### Objectives
- convert text into machine-readable numbers
- enable analysis and modeling

### Encoding techniques

Use one technique
- One-hot encoding: unique numerical representations
    - 1 for the presnce
    - 0 for the abscence
- Bag of words (BoW): captures word freqency, disregarding order
    - treat doc. as an unordered collection of words
    - ro focus on frequency not order
- TF-IDF: balances uniqueness and importance
    - term frequency-inverse documnet frecuency
    - rare words have a higher score.
    - common ones have a lower score.
    - emphsizes the important ones
- Embeding: converts words into vectors, capturing semantic meaning. maps words to numerical vectors. Example: king and queen, man and woman.
    - word inex mapping: king -> 1, queen-> 2
    - more compact and computationaly efficient
    - follows tokenization typically

### One-hot encoding with pytorch

In [1]:
import torch
vocab = ['cat', 'dog','rabbit']
vocab_size = len(vocab)
one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
print(one_hot_dict)

{'cat': tensor([1., 0., 0.]), 'dog': tensor([0., 1., 0.]), 'rabbit': tensor([0., 0., 1.])}


### BoW (CountVectorizer)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizzer = CountVectorizer()
corpus = ['Collective intelligence: which is the combining of behavior, preferences, or ideas of a group of people to create novel insights.',
'Google’s PageRank: which take user data and perform calculations to create new information that can enhance the user experience.'
,'Machine Learning: is a subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn. '
]
x = vectorizzer.fit_transform(corpus)
print(x.toarray())
print(vectorizzer.get_feature_names_out())

[[0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 3 1 0 1 0 1 0 0
  0 1 1 0 1 0]
 [0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1
  1 1 1 2 1 0]
 [1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0
  1 0 1 0 0 1]]
['ai' 'algorithms' 'allow' 'and' 'artificial' 'behavior' 'calculations'
 'can' 'collective' 'combining' 'computers' 'concerned' 'create' 'data'
 'enhance' 'experience' 'google' 'group' 'ideas' 'information' 'insights'
 'intelligence' 'is' 'learn' 'learning' 'machine' 'new' 'novel' 'of' 'or'
 'pagerank' 'people' 'perform' 'preferences' 'subfield' 'take' 'that'
 'the' 'to' 'user' 'which' 'with']


### TF-IDF

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizzer = TfidfVectorizer()
y = vectorizzer.fit_transform(corpus)
print(y.toarray())
print(vectorizzer.get_feature_names_out())

[[0.         0.         0.         0.         0.         0.23283269
  0.         0.         0.23283269 0.23283269 0.         0.
  0.17707526 0.         0.         0.         0.         0.23283269
  0.23283269 0.         0.23283269 0.17707526 0.17707526 0.
  0.         0.         0.         0.23283269 0.53122579 0.23283269
  0.         0.23283269 0.         0.23283269 0.         0.
  0.         0.17707526 0.13751474 0.         0.17707526 0.        ]
 [0.         0.         0.         0.23148133 0.         0.
  0.23148133 0.23148133 0.         0.         0.         0.
  0.17604751 0.23148133 0.23148133 0.23148133 0.23148133 0.
  0.         0.23148133 0.         0.         0.         0.
  0.         0.         0.23148133 0.         0.         0.
  0.23148133 0.         0.23148133 0.         0.         0.23148133
  0.17604751 0.17604751 0.1367166  0.46296265 0.17604751 0.        ]
 [0.27054288 0.27054288 0.27054288 0.         0.27054288 0.
  0.         0.         0.         0.         0.27

### The embedings encoding

In [2]:
import torch
from torch import nn
words = ["the", "calico","cat","is","very","beautiful", "sat", "mat"]
word_to_idx = {word: i for i, word in enumerate(words)}
inputs = torch.LongTensor([word_to_idx[w] for w in words])
embeddings = nn.Embedding(num_embeddings=len(words), embedding_dim=10)
output = embeddings(inputs)
print(output)

tensor([[ 2.4661, -0.6180, -1.5599, -0.8083, -1.1261,  0.2165,  0.6045, -0.2697,
          1.3442,  0.4469],
        [-0.6772, -1.3861, -1.0436, -0.6108, -0.4254,  0.6196,  0.0470, -0.7417,
         -0.6104,  0.3865],
        [-0.5603, -0.4499,  0.2457, -0.7004, -1.9934,  0.8634,  1.3279,  1.4211,
          0.5491, -1.2899],
        [-0.2726, -1.2696,  0.9959, -0.2407, -0.2018,  1.3897, -0.0850, -0.6953,
          1.7704,  0.2885],
        [-0.6623, -0.5446, -0.8054,  0.3834,  0.1954,  1.4201,  0.7841,  0.8725,
         -0.1405, -2.2205],
        [-0.4203, -0.2566, -0.7075,  0.2712, -2.6097, -2.0456, -1.3913,  0.6208,
          0.6186, -1.3626],
        [ 0.6556,  0.6496,  0.3189, -0.1471, -1.0145,  0.0521, -2.1341,  0.4171,
         -0.0328,  0.1784],
        [-0.5188, -1.3143,  1.5656,  0.2504, -0.9893, -0.1764,  0.7010, -0.7542,
          0.9931, -0.1539]], grad_fn=<EmbeddingBackward0>)


### Practice

One-hot encoded book titles

PyBooks wants to catalog and analyze the book genres in its library. Apply one-hot encoding to a list of book genres to make them machine-readable.

torch has been imported for you.
Instructions
100 XP

    Define the size of the vocabulary and save to vocab_size.
    Create one-hot vectors using the appropriate torch technique and vocab_size.
    Create a dictionary mapping genres to their corresponding one-hot vectors using dictionary comprehension; the dictionary keys should be the genre.

In [4]:
genres = ['Fiction','Non-fiction','Biography', 'Children','Mystery']

# Define the size of the vocabulary
vocab_size = len(genres)

# Create one-hot vectors
one_hot_vectors = torch.eye(vocab_size)

# Create a dictionary mapping genres to their one-hot vectors
one_hot_dict = {genre: one_hot_vectors[i] for i, genre in enumerate(genres)}

for genre, vector in one_hot_dict.items():
    print(f'{genre}: {vector.numpy()}')

Fiction: [1. 0. 0. 0. 0.]
Non-fiction: [0. 1. 0. 0. 0.]
Biography: [0. 0. 1. 0. 0.]
Children: [0. 0. 0. 1. 0.]
Mystery: [0. 0. 0. 0. 1.]


Bag-of-words for book titles

PyBooks now has a list of book titles that need to be encoded for further analysis. The data team believes the Bag of Words (BoW) model could be the best approach.

The following packages have been imported for you: torch, torchtext.
Instructions
100 XP

    Import the CountVectorizer class for implementing bag-of-words.
    Initialize an object of the class you imported, then use this object to transform the titles into a matrix representation.
    Extract and display the first five feature names and encoded titles with the get_feature_names_out() method.


In [5]:
# Import from sklearn
from sklearn.feature_extraction.text import CountVectorizer

titles = ['The Great Gatsby','To Kill a Mockingbird','1984','The Catcher in the Rye','The Hobbit', 'Great Expectations']

# Initialize Bag-of-words with the list of book titles
vectorizer = CountVectorizer()
bow_encoded_titles = vectorizer.fit_transform(titles)

# Extract and print the first five features
print(vectorizer.get_feature_names_out()[:5])
print(bow_encoded_titles.toarray()[0, :5])

['1984' 'catcher' 'expectations' 'gatsby' 'great']
[0 0 0 1 1]


Applying TF-IDF to book descriptions

PyBooks has collected several book descriptions and wants to identify important words within them using the TF-IDF encoding technique. By doing this, they hope to gain more insights into the unique attributes of each book to help with their book recommendation system.

The following packages have been imported for you: torch, torchtext.
Instructions
100 XP

    Import the class from sklearn.feature_extraction.text that converts a collection of raw documents to a matrix of TF-IDF features.
    Instantiate an object of this class, then use this object to encode the descriptions into a TF-IDF matrix of vectors.
    Retrieve and display the first five feature names from the vectorizer and encoded vectors from tfidf_encoded_descriptions.


In [7]:
# Importing TF-IDF from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF encoding vectorizer
vectorizer = TfidfVectorizer()
tfidf_encoded_descriptions = vectorizer.fit_transform(titles)

# Extract and print the first five features
print(vectorizer.get_feature_names_out()[:5])
print(tfidf_encoded_descriptions.toarray()[0, :5])

['1984' 'catcher' 'expectations' 'gatsby' 'great']
[0.         0.         0.         0.68172171 0.55902156]


Embedding in PyTorch

PyBooks found success with a book recommendation system. However, it doesn't account for some of the semantics found in the text. PyTorch's built-in embedding layer can learn and represent the relationship between words directly from data. Your team is curious to explore this capability to improve the book recommendation system. Can you help implement it?

torch and torch.nn as nn have been imported for you.
Instructions

    Map a unique index to each word in words, saving to word_to_idx.
    Convert word_to_idx into a PyTorch tensor and save to inputs.
    Initialize an embedding layer using the torch module with ten dimensions.
    Pass the inputs tensor to the embedding layer and review the output.


In [3]:
# Map a unique index to each word
words = ["This", "book", "was", "fantastic", "I", "really", "love", "science", "fiction", "but", "the", "protagonist", "was", "rude", "sometimes"]
word_to_idx = {word: i for i, word in enumerate(words)}

# Convert word_to_idx to a tensor
inputs = torch.LongTensor([word_to_idx[w] for w in words])

# Initialize embedding layer with ten dimensions
embedding = nn.Embedding(num_embeddings=len(words), embedding_dim=10)

# Pass the tensor to the embedding layer
output = embedding(inputs)
print(output)

tensor([[ 3.7160e-01,  1.6860e+00, -1.0942e+00, -1.2422e+00, -6.9444e-02,
          2.6087e+00, -8.3396e-01, -1.6440e-01, -1.2661e+00, -6.3702e-01],
        [-3.7453e-01, -1.9003e+00,  7.8607e-02, -1.7111e+00,  9.1807e-01,
          1.6506e+00,  3.4699e-01, -7.8172e-01,  1.4753e+00, -1.6265e+00],
        [ 1.2198e+00,  6.1211e-01,  3.7045e-01, -1.4451e+00,  4.1161e-03,
         -6.5725e-01, -2.8760e-01, -3.0660e-02, -2.3058e+00,  7.1787e-01],
        [ 1.9828e-02, -1.5543e+00, -1.1705e+00,  7.3115e-01, -5.2529e-01,
         -5.7655e-04,  1.8865e+00, -1.9242e+00, -1.0738e-01, -1.1423e+00],
        [-4.5352e-01, -9.4602e-02,  7.8129e-01, -2.9492e+00, -1.6859e-01,
         -8.5185e-01,  1.4575e-01, -4.8461e-01,  8.1060e-01, -1.2035e-01],
        [ 2.4926e-01, -9.2849e-01,  2.0370e+00,  1.6821e+00,  7.7106e-01,
          2.1014e+00,  5.5311e-01, -1.0334e-01, -2.2921e+00,  1.3842e-01],
        [ 5.8318e-01, -7.6713e-01, -3.8975e-01,  5.4291e-01, -1.4369e+00,
          4.0370e-01, -1.8750e-0