### Objectives
- convert text into machine-readable numbers
- enable analysis and modeling

### Encoding techniques

Use one technique
- One-hot encoding: unique numerical representations
    - 1 for the presnce
    - 0 for the abscence
- Bag of words (BoW): captures word freqency, disregarding order
    - treat doc. as an unordered collection of words
    - ro focus on frequency not order
- TF-IDF: balances uniqueness and importance
    - term frequency-inverse documnet frecuency
    - rare words have a higher score.
    - common ones have a lower score.
    - emphsizes the important ones
- Embeding: converts words into vectors, capturing semantic meaning

### One-hot encoding with pytorch

In [1]:
import torch
vocab = ['cat', 'dog','rabbit']
vocab_size = len(vocab)
one_hot_vectors = torch.eye(vocab_size)
one_hot_dict = {word: one_hot_vectors[i] for i, word in enumerate(vocab)}
print(one_hot_dict)

{'cat': tensor([1., 0., 0.]), 'dog': tensor([0., 1., 0.]), 'rabbit': tensor([0., 0., 1.])}


### BoW (CountVectorizer)

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizzer = CountVectorizer()
corpus = ['Collective intelligence: which is the combining of behavior, preferences, or ideas of a group of people to create novel insights.',
'Google’s PageRank: which take user data and perform calculations to create new information that can enhance the user experience.'
,'Machine Learning: is a subfield of artificial intelligence (AI) concerned with algorithms that allow computers to learn. '
]
x = vectorizzer.fit_transform(corpus)
print(x.toarray())
print(vectorizzer.get_feature_names_out())

[[0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 1 0 0 0 0 1 3 1 0 1 0 1 0 0
  0 1 1 0 1 0]
 [0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1
  1 1 1 2 1 0]
 [1 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0
  1 0 1 0 0 1]]
['ai' 'algorithms' 'allow' 'and' 'artificial' 'behavior' 'calculations'
 'can' 'collective' 'combining' 'computers' 'concerned' 'create' 'data'
 'enhance' 'experience' 'google' 'group' 'ideas' 'information' 'insights'
 'intelligence' 'is' 'learn' 'learning' 'machine' 'new' 'novel' 'of' 'or'
 'pagerank' 'people' 'perform' 'preferences' 'subfield' 'take' 'that'
 'the' 'to' 'user' 'which' 'with']


### TF-IDF

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizzer = TfidfVectorizer()
y = vectorizzer.fit_transform(corpus)
print(y.toarray())
print(vectorizzer.get_feature_names_out())

[[0.         0.         0.         0.         0.         0.23283269
  0.         0.         0.23283269 0.23283269 0.         0.
  0.17707526 0.         0.         0.         0.         0.23283269
  0.23283269 0.         0.23283269 0.17707526 0.17707526 0.
  0.         0.         0.         0.23283269 0.53122579 0.23283269
  0.         0.23283269 0.         0.23283269 0.         0.
  0.         0.17707526 0.13751474 0.         0.17707526 0.        ]
 [0.         0.         0.         0.23148133 0.         0.
  0.23148133 0.23148133 0.         0.         0.         0.
  0.17604751 0.23148133 0.23148133 0.23148133 0.23148133 0.
  0.         0.23148133 0.         0.         0.         0.
  0.         0.         0.23148133 0.         0.         0.
  0.23148133 0.         0.23148133 0.         0.         0.23148133
  0.17604751 0.17604751 0.1367166  0.46296265 0.17604751 0.        ]
 [0.27054288 0.27054288 0.27054288 0.         0.27054288 0.
  0.         0.         0.         0.         0.27