<!-- # Word Representations -->
# Reprezentarea Cuvintelor

칉n procesarea limbajului natural, word embeddings (reprezentarea cuvintelor prin proiec탵ii) este termenul folosit pentru reprezentarea cuvintelor ca vectori. Pentru a antrena un model, avem nevoie de date numerice, deci avem nevoie de o modalitate prin care s캒 transform캒m un text 칥n vectori de numere 칥ncerc칙nd s캒 p캒str캒m c칙t mai multe informa탵ii relevante pentru ce vrem s캒 facem. A탳adar, uneori ne intereseaz캒 rela탵iile semantice, alteori con탵inutul lexical 탳amd.

<!-- In natural language processing, word embedding is the term used for representing a word as a vector. For training a model we need numerical data, which means that we must find a way to represent texts such that we keep as much information as possible considering our current context. This means that sometimes semantic relations will be more important, other times lexical information etc. -->

Prin vectorizarea cuvintelor, reprezent캒m fiecare cuv칙nt ca un num캒r sau ca o list캒 de numere. 칉n cazul reprezent캒rilor dense/continue ale cuvintelor, ideea este s캒 reprezent캒m cuvinte similare ca fiind apropiate 칥n spa탵iul vectorial 칥n care le proiect캒m.

<!-- By using word embeddings (vectorization) we can represent each word as a number or a list of numbers that conveys this information such that words that are similar will be closer to each other in the vector space than words that are not. -->

[word2vec ilustrat](https://jalammar.github.io/illustrated-word2vec/) (revenim la word2vec ora viitoare)

# Bag of Words (BoW) / Sac de cuvinte

Pentru situa탵iile c칙nd contextul 탳i ordinea cuvintelor nu este relevant캒, ci doar c칙t de des apar cuvintele, atunci am folosi BoW. E ca 탳i cum am "arunca" toate cuvintele 칥ntr-un sac (eventual le 탳i amestec캒m) iar apoi num캒r캒m de c칙te ori apare fiecare cuv칙nt (un fel de vector de frecven탵캒). Ca un caz particular, putem avea BoW binar (un fel de vector de apari탵ii). Aceasta este cam cea mai simpl캒 tehnic캒 de vectorizat un text.

<!-- Imagine a situation where the context of the words is not relevant, only how often they appear. This is where we use bag of words. This approach just throws all words in a bag, maybe shuffles it a bit, then counts how many times each words appears (or if they appear in case of a binary BoW). It is the easiest vectorization method that we will discuss. -->

<center><img src='https://drive.google.com/uc?export=view&id=1v6McR199QkVXvuQmC3FWJ80rSXGTbZUS' width=500></center>

S캒 lu캒m exemplul din imagine dat ca exemplu. Fie scriem noi implementarea de la zero, fie folosim [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) gata implementat 칥n scikit-learn:

<!-- Let's take the text from the example. We can either write our own BoW implementation, or we can use the one preimplemented in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html): -->

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

text = ['Did you see the fly?', 'The fly will fly with you.']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['did' 'fly' 'see' 'the' 'will' 'with' 'you']
[[1 1 1 1 0 0 1]
 [0 2 0 1 1 1 1]]


`CountVectorizer` are un constructor cu mul탵i parametri implici탵i pe care 칥i putem suprascrie. De exemplu, dac캒 vrem doar reprezentare binar캒 (vector de apari탵ii) 탳i s캒 num캒r캒m doar bigrame, am proceda astfel:

<!-- CountVectorizer is a class with predefined parameters. You can always change those parameters, meaning that you can, for example, choose to have a binary representation of bigrams: -->

In [None]:
text = ['I am not happy.', 'He is very happy']
vectorizer = CountVectorizer(binary=True, ngram_range=(2, 2))
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['am not' 'he is' 'is very' 'not happy' 'very happy']
[[1 0 0 1 0]
 [0 1 1 0 1]]


N-gramele (la nivel de cuv칙nt) sunt secven탵e de n cuvinte. Ele sunt utile pentru a furniza context, de exemplu pentru a diferen탵ia 칥ntre _not happy_ 탳i _verry happy_. Le putem folosi fie ca features (reprezent캒ri numerice), fie ca s캒 analiz캒m setul de date.

<!-- N-grams are sequences of n words. They help us get some context about the text, letting us know the difference between _not happy_ and _very happy_ for example. This can be used as a feature for another representation, or on its own to make assumptions about the dataset. -->

#  Term Frequency - Inverse Document Frequency (Tf-idf)

Doar pentru c캒 un cuv칙nt apare de multe ori, nu 칥nseamn캒 c캒 este 탳i relevant dpdv al con탵inutului (vezi stopwords). De asemenea, stopwords variaz캒 de la un domeniu la altul 탳i poate fi anevoios s캒 ne definim manual de fiecare dat캒 liste de stopwords. Dac캒 de exemplu vrem s캒 ne facem un motor de c캒utare, ar fi mai relevant un cuv칙nt care apare des 칥ntr-un document, 칥ns캒 nu este frecvent 칥nt칙lnit 칥n celelalte documente (cum se 칥nt칙mpl캒 칥n cazul stopwords).

<!-- Just because a word appears often it does not mean that it is necessarily relevant (think about stopwords). If we want to write a search engine for example, it would be more relevant for us to know how often a certain word appears in a document with regards to how common that word generally is. Tf-idf is an algorithm that takes this into account. In other words, a word is important for a given document if it appears many times in this one and rarely in others. -->

Pentru un document dat, repet캒m urm캒toarele opera탵ii pentru fiecare cuv칙nt din 칥ntreg setul de date:

<!-- We will consider the given document as the current datapoint and repeat the following for each word in the dataset: -->

$$TFIDF = TF * IDF$$
unde:
<!-- $$TF(word, document) = \frac{How\ many\ times\ the\ word\ appears\ in\ the\ document}{Number\ of\ words\ in\ the\ document}$$ -->
$$TF(cuv칙nt, document) = \frac{\text{#apari탵ii ale cuv칙ntului 칥n document}}{\text{# de cuvinte din document}}$$
탳i:
$$IDF(cuv칙nt, Documente) = log(\frac{\text{# de documente din corpus}}{1 + \text{# de documente care con탵in cuv칙ntul curent}} + 1)$$

<!-- We use **log** in order to smooth our values for an easier analysis. -->
Folosim **log** pentru a normaliza valorile 탳i pentru a fi mai u탳or de analizat.

Pentru implementare, avem [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Reprezentarea va fi o matrice unde fiecare r칙nd corespunde unui document 탳i fiecare coloan캒 corespunde unui cuv칙nt din tot setul de date:

<!-- For the implementation you can use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). The output will be a matrix where each row corresponds to a datapoint and each column to a word from the full dataset: -->

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

text = ['I am not happy.', 'He is very happy']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['am' 'happy' 'he' 'is' 'not' 'very']
[[0.6316672  0.44943642 0.         0.         0.6316672  0.        ]
 [0.         0.37997836 0.53404633 0.53404633 0.         0.53404633]]


# Reprezent캒ri continue ale cuvintelor

Reprezent캒rile de p칙n캒 acum (BoW, Tf-Idf) au ca dezavantaje
- necesitatea unui num캒r extrem de mare de elemente pe m캒sur캒 ce avem texte de dimensiuni mai mari
- nu au un mod de a reprezenta rela탵iile semantice dintre cuvinte

Pentru a ocupa mai pu탵in캒 memorie, se folosesc reprezent캒ri cu matrici rare (sparse matrix). Totu탳i, chiar 탳i a탳a, aceste reprezent캒ri rare ale cuvintelor nu scaleaz캒 c칙nd dimensiunea vocabularului este foarte mare.

Ca solu탵ii la aceast캒 problem캒, au ap캒rut reprezent캒rile dense (continue) de cuvinte, unde num캒rul de dimensiuni ale proiec탵iilor cuvintelor este mult mai mic.

Dimensiunea unui vocabular poate ajunge la c칙teva zeci sau sute de mii de cuvinte (탳i chiar milioane), deci utilizarea unor matrice de astfel de dimensiuni 칥n re탵ele neurale devine extrem de costisitoare.

### Pe scurt despre word2vec (2013) 탳i GloVe (2014)

Reprezent캒rile continue sunt pe scurt o matrice utilizat캒 pentru a reduce dimensiunea unui vector. Dac캒 dimensiunea vocabularului este V 탳i reducem la ni탳te embeddings (reprezent캒ri/proiec탵ii) de dimensiune N, avem o matrice de $V*N$. Ini탵ial, din textul nostru vom reprezenta un cuv칙nt ca un vector one-hot de dimensiune $V$.

Pentru a ob탵ine reprezentarea cuv칙ntului, 칥nmul탵im vectorul cu matricea de proiec탵ie (embedding matrix), deci ob탵inem un vector de dimensiune $N$.

De exemplu, dimensiunea vocabularului este de $10^5$ 탳i reducem la ni탳te reprezent캒ri de dimensiune 300. Astfel, 칥n loc s캒 lucr캒m cu vectori de dimensiune $10^5$ 칥n re탵elele noastre, avem doar vectori de dimensiuni 300, mult mai mici.

### fastText

Cu toate c캒 reprezent캒rile word2vec 탳i GloVE ne rezolv캒 problemele cu reprezent캒rile rare ale cuvintelor, acestea au o limitare (existent캒 탳i 칥nainte) referitoare la modul 칥n care ar trebui tratate cuvintele care nu exist캒 칥n setul de date ini탵ial.

Cuvintele din afara vocabularului (OOVW - out of vocabulary words) nu pot fi reprezentate 칥n spa탵iul vectorial dat de matricea de proiec탵ie, deoarece nu apar 칥n vocabular. C칙teva metode de remediere:
- cel mai simplu, ad캒ug캒m 칥n vocabular un cuv칙nd UNK (unknown - necunoscut)
- folosim stemming sau lematizare 칥n speran탵a c캒 vom g캒si un cuv칙nt similar 칥n vocabularul ini탵ial
- c캒ut캒m sinonime sau alte cuvinte similare (un pic ironic, nu-i a탳a? 游뗵)

Chiar 탳i a탳a, tot nu vom avea reprezent캒ri pentru cuvinte inventate sau nou ap캒rute 칥n limbaj.

[fastText](https://fasttext.cc/), ap캒rut 칥n 2016, rezolv캒 aceast캒 problem캒 prin utilizarea de n-grame la nivel de caracter. Astfel, chiar dac캒 avem un cuv칙nt necunoscut, se poate 칥ncerca ob탵inerea unei reprezent캒ri prin combinarea reprezent캒rilor n-gramelor la nivel de caracter din care e format acest cuv칙nt.

### Reprezent캒ri contextuale ale cuvintelor

Reprezent캒rile clasice word2vec, GloVe 탳i fastText au ca dezavantaj utilizarea unei reprezent캒ri fixe ale cuvintelor 칥nv캒탵ate, indiferent de context. De exemplu, 칥n propozi탵ia
```
Pe cer e un nor, eu cer un marker color.
```
Cuv칙ntul `cer` apare de dou캒 ori, dar are 칥n탵elesuri diferite: prima dat캒 este substantiv, a doua oar캒 este verb.

Reprezent캒rile continue clasice de cuvinte ar furniza un singur vector pentru acest cuv칙nt, de탳i semantic nu exist캒 vreo leg캒tur캒.

Pentru a rezolva aceast캒 problem캒, 칥n 2018 au ap캒rut reprezent캒rile contextuale ale cuvintelor. Astfel, 칥n exemplul anterior, vor exista dou캒 reprezent캒ri vectoriale pentru un singur cuv칙nt pe baza contextului.

Exemple de reprezent캒ri contextuale: BERT, ELMo, GPT-2.

Vezi https://ai.stanford.edu/blog/contextual/ (탳i cursurile de la masterul de NLP) pentru mai multe detalii.

Asem캒n캒tor cu fastText, 탳i aceste modele folosesc tokeniz캒ri la nivel de "sub-cuvinte" (un fel de n-grame la nivel de caracter) pentru a reprezenta mai u탳or cuvinte necunoscute.

Trebuie totu탳i men탵ionat c캒 탳i 칥n acest caz este posibil s캒 avem OOV sub-words, 칥ns캒 pentru texte pe subiecte generale este mult mai rar. Probleme de OOV 칥n prezent pot ap캒rea dac캒 folosim modele antrenate pe alte limbi, din alte domenii sau cu domenii specializate cu mul탵i termeni de ni탳캒 (de exemplu 칥n domeniul biologic sau medical).

Cam acesta este stadiul 칥n care ne afl캒m 칥n prezent (2024) referitor la reprezent캒rile cuvintelor.

### Alte direc탵ii

칉n alte contexte, este posibil s캒 nu avem nevoie de granularitate la fel de mare 칥n reprezentarea textelor. Astfel, exist캒 탳i reprezent캒ri la nivel de propozi탵ie (sent2vec) sau document (doc2vec). Aceste reprezent캒ri au fost adaptate 탳i 칥n contexte mai noi (de exemplu [sentence transformers](https://www.sbert.net/)). Mai multe detalii la masterul de NLP 游뗵


# Word2vec

Word2vec a fost una dintre cele mai cunoscute tehnici de reprezentare a cuvintelor 칥nainte de Transformer 칥n 2017 ([arxiv](https://arxiv.org/pdf/1706.03762.pdf), [NeurIPS](https://dl.acm.org/doi/pdf/10.5555/3295222.3295349))/BERT 칥n 2018 ([arxiv](https://arxiv.org/pdf/1810.04805.pdf), [NAACL](https://aclanthology.org/N19-1423.pdf)). Word2vec a ap캒rut 칥n 2013 ([1](https://arxiv.org/pdf/1310.4546.pdf), [2](https://arxiv.org/pdf/1310.4546.pdf), [3](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)), ideea fiind s캒 foloseasc캒 o re탵ea neural캒 cu un singur strat ascuns antrenat캒 pe fiecare cuv칙nt 칥n mod independent av칙nd ca obiectiv s캒 "apropie" cuvintele similare 칥n spa탵iul vectorial 탳i s캒 "칥ndep캒rteze" cuvintele irelevante.

<!-- Word2vec was one the most popular embedding technique used before the rise of the Transformer in [2017](https://arxiv.org/pdf/1706.03762.pdf). It was originaly published in 2013 ([\[1\]](https://arxiv.org/pdf/1310.4546.pdf), [\[2\]](https://arxiv.org/pdf/1310.4546.pdf)) and it consists of a shallow neural network (with only one hidden layer) trained on each word from a text independently such that similar words are closer to eachother in the vector space (and unrelated words are further). -->

Ideea de baz캒 are origini mult mai vechi 칥n filozofie:
- _You shall know a word by the company it keeps_ ("Ar trebui s캒 칥n탵elegi un cuv칙nt pe baza vecinilor s캒i" - John Rupert Firth, 1957, A synopsis of linguistic theory)
- _The meaning of a word is its use in the language_ ("Semnifica탵ia unui cuv칙nt este folosirea lui 칥n limbaj" - Ludwig Wittgenstein, 1953, Philosophical Investigations).

Ideea este c캒 ne folosim de contextul 칥n care apare un cuv칙nt pentru a calcula similarit캒탵i 칥ntre cuvintele dintr-un text. Folosim numeroase contexte pe post de set de date de antrenare, apoi punem modelul s캒 prezic캒 reprezent캒rile cuvintelor 탵int캒. Modelul word2vec poate folosi unul dintre urm캒torii algoritmi:
- Continuous Bag of Words (CBoW - "sac de cuvinte continue"): folosim un context pentru a prezice cuv칙ntul din "mijloc"; merge mai bine pe seturi mici de date
- Skip-Gram: folosim un cuv칙nt pentru a prezice contextul din jurul s캒u; merge mai bine pe caz general (pentru cuvinte rare)

<!-- It all starts from the quote: _You shall know a word by the company it keeps_. The idea is that we can use the context of a word to compute the similarity between different words in our text, use this as a training dataset and create a prediction model the works as an embedding for the words we have in our corpus. The model can use one of the following algorithms:
- Continuous Bag of Words (CBoW): use the context window around a word to predict the word; better for small datasets
- Skip-Gram: use a target word to predict the context around it; better at generalization (for rare words) -->

<img src= "https://wiki.pathmind.com/images/wiki/word2vec_diagrams.png" width="500" height="300">



<!-- [A more in depth explanation with code](https://www.tensorflow.org/text/tutorials/word2vec) -->
[Explica탵ie detaliat캒 cu cod](https://www.tensorflow.org/text/tutorials/word2vec)

### Continuous Bag-of-Words (CBoW) / Sac de cuvinte continue

Spre deosebire de modelul clasic BoW, CBoW ia 칥n calcul contextul din jurul unui cuv칙nt fixat folosind o fereastr캒 de context (context window).

De exemplu, dac캒 alegem textul _The fly will fly with you_ 탳i o fereastr캒 de dimensiune 1, algoritmul se va uita la un cuv칙nt 칥nainte 탳i un cuv칙nt dup캒, gener칙nd urm캒toarea secven탵캒 de perechi (_context_, _cuv칙nt 탵int캒_):

<!-- Unlike the BoW model, CBoW takes into account the context around a certain word by using a context window. -->


<!-- For example, if you choose the text _The fly will fly with you._ and the window size 1, it will look at exactly 1 word before and after each word in the text, generating the following sequence of (_context_, _target_) pairs: -->

$$([the, will], fly), ([fly, fly], will), ([will, with], fly), ([fly, you], with)$$

Acestea sunt informa탵iile folosite de model la antrenare pentru a prezice cel mai probabil cuv칙nt d칙ndu-se un anumit context.

<!-- This is the information on which we will train our model to predict the most probable word in a given context. -->

### Skip-Gram

Modelul Skip-Gram func탵ioneaz캒 pe dos dec칙t CBoW: dat fiind un cuv칙nt 탵int캒, s캒 se prezic캒 cel mai probabil context. Pentru a realiza acest lucru, antren캒m o re탵ea cu un strat ascuns s캒 prezic캒 probabilitatea ca un cuv칙nt _y_ s캒 apar캒 l칙ng캒 un cuv칙nt _x_ 칥ntr-un text la 칥nt칙mplare. Stratul ascuns este folosit ca reprezentarea vectorial캒 a unui cuv칙nt 탵int캒. Distan탵a 칥n spa탵iul vectorial dintre 2 cuvinte ar fi mai mic캒 dac캒 acele cuvinte apar 칥n contexte similare.

<!-- The Skip-Gram Model works the other way around: given a target word, it aims to predict the context around it. In order to do this, you can train a neural network with one hidden layer for a simple task: to predict the chance of having word _y_ really close to word _x_ in a random text. Then you use this layer as the vector representation of the given word, thus making sure that the vector distance between any 2 words is closer if they are more similar and larger if they are not. -->

### Antrenarea unui model
<!-- ### Training a model -->

In [None]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

embedding = Word2Vec(
    sentences=common_texts,   # the list of sentences, where each sentence is given as a list of words (processed or not processed)
    vector_size=100,          # the number of features in the vectorized representation
    window=7,                 # the context window
    min_count=3,              # the minimum number of times a word should appear in our dataset in order to be counted
    sg=1                      # sg=1 means skip-gram is used, sg=0 means CBOW is used
)

In [None]:
embedding.wv.key_to_index

{'system': 0, 'graph': 1, 'trees': 2, 'user': 3}

In [None]:
import pandas as pd

df = pd.DataFrame(
    [embedding.wv.get_vector(word) for word in embedding.wv.key_to_index.keys()],
    index=embedding.wv.key_to_index
  )

df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
system,-0.000536,0.000236,0.005103,0.009009,-0.009303,-0.007117,0.006459,0.008973,-0.005015,-0.003763,...,0.001631,0.00019,0.003474,0.000218,0.009619,0.005061,-0.008917,-0.007042,0.000901,0.006393
graph,-0.00862,0.003666,0.00519,0.005742,0.007467,-0.006168,0.001106,0.006047,-0.00284,-0.006174,...,0.001088,-0.001576,0.002197,-0.007882,-0.002717,0.002663,0.005347,-0.002392,-0.00951,0.004506
trees,9.5e-05,0.003077,-0.006813,-0.001375,0.007669,0.007346,-0.003673,0.002643,-0.008317,0.006205,...,-0.004509,0.005702,0.00918,-0.0041,0.007965,0.005375,0.005879,0.000513,0.008213,-0.007019
user,-0.008243,0.009299,-0.000198,-0.001967,0.004604,-0.004095,0.002743,0.00694,0.006065,-0.007511,...,-0.007426,-0.001064,-0.000795,-0.002563,0.009683,-0.000459,0.005874,-0.007448,-0.002506,-0.00555


In [None]:
embedding.wv.most_similar('system')

[('graph', -0.01083916611969471),
 ('trees', -0.05234673246741295),
 ('user', -0.111670583486557)]

### 칉nc캒rcarea unui model preantrenat
<!-- ### Loading a pretrained model -->

[Informa탵ii despre date 탳i modele](https://github.com/piskvorky/gensim-data)
<!-- [Info about data and models](https://github.com/piskvorky/gensim-data) -->

[Exemple de utilizare](https://radimrehurek.com/gensim/models/word2vec.html)
<!-- [Examples on how to use](https://radimrehurek.com/gensim/models/word2vec.html) -->

In [None]:
import gensim.downloader as api

api.info()

In [None]:
model = api.load("word2vec-google-news-300")



In [None]:
model.most_similar('system')

[('systems', 0.7227916717529297),
 ('sytem', 0.7129376530647278),
 ('sys_tem', 0.5871982574462891),
 ('System', 0.5275423526763916),
 ('mechanism', 0.5058810114860535),
 ('sysem', 0.5027822852134705),
 ('systen', 0.49969804286956787),
 ('system.The', 0.49599188566207886),
 ('sytems', 0.4949610233306885),
 ('computerized', 0.47604817152023315)]

In [None]:
model.similarity('system', 'graph')

0.09396098

### Fine-tuning (finisare) pentru modelul anterior:
<!-- ### Fine-tuning our model: -->


In [None]:
model.train(common_texts, total_examples=4, epochs=1)

Other cool stuff:

In [None]:
model.most_similar(positive=["king", "woman"], negative=["man"])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

[And less cool stuff:](https://arxiv.org/pdf/1607.06520.pdf)

In [None]:
model.most_similar(positive=["computer_programmer", "woman"], negative=["man"])

[('homemaker', 0.5627118945121765),
 ('housewife', 0.5105047225952148),
 ('graphic_designer', 0.505180299282074),
 ('schoolteacher', 0.497949481010437),
 ('businesswoman', 0.493489146232605),
 ('paralegal', 0.49255111813545227),
 ('registered_nurse', 0.4907974898815155),
 ('saleswoman', 0.4881627559661865),
 ('electrical_engineer', 0.4797725975513458),
 ('mechanical_engineer', 0.4755399227142334)]

Bias is still an unsolved problem in Machine Learning. Do you know any other popular examples of bias?

# Principal Component Analysis (PCA) / Analiza componentelor principale

PCA este un algoritm de reducere a num캒rului de dimensiuni -- poate fi util pentru a vizualiza date cu sute de dimensiuni 칥ntr-un spa탵iu 2D sau 3D. De exemplu, pentru a observa distan탵a dintre proiec탵iile unor cuvinte, avem:

<!-- PCA is a dimensionality reduction algorithm -- meaning that we can use it to visualise our data in 2D or 3D. Here is an example of how you can use it to see the distance between embeddings in 2D: -->

In [None]:
from sklearn.decomposition import PCA

text = ['system', 'graph', 'trees', 'user']
embeddings = [model[word] for word in text]

pca = PCA(n_components=2)
pca.fit(embeddings)
vectors_2d = pca.transform(embeddings)

Interfa탵a din sklearn este asem캒n캒toare ca cea a unui model de ML. Dup캒 ce antren캒m modelul, putem pune componentele pe un grafic cu matplotlib:

<!-- We can train it the same way we would a normal ML model, and visualize the results using, for example, a plotting library like matplotlib: -->

In [None]:
import matplotlib.pyplot as plt

x = [v[0] for v in vectors_2d]
y = [v[1] for v in vectors_2d]

fig, ax = plt.subplots()
ax.scatter(x, y)

for i, txt in enumerate(text):
    ax.annotate(txt, (x[i], y[i]))

plt.show()

## Global Vectors (GloVe) / Vectori globali

Word2Vec se bazeaz캒 pe statistici locale (apari탵ii la nivel de propozi탵ie). [GloVe](https://nlp.stanford.edu/projects/glove/) ia 칥n calcul statistici globale, ceea ce poate fi util pentru seturi de date mici, nefiind nevoie de multe date de antrenare.

Modelul num캒r캒 toate perechile "cuv칙nt1 cuv칙nt2 ..." (pentru un context de dimensiune x consider캒m cuvinte cu distan탵a cel mult x 칥ntre ele) 탳i re탵ine informa탵ia 칥ntr-o matrice de num캒r de apari탵ii (co-occurrence matrix):

<!-- While Word2Vec is based only on local statistics (the occurence of words at
a single-sentence level) [GloVe](https://nlp.stanford.edu/projects/glove/) incorporates global statistics methods. This makes it better suited for smaller datasets, as it does not need as much training data. -->

<!-- The model counts all "word1 word2 ..." pairs (for a context window of x we consider words that have at most distance x between them) and keeps the information in a co-occurrence matrix: -->

<center><img src='https://drive.google.com/uc?export=view&id=1pnX1lPdQItUauHp9W8xJlx8q2lgTe4cJ' width=500></center>

Dup캒 aceea, se calculeaz캒 probabilitatea ca un cuv칙nt s캒 fie mai aproape de alt cuv칙nt pe baza acestei matrice:

<!-- Afterwards, it computes the probability that a word will be closer to another one based on this matrix: -->
$$P(j | i) = \frac{X_{ij}}{X_i}$$
<!-- where: -->
unde:
$$P(j | i) = \text{probabilitatea s캒 avem cuv칙ntul j dac캒 avem cuv칙ntul i}$$
$$X_{ij} = \text{de c칙te ori apare cuv칙ntul j 칥n contextul cuv칙ntului i}$$
$$X_i = \sum_k X_{ik} = \text{num캒rul total de cuvinte care apar 칥n contextul cuv칙ntului i}$$

<!-- $$P(j | i) = the\ probability\ of\ word\ j\ given\ i$$
$$X_{ij} = how\ many\ times\ word\ j\ appears\ in\ the\ context\ of\ i$$
$$X_i = \sum_k X_{ik} = sum\ of\ how\ many\ times\ words\ appear\ in\ the\ context\ of\ i$$ -->

Pe baza acestor calcule, ar trebui s캒 putem determina rela탵ii 칥ntre cuvinte:

<!-- Based on this we should be able to infer relations between words: -->

<center><img src='https://nlp.stanford.edu/projects/glove/images/table.png' width=500></center>

Observ캒m c캒 _solid_ are leg캒tur캒 cu _ice_, dar nu 탳i cu _steam_, pe c칙nd _gas_ are leg캒tur캒 cu _steam_, dar nu 탳i cu _ice_ (probabilit캒탵i condi탵ionale foarte mari vs foarte mici). _Water_ 탳i _fasion_ sunt fie str칙ns corelate cu _ice_ 탳i _steam_ 칥mpreun캒, fie sunt complet f캒r캒 leg캒tur캒.

<!-- Notice how _solid_ is related to _ice_ but not _steam_, while _gas_ is related to _steam_ but not _ice_ (very large vs. very small conditional values). _Water_ and _fashion_ on the other hand are either highly related to both or completely unrelated. -->

Mai multe detalii se reg캒sesc 칥n [articol](https://aclanthology.org/D14-1162.pdf).

<!-- Some more computation will bring us to the regression model that is now used for this model. If you want to learn more you can check [the paper](https://aclanthology.org/D14-1162.pdf). -->

### Utilizarea GloVe
<!-- ### Using GloVe -->

Putem 칥nc캒rca un model GloVe preantrenat folosind biblioteca gensim (sau alte resurse):
<!-- We can load a pretrained GloVe model using the gensim library (or other resources): -->

In [None]:
import gensim.downloader as api

model = api.load("glove-twitter-100")



Pentru a calcula reprezent캒rile cuvintelor (word embeddings) sau pentru alte similarit캒탵i 칥ntre cuvinte, la fel ca 칥n cazul Word2Vec:
<!-- And use it to compute the word embeddings (or do all other similarity functions that we saw for Word2Vec): -->

In [None]:
model['system']

array([ 0.43887 ,  0.32601 , -0.28524 , -0.08248 ,  0.43643 ,  0.75065 ,
        0.093945, -0.72626 ,  0.32297 , -0.37128 , -0.23306 ,  0.35499 ,
       -3.1764  ,  0.015004,  0.69725 , -0.15256 ,  0.025449, -0.058944,
        0.20002 , -0.61298 , -0.79661 ,  0.53051 ,  0.64765 ,  0.90153 ,
       -0.27407 ,  0.52871 ,  0.39344 ,  0.56076 ,  0.31942 ,  0.83347 ,
       -0.53268 , -1.0166  , -0.25328 , -0.17347 ,  0.68794 ,  0.25902 ,
        0.42864 ,  0.3844  , -0.071415, -0.026013, -0.42733 ,  0.58874 ,
       -0.30061 , -0.18357 ,  0.21158 , -0.72648 , -0.48477 ,  0.43527 ,
       -0.37412 , -0.48493 ,  0.26264 ,  0.21684 , -0.8822  ,  0.57925 ,
       -0.54    ,  0.7147  , -0.33133 , -0.44715 , -0.40713 , -0.014364,
       -0.083808,  0.45569 , -0.094374,  0.56057 ,  0.65446 , -0.45768 ,
        0.2522  ,  0.34328 , -0.061001, -0.4899  ,  0.3342  ,  0.41277 ,
       -0.55403 ,  0.30807 ,  0.22867 , -0.53921 ,  0.16439 ,  0.021561,
        0.15131 , -0.70287 ,  1.4152  ,  0.83387 , 

Pentru a antrena un model de la zero:
<!-- Or you can train your own model from scratch: -->

In [None]:
from glove import Corpus, Glove

corpus = Corpus()
corpus.fit(common_texts, window=4)

glove = Glove(no_components=4, learning_rate=0.1)
glove.fit(corpus.matrix, epochs=10, no_threads=8, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model.txt')

# Principal Component Analysis (PCA)

PCA is a dimensionality reduction algorithm -- meaning that we can use it to visualise our data in 2D or 3D. Here is an example of how you can use it to see the distance between embeddings in 2D:

In [None]:
from sklearn.decomposition import PCA

text = ['system', 'graph', 'trees', 'user']
embeddings = [model[word] for word in text]

pca = PCA(n_components=2)
pca.fit(embeddings)
vectors_2d = pca.transform(embeddings)

We can train it the same way we would a normal ML model, and visualize the results using, for example, a plotting library like matplotlib:

In [None]:
import matplotlib.pyplot as plt

x = [v[0] for v in vectors_2d]
y = [v[1] for v in vectors_2d]

fig, ax = plt.subplots()
ax.scatter(x, y)

for i, txt in enumerate(text):
    ax.annotate(txt, (x[i], y[i]))

plt.show()

## Exerci탵ii

1. Scrie propria implementare de Bag of Words de la zero. Ar trebui s캒 furnizeze at칙t reprezent캒ri binare, c칙t 탳i reprezent캒ri pentru num캒rul de apari탵ii.
2. Implementeaz캒 Tf-Idf de la zero. Define탳te oric칙te func탵ii ajut캒toare crezi necesare.
3. Vezi distan탵a dintre ni탳te cuvinte 칥n 2D folosind PCA (sau alt캒 modalitate de reducere a dimensionalit캒탵ii).
4. Crea탵i perechi de (context, cuv칙nt_탵int캒) 탳i antrena탵i o re탵ea neural캒 folosind skip-gram sau continuous bag of words. Ar trebui s캒 eticheta탵i fiecare cuv칙nt cu un ID unic 탳i s캒 folosi탵i padding pentru antrenarea cuvintelor din capete (context "dummy" 칥n st칙nga/dreapta).
5. Vizualiza탵i distan탵ele dintre c칙teva cuvinte 칥n 2D folosind PCA (sau alte tehnici de reducere a num캒rului de dimensiuni).
6. Compara탵i reprezent캒rile de cuvinte 칥n diverse moduri, de exemplu timp de antrenare, cel mai similar cuv칙nt pentru cuv칙ntul X, distan탵e 칥n spa탵iul 2D, acurate탵ea cu SVM etc. Compara탵i implement캒rile voastre cu implement캒rile furnizate de biblioteci.