# NLP & representation learning: Neural Embeddings, Text Classification

## CELIK Simay 28713301 - SOYKOK Aylin 28711545

To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model. 

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol.json)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$. 


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class 
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class 
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models

## STEP 0: Loading data 

In [1]:
import json
from collections import Counter

# Loading json
file = './datasets/json_pol.json'
with open(file,encoding="utf-8") as f:
    data = json.load(f)
    

# Quick Check
counter = Counter((x[1] for x in data))
print("Number of reviews : ", len(data))
print("----> # of positive : ", counter[1])
print("----> # of negative : ", counter[0])
print("")
print(data[0])


Number of reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.', 1]


In [16]:
#preprocessing
import string
import re
punc = string.punctuation+'\n\r\t"'
def preprocess(text):
    """Suppressing numbers, lowering strings and removing punctuations"""
    chiffsupp = re.sub('[0-9]+', '', text)
    return chiffsupp.translate(str.maketrans(punc, ' ' * len(punc))).lower()



## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [17]:
# if gensim not installed yet
# ! pip install gensim

In [18]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in data]

# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2024-02-27 01:09:21,151 : INFO : collecting all words and their counts
2024-02-27 01:09:21,153 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-27 01:09:21,649 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-27 01:09:22,147 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-27 01:09:22,380 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-27 01:09:22,381 : INFO : Creating a fresh vocabulary
2024-02-27 01:09:22,614 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-27T01:09:22.614486', 'gensim': '4.3.0', 'python': '3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'prepare_vocab'}
2024-02-27 01:09:22,616 : IN

In [26]:
# Worth it to save the previous embedding
w2v.save("W2v-movies.dat")
# You will be able to reload them:
# w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2024-02-27 01:11:35,857 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-02-27T01:11:35.857444', 'gensim': '4.3.0', 'python': '3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'saving'}
2024-02-27 01:11:35,858 : INFO : not storing attribute cum_table
2024-02-27 01:11:35,930 : INFO : saved W2v-movies.dat


## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models /keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [27]:
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))

great and good: 0.7843455
great and bad: 0.47022957


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [28]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
print(w2v.wv.most_similar("movie",topn=5)) # 5 most similar words
print(w2v.wv.most_similar("awesome",topn=5))
print(w2v.wv.most_similar("actor",topn=5))

[('film', 0.9289810061454773), ('"movie"', 0.8283028602600098), ('flick', 0.7667436599731445), ('movie,', 0.7451980710029602), ('"film"', 0.7428070902824402)]
[('amazing', 0.7817275524139404), ('excellent', 0.7275443077087402), ('awesome,', 0.6990002989768982), ('exceptional', 0.6928319931030273), ('cool', 0.6804935336112976)]
[('actor,', 0.8143014907836914), ('actor.', 0.7617161273956299), ('Reeves', 0.7540039420127869), ('Hopper', 0.7491146326065063), ('actress', 0.734898567199707)]


But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [29]:
# What is awesome - good + bad ?
w2v.wv.most_similar(positive=["awesome","bad"],negative=["good"],topn=3)  


# Try other things like plurals for exemple.

[('awful', 0.7694743871688843),
 ('horrible', 0.6754423975944519),
 ('atrocious', 0.6486415266990662)]

In [30]:
w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?

[('actress', 0.8380486965179443),
 ('actress,', 0.7827675342559814),
 ('actress.', 0.6776851415634155)]

In [31]:
w2v.wv.most_similar(positive=["actors","women"],negative=["men"],topn=3) # 



[('actresses', 0.765678882598877),
 ('actors/actresses', 0.6931259036064148),
 ('actors,', 0.6719549298286438)]

**To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.**

**You can download the dataset [here](https://thome.isir.upmc.fr/classes/RITAL/questions-words.txt).**

In [32]:
out = w2v.wv.evaluate_word_analogies("ressources/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2024-02-27 01:11:37,763 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words.txt


2024-02-27 01:11:37,900 : INFO : capital-common-countries: 7.8% (7/90)
2024-02-27 01:11:38,019 : INFO : capital-world: 2.8% (2/71)
2024-02-27 01:11:38,062 : INFO : currency: 0.0% (0/28)
2024-02-27 01:11:38,550 : INFO : city-in-state: 0.0% (0/329)
2024-02-27 01:11:39,076 : INFO : family: 34.8% (119/342)
2024-02-27 01:11:40,405 : INFO : gram1-adjective-to-adverb: 1.9% (18/930)
2024-02-27 01:11:41,241 : INFO : gram2-opposite: 3.6% (20/552)
2024-02-27 01:11:42,871 : INFO : gram3-comparative: 19.0% (240/1260)
2024-02-27 01:11:43,803 : INFO : gram4-superlative: 7.0% (49/702)
2024-02-27 01:11:44,815 : INFO : gram5-present-participle: 16.8% (127/756)
2024-02-27 01:11:45,874 : INFO : gram6-nationality-adjective: 2.4% (19/792)
2024-02-27 01:11:47,568 : INFO : gram7-past-tense: 17.0% (214/1260)
2024-02-27 01:11:48,599 : INFO : gram8-plural: 4.9% (40/812)
2024-02-27 01:11:49,808 : INFO : gram9-plural-verbs: 25.9% (196/756)
2024-02-27 01:11:49,809 : INFO : Quadruplets with out-of-vocabulary words: 

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

**You can download the pre-trained word embedding [HERE](https://thome.isir.upmc.fr/classes/RITAL/word2vec-google-news-300.dat) .**

In [33]:
from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors
import gensim.downloader as api
bload = True
fname = "word2vec-google-news-300"
sdir = "word2vec-google-news-300/" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:    
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")
    

2024-02-27 01:11:49,847 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2024-02-27 01:11:49,848 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2024-02-27 01:11:49,849 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2024-02-27T01:11:49.849481', 'gensim': '4.3.0', 'python': '3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'created'}


2024-02-27 01:11:49,979 : INFO : loading KeyedVectors object from word2vec-google-news-300/word2vec-google-news-300.dat
2024-02-27 01:11:51,738 : INFO : loading vectors from word2vec-google-news-300/word2vec-google-news-300.dat.vectors.npy with mmap=None
2024-02-27 01:11:56,231 : INFO : KeyedVectors lifecycle event {'fname': 'word2vec-google-news-300/word2vec-google-news-300.dat', 'datetime': '2024-02-27T01:11:56.231264', 'gensim': '4.3.0', 'python': '3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'loaded'}


**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

In [39]:
out = wv_pre_trained.evaluate_word_analogies("ressources/questions-words_pretrained.txt",case_insensitive=True)

2024-02-27 01:19:18,164 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words_pretrained.txt


FileNotFoundError: [Errno 2] No such file or directory: 'ressources/questions-words_pretrained.txt'

In [None]:
print("great and good:",wv_pre_trained.similarity("great","good"))
print("great and bad:",wv_pre_trained.similarity("great","bad"))

great and good: 0.72915095
great and bad: 0.3928765


In [None]:
for word in ['movie', 'awesome', 'actor']:
    print(word, ':')
    for item in wv_pre_trained.most_similar(word,topn=5):
        print('\t', item)

movie :
	 ('film', 0.8676770329475403)
	 ('movies', 0.8013108372688293)
	 ('films', 0.7363011837005615)
	 ('moive', 0.6830360889434814)
	 ('Movie', 0.6693680286407471)
awesome :
	 ('amazing', 0.8282866477966309)
	 ('unbelievable', 0.74649578332901)
	 ('fantastic', 0.7453290224075317)
	 ('incredible', 0.7390913367271423)
	 ('unbelieveable', 0.6678116917610168)
actor :
	 ('actress', 0.7930010557174683)
	 ('Actor', 0.7446156740188599)
	 ('thesp', 0.6954971551895142)
	 ('thespian', 0.6651668548583984)
	 ('actors', 0.6519852876663208)


In [None]:
positives = [
    ["awesome","bad"],
    ["actor","woman"]
]

negatives = [
    ["good"],
    ["man"]
]

for i in range(2):
    print(*positives[i], '--', *negatives[i])
    for item in wv_pre_trained.most_similar(positive=positives[i],negative=negatives[i],topn=5):
        print('\t', item)

awesome bad -- good
	 ('horrible', 0.5953484177589417)
	 ('amazing', 0.5928210020065308)
	 ('weird', 0.5782381296157837)
	 ('freaky', 0.5767403244972229)
	 ('unbelievable', 0.5747914910316467)
actor woman -- man
	 ('actress', 0.8602624535560608)
	 ('actresses', 0.6596670150756836)
	 ('thesp', 0.6290916800498962)
	 ('Actress', 0.6165294647216797)
	 ('actress_Rachel_Weisz', 0.5997323989868164)


In [None]:
positives = [
    ["movie","movies"],
    ["actor","actors"]
]

negatives = [
    ["film"],
    ["actress"]
]

for i in range(2):
    print(*positives[i], '--', *negatives[i])
    for item in wv_pre_trained.most_similar(positive=positives[i],negative=negatives[i],topn=5):
        print('\t', item)

movie movies -- film
	 ('films', 0.6389263868331909)
	 ('Movies', 0.6188486218452454)
	 ('flicks', 0.6120561361312866)
	 ('Hollywood_blockbusters', 0.5958892703056335)
	 ('romcoms', 0.5864962339401245)
actor actors -- actress
	 ('Actors', 0.6128960251808167)
	 ('thesps', 0.5747196078300476)
	 ('screenwriters', 0.5588046312332153)
	 ('thespians', 0.5531652569770813)
	 ('thespian', 0.5476460456848145)


## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max 

In [43]:
import numpy as np
# We first need to vectorize text:
# First we propose to a sum of them

from sklearn.model_selection import train_test_split

def randomvec():
    default = np.random.randn(100)
    default = default  / np.linalg.norm(default)
    return default

def vectorize(text,func=np.sum):
    """
    This function should vectorize one review

    input: str
    output: np.array(float)
    """    

    #for word in text:
        # do something
    vec = []
    for word in text:
        if not (word in w2v.wv):
            vec.append(randomvec())
        else:
            vec.append(w2v.wv[word])
    #if mean:
    #    return np.mean(vec,axis=0)
    return func(np.array(vec),axis=0)
    
lab = [l for t,l in data]
train,test, y_train, y_test = train_test_split(text,lab,test_size=0.2,random_state=42)
#classes = [pol for text,pol in train]
X_train = [vectorize(text) for text in train]
X_test = [vectorize(text) for text in test]
#true = [pol for text,pol in test]


#let's see what a review vector looks like.
print(X_train[0])

[  -2.37627046   32.90345695  -22.36017699   -1.88356059   11.65108255
 -106.57608645    4.58677254  179.66670126  -67.97166419  -76.83892138
   16.09992365 -100.60202609    2.30489281   84.6385065   -15.19924983
  -25.08711765   19.02260561  -60.92727518  -21.38774537 -221.77669695
   53.83795181    9.30805586  158.35142657  -91.08885736  -18.84795691
   77.33316899  -47.85235558   20.23314666 -104.13862743   81.14041466
   65.81978434   17.8525585    27.8722761  -138.68033748  -34.42874938
   63.72472595   50.04359605  -77.72665762  -67.26169342 -157.09248285
   -6.43196663 -160.29360995  -62.9862789    44.3848597   104.91261487
  -43.62902318  -76.69970658   20.4128542    14.89370537   14.50099317
   65.89867352 -100.22173624   53.86226202  -36.92350398  -40.82686786
    9.17822228   -8.21536259   42.3206258   -44.35313587   53.63165555
   41.9501717   -25.40144322   22.63124714   72.12487827 -104.50473
  140.31925851   -5.23539437   73.04332452  -89.99325113   87.20401604
   -7.396

### (2) Train a classifier 
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



In [44]:


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
train,test, y_train, y_test = train_test_split(text,lab,test_size=0.2,random_state=42)

X_train = [vectorize(text,func=np.sum) for text in train]
X_test = [vectorize(text,func=np.sum) for text in test]

# Scikit Logistic Regression
clf = LogisticRegression()
clf.fit(X_train,  y_train)  
score = clf.score(X_test,y_test)
print(clf)
print("score sum=",score)
#print("classifier:",clf.coef_) # retrieve the coefs from inside the object (cf doc)

LogisticRegression()
score sum= 0.8232


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [45]:

X_train = [vectorize(text,func=np.mean) for text in train]
X_test = [vectorize(text,func=np.mean) for text in test]
clf1 = LogisticRegression()
clf1.fit(X_train,  y_train)  
score = clf1.score(X_test,y_test)
print(clf1)
print("score mean=",score)

LogisticRegression()
score mean= 0.8148


In [46]:
X_train = [vectorize(text,func=np.max) for text in train]
X_test = [vectorize(text,func=np.max) for text in test]
clf2 = LogisticRegression()
clf2.fit(X_train,  y_train)  
score = clf2.score(X_test,y_test)
print(clf2)
print("score max=",score)

LogisticRegression()
score max= 0.6998


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [47]:
X_train = [vectorize(text,func=np.min) for text in train]
X_test = [vectorize(text,func=np.min) for text in test]
clf3 = LogisticRegression()
clf3.fit(X_train,  y_train)  
score = clf3.score(X_test,y_test)
print(clf3)
print("score min=",score)

LogisticRegression()
score min= 0.698


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo**:  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?

## **Todo**: evaluate the same pipeline on speaker ID task (Chirac/Mitterrand) 

In [38]:
#- Which word2vec model works best: skip-gram or cbow
skipgram = w2v
cbow =gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=0, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2024-02-27 01:18:06,060 : INFO : collecting all words and their counts
2024-02-27 01:18:06,061 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-27 01:18:06,597 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-27 01:18:07,064 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-27 01:18:07,306 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-27 01:18:07,307 : INFO : Creating a fresh vocabulary
2024-02-27 01:18:07,526 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-27T01:18:07.526104', 'gensim': '4.3.0', 'python': '3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19045-SP0', 'event': 'prepare_vocab'}
2024-02-27 01:18:07,528 : IN

In [41]:
print("S",skipgram.wv.most_similar("movie",topn=5)) # 5 most similar words
print("C",cbow.wv.most_similar("movie",topn=5)) # 5 most similar words
print("S",skipgram.wv.most_similar("awesome",topn=5))
print("C",cbow.wv.most_similar("awesome",topn=5))
print("S",skipgram.wv.most_similar("actor",topn=5))
print("C",cbow.wv.most_similar("actor",topn=5))


S [('film', 0.9289810061454773), ('"movie"', 0.8283028602600098), ('flick', 0.7667436599731445), ('movie,', 0.7451980710029602), ('"film"', 0.7428070902824402)]
C [('film', 0.9295904040336609), ('movie,', 0.8304710388183594), ('film,', 0.7681915163993835), ('flick', 0.747571587562561), ('documentary', 0.7352034449577332)]
S [('amazing', 0.7817275524139404), ('excellent', 0.7275443077087402), ('awesome,', 0.6990002989768982), ('exceptional', 0.6928319931030273), ('cool', 0.6804935336112976)]
C [('amazing', 0.8534489870071411), ('excellent', 0.8053512573242188), ('exceptional', 0.7795264720916748), ('incredible', 0.779024600982666), ('outstanding', 0.7754307985305786)]
S [('actor,', 0.8143014907836914), ('actor.', 0.7617161273956299), ('Reeves', 0.7540039420127869), ('Hopper', 0.7491146326065063), ('actress', 0.734898567199707)]
C [('actress', 0.8344793319702148), ('actor,', 0.8023119568824768), ('role', 0.7676254510879517), ('role,', 0.7433274984359741), ('performance', 0.71019965410232

Cbow marche mieux avec un peu de pretraitement.


**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)