# NLP & representation learning: Neural Embeddings, Text Classification

## CELIK Simay 28713301 - SOYKOK Aylin 28711545

To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model. 

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol.json)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$. 


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class 
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class 
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models

## STEP 0: Loading data 

In [1]:
import json
from collections import Counter

# Loading json
file = './datasets/json_pol.json'
with open(file,encoding="utf-8") as f:
    data = json.load(f)
    

# Quick Check
counter = Counter((x[1] for x in data))
print("Number of reviews : ", len(data))
print("----> # of positive : ", counter[1])
print("----> # of negative : ", counter[0])
print("")
print(data[0])


Number of reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old Justin Henry and a script that was sympathetic to each character (and each one\'s predicament), the thought provoking elements linger long after the tear jerking ones are over. Overall, superior acting from a solid cast, excellent directing, and a very powerful script. The right touches of humor throughout help keep a "heavy" subject from becoming tedious or difficult to sit through. Lastly, this film stands the test of time and seems in no way dated, decades after it was released.', 1]


In [2]:
#preprocessing
import string
import re
punc = string.punctuation+'\n\r\t"'
def preprocess(text):
    """Suppressing numbers, lowering strings and removing punctuations"""
    chiffsupp = re.sub('[0-9]+', '', text)
    return chiffsupp.translate(str.maketrans(punc, ' ' * len(punc))).lower()



## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [4]:
# if gensim not installed yet
# ! pip install gensim

In [5]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in data]

# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2024-02-27 19:31:20,913 : INFO : collecting all words and their counts
2024-02-27 19:31:20,913 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-27 19:31:21,308 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-27 19:31:21,706 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-27 19:31:21,927 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-27 19:31:21,928 : INFO : Creating a fresh vocabulary
2024-02-27 19:31:22,125 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-27T19:31:22.125328', 'gensim': '4.3.0', 'python': '3.8.17 (default, Jul  5 2023, 16:07:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'prepare_vocab'}
2024-02-27 19:31:22,126 : INFO : Word2Vec lifecycle event {'ms

2024-02-27 19:32:11,435 : INFO : EPOCH 4 - PROGRESS: at 46.66% examples, 382353 words/s, in_qsize 6, out_qsize 0
2024-02-27 19:32:12,450 : INFO : EPOCH 4 - PROGRESS: at 56.22% examples, 383019 words/s, in_qsize 5, out_qsize 0
2024-02-27 19:32:13,451 : INFO : EPOCH 4 - PROGRESS: at 65.24% examples, 382783 words/s, in_qsize 5, out_qsize 0
2024-02-27 19:32:14,453 : INFO : EPOCH 4 - PROGRESS: at 74.78% examples, 383631 words/s, in_qsize 5, out_qsize 0
2024-02-27 19:32:15,463 : INFO : EPOCH 4 - PROGRESS: at 83.84% examples, 382238 words/s, in_qsize 5, out_qsize 0
2024-02-27 19:32:16,469 : INFO : EPOCH 4 - PROGRESS: at 93.16% examples, 383294 words/s, in_qsize 5, out_qsize 0
2024-02-27 19:32:17,216 : INFO : EPOCH 4: training on 5713167 raw words (4165478 effective words) took 10.9s, 383551 effective words/s
2024-02-27 19:32:17,217 : INFO : Word2Vec lifecycle event {'msg': 'training on 28565835 raw words (20823713 effective words) took 54.5s, 382420 effective words/s', 'datetime': '2024-02-27

In [6]:
# Worth it to save the previous embedding
w2v.save("W2v-movies.dat")
# You will be able to reload them:
# w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2024-02-27 20:01:12,112 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2024-02-27T20:01:12.112532', 'gensim': '4.3.0', 'python': '3.8.17 (default, Jul  5 2023, 16:07:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'saving'}
2024-02-27 20:01:12,117 : INFO : not storing attribute cum_table
2024-02-27 20:01:12,195 : INFO : saved W2v-movies.dat


In [7]:
vocab_size = len(w2v.wv.key_to_index)
print("Number of unique words found:", vocab_size)

Number of unique words found: 48208


## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models /keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [11]:
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))

great and good: 0.7652735
great and bad: 0.48330587


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [12]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
print(w2v.wv.most_similar("movie",topn=5)) # 5 most similar words
print(w2v.wv.most_similar("awesome",topn=5))
print(w2v.wv.most_similar("actor",topn=5))

[('film', 0.9344167709350586), ('"movie"', 0.8251368999481201), ('flick', 0.7798967361450195), ('movie,', 0.7704945206642151), ('stinker', 0.729508638381958)]
[('amazing', 0.7707414031028748), ('excellent', 0.7452518939971924), ('terrific', 0.685361385345459), ('fantastic', 0.6832655668258667), ('cool', 0.6712515950202942)]
[('actor,', 0.8258581161499023), ('actor.', 0.74735027551651), ('actress', 0.7395841479301453), ('Reeves', 0.7224267721176147), ('role,', 0.7131319642066956)]


But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [13]:
# What is awesome - good + bad ?
w2v.wv.most_similar(positive=["awesome","bad"],negative=["good"],topn=3)  


# Try other things like plurals for exemple.

[('awful', 0.7509977221488953),
 ('horrible', 0.6313647031784058),
 ('atrocious', 0.6246147751808167)]

In [14]:
w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?

[('actress', 0.8075904250144958),
 ('actress,', 0.7300375699996948),
 ('role', 0.6861469149589539)]

In [15]:
w2v.wv.most_similar(positive=["actors","women"],negative=["men"],topn=3) # 



[('actresses', 0.7714454531669617),
 ('actors/actresses', 0.7179805636405945),
 ('actors,', 0.6822283864021301)]

**To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.**

**You can download the dataset [here](https://thome.isir.upmc.fr/classes/RITAL/questions-words.txt).**

In [17]:
out = w2v.wv.evaluate_word_analogies("ressources/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2024-02-27 20:26:51,060 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words.txt
2024-02-27 20:26:51,153 : INFO : capital-common-countries: 2.2% (2/90)
2024-02-27 20:26:51,222 : INFO : capital-world: 1.4% (1/71)
2024-02-27 20:26:51,250 : INFO : currency: 0.0% (0/28)
2024-02-27 20:26:51,553 : INFO : city-in-state: 0.0% (0/329)
2024-02-27 20:26:51,867 : INFO : family: 36.8% (126/342)
2024-02-27 20:26:52,683 : INFO : gram1-adjective-to-adverb: 2.7% (25/930)
2024-02-27 20:26:53,178 : INFO : gram2-opposite: 3.1% (17/552)
2024-02-27 20:26:54,189 : INFO : gram3-comparative: 22.1% (278/1260)
2024-02-27 20:26:54,735 : INFO : gram4-superlative: 5.6% (39/702)
2024-02-27 20:26:55,396 : INFO : gram5-present-participle: 17.1% (129/756)
2024-02-27 20:26:56,110 : INFO : gram6-nationality-adjective: 3.4% (27/792)
2024-02-27 20:26:57,179 : INFO : gram7-past-tense: 16.5% (208/1260)
2024-02-27 20:26:57,841 : INFO : gram8-plural: 6.7% (54/812)
2024-02-27 20:26:

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

**You can download the pre-trained word embedding [HERE](https://thome.isir.upmc.fr/classes/RITAL/word2vec-google-news-300.dat) .**

In [21]:
from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors
import gensim.downloader as api
bload = True
fname = "word2vec-google-news-300"
sdir = "word2vec-google-news-300/" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:    
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")
    

2024-02-27 20:59:49,738 : INFO : loading KeyedVectors object from word2vec-google-news-300/word2vec-google-news-300.dat
2024-02-27 20:59:50,902 : INFO : loading vectors from word2vec-google-news-300/word2vec-google-news-300.dat.vectors.npy with mmap=None
2024-02-27 20:59:52,880 : INFO : KeyedVectors lifecycle event {'fname': 'word2vec-google-news-300/word2vec-google-news-300.dat', 'datetime': '2024-02-27T20:59:52.879874', 'gensim': '4.3.0', 'python': '3.8.17 (default, Jul  5 2023, 16:07:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'loaded'}


**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

In [22]:
#out = wv_pre_trained.evaluate_word_analogies("ressources/questions-words_pretrained.txt",case_insensitive=True)

In [23]:
print("great and good:",wv_pre_trained.similarity("great","good"))
print("great and bad:",wv_pre_trained.similarity("great","bad"))

great and good: 0.729151
great and bad: 0.3928765


In [24]:
for word in ['movie', 'awesome', 'actor']:
    print(word, ':')
    for item in wv_pre_trained.most_similar(word,topn=5):
        print('\t', item)

movie :
	 ('film', 0.8676770329475403)
	 ('movies', 0.8013108372688293)
	 ('films', 0.7363012433052063)
	 ('moive', 0.6830360889434814)
	 ('Movie', 0.6693678498268127)
awesome :
	 ('amazing', 0.8282864689826965)
	 ('unbelievable', 0.7464959025382996)
	 ('fantastic', 0.7453291416168213)
	 ('incredible', 0.7390913963317871)
	 ('unbelieveable', 0.6678117513656616)
actor :
	 ('actress', 0.7930009961128235)
	 ('Actor', 0.7446157336235046)
	 ('thesp', 0.6954972147941589)
	 ('thespian', 0.6651668548583984)
	 ('actors', 0.6519852876663208)


In [25]:
positives = [
    ["awesome","bad"],
    ["actor","woman"]
]

negatives = [
    ["good"],
    ["man"]
]

for i in range(2):
    print(*positives[i], '--', *negatives[i])
    for item in wv_pre_trained.most_similar(positive=positives[i],negative=negatives[i],topn=5):
        print('\t', item)

awesome bad -- good
	 ('horrible', 0.5953485369682312)
	 ('amazing', 0.5928210020065308)
	 ('weird', 0.5782380700111389)
	 ('freaky', 0.5767402648925781)
	 ('unbelievable', 0.5747914910316467)
actor woman -- man
	 ('actress', 0.8602624535560608)
	 ('actresses', 0.6596671342849731)
	 ('thesp', 0.6290916800498962)
	 ('Actress', 0.6165293455123901)
	 ('actress_Rachel_Weisz', 0.5997322201728821)


In [26]:
positives = [
    ["movie","movies"],
    ["actor","actors"]
]

negatives = [
    ["film"],
    ["actress"]
]

for i in range(2):
    print(*positives[i], '--', *negatives[i])
    for item in wv_pre_trained.most_similar(positive=positives[i],negative=negatives[i],topn=5):
        print('\t', item)

movie movies -- film
	 ('films', 0.6389263272285461)
	 ('Movies', 0.6188488006591797)
	 ('flicks', 0.6120561361312866)
	 ('Hollywood_blockbusters', 0.5958892703056335)
	 ('romcoms', 0.5864962339401245)
actor actors -- actress
	 ('Actors', 0.6128960251808167)
	 ('thesps', 0.5747197270393372)
	 ('screenwriters', 0.5588046312332153)
	 ('thespians', 0.5531652569770813)
	 ('thespian', 0.5476460456848145)


## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max 

In [27]:
import numpy as np
# We first need to vectorize text:
# First we propose to a sum of them

from sklearn.model_selection import train_test_split

def randomvec():
    default = np.random.randn(100)
    default = default  / np.linalg.norm(default)
    return default

def vectorize(text,func=np.sum):
    """
    This function should vectorize one review

    input: str
    output: np.array(float)
    """    

    #for word in text:
        # do something
    vec = []
    for word in text:
        if not (word in w2v.wv):
            vec.append(randomvec())
        else:
            vec.append(w2v.wv[word])
    #if mean:
    #    return np.mean(vec,axis=0)
    return func(np.array(vec),axis=0)
    
lab = [l for t,l in data]
train,test, y_train, y_test = train_test_split(text,lab,test_size=0.2,random_state=42)
#classes = [pol for text,pol in train]
X_train = [vectorize(text) for text in train]
X_test = [vectorize(text) for text in test]
#true = [pol for text,pol in test]


#let's see what a review vector looks like.
print(X_train[0])

[  14.64355245   27.6113289     2.8425088    -7.93338127    6.84360613
  -79.19433906   10.86327781  196.89161115  -67.31446975  -92.96547834
   15.56442079 -112.20022673    5.8086668    86.83066121  -25.40946327
  -49.9276499    33.3363579   -69.70861041   -7.62340931 -215.95424779
   50.17666742   -2.38020069  145.31848439  -69.1918506   -39.70970039
   68.09699225  -56.2441937    27.71014114  -98.97472918   79.38693306
   68.50543934    1.10013077   46.54545162 -145.75732527    6.4726288
   62.17236457   59.48089204  -91.40268719  -69.07438376 -159.08528891
   -4.02529158 -129.18004145  -47.1709022    36.39325455   72.80883827
  -61.38499063  -51.53503226   15.24842464   25.49245624   25.80657919
   40.06329639  -66.5347559    63.89715937  -29.40907955  -44.37395762
  -11.74681224   -7.10754368   20.37445744  -33.31172254   70.73421621
   55.71731006  -48.41107585  -10.93231426   51.54490925  -86.33590543
   95.01279946   27.11545611   69.53969419  -94.11321586   62.86848952
    0.4

### (2) Train a classifier 
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
train,test, y_train, y_test = train_test_split(text,lab,test_size=0.2,random_state=42)

X_train = [vectorize(text,func=np.sum) for text in train]
X_test = [vectorize(text,func=np.sum) for text in test]

# Scikit Logistic Regression
clf = LogisticRegression()
clf.fit(X_train,  y_train)  
score = clf.score(X_test,y_test)
print(clf)
print("score sum=",score)
#print("classifier:",clf.coef_) # retrieve the coefs from inside the object (cf doc)

LogisticRegression()
score sum= 0.8256


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [29]:

X_train = [vectorize(text,func=np.mean) for text in train]
X_test = [vectorize(text,func=np.mean) for text in test]
clf1 = LogisticRegression()
clf1.fit(X_train,  y_train)  
score = clf1.score(X_test,y_test)
print(clf1)
print("score mean=",score)

LogisticRegression()
score mean= 0.8156


In [30]:
X_train = [vectorize(text,func=np.max) for text in train]
X_test = [vectorize(text,func=np.max) for text in test]
clf2 = LogisticRegression()
clf2.fit(X_train,  y_train)  
score = clf2.score(X_test,y_test)
print(clf2)
print("score max=",score)

LogisticRegression()
score max= 0.707


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [31]:
X_train = [vectorize(text,func=np.min) for text in train]
X_test = [vectorize(text,func=np.min) for text in test]
clf3 = LogisticRegression()
clf3.fit(X_train,  y_train)  
score = clf3.score(X_test,y_test)
print(clf3)
print("score min=",score)

LogisticRegression()
score min= 0.7046


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo**:  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?

## **Todo**: evaluate the same pipeline on speaker ID task (Chirac/Mitterrand) 

In [32]:
#- Which word2vec model works best: skip-gram or cbow
skipgram = w2v
cbow =gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=0, hs=0, negative=5,   # sg=0    
                                cbow_mean=1, epochs=5)

2024-02-27 21:22:27,600 : INFO : collecting all words and their counts
2024-02-27 21:22:27,604 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-27 21:22:28,071 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-27 21:22:28,564 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-27 21:22:28,831 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-27 21:22:28,833 : INFO : Creating a fresh vocabulary
2024-02-27 21:22:29,050 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-27T21:22:29.049973', 'gensim': '4.3.0', 'python': '3.8.17 (default, Jul  5 2023, 16:07:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'prepare_vocab'}
2024-02-27 21:22:29,051 : INFO : Word2Vec lifecycle event {'ms

In [41]:
print("S",skipgram.wv.most_similar("movie",topn=5)) # 5 most similar words
print("C",cbow.wv.most_similar("movie",topn=5)) # 5 most similar words
print("S",skipgram.wv.most_similar("awesome",topn=5))
print("C",cbow.wv.most_similar("awesome",topn=5))
print("S",skipgram.wv.most_similar("actor",topn=5))
print("C",cbow.wv.most_similar("actor",topn=5))


S [('film', 0.9289810061454773), ('"movie"', 0.8283028602600098), ('flick', 0.7667436599731445), ('movie,', 0.7451980710029602), ('"film"', 0.7428070902824402)]
C [('film', 0.9295904040336609), ('movie,', 0.8304710388183594), ('film,', 0.7681915163993835), ('flick', 0.747571587562561), ('documentary', 0.7352034449577332)]
S [('amazing', 0.7817275524139404), ('excellent', 0.7275443077087402), ('awesome,', 0.6990002989768982), ('exceptional', 0.6928319931030273), ('cool', 0.6804935336112976)]
C [('amazing', 0.8534489870071411), ('excellent', 0.8053512573242188), ('exceptional', 0.7795264720916748), ('incredible', 0.779024600982666), ('outstanding', 0.7754307985305786)]
S [('actor,', 0.8143014907836914), ('actor.', 0.7617161273956299), ('Reeves', 0.7540039420127869), ('Hopper', 0.7491146326065063), ('actress', 0.734898567199707)]
C [('actress', 0.8344793319702148), ('actor,', 0.8023119568824768), ('role', 0.7676254510879517), ('role,', 0.7433274984359741), ('performance', 0.71019965410232

Cbow marche mieux avec un peu de pretraitement.

In [36]:
# Evaluation

from gensim.test.utils import datapath
analogy_skipgram_score = skipgram.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]
analogy_cbow_score = cbow.wv.evaluate_word_analogies(datapath('questions-words.txt'))[0]

similarity_skipgram_score = skipgram.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))[0]
similarity_cbow_score = cbow.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))[0]

2024-02-27 21:48:58,082 : INFO : Evaluating word analogies for top 300000 words in the model on /Users/aylinsoykok/opt/anaconda3/lib/python3.8/site-packages/gensim/test/test_data/questions-words.txt
2024-02-27 21:48:58,166 : INFO : capital-common-countries: 2.2% (2/90)
2024-02-27 21:48:58,236 : INFO : capital-world: 1.4% (1/71)
2024-02-27 21:48:58,265 : INFO : currency: 0.0% (0/28)
2024-02-27 21:48:58,567 : INFO : city-in-state: 0.0% (0/329)
2024-02-27 21:48:58,880 : INFO : family: 36.8% (126/342)
2024-02-27 21:48:59,695 : INFO : gram1-adjective-to-adverb: 2.7% (25/930)
2024-02-27 21:49:00,191 : INFO : gram2-opposite: 3.1% (17/552)
2024-02-27 21:49:01,218 : INFO : gram3-comparative: 22.1% (278/1260)
2024-02-27 21:49:01,771 : INFO : gram4-superlative: 5.6% (39/702)
2024-02-27 21:49:02,415 : INFO : gram5-present-participle: 17.1% (129/756)
2024-02-27 21:49:03,140 : INFO : gram6-nationality-adjective: 3.4% (27/792)
2024-02-27 21:49:04,233 : INFO : gram7-past-tense: 16.5% (208/1260)
2024-0

2024-02-27 21:49:14,468 : INFO : Skipping line #278 with OOV words: investor	earning	7.13
2024-02-27 21:49:14,468 : INFO : Skipping line #283 with OOV words: marathon	sprint	7.47
2024-02-27 21:49:14,469 : INFO : Skipping line #287 with OOV words: seafood	sea	7.47
2024-02-27 21:49:14,470 : INFO : Skipping line #288 with OOV words: seafood	food	8.34
2024-02-27 21:49:14,470 : INFO : Skipping line #289 with OOV words: seafood	lobster	8.70
2024-02-27 21:49:14,472 : INFO : Skipping line #306 with OOV words: environment	ecology	8.81
2024-02-27 21:49:14,473 : INFO : Skipping line #310 with OOV words: murder	manslaughter	8.53
2024-02-27 21:49:14,475 : INFO : Skipping line #330 with OOV words: ministry	culture	4.69
2024-02-27 21:49:14,476 : INFO : Skipping line #342 with OOV words: concert	virtuoso	6.81
2024-02-27 21:49:14,477 : INFO : Skipping line #351 with OOV words: weather	forecast	8.34
2024-02-27 21:49:14,482 : INFO : Pearson correlation coefficient against /Users/aylinsoykok/opt/anaconda3

In [37]:
# Evaluation

from gensim.test.utils import datapath
print("Analogy score (skip-gram):",analogy_skipgram_score)
print("Analogy score (cbow):",analogy_cbow_score)

print("Similarity score (skip-gram):", similarity_skipgram_score)
print("Similarity score (cbow):", similarity_cbow_score)

Analogy score (skip-gram): 0.12799539170506913
Analogy score (cbow): 0.11394009216589862
Similarity score (skip-gram): PearsonRResult(statistic=0.23389082080183135, pvalue=7.327643069840452e-05)
Similarity score (cbow): PearsonRResult(statistic=0.17728023181783326, pvalue=0.002812533933874721)


#### Evaluation avec differents méthodes d'aggregation

In [38]:
def vectorize_new(text,func,model):
    vec = []
    for word in text:
        if not (word in model.wv):
            vec.append(randomvec())
        else:
            vec.append(model.wv[word])         
    return func(np.array(vec),axis=0)

In [39]:
def evaluation(func, model):
    train,test, y_train, y_test = train_test_split(text,lab,test_size=0.2,random_state=42)
    X_train = [vectorize_new(text,func,model) for text in train]
    X_test = [vectorize_new(text,func,model) for text in test]
    clf = LogisticRegression()
    clf.fit(X_train,  y_train)  
    score = clf.score(X_test,y_test)
    return score

In [43]:
def testing(model_sg,model_cbow,aggreg_functions):
    print("Results:")
    print("="*50)
    print("{:<10} {:<15} {:<15}".format("Method", "skip-gram", "cbow"))
    print("="*50)
    for func in aggreg_functions:
        score_sg = evaluation(func, model_sg)
        score_cbow = evaluation(func, model_cbow)
        print("{:<10} {:<15.4f} {:<15.4f}".format(func.__name__, score_sg, score_cbow))
    print("="*50)

In [44]:
aggregation_functions = [np.sum,np.mean,np.max,np.min]
testing(skipgram,cbow,aggregation_functions)

Results:
Method     skip-gram       cbow           
sum        0.8208          0.7762         
mean       0.8174          0.7786         
amax       0.7078          0.6590         
amin       0.7054          0.6468         



**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)

In [45]:
# FastText
from gensim.models import FastText
fasttext_model = FastText(sentences=text, vector_size=100, window=5, min_count=5, workers=3, sg=1)

2024-02-27 22:16:46,815 : INFO : collecting all words and their counts
2024-02-27 22:16:46,816 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-27 22:16:47,271 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-27 22:16:47,704 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-27 22:16:47,950 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-27 22:16:47,951 : INFO : Creating a fresh vocabulary
2024-02-27 22:16:48,130 : INFO : FastText lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-27T22:16:48.130563', 'gensim': '4.3.0', 'python': '3.8.17 (default, Jul  5 2023, 16:07:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'prepare_vocab'}
2024-02-27 22:16:48,131 : INFO : FastText lifecycle event {'ms

2024-02-27 22:17:41,428 : INFO : EPOCH 2 - PROGRESS: at 96.99% examples, 247357 words/s, in_qsize 5, out_qsize 0
2024-02-27 22:17:41,953 : INFO : EPOCH 2: training on 5713167 raw words (4166361 effective words) took 16.8s, 247393 effective words/s
2024-02-27 22:17:42,980 : INFO : EPOCH 3 - PROGRESS: at 5.73% examples, 236829 words/s, in_qsize 5, out_qsize 0
2024-02-27 22:17:44,019 : INFO : EPOCH 3 - PROGRESS: at 11.97% examples, 242180 words/s, in_qsize 5, out_qsize 0
2024-02-27 22:17:45,050 : INFO : EPOCH 3 - PROGRESS: at 18.17% examples, 244327 words/s, in_qsize 5, out_qsize 0
2024-02-27 22:17:46,075 : INFO : EPOCH 3 - PROGRESS: at 24.28% examples, 245837 words/s, in_qsize 5, out_qsize 0
2024-02-27 22:17:47,099 : INFO : EPOCH 3 - PROGRESS: at 30.53% examples, 246935 words/s, in_qsize 5, out_qsize 0
2024-02-27 22:17:48,107 : INFO : EPOCH 3 - PROGRESS: at 36.38% examples, 247101 words/s, in_qsize 5, out_qsize 0
2024-02-27 22:17:49,114 : INFO : EPOCH 3 - PROGRESS: at 42.58% examples, 24

In [46]:
print("Score sum avec FastText:",evaluation(np.sum, fasttext_model))
print("Score mean avec FastText:",evaluation(np.mean, fasttext_model))

Score sum avec FastText: 0.8162
Score mean avec FastText: 0.8078


In [50]:
# Doc2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(words=d, tags=[i]) for i, d in enumerate(text)]
doc2vec_model = Doc2Vec(vector_size=100, window=5, min_count=5, workers=3)
doc2vec_model.build_vocab(tagged_data)
doc2vec_model.train(tagged_data, total_examples=doc2vec_model.corpus_count, epochs=5)

2024-02-27 22:24:51,143 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d100,n5,w5,mc5,s0.001,t3>', 'datetime': '2024-02-27T22:24:51.143073', 'gensim': '4.3.0', 'python': '3.8.17 (default, Jul  5 2023, 16:07:30) \n[Clang 14.0.6 ]', 'platform': 'macOS-10.16-x86_64-i386-64bit', 'event': 'created'}
2024-02-27 22:24:51,143 : INFO : collecting all words and their counts
2024-02-27 22:24:51,144 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2024-02-27 22:24:51,460 : INFO : PROGRESS: at example #10000, processed 2301366 words (7289152 words/s), 153853 word types, 0 tags
2024-02-27 22:24:51,797 : INFO : PROGRESS: at example #20000, processed 4553558 words (6707075 words/s), 240043 word types, 0 tags
2024-02-27 22:24:51,965 : INFO : collected 276678 word types and 25000 unique tags from a corpus of 25000 examples and 5713167 words
2024-02-27 22:24:51,966 : INFO : Creating a fresh vocabulary
2024-02-27 22:24:52,170 : INFO : Doc2Vec lifecycle event

In [52]:
print("Score sum avec Doc2Vec:",evaluation(np.sum, doc2vec_model))
print("Score mean avec Doc2Vec:",evaluation(np.mean, doc2vec_model))

Score sum avec Doc2Vec: 0.7764
Score mean avec Doc2Vec: 0.781
