# NLP & representation learning: Neural Embeddings, Text Classification


To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model. 

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol.json)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$. 


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class 
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class 
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models

## STEP 0: Loading data 

In [3]:
import json
from collections import Counter

# Loading json
file = './datasets/json_pol.json'
with open(file,encoding="utf-8") as f:
    data = json.load(f)
    

# Quick Check
counter = Counter((x[1] for x in data))
print("Number of reviews : ", len(data))
print("----> # of positive : ", counter[1])
print("----> # of negative : ", counter[0])
print("")
print(data[0][1])

Number of reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

1


## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [11]:
# if gensim not installed yet
# ! pip install gensim

In [4]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in data]


# the following configuration is the default configuration
w2v = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2024-02-29 19:12:06,348 : INFO : collecting all words and their counts
2024-02-29 19:12:06,348 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-29 19:12:06,849 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-29 19:12:07,282 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-29 19:12:07,502 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-29 19:12:07,502 : INFO : Creating a fresh vocabulary
2024-02-29 19:12:07,684 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-29T19:12:07.684063', 'gensim': '4.3.2', 'python': '3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'prepare_vocab'}
2024-02-29 19:12:07,684 : INFO : Word2Vec lif

In [3]:
len(text[1])

533

In [3]:
# Worth it to save the previous embedding
#w2v.save("W2v-movies.dat")
# You will be able to reload them:
w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2024-02-29 13:36:19,380 : INFO : loading Word2Vec object from W2v-movies.dat
2024-02-29 13:36:19,434 : INFO : loading wv recursively from W2v-movies.dat.wv.* with mmap=None
2024-02-29 13:36:19,435 : INFO : setting ignored attribute cum_table to None
2024-02-29 13:36:19,705 : INFO : Word2Vec lifecycle event {'fname': 'W2v-movies.dat', 'datetime': '2024-02-29T13:36:19.705738', 'gensim': '4.3.2', 'python': '3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'loaded'}


## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models /keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [5]:
# is great really closer to good than to bad ?
print("great and good:",w2v.wv.similarity("great","good"))
print("great and bad:",w2v.wv.similarity("great","bad"))
print("king and man:",w2v.wv.similarity("king","man"))

great and good: 0.7646114
great and bad: 0.48655832
king and man: 0.5934056


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [6]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
# w2v.wv.most_similar("movie",topn=5) # 5 most similar words
# w2v.wv.most_similar("awesome",topn=5)
w2v.wv.most_similar("actor",topn=5)

[('actor,', 0.8326389789581299),
 ('actor.', 0.7580764889717102),
 ('Reeves', 0.7533103227615356),
 ('actress', 0.7248632311820984),
 ('role,', 0.7208895087242126)]

But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [6]:
# What is awesome - good + bad ?
#w2v.wv.most_similar(positive=["awesome","bad"],negative=["good"],topn=3)  

#w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?

w2v.wv.most_similar(positive=["Paris","France"],negative=["English"],topn=6) 

# Try other things like plurals for exemple.

[('Mexico', 0.6771059036254883),
 ('downtown', 0.6716404557228088),
 ('Angeles', 0.6657335162162781),
 ('London', 0.6620723009109497),
 ('California,', 0.6614657640457153),
 ('Mexico,', 0.6600666642189026)]

**To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.**

**You can download the dataset [here](https://thome.isir.upmc.fr/classes/RITAL/questions-words.txt).**

In [7]:
out = w2v.wv.evaluate_word_analogies("ressources/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2024-02-29 19:13:15,811 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words.txt


2024-02-29 19:13:15,929 : INFO : capital-common-countries: 6.7% (6/90)
2024-02-29 19:13:16,009 : INFO : capital-world: 2.8% (2/71)
2024-02-29 19:13:16,044 : INFO : currency: 0.0% (0/28)
2024-02-29 19:13:16,393 : INFO : city-in-state: 0.0% (0/329)
2024-02-29 19:13:16,759 : INFO : family: 34.5% (118/342)
2024-02-29 19:13:17,711 : INFO : gram1-adjective-to-adverb: 1.3% (12/930)
2024-02-29 19:13:18,265 : INFO : gram2-opposite: 2.9% (16/552)
2024-02-29 19:13:19,435 : INFO : gram3-comparative: 20.2% (254/1260)
2024-02-29 19:13:20,060 : INFO : gram4-superlative: 7.3% (51/702)
2024-02-29 19:13:20,722 : INFO : gram5-present-participle: 18.8% (142/756)
2024-02-29 19:13:21,395 : INFO : gram6-nationality-adjective: 2.8% (22/792)
2024-02-29 19:13:22,350 : INFO : gram7-past-tense: 16.7% (211/1260)
2024-02-29 19:13:22,915 : INFO : gram8-plural: 4.2% (34/812)
2024-02-29 19:13:23,516 : INFO : gram9-plural-verbs: 24.9% (188/756)
2024-02-29 19:13:23,516 : INFO : Quadruplets with out-of-vocabulary words: 

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data, it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

**You can download the pre-trained word embedding [HERE](https://thome.isir.upmc.fr/classes/RITAL/word2vec-google-news-300.dat) .**

In [8]:
#from gensim.test.utils import get_tmpfile
import gensim.downloader as api
bload = True
#fname = "word2vec-google-news-300"
fname = "word2vec-google-news-300"
sdir = "word2vec-google-news-300/" # Change

if(bload==True):
    wv_pre_trained = gensim.models.KeyedVectors.load(sdir+fname+".dat")
else:    
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")
    
    

2024-02-29 19:13:23,544 : INFO : loading KeyedVectors object from word2vec-google-news-300/word2vec-google-news-300.dat
2024-02-29 19:13:24,689 : INFO : loading vectors from word2vec-google-news-300/word2vec-google-news-300.dat.vectors.npy with mmap=None
2024-02-29 19:13:29,017 : INFO : KeyedVectors lifecycle event {'fname': 'word2vec-google-news-300/word2vec-google-news-300.dat', 'datetime': '2024-02-29T19:13:29.017946', 'gensim': '4.3.2', 'python': '3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'loaded'}


**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max 

In [9]:
import numpy as np
from sklearn.model_selection  import train_test_split
# We first need to vectorize text:
# First we propose to a sum of them

def randomvec(val):
    default = np.random.randn(val)
    default = default  / np.linalg.norm(default)
    return default

def vectorize(text,mean=False):

# """
# This function should vectorize one review

# input: str
# output: np.array(float)

# """      
    vec = list()
    for word in text:
        if not (word in w2v.wv): 
            vec.append(randomvec(100))
        else:
            vec.append(w2v.wv[word])
            
    return sum(vec)

lab = [l for t,l in data]  
train, test,y_train, y_test = train_test_split(text,lab,test_size=0.2,random_state=42)

#classes = [pol for text,pol in train]
X = [vectorize(text) for text in train]
X_test = [vectorize(text) for text in test]
#true = [pol for text,pol in test]

#let's see what a review vector looks like.
print(X[0])

[  -7.57293123   28.71725747  -14.85025887   -7.28830908    8.46466901
 -108.42054234    1.14369239  176.05374151  -70.98090958  -76.57534255
    1.40000819 -106.06666902   -1.98149369   71.09030095    7.09264713
  -36.0693451    11.67508647  -62.26880306  -17.52540909 -210.37950162
   48.22400982    3.73759637  172.64392945  -97.07995029  -27.64271897
   69.14704071  -53.25381788   35.61033983 -102.77474912   83.73235576
   72.59939349   27.34860033   31.01915031 -131.62599325  -34.11725628
   68.10432259   27.23754314  -89.81583224  -53.09578426 -164.65305753
    5.35532321 -148.81550646  -56.12899937   41.6580968   102.65842123
  -34.24901678  -71.58212442    5.3329379     1.81644962   13.064842
   66.78748237  -92.31527053   48.49156979  -29.12656435  -48.62162613
    9.35690386  -18.06858999   34.7728236   -39.10262153   55.61303918
   43.64949047  -26.43843784   23.16181149   50.64614199  -91.63264633
  129.96082067   -6.19493017   74.68392716  -88.19130325   82.28000903
   -8.75

In [10]:

def vectorize(text,mean_b=False,sum_b = True,max_b=False, min_b=False):     
    vec = list()
    for word in text:
        if not (word in wv_pre_trained): 
            vec.append(randomvec(300))
        else:
            vec.append(wv_pre_trained[word])
    if mean_b:        
        return mean(vec)
    if sum_b:        
        return sum(vec)
    if max_b:        
        return max(vec)
    else: 
        return min(vec)

lab = [l for t,l in data]  
train, test,y_train, y_test = train_test_split(text,lab,test_size=0.2,random_state=42)

#classes = [pol for text,pol in train]
X = [vectorize(text) for text in train]
X_test = [vectorize(text) for text in test]
#true = [pol for text,pol in test]

#let's see what a review vector looks like.
print(X[0])

[ 1.95050251e+01  1.54200372e+01  1.23154661e+01  3.10836614e+01
 -2.26996226e+01 -2.49738134e+00  7.86992791e+00 -2.29712747e+01
  2.74694308e+01  2.77216386e+01 -1.05731094e+01 -3.87175350e+01
 -2.71817231e+00  1.58782151e+01 -3.57856571e+01  1.48034387e+01
  1.29298093e+01  3.40879125e+01 -1.14338176e+00 -1.44684337e+01
 -2.74474629e+00  1.93755284e+01  6.59190603e+00 -1.90031588e+00
  1.94091390e+01 -1.17171934e+01 -2.68422124e+01  2.32434216e+01
  1.21691816e+01  5.39238139e+00 -6.91356235e+00 -2.01518819e+00
 -1.23126532e+01  4.24769607e+00  5.08062781e+00 -3.09706575e+00
  6.95105550e+00  4.74856222e-01  1.31316447e+01  2.41026764e+01
  3.19635396e+01 -2.15723952e+01  3.40254192e+01 -4.24116698e+00
 -1.39004307e+00 -2.94744783e+00 -1.31642977e+01  4.80535654e-01
  1.99618221e+01  9.01170093e+00 -6.65149044e+00  1.58836373e+01
 -5.84438930e+00 -1.13945938e+01  2.36811412e+00  7.60873540e+00
 -3.52722560e+00 -1.99061144e+01  1.08302062e+01 -1.23251594e+01
 -8.17759768e+00  3.38086

### (2) Train a classifier 
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# Scikit Logistic Regression
lr = LogisticRegression()
lr.fit(X,y_train)
y_pred = lr.predict(X_test)
print("acc:",accuracy_score(y_test,y_pred))

acc: 0.8346


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo** :  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?


In [18]:
import gensim
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Train CBOW model
cbow_model = gensim.models.Word2Vec(sentences=text,
                                    vector_size=100, window=5,
                                    min_count=5, sample=0.001,
                                    workers=3, sg=0, hs=0, negative=5,
                                    cbow_mean=1, epochs=5)

# Train Skip-gram model
sg_model = gensim.models.Word2Vec(sentences=text,
                                   vector_size=100, window=5,
                                   min_count=5, sample=0.001,
                                   workers=3, sg=1, hs=0, negative=5,
                                   cbow_mean=1, epochs=5)

# Define vectorize function to accept Word2Vec model directly
def vectorize(text, wv_model, aggregation='mean'):
    vec = [wv_model[word] if word in wv_model else np.random.rand(100) for word in text]
    if aggregation == 'mean':
        return np.mean(vec, axis=0)
    elif aggregation == 'sum':
        return np.sum(vec, axis=0)
    elif aggregation == 'max':
        return np.max(vec, axis=0)
    elif aggregation == 'min':
        return np.min(vec, axis=0)
    else:
        raise ValueError("Invalid aggregation method")

# List of aggregation methods
aggregation_methods = ['mean', 'sum', 'max', 'min']

for method in aggregation_methods:
    # Vectorize training and test datasets using CBOW model
    X_cbow_train = np.array([vectorize(text, cbow_model.wv, method) for text in train])
    X_cbow_test = np.array([vectorize(text, cbow_model.wv, method) for text in test])

    # Vectorize training and test datasets using Skip-gram model
    X_sg_train = np.array([vectorize(text, sg_model.wv, method) for text in train])
    X_sg_test = np.array([vectorize(text, sg_model.wv, method) for text in test])

    # Train logistic regression models
    lr_cbow = LogisticRegression()
    lr_cbow.fit(X_cbow_train, y_train)

    lr_sg = LogisticRegression()
    lr_sg.fit(X_sg_train, y_train)

    # Evaluate accuracy of logistic regression models
    y_pred_cbow = lr_cbow.predict(X_cbow_test)
    accuracy_cbow = accuracy_score(y_test, y_pred_cbow)

    y_pred_sg = lr_sg.predict(X_sg_test)
    accuracy_sg = accuracy_score(y_test, y_pred_sg)

    print(f"Aggregation method: {method}")
    print("Accuracy of CBOW model:", accuracy_cbow)
    print("Accuracy of Skip-gram model:", accuracy_sg)
    print()


2024-02-29 19:23:20,360 : INFO : collecting all words and their counts
2024-02-29 19:23:20,362 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2024-02-29 19:23:20,754 : INFO : PROGRESS: at sentence #10000, processed 2301366 words, keeping 153853 word types
2024-02-29 19:23:21,178 : INFO : PROGRESS: at sentence #20000, processed 4553558 words, keeping 240043 word types
2024-02-29 19:23:21,396 : INFO : collected 276678 word types from a corpus of 5713167 raw words and 25000 sentences
2024-02-29 19:23:21,396 : INFO : Creating a fresh vocabulary
2024-02-29 19:23:21,683 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 48208 unique words (17.42% of original 276678, drops 228470)', 'datetime': '2024-02-29T19:23:21.683804', 'gensim': '4.3.2', 'python': '3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22631-SP0', 'event': 'prepare_vocab'}
2024-02-29 19:23:21,683 : INFO : Word2Vec lif

Aggregation method: mean
Accuracy of CBOW model: 0.7754
Accuracy of Skip-gram model: 0.8206



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Aggregation method: sum
Accuracy of CBOW model: 0.772
Accuracy of Skip-gram model: 0.8268



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Aggregation method: max
Accuracy of CBOW model: 0.676
Accuracy of Skip-gram model: 0.5848

Aggregation method: min
Accuracy of CBOW model: 0.6464
Accuracy of Skip-gram model: 0.6954



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
