# NLP & representation learning: Neural Embeddings, Text Classification


To use statistical classifiers with text, it is first necessary to vectorize the text. In the first practical session we explored the **Bag of Word (BoW)** model. 

Modern **state of the art** methods uses  embeddings to vectorize the text before classification in order to avoid feature engineering.

## [Dataset](https://thome.isir.upmc.fr/classes/RITAL/json_pol)


## "Modern" NLP pipeline

By opposition to the **bag of word** model, in the modern NLP pipeline everything is **embeddings**. Instead of encoding a text as a **sparse vector** of length $D$ (size of feature dictionnary) the goal is to encode the text in a meaningful dense vector of a small size $|e| <<< |D|$. 


The raw classification pipeline is then the following:

```
raw text ---|embedding table|-->  vectors --|Neural Net|--> class 
```


### Using a  language model:

How to tokenize the text and extract a feature dictionnary is still a manual task. To directly have meaningful embeddings, it is common to use a pre-trained language model such as `word2vec` which we explore in this practical.

In this setting, the pipeline becomes the following:
```
      
raw text ---|(pre-trained) Language Model|--> vectors --|classifier (or fine-tuning)|--> class 
```


- #### Classic word embeddings

 - [Word2Vec](https://arxiv.org/abs/1301.3781)
 - [Glove](https://nlp.stanford.edu/projects/glove/)


- #### bleeding edge language models techniques (see next)

 - [UMLFIT](https://arxiv.org/abs/1801.06146)
 - [ELMO](https://arxiv.org/abs/1802.05365)
 - [GPT](https://blog.openai.com/language-unsupervised/)
 - [BERT](https://arxiv.org/abs/1810.04805)






### Goal of this session:

1. Train word embeddings on training dataset
2. Tinker with the learnt embeddings and see learnt relations
3. Tinker with pre-trained embeddings.
4. Use those embeddings for classification
5. Compare different embedding models
6. Pytorch first look: learn to generate text.

## STEP 0: Loading data 

In [1]:
import json
from collections import Counter

# Loading json
with open("ressources/json_pol.txt",encoding="utf-8") as f:
    data = f.readlines()
    json_data = json.loads(data[0])
    train = json_data["train"]
    test = json_data["test"]
    

# Quick Check
counter_train = Counter((x[1] for x in train))
counter_test = Counter((x[1] for x in test))
print("Number of train reviews : ", len(train))
print("----> # of positive : ", counter_train[1])
print("----> # of negative : ", counter_train[0])
print("")
print(train[0])
print("")
print("Number of test reviews : ",len(test))
print("----> # of positive : ", counter_test[1])
print("----> # of negative : ", counter_test[0])

print("")
print(test[0])
print("")

Number of train reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

["The undoubted highlight of this movie is Peter O'Toole's performance. In turn wildly comical and terribly terribly tragic. Does anybody do it better than O'Toole? I don't think so. What a great face that man has!<br /><br />The story is an odd one and quite disturbing and emotionally intense in parts (especially toward the end) but it is also oddly touching and does succeed on many levels. However, I felt the film basically revolved around Peter O'Toole's luminous performance and I'm sure I wouldn't have enjoyed it even half as much if he hadn't been in it.", 1]

Number of test reviews :  25000
----> # of positive :  12500
----> # of negative :  12500

['Although credit should have been given to Dr. Seuess for stealing the story-line of "Horton Hatches The Egg", this was a fine film. It touched both the emotions and the intellect. Due especially to the incredible performance of seven year old 

## Word2Vec: Quick Recap

**[Word2Vec](https://arxiv.org/abs/1301.3781) is composed of two distinct language models (CBOW and SG), optimized to quickly learn word vectors**


given a random text: `i'm taking the dog out for a walk`



### (a) Continuous Bag of Word (CBOW)
    -  predicts a word given a context
    
maximizing `p(dog | i'm taking the ___ out for a walk)`
    
### (b) Skip-Gram (SG)               
    -  predicts a context given a word
    
 maximizing `p(i'm taking the out for a walk | dog)`



   

## STEP 1: train a language model (word2vec)

Gensim has one of [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) fastest implementation.


### Train:

In [2]:
# if gensim not installed yet
# ! pip install gensim

## Notes:
- Word2vec est un algorithme d'entrainement auto supervisé : on apprend à faire de la classif en supprimant des mots du texte.
- Modèles pré-entrainé sur un grand nombre de documents.
- Similarité cosine : produit scalaire normalisé.

In [12]:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

text = [t.split() for t,p in train]

# the following configuration is the default configuration
w2v_cbow = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=0, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

w2v_sg = gensim.models.word2vec.Word2Vec(sentences=text,
                                vector_size=100, window=5,               ### here we train a cbow model 
                                min_count=5,                      
                                sample=0.001, workers=3,
                                sg=1, hs=0, negative=5,        ### set sg to 1 to train a sg model
                                cbow_mean=1, epochs=5)

2023-02-16 14:07:57,753 : INFO : collecting all words and their counts
2023-02-16 14:07:57,754 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-02-16 14:07:58,790 : INFO : PROGRESS: at sentence #10000, processed 2358544 words, keeping 155393 word types
2023-02-16 14:07:59,857 : INFO : PROGRESS: at sentence #20000, processed 4675912 words, keeping 243050 word types
2023-02-16 14:08:00,399 : INFO : collected 280617 word types from a corpus of 5844680 raw words and 25000 sentences
2023-02-16 14:08:00,399 : INFO : Creating a fresh vocabulary
2023-02-16 14:08:00,989 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 49345 unique words (17.58% of original 280617, drops 231272)', 'datetime': '2023-02-16T14:08:00.989523', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'prepare_vocab'}
2023-02-16 14:08:00,990 : INFO : Word2Vec l

2023-02-16 14:08:40,870 : INFO : PROGRESS: at sentence #20000, processed 4675912 words, keeping 243050 word types
2023-02-16 14:08:41,390 : INFO : collected 280617 word types from a corpus of 5844680 raw words and 25000 sentences
2023-02-16 14:08:41,390 : INFO : Creating a fresh vocabulary
2023-02-16 14:08:41,980 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 49345 unique words (17.58% of original 280617, drops 231272)', 'datetime': '2023-02-16T14:08:41.980534', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'prepare_vocab'}
2023-02-16 14:08:41,988 : INFO : Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 5517507 word corpus (94.40% of original 5844680, drops 327173)', 'datetime': '2023-02-16T14:08:41.988218', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform':

2023-02-16 14:09:36,305 : INFO : EPOCH 1 - PROGRESS: at 89.06% examples, 153981 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:09:37,360 : INFO : EPOCH 1 - PROGRESS: at 92.69% examples, 153861 words/s, in_qsize 6, out_qsize 0
2023-02-16 14:09:38,375 : INFO : EPOCH 1 - PROGRESS: at 96.28% examples, 153692 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:09:39,329 : INFO : EPOCH 1: training on 5844680 raw words (4269381 effective words) took 27.7s, 153979 effective words/s
2023-02-16 14:09:40,336 : INFO : EPOCH 2 - PROGRESS: at 3.37% examples, 149315 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:09:41,430 : INFO : EPOCH 2 - PROGRESS: at 6.94% examples, 146417 words/s, in_qsize 6, out_qsize 0
2023-02-16 14:09:42,470 : INFO : EPOCH 2 - PROGRESS: at 10.68% examples, 148052 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:09:43,500 : INFO : EPOCH 2 - PROGRESS: at 14.56% examples, 150832 words/s, in_qsize 6, out_qsize 0
2023-02-16 14:09:44,529 : INFO : EPOCH 2 - PROGRESS: at 18.09% examples, 149

2023-02-16 14:10:50,042 : INFO : EPOCH 4 - PROGRESS: at 55.65% examples, 153096 words/s, in_qsize 6, out_qsize 0
2023-02-16 14:10:51,059 : INFO : EPOCH 4 - PROGRESS: at 59.40% examples, 153284 words/s, in_qsize 6, out_qsize 0
2023-02-16 14:10:52,072 : INFO : EPOCH 4 - PROGRESS: at 62.87% examples, 153335 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:10:53,122 : INFO : EPOCH 4 - PROGRESS: at 66.85% examples, 153228 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:10:54,122 : INFO : EPOCH 4 - PROGRESS: at 70.64% examples, 153471 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:10:55,182 : INFO : EPOCH 4 - PROGRESS: at 74.48% examples, 153270 words/s, in_qsize 6, out_qsize 0
2023-02-16 14:10:56,201 : INFO : EPOCH 4 - PROGRESS: at 78.10% examples, 153123 words/s, in_qsize 5, out_qsize 0
2023-02-16 14:10:57,260 : INFO : EPOCH 4 - PROGRESS: at 81.94% examples, 153018 words/s, in_qsize 4, out_qsize 1
2023-02-16 14:10:58,282 : INFO : EPOCH 4 - PROGRESS: at 85.65% examples, 153416 words/s, in_qsiz

In [13]:
# Worth it to save the previous embedding
w2v.save("W2v_cbow-movies.dat")
w2v.save("W2v_sg-movies.dat")
# You will be able to reload them:
# w2v = gensim.models.Word2Vec.load("W2v-movies.dat")
# and you can continue the learning process if needed

2023-02-16 14:11:24,833 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v_cbow-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2023-02-16T14:11:24.833059', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'saving'}
2023-02-16 14:11:24,836 : INFO : not storing attribute cum_table
2023-02-16 14:11:24,924 : INFO : saved W2v_cbow-movies.dat
2023-02-16 14:11:24,925 : INFO : Word2Vec lifecycle event {'fname_or_handle': 'W2v_sg-movies.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2023-02-16T14:11:24.925150', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'saving'}
2023-02-16 14:11:24,926 : INFO : not storing attribute cum_table
2023-02-16 14:11:25,012 : INFO : saved W2v_sg

## STEP 2: Test learnt embeddings

The word embedding space directly encodes similarities between words: the vector coding for the word "great" will be closer to the vector coding for "good" than to the one coding for "bad". Generally, [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is the distance used when considering distance between vectors.

KeyedVectors have a built in [similarity](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.similarity) method to compute the cosine similarity between words

In [14]:
# is great really closer to good than to bad ?
print("great and good:",w2v_cbow.wv.similarity("great","good"))
print("great and bad:",w2v_sg.wv.similarity("great","bad"))

great and good: 0.79202557
great and bad: 0.4801622


Since cosine distance encodes similarity, neighboring words are supposed to be similar. The [most_similar](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.BaseKeyedVectors.most_similar) method returns the `topn` words given a query.

In [7]:
# The query can be as simple as a word, such as "movie"

# Try changing the word
w2v.wv.most_similar("movie",topn=5) # 5 most similar words
#w2v.wv.most_similar("awesome",topn=5)
#w2v.wv.most_similar("actor",topn=5)

[('film', 0.9370229244232178),
 ('movie,', 0.8480282425880432),
 ('film,', 0.7691652774810791),
 ('flick', 0.7553712129592896),
 ('picture', 0.7163830995559692)]

But it can be a more complicated query
Word embedding spaces tend to encode much more.

The most famous exemple is: `vec(king) - vec(man) + vec(woman) => vec(queen)`

In [8]:
# What is awesome - good + bad ?
# w2v.wv.most_similar(positive=["awesome","bad"],negative=["good"],topn=3)  

# w2v.wv.most_similar(positive=["actor","woman"],negative=["man"],topn=3) # do the famous exemple works for actor ?


# Try other things like plurals for exemple.
w2v.wv.most_similar(positive=["angel","angels"],negative=["car"],topn=3) # do the famous exemple works for actor ?

[('author,', 0.795407235622406),
 ('forceful', 0.7824472784996033),
 ('internationally', 0.7710718512535095)]

To test learnt "synctactic" and "semantic" similarities, Mikolov et al. introduced a special dataset containing a wide variety of three way similarities.

In [15]:
out = w2v_cbow.wv.evaluate_word_analogies("ressources/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2023-02-16 14:11:49,823 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words.txt
2023-02-16 14:11:50,298 : INFO : capital-common-countries: 1.9% (3/156)
2023-02-16 14:11:50,648 : INFO : capital-world: 0.0% (0/111)
2023-02-16 14:11:50,710 : INFO : currency: 0.0% (0/18)
2023-02-16 14:11:51,619 : INFO : city-in-state: 0.0% (0/301)
2023-02-16 14:11:52,936 : INFO : family: 28.6% (120/420)
2023-02-16 14:11:55,559 : INFO : gram1-adjective-to-adverb: 0.9% (8/870)
2023-02-16 14:11:57,231 : INFO : gram2-opposite: 2.7% (15/552)
2023-02-16 14:12:00,823 : INFO : gram3-comparative: 22.4% (266/1190)
2023-02-16 14:12:03,047 : INFO : gram4-superlative: 6.0% (45/756)
2023-02-16 14:12:05,457 : INFO : gram5-present-participle: 15.0% (122/812)
2023-02-16 14:12:08,214 : INFO : gram6-nationality-adjective: 1.2% (12/967)
2023-02-16 14:12:11,768 : INFO : gram7-past-tense: 16.7% (210/1260)
2023-02-16 14:12:14,038 : INFO : gram8-plural: 6.3% (51/812)
2023-02-16 14:12

In [16]:
out = w2v_sg.wv.evaluate_word_analogies("ressources/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2023-02-16 14:12:38,821 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words.txt
2023-02-16 14:12:39,270 : INFO : capital-common-countries: 2.6% (4/156)
2023-02-16 14:12:39,637 : INFO : capital-world: 1.8% (2/111)
2023-02-16 14:12:39,707 : INFO : currency: 0.0% (0/18)
2023-02-16 14:12:40,688 : INFO : city-in-state: 0.0% (0/301)
2023-02-16 14:12:41,869 : INFO : family: 34.3% (144/420)
2023-02-16 14:12:44,138 : INFO : gram1-adjective-to-adverb: 1.7% (15/870)
2023-02-16 14:12:45,699 : INFO : gram2-opposite: 3.1% (17/552)
2023-02-16 14:12:49,246 : INFO : gram3-comparative: 18.6% (221/1190)
2023-02-16 14:12:51,548 : INFO : gram4-superlative: 9.9% (75/756)
2023-02-16 14:12:53,975 : INFO : gram5-present-participle: 20.9% (170/812)
2023-02-16 14:12:56,738 : INFO : gram6-nationality-adjective: 1.4% (14/967)
2023-02-16 14:13:00,303 : INFO : gram7-past-tense: 21.2% (267/1260)
2023-02-16 14:13:02,493 : INFO : gram8-plural: 6.4% (52/812)
2023-02-16 14:1

**When training the w2v models on the review dataset, since it hasn't been learnt with a lot of data (and thus it hasen't seen a lot of words), it does not perform very well.**


## STEP 3: Loading a pre-trained model

In Gensim, embeddings are loaded and can be used via the ["KeyedVectors"](https://radimrehurek.com/gensim/models/keyedvectors.html) class

> Since trained word vectors are independent from the way they were trained (Word2Vec, FastText, WordRank, VarEmbed etc), they can be represented by a standalone structure, as implemented in this module.

>The structure is called “KeyedVectors” and is essentially a mapping between entities and vectors. Each entity is identified by its string id, so this is a mapping between {str => 1D numpy array}.

>The entity typically corresponds to a word (so the mapping maps words to 1D vectors), but for some models, they key can also correspond to a document, a graph node etc. To generalize over different use-cases, this module calls the keys entities. Each entity is always represented by its string id, no matter whether the entity is a word, a document or a graph node.

In [20]:
#from gensim.test.utils import get_tmpfile
import gensim.downloader as api
from gensim.models import KeyedVectors


bload = False
# A word2vec trained on google news and of dimension 300
fname = "word2vec-google-news-300"
sdir = "ressources/" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:    
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")

2023-02-16 14:16:45,195 : INFO : loading projection weights from C:\Users\PC/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz
2023-02-16 14:18:35,618 : INFO : KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from C:\\Users\\PC/gensim-data\\word2vec-google-news-300\\word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2023-02-16T14:18:35.618111', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'load_word2vec_format'}
2023-02-16 14:18:35,618 : INFO : KeyedVectors lifecycle event {'fname_or_handle': 'ressources/word2vec-google-news-300.dat', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2023-02-16T14:18:35.618111', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 

**Perform the "synctactic" and "semantic" evaluations again. Conclude on the pre-trained embeddings.**

In [21]:
out = wv_pre_trained.evaluate_word_analogies("ressources/questions-words.txt",case_insensitive=True)  #original semantic syntactic dataset.

2023-02-16 14:20:09,547 : INFO : Evaluating word analogies for top 300000 words in the model on ressources/questions-words.txt
2023-02-16 14:20:31,949 : INFO : capital-common-countries: 83.2% (421/506)
2023-02-16 14:22:53,409 : INFO : capital-world: 81.3% (3552/4368)
2023-02-16 14:23:19,176 : INFO : currency: 28.5% (230/808)


KeyboardInterrupt: 

## STEP 4:  sentiment classification

In the previous practical session, we used a bag of word approach to transform text into vectors.
Here, we propose to try to use word vectors (previously learnt or loaded).


### <font color='green'> Since we have only word vectors and that sentences are made of multiple words, we need to aggregate them. </font>


### (1) Vectorize reviews using word vectors:

Word aggregation can be done in different ways:

- Sum
- Average
- Min/feature
- Max/feature

#### a few pointers:

- `w2v.wv.vocab` is a `set()` of the vocabulary (all existing words in your model)
- `np.minimum(a,b) and np.maximum(a,b)` respectively return element-wise min/max 

Une des différences avec les transformers c'est l'approche attentionnelle

- Un token cls
- Un pooling plus fin
- Moyenne pondérée avec tout les mots de la phrase

In [60]:
# wv_pre_trained.key_to_index

In [69]:
# wv_pre_trained.vectors
# wv_pre_trained.get_vector('with')
# wv_pre_trained['with']

In [61]:
import numpy as np

def vectorize(words, model, method='sum'):
    vectors = []
    for word in words:
        try:
            vectors.append(model[word])
        except KeyError:
            continue
    
    if not vectors:
        return np.zeros(model.vector_size)
    
    if method == 'sum':
        return np.sum(vectors, axis=0)
    
    elif method == 'average':
        return np.mean(vectors, axis=0)
    
    elif method == 'min':
        return np.min(vectors, axis=0)
    
    elif method == 'max':
        return np.max(vectors, axis=0)
    
    else:
        raise ValueError("Invalid method, choose one of 'sum', 'average', 'min', 'max'")

# ### TEST   
# classes = [pol for text,pol in train]
# X = [vectorize(text, wv_pre_trained) for text,pol in train]
# X_test = [vectorize(text, wv_pre_trained) for text,pol in test]
# true = [pol for text,pol in test]

# #let's see what a review vector looks like.
# print(X[0])

### (2) Train a classifier 
as in the previous practical session, train a logistic regression to do sentiment classification with word vectors



### Training a classifier using a trained model on the dataset

In [32]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Load pre-trained word embeddings
word_vectors = w2v_cbow.wv # your pre-trained word embedding model

# Load your sentiment data
y_train = [c for r,c in train]
y_test  = [c for r,c in test ]

methods = ['sum', 'average', 'min', 'max']

for m in tqdm(methods):
    # Vectorize the input data
    X_train = [vectorize(text, word_vectors, method=m) for text,pol in train]
    X_test  = [vectorize(text, word_vectors, method=m) for text,pol in test]
    
    # Train a logistic regression model
    clf = LogisticRegression(max_iter=500)
    clf.fit(X_train, y_train)

    # Evaluate the model on the testing set
    accuracy = clf.score(X_test, y_test)
    
    # Results
    print('Method :', m)
    print("Test accuracy:", accuracy)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
 25%|████████████████████▊                                                              | 1/4 [03:54<11:42, 234.27s/it]

Method : sum
Test accuracy: 0.64572


 50%|█████████████████████████████████████████▌                                         | 2/4 [07:50<07:50, 235.47s/it]

Method : average
Test accuracy: 0.594


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
 75%|██████████████████████████████████████████████████████████████▎                    | 3/4 [11:59<04:01, 241.47s/it]

Method : min
Test accuracy: 0.58904


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [16:01<00:00, 240.49s/it]

Method : max
Test accuracy: 0.59232





### Training a classifier using the pre-trained model

In [31]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import time

# Load pre-trained word embeddings
word_vectors = wv_pre_trained # your pre-trained word embedding model

# Load your sentiment data
y_train = [c for r,c in train]
y_test  = [c for r,c in test ]

methods = ['sum', 'average', 'min', 'max']
    
for m in tqdm(methods):
    # Vectorize the input data
    X_train = [vectorize(text, wv_pre_trained, method=m) for text,pol in train]
    X_test  = [vectorize(text, wv_pre_trained, method=m) for text,pol in test]
    
    # Train a logistic regression model
    tic = time.time()
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, y_train)

    # Evaluate the model on the testing set
    accuracy = clf.score(X_test, y_test)
    
    # Results
    print('Exec time : ', time.time()-tic)
    print('Method :', m)
    print("Test accuracy:", accuracy)
    print('---')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
 25%|████████████████████▊                                                              | 1/4 [03:17<09:53, 197.76s/it]

Exec time :  13.9268639087677
Method : sum
Test accuracy: 0.58756
---


 50%|█████████████████████████████████████████▌                                         | 2/4 [06:19<06:16, 188.49s/it]

Exec time :  2.244309663772583
Method : average
Test accuracy: 0.58528
---


 75%|██████████████████████████████████████████████████████████████▎                    | 3/4 [09:18<03:04, 184.04s/it]

Exec time :  3.518094539642334
Method : min
Test accuracy: 0.57396
---


100%|███████████████████████████████████████████████████████████████████████████████████| 4/4 [12:19<00:00, 184.81s/it]

Exec time :  4.330111742019653
Method : max
Test accuracy: 0.57848
---





performance should be worst than with bag of word (~80%). Sum/Mean aggregation does not work well on long reviews (especially with many frequent words). This adds a lot of noise.

## **Todo** :  Try answering the following questions:

- Which word2vec model works best: skip-gram or cbow
- Do pretrained vectors work best than those learnt on the train dataset ?


In [34]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Load pre-trained word embeddings
word_vectors = w2v_cbow.wv # your pre-trained word embedding model

# Load your sentiment data
y_train = [c for r,c in train]
y_test  = [c for r,c in test ]

m = 'sum'

# Vectorize the input data
X_train = [vectorize(text, word_vectors, method=m) for text,pol in train]
X_test  = [vectorize(text, word_vectors, method=m) for text,pol in test]

# Train a logistic regression model
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)

# Evaluate the model on the testing set
accuracy = clf.score(X_test, y_test)

# Results
print('Exec time : ', time.time()-tic)
print('Method :', m)
print("Test accuracy:", accuracy)
print('---')

Exec time :  1261.5267460346222
Method : sum
Test accuracy: 0.64572
---


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Load pre-trained word embeddings
word_vectors = w2v_sg.wv # your pre-trained word embedding model

# Load your sentiment data
y_train = [c for r,c in train]
y_test  = [c for r,c in test ]

m = 'sum'

# Vectorize the input data
X_train = [vectorize(text, word_vectors, method=m) for text,pol in train]
X_test  = [vectorize(text, word_vectors, method=m) for text,pol in test]

# Train a logistic regression model
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)

# Evaluate the model on the testing set
accuracy = clf.score(X_test, y_test)

# Results
print('Exec time : ', time.time()-tic)
print('Method :', m)
print("Test accuracy:", accuracy)
print('---')

Exec time :  1393.2965927124023
Method : sum
Test accuracy: 0.6532
---


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



**(Bonus)** To have a better accuracy, we could try two things:
- Better aggregation methods (weight by tf-idf ?)
- Another word vectorizing method such as [fasttext](https://radimrehurek.com/gensim/models/fasttext.html)
- A document vectorizing method such as [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html)

In [41]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim

def vectorize_tfidf(docs, model):
    # Create a TfidfVectorizer objecté
    tfidf = TfidfVectorizer()

    # Compute the tf-idf weights for the documents
    tfidf_weights = tfidf.fit_transform(docs)

    # Get the vocabulary and idf weights from the TfidfVectorizer
    vocabulary = tfidf.vocabulary_
    idf_weights = tfidf.idf_

    # Create an empty numpy array to store the aggregated vectors
    agg_vectors = np.zeros((len(docs), model.vector_size))

    # Loop over the documents and aggregate the vectors
    for i, doc in enumerate(docs):
        # Split the document into individual words
        words = doc.split()

        # Loop over the words and aggregate their vectors
        for word in words:
            if word in vocabulary:
                # Compute the tf-idf weight for the word
                tf_idf = idf_weights[vocabulary[word]] * (words.count(word) / len(words))

                # Add the weighted vector for the word to the aggregate vector for the document
                agg_vectors[i] += tf_idf * model[word]

    return agg_vectors


In [57]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Load pre-trained word embeddings
word_vectors = w2v_sg.wv # your pre-trained word embedding model

# Load your sentiment data
y_train = [c for r,c in train]
y_test  = [c for r,c in test ]

# Vectorize the input data
X_train = [vectorize_tfidf(text, word_vectors) for text,pol in train]
X_test  = [vectorize_tfidf(text, word_vectors) for text,pol in test]

# Train a logistic regression model
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)

# Evaluate the model on the testing set
accuracy = clf.score(X_test, y_test)

# Results
print('Exec time    : ', time.time()-tic)
print('Method       :', m)
print("Test accuracy:", accuracy)
print('---')

## Fasttext

## Doc2vec

In [62]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(train)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

2023-02-16 15:53:03,233 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2023-02-16 15:53:03,235 : INFO : built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)
2023-02-16 15:53:03,236 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...> from 9 documents (total 29 corpus positions)", 'datetime': '2023-02-16T15:53:03.236808', 'gensim': '4.3.0', 'python': '3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22000-SP0', 'event': 'created'}
2023-02-16 15:53:03,535 : INFO : collecting all words and their counts
2023-02-16 15:53:03,536 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2023-02-16 15:53:03,603 : INFO : PROGRESS: at example #10000, processed 20000 words (313453 words/s), 9985 word types, 0 tags
2023

---

## Annexe

sentences: This is the input data used to train the Word2Vec model. In this case, it is a list of sentences represented by the text variable.

vector_size: This parameter sets the size of the vector used to represent each word. In this case, it is set to 100.

window: This parameter sets the size of the window used to predict the target word given the context words. It is set to 5, which means that the model will consider the five words before and after the target word.

min_count: This parameter sets the minimum frequency of a word that should be included in the vocabulary. Words that occur less frequently than min_count times in the input data will be ignored. It is set to 5 in this case.

sample: This parameter sets the threshold for downsampling high-frequency words. Words that occur more frequently than sample will be randomly downsampled. It is set to 0.001 in this case.

workers: This parameter sets the number of worker threads to use for training the model. It is set to 3 in this case.

sg: This parameter sets the training algorithm. It is set to 1, which means that the model will use the Skip-gram (SG) algorithm instead of the default Continuous Bag of Words (CBOW) algorithm.

hs: This parameter sets the hierarchical softmax training algorithm. It is set to 0, which means that negative sampling will be used instead of hierarchical softmax.

negative: This parameter sets the number of negative samples to use when training the model. It is set to 5 in this case.

cbow_mean: This parameter sets the method for computing the word vectors in the CBOW algorithm. It is set to 1, which means that the vectors will be the mean of the context vectors.

epochs: This parameter sets the number of epochs (iterations) to train the model. It is set to 5 in this case.