# TP : Word Embeddings for Classification

## Objectives:

Explore the various way to represent textual data by applying them to a relatively small French classification dataset based on professionnal certification titles - **RNCP** - and evaluate how they perform on the classification task. 
1. Using what we have previously seen, pre-process the data: clean it, obtain an appropriate vocabulary.
2. Obtain representations: any that will allow us to obtain a vector representation of each document is appropriate.
    - Symbolic: **BoW, TF-IDF**
    - Dense document representations: via **Topic Modeling: LSA, LDA**
    - Dense word representations: **SVD-reduced PPMI, Word2vec, GloVe**
        - For these, you will need to implement a **function aggregating word representations into document representations**
3. Perform classification: we can make things simple and only use a **logistic regression**

## Necessary dependancies

We will need the following packages:
- The Machine Learning API Scikit-learn : http://scikit-learn.org/stable/install.html
- The Natural Language Toolkit : http://www.nltk.org/install.html
- Gensim: https://radimrehurek.com/gensim/

These are available with Anaconda: https://anaconda.org/anaconda/nltk and https://anaconda.org/anaconda/scikit-learn

In [2]:
import os.path as op
import re 
import numpy as np
import matplotlib.pyplot as plt
import pprint
import pandas as pd
import gzip
import nltk
pp = pprint.PrettyPrinter(indent=3)

## Loading data

Let's load the data: take a first look.

In [3]:
with open("rncp.csv", encoding='utf-8') as f:
    rncp = pd.read_csv(f, na_filter=False)

print(rncp.head())

   Categorie                                text_certifications
0          1  Responsable de chantiers de bûcheronnage manue...
1          1  Responsable de chantiers de bûcheronnage manue...
2          1                                 Travaux forestiers
3          1                                              Forêt
4          1                                              Forêt


In [4]:
print(rncp.columns.values)
texts = rncp.loc[:,'text_certifications'].astype('str').tolist()
labels = rncp.loc[:,'Categorie'].astype('str').tolist()

['Categorie' 'text_certifications']


You can see that the first column is the category, the second the title of the certification. Let's get the category names for clarity: 

In [5]:
Categories = ["1-environnement",
              "2-defense",
              "3-patrimoine",
              "4-economie",
              "5-recherche",
              "6-nautisme",
              "7-aeronautique",
              "8-securite",
              "9-multimedia",
              "10-humanitaire",
              "11-nucleaire",
              "12-enfance",
              "13-saisonnier",
              "14-assistance",
              "15-sport",
              "16-ingenierie"]

In [6]:
pp.pprint(texts[:10])

[  'Responsable de chantiers de bûcheronnage manuel et de débardage',
   'Responsable de chantiers de bûcheronnage manuel et de sylviculture',
   'Travaux forestiers',
   'Forêt',
   'Forêt',
   'Responsable de chantiers forestiers',
   'Diagnostic et taille des arbres',
   'option Chef d’entreprise ou OHQ en travaux forestiers, spécialité '
   'abattage-façonnage',
   'option Chef d’entreprise ou OHQ en travaux forestiers, spécialité '
   'débardage',
   'Gestion et conduite de chantiers forestiers']


In [7]:
# This number of documents may be high for some computers: we can select a fraction of them (here, one in k)
# Use an even number to keep the same number of positive and negative reviews
k = 4
texts_reduced = texts[0::k]
labels_reduced = labels[0::k]

print('Number of documents:', len(texts_reduced))

Number of documents: 23578


Use the function ```train_test_split```from ```sklearn``` function to set aside test data that you will use during the lab. Make it one fifth of the data you have currently.

<div class='alert alert-block alert-info'>
            Code:</div>

In [8]:
from sklearn.model_selection import train_test_split
texts_reduced, test_texts, labels_reduced, test_labels = train_test_split(texts_reduced, labels_reduced, test_size = 0.20)

## 1 - Document Preprocessing

You should use a pre-processing function you can apply to the raw text before any other processing (*i.e*, tokenization and obtaining representations). Some pre-processing can also be tied with the tokenization (*i.e*, removing stop words). Complete the following function, using the appropriate ```nltk``` tools. 
<div class='alert alert-block alert-info'>
            Code:</div>

In [9]:
# Imports
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

def remove_pontuation(liste):
    return [re.sub(r'[^\w\s]', '', word) for word in liste if re.sub(r'[^\w\s]', '', word) != ""]

def preprocess_text(text):
    # Tokenization
    tokens = nltk.word_tokenize(text)

    # Remove ponctuation
    tokens = remove_pontuation(tokens)
    
    # Remove stopwords 
    stop_words = set(stopwords.words('french'))
    
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
    
    # Stemming
    stemmer = SnowballStemmer('french')
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
        
    return stemmed_tokens

def vect_preprocess_text(texts):
    res=[]
    for text in texts:
        res.append(preprocess_text(text))
    return res

<div class='alert alert-block alert-info'>
            Code:</div>

In [10]:
# Look at the data and apply the appropriate pre-processing
prepo_text_reduced = vect_preprocess_text(texts_reduced)
prepo_text_test = vect_preprocess_text(test_texts)

In [11]:
prepo_text_reduced[48]

['domain',
 'scienc',
 'technolog',
 'mention',
 'scienc',
 'lingénieur',
 'spécial',
 'compétent',
 'complémentair',
 'informat']

Now that the data is cleaned, the first step we will follow is to pick a common vocabulary that we will use for every representations we obtain in this lab. **Use the code of the previous lab to create a vocabulary.**

<div class='alert alert-block alert-info'>
            Code:</div>

In [12]:
def vocabulary(corpus, count_threshold=0, voc_threshold=10000):
    """    
    Function using word counts to build a vocabulary - can be improved with a second parameter for 
    setting a frequency threshold
    Params:
        corpus (list of strings): corpus of sentences
        count_threshold (int): number of occurrences necessary for a word to be included in the vocabulary
        voc_threshold (int): maximum size of the vocabulary 
    Returns:
        vocabulary (dictionary): keys: list of distinct words across the corpus
                                 values: indexes corresponding to each word sorted by frequency   
        vocabulary_word_counts (dictionary): keys: list of distinct words across the corpus
                                             values: corresponding counts of words in the corpus
    """
    word_counts = {}

    for text in corpus:
        text = text.lower()
        tokens = preprocess_text(text)
        for token in tokens:
            if token not in word_counts:
                word_counts[token] = 0
            word_counts[token] += 1

    filtered_word_counts = {'UNK': 0}

    for word, count in word_counts.items():
        if count >= count_threshold:
            filtered_word_counts[word] = count

    sorted_word_counts = sorted(filtered_word_counts.items(), key=lambda x: x[1], reverse=True)

    vocabulary = {}
    vocabulary_word_counts = {}

    for i, (word, count) in enumerate(sorted_word_counts):
        if i >= voc_threshold:
            vocabulary['UNK'] = i
            vocabulary_word_counts['UNK'] = 0
            break
        vocabulary[word] = i
        vocabulary_word_counts[word] = count

    return vocabulary, vocabulary_word_counts

In [13]:
# Vocabulary
Vocabulary, Vocabulary_word_counts = vocabulary(texts_reduced)

In [14]:
Vocabulary['UNK']

3771

What do you think is the **appropriate vocabulary size here** ? Would any further pre-processing make sense ? Motivate your answer.

<div class='alert alert-block alert-warning'>
            Question:</div>

We have 16 categories and a vocabulary of size 3773. We have on average 235 unique words per categorie with almost 20000 samples. It seems reasonable.

We can maybe use lemmatization but nltk does not handle lemmatization for french word.

## 2 - Symbolic text representations

We can use the ```CountVectorizer``` class from scikit-learn to obtain the first set of representations:
- Use the appropriate argument to get your own vocabulary
- Fit the vectorizer on your training data, transform your test data
- Create a ```LogisticRegression``` model and train it with these representations. Display the confusion matrix using functions from ```sklearn.metrics``` 

Then, re-execute the same pipeline with the ```TfidfVectorizer```.

<div class='alert alert-block alert-info'>
            Code:</div>

### Count Word

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix

In [16]:
vectorizer = CountVectorizer(tokenizer = preprocess_text, token_pattern=None, vocabulary=Vocabulary)
X_train_bow = vectorizer.fit_transform(texts_reduced)
X_test_bow = vectorizer.transform(test_texts)



In [17]:
print("Length of the vocabulary :", len(vectorizer.vocabulary_))
print("Unique index assigned to each word : ", vectorizer.vocabulary_)
#print("Sparse Matrix Representation : ", X_test_bow)

Length of the vocabulary : 3772
Unique index assigned to each word :  {'spécial': 0, 'mention': 1, 'scienc': 2, 'sant': 3, 'technolog': 4, 'diplôm': 5, 'ingénieur': 6, 'gestion': 7, 'manag': 8, 'national': 9, 'domain': 10, 'droit': 11, 'mast': 12, 'informat': 13, 'system': 14, 'professionnel': 15, 'industriel': 16, 'gen': 17, 'social': 18, 'supérieur': 19, 'option': 20, 'commun': 21, 'méti': 22, 'art': 23, 'appliqu': 24, 'humain': 25, 'langu': 26, 'développ': 27, 'environ': 28, 'final': 29, 'ingénier': 30, 'projet': 31, 'econom': 32, 'recherch': 33, 'product': 34, 'lunivers': 35, 'fich': 36, 'respons': 37, 'économ': 38, 'polytechn': 39, 'lettr': 40, 'réseau': 41, 'international': 42, 'matérial': 43, 'mécan': 44, 'biolog': 45, 'sécur': 46, 'techniqu': 47, 'physiqu': 48, 'marketing': 49, 'lecol': 50, 'chim': 51, 'qualit': 52, 'écol': 53, 'organis': 54, 'linstitut': 55, 'entrepris': 56, 'inform': 57, 'licenc': 58, 'ecol': 59, 'chef': 60, 'télécommun': 61, 'électron': 62, 'commerc': 63, 'n

In [18]:
model = LogisticRegression(max_iter = 2000)

model.fit(X_train_bow, labels_reduced)

preds = model.predict(X_test_bow)

accuracy = metrics.accuracy_score(test_labels, preds)
print(f"Accuracy : {accuracy}")

#Display the confusion matrix
cm = confusion_matrix(test_labels, preds)
print(cm)

Accuracy : 0.22370653095843934
[[107  42  17   4   9   3  11  87   2  22 187  12  17  42   5  14]
 [ 67  33   0   3   2   2   4  16   0  10 119   4   3   2   0  20]
 [ 33   3  17   0   0   1   0  47   7   0  37   5   9  35   6   1]
 [  9   0   1   7  19  11   7   1   0   3  25   0   5   0   0   5]
 [ 14   0   0   9  20   5   7   0   1   9  10   1   4   1   1   4]
 [ 19   3   1   5  10  34  11   7   2   8  26   5   4   6   8   8]
 [ 21   7   0   9   5  10  11   2   1   8  78   1  15   0   0  28]
 [ 66  12  24   0   1   2   1 123   3   1 182  11   9 131   8  26]
 [  8   1   4   0   0   2   0   6   9   0  13   1  10   8   5   1]
 [ 33   6   2   2   2   4   3   6   0  46  51  16   4   4   0   8]
 [ 92  30   7   1   4   7  15 128   1  16 456  16  11  42   6  87]
 [ 37   4   6   2   2   5   4  45   0  21  83  16  13  28   2   1]
 [ 41   5   4   4   4   5  12  41  11   2  50  12  22  31   3   3]
 [ 31   8   8   0   0   2   0 165   1   4  90   7   8  65   6  13]
 [ 11   0   8   1   0   4   0  

### TFIDF

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvect = TfidfVectorizer(tokenizer= preprocess_text, token_pattern=None, vocabulary=Vocabulary)
X_train_tfidf = tfidfvect.fit_transform(texts_reduced)

X_test_tfidf = tfidfvect.transform(test_texts)



In [20]:
print("Length of the vocabulary :", len(tfidfvect.vocabulary_))
print("Unique index assigned to each word : ", tfidfvect.vocabulary_)
print("Sparse Matrix Representation : ", X_test_tfidf)

Length of the vocabulary : 3772
Unique index assigned to each word :  {'spécial': 0, 'mention': 1, 'scienc': 2, 'sant': 3, 'technolog': 4, 'diplôm': 5, 'ingénieur': 6, 'gestion': 7, 'manag': 8, 'national': 9, 'domain': 10, 'droit': 11, 'mast': 12, 'informat': 13, 'system': 14, 'professionnel': 15, 'industriel': 16, 'gen': 17, 'social': 18, 'supérieur': 19, 'option': 20, 'commun': 21, 'méti': 22, 'art': 23, 'appliqu': 24, 'humain': 25, 'langu': 26, 'développ': 27, 'environ': 28, 'final': 29, 'ingénier': 30, 'projet': 31, 'econom': 32, 'recherch': 33, 'product': 34, 'lunivers': 35, 'fich': 36, 'respons': 37, 'économ': 38, 'polytechn': 39, 'lettr': 40, 'réseau': 41, 'international': 42, 'matérial': 43, 'mécan': 44, 'biolog': 45, 'sécur': 46, 'techniqu': 47, 'physiqu': 48, 'marketing': 49, 'lecol': 50, 'chim': 51, 'qualit': 52, 'écol': 53, 'organis': 54, 'linstitut': 55, 'entrepris': 56, 'inform': 57, 'licenc': 58, 'ecol': 59, 'chef': 60, 'télécommun': 61, 'électron': 62, 'commerc': 63, 'n

In [21]:
modeltfidf = LogisticRegression(max_iter = 2000)

modeltfidf.fit(X_train_tfidf, labels_reduced)

preds_tfidf = modeltfidf.predict(X_test_tfidf)

accuracy = metrics.accuracy_score(test_labels, preds_tfidf)
print(f"Accuracy : {accuracy}")

#Display the confusion matrix
cm = confusion_matrix(test_labels, preds_tfidf)
print(cm)

Accuracy : 0.23918575063613232
[[132  30  10   1   5   0   9 101   2  23 192  10  11  39   2  14]
 [ 62  34   0   4   0   2   6  14   0  11 125   3   3   3   0  18]
 [ 40   2  14   0   0   0   0  62   3   1  34   3   6  35   1   0]
 [ 11   0   0   8  17  12   4   1   0   5  23   0   7   0   0   5]
 [ 22   0   0   5  13   9   5   0   1   6  15   1   3   1   1   4]
 [ 26   3   1   6  10  28   7   3   1   5  39   1   4   8   8   7]
 [ 21   1   0   9   5  13  10   3   1  12  76   0  11   0   0  34]
 [ 72  12  16   0   1   2   1 153   0   0 183  10   5 114   3  28]
 [ 13   2   3   0   0   1   0   8   5   0  11   0  12   8   4   1]
 [ 36   5   2   2   1   5   3   9   0  49  49  11   7   3   0   5]
 [ 90  25   5   1   1   3  16 146   0  12 483  10   4  39   6  78]
 [ 47   2   1   2   0   4   6  58   0  20  75  12   6  33   0   3]
 [ 47   4   1   5   5   4   8  51   5   3  44  13  23  31   2   4]
 [ 34   6   1   0   0   1   0 172   1   2  86   5   4  76   5  15]
 [ 16   0   4   1   0   4   0  

## 3 - Dense Representations from Topic Modeling

Now, the goal is to re-use the bag-of-words representations we obtained earlier - but reduce their dimension through a **topic model**. Note that this allows to obtain reduced **document representations**, which we can again use directly to perform classification.
- Do this with two models: ```TruncatedSVD``` and ```LatentDirichletAllocation```
- Pick $300$ as the dimensionality of the latent representation (*i.e*, the number of topics)

<div class='alert alert-block alert-info'>
            Code:</div>

In [22]:
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation

# TruncatedSVD
svd = TruncatedSVD(n_components=300)
X_train_svd = svd.fit_transform(X_train_bow)

X_test_svd = svd.transform(X_test_bow)

# LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=300)
X_train_lda = lda.fit_transform(X_train_bow)

X_test_lda = lda.transform(X_test_bow)

In [23]:
X_train_svd.shape, X_train_lda.shape

((18862, 300), (18862, 300))

In [24]:
modelsvd = LogisticRegression(max_iter = 200)

modelsvd.fit(X_train_svd, labels_reduced)

preds_svd = modelsvd.predict(X_test_svd)

accuracy = metrics.accuracy_score(test_labels, preds_tfidf)
print(f"Accuracy : {accuracy}")
print(metrics.classification_report(test_labels, preds_svd))

#Display the confusion matrix
cm = confusion_matrix(test_labels, preds_svd)
print(cm)

Accuracy : 0.23918575063613232
              precision    recall  f1-score   support

           1       0.22      0.23      0.23       581
          10       0.23      0.11      0.15       285
          11       0.21      0.08      0.12       201
          12       0.26      0.11      0.15        93
          13       0.32      0.26      0.28        86
          14       0.21      0.13      0.16       157
          15       0.19      0.08      0.11       196
          16       0.22      0.26      0.24       600
           2       0.27      0.15      0.19        68
           3       0.25      0.17      0.20       187
           4       0.27      0.56      0.37       919
           5       0.19      0.07      0.10       269
           6       0.25      0.10      0.14       250
           7       0.18      0.15      0.16       408
           8       0.25      0.19      0.22        72
           9       0.27      0.20      0.23       344

    accuracy                           0.24      

In [25]:
modellda = LogisticRegression(max_iter = 2000)

modellda.fit(X_train_lda, labels_reduced)

preds_lda = modellda.predict(X_test_lda)

accuracy = metrics.accuracy_score(test_labels, preds_lda)
print(f"Accuracy : {accuracy}")

#Display the confusion matrix
cm = confusion_matrix(test_labels, preds_lda)
print(cm)

Accuracy : 0.2383375742154368
[[153   6   8   2   3   0   3  91   0  12 251   8  10  20   0  14]
 [ 42  11   0   3   0   4   2  13   0   7 181   0   1   5   0  16]
 [ 43   0  11   0   0   0   0  62   0   1  54   1   0  26   0   3]
 [ 16   1   0  12   7   5   3   1   0   6  33   0   5   1   0   3]
 [ 15   1   0   2   3   6   4   3   0   3  35   0   6   3   1   4]
 [ 34   0   2   3   4   7   2  13   0   3  66   1   4   7   6   5]
 [ 16   2   0   5   1   9   5   1   0   6 118   0   6   0   0  27]
 [ 99   2  10   0   0   2   0 169   0   0 229   5   7  56   0  21]
 [ 19   0   4   0   0   0   0   5   0   0  29   0   2   6   3   0]
 [ 46   0   0   2   1   1   3  11   0  30  79   4   5   1   0   4]
 [ 82   6   4   1   0   7   7 131   0   7 593   5   4  26   2  44]
 [ 58   0   0   2   0   1   2  59   0  18  98   5   4  19   0   3]
 [ 60   0   0   2   0   4   2  55   0   2  81   5  13  20   3   3]
 [ 40   8   4   0   0   0   0 146   0   3 148   3   1  43   4   8]
 [ 20   0   2   1   0   0   0  1

<div class='alert alert-block alert-warning'>
            Question:</div>
            
We picked $300$ as number of topics. What would be the procedure to follow if we wanted to choose this hyperparameter through the data ? 

We would have to use a grid search to find the best hyperparameters for the model and select the value of the hypermarameter such that the cumulative variance ratio is greater than 0.9.

## 4 - Dense Count-based Representations

The following function allows to obtain very large-dimensional vectors for **words**. We will now follow a different procedure:
- Step 1: Obtain the co-occurence matrix, based on the vocabulary, giving you a vector by word in the vocabulary.
- Step 2: Apply an SVD to obtain **word embeddings** of dimension $300$, for each word in the vocabulary.
- Step 3: Obtain document representations by aggregating embeddings associated to each word in the document.
- Step 4: Train a classifier on the (document representations, label) pairs. 

Some instructions:
- In step 1, use the ```co_occurence_matrix``` function, which you need to complete.
- In step 2, use ```TruncatedSVD```to obtain word representations of dimension $300$ from the output of the ```co_occurence_matrix``` function.
- In step 3, use the ```sentence_representations``` function, which you will need to complete.
- In step 4, put the pipeline together by obtaining document representations for both training and testing data. Careful: the word embeddings must come from the *training data co-occurence matrix* only.

Lastly, add a **Step 1b**: transform the co-occurence matrix into the PPMI matrix, and compare the results.

In [26]:
def co_occurence_matrix(corpus, vocabulary, window=0):
    """
    Params:
        corpus (list of list of strings): corpus of sentences
        vocabulary (dictionary): words to use in the matrix
        window (int): size of the context window; when 0, the context is the whole sentence
    Returns:
        matrix (array of size (len(vocabulary), len(vocabulary))): the co-oc matrix, using the same ordering as the vocabulary given in input    
    """ 
    l = len(vocabulary)
    M = np.zeros((l,l))
    for sent in corpus:
        # Get the sentence
        sent = preprocess_text(sent)
        # Obtain the indexes of the words in the sentence from the vocabulary 
        sent_idx = []
        for word in sent:
            if word in vocabulary:
                sent_idx.append(vocabulary[word])
            else:
                sent_idx.append(vocabulary['UNK'])
        # Avoid one-word sentences - can create issues in normalization:
        if len(sent_idx) == 1:
                sent_idx.append(len(vocabulary)-1) # This adds an Unkown word to the sentence
        # Go through the indexes and add 1 / dist(i,j) to M[i,j] if words of index i and j appear in the same window
        for i, idx in enumerate(sent_idx):
            # If we consider a limited context:
            if window > 0:
                # Create a list containing the indexes that are on the left of the current index 'idx_i'
                l_ctx_idx = sent_idx[max(0, i-window):i]
            # If the context is the entire document:
            else:
                # The list containing the left context is easier to create
                l_ctx_idx = sent_idx[:i]
            # Go through the list and update M[i,j] and M[j,i]:        
            for j, ctx_idx in enumerate(l_ctx_idx):
                M[idx, ctx_idx] += 1 / abs(i-j)
                M[ctx_idx, idx] += 1 / abs(i-j)
    return M  

<div class='alert alert-block alert-info'>
            Code:</div>

In [27]:
# Obtain the co-occurence matrix, transform it as needed, reduce its dimension
co_occ_train = co_occurence_matrix(texts_reduced, Vocabulary, window=5)
co_occ_test = co_occurence_matrix(texts_reduced, Vocabulary, window=5)

In [28]:
svd = TruncatedSVD(n_components=300)
X_train_svd = svd.fit_transform(co_occ_train)

<div class='alert alert-block alert-info'>
            Code:</div>

In [29]:
def sentence_representations(texts, vocabulary, embeddings, np_func=np.mean):
    """
    Represent the sentences as a combination of the vector of its words.
    Parameters
    ----------
    texts : a list of sentences   
    vocabulary : dict
        From words to indexes of vector.
    embeddings : Matrix containing word representations
    np_func : function (default: np.sum)
        A numpy matrix operation that can be applied columnwise, 
        like `np.mean`, `np.sum`, or `np.prod`. 
    Returns
    -------
    np.array, dimension `(len(texts), embeddings.shape[1])`            
    """
    representations = []
    for text in texts:
        sent = preprocess_text(text)
        indexes = [vocabulary.get(token, vocabulary['UNK']) for token in sent] # Indexes of words in the sentence obtained thanks to the vocabulary
        sentrep = np_func(embeddings[indexes], axis=0) # Embeddings of words in the sentence, aggregated thanks to the function
        representations.append(sentrep)
    representations = np.array(representations)    
    return representations

In [30]:
#step3
sentence_representations_train = sentence_representations(texts_reduced, Vocabulary, X_train_svd)
sentence_representations_test = sentence_representations(test_texts, Vocabulary, X_train_svd)

In [31]:
sentence_representations_train.shape, sentence_representations_test.shape

((18862, 300), (4716, 300))

<div class='alert alert-block alert-info'>
            Code:</div>

In [32]:
model = LogisticRegression(max_iter = 4000)
model.fit(sentence_representations_train, labels_reduced)
test_preds = model.predict(sentence_representations_test)
print(metrics.classification_report(test_labels, test_preds))


#Display the confusion matrix
cm = confusion_matrix(test_labels, test_preds)
print(cm)


              precision    recall  f1-score   support

           1       0.21      0.22      0.21       581
          10       0.28      0.10      0.15       285
          11       0.24      0.09      0.13       201
          12       0.24      0.11      0.15        93
          13       0.27      0.22      0.24        86
          14       0.15      0.13      0.14       157
          15       0.12      0.05      0.07       196
          16       0.21      0.29      0.25       600
           2       0.18      0.12      0.14        68
           3       0.28      0.22      0.25       187
           4       0.30      0.55      0.38       919
           5       0.17      0.06      0.09       269
           6       0.21      0.10      0.13       250
           7       0.21      0.17      0.19       408
           8       0.22      0.21      0.21        72
           9       0.30      0.26      0.28       344

    accuracy                           0.25      4716
   macro avg       0.22   

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Now with PPMI matrix

In [33]:
def ppmi_matrix(co_occ_matrix):
    l = co_occ_matrix.shape[0]
    eps = 10**-50
    ppmi_matrix = np.zeros((l,l))
    sum_co_occ = np.sum(co_occ_matrix)
    sum_co_occ_col = np.sum(co_occ_matrix, axis=0)
    sum_co_occ_row = np.sum(co_occ_matrix, axis=1)
    for i in range(l):
        for j in range(l):
            res = (co_occ_matrix[i,j]*sum_co_occ)/(sum_co_occ_row[i]*sum_co_occ_col[j])
            if res > 1:
                ppmi_matrix[i,j] = np.log(res)
    return ppmi_matrix

ppmi_train = ppmi_matrix(co_occ_train)
ppmi_test = ppmi_matrix(co_occ_test)

In [34]:
svd = TruncatedSVD(n_components=300)
X_train_svd_ppmi = svd.fit_transform(ppmi_train)
X_test_svd_ppmi = svd.transform(ppmi_test)

In [35]:
sentence_representations_train_ppmi = sentence_representations(texts_reduced, Vocabulary, X_train_svd_ppmi)
sentence_representations_test_ppmi = sentence_representations(test_texts, Vocabulary, X_train_svd_ppmi)

model_ppmi = LogisticRegression(max_iter = 2000)
model_ppmi.fit(sentence_representations_train, labels_reduced)
test_preds_ppmi = model_ppmi.predict(sentence_representations_test_ppmi)

print("PPMI model :")
print(metrics.classification_report(test_labels, test_preds_ppmi))

print()
      
#Display the confusion matrix
cm = confusion_matrix(test_labels, test_preds_ppmi)
print(cm)


PPMI model :
              precision    recall  f1-score   support

           1       0.12      0.09      0.10       581
          10       0.09      0.11      0.10       285
          11       0.05      0.02      0.03       201
          12       0.01      0.03      0.02        93
          13       0.02      0.03      0.03        86
          14       0.04      0.12      0.06       157
          15       0.03      0.09      0.05       196
          16       0.05      0.01      0.01       600
           2       0.01      0.03      0.01        68
           3       0.08      0.06      0.07       187
           4       0.18      0.16      0.17       919
           5       0.00      0.00      0.00       269
           6       0.03      0.04      0.04       250
           7       0.07      0.04      0.05       408
           8       0.01      0.03      0.01        72
           9       0.09      0.07      0.08       344

    accuracy                           0.07      4716
   macro avg 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 5 - Dense Prediction-based Representations

We will now use word embeddings from ```Word2Vec```: which we will train ourselves

We will use the ```gensim``` library for its implementation of word2vec in python. Since we want to keep the same vocabulary as before: we'll first create the model, then re-use the vocabulary we generated above. 

In [36]:
from gensim.models import Word2Vec
X_train_bow.shape

(18862, 3772)

In [37]:
model = Word2Vec(vector_size=300,
                 window=5,
                 null_word= len(Vocabulary_word_counts))
model.build_vocab_from_freq(Vocabulary_word_counts)

<div class='alert alert-block alert-info'>
            Code:</div>

In [38]:
# The model is to be trained with a list of tokenized sentences, containing the full training dataset.
preprocessed_corpus = vect_preprocess_text(texts_reduced)

In [None]:
model.train(preprocessed_corpus, total_examples=model.corpus_count, epochs=30)

Then, we can re-use the ```sentence_representations```function like before to obtain document representations, and apply classification. 
<div class='alert alert-block alert-info'>
            Code:</div>

In [None]:
#hen, we can re-use the sentence_representations function like before to obtain document representations, and apply classification. 
sentence_representations_train_w2v = sentence_representations(texts_reduced, Vocabulary, model.wv)
sentence_representations_test_w2v = sentence_representations(test_texts, Vocabulary, model.wv)

model_w2v = LogisticRegression(max_iter = 2000)
model_w2v.fit(sentence_representations_train_w2v, labels_reduced)
test_preds_w2v = model_w2v.predict(sentence_representations_test_w2v)

print("Word2Vec model :")
print(metrics.classification_report(test_labels, test_preds_w2v))

<div class='alert alert-block alert-warning'>
            Question:</div>
            
Comment on the results. What is the big issue with the dataset that using embeddings did not solve ? 
**Given this type of data**, what would you propose if you needed solve this task (i.e, reach a reasonnable performance) in an industrial context ? 