# USING WORD2VEC EMBEDDINGS AS FEATURES IN SUPERVISED LEARNING

# Introduction

In this tutorial we'll briefly cover the benchmark you should always try when dealing with some NLP task - word2vec - and how to use embeddings as features in supervised learning.

Word2vec generally is an unsupervised learning algorithm, designed by Google developers and released in 2013, to learn vector representations of words The main idea is to encode words with close meaning that can substitute each other in a context as close vectors in an X-dimensional space. In other words, we assume that similar words have similar company like 'tell me who's your friend...'.


**Why it's useful**

When we have to use words in machine learning models, unless we are using tree based models, we need to convert the words into some kind of numeric representations.  The easiest way of doing this would be using some encoding methods of converting the word into a sparse matrix with only one non-zero element marking a corresponding word. Not even talking about the sparseness,   such aprroaches doens't give any information about local context of the words, it strips away information about words which commonly appear close together in sentences (or between sentences).

So, as a human-being, you understand that there's nothing more important for analyzing sequential text data than the context each word is used in. Keeping information about the context provides an opportunity to define sentimentally close words as generally similar instances which can be very important when it comes to analyzing text data.

**How is it working**

An algorithm tries to refrlect the meaning of a word by analyzing its context. The algorithm exists in two flavors: ***CBOW*** and ***Skip-Gram***. In the second approach, looping over a corpus of sentences, model tries to use the current word to predict its neighbors, or in ***CBOW*** it tries to predict the current word with the help of each of the contexts. The limit on the number of words in each context is determined by a parameter called “window size”.

Besides allowing for a numerical representation of textual data, the resulting embeddings also learn interesting relationships between words.

Python word2vec implementation can be found as a part ***gensim*** package.

# Data prepocessing

To get better results you should fisrt clear your data. In case of word2vec we need to split hte text into sentences and futher tokenize senteces. Some prefer to delete punctional marks then. Some prefer not to even split senteces to wors when it comes to big texts. It's up to you.

We'll work with classified film reviews from IMDb.

In [None]:
import pandas as pd
import nltk
import seaborn as sns
import re
import numpy as np
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('../input/bag-of-words-meets-bags-of-popcorn-/labeledTrainData.tsv', sep='\t')
df.head()

Now let's implement a function to preprocess textual data in the dataset. In the first step we're lowing the letters in the content. Then we construct already tokenized sentences with the help of ***nltk*** package. Then we remove empty lists to make the model more accurate. Note that you can further delete non-ascii characters, implement stemming or lemmatization or perform some other desirable steps. 

In [None]:
%%time
# Here we get transform the documents into sentences for the word2vecmodel
# we made a function such that later on when we make the submission, we don't need to write duplicate code
def preprocess(df):
    df['review'] = df.review.str.lower()
    df['document_sentences'] = df.review.str.split('.') 
    df['tokenized_sentences'] = list(map(lambda sentences: list(map(nltk.word_tokenize, sentences)), df.document_sentences))  
    df['tokenized_sentences'] = list(map(lambda sentences: list(filter(lambda lst: lst, sentences)), df.tokenized_sentences))

preprocess(df)

Now let's split our data to train and test sets. We'll use 20% of the data for evaluation and 80% for training.

In [None]:
from sklearn.model_selection import train_test_split
train, test, y_train, y_test = train_test_split(df.drop(columns='sentiment'), df['sentiment'], test_size=.2)

The next step is constructing a vocabulary with all of the tokens extracted from the training data.

In [None]:
#Collecting a vocabulary
voc = []
for sentence in train.tokenized_sentences:
    voc.extend(sentence)

print("Number of sentences: {}.".format(len(voc)))
print("Number of rows: {}.".format(len(train)))

Now we need to initialize and train word2vec model. We'll set the values of some key parameters.

- *size* - the dimensionality of word vectors (big values take long to compute);
- *min_count* - minimium frequency count of words;
- *window* - how many closest words will be used as a context;
- *workers* - number of threads;

In [None]:
%%time
from gensim.models import word2vec, Word2Vec

num_features = 300    
min_word_count = 3    
num_workers = 4       
context = 8           
downsampling = 1e-3   

# Initialize and train the model
W2Vmodel = Word2Vec(sentences=voc, sg=1, hs=0, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                    sample=downsampling, negative=5, iter=6)

To transform sentece groups into feature vectors we have to average vectors of particular words in a sentence (we can't use sum cause sentences have different word counts). Of course, our model can only give back vectors of words that are already in the vocabulary.

In [None]:
%%time
def sentence_vectors(model, sentence):
    #Collecting all words in the text
    words=np.concatenate(sentence)
    #Collecting words that are known to the model
    model_voc = set(model.wv.vocab.keys()) 
    
    sent_vector = np.zeros(model.vector_size, dtype="float32")
    
    # Use a counter variable for number of words in a text
    nwords = 0
    # Sum up all words vectors that are know to the model
    for word in words:
        if word in model_voc: 
            sent_vector += model[word]
            nwords += 1.

    # Now get the average
    if nwords > 0:
        sent_vector /= nwords
    return sent_vector

train['sentence_vectors'] = list(map(lambda sen_group:
                                      sentence_vectors(W2Vmodel, sen_group),
                                      train.tokenized_sentences))

Now we only have to extract vectors of different dimensions from a list of sentence vectors. 

In [None]:
def vectors_to_feats(df, ndim):
    index=[]
    for i in range(ndim):
        df[f'w2v_{i}'] = df['sentence_vectors'].apply(lambda x: x[i])
        index.append(f'w2v_{i}')
    return df[index]
X_train = vectors_to_feats(train, 300)
X_train.head()

We'll repeat the same steps for the test part of the set. 

In [None]:
%%time
test['sentence_vectors'] = list(map(lambda sen_group:sentence_vectors(W2Vmodel, sen_group), test.tokenized_sentences))
X_test=vectors_to_feats(test, 300)

And now we're ready to evaluate results:

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
from sklearn.metrics import roc_auc_score, confusion_matrix
roc_auc_score(y_test,lr.predict_proba(X_test)[:,1])


* Pretty good result. It would be hard to beat one using 'bags' techniques. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
df_cm = pd.DataFrame(confusion_matrix(y_test,lr.predict(X_test)), index = ['predicted positive', 'predicted negative'],
                  columns = ['actual positive', 'actual negative'])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)
plt.show()

Now we'll check it on CV with a vocabulary of all texts. It can be treated as an infprmation leak, but in this case it doesn't mean much.

In [None]:
voc_df = []
for sentence_group in df.tokenized_sentences:
    voc_df.extend(sentence_group)

print("Number of sentences: {}.".format(len(voc_df)))
print("Number of texts: {}.".format(len(df)))

In [None]:
%%time
from gensim.models import word2vec, Word2Vec

num_features = 300    
min_word_count = 3    
num_workers = 4       
context = 8           
downsampling = 1e-3   

# Initialize and train the model
W2Vmodel = Word2Vec(sentences=voc_df, sg=1, hs=0, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                    sample=downsampling, negative=5, iter=6)

In [None]:
%%time
df['sentence_vectors'] = list(map(lambda sen_group: sentence_vectors(W2Vmodel, sen_group), df.tokenized_sentences))
df = vectors_to_feats(df, 300)
y = pd.read_csv('../input/bag-of-words-meets-bags-of-popcorn-/labeledTrainData.tsv', sep='\t')['sentiment'].values

In [None]:
from sklearn.model_selection import ShuffleSplit, cross_val_score
cv = ShuffleSplit(n_splits=5, random_state=1)

cv_score = cross_val_score(lr, df, y ,cv=cv, scoring='roc_auc')
print(cv_score, cv_score.mean())

We can observe very stable evaluations on CV which demonstrates low varience and no overfitting is observed.

# Outro
That's it. We've briefly covered word2vec implementation in supervised NLP tasks. Remember: word2vec is always the one to try when it comes to NLP!