# Homework 3 - Word Embedding & Rating Prediction

## 0. Introduction

### Goal and outline

In this notebook, our goal is to build a simple classification model and evaluate its performance when using the embedding matrices produced by three different embedding techniques: LSI and Word2Vec. 

In the three following sections we build the embedding matrices using each technique. Then, in section 4, we try to fit a cosine similarity classifier and a random forest classifier and evaluate their performances in predicting the ratings of the reviews as a target variable, with respect to the embeding matrices.

### Importing useful libraries

In [1]:
# Data manipulation
import numpy as np 
import pandas as pd 
import os.path

# Text manipulation
from string import punctuation
from wordcloud import WordCloud

# NLP Modules
import nltk
from nltk import word_tokenize
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
import gensim
from gensim import corpora, similarities
from gensim.test.utils import common_dictionary, common_corpus
from gensim.test.utils import get_tmpfile
from gensim.models import LsiModel
import nltk
from nltk.tokenize import RegexpTokenizer
from gensim.models.coherencemodel import CoherenceModel

# Vizualisation
import seaborn as sns
import matplotlib as plt

### Loading and preprocessing the data

Here we use functions we defined in previous homework, running this cell takes few minutes.

In [2]:
def load_data(DATA_PATH = "data/", file_name = "raw_scrapped_data.csv.gzip"):
    """
    Input  : path of where data is stored
    Purpose: loading csv file of reviews
    Output : data frame of reviews with associated ratings
    """    
    # Path of the file
    file_path = DATA_PATH + file_name

    # Reading data
    scrapped_data = pd.read_csv(file_path, compression='gzip')
    data = scrapped_data[['content', 'rating']]
    return data 

def basic_cleaning(series):
    # Remove punctuation
    new_series = series.str.replace('[^\w\s]','')
    # Strip trailing whitespace
    new_series = new_series.str.strip(" ")
    # Decapitalize letters
    new_series = new_series.apply(lambda x: str(x).lower())
    return new_series

def tokenize_filter(sentence):
    # Define stopwords
    stop_words = set(stopwords.words('english')) 
    ## Add personalised stop words
    stop_words |= set(["london", "food", "drink", "restaurant"])
    # Filter the sentence
    word_tokens = word_tokenize(sentence) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    return (word_tokens, filtered_sentence)

def stem_review(tokens):
    porter = PorterStemmer()
    return tokens.apply(lambda x: [porter.stem(x[i]) for i in range(len(x))])

def preprocess_data(data):
    df = data
    df["clean_content"] = basic_cleaning(df["content"])
    df["tokenized_content"] = df["clean_content"].apply(lambda x: tokenize_filter(x)[1])
    df["stemmed_reviews"] = stem_review(df["tokenized_content"])
    return df[['stemmed_reviews', 'rating']]

df = preprocess_data(load_data())
df.head()

Unnamed: 0,stemmed_reviews,rating
0,"[decid, visit, windsor, castl, way, back, sw, ...",5
1,"[good, although, rather, small, portion, howev...",2
2,"[look, somewher, budget, go, eat, overnight, w...",5
3,"[good, menu, select, unfortun, stifado, avail,...",4
4,"[pop, last, night, glass, wine, attend, theatr...",3


## 1. Latent semantic indexing (LSI)

Here, we use the functions already defined in the handout and adapt them to our particular case.

In [3]:
reduced_df = df[:1000] # We will use 1000 reviews for now

1) Using "corpora" from gensim to extract vocabulary from a corpus.

In [4]:
def get_dictionary(doc_clean):
    """
    Input  : clean document
    Purpose: get the whole associated vocabulary
    Output : term dictionary
    """
    # Creating the term dictionary of our corpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    return corpora.Dictionary(doc_clean)

2) Buildind TF matrix useful for LSI.

In [5]:
def get_TF_matrix(doc_clean, useTransfertDict=True):
    """
    Input  : clean document
    Purpose: get the term frequency matrix from a corpus
    Output : Document Term Frequency Matrix
    """
    # Creating the term dictionary of our corpus, where every unique term is assigned an index. 
    dictionary = corpora.Dictionary(doc_clean)
        
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    return [dictionary.doc2bow(doc) for doc in doc_clean]

3) Create an LSI model using Gensim and obtain LSI word embedding for our reviews.

In [6]:
def create_gensim_lsi_model(clean_documents_list, k=None):
    """
    Input  : clean document, dictionary
    Purpose: create LSI model (Latent Semantic Indexing) 
             from corpus and dictionary
    Output : return LSI model
    """
    
    #LSI model consists of Singular Value Decomposition (SVD) of
    #Term Document Matrix M: M = T x S x D'
    #and dimensionality reductions of T, S and D ("Derivation")
    
    dictionary = get_dictionary(clean_documents_list)
    
    corpus = get_TF_matrix(clean_documents_list)
    if k is not None:
        lsi_model = LsiModel(
                corpus=corpus,
                id2word=dictionary,
                num_topics=int(k)
                )
    else:
            lsi_model = LsiModel(
            corpus=corpus,
            id2word=dictionary 
            )
    #print(); print(); print("="*20, "Training LSI model report", "="*20); print()
    
    #print("Initial TF matrix (NwordsXNdocuments): ")
    TF = []
    for x in corpus:
        wrds = [0 for i in range(len(dictionary))]
        for i, j in x: wrds[i] = j
        TF.append(wrds)
    #print(pd.np.transpose(TF))
    #print()
    #print("Derivation of Term Matrix T of Training Document Word Stems: ")
    #print(lsi_model.get_topics())
    #print()
    #Derivation of Term Document Matrix of Training Document Word Stems = M' x [Derivation of T]
    #print("LSI Vectors of 10 Training Document Word Stems: ")
    #print([lsi_model[document_word_stems] for document_word_stems in corpus[:10]])
    #print("="*70); print(); print()
    return lsi_model

def get_lsi_vector(lsi_model, clean_text, dictionary):
    return lsi_model[dictionary.doc2bow(clean_text)]

def get_lsi_matrix(lsi_model, corpus_TFmatrix):
    return np.array(lsi_model[corpus_TFmatrix])[:,:,1]

# create lsi model
lsi_model = create_gensim_lsi_model(reduced_df.stemmed_reviews)
# build encoded corpus (TF matrix)
corpus_TFmatrix = get_TF_matrix(reduced_df.stemmed_reviews)
# obtain the LSI representation of our reviews in the form of a matrix
lsi_matrix = get_lsi_matrix(lsi_model, corpus_TFmatrix)
print(lsi_matrix.shape)

(1000, 200)


## 2. Word2Vec

1) Instantiating the model from gensim models.

In [8]:
def create_word2vec_model():
    """
    Input  : none
    Purpose: create word2vec model from corpus
    Output : term dictionary
    """
    path = get_tmpfile("word2vec.model")
    model = gensim.models.Word2Vec(size=300, 
                                   window=3, 
                                   min_count=5, 
                                   workers=4, 
                                   seed=1, 
                                   iter=50)
    return model

2) Building model vocabulary.

In [9]:
def init_vocab(model, clean_documents_list):
    """
    Input  : model and clean documents list
    Purpose: instantiate model vocabulary from clean documents list
    Output : model with vocabulary
    """
    init_vocab = list(map(lambda review: review, clean_documents_list["stemmed_reviews"]))
    model.build_vocab(init_vocab)
    return model

3) Training the model on the reviews, then saving it.

In [10]:
def train_word2vec_model(model, clean_documents_list):
    """
    Input  : model and clean documents list
    Purpose: train model on clean documents list
    Output : trained model
    """
    corpus = list(map(lambda review: review, clean_documents_list["stemmed_reviews"]))
    model.train(corpus, total_examples=model.corpus_count, epochs=model.iter)
    model.save("word2vec.model")
    return model

4) Loading pre-trained model.

In [11]:
def load_trained_model(model_name):
    """
    Input  : trained model name
    Purpose: load trained model
    Output : saved model
    """
    return gensim.models.Word2Vec.load("word2vec.model")

5) Getting embedded matrix from corpus.

In [12]:
def get_word2vec_matrix(model):
    embedding_matrix = dict()
    for word in model.wv.vocab.keys():
        embedding_matrix[word] = list(model.wv[word])
    return pd.DataFrame(embedding_matrix)

Instantiating and training model:

In [13]:
model = create_word2vec_model()
model = init_vocab(model, reduced_df)
model = train_word2vec_model(model, reduced_df)
model = load_trained_model("word2vec.model")

  if __name__ == '__main__':
  


[('superb', 0.7938962578773499),
 ('excel', 0.756244957447052),
 ('outstand', 0.7333143353462219),
 ('throughout', 0.7261067628860474),
 ('buzz', 0.716883659362793),
 ('love', 0.6926382780075073),
 ('impecc', 0.6919640302658081),
 ('good', 0.6886517405509949),
 ('faultless', 0.68587726354599),
 ('brilliant', 0.6840408444404602)]

Testing for an example:

In [None]:
model.most_similar("great", topn=10)

Retrieving embedding matrix:

In [14]:
corpus_word2vecMatrix = get_word2vec_matrix(model)
print(corpus_word2vecMatrix.shape)
corpus_word2vecMatrix.head()

(300, 1278)


Unnamed: 0,decid,visit,way,back,england,saw,establish,english,beer,thought,...,appetit,broken,pipe,45,court,dairi,sorbet,mother,effort,aunt
0,-0.323729,0.139122,-0.11771,-0.362096,-0.097604,-0.113042,-0.170597,-0.191632,-0.074101,0.14266,...,-0.248556,-0.150945,-0.252921,-0.245601,-0.113153,-0.06301,-0.255565,-0.213441,-0.214218,-0.175309
1,0.019365,-0.139541,-0.184012,-0.334717,0.065058,0.004493,-0.00754,0.005203,0.126799,0.200532,...,0.06677,0.123345,0.126601,0.100687,0.015741,0.044488,0.172349,0.007982,0.029864,0.075839
2,-0.033673,-0.106202,0.016617,0.010697,-0.060481,-0.06642,-0.230945,0.074824,-0.284938,-0.388685,...,-0.02995,-0.098206,-0.014048,-0.097435,-0.072245,-0.153745,-0.00693,0.008833,-0.062296,-0.083426
3,-0.249941,0.699008,-0.216362,-0.066214,-0.004977,-0.001074,-0.041546,-0.224951,-0.515194,-0.311595,...,-0.189921,-0.166494,-0.143823,-0.098051,-0.0299,-0.122641,-0.247966,-0.009082,0.000888,-0.167865
4,0.270207,-0.196001,0.018785,0.607687,0.251987,-0.003024,0.344423,0.216505,0.289487,0.075034,...,0.271758,0.069775,0.201372,0.111284,0.113445,0.010921,0.161079,0.166344,0.136033,0.041274


## 4. Comparison of the performance of classifiers on different types of embedding

First, we split train/test data.

In [15]:
df_dataset = reduced_df
n = len(df_dataset)
df_dataset.sample(n=n, random_state=16)
n = int(2 * n / 3)
df_dataset_train = df_dataset[:n]
df_dataset_test = df_dataset[n:]
print("Split train/test: ", df_dataset_train.shape, "VS", df_dataset_test.shape)
corpus_TFmatrix_train = get_TF_matrix(df_dataset_train.stemmed_reviews)
corpus_TFmatrix_test = get_TF_matrix(df_dataset_test.stemmed_reviews)
y_train=df_dataset_train.rating
y_test=df_dataset_test.rating

Split train/test:  (666, 2) VS (334, 2)


### Cosine distance classification

Here, we use the classifier defined in the handout on our dataset.

In [16]:
def distance_classifier_cosine_traning(lsi_vector_trainDB):
    """
    Input  : LSI vectors
    Purpose: calculate cosine similarity matrix
    Output : return similarity matrix
    """
    #calculate cosine similarity matrix for all training document LSI vectors
    return similarities.MatrixSimilarity(lsi_vector_trainDB)

def distance_classifier_cosine_test(classification_model, training_data, test_doc_lsi_vector, N=1):
    """
    Input  : trained classifier model, the training data (list of descriptions), lsi vectors of a document and N nearest document in the training data base
    Purpose: calculate cosine similarity matrix against all training samples
    Output : return nearest N document and classes
    """
    cosine_similarities = classification_model[test_doc_lsi_vector]

    most_similar_document_test = training_data[np.argmax(cosine_similarities)]

    #calculate cosine similarity matrix for all training document LSI vectors
    return most_similar_document_test

def reco_rate(ref_labels, predicted_labels):
    commun_labels = (pd.np.array(ref_labels)==pd.np.array(predicted_labels)).sum()
    return 100 * commun_labels / len(ref_labels)

classification_model = distance_classifier_cosine_traning(lsi_model[corpus_TFmatrix_train])
classification_model

<gensim.similarities.docsim.MatrixSimilarity at 0x7f07deb36d68>

We test on train data:

In [17]:
dictionary = get_dictionary(df_dataset.stemmed_reviews)
predicted_ratings = [distance_classifier_cosine_test(classification_model, 
                                df_dataset_train.rating, 
                                get_lsi_vector(lsi_model, df_dataset_train.stemmed_reviews.iloc[i]))
                                for i in range(df_dataset_train.shape[0])]

print("Classifier performances on train DB: %.2f" % reco_rate(df_dataset_train.rating, predicted_ratings), "%")

Classifier performances on train DB: 100.00 %




Unsurprinsingly, the cosine distance classifier performs perfectly on test data because it by definition the distance of each stemmed review to itself is zero and the decision is easily made. Let's see how it performs on test data.

In [18]:
predicted_ratings_test = [distance_classifier_cosine_test(classification_model, 
                                 df_dataset_train.rating, 
                                 get_lsi_vector(lsi_model, 
                                               df_dataset_test.stemmed_reviews.iloc[i]
                                              ))
                   for i in range(df_dataset_test.shape[0])]

print("Classifier performances on test DB: %.2f"%(reco_rate(df_dataset_test.rating, predicted_ratings_test)), "%")


Classifier performances on test DB: 50.00 %




Take a look at the test reviews and their ratings: 

In [19]:
df_dataset_test[:10]

Unnamed: 0,stemmed_reviews,rating
666,"[pub, great, quit, cheap, consid, staff, atten...",3
667,"[best, thai, eaten, lobster, green, curri, fan...",5
668,"[one, arriv, love, place, understand, one, nic...",5
669,"[welcom, mouth, water, smell, steak, enter, ga...",5
670,"[went, waffl, jack, wretch, hangov, love, waff...",5
671,"[husband, decid, go, winter, garden, anniversa...",5
672,"[say, enough, excel, gripe, sauc, turbot, stro...",5
673,"[book, birthday, presentup, moment, didnt, kno...",4
674,"[good, surpris, went, curiou, sinc, friend, mi...",4
675,"[burger, delici, best, valu, price, eat, burge...",4


Here is our prediction:

In [20]:
predicted_ratings_test[:10]

[5, 5, 4, 5, 5, 5, 5, 4, 5, 4]

As we see, the cosine classifier performs very poorly on test data using LSI embedding, we switch to random forest classification

### Random Forest Classification

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [22]:
def model_training(clf,X_train,y_train):
    """
    Input  : LSI vectors
    Purpose: train classification model
    Output : return trained classifier
    """
    clf.fit(X_train,y_train)
    return clf

def model_predictions(clf, X_test):
    """
    Input  : trained classifier model, the test set
    Purpose: make predictions on test data
    Output : predictions
    """
    y_pred=clf.predict(X_test)
    
    return y_pred

def reco_rate(ref_labels, predicted_labels):
    commun_labels = (pd.np.array(ref_labels)==pd.np.array(predicted_labels)).sum()
    return 100 * commun_labels / len(ref_labels)

Hyperparameters to tune:

In [23]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


#### LSI

Use the random grid to search for best hyperparameters:

In [30]:
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=41, n_jobs = -1)
# Fit the random search model
rf_random.fit(get_lsi_matrix(lsi_model, corpus_TFmatrix_train), y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   23.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  4.7min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

Printing best parameters:

In [31]:
rf_random.best_params_

{'n_estimators': 200,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 100,
 'bootstrap': False}

Predictions:

In [36]:
# We test on train data
predicted_ratings = model_predictions(rf_random, get_lsi_matrix(lsi_model, corpus_TFmatrix_train))
print("Classifier performances on train DB: %.2f" % reco_rate(df_dataset_train.rating, predicted_ratings), "%")

# We test on test Data
predicted_ratings_test = model_predictions(clf, get_lsi_matrix(lsi_model, corpus_TFmatrix_test))
print("Classifier performances on test DB: %.2f"%(reco_rate(df_dataset_test.rating, predicted_ratings_test)), "%")

Classifier performances on train DB: 100.00 %
Classifier performances on test DB: 53.59 %




#### Word2Vec

In [24]:
def document_vector(model, doc):
    ##remove words that aren't in vocabulary
    doc = [word for word in doc if word in model.wv.vocab.keys()]
    return np.sum(model[doc], axis=0)

In [25]:
x = []
for doc in df_dataset_train.stemmed_reviews: # append the vector for each document
    x.append(document_vector(model, doc))
X_train = np.array(x)

x = []
for doc in df_dataset_test.stemmed_reviews: # append the vector for each document
    x.append(document_vector(model, doc))
X_test = np.array(x)

  after removing the cwd from sys.path.


Use the random grid to search for best hyperparameters:

In [26]:
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train,y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   32.4s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  6.4min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

Printing best parameters:

In [27]:
rf_random.best_params_

{'n_estimators': 200,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': 90,
 'bootstrap': False}

Predictions:

In [28]:
# We test on train data
predicted_ratings = model_predictions(rf_random, X_train)
print("Classifier performances on train DB: %.2f" % reco_rate(df_dataset_train.rating, predicted_ratings), "%")

# We test on test Data
predicted_ratings_test = model_predictions(rf_random, X_test)
print("Classifier performances on test DB: %.2f"%(reco_rate(df_dataset_test.rating, predicted_ratings_test)), "%")

Classifier performances on train DB: 100.00 %
Classifier performances on test DB: 58.08 %




We see that our tuned Random Forests do not perfom well, we believe this is due to the reduced number of reviews (1000) we are using. Increasing the number of reviews and trying to tune RFs on those reviews requires computing power that our machines can't handle.