# Homework 3 - Word Embedding & Rating Prediction

## 0. Introduction

### Goal and outline

In this notebook, our goal is to build a simple classification model and evaluate its performance when using the embedding matrices produced by three different embedding techniques: LSI, Word2Vec and FastText. 

In the three following sections we build the embedding matrices using each technique. Then, in section 4, we try to fit a cosine similarity classifier and a random forest classifier and evaluate its performances in predicting the ratings of the reviews as a target variable, with respect to the embeding matrices.

### Importing useful libraries

In [36]:
# Data manipulation
import numpy as np 
import pandas as pd 
import os.path

# Text manipulation
from string import punctuation
from wordcloud import WordCloud

# NLP Modules
import nltk
from nltk import word_tokenize
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
from gensim import corpora, similarities
from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel
import nltk
from nltk.tokenize import RegexpTokenizer
from gensim.models.coherencemodel import CoherenceModel

# Vizualisation
import seaborn as sns
import matplotlib as plt


# Extra imports

# Uncomment the following lines if you haven't installed gensim and nltk
#!pip3 install gensim
#!pip3 install nltk

# Downloading useful nltk packages if not already done
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')

### Loading and preprocessing the data

In [37]:
# Here we use functions we defined in previous homework, running this cell takes few minutes

def load_data(DATA_PATH = "data/", file_name = "raw_scrapped_data.csv.gzip"):
    """
    Input  : path of where data is stored
    Purpose: loading csv file of reviews
    Output : data frame of reviews with associated ratings
    """    
    # Path of the file
    file_path = DATA_PATH + file_name

    # Reading data
    scrapped_data = pd.read_csv(file_path, compression='gzip')
    data = scrapped_data[['content', 'rating']]
    return data 

def basic_cleaning(series):
    # Remove punctuation
    new_series = series.str.replace('[^\w\s]','')
    # Strip trailing whitespace
    new_series = new_series.str.strip(" ")
    # Decapitalize letters
    new_series = new_series.apply(lambda x: str(x).lower())
    return new_series

def tokenize_filter(sentence):
    # Define stopwords
    stop_words = set(stopwords.words('english')) 
    ## Add personalised stop words
    stop_words |= set(["london", "food", "drink", "restaurant"])
    # Filter the sentence
    word_tokens = word_tokenize(sentence) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    return (word_tokens, filtered_sentence)

def stem_review(tokens):
    porter = PorterStemmer()
    return tokens.apply(lambda x: [porter.stem(x[i]) for i in range(len(x))])

def preprocess_data(data):
    df = data
    df["clean_content"] = basic_cleaning(df["content"])
    df["tokenized_content"] = df["clean_content"].apply(lambda x: tokenize_filter(x)[1])
    df["stemmed_reviews"] = stem_review(df["tokenized_content"])
    return df[['stemmed_reviews', 'rating']]

df = preprocess_data(load_data())
df.head()

Unnamed: 0,stemmed_reviews,rating
0,"[decid, visit, windsor, castl, way, back, sw, ...",5
1,"[good, although, rather, small, portion, howev...",2
2,"[look, somewher, budget, go, eat, overnight, w...",5
3,"[good, menu, select, unfortun, stifado, avail,...",4
4,"[pop, last, night, glass, wine, attend, theatr...",3


## 1. Latent semantic indexing (LSI)

Here, we use the functions already defined in the handout and adapt them to our particular case

In [77]:
# We will use 100 reviews for now
reduced_df = df[:100]

In [78]:
# 1. Using "corpora" from gensim to extract vocabulary from a corpus
def get_dictionary(doc_clean):
    """
    Input  : clean document
    Purpose: get the whole associated vocabulary
    Output : term dictionary
    """
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. dictionary = corpora.Dictionary(doc_clean)
    return corpora.Dictionary(doc_clean)

In [79]:
# 2. Buildind TF matrix useful for LSI
def get_TF_matrix(doc_clean, useTransfertDict=True):
    """
    Input  : clean document
    Purpose: get the term frequency matrix from a corpus
    Output : Document Term Frequency Matrix
    """
    # Creating the term dictionary of our courpus, where every unique term is assigned an index. 
    dictionary = corpora.Dictionary(doc_clean)
        
    # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
    return [dictionary.doc2bow(doc) for doc in doc_clean]

In [80]:
# 3. Create an LSI model using Gensim
def create_gensim_lsi_model(clean_documents_list, k=None):
    """
    Input  : clean document, dictionary
    Purpose: create LSI model (Latent Semantic Indexing) 
             from corpus and dictionary
    Output : return LSI model
    """
    
    #LSI model consists of Singular Value Decomposition (SVD) of
    #Term Document Matrix M: M = T x S x D'
    #and dimensionality reductions of T, S and D ("Derivation")
    
    dictionary = get_dictionary(clean_documents_list)
    
    corpus = get_TF_matrix(clean_documents_list)
    if k is not None:
        lsi_model = LsiModel(
                corpus=corpus,
                id2word=dictionary,
                num_topics=int(k)
                )
    else:
            lsi_model = LsiModel(
            corpus=corpus,
            id2word=dictionary 
            )
    print(); print(); print("="*20, "Training LSI model report", "="*20); print()
    
    print("Initial TF matrix (NwordsXNdocuments): ")
    TF = []
    for x in corpus:
        wrds = [0 for i in range(len(dictionary))]
        for i, j in x: wrds[i] = j
        TF.append(wrds)
    print(pd.np.transpose(TF))
    print()
    print("Derivation of Term Matrix T of Training Document Word Stems: ")
    print(lsi_model.get_topics())
    print()
    #Derivation of Term Document Matrix of Training Document Word Stems = M' x [Derivation of T]
    print("LSI Vectors of Training Document Word Stems: ")
    print([lsi_model[document_word_stems] for document_word_stems in corpus])
    print("="*70); print(); print()
    return lsi_model

def get_lsi_vector(lsi_model, clean_text):
    return lsi_model[dictionary.doc2bow(clean_text)]

# create lsi model
lsi_model = create_gensim_lsi_model(reduced_df.stemmed_reviews)
# build encoded corpus (TF matrix)
corpus_TFmatrix = get_TF_matrix(reduced_df.stemmed_reviews)




Initial TF matrix (NwordsXNdocuments): 
[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [2 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 2]
 [0 0 0 ... 0 0 1]]

Derivation of Term Matrix T of Training Document Word Stems: 
[[ 3.27689715e-03  3.27689715e-03  9.09272852e-03 ...  8.67302505e-03
   1.73460501e-02  8.67302505e-03]
 [-5.97654184e-06 -5.97654184e-06 -3.52539124e-03 ... -4.21252894e-03
  -8.42505788e-03 -4.21252894e-03]
 [ 1.92095118e-03  1.92095118e-03  4.60127091e-03 ... -4.32058486e-03
  -8.64116973e-03 -4.32058486e-03]
 ...
 [-5.17215841e-03 -5.17215841e-03 -1.22328452e-02 ...  4.46598922e-03
   8.93197844e-03  4.46598922e-03]
 [ 1.23908936e-03  1.23908936e-03 -1.37522038e-03 ...  2.30560508e-03
   4.61121016e-03  2.30560508e-03]
 [-1.58974874e-03 -1.58974874e-03  5.24729898e-04 ...  1.16595285e-03
   2.33190570e-03  1.16595285e-03]]

LSI Vectors of Training Document Word Stems: 
[[(0, 2.084081844952941), (1, -0.00134612000816209), (2, 0.39879753049062794), (3, -2.02853

## 2. Word2Vec

## 3. FastText

## 4. Comparison of the performance of a Random Forest on different types of embedding

In [88]:
# First we split train/test data

df_dataset = reduced_df
n = len(df_dataset)
df_dataset.sample(n=n, random_state=16)
n = int(2 * n / 3)
df_dataset_train = df_dataset[:n]
df_dataset_test = df_dataset[n:]
print("Split train/test: ", df_dataset_train.shape, "VS", df_dataset_test.shape)
corpus_TFmatrix_train = get_TF_matrix(df_dataset_train.stemmed_reviews)
corpus_TFmatrix_test = get_TF_matrix(df_dataset_test.stemmed_reviews)

Split train/test:  (66, 2) VS (34, 2)


### Cosine distance classification

In [82]:
# Here we use the classifier defined in the handout on our dataset

def distance_classifier_cosine_traning(lsi_vector_trainDB):
    """
    Input  : LSI vectors
    Purpose: calculate cosine similarity matrix
    Output : return similarity matrix
    """
    #calculate cosine similarity matrix for all training document LSI vectors
    return similarities.MatrixSimilarity(lsi_vector_trainDB)

def distance_classifier_cosine_test(classification_model, training_data, test_doc_lsi_vector, N=1):
    """
    Input  : trained classifier model, the training data (list of descriptions), lsi vectors of a document and N nearest document in the training data base
    Purpose: calculate cosine similarity matrix against all training samples
    Output : return nearest N document and classes
    """
    cosine_similarities = classification_model[test_doc_lsi_vector]

    most_similar_document_test = training_data[np.argmax(cosine_similarities)]

    #calculate cosine similarity matrix for all training document LSI vectors
    return most_similar_document_test

def reco_rate(ref_labels, predicted_labels):
    commun_labels = (pd.np.array(ref_labels)==pd.np.array(predicted_labels)).sum()
    return 100 * commun_labels / len(ref_labels)

classification_model = distance_classifier_cosine_traning(lsi_model[corpus_TFmatrix_train])
classification_model

<gensim.similarities.docsim.MatrixSimilarity at 0x1a4ede76a0>

In [89]:
# We test on train data
dictionary = get_dictionary(df_dataset.stemmed_reviews)
predicted_ratings = [distance_classifier_cosine_test(classification_model, 
                                df_dataset_train.rating, 
                                get_lsi_vector(lsi_model, df_dataset_train.stemmed_reviews.iloc[i]))
                                for i in range(df_dataset_train.shape[0])]

print("Classifier performances on train DB: %.2f" % reco_rate(df_dataset_train.rating, predicted_ratings), "%")

Classifier performances on train DB: 100.00 %


Unsurprinsingly, the cosine distance classifier performs perfectly on test data because it by definition the distance of each stemmed review to itself is zero and the decision is easily made. Let's see how it performs on test data.

In [91]:
# We test on test Data
predicted_ratings_test = [distance_classifier_cosine_test(classification_model, 
                                 df_dataset_train.rating, 
                                 get_lsi_vector(lsi_model, 
                                               df_dataset_test.stemmed_reviews.iloc[i]
                                              ))
                   for i in range(df_dataset_test.shape[0])]

print("Classifier performances on test DB: %.2f"%(reco_rate(df_dataset_test.rating, predicted_ratings_test)), "%")


Classifier performances on test DB: 41.18 %


In [92]:
# We take a look at the test reviews and their ratings 
df_dataset_test[:10]

Unnamed: 0,stemmed_reviews,rating
66,"[select, lunch, south, african, busi, guest, c...",4
67,"[miss, love, indonesian, cuisin, holiday, hop,...",4
68,"[great, good, wine, list, attent, courteou, se...",5
69,"[friend, went, weekend, absolut, love, believ,...",4
70,"[first, time, visit, sceneri, took, breath, aw...",4
71,"[love, pint, beer, littl, els, call, windsor, ...",3
72,"[spanish, tapa, best, great, locat, realli, go...",5
73,"[time, lunch, work, peopl, last, time, worst, ...",3
74,"[stay, close, good, littl, place, good, locat,...",4
75,"[got, introduc, friend, hook, triangl, love, h...",5


In [93]:
# Here is our prediction
predicted_ratings_test[:10]

[5, 5, 5, 3, 5, 3, 5, 2, 5, 5]

As we see, the cosine classifier performs very poorly on test data using LSI embedding, we switch to random forest classification

### Random Forest Classification