# Learning from Big Data: Module 1 - Natural Language Processing

#### Session 3 - LDA and Word2Vec

# Introduction
#### This file illustrates `LDA` (Latent Dirichlet Allocation) and `Word2Vec`. 

# 1. Loading Packages

In [1]:
# Loading the required packages
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# 2. Loading the Reviews

Next, we load the review data. **Note** that we use the ISO-8859-1 encoding from the pd.readcsv() function - this helps reading the review text correctly for further processing (by correctly interpreting non-ASCII symbols).

In [2]:
# Loading the review data.
reviews_raw = pd.read_csv('reviews_tiny.csv', encoding='ISO-8859-1')
reviews_raw = reviews_raw[
    ['movie_name',
     'review_code',
     'reviewer',
     'review_date',
     'num_eval',
     'prob_sentiment',
     'words_in_lexicon_sentiment_and_review',
     'ratio_helpful',
     'raters',
     'prob_storyline',
     'prob_acting',
     'prob_sound_visual',
     'full_text',
     'processed_text',
     'release_date',
     'first_week_box_office',
     'MPAA',
     'studio',
     'num_theaters']
]

TOT_REVIEWS = len(reviews_raw)

### 2.1 Calculating the likelihoods with your own content likelihood file

In [4]:
# TODO: compute the content likelihoods for all the words in the training data...
likelihoods_content = pd.read_csv('likelihood_content.csv')

### 2.2 Inspecting the list of words to be passed for to LDA:

In [5]:
# Converting the first column to a list of strings
lexicon_content = likelihoods_content.iloc[:, 0].values.astype('U')
print(lexicon_content)

['story' 'hero' 'world' 'character' 'moral' 'audience' 'opponent' 'ofthe'
 'scene' 'one' 'not' 'characters' 'plot' 'will' 'can' 'also' 'man'
 'desire' 'stories' 'two' 'time' 'see' 'line' 'great' 'must' 'good' 'way'
 'revelation' 'ofa' 'first' 'need' 'make' 'michael' 'change' 'house'
 'heros' 'action' 'main' 'get' 'love' 'dialogue' 'selfrevelation' 'many'
 'just' 'technique' 'end' 'structure' 'steps' 'tells' 'life' 'argument'
 'symbol' 'key' 'george' 'wants' 'anatomy' 'only' 'theme' 'use' 'well'
 'even' 'single' 'place' 'principle' 'opposition' 'comes' 'look' 'values'
 'rick' 'storyteller' 'new' 'point' 'writers' 'big' 'web' 'within' 'says'
 'premise' 'scenes' 'people' 'conflict' 'human' 'weakness' 'back' 'take'
 'form' 'down' 'beginning' 'give' 'come' 'show' 'designing' 'doesnt'
 'makes' 'king' 'three' 'example' 'family' 'plan' 'know']


# 3. Unsupervised Learning: Latent Dirichlet Allocation (LDA)

In [6]:
# Creating a CountVectorizer to create the Document-Term matrix
vectorizer = CountVectorizer(analyzer='word',       
                             vocabulary={word: i for i, word in enumerate(lexicon_content)}, 
                             stop_words='english',             
                             lowercase=True,                   
                             token_pattern='[a-zA-Z0-9]{3,}',  
                            )

# Applying the vectorizer
data_vectorized = vectorizer.fit_transform(reviews_raw['processed_text'])

**Next**, we will set the `LDA` parameters.
+ `k` is the number of topic we ask LDA to estimate. In supervised learning, we set it equal to 3. In this example, we arbitrarily set `k` equal to 10.
+ `SEED` is for replicability (i.e., obtain the same number every time the code is run).
+ `ITER` parameter is set for the maximum number of iterations for the Expectation-Maximization algorithm used by sklearn's LDA implementation
    + In the unlikely case you have a warning of "no convergence", you may increase `ITER` to 2000 or 4000.

In [7]:
# Setting the LDA parameters
SEED = 100
ITER = 1000
k = 10

#### Tip: choosing which `k` to use in LDA is a **model selection problem**.
Typically, the best approach is to compute a model for each level of `k`, save the model log-likelihood, and choosing the `k` that produced the highest log-likelihood.
+ The `LatentDirichletAllocation` object in sklearn has a method called `score` which returns the log-likelihood.
+ The `score` method can be used after the model has been fitted, as follows:
  + `loglikelihood_k = lda_model_k.score(data_vectorized)`

**Next**, we will run the LDA and save the model. The model produced by `LatentDirichletAllocation()` is an object of class LatentDirichletAllocation (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html). This class includes the topics, the score (log-likelihood), and a lot more. To extract these elements, one should use the Methods listed under "Methods"  in the documentation.

In [10]:
# Fitting the LDA Model
lda_model = LatentDirichletAllocation(n_components=k,               
                                      max_iter=ITER,
                                      learning_method='online',
                                      random_state=SEED,          
                                      batch_size=128,            
                                      evaluate_every = -1,       
                                      n_jobs = -1,               
                                     )
lda_output = lda_model.fit_transform(data_vectorized)

#### Printing the log-likelihood.

In [11]:
log_likelihood = lda_model.score(data_vectorized)
print(f"The log-likelihood for k = {k} is {log_likelihood:.3f}")

The log-likelihood for k = 10 is -56122.886


#### Inspecting the posteriors.

In [12]:
# The columns/topics names (formatting)
topic_names = ["Topic" + str(i + 1) for i in range(lda_model.n_components)]

# The rows/indices names (formatting)
doc_names = ["Review_" + str(i + 1) for i in range(data_vectorized.shape[0])]

# Posterior probabilities per document by topic
df_document_topic = pd.DataFrame(np.round(lda_output, 3), columns=topic_names, index=doc_names)
print(df_document_topic)

             Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_1      0.005   0.005   0.302   0.005   0.005   0.005   0.173   0.488   
Review_2      0.203   0.003   0.401   0.053   0.003   0.003   0.102   0.161   
Review_3      0.049   0.358   0.003   0.003   0.003   0.003   0.003   0.524   
Review_4      0.003   0.338   0.003   0.003   0.003   0.003   0.041   0.499   
Review_5      0.011   0.011   0.580   0.011   0.011   0.011   0.011   0.173   
...             ...     ...     ...     ...     ...     ...     ...     ...   
Review_996    0.278   0.011   0.011   0.011   0.011   0.011   0.011   0.208   
Review_997    0.008   0.008   0.428   0.106   0.008   0.008   0.314   0.008   
Review_998    0.081   0.648   0.006   0.230   0.006   0.006   0.006   0.006   
Review_999    0.009   0.826   0.009   0.009   0.009   0.009   0.101   0.009   
Review_1000   0.344   0.006   0.006   0.006   0.006   0.006   0.314   0.006   

             Topic9  Topic10  
Review_1      0.005 

In [13]:
# Printing the 999-th one
print(df_document_topic[998:999])

            Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_999   0.009   0.826   0.009   0.009   0.009   0.009   0.101   0.009   

            Topic9  Topic10  
Review_999   0.009    0.009  


#### Tip: for the data splits.
For the data splits, if you can, mind the time. It's best to train on a split that temporarily precedes the prediction split, but sometimes that is not viable. However, it is good to be aware.

# 4. Unsupervised Learning: Word Embeddings

Our word embeggind example has three steps.
+ First, run Word2Vec to train a model using the training data split.
+ Second, use the trained model to analyze the prediction data split.
+ Third, use the constructed variables to forecast the `box office`.

### Step 1: Training step

In [14]:
full_data = reviews_raw['full_text'].str.lower()

In [15]:
# TODO: use a split of the data here (say 70%) instead of the entire dataset
# train_data = ...

In [16]:
# Tokenizing each sentence into a list of words
full_data = [simple_preprocess(line, deacc=True) for line in full_data]

# Number of topics for Word2Vec
topics_word2vec = 10

# Training the Word2Vec model
model = Word2Vec(full_data, vector_size=topics_word2vec, sg=0, epochs=20)

# The embeddings in gensim's Word2Vec model can be accessed via the 'wv' attribute
embeddings = model.wv

### Step 2: Constructing variables from word embeddings

In [17]:
# TODO: use the other split of the data here (30%)
# test_data = ...

In [21]:
# Initializing the embeddings matrix
all_embeddings = np.zeros((TOT_REVIEWS, topics_word2vec))

# Looping through each review
for review in range(TOT_REVIEWS):
    
    # Tokenizing the review: identify the words, separately
    tokenized_review = simple_preprocess(reviews_raw['full_text'].iloc[review])

    # Getting the word vectors per review
    embedding_review = [] # Initializing an empty list to store the word vectors

    # Looping through each word in the tokenized review
    for word in tokenized_review:
    
        # Checking if the word exists in the Word2Vec model vocabulary
        if word in model.wv.key_to_index:
        
            # If it does, get its vector and add it to the list
            word_vector = model.wv[word]
            embedding_review.append(word_vector)

    # Here, we handle the case where none of the words in the review are in the Word2Vec vocabulary
    if not embedding_review:
        continue
    
    # Compute mean across all words in the review 
    all_embeddings[review, :] = np.mean(embedding_review, axis=0)

#### Inspecting the embeddings

In [19]:
# Word embeddings per document by topic (these are not probabilities!)
# The columns/topics names (formatting)
topic_names_w2v = ["Topic" + str(i + 1) for i in range(topics_word2vec)]

# The rows/indices names (formatting)
doc_names_w2v = ["Review_" + str(i + 1) for i in range(all_embeddings.shape[0])]

# Posterior probabilities per document by topic
df_document_w2v_topic = pd.DataFrame(np.round(all_embeddings, 3), columns=topic_names_w2v, index=doc_names_w2v)
print(df_document_w2v_topic)

             Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_1     -0.550  -0.772   0.487   0.042   0.326   0.754   0.616   0.800   
Review_2     -0.369  -0.730   0.230  -0.030   0.222   0.613   0.736   0.859   
Review_3     -0.243  -0.957   0.119  -0.095   0.677   0.657   0.764   0.600   
Review_4     -0.788  -0.965   0.142  -0.264   0.580   1.091   1.076   0.627   
Review_5     -0.111  -0.811   0.047  -0.228   0.694   0.601   1.037   0.774   
...             ...     ...     ...     ...     ...     ...     ...     ...   
Review_996   -0.052  -0.775   0.043   0.151  -0.029   0.531   0.538   0.982   
Review_997    0.031  -0.774   0.173   0.398   0.990   0.331   0.627   0.407   
Review_998    0.202  -1.261  -0.533  -0.096   0.817   0.940   0.665   0.354   
Review_999    0.258  -0.743  -0.010   0.085   0.904   0.702   0.333   0.390   
Review_1000   0.011  -1.016  -0.108   0.069   0.208   0.325   0.500   0.847   

             Topic9  Topic10  
Review_1     -0.322 

In [20]:
# Printing the 999-th one
print(df_document_w2v_topic[998:999])

            Topic1  Topic2  Topic3  Topic4  Topic5  Topic6  Topic7  Topic8  \
Review_999   0.258  -0.743   -0.01   0.085   0.904   0.702   0.333    0.39   

            Topic9  Topic10  
Review_999  -0.891   -0.519  


### Step 3: Using the constructed variables to forecast the `box office`

In [3]:
# TODO: Implementation...