# NLP: Assignment 2
### Group 11: Teo Stereciu (s4678826) & Csanad Vegh (s4739124)

For Part II of the second assignment, we designed two feedforward neural networks to classify movie reviews into positive and negative. The first one takes as input a vector embedding based on TF-IDF and the second one uses Word2Vec embeddings. 

### Preparation

Here we set up all dependencies for both models.

In [18]:
import numpy as np
import pandas as pd

import re
from nltk.corpus import stopwords 
from nltk.corpus import wordnet
from nltk.tokenize import wordpunct_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer


from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

from gensim.models import Word2Vec

Now we load the data from a local source. The reviews have already been split threeway, into training, validation, and test. 

In [2]:
# load dataset into memory
def load_data (filename):
    df = pd.read_csv(filename)
    corpus = df["text"]
    target = df["label"]
    return corpus, target

corpus_valid, target_valid = load_data("IMDB/Valid.csv")
corpus_train, target_train = load_data("IMDB/Train.csv")
corpus_test, target_test = load_data("IMDB/Test.csv")

Next, we look into our training set. Our analysis points out that there are two possible labels and that the training corpus is decently sized and balanced.

In [3]:
print("Possible sentiments are", np.unique(target_train))
print("The number of reviews for training is", len(corpus_train))
size = len(corpus_train) + len(corpus_valid) + len(corpus_test)
print("Training corpus is " + str(int(100*np.sum(target_train)/len(target_train))) + "% positive reviews")
info = pd.DataFrame([corpus_train[5]], columns=["raw text example"]) # use to track progress
info

Possible sentiments are [0 1]
The number of reviews for training is 40000
Training corpus is 49% positive reviews


Unnamed: 0,raw text example
0,A terrible movie as everyone has said. What ma...


Now we need to clean up the data. We want to obtain clean tokens, but put them back together into a string for practical reasons that will become apparent soon.

In [29]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

# helper function to convert the pos tag format into something compatible with the lemmatizer
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# turn the dataset into clean tokens
def clean_data(doc):
    doc = re.sub(r'<[^>]+>', '', doc)  # remove HTML tags
    doc = re.sub(r'\W+', ' ', doc) # remove every char that is not alphanumeric, keep spaces
    tokens = wordpunct_tokenize(doc) 
    tokens = [token.lower() for token in tokens if token.lower() not in stop_words] 
    pos = pos_tag(tokens)
    clean_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in pos]

    # put tokens back into string format for tfidf vectorizer
    clean_doc = " ".join(clean_tokens)

    return clean_doc

In [30]:
X_train = []
y_train = target_train
for line in corpus_train:
    clean_line = clean_data(line)
    X_train.append(clean_line)

X_valid = []
y_valid = target_valid
for line in corpus_valid:
    clean_line = clean_data(line)
    X_valid.append(clean_line)

X_test = []
y_test = target_test
for line in corpus_test:
    clean_line = clean_data(line)
    X_test.append(clean_line)

Let's take a look at what the clean text looks like compared to the original. We also find how many (unique) tokens there are in the training corpus to get a sense of how complex our models have to be.

In [31]:
info["clean text example"] = X_train[5]
info["sentiment"] = y_train[5]
review_len = [sum(1 for word in review.split()) for review in X_train]
tokens = [[word for word in review.split()] for review in X_train]
flat_tokens = [token for review in tokens for token in review]
num_tokens_unique = len(set(flat_tokens))
info["num tokens in corpus"] = np.sum(review_len)
info["num unique tokens in corpus"] = num_tokens_unique
info["avg review length"] = np.mean(review_len)
info["max review length"] = np.max(review_len)
info["min review length"] = np.min(review_len)
info

Unnamed: 0,raw text example,clean text example,sentiment,num tokens in corpus,num unique tokens in corpus,avg review length,max review length,min review length
0,A terrible movie as everyone has said. What ma...,terrible movie everyone say make laugh cameo a...,0,4786521,84468,119.663025,1429,3


## TF-IDF

In this section we'll be focusing on the TF-IDF method. In a nutshell, we'll be using a weigthed sum to represent the words in vector space. The TF-IDF score give more importance to words that not only have high occurence in an arbitrary document, but also occur in many documents in the corpus. We also choose to ignore rare terms (i.e., that appear in less than 10% of the reviews) because highly movie-specific words would not be useful during training. Note that the sklearn.feature_extraction.text.TfidfVectorizer() takes as input text rather than tokens, hence why we put the tokens back together earlier. 

In [50]:
def define_model_tfidf(size):
    model_tfidf = Pipeline([
        ("vect", TfidfVectorizer(min_df=0.1)), 
        ("clf", MLPClassifier(hidden_layer_sizes=(size,), max_iter=500))
    ])
    model_tfidf.fit(X_train, y_train)
    predict_train = model_tfidf.predict(X_train)
    train_accuracy = accuracy_score(y_train, predict_train)
    
    predict_valid = model_tfidf.predict(X_valid)
    valid_accuracy = accuracy_score(y_valid, predict_valid)
    return model_tfidf, train_accuracy, valid_accuracy

We explore with 100, 200, and 500 hidden layers on the validation set. We set our maximum number of iterations relatively high at 500 epochs because the training sample size.

In [51]:
model_tfidf100, train_accuracy100, valid_accuracy100 = define_model_tfidf(100)
model_tfidf200, train_accuracy200, valid_accuracy200 = define_model_tfidf(200)
#model_tfidf500, train_accuracy500, valid_accuracy500 = define_model_tfidf(500)



In [54]:
print("model100:", train_accuracy100, valid_accuracy100)
print("model200:", train_accuracy200, valid_accuracy200)
#print("model500:", train_accuracy500, valid_accuracy500)

model100: 0.91355 0.6918
model200: 0.998525 0.681


Now, for the final test, we'll be using ...

# Word2Vec

Moving on to Word2Vec, we need to tokenize the training corpus back into a list. The settings we used for the Word2Vec model are not too different from the standard. For a high level explanation of our goal this section, our neural network will take as input Word2Vec embeddings, which use context to predict how likely it is that a word fits with others. 

In [None]:
# tokenize for word2vec
X_train_list = [[word for word in line.split()] for line in X_train]
# initialize the word2vec model
w2v_model = Word2Vec(X_train_list,
                    vector_size=100,
                    window=5,
                    min_count=2)

We define our own vectorizer that uses the Word2Vec model above on clean text.

In [74]:
def w2v_vectorizer(X):
    # tokenize
    X_list = [[word for word in line.split()] for line in X]
    
    # average embeddings for each review
    X_vect = []
    vocab = set(w2v_model.wv.index_to_key)
    for line in X_list:
        mean_vec = [0]*100
        for word in line:
            if word in vocab:
                mean_vec = np.add(mean_vec, w2v_model.wv[word])
        X_vect.append(np.array(mean_vec/len(line)))

    return np.array(X_vect)

In [75]:
X_train_vect = w2v_vectorizer(X_train)
print(X_train_vect)

[[-0.44250308 -0.12474672 -0.11875592 ... -0.12294614 -0.01942103
   0.0104833 ]
 [-0.62371868 -0.46918127 -0.41876    ...  0.14540264  0.15781565
  -0.41587733]
 [-0.77591374 -0.39079838 -0.33819434 ...  0.26265447  0.13211061
  -0.38795699]
 ...
 [-0.30249076 -0.11166037 -0.14121279 ...  0.25851835  0.15269776
  -0.36960565]
 [-0.9231643  -0.41026558 -0.67779348 ...  0.26594043  0.32524223
  -0.42071024]
 [-0.4394797  -0.19827351 -0.24165958 ...  0.09765698  0.37199169
  -0.01631636]]


We come to finally define our Word2Vec pipeline. In this case, we kept the same hidden layer dimensions as before so we could compare how they perform with similar learning capacity.

In [78]:
def define_model_w2v():
    X_train_vect = w2v_vectorizer(X_train)
    clf_w2v = MLPClassifier(hidden_layer_sizes=(200,))
    clf_w2v.fit(X_train_vect, y_train)
    
    predict_train = clf_w2v.predict(X_train_vect)
    train_accuracy = accuracy_score(y_train, predict_train)

    X_valid_vect = w2v_vectorizer(X_valid)
    predict_valid = clf_w2v.predict(X_valid_vect)
    valid_accuracy = accuracy_score(y_valid, predict_valid)
    return clf_w2v, train_accuracy, valid_accuracy
    

In [79]:
model_w2v, train_accuracy_w2v, valid_accuracy_w2v = define_model_w2v()
print(train_accuracy_w2v, valid_accuracy_w2v)



0.950875 0.8456


And this is the final test.