# NLP: Assignment 2
### Group 11: Teo Stereciu (s4678826) & Csanad Vegh (s4739124)

For Part II of the second assignment, we designed two feedforward neural networks to classify movie reviews into positive and negative. The first one takes as input a vector embedding based on TF-IDF and the second one uses Word2Vec embeddings. 

### Preparation

Here we set up all dependencies for both models.

In [18]:
import numpy as np
import pandas as pd

import re
from nltk.corpus import stopwords 
from nltk.corpus import wordnet
from nltk.tokenize import wordpunct_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer


from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

from gensim.models import Word2Vec

Now we load the data from a local source. The reviews have already been split threeway, into training, validation, and test. 

In [2]:
# load dataset into memory
def load_data (filename):
    df = pd.read_csv(filename)
    corpus = df["text"]
    target = df["label"]
    return corpus, target

corpus_valid, target_valid = load_data("IMDB/Valid.csv")
corpus_train, target_train = load_data("IMDB/Train.csv")
corpus_test, target_test = load_data("IMDB/Test.csv")

Next, we look into our training set. Our analysis points out that there are two possible labels and that the training corpus is decently sized and balanced.

In [3]:
print("Possible sentiments are", np.unique(target_train))
print("The number of reviews for training is", len(corpus_train))
size = len(corpus_train) + len(corpus_valid) + len(corpus_test)
print("Training corpus is " + str(int(100*np.sum(target_train)/len(target_train))) + "% positive reviews")
info = pd.DataFrame([corpus_train[5]], columns=["raw text example"]) # use to track progress
info

Possible sentiments are [0 1]
The number of reviews for training is 40000
Training corpus is 49% positive reviews


Unnamed: 0,raw text example
0,A terrible movie as everyone has said. What ma...


Now we need to clean up the data. We want to obtain clean tokens, but put them back together into a string for practical reasons that will become apparent soon.

In [32]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

# helper function to convert the pos tag format into something compatible with the lemmatizer
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# turn a review into clean tokens
def clean_data(doc):
    doc = re.sub(r'<[^>]+>', '', doc)  # remove HTML tags
    doc = re.sub(r'\W+', ' ', doc) # remove every char that is not alphanumeric, keep spaces
    tokens = wordpunct_tokenize(doc) 
    tokens = [token.lower() for token in tokens if token.lower() not in stop_words] 
    pos = pos_tag(tokens)
    clean_tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in pos]

    # put tokens back into string format for tfidf vectorizer
    clean_doc = " ".join(clean_tokens)

    return clean_doc

In [30]:
X_train = []
y_train = target_train
for line in corpus_train:
    clean_line = clean_data(line)
    X_train.append(clean_line)

X_valid = []
y_valid = target_valid
for line in corpus_valid:
    clean_line = clean_data(line)
    X_valid.append(clean_line)

X_test = []
y_test = target_test
for line in corpus_test:
    clean_line = clean_data(line)
    X_test.append(clean_line)

Let's take a look at what the clean text looks like compared to the original. We also find how many (unique) tokens there are in the training corpus to get a sense of how complex our models have to be.

In [31]:
info["clean text example"] = X_train[5]
info["sentiment"] = y_train[5]
review_len = [sum(1 for word in review.split()) for review in X_train]
tokens = [[word for word in review.split()] for review in X_train]
flat_tokens = [token for review in tokens for token in review]
num_tokens_unique = len(set(flat_tokens))
info["num tokens in corpus"] = np.sum(review_len)
info["num unique tokens in corpus"] = num_tokens_unique
info["avg review length"] = np.mean(review_len)
info["max review length"] = np.max(review_len)
info["min review length"] = np.min(review_len)
info

Unnamed: 0,raw text example,clean text example,sentiment,num tokens in corpus,num unique tokens in corpus,avg review length,max review length,min review length
0,A terrible movie as everyone has said. What ma...,terrible movie everyone say make laugh cameo a...,0,4786521,84468,119.663025,1429,3


## TF-IDF

In this section we'll be focusing on the TF-IDF method. In a nutshell, we'll be using a weigthed sum to represent the words in vector space. The TF-IDF score gives more importance to words that not only have high occurence in an arbitrary document, but also occur in many documents in the corpus. We also choose to ignore rare terms (i.e., that appear in less than 10% of the reviews) because highly movie-specific words would not be useful during training. Note that the sklearn.feature_extraction.text.TfidfVectorizer() takes as input text rather than tokens, hence why we put the tokens back together earlier. 

In [44]:
def define_model_tfidf(size, max_iter):
    model_tfidf = Pipeline([
        ("vect", TfidfVectorizer(min_df=0.1)), 
        ("clf", MLPClassifier(hidden_layer_sizes=(size,), max_iter=max_iter, random_state=11))
    ])
    model_tfidf.fit(X_train, y_train)
    predict_train = model_tfidf.predict(X_train)
    train_accuracy = accuracy_score(y_train, predict_train)
    
    predict_valid = model_tfidf.predict(X_valid)
    valid_accuracy = accuracy_score(y_valid, predict_valid)
    return model_tfidf, train_accuracy, valid_accuracy

We explore with 100, 200, and 500 hidden layers on the validation set. We set our maximum number of iterations relatively high at 500 epochs to match the training sample size.

In [45]:
model_tfidf100, train_accuracy100, valid_accuracy100 = define_model_tfidf(100, 500)
model_tfidf200, train_accuracy200, valid_accuracy200 = define_model_tfidf(200, 500)
model_tfidf500, train_accuracy500, valid_accuracy500 = define_model_tfidf(500, 500)



In [47]:
print("model100:", train_accuracy100, valid_accuracy100) # model100: 0.950675 0.6992
print("model200:", train_accuracy200, valid_accuracy200) # model200: 1.0 0.7022
print("model500:", train_accuracy500, valid_accuracy500) # model500: 1.0 0.714

model100: 0.950675 0.6992
model200: 1.0 0.7022
model500: 1.0 0.714


It seems like the more complex models are overfitting a lot. Let's try to prevent this with early stoppping by limiting training to 200 iterations. 

In [48]:
model_tfidf100_early, train_accuracy100_early, valid_accuracy100_early = define_model_tfidf(100, 200)
model_tfidf200_early, train_accuracy200_early, valid_accuracy200_early = define_model_tfidf(200, 200)
model_tfidf500_early, train_accuracy500_early, valid_accuracy500_early = define_model_tfidf(500, 200)



In [49]:
print("model100_early:", train_accuracy100_early, valid_accuracy100_early) 
print("model200_early:", train_accuracy200_early, valid_accuracy200_early) 
print("model500_early:", train_accuracy500_early, valid_accuracy500_early)

model100_early: 0.9074 0.7208
model200_early: 0.9806 0.717
model500_early: 1.0 0.715


The classifier with 500 hidden layers is still overfitting too much. This indicates that its learning capacity is way to big for our set-up. For the final test, we'll be using the 100 dimensions one, since it had the best accuracy on the validation set.  We get 72.52% accuracy on the test set, which is not too bad.

In [50]:
predict_test = model_tfidf100_early.predict(X_test)
test_accuracy = accuracy_score(y_test, predict_test)
print(test_accuracy)

0.7252


# Word2Vec

Moving on to Word2Vec, we need to tokenize the training corpus back into a list. The settings we used for the Word2Vec model are not too different from the standard. For a high level explanation of our goal in this section, our neural network classifier will take as input Word2Vec embeddings, which use context to predict how likely it is that a word fits in with others. 

In [41]:
# tokenize for word2vec
X_train_list = [[word for word in line.split()] for line in X_train]
# initialize the word2vec model
model = Word2Vec(X_train_list,
                vector_size=100,
                window=5,
                min_count=2)

model.save("word2vec.model")

We define our own vectorizer that uses the Word2Vec model above on clean text.

In [42]:
w2v_model = Word2Vec.load("word2vec.model")

def w2v_vectorizer(X):
    # tokenize
    X_list = [[word for word in line.split()] for line in X]
    
    # average embeddings for each review
    X_vect = []
    vocab = set(w2v_model.wv.index_to_key)
    for line in X_list:
        mean_vec = [0]*100
        for word in line:
            if word in vocab:
                mean_vec = np.add(mean_vec, w2v_model.wv[word])
        X_vect.append(np.array(mean_vec/len(line)))

    return np.array(X_vect)

We come to finally defining our Word2Vec pipeline. In this case, we kept the same hidden layer dimensions as before so we could compare how they perform with similar learning capacity. We also kept the maximum number of iterations to establish a straightforward comparison.

In [52]:
def define_model_w2v():
    X_train_vect = w2v_vectorizer(X_train)
    clf_w2v = MLPClassifier(hidden_layer_sizes=(100,), max_iter=200, random_state=11)
    clf_w2v.fit(X_train_vect, y_train)
    
    predict_train = clf_w2v.predict(X_train_vect)
    train_accuracy = accuracy_score(y_train, predict_train)

    X_valid_vect = w2v_vectorizer(X_valid)
    predict_valid = clf_w2v.predict(X_valid_vect)
    valid_accuracy = accuracy_score(y_valid, predict_valid)
    return clf_w2v, train_accuracy, valid_accuracy

In [54]:
clf_w2v, train_accuracy_w2v, valid_accuracy_w2v = define_model_w2v()
print(train_accuracy_w2v, valid_accuracy_w2v) # 0.924375 0.8408



0.924375 0.8408


And this is the final test, which yields predictions that are 84.74% accurate.

In [55]:
X_test_vect = w2v_vectorizer(X_test)
predict_test = clf_w2v.predict(X_test_vect)
test_accuracy = accuracy_score(y_test, predict_test)

print(test_accuracy)

0.8474


For this project we fed two types of semantic embeddings to the same feedforward neural network architecture. We obtained 72.5% test accuracy with the first, TF-IDF based method, and 84.74% with the Word2Vec embeddings. This is not surprising, since Word2Vec representation has advantages in capturing semantic meaning over TF-IDF. However, in a restricted domain TF-IDF may still outperform Word2Vec if word frequency information is highly suggestive of the labelling. In any case, a more extensive hyperparameter search may improve both performances. 