# Sentiment classification - close to the state of the art

The task of classifying sentiments of texts (for example movie or product reviews) has high practical significance in online marketing as well as financial prediction. This is a non-trivial task, since the concept of sentiment is not easily captured.

For this assignment you have to use the larger [IMDB sentiment](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) benchmark dataset from Stanford, an achieve close to state of the art results.

The task is to try out multiple models in ascending complexity, namely:

1. TFIDF + classical statistical model (eg. RandomForest)
2. LSTM classification model
3. LSTM model, where the embeddings are initialized with pre-trained word vectors
4. fastText model
5. BERT based model (you are advised to use a pre-trained one and finetune, since the resource consumption is considerable!)

You should get over 90% validation accuracy (though nearly 94 is achievable).

You are allowed to use any library or tool, though the Keras environment, and some wrappers on top (ie. Ktrain) make your life easier.





__Groups__
This assignment is to be completed individually, four weeks after the class has finished. For the precise deadline please see canvas.

__Format of submission__
You need to submit a pdf of your Google Collab notebooks.

__Due date__
Four weeks after the class has finished. For the precise deadline please see canvas.

Grade distribution:
1. TFIDF + classical statistical model (eg. RandomForest) (25% of the final grade)
2. LSTM classification model (15% of the final grade)
3. LSTM model, where the embeddings are initialized with pre-trained word vectors, e.g. fastText, GloVe etc. (15% of the final grade)
4. fastText model (15% of the final grade)
5. BERT based model (you are advised to use a pre-trained one and finetune it, since the resource consumption is considerable!) (30% of the final grade). For BERT you should get over 90% validation accuracy (though nearly 94% is achievable).


__For each of the models, the marks will be awarded according to the following three criteria__:

(1) The (appropriately measured) accuracy of your prediction for the task. The more accurate the prediction is, the better. Note that you need to validate the predictive accuracy of your model on a hold-out of unseen data that the model has not been trained with.

(2) How well you motivate the use of the model - what in this model's structure makes it suited for representing sentiment? After using the model for the task how well you evaluate the accuracy you got for each model and discuss the main advantages and disadvantages the model has in the particular modelling task. At best you take part of the modelling to support your arguments.

(3) The consistency of your take-aways, i.e. what you have learned from your analyses. Also, analyze when the model is good and when and where it does not predict well.

Please make sure that you comment with # on the separates steps of the code you have produced. For the verbal description and analyses plesae insert markdown cells.


__Plagiarism__: The Frankfurt School does not accept any plagiarism. Data science is a collaborative exercise and you can discuss the research question with your classmates, if you like. You must not copy any code or text though. Plagiarism will be prosecuted and will result in a mark of 0 and you failing this class.

After carefully reading this document and having had a look at the data you may still have questions. Please submit those question to the public Q&A board in canvas and we will answer each question, so 

# Data download

In [1]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xzf aclImdb_v1.tar.gz
!ls

--2021-12-07 22:02:26--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-12-07 22:02:27 (69.0 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]

aclImdb  aclImdb_v1.tar.gz  sample_data


### Read data

In [2]:
import os
import re
import numpy as np
import pandas as pd
from gensim.utils import simple_preprocess

In [3]:
data = {}
for split in ["train", "test"]:
    data[split] = []
    for sentiment in ["neg", "pos"]:
        score = 1 if sentiment == "pos" else 0
        path = os.path.join('aclImdb', split, sentiment)
        file_names = os.listdir(path)
        for f_name in file_names:
            with open(os.path.join(path, f_name), "r") as f:
                review = f.read()
                data[split].append([review, score])

np.random.shuffle(data["train"])        
data_train = pd.DataFrame(data["train"],columns=['text', 'label'])
print(data_train)
np.random.shuffle(data["test"])
data_test = pd.DataFrame(data["test"],columns=['text', 'label'])
print(data_test)

                                                    text  label
0      Red Rock West is one of those rare films that ...      1
1      I still wonder why I watched this movie. Admit...      1
2      This is an installment in the notorious Guinea...      0
3      Randolph Scott is heading into Albuquerque to ...      1
4      An apparent vanity project for Karin Mani (who...      0
...                                                  ...    ...
24995  Ashanti is a very 70s sort of film (1979, to b...      1
24996  A brutally depressing script and some fine low...      1
24997  I viewed this movie in DVD format. My copy may...      0
24998  I just watched this movie. In one word: sucky!...      0
24999  No day passes without a new released computer ...      1

[25000 rows x 2 columns]
                                                    text  label
0      I had enjoyed the Masters of Horror Series unt...      0
1      G&M started a the odd couple downstairs in Man...      0
2      This mo

### Preprocess text
Use gensim to process data


In [4]:
data_train.iloc[:,0] = data_train.iloc[:,0].apply(lambda x :' '.join(simple_preprocess(x))) 
data_test.iloc[:,0] =data_test.iloc[:,0].apply(lambda x :' '.join(simple_preprocess(x)))

# TF-IDF + Classic Model
 The first step is to Vectroize text into number,here use the method from sklearn. First build a vectorizer class, then fit the train_text into the class. Then transform test_test, So that train_text and test_text has same dimension.

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
import xgboost as xgb
from sklearn.model_selection import train_test_split

In [6]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(data_train['text'])
print('X_train.shape:' ,X_train.shape)
X_test = vectorizer.transform(data_test['text'])
print('X_test.shape:' ,X_test.shape)

X_train.shape: (25000, 73293)
X_test.shape: (25000, 73293)


### Build Model
To avoid overfitting, split 10% validation data from train_text.

In [7]:
X_train_rf, X_valid_rf, y_train_rf, y_valid_rf = train_test_split(
    X_train, data_train['label'], test_size=0.1, random_state=42)

In [8]:
clf = RandomForestClassifier(n_estimators=200, max_depth=10,random_state = 10223)
clf.fit(X_train_rf, y_train_rf)

RandomForestClassifier(max_depth=10, n_estimators=200, random_state=10223)

In [9]:
y_predicted = clf.predict(X_test)

In [10]:
accuracy_score(data_test['label'], y_predicted)

0.82708

Here first get a accuracy of 82.70% on test data as base line.

# FastText model
FastText model requires special format of input text, so first convert text into the suitable form. 

In [16]:
data_train_ft = data_train.copy()
data_test_ft = data_test.copy()
data_train_ft.iloc[:,1] = data_train_ft.iloc[:,1].apply(lambda x:'__label__'+str(x))
data_test_ft.iloc[:,1] = data_test_ft.iloc[:,1].apply(lambda x:'__label__'+str(x))

In [17]:
import csv
data_train_ft.to_csv('train.txt',index = False,
                    sep = ' ',header = None,
                    quoting = csv.QUOTE_NONE,
                    quotechar = "",
                    escapechar = " ")
data_test_ft.to_csv('test.txt', 
                   index = False, 
                   sep = ' ',
                   header = None, 
                   quoting = csv.QUOTE_NONE, 
                   quotechar = "", 
                   escapechar = " ")

In [18]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[?25l[K     |████▊                           | 10 kB 30.5 MB/s eta 0:00:01[K     |█████████▌                      | 20 kB 36.6 MB/s eta 0:00:01[K     |██████████████▎                 | 30 kB 22.4 MB/s eta 0:00:01[K     |███████████████████             | 40 kB 17.7 MB/s eta 0:00:01[K     |███████████████████████▉        | 51 kB 8.8 MB/s eta 0:00:01[K     |████████████████████████████▋   | 61 kB 9.4 MB/s eta 0:00:01[K     |████████████████████████████████| 68 kB 4.1 MB/s 
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.8.1-py2.py3-none-any.whl (208 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3123444 sha256=e5934e566498d86b38f66650e38a49a6861bdb051415f49c3a7ae323d436cd52
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a65

In [63]:
import fasttext
fast_model = fasttext.train_supervised('train.txt')
fast_model.test('test.txt')

(25000, 0.87708, 0.87708)

Even with the default parameter setting, fasttext model behaves better than randomforest model.

In [None]:
lr =0.1
dim = 128
epoch = 10

In [None]:
fast_model_opt = fasttext.train_supervised('train.txt',lr=lr,dim=dim,epoch=epoch)

In [62]:
fast_model_opt.test('test.txt')

(25000, 0.88356, 0.88356)

After a very limited space of parameter tuning, the model behaves slighly better.

# LSTM + FastText
With a pretrained FastText model,train_data need to be converted to a matrix with dimension of fixed-length sequence keeping the first few words and word vector generated from the fast train model. 

In [None]:
!pip install seed

In [None]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow import keras
from tensorflow.keras.layers import Dense,Input,Embedding,GlobalAveragePooling1D,LSTM
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import Model,backend
from tensorflow.keras import regularizers
import seed
tf.random.set_seed(1234)
import nltk
nltk.download('punkt')

In [None]:
lstm_size =64
max_seq_len = 150

In [None]:
def build_sentence_matrix(series):
  '''Convert n-dim series to matrix (n,sequence_length,word_vector_dimension)'''
  x = np.zeros((len(series),max_seq_len,fast_model_opt.dim))
  y = np.zeros((len(series),max_seq_len,fast_model_opt.dim))
  for idx,sentence in enumerate(series): 
    sentence = nltk.word_tokenize(sentence)
    np_array = np.asarray([fast_model_opt.get_word_vector(word) for word in sentence])
    if idx == len(series):
      break
    length = min(max_seq_len,len(np_array))
    x[idx,:length-1,:] = np_array[:length-1,:]
  return x     

In [None]:
build_sentence_matrix(['this a string','this is another string']).shape # test how the function words

In [None]:
train_text_flstm = build_sentence_matrix(data_train.iloc[:,0]) # convert train text

First build a simple model. Only has one LSTM layer

In [None]:
keras.backend.clear_session()
inp = Input(shape=(max_seq_len,fast_model_opt.dim))
x = LSTM(lstm_size)(inp)
out = Dense(1,activation='sigmoid')(x)
model = Model(inp, out)

In [30]:
model.compile(
    tf.keras.optimizers.RMSprop(
    learning_rate=0.01),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy'])

In [31]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 150, 128)]        0         
                                                                 
 lstm (LSTM)                 (None, 64)                49408     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 49,473
Trainable params: 49,473
Non-trainable params: 0
_________________________________________________________________


In [32]:
X_train_ls, X_valid_ls, y_train_ls, y_valid_ls = train_test_split(
    train_text_flstm, data_train['label'], test_size=0.2, random_state=42)

In [33]:
model.fit(X_train_ls,y_train_ls,epochs=10,batch_size= 100,validation_data=(X_valid_ls,y_valid_ls))

Epoch 1/10


  return dispatch_target(*args, **kwargs)


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fe8a01cb290>

In [34]:
test_text_flstm = build_sentence_matrix(data_test.iloc[:,0])

In [35]:
model.evaluate(test_text_flstm,data_test.iloc[:,1],batch_size=100)



[0.46056094765663147, 0.8181599974632263]

The model works well in dealing with overfiting, but seems have not enough capacity even after 20 batches traing. Try to add another layer to enlarge the capacity. Another aspect is the rather bad performance on test data compared to train and valid are both better. Try with more ramdomed validation data than validation_split in model.fit. 

In [57]:
keras.backend.clear_session()
inp = Input(shape=(max_seq_len,fast_model_opt.dim))
lstm1 = LSTM(lstm_size,return_sequences=True,return_state=True)(inp)
lstm2 = LSTM(lstm_size)(lstm1[0])
out = Dense(1,activation='sigmoid')(lstm2)
model = Model(inp, out)

In [58]:
model.compile(
    tf.keras.optimizers.Adam(
    learning_rate=0.02),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy'])

In [59]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 150, 128)]        0         
                                                                 
 lstm (LSTM)                 [(None, 150, 64),         49408     
                              (None, 64),                        
                              (None, 64)]                        
                                                                 
 lstm_1 (LSTM)               (None, 64)                33024     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 82,497
Trainable params: 82,497
Non-trainable params: 0
_________________________________________________________________


In [60]:
model.fit(X_train_ls,y_train_ls,epochs=10,batch_size= 100,validation_data=(X_valid_ls,y_valid_ls))

Epoch 1/10


  return dispatch_target(*args, **kwargs)


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fe79af93690>

In [61]:
model.evaluate(test_text_flstm,data_test.iloc[:,1],batch_size=50)

  return dispatch_target(*args, **kwargs)




[0.4172135293483734, 0.8278800249099731]

Test result is not as good as expected. More evaborate analysis is needed.