# Assignment 5

Build CNN model for sentiment analysis (binary classification) of IMDB Reviews (https://www.kaggle.com/utathya/imdb-review-dataset).
You can use data with label="unsup" for pretraining of embeddings. Here you are forbidden to use test dataset for pretraining of embeddings.  
Your quality metric is accuracy score on test dataset. Look at "type" column for  train/test split.  
You can use pretrained embeddings from external sources.  
You have to provide data for trials with different hyperparameter values.  

You have to beat following baselines:  
[3 points] acc = 0.75  
[5 points] acc = 0.8  
[8 points] acc = 0.9  

[2 points] for using unsupervised data  

In [1]:
import nltk
import numpy as np
import pandas as pd
import re

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, f1_score

nltk.download(['stopwords', 'wordnet'])

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/margaritaberseneva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/margaritaberseneva/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df = pd.read_csv('imdb_master.csv', encoding="latin-1", index_col=0)
df.head()

Unnamed: 0,type,review,label,file
0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt


In [3]:
df = df[df['label']!='unsup']

In [4]:
def preproc(text):
    text = re.sub('[^a-zA-Z]',' ', text)
    text = text.lower()
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")]
    text = [lemmatizer.lemmatize(token, 'v') for token in text]
    text = [word for word in text if not word in stop_words]
    text = ' '.join(text)
    return text

In [5]:
df['text'] = df.review.apply(lambda x: preproc(x))

In [6]:
df_train = df[df['type'] == 'train'].drop(columns=['type', 'file', 'review'])
df_test = df[df['type'] == 'test'].drop(columns=['type', 'file', 'review'])

In [7]:
df_train['label'] = df_train['label'].apply(lambda x : 0 if x=='neg' else 1)
df_test['label'] = df_test['label'].apply(lambda x : 0 if x=='neg' else 1)

In [8]:
df_train.head()

Unnamed: 0,label,text
25000,0,story man ha unnatural feel pig start open sc...
25001,0,airport start brand new luxury plane l...
25002,0,film lack something put finger first charisma...
25003,0,sorry everyone know suppose art film wo...
25004,0,wa little parent take along theater see interi...


In [9]:
vectorizer = CountVectorizer(max_features=5000)
vectorizer.fit(df_train['text'].values)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [10]:
X_train = vectorizer.transform(df_train['text']).toarray()
y_train = df_train['label'].values

X_test = vectorizer.transform(df_test['text']).toarray()
y_test = df_test['label'].values

In [11]:
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [12]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print('Train accuracy:', accuracy_score(y_train, y_train_pred), ' f1:', f1_score(y_train, y_train_pred))
print('Test accuracy:', accuracy_score(y_test, y_test_pred), ' f1:', f1_score(y_test, y_test_pred))

Train accuracy: 1.0  f1: 1.0
Test accuracy: 0.84508  f1: 0.8443015075376884
