# IMDB sentiment analysis with Word2Vec (Kaggle) 
[task link](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

> Use Google's Word2Vec for movie reviews

Data Set: <br>
* The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. <br>
* The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1.<br>
* No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. <br>
* In addition, there are another 50,000 IMDB reviews provided without any rating labels.

Task: 

Method 1. Predict labeded Data by BOW, tf-idf and MultinomialNB.

    a. check basic info of dataset
    b. data cleaning with stopwords (use NLTK)
    c. Build model by CountVectorizer and MultinomialNB (use GridsearhCV and Pipline)
    d. Build model by TfidfVectorizer and MultinomialNB (use GridsearhCV and Pipline)
    e. Kaggle data submission

Method 2. Use Word2Vec and Tree-based models to do prediction.

    a. Train Word2Vec model by unlabeled data
    b. Turn labeled reviews data into vectors by trained Word2Vec model
    c. Fit the vector data by GBDT and do prediction.
    d. Fit the vector data by XGboost and do prediction.
    e. Kaggle data submission


#### Kaggle submission result

![subsmission result](https://github.com/vivianchang2019/IMDB_word2vec_kaggle/blob/master/result/word2vec_reviews.JPG?raw=true)

# 1. Predict labeded Data by BOW, tf-idf and MultinomialNB

In [2]:
import pandas as pd
import numpy as np

In [3]:
df_train = pd.read_csv('data/IMDB/labeledTrainData.tsv', delimiter = '\t')
df_test = pd.read_csv('data/IMDB/testData.tsv', delimiter = '\t')

#### a. check basic info of dataset

In [4]:
df_train.info() # no missing value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [5]:
df_train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [6]:
df_test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


#### b. data cleaning with stopwords

In [7]:
from bs4 import BeautifulSoup
import re
import nltk

In [8]:
# nltk.download('stopwords')
from nltk.corpus import stopwords

In [9]:
# data cleaning

def review_to_text(review):
    # remove HTML tag
    raw_text=BeautifulSoup(review, 'html').get_text()
    # remove non English character
    letters = re.sub('[^a-zA-Z]', ' ', raw_text)
    words = letters.lower().split()
    # remove stopwords
    stopw_words = set(stopwords.words('English'))
    words = ' '.join([w for w in words if w not in stopw_words])

    return words


In [10]:
X_train = df_train['review'].apply(review_to_text)
X_test = df_train['review'].apply(review_to_text)

In [11]:
X_train.head()

0    stuff going moment mj started listening music ...
1    classic war worlds timothy hines entertaining ...
2    film starts manager nicholas bell giving welco...
3    must assumed praised film greatest filmed oper...
4    superbly trashy wondrously unpretentious explo...
Name: review, dtype: object

In [12]:
y_train = df_train['sentiment']

#### c. Build model by CountVectorizer and MultinomialNB

[CountVectorizer document](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [13]:
# Transforms text to vectors. 

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [14]:
# use pipline to build two classifiers with CountVectorizer 
pip_count = Pipeline([('count_vec', CountVectorizer(analyzer='word')),
                     ('mnb', MultinomialNB())])             

In [15]:
# set hyperparameters for GridSearch
params_count = {'count_vec__binary':[True, False],
                'count_vec__ngram_range': [(1, 1), (1, 2)],
                'mnb__alpha': [0.1, 1.0, 10.0]}

In [16]:
# run GridSearch for CountVectorizer

gs_count = GridSearchCV(pip_count, params_count, cv=4, n_jobs=-1, verbose=1)
gs_count.fit(X_train, y_train)

Fitting 4 folds for each of 12 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  1.8min finished


GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('count_vec', CountVectorizer()),
                                       ('mnb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'count_vec__binary': [True, False],
                         'count_vec__ngram_range': [(1, 1), (1, 2)],
                         'mnb__alpha': [0.1, 1.0, 10.0]},
             verbose=1)

In [17]:
print(gs_count.best_score_)
print(gs_count.best_params_)

0.88216
{'count_vec__binary': True, 'count_vec__ngram_range': (1, 2), 'mnb__alpha': 1.0}


In [18]:
# predict the data with model and best parameter
count_y_predict = gs_count.predict(X_test)

#### d. Build model by TfidfVectorizer and MultinomialNB

In [19]:
pip_tfidf = Pipeline([('tfidf_vec', TfidfVectorizer(analyzer='word')),
                     ('mnb', MultinomialNB())])   

In [20]:
params_tfidf = {'tfidf_vec__binary':[True, False],
                'tfidf_vec__ngram_range': [(1, 1), (1, 2)],
                'mnb__alpha': [0.1, 1.0, 10.0]}

In [21]:
# run GridSearch for TfidfVectorizer
gs_tfidf = GridSearchCV(pip_tfidf, params_tfidf, cv=4, n_jobs=-1, verbose=1)
gs_tfidf.fit(X_train, y_train)

Fitting 4 folds for each of 12 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  2.0min finished


GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('tfidf_vec', TfidfVectorizer()),
                                       ('mnb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'mnb__alpha': [0.1, 1.0, 10.0],
                         'tfidf_vec__binary': [True, False],
                         'tfidf_vec__ngram_range': [(1, 1), (1, 2)]},
             verbose=1)

In [22]:
print(gs_tfidf.best_score_)
print(gs_tfidf.best_params_)

0.88712
{'mnb__alpha': 0.1, 'tfidf_vec__binary': True, 'tfidf_vec__ngram_range': (1, 2)}


In [23]:
tfidf_y_predict = gs_tfidf.predict(X_test)

#### e. Kaggle data submission

In [24]:
submission_count = pd.DataFrame({'id': df_test['id'], 'sentiment': count_y_predict})
submission_tfidf = pd.DataFrame({'id': df_test['id'], 'sentiment': tfidf_y_predict})


In [72]:
# submission_count.to_csv('result/submission_count.csv', index = False)
# submission_tfidf.to_csv('result/submission_tfidf.csv', index = False)

# b. Use Word2Vec and Tree-based models to do prediction

#### a. Train Word2Vec model by unlabeled data

In [26]:
unlabeled_train = pd.read_csv('data/IMDB/unlabeledTrainData.tsv', delimiter = '\t', quoting = 3)

In [27]:
unlabeled_train.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


In [28]:
import nltk

# nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [29]:
# define function to divides a text into a list of sentences, and do string cleaning by each sentence


def review_to_text_word2vec(review):
    # remove HTML tag
    try:
        raw_text = BeautifulSoup(review, 'html').get_text()
    except:
        raw_text = review
    # remove non English character
    letters = re.sub('[^a-zA-Z]', ' ', raw_text)
    words = letters.lower().split()
    # remove stopwords
    stopw_words = set(stopwords.words('English'))
    words = [w for w in words if w not in stopw_words]
    return words


def review_to_sentences(review, tokenizer):
    # remove HTML tag
    raw_sentences = tokenizer.tokenize(review.strip())
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence)> 0:
            sentences.append(review_to_text_word2vec(raw_sentence))
    return sentences

In [30]:
# prepare data for Word2vec training

corpora = []
for review in unlabeled_train['review']:
    corpora += review_to_sentences(review, tokenizer)



[Word2Vec document](https://radimrehurek.com/gensim/models/word2vec.html) <br>
[Word2Vec explaination](https://allenlu2007.wordpress.com/2018/05/18/nlp-%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86-word2vec-%E5%B7%A5%E5%85%B7%E7%AE%B1%E5%92%8C-skip-gram/)

In [31]:
# train Word2Vec model
from gensim.models import Word2Vec

num_features = 300

model = Word2Vec(corpora, size = num_features, workers = 4, min_count = 20, window = 10)

In [32]:
# model.save('model/word2vec_300features') # save the model

In [34]:
# check the training result
model.wv.most_similar("man")

[('lady', 0.5992120504379272),
 ('woman', 0.5843505859375),
 ('men', 0.5464239716529846),
 ('lad', 0.45248642563819885),
 ('payback', 0.43589454889297485),
 ('gentleman', 0.43261802196502686),
 ('person', 0.43119335174560547),
 ('monk', 0.429901123046875),
 ('mans', 0.41871994733810425),
 ('angus', 0.41664761304855347)]

#### b. Turn labeled reviews data into vectors by trained Word2Vec mode

In [55]:
# Function to average all of the word vectors in a given　paragraph

def makeFeatureVec(words, model, num_features):

    # Pre-initialize an empty numpy array
    featureVec = np.zeros((num_features,),dtype="float32")
    nwords = 0.

    # a set contains the words in the model's vocabulary
    index2word_set = set(model.wv.index2word)

    # create feature vector of words if a word in index2word_set
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model.wv.__getitem__([word]))
    featureVec = np.divide(featureVec,nwords)
    
    return featureVec


In [56]:
# Function to calculate the average feature vector for each review 

def getAvgFeatureVecs(reviews, model, num_features):
    counter = 0
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")

    for review in reviews:
        reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
        counter = counter + 1
        
    return reviewFeatureVecs


In [57]:
# transfer training and testing sets to be vectors

clean_train_reviews = []
for review in df_train["review"]:
    clean_train_reviews.append(review_to_text_word2vec(review))

trainDataVecs = getAvgFeatureVecs(clean_train_reviews, model, num_features )

In [61]:
clean_test_reviews = []
for review in df_test["review"]:
    clean_test_reviews.append(review_to_text_word2vec(review))

testDataVecs = getAvgFeatureVecs(clean_test_reviews, model, num_features )

#### c. Fit the vector data by GBDT

In [62]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

In [63]:
gbc = GradientBoostingClassifier()

In [64]:
params_gbc = {'n_estimators':[10, 100, 500],
              'learning_rate': [0.01, 0.1, 1.0],
              'max_depth': [2, 3, 4]}

In [65]:
gs_gbc = GridSearchCV(gbc, params_gbc, cv=4, n_jobs=-1, verbose=1)

In [66]:
gs_gbc.fit(trainDataVecs, y_train)

Fitting 4 folds for each of 27 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 45.8min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 193.3min finished


GridSearchCV(cv=4, estimator=GradientBoostingClassifier(), n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.1, 1.0],
                         'max_depth': [2, 3, 4],
                         'n_estimators': [10, 100, 500]},
             verbose=1)

In [67]:
print(gs_gbc.best_score_)
print(gs_gbc.best_params_)

0.8565600000000001
{'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 500}


In [68]:
result_gbc = gs_gbc.predict(testDataVecs)

In [74]:
submission_w2v_gbc = pd.DataFrame({'id': df_test['id'], 'sentiment': result_gbc})

In [86]:
# submission_w2v_gbc.to_csv('result/submission_w2v_gbc.csv', index = False)

#### d. Fit the vector data by XGboost for classification.

In [77]:
import xgboost as xgb

In [78]:
xgb = xgb.XGBClassifier()

In [79]:
params_xgb = {'n_estimators':[10, 100, 500],
              'learning_rate': [0.01, 0.1, 1.0],
              'max_depth': [2, 3, 4]}

In [80]:
gs_xgb = GridSearchCV(xgb, params_xgb, cv=4, n_jobs=-1, verbose=1)

In [81]:
gs_xgb.fit(trainDataVecs, y_train)

Fitting 4 folds for each of 27 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 12.9min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 51.1min finished


GridSearchCV(cv=4,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None, random_state=None,
                                     reg_alpha=None, reg_lambda=None,
                                     scale_pos_weight=None, subsample=None,
                                     tree_method=None,

In [82]:
print(gs_xgb.best_score_)
print(gs_xgb.best_params_)

0.8560800000000001
{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500}


#### e. Kaggle data submission

In [83]:
result_xgb = gs_xgb.predict(testDataVecs)

In [84]:
submission_w2v_xgb = pd.DataFrame({'id': df_test['id'], 'sentiment': result_xgb})

In [85]:
# submission_w2v_xgb.to_csv('result/submission_w2v_xgb.csv', index = False)