[Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
======

## Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

## File descriptions

labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
## Data fields

* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## Objective
Objective of this dataset is base on **review** we predict **sentiment** (positive or negative) so X is **review** column and y is **sentiment** column

## 1. Load Dataset

Let's first of all have a look at the data. You can download the file `labeledTrainData.tsv` on the [Kaggle website of the competition](https://www.kaggle.com/c/word2vec-nlp-tutorial/data), or on our [Google Drive](https://drive.google.com/file/d/1a1Lyn7ihikk3klAX26fgO3YsGdWHWoK5/view?usp=sharing)


In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')

import pandas as pd
import numpy as np
import csv
import zipfile
import glob
from string import digits
import random

import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus.reader.wordnet import NOUN, VERB, ADJ
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from collections import Counter
from collections import defaultdict
import math

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\msi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\msi\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\msi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\msi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Read dataset with extra params sep='\t', encoding="latin-1"
data = pd.read_csv("labeledTrainData.csv",sep='\t', encoding='latin-1')
data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


## 2. Preprocessing

In [3]:
X, Y = data['review'],data['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [4]:
print(Counter(y_train))
print(Counter(y_test))

Counter({0: 10019, 1: 9981})
Counter({1: 2519, 0: 2481})


In [5]:
def preformat(sentence):

    # Clean text data
    sentence = sentence.lower() # Lower the text sentence
    sentence = sentence.translate(str.maketrans('', '', digits)) # Delete digits number in text sentence
    sentence = re.sub(r"http\S+", "", sentence) # Remove URL in text sentence
    sentence = sentence.translate ({ord(c): " " for c in ",.:'!@#$%^&*()[]{}/<>?\\|`~=_+"""})  # Remove special character in text sentence
    sentence = sentence.replace(" br "," ").replace(" a "," ").replace(" p "," ").replace(" div "," ").replace(" span "," ").replace(" s "," ").replace(" i "," ")
    
    # Processing text data with NLP
    tokens = word_tokenize(sentence)
    tags = nltk.pos_tag(tokens)

    lemmatizer = WordNetLemmatizer()
    words = ""

    for i, token in enumerate(tokens):
        pos_tag = tags[i][1]
        if pos_tag.startswith('N'):
            lemma = lemmatizer.lemmatize(token, pos=NOUN)
            words += lemma + " "
        elif pos_tag.startswith('V'):
            lemma = lemmatizer.lemmatize(token, pos=VERB)
            words += lemma + " "
        elif pos_tag.startswith('J'):
            lemma = lemmatizer.lemmatize(token, pos=ADJ)
            words += lemma + " "
        else:
            lemma = token
            words += " "

    return ' '.join(words.split())

In [6]:
print("---Text before formatting: ")
print(data['review'][0])
# Text after formatting
print("---Text after formatting: ")
print(preformat(data['review'][0]))

---Text before formatting: 
With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature f

In [7]:
stopword = set(stopwords.words('english'))

def build_vocab(X):
    """
    Build vocabulary from dataset
    """
    vocab = {}
    for sentence in X:
        sentence = preformat(sentence)
        for word in sentence.split():
            if word in stopword:
                sentence.replace(word, "")
                continue
            if word not in vocab:
                idx = len(vocab)
                vocab[word] = idx
    return vocab

In [8]:
list_vocab = build_vocab(X_train);
# Select randomly 20 vocab in X_train and the number of occurrences of it in X_train
for i, val in enumerate(random.sample(list(list_vocab), 20)):
    print(val + ": " + str(list_vocab[val]))

kelsey: 28199
ambitiousness: 48756
pornoes: 31941
waspish: 35985
incur: 20641
cecile: 16814
babtise: 15866
innsbruck: 39388
roeves: 34373
wonderfalls: 34121
knockoff: 978
castelnuovo: 18144
marianne: 4324
panic: 7374
heaven: 1155
bushy: 19161
murdock: 20867
marvel: 6310
enthralling: 44095
battle: 3407


## 3. Create Model and Train 

In [9]:
def train_naive_bayes(texts, labels, target_classes, alpha=1):  
    """
    Train a multinomial Naive Bayes model
    """
    ndoc = 0
    nc = defaultdict(int)   # map from a class label to number of documents in the class
    logprior = dict()
    loglikelihood = dict()
    count = defaultdict(int)  # count the occurrences of w in documents of class c

    vocab = build_vocab(texts)
    # Training
    for s, c in zip(texts, labels):
        ndoc += 1
        nc[c] += 1
        for w in s.split():
            if w in vocab:
                count[(w,c)] += 1
    
    vocab_size = len(vocab)
    for c in target_classes:
        logprior[c] = math.log(nc[c]/ndoc)
        sum_ = 0
        for w in vocab.keys():
            if (w,c) not in count: count[(w,c)] = 0
            sum_ += count[(w,c)]
        
        for w in vocab.keys():
            loglikelihood[(w,c)] = math.log( (count[(w,c)] + alpha) / (sum_ + alpha * vocab_size) )
    
    return logprior, loglikelihood, vocab

In [10]:
def test_naive_bayes(testdoc, logprior, loglikelihood, target_classes, vocab):
    sum_ = {}
    for c in  target_classes:
        sum_[c] = logprior[c]
        for w in testdoc.split():
            if w in vocab:
                sum_[c] += loglikelihood[(w,c)]
    # sort keys in sum_ by value
    sorted_keys = sorted(sum_.keys(), key=lambda x: sum_[x], reverse=True)
    return sorted_keys[0]

In [13]:
target_classes = set()
for index in y_train:
    target_classes.add(index)

print(target_classes)

{0, 1}


In [14]:
logprior, loglikelihood, vocab = train_naive_bayes(X_train, y_train, target_classes)


In [15]:
predicted_labels = [test_naive_bayes(s, logprior, loglikelihood, target_classes, vocab)
                    for s in X_test]

print('Accuracy score: %f' % metrics.accuracy_score(y_test, predicted_labels))

Accuracy score: 0.826800


## 4. Evaluate Model

In [16]:
predicted_labels = [test_naive_bayes(s, logprior, loglikelihood, target_classes, vocab)
                    for s in X_test]

print('Accuracy score: %f' % metrics.accuracy_score(y_test, predicted_labels))

Accuracy score: 0.826800


In [17]:
print('  Precision: %f' % metrics.precision_score(y_test, predicted_labels, average='micro'))
print('  Recall: %f' % metrics.recall_score(y_test, predicted_labels, average='micro'))
print('  F1: %f' % metrics.f1_score(y_test, predicted_labels, average='micro'))

  Precision: 0.826974
  Recall: 0.826887
  F1: 0.826795


In [18]:
print('Macro-averaged f1: %f' % metrics.f1_score(y_test, predicted_labels, average='macro'))
print('Micro-averaged f1: %f' % metrics.f1_score(y_test, predicted_labels, average='micro'))

Macro-averaged f1: 0.826795
Micro-averaged f1: 0.826800


In [19]:
print(metrics.classification_report(y_test, predicted_labels, digits=3))

              precision    recall  f1-score   support

           0      0.817     0.838     0.828      2481
           1      0.837     0.815     0.826      2519

    accuracy                          0.827      5000
   macro avg      0.827     0.827     0.827      5000
weighted avg      0.827     0.827     0.827      5000

