# Bag of Words Meets Bags of Popcorn

#### Yuichiro Suzuki
#### Last update: 20170326

## Description
---
In this project, we predict that the reviews of movies are positive or negative by the words contained in the reviews.  
This type of machine learning task is called "Sentiment analysis".  
The data set needed in this project can be obtained in [here](https://www.kaggle.com/c/word2vec-nlp-tutorial/data), which is one of Kaggle competitions.

## Evaluation
---
The score is evaluated on [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).

In [1]:
import pandas as pd
import re
import numpy as np
import nltk
import os
import csv
import pyprind


from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier
from sklearn.externals import joblib
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier



In [2]:
train = pd.read_csv("./data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [3]:
print(train.shape)

(25000, 3)


### Tokenize and vectorize

In [4]:
from vectorizer import vect

"""
# vectorizer.py

def tokenizer(text):
    text = BeautifulSoup(text, "html.parser").get_text()
    emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)
    text = re.sub("[^a-zA-Z]", " ", text.lower()) + " ".join(emoticons).replace("-", " ")
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


vect = HashingVectorizer(decode_error="ignore",
                         n_features=2 ** 21,
                         preprocessor=None,
                         tokenizer=tokenizer)
                         
"""

'\n# vectorizer.py\n\ndef tokenizer(text):\n    text = BeautifulSoup(text, "html.parser").get_text()\n    emoticons = re.findall("(?::|;|=)(?:-)?(?:\\)|\\(|D|P)", text)\n    text = re.sub("[^a-zA-Z]", " ", text.lower()) + " ".join(emoticons).replace("-", " ")\n    tokenized = [w for w in text.split() if w not in stop]\n    return tokenized\n\n\nvect = HashingVectorizer(decode_error="ignore",\n                         n_features=2 ** 21,\n                         preprocessor=None,\n                         tokenizer=tokenizer)\n                         \n'

## Modeling

### Stochastic Gradient Descent
I used stochastic gradiend descent(SGD) in this project. SGD is useful when you want to process a large size of data because it can use partial_fit method.  Also, the model is pickled in advance and loaded by using joblib module. The source cord of the pickle object of SGD is shown below.

In [5]:
sgd= joblib.load(open(os.path.join("pkl_objects", "sgd.pkl"), "rb"))


"""
def stream_docs(path):
    with open(path, "r", encoding="utf-8") as f:
        reader = csv.reader(f, delimiter="\t")
        next(reader)
        for line in reader:
            text, label = line[2], int(line[1])
            yield text, label


def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y


doc_stream = stream_docs(path="./data/labeledTrainData.tsv")


sgd = SGDClassifier(loss="log", random_state=1, n_iter=1)

classes = np.unique(train["sentiment"])
length = 20
pbar = pyprind.ProgBar(length)

for _ in range(length):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    sgd.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

    
sgd = sgd.partial_fit(X_test, y_test)


X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print("Accuracy: {0: .3f}".format(sgd.score(X_test, y_test)))


dest = os.path.join("pkl_objects")
if not os.path.exists(dest):
    os.makedirs(dest)
joblib.dump(sgd, open(os.path.join(dest, "sgd.pkl"), "wb"))

"""

'\ndef stream_docs(path):\n    with open(path, "r", encoding="utf-8") as f:\n        reader = csv.reader(f, delimiter="\t")\n        next(reader)\n        for line in reader:\n            text, label = line[2], int(line[1])\n            yield text, label\n\n\ndef get_minibatch(doc_stream, size):\n    docs, y = [], []\n    try:\n        for _ in range(size):\n            text, label = next(doc_stream)\n            docs.append(text)\n            y.append(label)\n    except StopIteration:\n        return None, None\n    return docs, y\n\n\ndoc_stream = stream_docs(path="./data/labeledTrainData.tsv")\n\n\nsgd = SGDClassifier(loss="log", random_state=1, n_iter=1)\n\nclasses = np.unique(train["sentiment"])\nlength = 20\npbar = pyprind.ProgBar(length)\n\nfor _ in range(length):\n    X_train, y_train = get_minibatch(doc_stream, size=1000)\n    if not X_train:\n        break\n    X_train = vect.transform(X_train)\n    sgd.partial_fit(X_train, y_train, classes=classes)\n    pbar.update()\n\n    

Processing test data

In [6]:
test = pd.read_csv("./data/testData.tsv", delimiter="\t", quoting=3)
test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [7]:
test.shape

(25000, 2)

In [8]:
X_test = vect.transform(test["review"])

In [17]:
result = sgd.predict_proba(X_test)
output = pd.DataFrame({"id": test["id"], "sentiment": result[:, 1]})
output.to_csv("./processed/Bag_of_Words_model.csv", index=False, quoting=3)