In [94]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

<hr>

The data is located in the file amazon_reviews.txt, each line contains a review and a 1 or 0 at the end of the line indicating wether the review is positive or negative resepctively.


Load the data set into a dataframe (you can use pandas.read_csv), and notice that the seperator is a tab ('\t').

In [18]:
amazon_reviews = pd.read_csv("amazon_reviews.txt", sep="\t", header=None)
amazon_reviews.columns = ["comment","sentiment"]

<hr>
Print the first rows of your data frame to make sure data is loaded correctly

In [46]:
amazon_reviews.head(1)

Unnamed: 0,comment,sentiment
0,So there is no way for me to plug it in here i...,0


<hr>

Split your data into training and test sets (use 20% of reviews for test)

In [47]:
X_train, X_test, y_train, y_test = train_test_split(amazon_reviews.iloc[:,0], 
                                                    amazon_reviews.iloc[:,1], 
                                                    test_size=0.2, 
                                                    random_state=1)

<hr>

We will use the words in reviews as a features. Our naive bayes assumption would be the following:
* A review is randomly generated by:
    * Randomly choosing a class (positive or negative) with a probability to be learned
    * When a class is chosen, create a multinomial sequence of words (words in the overall vocabulary). At each position of the variable-size sequence a word_i is chosen with a multinomial probability P(w_i | Chosen sentiment class)
    * And therefor we don't care about the order of the words
 
We will then use a bag of words as a feature for each review. A feature vector is the number of occurences of each word in the vocabulary in the given review.

<hr>

Use sklean's feature extractor named `CountVectorizer` (https://bit.ly/2Yc5P1X) to convert reviews into bag of words. Fit the converter on your training data only.

In [77]:
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1))
vectorizer.fit(X_train)

corpus_train = vectorizer.transform(X_train)

<hr>

Instantiate a multinomial naive bayes classifier from sklearn `MultinomialNB` 

In [69]:
clf = MultinomialNB()

<hr>

Fit you classifier on training data

In [84]:
clf.fit(corpus_train, y_train)

MultinomialNB()

<hr>

Print the accuracy score of your classifier (you the built in score method of your model):
1. On your training data
2. And on your test data

In [87]:
pred_train = clf.predict(corpus_train)

accuracy_score(y_train, pred_train)

0.97375

In [88]:
corpus_test = vectorizer.transform(X_test)
pred_test = clf.predict(corpus_test)

accuracy_score(y_test, pred_test)

0.78

<hr>

Here we use only the number of occurences of each word. If you have time, try to fit using tf-idf. When you use tf-idf, if a word w_i has a high score within a given review, it will have a higher probability P(w_i | a given sentiment). Intuitevely, the tf-idf is just a counting scaled by the importance of a word.  

In [95]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)
corpus_train = vectorizer.transform(X_train)
corpus_test = vectorizer.transform(X_test)

clf = MultinomialNB()
clf.fit(corpus_train, y_train)

MultinomialNB()

In [96]:
pred_train = clf.predict(corpus_train)

accuracy_score(y_train, pred_train)

0.96875

In [97]:
corpus_test = vectorizer.transform(X_test)
pred_test = clf.predict(corpus_test)

accuracy_score(y_test, pred_test)

0.8