# Objective

* Load Data, vectorize reviews to numbers
* Build a basic model based on counting
* Evaluate the Model
* Make a first Kaggle Submission

## Download Data from Kaggle:

* Competition Link: https://www.kaggle.com/c/movie-sentiment-analysis
        
* Unzip into Data Directory

In [None]:
from __future__ import print_function  # Python 2/3 compatibility
import numpy as np
import pandas as pd
from collections import Counter

from IPython.display import Image

## Load Data

In [None]:
train_df = pd.read_csv("data/train.tsv", sep="\t")

In [None]:
train_df.sample(10)

In [None]:
# Load the Test Dataset
# Note that it's missing the Sentiment Column.  That's what we need to Predict
#
test_df = pd.read_csv("data/test.tsv", sep="\t")
test_df.head()

## Explore Dataset

In [None]:
# Equal Number of Positive and Negative Sentiments
train_df.sentiment.value_counts()

In [None]:
# Lets take a look at some examples
def print_reviews(reviews, max_words=500):
    for review in reviews:
        print(review[:500], end="\n\n")

In [None]:
# Some Positive Reviews
print("Sample **Positive** Reviews: ", "\n")
print_reviews(train_df[train_df["sentiment"] == 1].sample(3).review)

In [None]:
# Some Positive Reviews
print("Sample **Negative** Reviews: ", "\n")
print_reviews(train_df[train_df["sentiment"] == 1].sample(3).review)

## Vectorize Data (a.k.a. covert text to numbers)

Computers don't understand Texts, so we need to convert texts to numbers before we could do any math on it and see if we can build a system to classify a review as Positive or Negative.

Ways to vectorize data:

* Bag of Words
* TF-IDF
* Word Embeddings (Word2Vec) 

### Bag of Words

Take each sentence and count how many occurances of a particular word.

In [None]:
## Doing it by Hand

def bag_of_words_vocab(reviews):
    """Returns words in the reviews"""
    # all_words = []
    # for review in reviews:
    #    for word in review.split():
    #       all_words.append(word)
    ## List comprehension method of the same lines above
    all_words = [word.lower() for review in reviews for word in review.split(" ")]
    return Counter(all_words)

In [None]:
words_vocab = bag_of_words_vocab(train_df.review)

In [None]:
words_vocab.most_common(20)

#### Observations:

* Common words are not that meaningful (also called Stop words - unfortunately)
* These words are likely to appear in both Positive and Negative Reviews


We need a way to find what words are mroe likely to cocur in Postive Review as compared to Negative Review

In [None]:
pos_words_vocab = bag_of_words_vocab(train_df[train_df.sentiment == 1].review)
neg_words_vocab = bag_of_words_vocab(train_df[train_df.sentiment == 0].review)

In [None]:
pos_words_vocab.most_common(10)

In [None]:
neg_words_vocab.most_common(10)

In [None]:
pos_neg_freq = Counter()

for word in words_vocab:
    pos_neg_freq[word] = (pos_words_vocab[word] + 1e-3) / (neg_words_vocab[word] + 1e-3)

In [None]:
print("Neutral words:")
print("Pos-to-neg for 'the' = {:.2f}".format(pos_neg_freq["is"]))
print("Pos-to-neg for 'movie' = {:.2f}".format(pos_neg_freq["is"]))

print("\nPositive and Negative review words:")
print("Pos-to-neg for 'amazing' = {:.2f}".format(pos_neg_freq["great"]))
print("Pos-to-neg for 'terrible' = {:.2f}".format(pos_neg_freq["terrible"]))

### Let's Amplify the difference using Log Scale


* Neutral Values are Close to 1
* Negative Sentiment Words are less than 1
* Positive Sentiment Words are greater than 1

When Converted to Log Scale -

* Neutral Values are Close to 0
* Negative Sentiment Words are negative
* Positive Sentiment Words are postive

That not only makes lot of sense when looking at the numbers, but we could use it for our first classifier

In [None]:
# https://www.desmos.com/calculator  
Image("images/log-function.png", width=960)

In [None]:
for word in pos_neg_freq:
    pos_neg_freq[word] = np.log(pos_neg_freq[word])

In [None]:
print("Neutral words:")
print("Pos-to-neg for 'the' = {:.2f}".format(pos_neg_freq["is"]))
print("Pos-to-neg for 'movie' = {:.2f}".format(pos_neg_freq["is"]))

print("\nPositive and Negative review words:")
print("Pos-to-neg for 'amazing' = {:.2f}".format(pos_neg_freq["great"]))
print("Pos-to-neg for 'terrible' = {:.2f}".format(pos_neg_freq["terrible"]))

## Time to build a Counting Model

* For each Review, we will ADD all the pos_neg_freq values and if the Total for all words in the given review is > 0, we will call it Positive Review and if it's a negative total, we will call it a Negative Review.  Sounds good?

In [None]:
class CountingClassifier(object):
    
    def __init__(self, pos_neg_freq):
        self.pos_neg_freq = pos_neg_freq
    
    def fit(self, X, y=None):
        # No Machine Learing here.  It's just counting
        pass
    
    def predict(self, X):
        predictions = []
        for review in X:
            all_words = [word.lower() for word in review.split()]
            result = np.sum(self.pos_neg_freq.get(word, 0) for word in all_words)
            predictions.append(result)
        return np.array(predictions)

In [None]:
counting_model = CountingClassifier(pos_neg_freq)
train_predictions = counting_model.predict(train_df.review)

In [None]:
train_predictions[:10]

In [None]:
# Covert to Binary Classifier
train_predictions > 0

In [None]:
y_pred = (train_predictions > 0).astype(int)
y_pred

In [None]:
y_true = train_df.sentiment
len(y_true)

In [None]:
np.sum(y_pred == y_true)

In [None]:
## Accuracy
train_accuracy = np.sum(y_pred == y_true) / len(y_true)

print("Accuracy on Train Data: {:.2f}".format(train_accuracy))

#### Machine Learning Easy?  What Gives?

Remember this is Training Accuracy.  We have not split our Data into Train and Validation (which we will do in our next notebook when we actualy build a Machine Learning Model)

## Make a Submission to Kaggle

Predict on Test Data and Submit to Kaggle.  May be we could end the tutorial right here :-D

In [None]:
## Test Accracy
test_predictions = counting_model.predict(test_df.review)

test_predictions

In [None]:
y_pred = (test_predictions > 0).astype(int)

In [None]:
df = pd.DataFrame({
    "document_id": test_df.document_id,
    "sentiment": y_pred
})

In [None]:
df.head()

In [None]:
df.to_csv("data/count-submission.csv", index=False)

## Reasons for Testing Accuracy Being Lower?


* One Hypothesis, Since we are just Adding up ALL of the scores for each word in the review, the length of the reivew could have an impact.  Let's look at length of reviews in train and test dataset

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
train_df.review.str.len().hist(log=True)

In [None]:
test_df.review.str.len().hist(log=True)

## Next Steps

* Split the Training Data into Training and Validation to avoid surprises on New Data(might not have helped in our counting method)
* Build a Machine Learning Model beyond the rule based system of Counting values