# Objective

Build a Sentiment Classifier using Logistic Regression:

* Load Data
* Vectorize using Scikit-Learn
* Build a Logisitc Regression Model
* Evaluate the Model
* Update our Kaggle Submission

In [None]:
from __future__ import print_function  # Python 2/3 compatibility
import numpy as np
import pandas as pd

from IPython.display import Image

## Load Data

In [None]:
train_df = pd.read_csv("data/train.tsv", sep="\t")

In [None]:
train_df.sample(10)

## Training process

* Split the Overall Training examples into Training and Validation
* Build the Models on Training Data
* Score on Validation data
* Choose the best model and submit to Kaggle


Caution:  If you do this enough times, you will be overfitting to the Validation data.  To avoid that it might be advisable to split into three ways like Train-Validation-Test and generate the final score on Test Data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_df["review"], train_df["sentiment"], test_size=0.2)

In [None]:
print("Training Data: {}, Validation: {}".format(len(X_train), len(X_valid)))

## Vectorize Data (a.k.a. covert text to numbers)

Computers don't understand Texts, so we need to convert texts to numbers before we could do any math on it and see if we can build a system to classify a review as Positive or Negative.

Ways to vectorize data:

* Bag of Words
* TF-IDF
* Word Embeddings (Word2Vec) 

Scikit-Learn has nice APIs for preprocessing and feature extraction modules.  In fact, these can be used even if you build your own models or use another libriary for model building process.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# The API is very similar to model building process.
# Step 1: Instantiate the Vectorizer or more generally called Transformer

vect = CountVectorizer(max_features=5000, binary=True, stop_words="english")

In [None]:
# Fit your Training Data
vect.fit(X_train)

# Transform your training and validation data
X_train_vect = vect.transform(X_train)
X_valid_vect = vect.transform(X_valid)

In [None]:
X_train.head()

In [None]:
# Creates a Sparse Matrix
X_train_vect

In [None]:
# Understand the Vectorizer
vect

In [None]:
# Does similar things to what we did manually in our bag of words model

# vect.vocabulary_

In [None]:
# Does similar things to what we did manually in our bag of words model
from itertools import islice

list(islice(vect.vocabulary_.items(), 10))

In [None]:
pd.DataFrame(X_train_vect.todense(), columns=vect.vocabulary_.keys()).head()

## Model - Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train_vect, y_train)

In [None]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model.score(X_train_vect, y_train)))

In [None]:
## Validation Accuracy
print("Validation Accuracy: {:.3f}".format(model.score(X_valid_vect, y_valid)))

## Model Tuning

Model seems to be Overfitting.  Try, Regularization to bring Training Accuracy closer to Validation Accuracy

* What options are available in Logisitc Regression

In [None]:
model = LogisticRegression(C=0.1)
model.fit(X_train_vect, y_train)

In [None]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model.score(X_valid_vect, y_valid)))

## Feeling Good? - Let's Update Kaggle Submission

Steps:

* Load Test Dataset
* Vectorize the Features (Review)
* Predict the sentiment
* Create the CSV file and update the submission

In [None]:
# Read in the Test Dataset
# Note that it's missing the Sentiment Column.  That's what we need to Predict
#
test_df = pd.read_csv("data/test.tsv", sep="\t")
test_df.head()

In [None]:
# Vectorize the Review Text

X_test = test_df.review
X_test_vect = vect.transform(X_test)

In [None]:
y_test_pred = model.predict(X_test_vect)

In [None]:
df = pd.DataFrame({
    "document_id": test_df.document_id,
    "sentiment": y_test_pred
})

In [None]:
df.to_csv("data/logistic_reg_submission1.csv", index=False)

In [None]:
!head data/logistic_reg_submission1.csv

## The End

* Now your turn, Open the 04-compete notebook and try different Classifiers and see if you can improve the predictions