# Intro to the Bag of Words Classifier

Our overall goal is to be able to classify between postive (1) and negative (0) IMDB reviews:

In [1]:
import pandas as pd

#   run ./get_data.sh from the data dir to download the data
reviews = pd.read_csv("data/movie_data.csv")

In [2]:
reviews.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [3]:
print("Example Positive review:")
print(reviews.iloc[3].review)

Example Positive review:
hi for all the people who have seen this wonderful movie im sure thet you would have liked it as much as i. i love the songs once you have seen the show you can sing along as though you are part of the show singing and dancing . dancing and singing. the song ONE is an all time fave musical song too and the strutters at the end with the mirror its so oh you have to watch this one


In [4]:
print("Example Negative review:")
print(reviews.iloc[1].review)

Example Negative review:
OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low energy style and he will steal a scene effortlessly. But, Disappearance is his misstep. Holy Moly, this was a bad movie! <br /><br />I must give kudos to the cinematography and and the actors, including Kris, for trying their darndest to make sense from this goofy, confusing story! None of it made sense and Kris probably didn't understand it either and he was just going through the motions hoping someone would come up to him and tell him what it was all about! <br /><br />I don't care that everyone on this movie was doing out of love for the project, or some such nonsense... I've seen low budget movies that had a plot for goodness sake! This had none, zilcho, nada, zippo, empty of reason... a complete waste of good talent, scenery and celluloid! <br /><br />I rented this piece of garbage for a buck, and I want my mon

## where to start

To use the power of various machine learning methods, we need to do 2 things:
* turn our training examples into vectors (feature extraction)
* turn our labels (in this case positive / negative) into a binary, multi-class, or continuous output

The second part is done, as our labels are already binary.  The first part is harder: how do we make vectors representing these blocks of text?

Each of our training examples is approximately one paragraph of text.  The Bag of Words classifier treats each "word" as a feature, and at its core counts the number of times each unique word in a corpus (large collection of text) appears in each example.

## Transforming documents into feature vectors

By calling the fit_transform method on CountVectorizer, we just constructed the vocabulary of the bag-of-words model and transformed the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two

In [5]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [6]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [7]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


## Assessing word relevancy via term frequency-inverse document frequency

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be de ned as the product of the term frequency and the inverse document frequency:

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

Here the tf(t, d) is the term frequency that we introduced in the previous section,
and the inverse document frequency *idf(t, d)* can be calculated as:

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and *df(d, t)* is the number of documents *d* that contain the term *t*. Note that adding the constant 1 to the denominator is optional and serves the purpose of assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that low document frequencies are not given too much weight.

Scikit-learn implements yet another transformer, the `TfidfTransformer`, that takes the raw term frequencies from `CountVectorizer` as input and transforms them into tf-idfs:

In [8]:
np.set_printoptions(precision=2)

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, 
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[ 0.    0.43  0.    0.56  0.56  0.    0.43  0.    0.  ]
 [ 0.    0.43  0.    0.    0.    0.56  0.43  0.    0.56]
 [ 0.5   0.45  0.5   0.19  0.19  0.19  0.3   0.25  0.19]]


### Conceptual Explanation

We're trying to develop a vector representation for each document / training example we encounter, and we'd like the representation to accurately represent each example.  How does the TFIDF representation accomplish that?  

TF = Term Frequency counts up the # of times a given word / token appears in our document.  
IDF = Inverse Document Frequency is a measure of how many documents that token appears in.

TFIDF representation relies on the following assumptions:
* if a term appears a lot in a document, it's important to that document
    * TFIDF grows as TF for a document grows
* if a term appears in a lot of documents, it's not a very important word for classification
    * TFIDF decreases as the number of documents a term appears in grows
* if a term doesn't appear in a lot of documents, it's an important word for classification
    * TFIDF increases as the number of documents a term appears in shrinks

## Naive Implementation

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

In [11]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None,
                        max_features=10000)

In [12]:
train_data = tfidf.fit_transform(reviews.review)

In [13]:
print(train_data.shape)

(50000, 10000)


Now we have a matrix of size (n_documents, 10000) with which to fit a SVM with.

In [14]:
clf = SVC(C=100)
clf.fit(train_data[:1000], reviews.sentiment[:1000])

SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

#### how do we do?!

In [17]:
from sklearn.metrics import accuracy_score, f1_score

y_pred = clf.predict(train_data[1000:1100])
print("accuracy: ", accuracy_score(reviews.sentiment[1000:1100], y_pred))
print("f1: ", f1_score(reviews.sentiment[1000:1100], y_pred))

accuracy:  0.46
f1:  0.630136986301


So even just training on 1000 reviews, we start to see some pretty decent results!  We are getting an [F1 Score](https://en.wikipedia.org/wiki/F1_score) of about 0.63.

## Issues with our model

Dirty Data:
* our dataset contains bits of HTML code, and other weird punctuation
* it also has several words such as "watch,watching,watches,watcher" etc... that are considered "unique words" even though the information they convey is almost similar
* lowercase vs. uppercase etc...

## Improve!

Implement some pre-processing! 