Lab: Apply Naive Bayes to classify movie reviews with scikit-learn
====

![](https://imgs.xkcd.com/comics/star_ratings.png)

By The End Of This Lab You Should Be Able To:
----

- Programmatically download data from the Internet
- Fit Naive Bayes model with scikit-learn
- Improve model performance to be better than the results from a published peer-reviewed paper

You are going to apply the Data Science workflow to classifying movie reviews as positive or negative. 

__Data Science Workflow:__

1. Ask
2. Acquire
3. Process
4. Model
5. Deliver 

You are going to use the Internet Movie Database ([imdb.com](www.imdb.com)) data from the seminal [Pang et al. (2002)](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf) paper.

1) Ask: 
----

Can a simple machine learnning model predict if movie review is positive or negative?

Can we improve model performance to be better than state-of-the-art in 2002?

2) Acquire
----

You are going to use movie review data from http://www.cs.cornell.edu/People/pabo/movie-review-data/, specifically the `polarity dataset v2.0 (3.0Mb)`

In [1]:
reset -fs

In [2]:
import os
import shutil
import tarfile
from urllib.request import urlretrieve

url = "http://www.cs.cornell.edu/People/pabo/movie-review-data/"
path = "."
filename = "review_polarity.tar.gz"

In [3]:
def download_and_unzip_data(url: str, path: str, filename: str) -> None:
    """Write code that retrieves the zipped file and unzips it.
    Check if file is local, if not then retrieve it.
    Check if files are unzipped, if not then unzip them.
    """
    # Download the data
    if not os.path.exists(filename):
        urlretrieve(url=os.path.join(url,filename), filename=filename)
    # Unzip the data
    if not os.path.exists(path+"/txt_sentoken"):
        with tarfile.open(filename) as tar:
            tar.extractall()
#     raise NotImplementedError()

In [4]:
"""
2 points
Test code for the 'download_and_unzip_data' function. 
This cell should NOT give any errors when it is run.
"""

# Remove directory to make sure function works
try:
    shutil.rmtree(path+"/txt_sentoken/")
except OSError:
    print('Directory not found. Moving along…')
    
# Run function
print("Downloading and unzipping files…")
download_and_unzip_data(url, path, filename)

# NOTE: Assumes UNIX style paths names
assert os.path.exists(path+"/txt_sentoken")
assert os.path.exists(path+"/txt_sentoken/neg")
assert os.path.exists(path+"/txt_sentoken/pos")
assert os.path.exists(path+"/txt_sentoken/neg/cv000_29416.txt")
assert os.path.exists(path+"/txt_sentoken/pos/cv999_13106.txt")
assert not os.path.exists(path+"/txt_sentoken/neg/cv999_13106.txt")
print('Files download and unzipped.')

Downloading and unzipping files…
Files download and unzipped.


3) Process
-----

The data is already processed.

Look at several reviews in your favorite text editor to get a general sense of the data.

In [5]:
! head ./txt_sentoken/neg/cv000_29416.txt

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 


In [6]:
from sklearn.datasets import load_files

In [7]:
# Load data
# scikit-learn assumes data in different folders belongs to different classes
sentiment = load_files(path+'/txt_sentoken/', 
                       encoding='utf-8',
                       random_state=42)
sentiment.target_names

['neg', 'pos']

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# Create train/test split with labels
train_data, test_data, train_target, test_target = train_test_split(sentiment.data,
                                                                    sentiment.target,
                                                                    random_state=42)

4) Model
-----

The words need to be converted into numbers (i.e., vectorized). Machine Learning algorithms assume numeric inputs.

One of the simplest methods to vectorize is based on word counts. Each word is mapped to an index. For each index, we count the number of occurrences per document. The result is a matrix with a row for each document/review and a column for each word. This matrix is called the document-term matrix.

Learn more with package document [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
# Convert the words to number for both training and test data
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)

Now you can define your model. Let's start with a simple model - Naive Bayes. There are many variations of Naive Bayes, for today's lab use the multinomial variation.

> The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). 
suitable for classification with discrete features (e.g., word counts for text classification).

Learn more [here](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html).

In [12]:
from sklearn.naive_bayes import MultinomialNB

> We decided to sample an equal number of positive and negative reviews—was that a good idea? — Bo Pang

Thus we can use accuracy as our evaluation metric.

In [13]:
from sklearn.metrics import accuracy_score

In [14]:
def fit_nb(train_features, test_features, train_target, test_target) -> float:
    """Fit a NB model on the training features. 
    Then predict on the test features. 
    Return accuracy."""
    # YOUR CODE HERE
    # Define a classifier
    clf = MultinomialNB()
    clf.fit(train_features, train_target)
    # raise NotImplementedError()
    # predicted are the trained model predictions for test reviews. It will be a vector of 1s, and 0s.
    predicted = clf.predict(test_features)
    assert predicted.shape == (500,) 
    return accuracy_score(predicted, test_target)

In [15]:
clf = MultinomialNB()
clf.fit(train_features, train_target)
# raise NotImplementedError()
# predicted are the trained model predictions for test reviews. It will be a vector of 1s, and 0s.
predicted = clf.predict(test_features)
predicted

array([1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1,

In [16]:
"""
3 points
Test code for the 'fit_nb' function. 
This cell should NOT give any errors when it is run.
"""

assert round(fit_nb(train_features, test_features, train_target, test_target), 3) == 0.812

5) Deliver
-----

The final part of lab is improve the model to score greater than 82.90% accuracy on the test set. Have a trained model that does than the score from [the published paper](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf). Once you exceed the paper performance, you have delivered and can stop.

You can tune the model by hand or write code to programmatically search for a "good enough model". If you write code to search, just turn in the best model. Do not submit search code.

Hints:

- Tune Naive Bayes
- Pick a different algorithm
- Engineer better features
    - Tune count vectorizer
    - Pick a different vectorizer

In [17]:
# Reset namespace to make sure your function contains all neccesary scikit-learn functions. 

In [18]:
reset -fs

In [19]:
from sklearn.datasets import load_files
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [20]:
def fit_final_model() -> float:
    """Fit your final model, returning accuracy. 
    Remember to add needed import statements inside of this function. 
    You should only use scikit-learn. No other packages should be used.
    Do not modify any code that is alreay written.
    """
    from sklearn.svm import SVC
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Load data
    sentiment = load_files('./txt_sentoken/', 
                           encoding='utf-8',
                           random_state=42)

    # Create train/test split with labels
    train_data, test_data, train_target, test_target = train_test_split(sentiment.data,
                                                                        sentiment.target,
                                                                        random_state=42)
    
    ## Convert the words to number for both training and test data
    vectorizer = TfidfVectorizer(stop_words='english')
    
    train_features = vectorizer.fit_transform(train_data)
    test_features = vectorizer.transform(test_data)
    clf = SVC(C=1, 
             kernel='linear',
             gamma=1,
             coef0=1,
             random_state=42)
    clf.fit(train_features, train_target)
    predicted = clf.predict(test_features)    
    assert predicted.shape == (500,) 
    return accuracy_score(predicted, test_target)
fit_final_model()

0.856

In [21]:
"""
10 points 👈
Test code for the 'fit_best_model' function.
This cell should NOT give any errors when it is run, warnings are okay.
"""

assert round(fit_final_model(), 4) > .8290

Bonus Cartoon
------

![](https://imgs.xkcd.com/comics/emoji_movie_reviews.png)

<br>
<br> 
<br>

----