In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import seaborn as sns
import matplotlib 
%matplotlib inline
import matplotlib.pyplot as plt  
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm

## 1 Load data

Load data and split it into working and testing chunks. But before you begin: ensure you can save a
dataframe in a format you can load back in afterwards. pd.to_csv is a good bet, but it has a lot of
options which may screw up the way you read data. Ensure you can store data in a way that you can
read it back in correctly, including that missings remain missings.

1\. create a tiny toy data frame that includes some numbers, strings, and missings. Save it and ensure you can reload it in the correct form.

In [2]:
data = {"name": ["john", "jane", "bob", "candace"], "age": [18, 6, 36], "occupation": ["student", "farmer", "barista"]}
tiny_df = pd.DataFrame.from_dict(data, orient='index').T
tiny_df.head()

Unnamed: 0,name,age,occupation
0,john,18.0,student
1,jane,6.0,farmer
2,bob,36.0,barista
3,candace,,


In [3]:
tiny_df.to_csv("tiny_df.csv", sep=',', index=False)

In [4]:
pd.read_csv("tiny_df.csv", sep=',')

Unnamed: 0,name,age,occupation
0,john,18.0,student
1,jane,6.0,farmer
2,bob,36.0,barista
3,candace,,


Now you are good to go:

2\. load the data (available on canvas: files/data/rotten-tomatoes.csv). DO NOT LOOK AT IT!

In [2]:
tomatoes = pd.read_csv("reviews.csv", sep=',')

3\. split the dataset into working-testing parts (80/20 or so). Note that sklearn's train_test_split can easily handle dataframes. Just for your confirmation, ensure that the size of the working and testing data look reasonable.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    tomatoes.loc[:, tomatoes.columns != "fresh"], 
    tomatoes["fresh"], 
    test_size=0.20)

In [5]:
print("train features size: " + str(X_train.size))

train features size: 86024


In [6]:
print("test features size: " + str(X_test.size))

NameError: name 'X_test' is not defined

4\. now save the test data and delete it from memory. Use python's del statement, or R-s rm function.

In [9]:
X_test.to_csv("X_test", sep=',', index=False)
y_test.to_csv("y_test", sep=',', index=False)

In [4]:
del X_test
del y_test

## 2 Explore and clean the data

Now when the test data is put aside, we can breath out and take a closer look how does the work data look like.

1\. Take a look at a few lines of data (you may use pd.sample for this).

In [9]:
train_data = X_train
train_data["fresh"] = y_train

In [12]:
train_data.sample(5)

Unnamed: 0,critic,imdb,link,publication,quote,review_date,rtid,title,fresh
6081,Roger Moore,482606,http://www.orlandosentinel.com/entertainment/m...,Orlando Sentinel,Fans of the 'pitiless/merciless killers' schoo...,2008-05-30 00:00:00,678555400,The Strangers,fresh
10165,Rick Groen,403217,http://www.theglobeandmail.com/servlet/story/R...,Globe and Mail,Gus Van Sant ventures into the valley of death...,2005-08-12 00:00:00,12868,Last Days,fresh
4546,David Ansen,117318,http://www.msnbc.com/m/nw/a/m/mv_p.asp#The%20P...,Newsweek,"A brave, spectacularly entertaining -- and une...",2008-06-30 00:00:00,15475,The People Vs. Larry Flynt,fresh
9933,Jeff Millar,127722,http://www.chron.com/cs/CDA/moviestory.mpl/ae/...,Houston Chronicle,A well-executed scene can be followed by anoth...,2005-07-21 00:00:00,17297,Another Day In Paradise,rotten
3290,Hal Hinson,116320,http://www.washingtonpost.com/wp-srv/style/lon...,Washington Post,"On the whole, Baldwin seems pretty dim for a r...",2002-01-22 00:00:00,16068,Fled,rotten


2\. print out all variable names.

In [13]:
train_data.columns

Index(['critic', 'imdb', 'link', 'publication', 'quote', 'review_date', 'rtid',
       'title', 'fresh'],
      dtype='object')

3\. create a summary table (maybe more like a bullet list) where you print out the most important summary statistics for the most interesting variables. The most interesting facts you should present should include: a) number of missings for fresh and quote; b) all different values for fresh/rotten evaluations; c) counts or percentages of these values; d) number of zero-length or only whitespace quote-s; e) minimum-maximum-average length of quotes (either in words, or in characters). (Can you do this as an one-liner?); f) how many reviews are in data multiple times. Feel free to add more figures you consider relevant.

In [14]:
print("Number of nulls for fresh: " + str(y_train.isnull().sum()))
print("Number of nulls for quote: " + str(X_train["quote"].isnull().sum(axis = 0)))
print()
print("Unique values for fresh/rotten: ")
print(y_train.unique())
print()
print("Counts for unique values of fresh/rotten: ")
print(y_train.value_counts())
print()
print("Percentages for unique values of fresh/rotten: ")
print(y_train.value_counts()/y_train.size*100)
print()
print("Number of zero-length or whitespace quotes: " + str(X_train.loc[X_train["quote"] == ""].size + X_train.loc[X_train["quote"].str.isspace()].size))
print()
print("Minimum, maximum, and average length of quotes in words: " + str(X_train["quote"].str.split().apply(len).min()) + ", " + str(X_train["quote"].str.split().apply(len).max()) + ", " + str(X_train["quote"].str.split().apply(len).mean()))
print()
print("Reviews in data multiple times: " + str(len(X_train[X_train.duplicated() == True])))


Number of nulls for fresh: 0
Number of nulls for quote: 0

Unique values for fresh/rotten: 
['fresh' 'rotten' 'none']

Counts for unique values of fresh/rotten: 
fresh     6736
rotten    3998
none        19
Name: fresh, dtype: int64

Percentages for unique values of fresh/rotten: 
fresh     62.642983
rotten    37.180322
none       0.176695
Name: fresh, dtype: float64

Number of zero-length or whitespace quotes: 0

Minimum, maximum, and average length of quotes in words: 1, 49, 20.108992839207662

Reviews in data multiple times: 369


4\. Now when you have an overview what you have in data, clean it by removing all the inconsistencies the table reveals. We have to ensure that the central variables: quote and fresh are not missing, and quote is not an empty string (or just contain spaces and such). I strongly recommend to do it as a standalone function because at the end you have to perform exactly the same cleaning operations with your test data too.

In [10]:
def clean(data):
    data.dropna(subset=['fresh', 'quote'], inplace=True)
    data = data[data["fresh"] != "none"]
    data = data[data["quote"] != ""]
    data = data[data["quote"].str.isspace() == False]
    data.drop_duplicates(inplace=True)
    return data.reset_index()

In [11]:
cleaned_train_data = clean(train_data)

## 3 Naïve Bayes

Now where you are familiar with the data, it's time to get serious and implement the Naive Bayes classier
from scratch. But first things first.

1\. Ensure you are familiar with Naive Bayes. Consult the readings, available on canvas. Schutt & O'Neill is an easy and accessible (and long) introduction, Whitten & Frank is a lot shorter but still accessible introduction.

2\. Convert your data (quotes) into bag-of-words. Your code should look something along the lines as in PS4. However, now we don't want BOW that contains counts of words in quotes, but just 1/0 (or true/-false) for the presence/non-presence of the words. Convert the count-based BOW into such a presence BOW. Hint: think in terms of vectorized (universal) functions.

In [12]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(cleaned_train_data["quote"]).toarray()
words = vectorizer.get_feature_names()

In [13]:
X = X > 0

3. Split your work data and target (i.e. the variable fresh) into training and validation chunks (80/20 or so). Later we also do cross-validation, but for now, a simple training/validation will do.

In [14]:
X_t, X_v, y_t, y_v = train_test_split(
    cleaned_train_data.loc[:, cleaned_train_data.columns != "fresh"], 
    cleaned_train_data["fresh"], 
    test_size=0.20)

Good. Now you are ready with the preparatory work and it's time to dive into the real thing. Let's
implement Naive Bayes. Use only training data in the fitting below.

4\. Compute the unconditional (log) probability that the tomato is fresh/rotten, log Pr(F), and log Pr(R). These probabilities are based on the values of fresh variable but not on the words the quotes contain.

In [15]:
Nfresh = y_t[y_t == "fresh"].count()
Nrotten = y_t[y_t == "rotten"].count()
Ntotal = Nfresh + Nrotten
Pfresh = Nfresh/Ntotal
Protten = Nrotten/Ntotal
print("log probability that tomato is fresh: " + str(np.log(Pfresh)))
print("log probability that tomato is rotten: " + str(np.log(Protten)))

log probability that tomato is fresh: -0.469399582112
log probability that tomato is rotten: -0.981836809772


5\. For each word w, compute log Pr(w|F) and log Pr(w|R), the (log) probability that the word is present in a fresh/rotten review. These probabilities can easily be calculated from counts of how many times these words are present for each class. Hint: these computations are based on your BOW-s X. Look at ways to sum along columns in this matrix.

In [16]:
f = X[y_t[y_t == "fresh"].index,:]
r = X[y_t[y_t == "rotten"].index,:]
PwordsF = np.sum(f, axis=1)/Nfresh
PwordsR = np.sum(r, axis=1)/Nrotten
LPF = np.log(PwordsF)
LPR = np.log(PwordsR)
print("Probability words are in fresh review: " + str(LPF))
print("Probability words are in rotten review: " + str(LPR))

Probability words are in fresh review: [-5.98664526 -5.55586234 -6.47215308 ..., -5.60715564 -5.55586234
 -5.60715564]
Probability words are in rotten review: [-4.82028157 -4.86110356 -6.65286303 ..., -5.40010006 -5.09471841
 -5.09471841]


Now we are done with the estimator. Your fitted model is completely described by these four probability
vectors: log Pr(F), log Pr(R), log Pr(w|F), log Pr(w|R). Let's now turn to prediction, and pull out your
validation data (not the test data!).

6\. For both destination classes, F and R, compute the log-likelihood that the quote belongs to this class. Based on the log-likelihoods, predict the class F or R for each quote in the validation set.

In [23]:
np.apply_along_axis(lambda x: np.sum(), 1, x) + np.log(PFresh))

7\. Print the resulting confusion matrix and accuracy (feel free to use existing libraries).

In [24]:
# from sklearn.metrics import confusion_matrix
# confusion_matrix(y_v, pred)

## 4 Interpretation

Now it is time to look at your fitted model a little bit closer. NB model probabilities are rather easy to understand and interpret. The task here is to find the best words to predict a fresh, and a rotten review. And we only want to look at words that are reasonably frequent, say more frequent than 30 times in the data.

1\. Extract from your conditional probability vectors log Pr(F) and log Pr(R) the probabilities that correspond to frequent words only.

In [25]:
frequentF = np.sum(f, axis=1) > 30
frequentR = np.sum(r, axis=1) > 30
frequent_LPF = LPF[np.where(frequentF==True)]
frequent_LPR = LPR[np.where(frequentR==True)] 

2\. Find 10 best words to predict F and 10 best words to predict R. Hint: imagine we have a review that contains just a single word. Which word will give the highest weight to the probability the review is fresh? Which one to the likelihood it is rotten? Comment your results.

In [26]:
print("10 best words to predict F: " + str([words[i] for i in (-frequent_LPF).argsort()[:10]]))

10 best words to predict F: ['1957', '27', '1971', 'about', 'active', '1963', '20000', 'abroad', '90', '17th']


In [27]:
print("10 best words to predict R: " + str([words[i] for i in (-frequent_LPR).argsort()[:10]]))

10 best words to predict R: ['1955', '1993', '136', '39', '8212', '21st', '231', '1920s', '3000', '130']


3\. Print out a few missclassified quotes. Can you understand why these are misclassified?

## 5 NB with smoothing

So, now you have your brand-new NB algorithm up and running. As a next step, we add smoothing to it. As you will be doing cross-validation below, your first task is to mold what you did above into two funcions: one for fitting and another one for predicting.

1\. Create two functions: one for fitting NB model, and another to predict outcome based on the fitted model. As mentioned above, the model is fully described with 4 probabilities, so your fitting function may return such a list as the model; and the prediction function may take it as an input.

In [28]:
def fitNB(X_t, y_t, alpha):
    Nfresh = y_t[y_t == "fresh"].count()
    Nrotten = y_t[y_t == "rotten"].count()
    Ntotal = Nfresh + Nrotten
    Pfresh = Nfresh/Ntotal
    Protten = Nrotten/Ntotal
    LPF = np.log(Pfresh)
    LPR = np.log(Protten)
    return Pfresh, Protten, LPF, LPR

In [29]:
def predictNB():
    

SyntaxError: unexpected EOF while parsing (<ipython-input-29-84aceb6bec43>, line 2)

2\. Add smoothing to the model. See Schutt p 103 and 109. Smoothing amounts to assuming that we have "seen" every possible work α > 0 times already, for both classes. (If you wish, you can also assume you have seen the words α times for F and β times for R). Note that α does not have to be an integer, and typically the best α < 1.

3\. Now fit a few models with different α-s and see if the accuracy improves compared to the baseline case above.

## 6 Cross-Validation

Finally (well, almost finally), we do cross-validation. This is another piece of code you have to implement yourself, not use existing libraries.

* Implement k-fold CV. I recommend to implement it as a function that a) puts your data into random order; b) splits these into k chunks; c) selects a chunk for testing and the others for training; d) trains your NB model on the training chunks; e) computes accuracy on training chunk; f) returns mean accuracy over all these k trials. The function should also take α as an argument, this is the hyperparameter you are going to optimize.

In [None]:
def cv(k):
    

* Find the optimal α by 5-fold CV using your own CV code. You have to find the cross-validated accuracies for a number of α-s between 0 and 1. Present the accuracy as a function of α on a plot and indicate which one is the best α.

## 7 Final model performance

Finally (and now I mean finally,), estimate the model performance on the testing data. Complete this section after everything else is done and you are ready to submit your work. Don't improve model after you have loaded testing data!

1\. Fit your NB model using the cross-validated optimal alpha using your complete work data (both training and validation). This is your best and final model.

2\. Load your testing data. Clean it using exactly the same procedure (you made a function for this, right?) and transform it into BOW-s.

3\. Predict the F/R class on testing data. Compute accuracy. Present it.

4\. Did you get a better or worse result compared to the k-NN and TF-IDF in PS04?