# Week 2: Codebase, Model Complexity, Regularizers, and Classification

#### Making Meaningful Predictions from Data

This week we were introduced to the concepts of Complexity, Regularization, and further discussed Classification. In this notebook we will give examples and reasoning behind Complexity, give an example of why Regularization should be used, and give an example of how to Classify your model using precision, accuracy, and recall.

## Part 1: Setting up the codebase

Lets start by importing some libraries that we'll need thoughout the notebook.

In [None]:
import gzip
from collections import defaultdict
import string # Some string utilities
import random
from nltk.stem.porter import PorterStemmer # Stemming
import numpy
from sklearn import linear_model

### The Data
We are going to continue to use the Amazon Gift Card data for this examples. This data contains a large set of reviews paired with start ratings and and various other pieces of information.
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz

### Reading the Data
This should be familiar if you have taken previous courses on that discuss data ingestion, just remember what type file we are working with.

In [None]:
path = "./datasets/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"

In [None]:
f = gzip.open(path, 'rt', encoding="utf8")

In [None]:
header = f.readline()
header = header.strip().split('\t')

Lets look at the header for the given data along with an entry to show how it relates.

In [None]:
header

In [None]:
#TODO create a list, each element of which is a dictionary of an entry in the review data
#TODO cast the categories titled "'star_rating', 'helpful_votes', and 'total_votes'" into integer values

In [None]:
dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(' #TODO ')
    d['helpful_votes'] = int(' #TODO ')
    d['total_votes'] = int(' #TODO ')
    dataset.append(d)

In [None]:
#Example entry
dataset[0]

## Part 2: Complexity and Regularizers

For this next part we will go over two ways of simplifying our dataset (there are plenty more). The next cell will output the length of this dataset. For simplicity sake we will take a smaller portion of this dataset to work with, though all operations below will still apply to the entire dataset, given more time for computations.

In [None]:
len(dataset)

In [None]:
#Grab the first 10,000 values of the dataset and put them into a new dataset named shortData

shortData = dataset[' #TODO ']

In [None]:
#Check
len(shortData) == 10000

First, let's count the number of unique words in the dataset

In [None]:
#TODO count the number of unique words found within the 'review_body' portion of your dataset using the .split() function
# and the defaultdict collection
wordCount = defaultdict(int)

#SOLN
for d in shortData:
    for w in d[' #TODO '].split():
        wordCount[w] += 1

print(len(wordCount)) #Answer should be 11215

As you can see, this in itself is not too bad to work with, but the actual dataset has roughly 97,000 unique words, so we are still working with a smaller fraction of the data, so let's try and improve this.

Next, lets try and reduce the amount of words by removing punctuation and capitalization.

In [None]:
wordCountPunc = defaultdict(int)
punctuation = set(string.punctuation)

#TODO same as above without use of stemming


#SOLN
for d in shortData:
  r = ''.join([c for c in d[' #TODO '].lower() if not c in punctuation])
  for w in r.split():
    wordCountPunc[w] += 1
    
print(len(wordCountPunc)) #Answer should be 6023

### This is better (roughly half the previous amount) but we can do better!


Lets build a few data structures to count the number of instances of each word. Here you want to remove punctuation and capitalization, then apply stemming.

Stemming is a tool from the NLTK (Natural Language Toolkit) library. Here is the [link](http://www.nltk.org/howto/stem.html) to how this works. (Hint separate the capitalization and punctuation for each review first, then place the separated words into the stemmer.)

In [None]:
wordCountStem = defaultdict(int)
punctuation = set(string.punctuation)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

#Note: this will take longer than the previous to run as a result of stemming

#TODO count the number of unique words, this time removing capitalization and punctuation, USE STEMMING HERE


#SOLN
for d in shortData:
    r = ''.join([c for c in d[' #TODO '].lower() if not c in punctuation])
    for w in r.split():
        w = stemmer.stem(w) # with stemming
        wordCountStem[w] += 1

print(len(wordCountStem)) #Answer should be 4666

### Recap 

Note that each word here will hold a weight in a model, meaning each unique word is a dimension of the model. As we expand our dataset from our smaller portion into the entirety of the data, our model will grow in dimensionality very quickly. This will cause our model to be highly prone to overfitting.
Here's a link to a visual model using dogs and cats! [Here](http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/#The_curse_of_dimensionality_and_overfitting)

Next week you will learn how to address this by implementing a regularizer using the "Ridge" Model. This model from sklearn implements a least squares regression model that includes a regularizer.

## Part 3: Evaluating Classifiers for Ranking

Last week you learned about Classification Diagnostics (accuracy, precision etc). Using a Logistic Regression model, you can use those calculations to evaluate your classifiers.

In [None]:
#Grabbing count values in order to Classify our model's accuracy and precision down below
counts = [(wordCountPunc[w], w) for w in wordCount]
counts.sort()
counts.reverse()
words = [x[1] for x in counts[:1000]]
wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

In [None]:
#Can you figure out what this function is doing? 
def feature(datum):
    feat = [0]*len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    feat.append(1) #offset
    return feat                 #Next week we will discuss what this is doing.

In [None]:
X = [feature(d) for d in dataset] #List of our "features"
y = [d['star_rating'] for d in dataset] #List of all star ratings
y_class = [(rating > 3) for rating in y] #List of all ratings higher than 3 stars

modelLin = linear_model.LogisticRegression() #Basic Linear Model
modelLin.fit(X, y_class);

In [None]:
predictions = modelLin.predict(X)
correct = predictions == y_class

In [None]:
correct #Think, what does this array tell us at each entry?

Now we can calculate accuracy and precision of the Logistic Regression model. 

In [None]:
#TODO calculate the accuracy through any method

#SOLN
accuracy = ' #TODO '
print("Accuracy = " + str(accuracy)) #Hint this should be high.

#### Now that we've worked with accuracy, lets move to ranking based on Precision and Recall

In [None]:
#Here's a quick calculation of accuracy through true/false positives/negatives

TP = sum([(p and l) for (p,l) in zip(predictions, y_class)])
FP = sum([(p and not l) for (p,l) in zip(predictions, y_class)])
TN = sum([(not p and not l) for (p,l) in zip(predictions, y_class)])
FN = sum([(not p and l) for (p,l) in zip(predictions, y_class)])

TFaccuracy = (TP + TN) / (TP + FP + TN + FN)

In [None]:
#Check
TFaccuracy == accuracy

In [None]:
#TODO Calculate the precision and recall using the True/False values defined above

#SOLN
precision = TP / ' #TODO '
recall = TP / ' #TODO '

In [None]:
precision, recall

In [4]:
s=[True, False,False]
sum([not i for i in s])

2

Notice how high our precision and accuracy are. Individually neither of these are difficult to obtain, but it can be difficult to get both at the same time. When they are both high, that indicates a good model!

## You're All Done!

Next week we will learn about guidelines for the implementation of predictive pipelines.