## Imports

In [1]:
import gzip
from collections import defaultdict
import random
import numpy
import scipy.optimize
import string
from sklearn import linear_model
from nltk.stem.porter import PorterStemmer # Stemming

# Task 1: Data Processing

###  Read the data and Fill your dataset

The dataset which is used can be found at [amazon-reviews-us-Sports](https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Sports_v1_00.tsv.gz).

We can take any other dataset from this official github sit of tensorflow datasets [Similar Datasets](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/url_checksums/amazon_us_reviews.txt).

In [2]:
path = 'data/amazon_reviews_us_Sports_v1_00.tsav.gz'

f = gzip.open(path, 'rt', encoding='utf8')

header = f.readline()
header = header.strip().split('\t')

dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    d['verified_purchase'] = d['verified_purchase'] == 'Y'
    dataset.append(d)

###  Split the data into a Training and Testing set

First shuffle your data, then split your data. Have Training be the first 80%, and testing be the remaining 20%. 

In [3]:
N = len(dataset)
trainingSet = dataset[:int(N*0.8)]
testSet = dataset[int(N*0.8):]

#### Now delete your dataset
You don't want any of your answers to come from your original dataset any longer, but rather your Training Set, this will help you to not make any mistakes later on, especialy when referencing the checkpoint solutions.

In [4]:
del dataset

###  Extracting Basic Statistics

1. How many entries are in your dataset?
2. Pick a non-trivial attribute (i.e. verified purchases in example), what percentage of your data has this atttribute?
3. Pick another different non-trivial attribute, what percentage of your data share both attributes?

In [8]:
print('Numer of entries in (training) dataset: ', len(trainingSet))

verified_purchases = [d['verified_purchase'] for d in trainingSet]
print('Fraction of reviews from verified purchases: ', sum(verified_purchases) / len(verified_purchases))

verified_purchases_5_star_ratings = [d for d in trainingSet if d['star_rating'] == 5 and d['verified_purchase']]
print('Fraction of reviews from verified purchases and have 5-star ratings: ',
      len(verified_purchases_5_star_ratings) / len(trainingSet))

Numer of entries in (training) dataset:  3880288
Fraction of reviews from verified purchases:  0.9077135511590892
Fraction of reviews from verified purchases and have 5-star ratings:  0.5829724494676684


# Task 2: Classification

Next you will use our knowledge of classification to extract features and make predictions based on them. Here you will be using a Logistic Regression Model.

### Define the feature function

This implementation will be based on any two attributes from your dataset. You will be using these two attributes to predict a third.

In [9]:
def feature(d):
    feat = [1, d['star_rating'], len(d['review_body'])]
    return feat

### Fit your model

1. Create your __Feature Vector__ based on your feature function defined above. 
2. Create your __Label Vector__ based on the "verified purchase" column of your training set.
3. Define your model as a __Logistic Regression__ model.
4. Fit your model.

In [10]:
feature_vector = [feature(d) for d in trainingSet]

label_vector = [d['verified_purchase'] for d in trainingSet]

model = linear_model.LogisticRegression(solver='lbfgs')

model.fit(feature_vector, label_vector)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

### Compute Accuracy of Your Model

1. Make __Predictions__ based on your model.
2. Compute the __Accuracy__ of your model.

In [11]:
#YOUR CODE HERE
predictions = model.predict(feature_vector)

corrects = predictions == label_vector
accuracy = sum(corrects) / len(corrects)
print('Accuracy = ', accuracy)

Accuracy =  0.907281882169571


# Task 3: Regression

In this section you will start by working though two examples of altering features to further differentiate. Then you will work through how to evaluate a Regularaized model.

For this example you will be working with the dataset in the example notebook. This dataset includes reviews that you will be doing a few regression tasks on. The dataset can be found [here.](https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Home_Improvement_v1_00.tsv.gz)

In [12]:
path = "data/amazon_reviews_us_Home_Improvement_v1_00.tsv.gz"

f = gzip.open(path, 'rt', encoding="utf8")
header = f.readline()
header = header.strip().split('\t')
reg_dataset = []
for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    reg_dataset.append(d)

### Unique Words in a Sample Set

We are going to work with a new dataset here, as such we are going to take a smaller portion of the set and call it a Sample Set. This is because stemming on the normal training set will take a very long time. (Feel free to change sampleSet -> reg_dataset if you would like to see the difference for yourself)

1. Count the number of unique words found within the 'review body' portion of the sample set defined below, making sure to __Ignore Punctuation and Capitalization__.
2. Count the number of unique words found within the 'review body' portion of the sample set defined below, this time with use of __Stemming,__ __Ignoring Puctuation,__ ___and___ __Capitalization__.

In [13]:
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

wordCountStem = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

#SampleSet and y vector given
sampleSet = reg_dataset[:2*len(reg_dataset)//10]
y_reg = [d['star_rating'] for d in sampleSet]

In [16]:
for d in sampleSet:
    review = ''.join([x for x in d['review_body'].lower() if not x in punctuation])
    for word in review.split():
        wordCount[word] += 1
        
for d in sampleSet:
    review = ''.join([x for x in d['review_body'].lower() if not x in punctuation])
    for word in review.split():
        stemmed_word = stemmer.stem(word)
        wordCountStem[stemmed_word] += 1
        
print("# of unique words in 'review body' of sampleSet, ignoring punctuation & capitalization:\n",
      len(wordCount))
print("# of unique words in 'review body' of sampleSet, using stemming & ignoring punctuation & capitalization:\n",
      len(wordCountStem))

# of unique words in 'review body' of sampleSet, ignoring punctuation & capitalization:
 156174
# of unique words in 'review body' of sampleSet, using stemming & ignoring punctuation & capitalization:
 131821


### Evaluating Classifiers

1. Given the feature function and your counts vector, __Define__ your X_reg vector. (This being the X vector, simply labeled for the Regression model)
2. __Fit__ your model using a __Ridge Model__ with (alpha = 1.0, fit_intercept = True).
3. Using your model, __Make your Predictions__.
4. Find the __MSE__ between your predictions and your y_reg vector.

In [17]:
def feature_reg(datum):
    feat = [0] * len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    return feat

def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

#Note: increasing the size of the dictionary may require a lot of memory
words = [x[1] for x in counts[:100]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

In [19]:
X_reg = [feature_reg(d) for d in sampleSet]

model = linear_model.Ridge(alpha=1.0, fit_intercept=True)
model.fit(X_reg, y_reg)

predictions = model.predict(X_reg)

mse = MSE(predictions, y_reg)
print('MSE =', mse)

MSE = 1.2681907016075111


# Task 4: Recommendation Systems

You will use your knowledge of simple similarity-based recommender systems to make calculate the most similar items.

In [20]:
attribute_1 = defaultdict(set)
attribute_2 = defaultdict(set)

### Fill your Dictionaries

1. For each entry in your training set, fill your default dictionaries (defined above). 

In [22]:
itemNames = {}

for d in sampleSet:
    user, item = d['customer_id'], d['product_id']
    attribute_1[item].add(user)
    attribute_2[user].add(item)
    itemNames[item] = d['product_title']

In [23]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

def mostSimilar(n, m): #n is the entry index
    similarities = []  #m is the number of entries
    users = attribute_1[n]
    for i2 in attribute_1:
        if i2 == n: continue
        sim = Jaccard(users, attribute_1[n])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:m]

### Getting Predictions

1. Calculate the __10__ most similar entries to the __first__ entry in your dataset, using the functions defined above.

In [25]:
query = sampleSet[0]['product_id']
print('Recommendations for ' + itemNames[query] + ':\n')
recommendations = [itemNames[x[1]] for x in mostSimilar(query, 10)]
for i, recommendation in enumerate(recommendations):
    print(i + 1, ': ', recommendation + '\n')

Recommendations for SadoTech Model C Wireless Doorbell Operating at over 500-feet Range with Over 50 Chimes, No Batteries Required for Receiver, (Various Colors):

1 :  Fibaro Z-Wave Motion Sensor - FGMS-001

2 :  PowerHalo Sliding Gate Opener Sliding Gate Opener Kit Sliding Gate Opener Remote Auto Close Particularly Simple Installation with Comprehensive Interface

3 :  Newsee Decals Dream Until Your Dreams Come True Wall Famous PVC Wall Sticker Decal Quote Art Vinyl Black

4 :  Antique Gold Swing Arm Floor Lamp 58"

5 :  REEGE Premium Multifunction Toilet Handheld Bidet Shattaf Cloth Diaper Sprayer Shower with Brass Material

6 :  Double Cylinder Satin Nickel Finish Deadbolt Lock w/ Keys - Fits All Doors

7 :  Alarm Detects One Drop of Water! Leak Detector for your basement. The only water sensor equipment on the market that detects a single drop of water and small amounts of moisture.

8 :  Classic with Nylon FBA(Ready to Ships)

9 :  Sentrel SSK60367831GB Royale Tub/Shower Surround