# Rating Predictions
*Author: Koki Sasagawa*  
*Date: 10/8/2018* 

## Task
Build a recommendation system that predicts a rating on the scale of 1 to 5 for a given user product. Collaborative filtering models are based on the idea that people tend to like things that are similar to other things they like, and things that are liked by other people that share similar preferences. 

2 approaches: 
- Memory based approach cosine similarity and weighted averages of ratings
- Model based approach using SVD for matrix factorization

## Submission 
Excel file containing 3 columns: reviewerID, asin, and overall.  
kaggle link: https://www.kaggle.com/c/si671-hw1

## Running the notebook
This notebook requires `predictions.py`, `evaluations.py`, and `decorators.py` files

In [1]:
import os
import pickle 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import sparse
from surprise import SVD, Dataset, Reader

from predictions import baseline
from evaluations import rmse_test
from decorators import timer 

%matplotlib inline

## 1. Load files

In [2]:
train_df = pd.read_csv('../temp_data/full_train_set.csv')
print('Number of rows: {}'.format(train_df.shape[0]))
train_df.head()

Number of rows: 1527779


Unnamed: 0,reviewerID,asin,overall
0,AMFIPCYDYWGVT,B0090SI56Y,4
1,A3G602Z4DWDZKS,B00005JL99,5
2,A33BOYMVG3U58Y,B00109KN0M,5
3,ANEDXRFDZDL18,B00005JMPT,5
4,A1VN7IS16PY024,B00005AAA9,4


In [3]:
train_matrix = sparse.load_npz('../temp_data/movie_reviews.npz')
print('Sparse matrix dimensions: {}'.format(train_matrix.shape))
print('Number of stored values: {}'.format(train_matrix.nnz))

Sparse matrix dimensions: (50051, 123960)
Number of stored values: 1527779


In [4]:
with open('../temp_data/movie_index_map.p', 'rb') as fp:
    row_index_to_movies = pickle.load(fp)
    
print('Number of row indexes: {}'.format(len(row_index_to_movies)))

Number of row indexes: 50051


In [5]:
with open('../temp_data/reviewer_index_map.p', 'rb') as fp:
    col_index_to_reviewer = pickle.load(fp)
    
print('Number of col indexes: {}'.format(len(col_index_to_reviewer)))

Number of col indexes: 123960


In [6]:
test_df = pd.read_csv('../raw_data/reviews.test.unlabeled.csv')
print('Number of rows: {}'.format(test_df.shape[0]))
test_df.head()

Number of rows: 169753


Unnamed: 0,datapointID,reviewerID,asin
0,85288b7fd23d48dcb4fd2c9b52a7fa3c,AT79BAVA063DG,B0009UVCQC
1,06f33eaec5bb4c20857cc1f9aee60fb4,A2DAHERP7HYJGO,B002ZG99TA
2,8f14a0d25996472d80a2e745b66f565a,A3NM0RAYSL6PA8,B0001NBNDY
3,50095c59950e444eb2b35afb00009f44,A2KODQS5LJGHF8,6304089767
4,abbbd3cd87d846b0a965ae7ce0ea1aaf,A2ULE2TYILL4BR,B000056MOF


## 2. Train model

After testing different methods during development, we concluded the best performance was achieved using SVD for matrix factorization with the following parameters:

1. n_factors: 5
2. n_epoch: 80
3. lr_all: 0.003
4. reg_all: 0.15

In [7]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_df, reader)

In [20]:
# Prepare data 
train_data = data.build_full_trainset()

# Fit SVD model with best performing parameters 
model = SVD(n_factors=5, n_epochs=80, lr_all=0.003, reg_all=0.15)
model.fit(train_data)

# Save model
with open('../output/SVD_model.p', 'wb') as fp:
    pickle.dump(model, fp, protocol=pickle.HIGHEST_PROTOCOL)

## 3. Make predictions

In [21]:
predictions = []

for i in test_df.itertuples():
    reviewer = i[2]
    movie = i[3]    

    predictions.append(model.predict(reviewer, movie).est)

## 4. Save output

In [22]:
# Save predictions 
rating_predictions = pd.DataFrame({'datapointID': test_df['datapointID'], 
                                   'overall': predictions})

rating_predictions.to_csv('../output/test.csv', index=False)