# 158.755 Data Science - Making Sense of Data Project 4
# Study In Recommendation Engine

## Abstract

A recommender system, or a recommendation engine intends to seek prediction of 'rating' or 'preference' for a user that would rank an item or event, mainly used in commercial web applications. 

They have been utilized in various areas over the last ten years, commonly used as playlist generators for video and music services such as Youtube, Netflix, and BiliBili, product recommenders like Amazon, Trademe, and Ebay, and content or news based recommenders for social media platforms as Facebook, Twitter and Instergram. Also, there are other popular recommender systems being utilized for specific topics as hotels booking, dating matching, and online competition game team up.

Flow and monetization are the core concept associated with commercial web applications for internet. Simply speaking, flow is the measurement that the number of visitings for a web application, while monetization evaluates its overall income for keeping the business up and running continuously and healthily. (i.e. advertisement income of Youtube, membership subscriptions for Netflix, and fees/charges for each successfully sale for Amazon ). 

Therefore, recommendation engine is one of the key part for monetization that allows finding consumers' real demand through directing them to their most interest items or services. In addition to this , owners of commercial web applications, would be able accurately deploy advertisements and services to their customers based on a successful recommendation engine set up, for example ,playlist generators (Youtube) could generate just right advertisements to their viewers , and product recommenders like Amazon could pushed customers related products when they are viewing a certain item. Thus, it would optimise the resource usage and amplify income for a commercial organisation who has developed such kind of commercial web applications. 

In this project , we would implement a ***'breath first'*** strategy so as to explore as much lifecycle and implementations for a recommendation engine , such as the input and output data structure, machine learning algorithms, evaluation standard for comparison of different algorithms, and its position and role in architecture of a commercial web applications , instead of undertaking ***in-depth*** research on specific algorithms as per their theory and implementations. Thus, we will simply use an existing library to compute our findings and recommendations for this project.



## Indroduction


## Data Collection

The datasets for a commercial web application have almost never been exposed for public usage or research, as they are private and strategy assets for business. Normally they are generated through 

Fortunately, there are existing web source regarding recommendation engine study for us to play with , which is Amazon product data prepared by Associate Professor Julian McAuley, UCSD at http://jmcauley.ucsd.edu/data/amazon/.

Normally, user activates for a commercial web app could be concluded as viewing a certain web page , purchasing products, comments and rank items , content or news etc. It is quite difficult to summarise a standard data structure for a recommendation engine due to the variety demands and requirements for various business activities.

However, a classic data structure format has been widely used after long term experimentation and operation with major internet giant companies such as Amazon and Facebook:

- ***user id*** : unique user id
- ***item id*** : unique item id
- ***behaviour type*** : type of behaviour , i.e. purchase or view an item
- ***context*** : behaviour context, including location and time etc.
- ***behaviour weight*** : weight could be the viewing length for a video or rank for an item 
- ***behaviour content*** : if a user comments something, the content could be saved as a text file. If user click an item, the content could be a binary input.


### Web API

Amazon Instant Video review data , http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Amazon_Instant_Video_5.json.gz , has been selected as the dataset for this project. 




### Web Crawling

## EDA

## Algorithm Research

In [9]:
from collections import defaultdict
from surprise import Dataset
import pandas as pd
from surprise import SVD
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate
import numpy as np
from surprise import dump
import os
from surprise.model_selection import KFold
import io  # needed because of weird encoding of u.item file

from surprise import KNNBaseline
from surprise import get_dataset_dir

def precision_recall_at_k(predictions, k=10, threshold=3.5):
    '''Return precision and recall at k metrics for each user.'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls




### SVD

In [13]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

df = pd.read_csv('full_dataset.csv')
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

#choose various parameters for model
param_grid = {'n_epochs': [5, 10,15], 'lr_all': [0.002, 0.005,0.01],
              'reg_all': [0.2, 0.4, 0.6]}

#iterate thorugh parameter grid to find best model
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print('best RMSE score')
print(gs.best_score['rmse'])
print(gs.best_score['mae'])
# combination of parameters that gave the best RMSE score
print('combination of parameters that gave the best RMSE score')
print(gs.best_params['rmse'])
print(gs.best_params['mae'])

best RMSE score
1.0273052571444052
0.7624481094830237
combination of parameters that gave the best RMSE score
{'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.2}
{'n_epochs': 15, 'lr_all': 0.01, 'reg_all': 0.2}


In [14]:
kf = KFold(n_splits=5)
#best algo based on above
algo = gs.best_estimator['rmse']
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

precision@k
0.9056838372698603
recall@k
0.8122341971254446
precision@k
0.9031010602439188
recall@k
0.8142918035172808
precision@k
0.9052896754349741
recall@k
0.8111687175038439
precision@k
0.9050548802438796
recall@k
0.8097509651491113
precision@k
0.903617541447018
recall@k
0.8107720699323242


In [15]:
#save the model for later use
dump.dump(os.path.expanduser("SVD_GS_BEST_RMSE"), algo=gs.best_estimator['rmse'])

### Cosine Similarity

In [22]:
sim_options = {'name': 'cosine', 'user_based': False} # or item based
algo = KNNBaseline(sim_options=sim_options)
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0877  1.0845  1.0827  1.0826  1.0855  1.0846  0.0019  
MAE (testset)     0.7407  0.7390  0.7382  0.7378  0.7383  0.7388  0.0010  
Fit time          49.62   280.10  307.70  56.33   82.74   155.30  114.04  
Test time         9.07    149.09  97.32   69.34   40.16   73.00   48.09   


{'test_rmse': array([1.08774203, 1.08446025, 1.08267142, 1.08255709, 1.08551081]),
 'test_mae': array([0.74068957, 0.73899557, 0.73816598, 0.73776043, 0.73833932]),
 'fit_time': (49.61555480957031,
  280.0983302593231,
  307.70160937309265,
  56.328015089035034,
  82.74313378334045),
 'test_time': (9.068887710571289,
  149.0948407649994,
  97.32177829742432,
  69.33917331695557,
  40.16020727157593)}

In [23]:
kf = KFold(n_splits=5)
#best algo based on above
sim_options = {'name': 'cosine', 'user_based': False} # or item based
algo = KNNBaseline(sim_options=sim_options)
#print precision@k and recall@k
for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print('precision@k')
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print('recall@k')
    print(sum(rec for rec in recalls.values()) / len(recalls))

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8971717241273791
recall@k
0.791610061030033
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8950178931316445
recall@k
0.7913077032646899
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8981713917229017
recall@k
0.7895691976528738
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8968833226191572
recall@k
0.7893254046291424
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
precision@k
0.8975616193142721
recall@k
0.7904921824778635


In [24]:
#save the model for later use
dump.dump(os.path.expanduser("KNN_COSINE"), algo=algo)

### Get Top N Items 

In [None]:
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

### Get the k nearest neighbors of a user (or item)

In [27]:
algo.get_neighbors(int(10000), k=10)

[3, 12, 28, 40, 66, 67, 72, 77, 80, 95]

## Presentation of Web App

## Recommendation Engine in System Architecture 