# Accuracy Metrics

**Objectives:**

**1.** As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.

**2.** Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.

**3.** Compare and report on any change in accuracy before and after you’ve made the change in #2.

**4.** As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import surprise
from surprise import Reader
from surprise import Dataset
from surprise.prediction_algorithms.random_pred import NormalPredictor
from surprise.prediction_algorithms.knns import KNNWithMeans

# Data - Jester

[Jester](http://eigentaste.berkeley.edu/dataset/)
Over 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003.


Format:

- 3 Data files contain anonymous ratings data from 73,421 users.
- Data files are in .zip format, when unzipped, they are in Excel (.xls) format
- Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated"). One row per user
- The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.
- The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).


In [2]:
# Load the data
jokes_df_original = pd.read_csv('D:\\Rafal\\CUNY\\643\\hw\hw4\\data\\jester.csv', header=None)

# Create a copy of the original and apply data transformations 
df = jokes_df_original.copy()

# Drop the column that contains the count of ratings (0-100)
df.drop([0], axis = 1, inplace = True)

# Replace '99' with numpy zeros. This will help with RMSE calculations
df_nans = df.replace(99, np.nan)

df = df.replace(99, 0)

df['userID'] = df.index

In [3]:
df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,92,93,94,95,96,97,98,99,100,userID
0,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,-4.76,...,0.0,0.0,0.0,0.0,0.0,-5.63,0.0,0.0,0.0,0
1,4.08,-0.29,6.36,4.37,-2.38,-9.66,-0.73,-5.34,8.88,9.22,...,-4.95,-0.29,7.86,-0.19,-2.14,3.06,0.34,-4.32,1.07,1
2,0.0,0.0,0.0,0.0,9.03,9.27,9.03,9.27,0.0,0.0,...,0.0,0.0,9.08,0.0,0.0,0.0,0.0,0.0,0.0,2
3,0.0,8.35,0.0,0.0,1.8,8.16,-2.82,6.21,0.0,1.84,...,0.0,0.0,0.53,0.0,0.0,0.0,0.0,0.0,0.0,3
4,8.5,4.61,-4.17,-5.39,1.36,1.6,7.04,4.61,-0.44,5.73,...,5.58,4.27,5.19,5.73,1.55,3.11,6.55,1.8,1.6,4


In [4]:
# Format from wide to long df and add userid column based on the index
df = pd.melt(df, id_vars='userID', var_name='itemID', value_name='rating')
df.head()

Unnamed: 0,userID,itemID,rating
0,0,1,-7.82
1,1,1,4.08
2,2,1,0.0
3,3,1,0.0
4,4,1,8.5


## Move data to suprise format

In [5]:
# Define a suprise Reader and provide the scale
reader = Reader(rating_scale=(0, 1))

# Load dataframe into a suprise Dataset object
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

# KNN Means Predictor

In [6]:
def calculate_knn(data):
    
    # data split 
    trainset, testset = surprise.model_selection.train_test_split(data, test_size=.25)
    
    # Similarity parameters used for models
    sim_options = {'name': 'cosine', 'user_based': False}
    
    # Define KNN model
    knn_model = KNNWithMeans(sim_options=sim_options)

    # Predict values and create dataframe
    predictions = knn_model.fit(trainset).test(testset)
    df = pd.DataFrame(predictions)
    
    # Calculate the rmse
    rmse = surprise.accuracy.rmse(predictions, verbose=True)
    
    return df, rmse
    

In [7]:
svd_cosine, knn_rmse = calculate_knn(data)

Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 4.2846


In [8]:
svd_cosine.drop('details', axis=1).head()

Unnamed: 0,uid,iid,r_ui,est
0,24648,19,7.18,0.0
1,19550,74,0.0,0.0
2,2520,15,-4.08,0.0
3,11559,44,-9.85,0.0
4,6568,59,-7.43,0.0


# Normal Predictor

In [9]:
def calculate_normal(data):
    
    # data split  
    trainset, testset = surprise.model_selection.train_test_split(data, test_size=.25)
    
    # Define model
    model = NormalPredictor()

    # Predict values and create dataframe
    model.fit(trainset)
    predictions = model.test(testset)
    
    df = pd.DataFrame(predictions)
    
    # Calculate the rmse
    rmse = surprise.accuracy.rmse(predictions, verbose=True)
    
    return df, rmse
    

In [10]:
norm_predictions, norm_rmse = calculate_normal(data)

RMSE: 4.4982


In [11]:
norm_predictions.drop('details', axis=1).head()

Unnamed: 0,uid,iid,r_ui,est
0,8948,51,-8.98,0.0
1,15850,11,2.04,1.0
2,13207,58,0.0,1.0
3,9285,1,8.93,1.0
4,19026,89,0.0,0.0


# Business or User Experience Goals


- **Diversity** – How dissimilar are the recommendations?
- **Coverage** - What percentage of the user-item space can be recommended?
- **Serendipity** - How surprising are the relevant recommendations?
- **Novelty** - How surprising are the recommendations in general?
- **Relevancy** - How relevant are the recommendations?

## Novelty - How surprising are the recommendations?

Novelty determines how unknown recommended items are to a user. Higher novelty values means that less popular items are being recommended.

In [12]:
# Number of jokes
n = 100

# Rank of the items (jokes)
rank = svd_cosine.rank(axis=1).iid

# probability as a function of its rank for all users
probability = (n - rank) / (n - 1)

# Higher novelty values represents that less popular items are being recommended
novelty = (np.log2(probability) / n).sum()

round(novelty,4)

-182.4577

# Online Evaluation of Recommender Systems

Online evaluations gives us a unique opportunity to run two or more recommender systems in parallel and perform A/B testing. A  percentage of the total trafic should be distributed to each system for comparison. In this case, we are looking for all diferences in users's behavior.

One of the obvious metric is the number of recommendations users followed. That indicates an increased interest and that a users actually leveraged the recommended items to navigate further within the site instead of trying alternative methods such as search or html form filtering.

Some of the more advanced techniques would include:

- Mouse tracking: that could be translated into heatmaps. A strong indicator would highlight that users 'hovered' over the recommended items and either clicked (followed) the item or 'bounced.' Thous could be translated into a ratio or _time spent over the items_ vs _clicks_.
- Items 'consumed' or purchased from each of the recommender systems.
- Time spent looking vs time spent 'consuming'



Online tests may negatively impact user's trust in the recommender system. Such methodology should be implemented with caution and throughly tested offline first. To get a more accurate representation of each system's performance, all systems/websites should follow the same design and functionality.


# References:


http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

https://beckernick.github.io/matrix-factorization-recommender/