### CUNY MSDS DATA643 - Recommender System

---

### Rose Koh
### 06/10/2018

---

## Global Baseline Predictors and RMSE

In this first assignment, we’ll attempt to predict ratings with very little information. 

We’ll first look at just raw averages across all (training dataset) users. 

We’ll then account for “bias” by normalizing across users and across items.

Working with ratings in a user-item matrix, where each rating may be (1) assigned to a training dataset, (2) assigned to a test dataset, or (3) missing.

## Recommender System Description

<i> Briefly describe the recommender system that you’re going to build out from a business perspective, e.g. “This system recommends data science books to readers."</i>

* This system recommends travel destinations to travellers.

## Dataset Description

<i> 
Find a dataset, or build out your own toy dataset. As a minimum requirement for complexity, please include numeric ratings for at least five users, across at least five items, with some missing data.
</i>

* This dataset is built out as toy dataset with twenty users, five cities, ratings with some missing data.

## Load Data

<i>
Load your data into (for example) an R or dataframe, a Python dictionary or list of lists, (or another data structure of your choosing). From there, create a user-item matrix.

If you choose to work with a large dataset, you’re encouraged to also create a small, relatively dense “user-item” matrix as a subset so that you can hand-verify your calculations.
</i>

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({"destinations":["Croatia", "Montenegro", "Malta",  "Fiji", "Mauritius"],\
                   "user_A":['NA',4,3,5,5], \
                   "user_B":[5,4,3,4,5], \
                   "user_C":[1,'NA',1,4,5],\
                   "user_D":['NA',1,3,5,5], \
                   "user_E":[3,3,1,4,5], \
                  "user_F":[3,2,5,1,1], \
                  "user_G":[2,3,4,'NA',2], \
                  "user_H":[3,4,3,1,2], \
                  "user_I":[4,3,'NA',2,1], \
                  "user_J":[5,1,3,1,5], \
                   "user_K":[4,'NA',3,3,2], \
                   "user_L":[3,5,3,2,5],\
                   "user_M":[3,5,3,1,5], \
                   "user_N":[1,2,3,'NA',5], \
                   "user_O":[1,2,3,4,'NA']
                  })

In [3]:
df

Unnamed: 0,destinations,user_A,user_B,user_C,user_D,user_E,user_F,user_G,user_H,user_I,user_J,user_K,user_L,user_M,user_N,user_O
0,Croatia,,5,1.0,,3,3,2.0,3,4.0,5,4.0,3,3,1.0,1.0
1,Montenegro,4.0,4,,1.0,3,2,3.0,4,3.0,1,,5,5,2.0,2.0
2,Malta,3.0,3,1.0,3.0,1,5,4.0,3,,3,3.0,3,3,3.0,3.0
3,Fiji,5.0,4,4.0,5.0,4,1,,1,2.0,1,3.0,2,1,,4.0
4,Mauritius,5.0,5,5.0,5.0,5,1,2.0,2,1.0,5,2.0,5,5,5.0,


In [4]:
df = df.set_index('destinations').T

In [5]:
df = df.replace('NA', np.nan)

In [6]:
df.isnull().any()

destinations
Croatia       True
Montenegro    True
Malta         True
Fiji          True
Mauritius     True
dtype: bool

In [7]:
df

destinations,Croatia,Montenegro,Malta,Fiji,Mauritius
user_A,,4.0,3.0,5.0,5.0
user_B,5.0,4.0,3.0,4.0,5.0
user_C,1.0,,1.0,4.0,5.0
user_D,,1.0,3.0,5.0,5.0
user_E,3.0,3.0,1.0,4.0,5.0
user_F,3.0,2.0,5.0,1.0,1.0
user_G,2.0,3.0,4.0,,2.0
user_H,3.0,4.0,3.0,1.0,2.0
user_I,4.0,3.0,,2.0,1.0
user_J,5.0,1.0,3.0,1.0,5.0


## Split data

<i>
Break your ratings into separate training and test datasets.
</i>

In [8]:
train = df[0:int(len(df)*0.8)]
test = df[int(len(df)*0.8):]

In [9]:
train

destinations,Croatia,Montenegro,Malta,Fiji,Mauritius
user_A,,4.0,3.0,5.0,5.0
user_B,5.0,4.0,3.0,4.0,5.0
user_C,1.0,,1.0,4.0,5.0
user_D,,1.0,3.0,5.0,5.0
user_E,3.0,3.0,1.0,4.0,5.0
user_F,3.0,2.0,5.0,1.0,1.0
user_G,2.0,3.0,4.0,,2.0
user_H,3.0,4.0,3.0,1.0,2.0
user_I,4.0,3.0,,2.0,1.0
user_J,5.0,1.0,3.0,1.0,5.0


In [10]:
test

destinations,Croatia,Montenegro,Malta,Fiji,Mauritius
user_M,3.0,5.0,3.0,1.0,5.0
user_N,1.0,2.0,3.0,,5.0
user_O,1.0,2.0,3.0,4.0,


## Calculation

### Raw average (mean) rating

<i>  Using your training data, calculate the raw average (mean) rating for every user-item combination. </i>

In [11]:
train_raw_avg = np.nanmean(np.matrix(train))
print(train_raw_avg)

3.14814814815


### RMSE

<i> Calculate the RMSE for raw average for both your training data and your test data. </i>

In [12]:
from math import sqrt

def rmse(actual, pred):
    return sqrt(np.nanmean((actual - pred)**2))

In [13]:
train_rmse = rmse(np.array(train), train_raw_avg)
test_rmse = rmse(np.array(test), train_raw_avg)

In [14]:
print(train_rmse)

1.4064324108183353


In [15]:
print(test_rmse)

1.4565929333601841


### Bias

<i> Using your training data, calculate the bias for each user and each item. </i>

In [16]:
train_dest_bias = np.nanmean(np.array(train), axis = 0) - train_raw_avg
test_dest_bias = np.nanmean(np.array(test), axis = 0) - train_raw_avg

train_user_bias = np.nanmean(np.array(train), axis = 1) - train_raw_avg
test_user_bias = np.nanmean(np.array(test), axis = 1) - train_raw_avg

In [17]:
print(train_user_bias)

[ 1.10185185  1.05185185 -0.39814815  0.35185185  0.05185185 -0.74814815
 -0.39814815 -0.54814815 -0.64814815 -0.14814815 -0.14814815  0.45185185]


In [18]:
print(test_user_bias)

[ 0.25185185 -0.39814815 -0.64814815]


In [19]:
print(train_dest_bias)

[ 0.15185185 -0.14814815 -0.23905724 -0.23905724  0.43518519]


In [20]:
print(test_dest_bias)

[-1.48148148 -0.14814815 -0.14814815 -0.64814815  1.85185185]


### Baseline predictors

<i> From the raw average, and the appropriate user and item biases, calculate the baseline predictors for every user-item combination. </i>

In [21]:
print(train.shape)

(12, 5)


In [22]:
def baseline_predict(raw_avg, user_bias, item_bias):
    arr = np.empty((0,5), int)
    for i in range(len(user_bias)):
        pred_row = item_bias + user_bias[i] + raw_avg
        arr = np.append(arr, pred_row)
    return arr

In [23]:
train_pred = baseline_predict(train_raw_avg, train_user_bias, train_dest_bias)
test_pred = baseline_predict(train_raw_avg, test_user_bias, train_dest_bias)

In [24]:
print(train_pred)
train_pred.shape = (12,5)
print(train_pred.shape)

[ 4.40185185  4.10185185  4.01094276  4.01094276  4.68518519  4.35185185
  4.05185185  3.96094276  3.96094276  4.63518519  2.90185185  2.60185185
  2.51094276  2.51094276  3.18518519  3.65185185  3.35185185  3.26094276
  3.26094276  3.93518519  3.35185185  3.05185185  2.96094276  2.96094276
  3.63518519  2.55185185  2.25185185  2.16094276  2.16094276  2.83518519
  2.90185185  2.60185185  2.51094276  2.51094276  3.18518519  2.75185185
  2.45185185  2.36094276  2.36094276  3.03518519  2.65185185  2.35185185
  2.26094276  2.26094276  2.93518519  3.15185185  2.85185185  2.76094276
  2.76094276  3.43518519  3.15185185  2.85185185  2.76094276  2.76094276
  3.43518519  3.75185185  3.45185185  3.36094276  3.36094276  4.03518519]
(12, 5)


In [25]:
print(test_pred)
test_pred.shape = (3,5)
print(test_pred.shape)

[ 3.55185185  3.25185185  3.16094276  3.16094276  3.83518519  2.90185185
  2.60185185  2.51094276  2.51094276  3.18518519  2.65185185  2.35185185
  2.26094276  2.26094276  2.93518519]
(3, 5)


### RMSE

<i> Calculate the RMSE for the baseline predictors for both your training data and your test data. </i>

In [26]:
train_pred_rmse = rmse(np.array(train), np.array(train_pred))
test_pred_rmse = rmse(np.array(test), np.array(test_pred))

In [27]:
print(train_pred_rmse)

1.2350483773221095


In [28]:
print(test_pred_rmse)

1.340145457604759


## Summary

In [29]:
eval_train = train_pred_rmse/train_rmse - 1
eval_test = test_pred_rmse/test_rmse - 1

In [30]:
result = pd.DataFrame({"Data": ['Train', 'Test'],\
                       "1_Raw_Avg_RMSE": [train_rmse, test_rmse],\
                       "2_Baseline_Pred_RMSE": [train_pred_rmse, test_pred_rmse],\
                       "3_Eval": [eval_train, eval_test]})
result = result.set_index('Data')
print(result)

       1_Raw_Avg_RMSE  2_Baseline_Pred_RMSE    3_Eval
Data                                                 
Train        1.406432              1.235048 -0.121857
Test         1.456593              1.340145 -0.079945


The data with the split ratio of 8:2, the train and test dataset both showed improvement of 12.18%, 7.99% respectively in RMSE score using the bias.

As we can see above, having one more parameter resulting in the lower RMSE score. There is not a big difference in RMSE score for the raw average or the baseline predictors. This might be due to the small size of the dataset.