# **CUHK-STAT3009** Notebook - Overview of Recommender Systems


## Markdown

- We will documentate the Python code by using `markdown`
- [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/)
- Make sure a general audit could understand your notebook

## Examples (RS) in Kaggle

- [Elo Merchant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation/data?select=Data+Dictionary.xlsx): `merchant_id` and `card_id`.

- [WSDM - KKBox's Music Recommendation Challenge](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data): `user` and `music`.

- [Event Recommendation Engine Challenge](https://www.kaggle.com/c/event-recommendation-engine-challenge/overview/evaluation): `user` and `event`.

## Load Netflix dataset

- Dowload [Netflix Prize Data](https://www.kaggle.com/netflix-inc/netflix-prize-data). (For illustration, we only take the first subset.)

- Dataset is pre-processed by [pre-process.py](https://github.com/statmlben/CUHK-STAT3009/tree/main/dataset)

- Load data into Python

- Re-orginize the data structure as a standard form

- For testing set, we remove the real ratings.

In [1]:
import numpy as np
import pandas as pd

## Upload Netflix dataset in CUHK-STAT3009 Github repo

train_url = "https://raw.githubusercontent.com/statmlben/CUHK-STAT3009/main/dataset/train.csv"
test_url = "https://raw.githubusercontent.com/statmlben/CUHK-STAT3009/main/dataset/train.csv"

dtrain = pd.read_csv(train_url)
dtest = pd.read_csv(test_url)

## save real ratings for test set for evaluation.
test_rating = np.array(dtest['rating'], dtype=float)
## remove the ratings in the test set to simulate prediction
dtest = dtest.drop(columns='rating')

In [2]:
dtrain.sample(5).T

Unnamed: 0,48099,41453,3792,65,32925
movie_id,3415,241,2317,2843,2665
user_id,41,858,1404,1820,1838
rating,4,4,5,2,4
date,2004-06-07,2004-03-23,2005-12-11,2003-04-03,2005-11-25


In [3]:
dtest.sample(5).T

Unnamed: 0,41957,17760,33557,18950,48485
movie_id,1869,712,916,1424,495
user_id,1917,1949,1782,1623,1664
date,2005-01-04,2005-06-28,2005-02-20,2005-05-14,2003-05-03


### Pre-process the data as a `np.array`

In [4]:
## save (user_id, item_id) and rating separately
train_pair = dtrain[['user_id', 'movie_id']].values
train_rating = np.array(dtrain['rating'], dtype=float)
test_pair = dtest[['user_id', 'movie_id']].values

In [5]:
train_pair

array([[1960,  670],
       [1346,  152],
       [ 785, 1741],
       ...,
       [ 858, 2725],
       [1809,   21],
       [ 142,  574]])

In [6]:
## find the number of users/items

n_user, n_item = max(train_pair[:,0].max(), test_pair[:,0].max())+1, max(train_pair[:,1].max(), test_pair[:,1].max())+1

## Evaluation

- Define a function to compute `rmse` for the predicted rating

- Test your function

In [7]:
def rmse(true_rating, pred_rating):
  return np.sqrt(np.mean((true_rating - pred_rating)**2))

In [8]:
## Test `rmse` function

pred = np.zeros(len(test_rating))
print('rmse for zero rating: %.3f' %rmse(test_rating, pred))
print('rmse for true rating: %.3f' %rmse(test_rating, test_rating))

rmse for zero rating: 3.781
rmse for true rating: 0.000


## Implement Baseline methods

- Inpout: training set.

- Output: return predicted ratings for (user id, item id) user-item pairs in testing set.

- Goal: make prediction for testing set

### Global mean

In [9]:
## create a potential prediction for `test_rating`
pred = np.zeros(len(test_rating))

## Compute global mean based on `train_rating`
global_pred = pred.copy()
global_mean = train_rating.mean()
global_pred = global_mean*np.ones(len(pred))

print(global_pred[:10])
print('rmse for global mean: %.3f' %rmse(test_rating, global_pred))

[3.62115674 3.62115674 3.62115674 3.62115674 3.62115674 3.62115674
 3.62115674 3.62115674 3.62115674 3.62115674]
rmse for global mean: 1.088


### user average

- Loop for all users
  - Find all records for this user in both training and testing sets.
  - Compute the average ratings for this user in the training set.
  - Predict the ratings for this users in the testing set.

In [10]:
## (InClass Practice) user average
UA_pred = np.zeros_like(test_rating, dtype=float)
for u in range(n_user):
    # find the index for both train and test for user_id = u
    ind_test = np.where(test_pair[:,0] == u)[0]
    ind_train = np.where(train_pair[:,0] == u)[0]
    ## if there is no record; predict as the global mean
    if len(ind_test) == 0:
        continue
    ## (Practice) if the records for the users is too small, then predict as global mean
    if len(ind_train) < 3:
        UA_pred[ind_test] = global_mean
    else:
        # predict as user average
        UA_pred[ind_test] = train_rating[ind_train].mean()
print(UA_pred[:10])

print('rmse for user mean: %.3f' %rmse(test_rating, UA_pred))

[3.81556196 3.71962617 2.95774648 4.52272727 3.69393939 3.81632653
 4.         3.59440559 3.05263158 4.        ]
rmse for user mean: 0.977


### item average

In [11]:
## (InClass Practice) item average
IA_pred = np.zeros_like(test_rating, dtype=float)
for i in range(n_item):
    # find the index for both train and test for item_id = i
    ind_test = np.where(test_pair[:,1] == i)[0]
    ind_train = np.where(train_pair[:,1] == i)[0]
    if len(ind_test) == 0:
        continue
    if len(ind_train) < 3:
        IA_pred[ind_test] = global_mean
    else:
        # predict as item average
        IA_pred[ind_test] = train_rating[ind_train].mean()

print('rmse for item mean: %.3f' %rmse(test_rating, IA_pred))

rmse for item mean: 1.009


## Package Python functions


- *Input*: 'train_rating', 'test_pair'

- *Return*: Predicted ratings.

In [12]:
def glb_mean(train_rating, test_pair):
    pred = train_rating.mean() * np.ones(len(test_pair))
    return pred

In [13]:
def user_mean(train_pair, train_rating, test_pair):
    n_user = max(train_pair[:,0].max(), test_pair[:,0].max())+1
    pred = np.zeros(len(test_pair))
    glb_mean_value = train_rating.mean()
    for u in range(n_user):
        # find the index for both train and test for user_id = u
        ind_test = np.where(test_pair[:,0] == u)[0]
        ind_train = np.where(train_pair[:,0] == u)[0]
        if len(ind_test) == 0:
            continue
        if len(ind_train) < 3:
            pred[ind_test] = glb_mean_value
        else:
            # predict as user average
            pred[ind_test] = train_rating[ind_train].mean()
    return pred

In [14]:
def item_mean(train_pair, train_rating, test_pair):
    n_item = max(train_pair[:,1].max(), test_pair[:,1].max())+1
    pred = np.zeros(len(test_pair))
    glb_mean_value = train_rating.mean()
    for i in range(n_item):
        # find the index for both train and test for item_id = i
        ind_test = np.where(test_pair[:,1] == i)[0]
        ind_train = np.where(train_pair[:,1] == i)[0]
        if len(ind_test) == 0:
            continue
        if len(ind_train) < 3:
            pred[ind_test] = glb_mean_value
        else:
            # predict as user average
            pred[ind_test] = train_rating[ind_train].mean()
    return pred

## Sequential models; user-item mean

- We can predict the rating by the `user_mean`, then fit the residual by `item_mean`

$$\hat{r}_{ui} = \bar{r} + \mu_u + \mu_i$$

where 
		$$\mu_u = \frac{1}{|\mathcal{I}_u|} \sum_{i \in \mathcal{I}_u} (r_{ui} - \bar{r}), \quad \mu_i = \frac{1}{|\mathcal{U}_i|} \sum_{u \in \mathcal{U}_i} (r_{ui} - \bar{r} - \mu_u)$$

In [16]:
pred_rating_user = user_mean(train_pair, train_rating, test_pair)

print('rmse for user mean: %.3f' %rmse(test_rating, UA_pred))

rmse for user mean: 0.977


In [18]:
## compute the residual rating
## test_pair -> train_pair

res_rating = train_rating - user_mean(train_pair, train_rating, train_pair)
pred_res_item = item_mean(train_pair, res_rating, test_pair)

final_pred = pred_rating_user + pred_res_item
print('rmse for user mean: %.3f' %rmse(test_rating, final_pred))

rmse for user mean: 0.903


## To-do list

- **STAT**
  - [ ] Background of RS  
  - [ ] The data types in RS
  - [ ] Evaluation metrics
  - [ ] Statistical models for baseline methods

- **Code**

  - [ ] Load data to Python `pd.read_csv`
  - [ ] implementation of baseline methods
  - [ ] define Python functions