# Overview of Recommender Systems

## Examples (RS) in Kaggle

- [Elo Merchant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation/data?select=Data+Dictionary.xlsx): `merchant_id` and `card_id`.

- [WSDM - KKBox's Music Recommendation Challenge](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data): `user` and `music`.

- [Event Recommendation Engine Challenge](https://www.kaggle.com/c/event-recommendation-engine-challenge/overview/evaluation): `user` and `event`.


## Load Netflix dataset

- Dowload [Netflix Prize Data](https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data). (For illustration, we only take the first subset.)

- Dataset is pre-processed by [`pre-process.py`](https://github.com/statmlben/CUHK-STAT3009/tree/main/dataset)

- Load data into Python

- Re-orginize the data structure as a standard form

- For testing set, we remove the real ratings.



In [None]:
import numpy as np
import pandas as pd

## Upload Netflix dataset in CUHK-STAT3009 Github repo

train_url = "https://raw.githubusercontent.com/statmlben/CUHK-STAT3009/main/dataset/train.csv"
test_url = "https://raw.githubusercontent.com/statmlben/CUHK-STAT3009/main/dataset/test.csv"

dtrain = pd.read_csv(train_url)
dtest = pd.read_csv(test_url)

In [None]:
dtrain.sample(5).T

Unnamed: 0,3569,56,4720,1286,12835
movie_id,1872,3543,2753,2216,2946
user_id,766,104,1643,663,1558
rating,4,5,3,5,3
date,2005-03-14,2005-12-30,2003-06-03,2005-09-16,2004-01-29


In [None]:
dtest.sample(5).T

Unnamed: 0,13157,30022,40949,18877,8341
movie_id,2151,458,602,1860,730
user_id,671,1867,1009,663,641
rating,4,5,5,3,5
date,2005-08-17,2004-01-28,2005-11-02,2005-01-27,2004-11-27


### Pre-process the data as a `np.array`

In [None]:
## save (user_id, item_id) and rating separately
train_rating = dtrain['rating'].values
train_rating = np.array(train_rating, dtype=float)
train_pair = dtrain[['user_id', 'movie_id']].values

test_rating = dtest['rating'].values
test_rating = np.array(test_rating, dtype=float)
test_pair = dtest[['user_id', 'movie_id']].values

## we want to predict `test_rating` based on `train_pair`, `train_rating`, `test_pair`

In [None]:
test_pair[:,1]

array([2956,  791, 1547, ...,  653, 2195, 3081])

In [None]:
## find the number of users/items
n_user = max( max(train_pair[:,0]), max(test_pair[:,0]) ) + 1
print('total number of users: %d' %n_user)

n_item = max( max(train_pair[:,1]), max(test_pair[:,1]) ) + 1
print('total number of items: %d' %n_item)

total number of users: 2000
total number of items: 3568


## Evaluation

- Define a function to compute `rmse` for the predicted rating

- Test your function

In [None]:
## define RMSE function
def rmse(true_rating, pred_rating):
  return np.sqrt(np.mean((true_rating - pred_rating)**2))

In [None]:
## Test `rmse` function

pred = np.zeros(len(test_rating))
print('rmse for zero rating: %.3f' %rmse(test_rating, pred))
print('rmse for true rating: %.3f' %rmse(test_rating, test_rating))

rmse for zero rating: 3.787
rmse for true rating: 0.000


## Implement Baseline methods

- Inpout: training set.

- Output: return predicted ratings for (user id, item id) user-item pairs in testing set.

- Goal: make prediction for testing set

### Global mean

$$
\bar{r} = \frac{1}{|\Omega|} \sum_{(u,i) \in \Omega} r_{ui}, \quad \hat{r}_{ui} = \bar{r}
$$

In [None]:
## create a potential prediction for `test_rating`
pred = np.zeros(len(test_rating))

## Compute global mean based on `train_rating`
global_pred = pred.copy()
global_mean = train_rating.mean()
global_pred = global_mean*np.ones(len(pred))

print(global_pred[:10])
print('rmse for global mean: %.3f' %rmse(test_rating, global_pred))

[3.62115674 3.62115674 3.62115674 3.62115674 3.62115674 3.62115674
 3.62115674 3.62115674 3.62115674 3.62115674]
rmse for global mean: 1.085


### user average

$$
		\bar{r}_{u} = \frac{1}{|\mathcal{I}_u|} \sum_{i \in \mathcal{I}_u} r_{ui}, \text{ for } u=1, \cdots, n; \quad \hat{r}_{ui} = \bar{r}_u
$$

- Loop for all users
  - Find all records for this user in both training and testing sets.
  - Compute the average ratings for this user in the training set.
  - Predict the ratings for this users in the testing set.

In [None]:
## (InClass Practice) user average
UA_pred = np.zeros_like(test_rating, dtype=float)
for u in range(n_user):
    # find the index for both train and test for user_id = u
    ind_test = np.where(test_pair[:,0] == u)[0]
    ind_train = np.where(train_pair[:,0] == u)[0]
    ## if there is no record; predict as the global mean
    if len(ind_test) == 0:
        continue
    ## (Practice) if the records for the users is too small, then predict as global mean
    if len(ind_train) < 3:
        UA_pred[ind_test] = global_mean
    else:
        # predict as user average
        UA_pred[ind_test] = train_rating[ind_train].mean()
print(UA_pred[:10])

print('rmse for user mean: %.3f' %rmse(test_rating, UA_pred))

[3.73684211 3.35714286 3.66037736 2.84931507 3.70909091 3.27419355
 3.16666667 3.57142857 4.1        3.375     ]
rmse for user mean: 1.013


### item average

$$
		\bar{r}_{i} = \frac{1}{|\mathcal{U}_i|} \sum_{u \in \mathcal{U}_i} r_{ui}, \text{ for } i=1, \cdots, m; \quad \hat{r}_{ui} = \bar{r}_i,
$$

In [None]:
## (InClass Practice) item average

IA_pred = np.zeros_like(test_rating, dtype=float)
for i in range(n_item):
    # find the index for both train and test for item_id = i
    ind_test = np.where(test_pair[:,1] == i)[0]
    ind_train = np.where(train_pair[:,1] == i)[0]
    if len(ind_test) == 0:
        continue
    if len(ind_train) < 3:
        IA_pred[ind_test] = global_mean
    else:
        # predict as item average
        IA_pred[ind_test] = train_rating[ind_train].mean()

print('rmse for item mean: %.3f' %rmse(test_rating, IA_pred))

rmse for item mean: 1.039


## Package Python functions


- *Input*: 'train_rating', 'test_pair'

- *Return*: Predicted ratings.

In [None]:
def glb_mean(train_rating, test_pair):
    pred = train_rating.mean() * np.ones(len(test_pair))
    return pred

In [None]:
def user_mean(train_pair, train_rating, test_pair):
    n_user = max(train_pair[:,0].max(), test_pair[:,0].max())+1
    pred = np.zeros(len(test_pair))
    glb_mean_value = train_rating.mean()
    for u in range(n_user):
        # find the index for both train and test for user_id = u
        ind_test = np.where(test_pair[:,0] == u)[0]
        ind_train = np.where(train_pair[:,0] == u)[0]
        if len(ind_test) == 0:
            continue
        if len(ind_train) < 3:
            pred[ind_test] = glb_mean_value
        else:
            # predict as user average
            pred[ind_test] = train_rating[ind_train].mean()
    return pred

In [None]:
def item_mean(train_pair, train_rating, test_pair):
    n_item = max(train_pair[:,1].max(), test_pair[:,1].max())+1
    pred = np.zeros(len(test_pair))
    glb_mean_value = train_rating.mean()
    for i in range(n_item):
        # find the index for both train and test for item_id = i
        ind_test = np.where(test_pair[:,1] == i)[0]
        ind_train = np.where(train_pair[:,1] == i)[0]
        if len(ind_test) == 0:
            continue
        if len(ind_train) < 3:
            pred[ind_test] = glb_mean_value
        else:
            # predict as user average
            pred[ind_test] = train_rating[ind_train].mean()
    return pred

## Sequential models; user-item mean

- We can predict the rating by the `user_mean`, then fit the residual by `item_mean`

$$\hat{r}_{ui} = \bar{r} + \mu_u + \mu_i$$

where 
		$$\mu_u = \frac{1}{|\mathcal{I}_u|} \sum_{i \in \mathcal{I}_u} (r_{ui} - \bar{r}), \quad \mu_i = \frac{1}{|\mathcal{U}_i|} \sum_{u \in \mathcal{U}_i} (r_{ui} - \bar{r} - \mu_u)$$

In [None]:
## compute user-mean
pred_rating_user = user_mean(train_pair, train_rating, test_pair)

print('rmse for user mean: %.3f' %rmse(test_rating, UA_pred))

rmse for user mean: 1.013


In [None]:
## compute the residual rating
## test_pair -> train_pair

res_rating = train_rating - user_mean(train_pair, train_rating, train_pair)
pred_res_item = item_mean(train_pair, res_rating, test_pair)

final_pred = pred_rating_user + pred_res_item
print('rmse for user mean: %.3f' %rmse(test_rating, final_pred))

rmse for user mean: 0.964


## To-do list

- **STAT**
  - [ ] Background of RS  
  - [ ] The data types in RS
  - [ ] Evaluation metrics
  - [ ] Statistical models for baseline methods

- **Code**

  - [ ] Load data to Python `pd.read_csv`
  - [ ] implementation of baseline methods
  - [ ] define Python functions

# **CUHK-STAT3009** Notebook - Overview of Recommender Systems
