# CUHK [STAT3009](https://www.bendai.org/STAT3009/) Notebook1: Baseline methods for Recommender Systems

## Software prepare
- `Code Editor`: VS Code; Sublime; or Atom

- `Terminal`: Iterm2 in Mac; Deepin terminal in Linux

## Creating virtual environments
- If you have multiple versions of Python on your system, you can select a specific Python version by running python3 or whichever version you want.

- To create a virtual environment, decide upon a directory where you want to place it, and run the venv module as a script with the directory path:

- How to create and activate a virtual environment, see Section 12.2 in the [Document]((https://docs.python.org/3/tutorial/venv.html)).

- Install packages via `pip`, see `Installing packages` section in the [Document]((https://docs.python.org/3/tutorial/venv.html)).


## Kaggle competition about recommender systems: user and item can be extended to more general cases.
- [Elo Merchant Category Recommendation](https://www.kaggle.com/c/elo-merchant-category-recommendation/data?select=Data+Dictionary.xlsx): `merchant_id` and `card_id`.

- [WSDM - KKBox's Music Recommendation Challenge](https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data): `user` and `music`.

- [Event Recommendation Engine Challenge](https://www.kaggle.com/c/event-recommendation-engine-challenge/overview/evaluation): `user` and `event`.

## Load dataset into Python
- Dowload [Netflix Prize Data](https://www.kaggle.com/netflix-inc/netflix-prize-data).

- Load data into Python.

- Re-orginize the data structure as a standard form.

- For testing set, we hide the real ratings.

- We only take the first subset for illustration.

In [3]:
import numpy as np
import pandas as pd

df = pd.read_csv('./dataset/train_demo.csv')
df.sample(5)

Unnamed: 0.1,Unnamed: 0,"(user_id, item_id)",ratings
14,1071,"(867, 419)",6.32142
1970,2989,"(644, 128)",1.904344
2083,2178,"(69, 361)",-3.877432
1996,2818,"(491, 75)",2.476299
245,3193,"(445, 95)",3.69998


In [5]:
dt = pd.read_csv('./dataset/test_demo.csv')
## save real ratings for test set for evaluation.
test_ratings = dt['ratings']
## remove the ratings in the test set to simulate prediction
dt = dt.drop(columns='ratings')
dt.sample(5)

Unnamed: 0.1,Unnamed: 0,"(user_id, item_id)"
163,1788,"(276, 406)"
394,1379,"(199, 292)"
281,2344,"(781, 213)"
146,865,"(455, 69)"
183,2261,"(493, 258)"


## Pre-process the data as a standard form
- Convert `string` '(user_id, item_id)' -> `np.array` int \[user_id, item_id\]

- Tutorial: [Reading Data from the Web: Web Scraping & Regular Expressions](https://www.summet.com/dmsi/html/readingTheWeb.html)

In [6]:
## convert string to user_id and item_id -> [user_id, item_id, rating]
import re
# pre-process for training data
train_pair = [re.findall(r'\d+', tmp) for tmp in df['(user_id, item_id)']]
train_pair = np.array(train_pair)
train_pair = train_pair.astype(int)
# pre-process for testing set
test_pair = [re.findall(r'\d+', tmp) for tmp in dt['(user_id, item_id)']]
test_pair = np.array(test_pair)
test_pair = test_pair.astype(int)
n_user, n_item = train_pair[:,0].max()+1, train_pair[:,1].max()+1
train_ratings = df['ratings'].values


In [7]:
## save the pre-process data use `np.save`
np.save('./dataset/train_pair.npy', train_pair)
np.save('./dataset/test_pair.npy', test_pair)
np.save('./dataset/train_ratings.npy', ratings)
np.save('./dataset/test_ratings.npy', test_ratings)

## Implement Baseline methods: global\_average, user\_average and item\_average (For your practice)
- Inpout: training set.

- Output: return predicted ratings for (user id, item id) user-item pairs in testing set.

- Goal: make prediction for testing set

In [8]:
pred = np.zeros(len(test_ratings))

In [9]:
## Global average
global_pred = pred.copy()
global_mean = ratings.mean()
global_pred = global_mean*np.ones(len(pred))
print(global_pred[:10])

[0.16038824 0.16038824 0.16038824 0.16038824 0.16038824 0.16038824
 0.16038824 0.16038824 0.16038824 0.16038824]


### user\_average
- Loop for all users
    - Find all records for this user in both training and testing sets.
    - Compute the average ratings for this user in the training set.
    - Predict the ratings for this users in the testing set.

In [10]:
## user average
UA_pred = pred.copy()
for u in range(n_user):
    # find the index for both train and test for user_id = u
    ind_test = np.where(test_pair[:,0] == u)[0]
    ind_train = np.where(train_pair[:,0] == u)[0]
    if len(ind_test) == 0:
        continue
    if len(ind_train) < 3:
        UA_pred[ind_test] = global_mean
    else:
        # predict as user average
        UA_pred[ind_test] = ratings[ind_train].mean()
print(UA_pred[:10])

[0.16038824 0.16038824 0.16038824 0.16038824 0.16038824 0.16038824
 0.16038824 0.16038824 0.16038824 0.16038824]


## Evaluation: compute RMSE for baseline methods
- Input: (1) predicted testing ratings (2) real testing ratings

- Output: RMSE for the prediction

- Goal: evaluate the prediction performance for the method.

In [11]:
## RMSE for Global average
rmse_glb = np.sqrt(np.mean((global_pred - test_ratings)**2))
print('RMSE for GLB average: %.3f' %rmse_glb)

RMSE for GLB average: 2.588


In [12]:
## RMSE for user average
rmse_usr = np.sqrt(np.mean((UA_pred - test_ratings)**2))
print('RMSE for user average: %.3f' %rmse_usr)

RMSE for user average: 2.591


## Summarize `glb_average` and `user_average` methods as Python functions

### `glb_average`

- *Input*: 'train_ratings', 'test_pair'

- *Return*: Predicted ratings based on glb mean.

In [None]:
def glb_mean(train_ratings, test_pair):
    pred = train_ratings.mean() * np.ones(len(test_pair))
    return pred

### `user_average`

- *Input*: 'train_pair', 'train_ratings', 'test_pair'

- *Return*: Predicted ratings based on user mean.

In [None]:
def user_mean(train_pair, train_ratings, test_pair):
    n_user = train_pair[:,0].max()+1
    pred = np.zeros(len(test_pair))
    glb_mean_value = train_ratings.mean()
    for u in range(n_user):
        # find the index for both train and test for user_id = u
        ind_test = np.where(test_pair[:,0] == u)[0]
        ind_train = np.where(train_pair[:,0] == u)[0]
        if len(ind_test) == 0:
            continue
        if len(ind_train) < 3:
            pred[ind_test] = glb_mean_value
        else:
            # predict as user average
            pred[ind_test] = ratings[ind_train].mean()
    return pred

## Summarize `Evaluation` as a Python function

In [13]:
def rmse(true, pred):
    return np.sqrt(np.mean((pred - true)**2))