# CUHK [STAT3009](https://www.bendai.org/STAT3009/) Notebook10(a): binary recommender systems

## Pre-process the ML-100K raw data
- check the `user_id` and `item_id`: mapping `item_id` to a continuous sequence based on `sklean.preprocessing`
- use `sklearn.model_selection.train_test_split` to generate `train` and `test` dataset

## Load additional ``side information``

ref: https://colab.research.google.com/github/lcharlin/80-629/blob/master/week4-PracticalSession/Introduction_to_ML.ipynb#scrollTo=4R717-S52plZ

In [1]:
import numpy as np
import pandas as pd
# load rating
df = pd.read_csv('./dataset/ml-latest-small/ratings.csv')
del df['timestamp']

movies_pd = pd.read_csv('./dataset/ml-latest-small/movies.csv', sep=',', engine='python')
movies_pd.sample(10)

Unnamed: 0,movieId,title,genres
2840,3798,What Lies Beneath (2000),Drama|Horror|Mystery
1189,1586,G.I. Jane (1997),Action|Drama
116,141,"Birdcage, The (1996)",Comedy
6678,57843,Rise of the Footsoldier (2007),Action|Crime|Drama
4994,7707,"He Said, She Said (1991)",Comedy|Drama|Romance
9242,154975,Merci Patron ! (2016),Comedy|Documentary
7860,93855,God Bless America (2011),Comedy|Drama
4383,6427,"Railway Children, The (1970)",Children|Drama
8481,113186,Felony (2013),Thriller
2668,3571,Time Code (2000),Comedy|Drama


## Feature engineering
- extract `year` and `genre` from the movies' side information
- For simplicity, if multiple genres exist, we just take the first one
- `Regex` to deal with the raw data [tutorial](https://regexone.com/) 

In [2]:
import re

year, genre = [], []
for i in range(len(movies_pd)):
	row = movies_pd.loc[i]
	year_tmp = re.findall('\d+', row['title'])
	if len(year_tmp) > 0:
		year.append(int(year_tmp[0]))
	else:
		year.append(np.nan)
	## take the first one as primary genere
	genre.append(row['genres'].split('|')[0])

movies_pd['year'], movies_pd['pGenre'] = year, genre
## delete original title and genres
del movies_pd['title']
del movies_pd['genres']
movies_pd.sample(10)

Unnamed: 0,movieId,year,pGenre
5866,32875,1949.0,Comedy
2209,2936,1941.0,Adventure
1473,1998,1977.0,Horror
2783,3724,1978.0,Drama
9633,179133,2017.0,Animation
849,1119,1995.0,Drama
8989,139157,2015.0,Comedy
2059,2738,1986.0,Comedy
2545,3406,1951.0,Action
2084,2770,1999.0,Comedy


## [Missing data](https://machinelearningmastery.com/handle-missing-data-python/)
- Usually we impute the missing values by average, but there are some fancy methods, see [Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html#impute).
- Use package `sklearn.impute.SimpleImputer`

In [4]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(movies_pd['year'].values.reshape(-1, 1))
movies_pd['year'] = imp_mean.transform(movies_pd['year'].values.reshape(-1, 1))

## Generate some additional side information for users and items
- Number of ratings
- Averaged ratings
- quantiles of the ratings (as a practice)

In [5]:
user_pd = pd.merge(left=df.groupby('userId')['rating'].mean(), 
					right=df.groupby('userId')['rating'].count(), on='userId', )
user_pd.columns = ['rating_mean', 'rating_count']
user_pd = user_pd.reset_index()

movie_rating_pd = pd.merge(left=df.groupby('movieId')['rating'].mean(), 
						right=df.groupby('movieId')['rating'].count(), on='movieId')
movie_rating_pd.columns	= ['rating_mean', 'rating_count']

movies_pd = pd.merge(left=movie_rating_pd, right=movies_pd, on='movieId')

print(user_pd.sample(10))
print(movies_pd.sample(10))

     userId  rating_mean  rating_count
587     588     3.250000            56
11       12     4.390625            32
488     489     3.017747           648
243     244     3.774194            93
235     236     3.966667            30
96       97     4.194444            36
551     552     3.119681           188
597     598     3.809524            21
593     594     3.924569           232
347     348     4.672727            55
      movieId  rating_mean  rating_count    year       pGenre
3288     4453     3.500000             1  2000.0  Documentary
9282   158956     2.500000             2  2016.0       Action
7331    78218     3.000000             1  2010.0        Drama
3862     5437     3.000000             3  1986.0       Comedy
939      1241     4.050000            10  1992.0       Comedy
4351     6374     2.166667             3  2003.0       Comedy
7189    72733     3.857143             7  2009.0        Drama
2409     3201     4.500000             8  1970.0        Drama
3567     4890

## Pre-processing the dataset
- all continuous features should be standardized as mean 0, std 1
- all categorical features should be re-encoding to remove the missing ones

In [6]:
from sklearn import preprocessing
## pre-processing for users
user_cont = ['rating_mean', 'rating_count']
user_pd[user_cont] = preprocessing.StandardScaler().fit_transform(user_pd[user_cont])

## pre-processing for movies
movie_cont = ['rating_mean', 'rating_count', 'year']
movies_pd[movie_cont] = preprocessing.StandardScaler().fit_transform(movies_pd[movie_cont])

## encoding for categorical data 
from sklearn import preprocessing
le_genre = preprocessing.LabelEncoder()
movies_pd['pGenre'] = le_genre.fit_transform(movies_pd['pGenre'])

## joint encoding for userId and movieId
# !!! all dfs should share the same encoding for userId and movieId, respecitively!!!
le_movie = preprocessing.LabelEncoder()
le_user = preprocessing.LabelEncoder()

df['movieId'] = le_movie.fit_transform(df['movieId'])
df['userId'] = le_user.fit_transform(df['userId'])

movies_pd['movieId'] = le_movie.transform(movies_pd['movieId'])
user_pd['userId'] = le_user.transform(user_pd['userId'])

user_pd = user_pd.set_index('userId', drop=False)
movies_pd = movies_pd.set_index('movieId', drop=False)
## generate train / test dataset
from sklearn.model_selection import train_test_split
dtrain, dtest = train_test_split(df, test_size=0.33, random_state=42)
## save real ratings for test set for evaluation.
test_rating = np.array(dtest['rating'])
## remove the ratings in the test set to simulate prediction
dtest = dtest.drop(columns='rating')

In [7]:
# tran_pair, train_rating
train_pair = dtrain[['userId', 'movieId']].values
train_rating = dtrain['rating'].values
# test_pair
test_pair = dtest[['userId', 'movieId']].values
n_user, n_item = max(train_pair[:,0].max(), test_pair[:,0].max())+1, max(train_pair[:,1].max(), test_pair[:,1].max())+1

## Generate a binary dataset
- if `rating` >= 3.5: ``LIKE``;
- if `rating` <3.5: ``DISLIKE``


In [8]:
train_like = 1*(train_rating >= 3.5)
test_like = 1*(test_rating >= 3.5)

## Load the existing methods

In [9]:
def rmse(true, pred):
	return np.sqrt(np.mean((pred - true)**2))

# baseline methods
class glb_mean(object):
	def __init__(self):
		self.glb_mean = 0
	
	def fit(self, train_ratings):
		self.glb_mean = np.mean(train_ratings)
	
	def predict(self, test_pair):
		pred = np.ones(len(test_pair))
		pred = pred*self.glb_mean
		return pred

class user_mean(object):
	def __init__(self, n_user):
		self.n_user = n_user
		self.glb_mean = 0.
		self.user_mean = np.zeros(n_user)
	
	def fit(self, train_pair, train_ratings):
		self.glb_mean = train_ratings.mean()
		for u in range(self.n_user):
			ind_train = np.where(train_pair[:,0] == u)[0]
			if len(ind_train) == 0:
				self.user_mean[u] = self.glb_mean
			else:
				self.user_mean[u] = train_ratings[ind_train].mean()
	
	def predict(self, test_pair):
		pred = np.ones(len(test_pair))*self.glb_mean
		j = 0
		for row in test_pair:
			user_tmp, item_tmp = row[0], row[1]
			pred[j] = self.user_mean[user_tmp]
			j = j + 1
		return pred

class item_mean(object):
	def __init__(self, n_item):
		self.n_item = n_item
		self.glb_mean = 0.
		self.item_mean = np.zeros(n_item)
	
	def fit(self, train_pair, train_ratings):
		self.glb_mean = train_ratings.mean()
		for i in range(self.n_item):
			ind_train = np.where(train_pair[:,1] == i)[0]
			if len(ind_train) == 0:
				self.item_mean[i] = self.glb_mean
			else:
				self.item_mean[i] = train_ratings[ind_train].mean()
	
	def predict(self, test_pair):
		pred = np.ones(len(test_pair))*self.glb_mean
		j = 0
		for row in test_pair:
			user_tmp, item_tmp = row[0], row[1]
			pred[j] = self.item_mean[item_tmp]
			j = j + 1
		return pred


class LFM(object):

    def __init__(self, n_user, n_item, lam=.001, K=10, iterNum=10, tol=1e-4, verbose=1):
        self.P = np.random.randn(n_user, K)
        self.Q = np.random.randn(n_item, K)
        # self.index_item = []
        # self.index_user = []
        self.n_user = n_user
        self.n_item = n_item
        self.lam = lam
        self.K = K
        self.iterNum = iterNum
        self.tol = tol
        self.verbose = verbose

    def fit(self, train_pair, train_rating):
        diff, tol = 1., self.tol
        n_user, n_item, n_obs = self.n_user, self.n_item, len(train_pair)
        K, iterNum, lam = self.K, self.iterNum, self.lam
        ## store user/item index set
        self.index_item = [np.where(train_pair[:,1] == i)[0] for i in range(n_item)]
        self.index_user = [np.where(train_pair[:,0] == u)[0] for u in range(n_user)]
        if self.verbose:
            print('Fitting Reg-LFM: K: %d, lam: %.5f' %(K, lam))
        for i in range(iterNum):
            ## item update
            score_old = self.rmse(test_pair=train_pair, test_rating=train_rating)
            for item_id in range(n_item):
                index_item_tmp = self.index_item[item_id]
                if len(index_item_tmp) == 0:
                    self.Q[item_id,:] = 0.
                    continue
                sum_pu, sum_matrix = np.zeros((K)), np.zeros((K, K))
                for record_ind in index_item_tmp:
                    ## double-check
                    if item_id != train_pair[record_ind][1]:
                        raise ValueError('the item_id is waring in updating Q!')
                    user_id, rating_tmp = train_pair[record_ind][0], train_rating[record_ind]
                    sum_matrix = sum_matrix + np.outer(self.P[user_id,:], self.P[user_id,:])
                    sum_pu = sum_pu + rating_tmp * self.P[user_id,:]                    
                self.Q[item_id,:] = np.dot(np.linalg.inv(sum_matrix + lam*n_obs*np.identity(K)), sum_pu)
            
            for user_id in range(n_user):
                index_user_tmp = self.index_user[user_id]
                if len(index_user_tmp) == 0:
                    self.P[user_id,:] = 0.
                    continue
                sum_pu, sum_matrix = np.zeros((K)), np.zeros((K, K))
                for record_ind in index_user_tmp:
                    ## double-check
                    if user_id != train_pair[record_ind][0]:
                        raise ValueError('the user_id is waring in updating P!')
                    item_id, rating_tmp = train_pair[record_ind][1], train_rating[record_ind]
                    sum_matrix = sum_matrix + np.outer(self.Q[item_id,:], self.Q[item_id,:])
                    sum_pu = sum_pu + rating_tmp * self.Q[item_id,:]                    
                self.P[user_id,:] = np.dot(np.linalg.inv(sum_matrix + lam*n_obs*np.identity(K)), sum_pu)
            # compute the new rmse score
            score_new = self.rmse(test_pair=train_pair, test_rating=train_rating)
            diff = abs(score_new - score_old) / score_old
            if self.verbose:
                print("Reg-LFM: ite: %d; diff: %.3f RMSE: %.3f" %(i, diff, score_new))
            if(diff < tol):
                break

    def predict(self, test_pair):
        # predict ratings for user-item pairs
        pred_rating = [np.dot(self.P[line[0]], self.Q[line[1]]) for line in test_pair]
        return np.array(pred_rating)
    
    def rmse(self, test_pair, test_rating):
        # report the rmse for the fitted `LFM`
        pred_rating = self.predict(test_pair=test_pair)
        return np.sqrt( np.mean( (pred_rating - test_rating)**2) )

from sklearn.model_selection import KFold
import itertools
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

class LFM_CV(object):

	def __init__(self, n_user, n_item, cv=5,
				lams=[.000001,.0001,.001,.01], 
				Ks=[3,5,10,20], 
				iterNum=10, tol=1e-4):
		# self.index_item = []
		# self.index_user = []
		self.n_user = n_user
		self.n_item = n_item
		self.cv = cv
		self.lams = lams
		self.Ks = Ks
		self.iterNum = iterNum
		self.tol = tol
		self.best_model = {}
		self.cv_result = {'K': [], 'lam': [], 'train_rmse': [], 'valid_rmse': []}

	def grid_search(self, train_pair, train_rating):
		## generate all comb of `K` and `lam`
		kf = KFold(n_splits=self.cv, shuffle=True)
		for (K,lam) in itertools.product(self.Ks, self.lams):
			train_rmse_tmp, valid_rmse_tmp = 0., 0.
			for train_index, valid_index in kf.split(train_pair):
				# produce training/validation sets
				train_pair_cv, train_rating_cv = train_pair[train_index], train_rating[train_index]
				valid_pair_cv, valid_rating_cv = train_pair[valid_index], train_rating[valid_index]
				# fit the model based on CV data
				model_tmp = LFM(self.n_user, self.n_item, K=K, lam=lam, verbose=0)
				model_tmp.fit(train_pair=train_pair_cv, train_rating=train_rating_cv)
				train_rmse_tmp_cv = model_tmp.rmse(test_pair=train_pair_cv, test_rating=train_rating_cv)
				valid_rmse_tmp_cv = model_tmp.rmse(test_pair=valid_pair_cv, test_rating=valid_rating_cv)
				train_rmse_tmp = train_rmse_tmp + train_rmse_tmp_cv / self.cv
				valid_rmse_tmp = valid_rmse_tmp + valid_rmse_tmp_cv / self.cv
				print('%d-Fold CV for K: %d; lam: %.5f: train_rmse: %.3f, valid_rmse: %.3f' 
						%(self.cv, K, lam, train_rmse_tmp_cv, valid_rmse_tmp_cv))
			self.cv_result['K'].append(K)
			self.cv_result['lam'].append(lam)
			self.cv_result['train_rmse'].append(train_rmse_tmp)
			self.cv_result['valid_rmse'].append(valid_rmse_tmp)
		self.cv_result = pd.DataFrame.from_dict(self.cv_result)
		best_ind = self.cv_result['valid_rmse'].argmin()
		self.best_model = self.cv_result.loc[best_ind]
	
	def plot_grid(self, data_source='valid'):
		sns.set_theme()
		if data_source == 'train':
			cv_pivot = self.cv_result.pivot("K", "lam", "train_rmse")
		elif data_source == 'valid':
			cv_pivot = self.cv_result.pivot("K", "lam", "valid_rmse")
		else:
			raise ValueError('data_source must be train or valid!')
		sns.heatmap(cv_pivot, annot=True, fmt=".3f", linewidths=.5, cmap="YlGnBu")
		plt.show()

## ``NCF`` Model based on side information

### Step 1: Formulate neural network based on continuous and categorical features
- embedding for categorical features
- concatenate continuous features and all embedding vectors

In [10]:
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Embedding, Flatten, Input, Dropout, Dense, Concatenate
from tensorflow.keras.optimizers import Adam
from IPython.display import SVG
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf

2021-09-03 16:51:04.535519: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-03 16:51:04.535537: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [11]:
class SideNCF(keras.Model):
    def __init__(self, num_users, num_movies, num_genre, embedding_size, **kwargs):
        super(SideNCF, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.genre_embedding = layers.Embedding(
            num_genre,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.concatenate = layers.Concatenate()
        self.dense1 = layers.Dense(100, name='fc-1', activation='relu')
        self.dense2 = layers.Dense(50, name='fc-2', activation='relu')
        ## we need to change the last layer activation function as sigmiod
        self.dense3 = layers.Dense(1, name='fc-3', activation='sigmoid')

    def call(self, inputs):
        cont_feats = inputs[0]
        cate_feats = inputs[1]
        user_vector = self.user_embedding(cate_feats[:,0])
        movie_vector = self.movie_embedding(cate_feats[:,1])
        genre_vector = self.genre_embedding(cate_feats[:,2])
        concatted_vec = self.concatenate([cont_feats, user_vector, movie_vector, genre_vector])
        fc_1 = self.dense1(concatted_vec)
        fc_2 = self.dense2(fc_1)
        fc_3 = self.dense3(fc_2)
        return fc_3

## ``loss`` function and evaluation ``metrics`` should be changed!

In [12]:
num_genre = movies_pd['pGenre'].max() + 1
model = SideNCF(num_users=n_user, num_movies=n_item, num_genre=num_genre, embedding_size=50)

# metrics = [
#     keras.metrics.MeanAbsoluteError(name='mae'),
#     keras.metrics.RootMeanSquaredError(name='rmse')
# ]

# model.compile(
#     optimizer=keras.optimizers.Adam(1e-3), 
#     loss=tf.keras.losses.MeanSquaredError(), 
#     metrics=metrics
# )

metrics = [
    keras.metrics.BinaryAccuracy(name='binary_acc')
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-3), 
    loss=tf.keras.losses.BinaryCrossentropy(), 
    metrics=metrics
)

# from tensorflow.keras.utils import plot_model
# plot_model(model, to_file='model.png')

2021-09-03 16:51:13.421643: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-03 16:51:13.421931: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-03 16:51:13.421980: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-09-03 16:51:13.422023: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-09-03 16:51:13.422065: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Co

### Step 2: produce the continuous and categorical features for users and items

In [13]:
## find the continuous features and categorical features for user and item, respectively
movie_cont, movie_cate = ['rating_mean', 'rating_count'], ['movieId', 'pGenre']
user_cont, user_cate = ['rating_mean', 'rating_count'], ['userId']

train_cont_feats = np.hstack((user_pd.loc[train_pair[:,0]][user_cont], movies_pd.loc[train_pair[:,1]][movie_cont]))
train_cate_feats = np.hstack((user_pd.loc[train_pair[:,0]][user_cate], movies_pd.loc[train_pair[:,1]][movie_cate]))

test_cont_feats = np.hstack((user_pd.loc[test_pair[:,0]][user_cont], movies_pd.loc[test_pair[:,1]][movie_cont]))
test_cate_feats = np.hstack((user_pd.loc[test_pair[:,0]][user_cate], movies_pd.loc[test_pair[:,1]][movie_cate]))

### Step 3: Feed neural network with multi-source dataset

In [14]:
callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_binary_acc', min_delta=0, patience=5, verbose=1, 
    mode='auto', baseline=None, restore_best_weights=True)]

history = model.fit(
    x=[train_cont_feats, train_cate_feats],
    y=train_like,
    batch_size=64,
    epochs=50,
    verbose=1,
    validation_split=.2,
)

2021-09-03 16:51:23.917633: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [15]:
## make prediction
pred_prob = model.predict([test_cont_feats, test_cate_feats]).flatten()
pred_like = 1*(pred_prob >= 0.5)
print(pred_like)
_, binary_acc_test = model.evaluate(x=[test_cont_feats, test_cate_feats], y=test_like)
print('mcr: SideNCF: %.3f' %binary_acc_test)

[1 1 0 ... 1 0 1]
mcr: SideNCF: 0.772
