# CUHK [STAT3009](https://www.bendai.org/STAT3009/) Notebook8: Side information: continuous and discrete features

## Pre-process the ML-100K raw data
- check the `user_id` and `item_id`: mapping `item_id` to a continuous sequence based on `sklean.preprocessing`
- use `sklearn.model_selection.train_test_split` to generate `train` and `test` dataset

## Load additional ``side information``

ref: https://colab.research.google.com/github/lcharlin/80-629/blob/master/week4-PracticalSession/Introduction_to_ML.ipynb#scrollTo=4R717-S52plZ

In [4]:
import numpy as np
import pandas as pd
# load rating
df = pd.read_csv('./dataset/ml-latest-small/ratings.csv')
del df['timestamp']

movies_pd = pd.read_csv('./dataset/ml-latest-small/movies.csv', sep=',', engine='python')
movies_pd.sample(10)

Unnamed: 0,movieId,title,genres
3150,4237,"Gleaners & I, The (Les glaneurs et la glaneuse...",Documentary
5831,32294,Milk and Honey (2003),Drama
1432,1955,Kramer vs. Kramer (1979),Drama
4836,7218,"Ox-Bow Incident, The (1943)",Drama|Western
8572,116823,The Hunger Games: Mockingjay - Part 1 (2014),Adventure|Sci-Fi|Thriller
2184,2901,Phantasm (1979),Horror|Sci-Fi
7973,96530,Conception (2011),Comedy|Romance
9529,172229,Plain Clothes (1988),Comedy|Mystery|Romance|Thriller
2798,3740,Big Trouble in Little China (1986),Action|Adventure|Comedy|Fantasy
4866,7292,Best Defense (1984),Comedy|War


## Feature engineering
- extract `year` and `genre` from the movies' side information
- For simplicity, if multiple genres exist, we just take the first one
- `Regex` to deal with the raw data [tutorial](https://regexone.com/) 

In [5]:
import re

year, genre = [], []
for i in range(len(movies_pd)):
	row = movies_pd.loc[i]
	year_tmp = re.findall('\d+', row['title'])
	if len(year_tmp) > 0:
		year.append(int(year_tmp[0]))
	else:
		year.append(np.nan)
	## take the first one as primary genere
	genre.append(row['genres'].split('|')[0])

movies_pd['year'], movies_pd['pGenre'] = year, genre
## delete original title and genres
del movies_pd['title']
del movies_pd['genres']
movies_pd.sample(10)

Unnamed: 0,movieId,year,pGenre
7983,96726,2012.0,Comedy
3318,4490,1988.0,Comedy
8012,97742,2012.0,Action
7160,71732,2008.0,Comedy
119,146,1995.0,Adventure
3604,4951,1990.0,Adventure
3816,5346,1990.0,Drama
7697,89774,2011.0,Drama
5853,32598,2005.0,Comedy
6433,51662,300.0,Action


## [Missing data](https://machinelearningmastery.com/handle-missing-data-python/)
- Usually we impute the missing values by average, but there are some fancy methods, see [Imputation of missing values](https://scikit-learn.org/stable/modules/impute.html#impute).
- Use package `sklearn.impute.SimpleImputer`

In [None]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(movies_pd['year'].values.reshape(-1, 1))
movies_pd['year'] = imp_mean.transform(movies_pd['year'].values.reshape(-1, 1))

## Generate some additional side information for users and items
- Number of ratings
- Averaged ratings
- quantiles of the ratings (as a practice)

In [None]:
user_pd = pd.merge(left=df.groupby('userId')['rating'].mean(), 
					right=df.groupby('userId')['rating'].count(), on='userId', )
user_pd.columns = ['rating_mean', 'rating_count']
user_pd = user_pd.reset_index()

movie_rating_pd = pd.merge(left=df.groupby('movieId')['rating'].mean(), 
						right=df.groupby('movieId')['rating'].count(), on='movieId')
movie_rating_pd.columns	= ['rating_mean', 'rating_count']

movies_pd = pd.merge(left=movie_rating_pd, right=movies_pd, on='movieId')

print(user_pd.sample(10))
print(movies_pd.sample(10))

## Pre-processing the dataset
- all continuous features should be standardized as mean 0, std 1
- all categorical features should be re-encoding to remove the missing ones

In [None]:
from sklearn import preprocessing
## pre-processing for users
user_cont = ['rating_mean', 'rating_count']
user_pd[user_cont] = preprocessing.StandardScaler().fit_transform(user_pd[user_cont])

## pre-processing for movies
movie_cont = ['rating_mean', 'rating_count', 'year']
movies_pd[movie_cont] = preprocessing.StandardScaler().fit_transform(movies_pd[movie_cont])

## encoding for categorical data 
from sklearn import preprocessing
le_genre = preprocessing.LabelEncoder()
movies_pd['pGenre'] = le_genre.fit_transform(movies_pd['pGenre'])

## joint encoding for userId and movieId
# !!! all dfs should share the same encoding for userId and movieId, respecitively!!!
le_movie = preprocessing.LabelEncoder()
le_user = preprocessing.LabelEncoder()

df['movieId'] = le_movie.fit_transform(df['movieId'])
df['userId'] = le_user.fit_transform(df['userId'])

movies_pd['movieId'] = le_movie.transform(movies_pd['movieId'])
user_pd['userId'] = le_user.transform(user_pd['userId'])

## generate train / test dataset
from sklearn.model_selection import train_test_split
dtrain, dtest = train_test_split(df, test_size=0.33, random_state=42)
## save real ratings for test set for evaluation.
test_rating = np.array(dtest['rating'])
## remove the ratings in the test set to simulate prediction
dtest = dtest.drop(columns='rating')

In [None]:
# tran_pair, train_rating
train_pair = dtrain[['userId', 'movieId']].values
train_rating = dtrain['rating'].values
# test_pair
test_pair = dtest[['userId', 'movieId']].values
n_user, n_item = max(train_pair[:,0].max(), test_pair[:,0].max())+1, max(train_pair[:,1].max(), test_pair[:,1].max())+1

### Create NCF Model

In [20]:
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Embedding, Flatten, Input, Dropout, Dense, Concatenate
from tensorflow.keras.optimizers import Adam
from IPython.display import SVG
from tensorflow import keras
from tensorflow.keras import layers

In [45]:
class NCF(keras.Model):
    def __init__(self, num_users, num_movies, embedding_size, **kwargs):
        super(NCF, self).__init__(**kwargs)
        self.num_users = num_users
        self.num_movies = num_movies
        self.embedding_size = embedding_size
        self.user_embedding = layers.Embedding(
            num_users,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.movie_embedding = layers.Embedding(
            num_movies,
            embedding_size,
            embeddings_initializer="he_normal",
            embeddings_regularizer=keras.regularizers.l2(1e-2),
        )
        self.concatenate = layers.Concatenate()
        self.dense1 = layers.Dense(100, name='fc-1', activation='relu')
        self.dense2 = layers.Dense(50, name='fc-2', activation='relu')
        self.dense3 = layers.Dense(1, name='fc-3', activation='relu')

    def call(self, inputs):
        user_vector = self.user_embedding(inputs[:, 0])
        movie_vector = self.movie_embedding(inputs[:, 1])
        concatted_vec = self.concatenate([user_vector, movie_vector])
        fc_1 = self.dense1(concatted_vec)
        fc_2 = self.dense2(fc_1)
        fc_3 = self.dense3(fc_2)
        return fc_3

## Select `loss function`, `metrics`, `algorithm`

In [46]:
model = NCF(num_users=n_user, num_movies=n_item, embedding_size=50)

metrics = [
    keras.metrics.MeanAbsoluteError(name='mae'),
    keras.metrics.RootMeanSquaredError(name='rmse')
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-3), 
    loss=tf.keras.losses.MeanSquaredError(), 
    metrics=metrics
)

# from tensorflow.keras.utils import plot_model
# plot_model(model, to_file='model.png')

In [47]:
callbacks = [keras.callbacks.EarlyStopping( 
    monitor='val_rmse', min_delta=0, patience=5, verbose=1, 
    mode='auto', baseline=None, restore_best_weights=True)]

history = model.fit(
    x=train_pair,
    y=train_rating,
    batch_size=64,
    epochs=50,
    verbose=1,
    validation_split=.2,
)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [48]:
## make prediction
pred_rating = model.predict(test_pair).flatten()
print(pred_rating)
print('rmse: LFactorNet: %.3f' %np.sqrt(np.mean((pred_rating - test_rating)**2)))

[2.9420114 3.3594353 2.0358317 ... 3.8942363 3.1720738 3.9475594]
rmse: LFactorNet: 0.882
