# AI-Frameworks

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
</center>

# LAB 5 Introduction to Recommendation System with Collaborative Filtering  -  Part 3 : Latent Vector-Based Methods with `Keras` Python Library.

The objectives of this notebook are the following : 

* Build Keras models to learn embedding space for user and item data.
* Visualize these space.
* Use results of algorithm to apply recommendation. 

# Library

In [None]:
import numpy as np
import pandas as pd 
import tensorflow.keras.layers as kl
import tensorflow.keras.models as km
import sklearn.metrics as sm
import sklearn.decomposition as sdec

import matplotlib.pyplot as plt
import seaborn as sb

# Data

We download the updated ratings data generated in the first notebook. `1-Python-Neighborhood-MovieLens.ipynb`

In [None]:
DATA_DIR = "movielens_small/"
rating = pd.read_csv(DATA_DIR + "ratings_updated.csv")
nb_entries = rating.shape[0]
print("Number of entries : %d " %nb_entries)
rating.head(5)

We first create two new columns. The column **user_id** (resp. **item_id**) rearange the userId (resp. MovieId) columns in order that these columns lies in the range(0,609) (resp. range(0,0723)).

In [None]:
userIdToNormUserId = {k:v for v,k in enumerate(rating.userId.unique())}
rating["user_id"] = [userIdToNormUserId[x] for x in rating.userId.values]
itemIdToNormItemId = {k:v for v,k in enumerate(rating.movieId.unique())}
rating["item_id"] = [itemIdToNormItemId[x] for x in rating.movieId.values]

In [None]:
rating.head()

In [None]:
movies = pd.read_csv(DATA_DIR + "movies.csv")
id_movie_to_title = dict(movies[["movieId","title"]].values)
id_item_to_title = {itemIdToNormItemId[k]:v for k,v in id_movie_to_title.items() if k in itemIdToNormItemId}
print("Number of movies in the dictionary : %d" %(len(id_item_to_title)))
movies.head()

We now create the same train/test dataset that the one in the first notebook.

In [None]:
train = rating[rating.test_train=="train"]
user_id_train = train['user_id']
item_id_train = train['item_id']
rating_train = train['rating']
print(train.shape)

test = rating[rating.test_train=="test"]
user_id_test = test['user_id']
item_id_test = test['item_id']
rating_test = test['rating']
print(test.shape)

# Neural Recommender System

We first build a very simple recommender according to this architecture:

![alt text](images/simple_architecture.png)

Let's decompose the construction of this network.


We first create the inputs layer, which will take as entry the id of the user and the id of the item.

In [None]:
# For each sample we input the integer identifiers of a single user and a single item
user_id_input = kl.Input(shape=[1], name='user')
item_id_input = kl.Input(shape=[1], name='item')

This id we will then be converted in their embedding space. This can be easily done with the `Embedding` layer object of Keras.

In [None]:
max_user_id= rating.user_id.max()
max_item_id= rating.item_id.max()
embedding_size = 30
user_embedding = kl.Embedding(output_dim=embedding_size, input_dim=max_user_id + 1,
                           input_length=1, name='user_embedding')(user_id_input)
item_embedding = kl.Embedding(output_dim=embedding_size, input_dim=max_item_id + 1,
                           input_length=1, name='item_embedding')(item_id_input)

We compute the dot product of the two vectors which are the vectors representation in the embedding space of the user and the item given in input.

In [None]:
# reshape from shape: (batch_size, input_length, embedding_size)
# to shape: (batch_size, input_length * embedding_size) which is
# equal to shape: (batch_size, embedding_size)
user_vecs = kl.Flatten()(user_embedding)
item_vecs = kl.Flatten()(item_embedding)

y = kl.Dot(axes=1)([user_vecs, item_vecs])

We now have the complete model.

In [None]:
model = km.Model(inputs=[user_id_input, item_id_input], outputs=y)
model.compile(optimizer='adam', loss='mse')
model.summary()

The prediction can now be applied by giving the list of user and item ids that we want to compute.

In [None]:
initial_train_preds = model.predict([user_id_train, item_id_train])
initial_train_preds.shape

Of course, as the model has not been traine, the Model error is quite bad.

In [None]:
print("Random init MSE: %0.3f" % sm.mean_squared_error(initial_train_preds, rating_train))
print("Random init MAE: %0.3f" % sm.mean_absolute_error(initial_train_preds, rating_train))

Let's fit the model

In [None]:
history = model.fit([user_id_train, item_id_train], rating_train,
                    batch_size=64, epochs=10, validation_split=0.1,
                    shuffle=True)

**Questions**:

- Why is the train loss higher than the first loss in the first few epochs?
- Why is Keras not computing the train loss on the full training set at the end of each epoch as it does on the validation set?


Now that the model is trained, the model MSE and MAE look nicer:

In [None]:
test_preds = model.predict([user_id_test, item_id_test])
print("Final test MSE: %0.3f" % sm.mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % sm.mean_absolute_error(test_preds, rating_test))

In [None]:
train_preds = model.predict([user_id_train, item_id_train])
print("Final train MSE: %0.3f" % sm.mean_squared_error(train_preds, rating_train))
print("Final train MAE: %0.3f" % sm.mean_absolute_error(train_preds, rating_train))

**Q** What do you think about those results? 

# A Deep recommender model

Let's know compute a deeper architecture in order to improve those results.

![alt text](images/deep_architecture.png)


**Exercise** : Implement a model similar to the previous one with:

* A concatenate layer (look at the kl.Concatenate function)
* A dropout layer (rate=0.5) after the concatenate layer.
* only one Hidden layer with 64 neurons and relu activation function.


In [None]:
# %load solutions/exercise_3_1.py

In [None]:
history = model.fit([user_id_train, item_id_train], rating_train,
                    batch_size=64, epochs=5, validation_split=0.1,
                    shuffle=True)

In [None]:
train_preds = model.predict([user_id_train, item_id_train])
print("Final train MSE: %0.3f" % sm.mean_squared_error(train_preds, rating_train))
print("Final train MAE: %0.3f" % sm.mean_absolute_error(train_preds, rating_train))

In [None]:
test_preds = model.predict([user_id_test, item_id_test])
print("Final test MSE: %0.3f" % sm.mean_squared_error(test_preds, rating_test))
print("Final test MAE: %0.3f" % sm.mean_absolute_error(test_preds, rating_test))

**Question** What can you say about those results?

# Exploiting the model

In this section we will see how to explore both the model and the embedding space.

## Finding similar items and user.

We want to find the K closest element of an item or a user. The model e build can't be used directly as it take into account a user and a item and not two user nor two items.

But we can't easily build a method based on the constructed embedding space. Let's first get the embedding matices of the user and the movies.

In [None]:
weights = model.get_weights()
user_embeddings = weights[0]
print("User embedding matrix dimension : %s" %str(user_embeddings.shape))
item_embeddings = weights[1]
print("item embedding matrix dimension : %s" %str(item_embeddings.shape))

For the id of an item we compute the distance (*cosine*, *euclidean*, etc.) of its embedding vector to all embedding vectors of the items. 

(The procedure would be the same for the user, but the results are easier to interpreted with the movies)

In [None]:
idx = 1027
X = np.expand_dims(item_embeddings[idx],axis=0)
distX = sm.pairwise_distances(X, item_embeddings, metric="cosine")[0]

The top 10 items of the item "idx" are then the ten items that are the closest to this items.

In [None]:
print("Top 10 items similar to movies %s" %str(id_item_to_title[idx]))
mostSimilarItem = pd.DataFrame([[id_item_to_title[x], distX[x],x] for x in distX.argsort()[:10]])
mostSimilarItem

**Question** What do you think of these results?  Unfortunalty the dataset is to small to really get good meanings.

## Visualizing Items

In [None]:
pcaItems = sdec.PCA(n_components=2)
items_pca_embeddings = pcaItems.fit_transform(item_embeddings)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1,1,1)
ax.plot(items_pca_embeddings[:,0], item_embeddings[:,1], linestyle="None", marker=".")
ax.plot(items_pca_embeddings[mostSimilarItem[2].values,0], item_embeddings[mostSimilarItem[2].values,1], linestyle="None", marker=".", markersize=10)

## A recommendation function for a given user

Once the model is trained, the system can be used to recommend a few items for a user, that he/she hasn't already seen:

First let's select a user and display the movies he likes or dislikes.

In [None]:
user_id = 0
rating_user = rating[rating["user_id"]==user_id]
rating_user_sorted = rating_user.sort_values("rating")
print("10 best rated movies by user %d" %user_id)
display(rating_user_sorted[-10:][["movie","rating"]])
print("10 worst rated movies by user %d" %user_id)
display(rating_user_sorted[:10][["movie","rating"]])

**Exercise** Use the model to compute the estimated rates that the user would give to the movies he hasn't seen. Display the 10 movies you would recommend to him.

In [None]:
# %load solutions/exercise_3_2.py

# Complete dataset

The following code perform the same model on the complete dataset. 
It would take too much time if you don't have a GPU.


In [None]:
DATA_DIR = "ml-25/"
rating = pd.read_csv(DATA_DIR + "ratings_updated.csv")
nb_entries = rating.shape[0]
print("Number of entries : %d " %nb_entries)
rating.head(5)

In [None]:
movies = pd.read_csv(DATA_DIR + "movies.csv")
id_movie_to_title = dict(movies[["movieId","title"]].values)
id_item_to_title = {itemIdToNormItemId[k]:v for k,v in id_movie_to_title.items() if k in itemIdToNormItemId}
print("Number of movies in the dictionary : %d" %(len(id_item_to_title)))
movies.head()

In [None]:
userIdToNormUserId = {k:v for v,k in enumerate(rating.userId.unique())}
rating["user_id"] = [userIdToNormUserId[x] for x in rating.userId.values]
itemIdToNormItemId = {k:v for v,k in enumerate(rating.movieId.unique())}
rating["item_id"] = [itemIdToNormItemId[x] for x in rating.movieId.values]

In [None]:
train = rating[rating.test_train=="train"]
user_id_train = train['user_id']
item_id_train = train['item_id']
rating_train = train['rating']
print(train.shape)

test = rating[rating.test_train=="test"]
user_id_test = test['user_id']
item_id_test = test['item_id']
rating_test = test['rating']
print(test.shape)

In [None]:
user_id_input = kl.Input(shape=[1], name='user')
item_id_input = kl.Input(shape=[1], name='item')

embedding_size = 30
max_user_id= rating.user_id.max()
max_item_id= rating.item_id.max()
user_embedding = kl.Embedding(output_dim=embedding_size, input_dim=max_user_id + 1,
                           input_length=1, name='user_embedding')(user_id_input)
item_embedding = kl.Embedding(output_dim=embedding_size, input_dim=max_item_id + 1,
                           input_length=1, name='item_embedding')(item_id_input)

# reshape from shape: (batch_size, input_length, embedding_size)
# to shape: (batch_size, input_length * embedding_size) which is
# equal to shape: (batch_size, embedding_size)
user_vecs = kl.Flatten()(user_embedding)
item_vecs = kl.Flatten()(item_embedding)

input_vecs = kl.Concatenate()([user_vecs, item_vecs])
input_vecs = kl.Dropout(0.5)(input_vecs)

x = kl.Dense(64, activation='relu')(input_vecs)
y = kl.Dense(1)(x)

model = km.Model(inputs=[user_id_input, item_id_input], outputs=y)
model.compile(optimizer='adam', loss='mae')
model.summary()

In [None]:
history = model.fit([user_id_train, item_id_train], rating_train,
                    batch_size=2048, epochs=10, validation_split=0.1,
                    shuffle=True)

In [None]:
weights = model.get_weights()
user_embeddings = weights[0]
print("User embedding matrix dimension : %s" %str(user_embeddings.shape))
item_embeddings = weights[1]
print("item embedding matrix dimension : %s" %str(item_embeddings.shape))

In [None]:
idx = 283
X = np.expand_dims(item_embeddings[idx],axis=0)
distX = sm.pairwise_distances(X, item_embeddings, metric="cosine")[0]

In [None]:
print("Top 10 items similar to movies %s" %str(id_item_to_title[idx]))
mostSimilarItem = pd.DataFrame([[id_item_to_title[x], distX[x],x] for x in distX.argsort()[:10]])
mostSimilarItem

In [None]:
pcaItems = sdec.PCA(n_components=2)
items_pca_embeddings = pcaItems.fit_transform(item_embeddings)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1,1,1)
ax.plot(items_pca_embeddings[:,0], item_embeddings[:,1], linestyle="None", marker=".")
ax.plot(items_pca_embeddings[mostSimilarItem[2].values,0], item_embeddings[mostSimilarItem[2].values,1], linestyle="None", marker=".", markersize=10)

In [None]:
user_id = 1
rating_user = rating[rating["user_id"]==user_id]
rating_user_sorted = rating_user.sort_values("rating")
print("10 best rated movies by user %d" %user_id)
display(rating_user_sorted[-10:][["movie","rating"]])
print("10 worst rated movies by user %d" %user_id)
display(rating_user_sorted[:10][["movie","rating"]])

In [None]:
#Run prediction for all movies
prediction = model.predict([[user_id for _ in range(max_item_id)], [x for x in range(max_item_id)]])
#Concatenate results with id of the movie
prediction_with_id = zip(prediction, [x for x in range(max_item_id)])
# Filter on unseen movie, get the title and sort the results according to predicted rate
prediction_of_unseen_movie = sorted([[p[0],id_item_to_title[x]] for p,x in prediction_with_id if not(x in seen_movie)], key=lambda x :x[0], reverse = True)
#Display it.
pd.DataFrame(prediction_of_unseen_movie)