# AI-Frameworks

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
</center>

# LAB 5 Introduction to Recommendation System with Collaborative Filtering  -  Part 1 : Neighborhood-Based Methods with `Surprise` Python Library.

The objectives of this notebook are the following : 

* Discover and Explore `MovieLens` Dataset
* Discover `Surprise`python library
* Use Neigborhood-Based Methods (User-User and Item-Item Filters) methods to learn similarity between User an Item and use it to apply recommendation.

# Library

In [None]:
import collections
import pickle
import random
import time

import numpy as np
import pandas as pd
import scipy.sparse as scsparse
import scipy.stats as scstats
import sklearn.metrics.pairwise as smp
import surprise
import surprise.model_selection as sms
import surprise.prediction_algorithms as spa

#Plotly
import plotly.graph_objects as go
import plotly.offline as pof

#Matplotlib
import matplotlib.pyplot as plt

# Seaborn
import seaborn as sb
sb.set(color_codes=True)

# Data : Movielens dataset

The `movielens` dataset is a famous and widely used dataset furnish by *GroupLens* company : (https://grouplens.org/).

The dataset is compose of ratings of movies made by a set of User collected over vairous periods of time. 

They are various dataset of different size avalaible on their website : https://grouplens.org/datasets/movielens/.  

We will used, all along the diffrent TPs of this lab, the small dataset (100k ratings) for test and exploration and the stable dataset (20 Millions ratings) for testing performances. 


* Small Dataset :  *movielens_small folder*
    * 100,000 ratings. 
    * 9742 movies. 
    * 610 users.
    
* Stable Dataset : 
    * 20 million ratings.
    * 59.047 movies.
    * 162.541 users.
    
Those datasets are also composed of genre information of movies and other metadata (tags on the movie, information about the user: age, sex, ..), that can be used to improve the recommendation system. We won't use those data as the methods we cover in the course does not handle metadata.

## Presentation

### Ratings
The `ratings.csv`files is composed of fours columns:

* userId : Int. Unique id of the user.
* movieId : Int. Unique id of the movie.
* rating : Int(0-5). Rate given by an user to a movie.
* timestamp : time at which the rate has been given by. 

We won't use *timestamp* columns during this lab. 

In [None]:
DATA_DIR = "movielens_small/"
rating = pd.read_csv(DATA_DIR + "ratings.csv")
nb_entries = rating.shape[0]
print("Number of entries : %d " %nb_entries)
rating.head(5)

In [None]:
nb_user = len(rating.userId.unique())
print("Number of unique User : %d" %nb_user)

In [None]:
nb_movie = len(rating.movieId.unique())
print("Number of unique Movies : %d" %nb_movie)

### Movies

The `movies.csv`files is composed of three columns:

* movieId : Int. Unique id of the movie.
* title : string. The title of the movie.
* genres : the genre(s) of the movies.

We won't use *genres* columns during this lab. We won't use title in our algorithm but we will use it to display information and give more sense to our prediction.

In [None]:
movies = pd.read_csv(DATA_DIR + "movies.csv")
print("Number of movies in the dictionary : %d" %(len(movies.movieId.unique())))
movies.head()

We create a `id_to_title` dictionary to convert id to their title.

In [None]:
id_to_title = dict(movies[["movieId","title"]].values)

We add a *movie* columns to the rating dataset in order to display directly the information.

In [None]:
rating["movie"] = [id_to_title[x] for x in rating["movieId"].values]
rating.head()

# Exploration

Let's make some quick exploration to have some intuitions about these data.

## User
We look at the distribution number of rating per user. We create a groupby pandas object where row are group by users.

In [None]:
rating_gb_user = rating.groupby("userId")

### Number of rating per user.

We will display the distribution of number of rating per user.

Plot are display using:
* **Matplotlib** : Default python library. 
* **Seaborn**: A library based on matplotlib that can easily enable more beautiful an readble plot.
* **Plotly** :   A library available in python, javascript R which allow to build interactive graph.

#### Plotly.

In [None]:
x = rating_gb_user.count()["rating"].values
data = [go.Histogram(x=x,
                    xbins=dict( # bins used for histogram
                    start=x.min(),
                    end=x.max(),
                    size=5,
                ))]
fig = go.Figure(data=data)
fig.update_layout(
    title_text='Number of rate per user distribution', # title of plot
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

#### Matplotlib. 

In [None]:
x = rating_gb_user.count()["rating"].values
fig = plt.figure(figsize=(30,5))
ax = fig.add_subplot(1,1,1)
plt.hist(x,bins = np.arange(x.min(),x.max()+5,5))
plt.show()

### Seaborn

In [None]:
x = rating_gb_user.count()["rating"].values
fig = plt.figure(figsize=(30,5))
ax = fig.add_subplot(1,1,1)
sb.distplot(x, ax=ax, kde=False, bins = np.arange(x.min(),x.max()+5,5))

**Question** What can you say about the distribution? What is the minimum number of rate a user has given?

**Exercise**: Find the most *complaisant*  and the most *harsh* users and display their notation.

In [None]:
# %load solutions/exercise_1_1.py

Most "Hard" user

## Movie
We look at the distribution number of rating recieved per movie. We create a groupby pandas object where row are groupby movie.

In [None]:
rating_gb_movie = rating.groupby("movie")

### Number of rating per movie.
We will display the distribution of number of rating per user.

#### Plotly

In [None]:
x = rating_gb_movie.count()["userId"].values
data = [go.Histogram(x=x,
                    xbins=dict( # bins used for histogram
                    start=x.min(),
                    end=x.max(),
                    size=2,
                ))]
fig = go.Figure(data=data)
fig.update_layout(
    title_text='Number of rate per movie', # title of plot
    bargap=0.2, # gap between bars of adjacent location coordinates
    bargroupgap=0.1 # gap between bars of the same location coordinates
)
fig.show()

#### Matplotlib

In [None]:
x = rating_gb_user.count()["rating"].values
fig = plt.figure(figsize=(30,5))
ax = fig.add_subplot(1,1,1)
plt.hist(x,bins = np.arange(x.min(),x.max()+5,5))
plt.show()

#### Seaborn

In [None]:
x = rating_gb_user.count()["rating"].values
fig = plt.figure(figsize=(30,5))
ax = fig.add_subplot(1,1,1)
sb.distplot(x, ax=ax, kde=False, bins = np.arange(x.min(),x.max()+5,5))

**Question** What can you say about the distribution of the movie? What is the minimum number of rate a movie can have?

**Exercices** Display the Top 10 most rated movies, top 10 better and worst movies (for movies with at least 10 rates).

In [None]:
# %load solutions/exercise_1_2.py

# Suprise

Surprise is a python library http://surpriselib.com/, that contains various algorithm dedicated to Recommendation.  We will use it to apply neighborhood-based algorithm.

Surprise contains various function that enable to load directly the movielens dataset and create train/text partition. However we won't use those methods.
The movielens-100K dataset is changing and we want it to be the same to compare the methods with different library over the notebooks of this lab. 

First We create train and test dataset and we save and updated version of the *ratings/csv* filest with a new *Test/train* columns

In [None]:
rating["test_train"] = ["test" if random.random()<=0.1 else "train" for _ in range(rating.shape[0])]
rating["test_train"].value_counts()

In [None]:
rating.to_csv("movielens_small/ratings_updated.csv",index=False)

We then use the `load_from_df` methods that require data Nx3 matrices where N is the number of entries and the 3 columns are the users, the items and the rates. This correspond to the rating dataset. 

In [None]:
reader = surprise.Reader(rating_scale=(0, 5))
rating_train = rating[rating.test_train=="train"]
data = surprise.Dataset.load_from_df(rating_train[['userId', 'movieId', 'rating']], reader)

We then use the `build_full_trainset` to convert the Surprise Dataset object to a Surprise Trainset object that can be fitted. 

In [None]:
train = data.build_full_trainset()
train

In [None]:
rating_test = rating[rating.test_train=="test"]
test = list([tuple(x) for x in rating_test[['userId', 'movieId', 'rating']].values])
test[:10]

# User-User Filter

**Main assumption** : customers with a similar profile will have similar tastes.


For a customer u, the aim is to find a subset of customers with a close profile and predicting the missing mark of a product i on customer u relies on a convex linear aggregation of marks of customers with close profile.


$$\hat{r}_{u,i} = \bar{r}_u + \frac{\sum_{u'\in S_u} s(u,u')\cdot (r_{u',i}-\bar{r_{u'}})}{\sum_{u'\in S_u} |s(u,u')| }$$

## Fit the User-User similarity Matrix

Have a look ad the surprise "knn inspired" algorithm documentation :  https://surprise.readthedocs.io/en/stable/knn_inspired.html to understand the different algorithm available.

**Exercise** :  Initialize a method that perform a **user-user** filter based on the formula above (i.e. that **take means** into account) with:
* **pearson** similarity distance
* **k** (number of neighboor) to 40.

In [None]:
UUFilter = 

In [None]:
# %load solutions/exercise_1_3.py

You can know easily fit the algorithm and compute the results on test with the dedicated `surprise` methods.

In [None]:
# Train the algorithm on the trainset, and predict ratings for the testset
UUFilter.fit(train)
predictions = UUFilter.test(test)

# Then compute RMSE
surprise.accuracy.rmse(predictions)

## Use the User-User similarity Matrix

A big advantage of this methods is that it quite easy to explore the results.

### Nearest user

The surprise library furnish a `get_neighbors`method that allow you to get directly the closest id of a given id.

In [None]:
userId=1
nearest_userId = UUFilter.get_neighbors(userId,k=1)[0]
print("user %d is the closest user of user %d" %(nearest_userId,userId))
print("User %d" %userId)
display(rating[rating.userId==userId][["movie","rating"]].sort_values(by="rating"))
print("User %d" %nearest_userId)
rating[rating.userId==neirest_userId][["movie","rating"]].sort_values(by="rating")


## Recommendation

**Exercise** Build the list of the 10 most recommended movies for the user with the estimated rate. use the `predict`method of the `UUfilter`object that give you the rate for a couple (userId,itemId).

In [None]:
UUFilter.predict?

In [None]:
# %load solutions/exercise_1_4.py

# Item-Item Filter

Main assumption : the customers will prefer products that share a high similarity with those already well appreciated. Prediction of product j : aggregate
with a linear convex combination of products Sj that are closed to product j.

$$\hat{r}_{ui} = \mu_i + \frac{ \sum\limits_{j \in N^k_u(i)}
\text{sim}(i, j) \cdot (r_{uj} - \mu_j)} {\sum\limits_{j \in
N^k_u(i)} \text{sim}(i, j)}$$

We just have one parameter to change (user_based=False) in order to perform Item-Item Filter.

In [None]:
IIFilter = spa.knns.KNNWithMeans(k=40, 
                      min_k =1, 
                      sim_options = {'name': 'pearson',
                                     'user_based': False},
                     verbose=True)

In [None]:
# Train the algorithm on the trainset, and predict ratings for the testset
IIFilter.fit(train)
predictions = IIFilter.test(test)

# Then compute RMSE
surprise.accuracy.rmse(predictions)

**Questions** The method is quite slower than the previous one. Why is that? 
What can you say about the performance?

# Get an example prediction

## Use the Item-Item similarity Matrix

### Nearest user

The same `get_neighbors` can be used and now show closest item of a given item.

In [None]:
movieId = 2
print("Selected Movie : %s" %(id_to_title[movieId]))
nearest_movieId = IIFilter.get_neighbors(movieId,k=10)
print("10 most similar movies")
pd.DataFrame([id_to_title[k] for k in nearest_movieId if k in id_to_title])

### Prediction 

Same code that above can be used to recommend 10 movies to the user.

In [None]:
userId=1
# Get list of movies already rated by the user
idmovies_rated_per_user = rating[rating.userId==userId]["movieId"].values
# get prediction fo all movies for movies that are not already rated
predicted = [[mid,IIFilter.predict(userId, mid)] for mid in movies.movieId.values if not(mid in idmovies_rated_per_user)]
# sort predicted list according to the estimation computed
recommendation = sorted(predicted, key=lambda x : x[1].est, reverse=True)
#display the most 10 prediciton with a dataframe
pd.DataFrame([(id_to_title[r[0]], r[1].est) for r in recommendation[:10]])

# Compare results for different parameters

In [None]:
# Compare parameters
results = []
for k in [10,25,50,100]:
    for user_based in [True, False]:
        for sim_options_name in ["pearson","cosine","msd"]:
            tstart = time.time()
            Filter = spa.knns.KNNWithMeans(k=k,
                                  sim_options = {'name': sim_options_name,
                                                 'user_based': user_based}, 
                                verbose=0)
            Filter.fit(train)
            predictions = Filter.test(test)
            rmse = surprise.accuracy.rmse(predictions)
            results.append([k, user_based, sim_options_name, rmse])
            tend = time.time()
            print("%s, %s, %s computed in %d seconds" %(k, user_based, sim_options_name, tend-tstart))

## Plotly

In [None]:
data=[]
color_dict = {True:"green",False:"red"}
marker_dict = {"pearson":"x","cosine":0,"msd":"triangle-up"}
for user_based in [True, False]:
    for sim_options_name in ["pearson","cosine","msd"]:
        result_ = [r for r in results if r[1]==user_based and r[2] == sim_options_name]
        x = [r[0] for r in result_]
        y = [r[3] for r in result_]
        user_string = "User_User" if user_based else "Item Item"
        
        data.append(go.Scatter(x=x,
                               y=y,
                               marker =dict(color=color_dict[user_based], symbol=marker_dict[sim_options_name]),
                               name = "%s Filter with %s similarity" %(user_string, sim_options_name)
                        ))
fig = go.Figure(data=data)
fig.update_layout(
    title_text='MSE according to parameters'
)
fig.show()

## Seaborn

In [None]:
fig=plt.figure(figsize=(30,10))
ax = fig.add_subplot(1,1,1)
color_dict = {True:"green",False:"red"}
marker_dict = {"pearson":"x","cosine":0,"msd":"^"}
for user_based in [True, False]:
    for sim_options_name in ["pearson","cosine","msd"]:
        result_ = [r for r in results if r[1]==user_based and r[2] == sim_options_name]
        x = [r[0] for r in result_]
        y = [r[3] for r in result_]
        user_string = "User_User" if user_based else "Item Item"
        ax.plot(x,y, color=color_dict[user_based], marker = marker_dict[sim_options_name], label = "%s Filter with %s similarity" %(user_string, sim_options_name))
ax.set_title("MSE according to parameters", fontsize=20)
plt.legend(fontsize=15)
fig.show()

**Question** Which algorithm perform the best? With which parameters?

We will see that these results are not that bad compare to other methods. 
However, this method would take to many time and requires to many computation power to be applied on the complete dataset of (25 Millions of row). 

# (Optionnal)Run code on complete dataset 

**Exercise**

* Download the complete and stable dataset by clicking here : http://files.grouplens.org/datasets/movielens/ml-25m.zip. 
* Move the dataset to the current file (RecomendationSystem).
* Load the data and create a train/test dataset.
* Fit a neighborhood based algorithm with the best parameter according to the results find on small dataset (**It may take a while**)

In [None]:
# %load solutions/exercise_1_5.py

In [None]:
tstart=time.time()
IIFilter = spa.knns.KNNWithMeans(k=100, 
                      min_k =1, 
                      sim_options = {'name': 'msd',
                                     'user_based': False},
                     verbose=1)
# Train the algorithm on the trainset, and predict ratings for the testset
IIFilter.fit(train)
predictions = UUFilter.test(test)

# Then compute RMSE
surprise.accuracy.rmse(predictions)
tend=time.time()
print(tend-tstart)