<left>
<h1> Application of AI </h1>
<h3> Practice 1 <br>
Recommender system </h3>
<h4> Group 2 <br>
Pr. L. BENEDETTI <br>
Author : Rizk AIT BRIK </h4>
</left>

# Introduction

In this practice, we are going to recreate a user-based recommender system using a movies dataset. We'll start by manipulating our data using pandas and then, recreate a simple recommender system using weighted ratings and conclude by turning our simple recommender system into a user-based recommender system.

# Part 0 : Loading the necessary packages

In this section, we will load pandas and numpy to be able to do our data manipulation. But let's also add pyplot and seaborn to plot our data and analyse it further. The code cell below won't list all the packages needed as we'll be adding more when needed in the code afterwards.

In [142]:
import pandas as pd
import numpy as np
import warnings as wn

# Part 1 : Data manipulation with pandas

## Let's create a small DataFrame

Let's read the csv file for our dataset and then, let's create a small Dataframe using only selected features.
<ul>
    <li> Title of the movie. </li>
    <li> Release date of the movie. </li>
    <li> Budget of the movie. </li>
    <li> Revenue of the movie. </li>
    <li> Runtime of the movie. </li>
    <li> Genres of the movie. </li>
</ul>

In [143]:
df = pd.read_csv('movies_metadata.csv')
# small dataframe creation
small_df = df[["title", "release_date", "budget", "revenue", "runtime", "genres"]]

Since we have the small dataframe. Let's peek on its contents.

In [144]:
small_df.head(10)

Unnamed: 0,title,release_date,budget,revenue,runtime,genres
0,Toy Story,1995-10-30,30000000,373554033.0,81.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,Jumanji,1995-12-15,65000000,262797249.0,104.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,Grumpier Old Men,1995-12-22,0,0.0,101.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,Waiting to Exhale,1995-12-22,16000000,81452156.0,127.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,Father of the Bride Part II,1995-02-10,0,76578911.0,106.0,"[{'id': 35, 'name': 'Comedy'}]"
5,Heat,1995-12-15,60000000,187436818.0,170.0,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam..."
6,Sabrina,1995-12-15,58000000,0.0,127.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '..."
7,Tom and Huck,1995-12-22,0,0.0,97.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
8,Sudden Death,1995-12-22,35000000,64350171.0,106.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam..."
9,GoldenEye,1995-11-16,58000000,352194034.0,130.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 28, '..."


## Let's check out the data types

In [145]:
print("The datatypes of the different features are : \n" + format(small_df.dtypes))

The datatypes of the different features are : 
title            object
release_date     object
budget           object
revenue         float64
runtime         float64
genres           object
dtype: object


Budget is a column with the type object which suggests that it is a list of strings rather than floating numbers. In other words, regular mathematical operations won't work on it neither will it be prone to be used with mathematical formulars as it is a string of characters. So, the sum will rather be considered a concatenization at this point.

## Let's convert budget to float

In order to be able to use the column budget as it is intended to, we need to convert its contents to a floating number and process the missing values in case there's any. Thus, let's start with a basic conversion from string to float.

In [146]:
# Convert budget column to float
small_df["budget"] = small_df["budget"].astype(float)

ValueError: could not convert string to float: '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'

The error we get is <span style = 'color: red'> ValueError : could not convert string to float: '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'</span>. This means that there are some non-convertible strings which contain special characters and letters. Normally, in order to convert a string to a float, it needs to contain only numbers and no letters. Otherwise, the conversion has no actual sense.

## Let's code our own function <i>to_float</i>

In order to process this special column that has no only missing values but also unconvertible strings, we need to create our own conversion function. We need to intend it to handle missing values and that special string which gave us the ValueError by including the case where it could fail. So we'll use and error handling type of approach.

In [147]:
def to_float(x): 
    # Convert x to float and if it fails replace it with NaN
    try:
        return float(x)
    except:
        return np.nan

## Let's use our <i> to_float </i>

In [148]:
# ignore warnings
wn.filterwarnings('ignore')
small_df["budget"] = small_df["budget"].apply(to_float)


In [149]:
small_df.budget.astype("float")

0        30000000.0
1        65000000.0
2               0.0
3        16000000.0
4               0.0
            ...    
45461           0.0
45462           0.0
45463           0.0
45464           0.0
45465           0.0
Name: budget, Length: 45466, dtype: float64

In [150]:
small_df.dtypes

title            object
release_date     object
budget          float64
revenue         float64
runtime         float64
genres           object
dtype: object

Voilà ! We can see that our budget column became a float64 column. We can use it freely now with mathematical operators the way it was intended to be. Which also includes comparisons etc.

## Define a new feature called "year"

In what's coming, we're going to focus on the year of release of given movies by extracting it from the release date feature in our dataframe. Our release date column is also a basic string type column. In order to be able to get the year out of it, we'll need to start by converting it to a datetime type and then using it as is to get the year out of it. The below code snippet is doing exactly that.

In [151]:
release_date = pd.to_datetime(small_df['release_date'], errors='coerce')
# Get the year of the release date
small_df['year'] = release_date.dt.year

## What are the oldest movies in this dataset

In order to get the oldest movies, let's use the sorting with the year of release in the ascending order.

In [152]:
# sort by year
small_df = small_df.sort_values(by=['year'], ascending=True)
small_df.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres,year
34940,Passage of Venus,1874-12-09,0.0,0.0,1.0,"[{'id': 99, 'name': 'Documentary'}]",1874.0
34937,Sallie Gardner at a Gallop,1878-06-14,0.0,0.0,1.0,"[{'id': 99, 'name': 'Documentary'}]",1878.0
41602,Buffalo Running,1883-11-19,0.0,0.0,1.0,"[{'id': 99, 'name': 'Documentary'}]",1883.0
34933,Man Walking Around a Corner,1887-08-18,0.0,0.0,1.0,"[{'id': 99, 'name': 'Documentary'}]",1887.0
34934,Accordion Player,1888-01-01,0.0,0.0,1.0,"[{'id': 99, 'name': 'Documentary'}]",1888.0


Here, we can see that the 5 oldest movies in our dataset are : 

In [153]:
print("The 5 oldest movies are : \n" + format(small_df["title"][0:6]))

The 5 oldest movies are : 
34940                 Passage of Venus
34937       Sallie Gardner at a Gallop
41602                  Buffalo Running
34933      Man Walking Around a Corner
34934                 Accordion Player
34938    Traffic Crossing Leeds Bridge
Name: title, dtype: object


## What are the most successful movies in this dataset ? 

The most successful movies are known to have the highest revenue, so we will answer the question by sorting usng the revenue column.

In [154]:
# Sort by revenue
small_df = small_df.sort_values(by=['revenue'], ascending=False)
small_df.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres,year
14551,Avatar,2009-12-10,237000000.0,2787965000.0,162.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",2009.0
26555,Star Wars: The Force Awakens,2015-12-15,245000000.0,2068224000.0,136.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",2015.0
1639,Titanic,1997-11-18,200000000.0,1845034000.0,194.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",1997.0
17818,The Avengers,2012-04-25,220000000.0,1519558000.0,143.0,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",2012.0
25084,Jurassic World,2015-06-09,150000000.0,1513529000.0,124.0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",2015.0


Here, we can see the top 5 most successful movies ordered by revenue.

## Creation of the new Dataframe

In [155]:
# Create a new dataframe with the movies who earned more than 1 Billion dollars
# and sort them by revenue
new = small_df[small_df['revenue'] > 1000000000]
new = new.sort_values(by=['revenue'], ascending=True)
new.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres,year
12481,The Dark Knight,2008-07-16,185000000.0,1004558000.0,152.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",2008.0
44009,Despicable Me 3,2017-06-15,80000000.0,1020063000.0,96.0,"[{'id': 28, 'name': 'Action'}, {'id': 16, 'nam...",2017.0
19971,The Hobbit: An Unexpected Journey,2012-11-26,250000000.0,1021104000.0,169.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",2012.0
36253,Zootopia,2016-02-11,150000000.0,1023784000.0,108.0,"[{'id': 16, 'name': 'Animation'}, {'id': 12, '...",2016.0
14892,Alice in Wonderland,2010-03-03,200000000.0,1025491000.0,108.0,"[{'id': 10751, 'name': 'Family'}, {'id': 14, '...",2010.0


In [156]:
new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29 entries, 12481 to 14551
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         29 non-null     object 
 1   release_date  29 non-null     object 
 2   budget        29 non-null     float64
 3   revenue       29 non-null     float64
 4   runtime       29 non-null     float64
 5   genres        29 non-null     object 
 6   year          29 non-null     float64
dtypes: float64(4), object(3)
memory usage: 1.8+ KB


Judging from the former output, there are 29 movies with more than 1 billion dollars revenue.

## Creation of the new2 DataFrame

In [157]:
# Create a dataframe called new2 with the movies who earned more than 1 Billion dollars with a budget less than 150 million dollars
new2 = new[new['budget'] < 150000000]
new2.head()

Unnamed: 0,title,release_date,budget,revenue,runtime,genres,year
44009,Despicable Me 3,2017-06-15,80000000.0,1020063000.0,96.0,"[{'id': 28, 'name': 'Action'}, {'id': 16, 'nam...",2017.0
7000,The Lord of the Rings: The Return of the King,2003-12-01,94000000.0,1118889000.0,201.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",2003.0
30700,Minions,2015-06-17,74000000.0,1156731000.0,91.0,"[{'id': 10751, 'name': 'Family'}, {'id': 16, '...",2015.0
17437,Harry Potter and the Deathly Hallows: Part 2,2011-07-07,125000000.0,1342000000.0,130.0,"[{'id': 10751, 'name': 'Family'}, {'id': 14, '...",2011.0


In [158]:
new2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 44009 to 17437
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         4 non-null      object 
 1   release_date  4 non-null      object 
 2   budget        4 non-null      float64
 3   revenue       4 non-null      float64
 4   runtime       4 non-null      float64
 5   genres        4 non-null      object 
 6   year          4 non-null      float64
dtypes: float64(4), object(3)
memory usage: 256.0+ bytes


Judging from the last output, there are 4 movies with more than 1 billion revenue and a budget of less than 150 million dollars.

# Part 2 : Building a simple recommender system

### The theory of simple recommender systems.

Simple recommender systems use a weighted rating to create a scoring system. They offer generalized recommendations to every user, based on movie popularity and/or genre. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience. 

simple recommenders are basic systems that recommend the top items based on a certain metric or score. Using directly a metric that is based only on vote counts and the vote average isn't enough as it will only favor popular movies and ignore movies with not enough views.

So, it does not take into consideration the popularity of a movie. Therefore, a movie with a rating of 9 from 10 voters will be considered 'better' than a movie with a rating of 8.9 from 10,000 voters. Also, this metric will also tend to favor movies with a smaller number of voters with skewed and/or extremely high ratings.

<h4> The solution ? </h4>

Using a special metric that takes into account everything it needs to avoid these shortcomings. Thus, the introduction of the weighted rating which has the following expression.
<center>

$ WeightedRating (WR) = \frac{v}{v+m} \times{R} + \frac{m}{v+m} \times{C} $

</center>

Where : 
 - $v$ is the number of votes garnered by the movie
 - $m$ is the minimum number of votes required for the movie to be in the chart
 - $R$ is the mean rating of the movie
 - $C$ is the mean rating of all the movies in the dataset 

Here, we already have v as vote_count and R as vote_average. We still need to calculate m.


## Let's calculate m

In [159]:
# Let's count the minimum votes garnered by the movies in order to be more than the 80% of the movies
votes = df['vote_count']
votes.dropna(inplace=True)
m = np.percentile(votes, 80)
print("The minimum of votes required for a movie to be in the 80th percentile is : m =" + format(m))

The minimum of votes required for a movie to be in the 80th percentile is : m =50.0


## Let's calculate C

In [160]:
C = df.vote_average.mean()
print("The mean of the average of votes garnered by the movies in the dataset is : C =" + format(C))

The mean of the average of votes garnered by the movies in the dataset is : C =5.618207215134185


## Let's calculate WR

In [161]:
# We will calculate Weighted Rating
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)
df['WR score'] = df.apply(weighted_rating, axis=1)
df[["title", "WR score"]].head(10)

Unnamed: 0,title,WR score
0,Toy Story,7.680953
1,Jumanji,6.873979
2,Grumpier Old Men,6.18951
3,Waiting to Exhale,5.813219
4,Father of the Bride Part II,5.681661
5,Heat,7.646235
6,Sabrina,6.047698
7,Tom and Huck,5.514846
8,Sudden Death,5.526386
9,GoldenEye,6.560539


## Let's sort the dataframe using WR score

In [162]:
# Sort the dataframe by weighted rating in descending order
df = df.sort_values(by=['WR score'], ascending=False)
# Output the top 10 movies
df[['title', 'WR score']].head(10)

Unnamed: 0,title,WR score
10309,Dilwale Dulhania Le Jayenge,8.855148
314,The Shawshank Redemption,8.482863
834,The Godfather,8.476278
40251,Your Name.,8.366584
12481,The Dark Knight,8.289115
2843,Fight Club,8.286216
292,Pulp Fiction,8.284623
522,Schindler's List,8.270109
23673,Whiplash,8.269704
5481,Spirited Away,8.266628


We can see clearly the highest WR score is 8.855148 garnered by the movie "Dilwale Dulhania Le Jayenge".

# Part 3 : Implement a user-based recommender system

User-Based Collaborative Filtering is a technique used to predict the items that a user might like on the basis of ratings given to that item by the other users who have similar taste with that of the target user. Collaborative Filtering is a technique which is widely used in recommendation systems. The two most commonly used methods are memory-based and model-based.

In [163]:
# Using the data in ratings_small.csv, construct the user rating matrix (where the rows represent the users and the columns represent the movies)
ratings_small = pd.read_csv('ratings_small.csv')
ratings_small.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


Let's create the user rating matrix

In [164]:
# Create user rating matrix
user_ratings = ratings_small.pivot_table(index=['userId'], columns=['movieId'], values='rating')
user_ratings

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,4.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,4.0,...,,,,,,,,,,
5,,,4.0,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,,,,,,4.0,,,,,...,,,,,,,,,,
668,,,,,,,,,,,...,,,,,,,,,,
669,,,,,,,,,,,...,,,,,,,,,,
670,4.0,,,,,,,,,,...,,,,,,,,,,


This contains some lots of NaN value since every user has not seen all the movies and that’s the reason this type of matrix is called sparse matrix. Next step and one of the important step is to replace this NaN with actual values. We can distinguish two approaches. In what's coming, I decided to replace NaN values with 0 but we can replace them using the average of a user column-wise or line-wise.

Let's calculate the cosine similarity for each pair of users

In [165]:
# Let's calculate the cosine similarity for each pair of users
from sklearn.metrics.pairwise import cosine_similarity
user_filled_val = ratings_small.pivot_table(index = 'userId', columns = 'movieId', values = "rating", fill_value = 0)
user_similarity = cosine_similarity(user_filled_val)
# Print the 2D array
print(user_similarity)

[[1.         0.         0.         ... 0.06291708 0.         0.01746565]
 [0.         1.         0.12429498 ... 0.02413984 0.17059464 0.1131753 ]
 [0.         0.12429498 1.         ... 0.08098382 0.13660585 0.17019275]
 ...
 [0.06291708 0.02413984 0.08098382 ... 1.         0.04260878 0.08520194]
 [0.         0.17059464 0.13660585 ... 0.04260878 1.         0.22867673]
 [0.01746565 0.1131753  0.17019275 ... 0.08520194 0.22867673 1.        ]]


Here, we get the user similarity matrix.

## Nearest neighbor calculation

In [166]:
def nearest_neighbors(user_id, n=10):
    # Get the index of the user
    user_index = user_ratings.index.get_loc(user_id)
    # Get the similarity scores of the user with all the other users
    similarity_scores = list(enumerate(user_similarity[user_index]))
    # Sort the similarity scores in descending order
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    # Get the user ids of the top n similar users
    nearest_neighbors_id = [user_id for user_id, _ in similarity_scores[1:n+1]]
    # Get the top n similar users
    nearest_neighbors = user_ratings.loc[nearest_neighbors_id]
    # Get the top n similar users' ids
    nearest_neighbors_id = nearest_neighbors.index
    return nearest_neighbors, nearest_neighbors_id

Let's use it to find the nearest neighbors of the user 1.

In [167]:
nearest_neighbors_1, nearest_neighbors_id_1 = nearest_neighbors(user_id=1, n=10)
print("The nearest neighbors for user 1 are : "+ format(nearest_neighbors_id_1.values))

The nearest neighbors for user 1 are : [324 633 340 309 206  34 194 484 129 228]


In other words, the users who are similar to the user n° 1 are the users displayed by the code snippet above.

## Recommended movies

The nearest_neighbors function returns the ids of the nearest neighbors and also the actual nearest neighbors in order to use them for the movie recommendation part.

In [168]:
# A function that recommends k movies for user based on the nearest neighbors
def recommend_movies(user_id, n=10, k=10):
    nn, nn_id = nearest_neighbors(user_id, n)
    # Get the movies rated by the nearest neighbors
    nn_movies = nn.drop(user_id, axis=1)
    # Get the average ratings of the movies
    nearest_neighbors_mean_ratings = nn_movies.mean(axis=1)
    # Sort the nearest neighbors' movies in descending order of their average ratings
    nearest_neighbors_mean_ratings = nearest_neighbors_mean_ratings.sort_values(ascending=False)
    # Get the top k movies
    top_k_movies = nearest_neighbors_mean_ratings.head(k)
    # Get the movie ids of the top n movies
    top_k_movies_id = top_k_movies.index
    # Get the top n movies' ratings
    top_k_movies_ratings = user_ratings.loc[user_id, top_k_movies_id]
    # Sort the top n movies in descending order of their ratings
    top_k_movies_ratings = top_k_movies_ratings.sort_values(ascending=False)
    # Get the top n movies' ids
    top_k_movies_id = top_k_movies_ratings.index
    return top_k_movies_id

Let's try to use it to find the recommnended movies for user 1.

In [169]:
top_k_movies_id_1 = recommend_movies(user_id=1, n=10, k=10)
print("The top 10 movies recommended for user 1 are : "+ format(top_k_movies_id_1.values))

The top 10 movies recommended for user 1 are : [309 228 484  34 340 324 194 206 129 633]


Let's see what the titles of those movies are.

In [170]:
df_movies_ids = pd.read_csv('movies_ids.csv')
df_movies_ids.loc[top_k_movies_id_1]

Unnamed: 0_level_0,movieId,title,genres
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
309,344,Ace Ventura: Pet Detective (1994),Comedy
228,256,Junior (1994),Comedy|Sci-Fi
484,540,Sliver (1993),Thriller
34,36,Dead Man Walking (1995),Crime|Drama
340,376,"River Wild, The (1994)",Action|Thriller
324,360,I Love Trouble (1994),Action|Comedy
194,220,Castle Freak (1995),Horror
206,234,Exit to Eden (1994),Comedy
129,150,Apollo 13 (1995),Adventure|Drama|IMAX
633,760,Stalingrad (1993),Drama|War


Above, we can see the movies which should be recommended to our user n°1.

## The value of n most suitable

Propose an approach to find which value of n is suitable for this dataset. 

In [171]:
# A function that finds the best value of neighborhood size n for the given dataset
def find_best_n(user_id, n=10, k=10):
    best_n = 0
    best_n_movies = []
    for i in range(1, n+1):
        movies = recommend_movies(user_id, n=i, k=k)
        if len(movies) > len(best_n_movies):
            best_n = i
            best_n_movies = movies
    return best_n, best_n_movies

For user 34, the best n to recommend 9 movies is :

In [172]:
find_best_n(user_id=34, n=10, k=9)

(9,
 Int64Index([231, 118, 101, 117, 247, 194, 20, 18, 213], dtype='int64', name='userId'))

# Conclusion

From this, we can do so much more than just recommending the movies. As we can still predict the score the user might give after watching it. And that is the beauty of it !