# Read the data

On this occasion, we will use a more recent version of the MovieLens dataset that already includes the genres related to each of the movies.

In [21]:
import pandas as pd

#Storing the movie information into a pandas dataframe
movies_df = pd.read_csv('./ml-latest-small/movies.csv')
#Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('./ml-latest-small/ratings.csv')
#Head is a function that gets the first N rows of a dataframe. N's default is 5.
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


# Feature extraction

The first step is to extract the features that we will use as content to describe the items. 

As explained in class, this step is one of the most important ones in a Content-based RS since the extracted features will allow the identification of similar contents and the creation of the user profiles.

In particular, for this experiment, we are going to make use of the movie genres.


As they are formatted, it isn't optimal for the content-based recommendation system technique. We will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every genre in columns containing either 1 or 0. 1 shows that a movie has that genre, and 0 shows that it doesn't. 

In [22]:
#Every genre is separated by a | so we simply have to call the split function on |
movies_df['genres'] = movies_df.genres.str.split('|')

#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
moviesWithGenres_df = movies_df.copy()

#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
        
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",0.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II (1995),[Comedy],0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We now have a proper content-based representation of the movies that we will leverage to create our recommendation model.

# Content-Based recommendation system

Now, let's take a look at how to implement Content-Based recommendation systems. 

The recommendation is based on two steps:

- Generate the user profiles: We need to describe the users based on the features (genres) related to the movies that they have rated

- Search for the movies more related to each user profile: Once we have the user preferences properly represented (in the user profiles), we need to look for movies which description (genres) fits those preferences.

## User Profile Generation

This technique attempts to figure out the user's favorite aspects of an item and then recommends the items that present those aspects. In our case, we will try to figure out the user's favorite genres from the movies and ratings given.

Let's begin with a single sample user and extract the set of movies related to their profile.

In [23]:
user_id = 2

# Get from the ratings dataframe only the rows (ratings) related to the user_id
user_rating = ratings_df[ratings_df.userId == user_id]
user_rating.drop("timestamp", 1)

# Merge with the movies dataframe to add the movie title to facilitate the analysis of the results
inputMovies = pd.merge(user_rating, movies_df, on='movieId').drop("timestamp",1).drop("genres",1).drop("userId",1)
inputMovies

Unnamed: 0,movieId,rating,title
0,318,3.0,"Shawshank Redemption, The (1994)"
1,333,4.0,Tommy Boy (1995)
2,1704,4.5,Good Will Hunting (1997)
3,3578,4.0,Gladiator (2000)
4,6874,4.0,Kill Bill: Vol. 1 (2003)
5,8798,3.5,Collateral (2004)
6,46970,4.0,Talladega Nights: The Ballad of Ricky Bobby (2...
7,48516,4.0,"Departed, The (2006)"
8,58559,4.5,"Dark Knight, The (2008)"
9,60756,5.0,Step Brothers (2008)


We can see that the movies liked by the user a more or less consistent.

We're going to start by learning the input's preferences, so let's get the subset of movies that the user has watched from the previous Dataframe and extend it with the movie genres

In [24]:
#Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
277,318,"Shawshank Redemption, The (1994)","[Crime, Drama]",0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
291,333,Tommy Boy (1995),[Comedy],0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1284,1704,Good Will Hunting (1997),"[Drama, Romance]",0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2674,3578,Gladiator (2000),"[Action, Adventure, Drama]",1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4615,6874,Kill Bill: Vol. 1 (2003),"[Action, Crime, Thriller]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5305,8798,Collateral (2004),"[Action, Crime, Drama, Thriller]",0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6253,46970,Talladega Nights: The Ballad of Ricky Bobby (2...,"[Action, Comedy]",0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6315,48516,"Departed, The (2006)","[Crime, Drama, Thriller]",0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6710,58559,"Dark Knight, The (2008)","[Action, Crime, Drama, IMAX]",0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6801,60756,Step Brothers (2008),[Comedy],0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns.



In [25]:
#Resetting the index to avoid future issues
userMovies = userMovies.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1)
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we're ready to start learning the input's preferences!

To do this, we're going to turn each genre into weights. We can do this by using the user's reviews and multiplying them into the genre table, and then summing up the resulting table by column. (i.e., this operation is actually a dot product between a matrix and a vector, so we can simply accomplish this by calling Pandas' "dot" function).

This way, we will give a larger score to those genres related to the movies the user has rated better. In the same way, genres related to the movies that the user disliked or not interacted with will have a lower score in the user profile.

This is a very simplistic way of generating user profiles. **Can you think of more advanced methods of doing it?**

In [26]:
#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
userProfile

Adventure             12.5
Animation              0.0
Children               0.0
Comedy                28.0
Fantasy                0.0
Romance                4.5
Drama                 66.0
Action                43.5
Crime                 38.0
Thriller              37.0
Horror                 3.0
Mystery                8.0
Sci-Fi                15.5
War                    4.5
Musical                0.0
Documentary           13.0
IMAX                  15.0
Western                3.5
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Now, we have the weights for every user's preferences. This user, in particular, seems to be interested in Drama, Action, Crime & Thriller while barely interested in Animation or Children Movies.

Using this, we can recommend movies that satisfy the user's preferences. Let's start by extracting the genre table from the original dataframe.

In [27]:
#Now let's get the genres of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
#And drop the unnecessary information
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1)
genreTable.head(10) #This is for all movies

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With the user profile as input and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the user profile and recommend the top twenty movies that most satisfy it.

In [28]:
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())

#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)

recommendationTable_df

movieId
81132     0.820205
79132     0.763699
4719      0.743151
7235      0.738014
5628      0.727740
            ...   
2096      0.000000
4294      0.000000
136556    0.000000
313       0.000000
81018     0.000000
Length: 9742, dtype: float64

If we look carefuly at the result list we can see the movieId 79132 which corresponds to Inception, which is one of the movies that the user already watched and rated. We do not want to offer as recommendations movies that have been already watched, so I will remove them from the recommendation list

In [29]:
recommendationTable_df.drop(inputMovies.movieId, inplace=True)