# Recommender Systems



## Content-Based Filtering
For this example we are going to use movies data. 
You can download the data from https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
It contains 2 files: 
- 'movies.csv' - genre, title and year for the movies 
- 'ratings.csv'- raiting that each user gave for movies that had watched 

The idea is to recommend to one of the users another similar movie that might like based on the raitings  and the genres of the movies that the user had watched, comparing to the other list of movies with their corresponding genre.

Some data cleaning is necessary to do before doing the necessary calculations.   
  
So, Let's start!.

First download the data with the following code:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#to unzip:
#unzip data
#!unzip -o -j moviedataset.zip 

Load the two datasets 'movies.csv' and 'ratings.csv' into two dataframes 

In [5]:
#Movies has information about the movies and rating indicates the rating that each user gave for some of the movies
movies_df = pd.read_csv('data/movies.csv')
ratings_df = pd.read_csv('data/ratings.csv')

### Data cleaning

The activities necessary to be performed are:

- Check the missing values
- Create categorical variables for genres
- Format title
- Create year column 

In [6]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


#### Check missing values

In [9]:
print(movies_df.isnull().sum())
ratings_df.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64


userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

#### Format data
We can see that the year is inside the title. We can extract the year and generate another column 'year'. Check if there are NAs, in case that there are NAs values, just remove those columns. 

In [12]:
movies_df['year'] = movies_df['title'].str.extract('\((\d{4}).*\)')
#check if all title had the year
print(movies_df['year'].isnull().sum())
#remove NAs
movies_df.dropna(inplace=True)
#remove year from title 
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df['title'] = movies_df['title'].str.replace('\((\d{4}).*\)','')

65


#### Transform categorical data
- Transform categorical data to numeric (year column)
- Split genres variable into binary variables
- Remove unecessary columns

In [15]:
#transform to numeric
movies_df.year = movies_df.year.astype('int32')
#split genres into binary variables
for index, row in movies_df.iterrows():
    for genre in row['genres'].split('|'):
        movies_df.at[index, genre] = 1
movies_df = movies_df.fillna(0)
movies_df.head()

#remove no genres listed column (is implicit)
#remove genres column since is not necessary anymore
movies_df.drop(axis=1, columns=['(no genres listed)', 'genres'], inplace=True)
movies_df

Unnamed: 0,movieId,title,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
0,1,Toy Story,1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34203,151697,Grand Slam,1967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34204,151701,Bloodmoney,2010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34205,151703,The Butterfly Circus,2009,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34206,151709,Zero,2015,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Show the first values for raitings dataframe

In [18]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


Remove timestamp column, not important for the analysis

In [21]:
#remove timestamp column, not important for the analysis
ratings_df.drop('timestamp', axis=1, inplace=True)

### Choose an user
Choose the user with userId 1.

In [24]:
#choose user 1
user_df = ratings_df[ratings_df['userId']==1]
user_df

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0


Get movies genres that user 1 watched

In [27]:
user_movies = movies_df[movies_df['movieId'].isin(user_df['movieId'].tolist())]
user_genres = user_movies.drop(['movieId','title', 'year'], axis=1)
user_genres.reset_index(inplace=True, drop=True)
user_genres

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Make the calculations
In order to generate the list of recommended movies, the weighted average is performed for the profile.  
This value would be the score for similar movies.  
The profile of a user is defined by the multiplication of the raiting by the genres fot the movies that the user watched. 

Follow these steps:  
1. profile: create the transpose matris for user_genres and multiply it by the rating column 
2. find the genres for all movies by creating a new dataframe 'all_genres'
3. Calculate the recommendation table by doing the following calculation : sum(all_genres*profile)/sum(profile)

In [30]:
#multiply by rating to get the profile of the user
profile = user_genres.transpose().dot(user_df['rating'])
#get genres for all movies 
all_genres = movies_df.set_index(movies_df['movieId']).drop('movieId', axis=1).drop('title', axis=1).drop('year', axis=1)
#calculate weighted average
recommendation_table_df = ((all_genres*profile).sum(axis=1))/(profile.sum())
#sort values so at the top appear the most similar movies
recommendation_table_df.sort_values(ascending=False, inplace=True)
recommendation_table_df.head()

movieId
81132     0.920635
122787    0.920635
64645     0.920635
96601     0.825397
124681    0.825397
dtype: float64

### Recommend a movie!

Everything is ready to make a recomendation. Let's get the top 10, which are the most similar movies for this user.

In [33]:
#get top 10 recommended movies 
top_10 = recommendation_table_df.head(10).keys()
movies_df.loc[movies_df['movieId'].isin(top_10)]

Unnamed: 0,movieId,title,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
455,459,"Getaway, The",1994,1.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4923,5018,Motorama,1991,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5918,6016,City of God (Cidade de Deus),2002,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11494,49530,Blood Diamond,2006,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
13250,64645,The Wrecking Crew,1968,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15480,78729,24: Redemption,2008,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16055,81132,Rubber,2010,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
19519,96601,Icon,2005,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26442,122787,The 39 Steps,1959,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26806,124681,Raffles,1939,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
