# Recommender Systems

Two examples are provided, one for content-based filtering and another for user-based filtering. 


## Content-Based Filtering
For this example we are going to use movies data. 
You can download the data from https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
It contains 2 files: 
- 'movies.csv' - genre, title and year for the movies 
- 'ratings.csv'- raiting that each user gave for movies that had watched 

The idea is to recommend to one of the users another similar movie that might like based on the raitings  and the genres of the movies that the user had watched, comparing to the other list of movies with their corresponding genre.

Some data cleaning is necessary to do before doing the necessary calculations. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#unzip data
!unzip -o -j moviedataset.zip 

Archive:  moviedataset.zip
  inflating: links.csv               
  inflating: movies.csv              
  inflating: ratings.csv             
  inflating: README.txt              
  inflating: tags.csv                


In [3]:
#Movies has information about the movies and rating indicates the rating that each user gave for some of the movies
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')

### Data cleaning
The activities necessary to be performed are:
- Check the missing values
- Create categorical variables for genres
- Format title
- Create year column 
- Drop unecessary columns

In [4]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
#check missing values
print(movies_df.isnull().sum())
ratings_df.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64


userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [6]:
#Format data: we can see that the year is inside the title. We can extract the year and generate another column
movies_df['year'] = movies_df['title'].str.extract('\((\d{4}).*\)')
#check if all title had the year
print(movies_df['year'].isnull().sum())
#remove NAs
movies_df.dropna(inplace=True)
#transform to numeric
movies_df.year = movies_df.year.astype('int32')

#remove year from title 
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df['title'] = movies_df['title'].str.replace('\((\d{4}).*\)','')

65


In [7]:
#split genres into binary variables
for index, row in movies_df.iterrows():
    for genre in row['genres'].split('|'):
        movies_df.at[index, genre] = 1
movies_df = movies_df.fillna(0)
movies_df.head()

#remove no genres listed column (is implicit)
#remove genres column since is not necessary anymore
movies_df.drop(axis=1, columns=['(no genres listed)', 'genres'], inplace=True)
movies_df

Unnamed: 0,movieId,title,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
0,1,Toy Story,1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34203,151697,Grand Slam,1967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34204,151701,Bloodmoney,2010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34205,151703,The Butterfly Circus,2009,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
34206,151709,Zero,2015,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [10]:
#remove timestamp column, not important for the analysis
ratings_df.drop('timestamp', axis=1, inplace=True)

### Choose an user

In [11]:
#choose user 1
user_df = ratings_df[ratings_df['userId']==1]
#get movies genres that user 1 watched
user_movies = movies_df[movies_df['movieId'].isin(user_df['movieId'].tolist())]
user_genres = user_movies.drop('movieId', axis=1).drop('title', axis=1).drop('year', axis=1)
user_genres.reset_index(inplace=True, drop=True)
user_genres

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Make the calculations
In order to generate the list of recommended movies, the weighted average is performed for the profile.  
This value would be the score for similar movies.  
The profile of a user is defined by the multiplication of the raiting by the genres fot the movies that the user watched. 

In [12]:
#multiply by rating to get the profile of the user
profile = user_genres.transpose().dot(user_df['rating'])
#get genres for all movies 
all_genres = movies_df.set_index(movies_df['movieId']).drop('movieId', axis=1).drop('title', axis=1).drop('year', axis=1)
#calculate weighted average
recommendationTable_df = ((all_genres*profile).sum(axis=1))/(profile.sum())
recommendationTable_df.head()

movieId
1    0.349206
2    0.253968
3    0.095238
4    0.333333
5    0.095238
dtype: float64

In [13]:
#sort values so at the top appear the most similar movies
recommendationTable_df.sort_values(ascending=False, inplace=True)

### Recommend a movie!
Everything is ready to make a recomendation. Let's get the top 10, which are the most similar movies for this user.

In [14]:
#get top 10 recommended movies 
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(10).keys())]

Unnamed: 0,movieId,title,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir
455,459,"Getaway, The",1994,1.0,0.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4923,5018,Motorama,1991,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5918,6016,City of God (Cidade de Deus),2002,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11494,49530,Blood Diamond,2006,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
13250,64645,The Wrecking Crew,1968,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15480,78729,24: Redemption,2008,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16055,81132,Rubber,2010,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
19519,96601,Icon,2005,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26442,122787,The 39 Steps,1959,1.0,0.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26806,124681,Raffles,1939,1.0,0.0,0.0,1.0,0.0,1.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Collaborative Filtering


In [15]:
#Find users that saw the same movies 
other_users = ratings_df[ratings_df['movieId'].isin(user_df['movieId'].tolist())]
other_users.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
491,13,169,1.0
663,14,169,3.0


In [54]:
other_users_group = other_users.groupby(['userId'])

In [96]:
#TODO: try to improve https://github.com/pancr9/Netflix-Recommender-System/blob/master/PearsonBasedRecommender.ipynb
from math import sqrt
#Guardar la Correlación Pearson en un diccionario, donde la clave es el Id del usuario y el valor es el coeficiente
pearsonCorrelationDict = {}

#Para cada grupo de usuarios en nuestro subconjunto
for name, group in other_users_group:
    #Comencemos ordenando el usuario actual y el ingresado de forma tal que los valores no se mezclen luego
    group = group.sort_values(by='movieId')
    inputMovies = user_df.sort_values(by='movieId')
    #Obtener el N para la fórmula
    nRatings = len(group)
    #Obtener los puntajes de revisión para las películas en común
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #Guardarlas en una variable temporal con formato de lista para facilitar cálculos futuros
    tempRatingList = temp_df['rating'].tolist()
    #Pongamos también las revisiones de grupos de usuarios en una lista
    tempGroupList = group['rating'].tolist()
    #Calculemos la Correlación Pearson entre dos usuarios, x e y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #Si el denominador es diferente a cero, entonces dividir, sino, la correlación es 0.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

5.0