# Movie Recommender Sytem

Following notebook demonstrates following implementations in action which are build using Movie Lens Data set(26+ million records). The complete implementation of Python classes being used here can be found in src section of this project.
- Popularity based recommendation
- Item similarity based recommendation
- User similarity based recommendation

In [49]:
import sys
sys.path.insert(0,'/Users/skumar/recommendation_system/src')

In [50]:
%matplotlib inline
from models import recommenders
from metrics import metrics
import pandas as pd
from sklearn.cross_validation import train_test_split
import numpy as np
import time
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
import warnings; warnings.simplefilter('ignore')
from nltk.tokenize import RegexpTokenizer

# Load MovieLens movie data

In [24]:
rating_df = pd.read_csv('../data/movielens/ratings.csv', sep=',',header=0)
movies_df = pd.read_csv('../data/movilelens/movies.csv',sep=',', header=0)
rating_df.columns=['user_id', 'movie_id', 'rating','timestamp']
movies_df.columns=['movie_id', 'title', 'genres']
movies_df.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [25]:
rating_df.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [26]:
combined_df=pd.merge(movies_df, rating_df, on='movie_id')
combined_df.head()

Unnamed: 0,movie_id,title,genres,user_id,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8,4.0,1013443596
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9,4.5,1073837180
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,12,4.0,943912205
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,20,4.0,1368361348
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,24,4.0,979869938


In [27]:
combined_df.shape

(26024289, 6)

## Length of the dataset

## Showing the most popular movies in the dataset

In [28]:
movie_grouped = combined_df.groupby(['movie_id']).agg({'user_id': 'count'}).reset_index()
grouped_sum = movie_grouped['user_id'].sum()
movie_grouped['percentage']  = movie_grouped['user_id'].div(grouped_sum)*100
movie_grouped.sort_values(['user_id', 'movie_id'], ascending = [0,1]).head()

Unnamed: 0,movie_id,user_id,percentage
352,356,91921,0.353212
315,318,91082,0.349988
293,296,87901,0.337765
587,593,84078,0.323075
2487,2571,77960,0.299566


## Count number of unique users in the dataset

In [29]:
users = combined_df['user_id'].unique()

In [30]:
len(users)

270896

In [31]:
combined_df=combined_df.head(100000)

# Create a movie recommender

In [32]:
train_data, test_data = train_test_split(combined_df, test_size = 0.20, random_state=0)
print(train_data.head(5))

       movie_id                    title  \
10382         1         Toy Story (1995)   
73171         2           Jumanji (1995)   
30938         1         Toy Story (1995)   
99310         3  Grumpier Old Men (1995)   
58959         1         Toy Story (1995)   

                                            genres  user_id  rating  \
10382  Adventure|Animation|Children|Comedy|Fantasy    42534     4.0   
73171                   Adventure|Children|Fantasy    74688     3.5   
30938  Adventure|Animation|Children|Comedy|Fantasy   126997     4.0   
99310                               Comedy|Romance   127781     1.5   
58959  Adventure|Animation|Children|Comedy|Fantasy   241602     3.5   

        timestamp  
10382   981571500  
73171  1472861478  
30938  1436561086  
99310  1056431462  
58959  1448244208  


## Popularity-based recommendations

### Create an instance of popularity based recommender class

In [33]:
pm = recommenders.PopularityRecommender()
pm.create(train_data, 'user_id', 'movie_id')

### Use the popularity model to make some predictions

In [34]:
user_id = users[0]
pm.recommend(user_id)

Unnamed: 0,user_id,movie_id,score,Rank
0,8,1,52756,1.0
1,8,2,20900,2.0
2,8,3,6344,3.0


In [35]:
user_id = users[3]


## Build a movie recommender with personalization

We now create an item similarity based collaborative filtering model that allows us to make personalized recommendations to each user. 

## Class for an item similarity based personalized recommender system

In [36]:

is_model = recommenders.ItemSimilarityRecommender()
is_model.create(train_data, 'user_id', 'movie_id')

### Use the personalized model to make movie recommendations

In [37]:
#Print the movies for the user in training data
user_id = users[0]
user_items = is_model.get_user_items(user_id)
#
print("------------------------------------------------------------------------------------")
print("Training data movies for the user user_id: %s:" % user_id)
print("------------------------------------------------------------------------------------")

for user_item in user_items:
    print(user_item)

print("----------------------------------------------------------------------")
print("Recommendation process going on:")
print("----------------------------------------------------------------------")

#Recommend movies for the user using personalized model
is_model.recommend(user_id)

------------------------------------------------------------------------------------
Training data movies for the user user_id: 8:
------------------------------------------------------------------------------------
1
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
No. of unique movies for the user: 1
no. of unique movies in the training set: 3
Non zero values in cooccurence_matrix :3


Unnamed: 0,user_id,movie,score,rank
0,8.0,2.0,0.16931,1.0
1,8.0,3.0,0.056489,2.0


### We can also apply the model to find similar movies to any movie in the dataset

In [38]:
is_model.get_similar_items([16240])

no. of unique movies in the training set: 3
Non zero values in cooccurence_matrix :0


Unnamed: 0,user_id,movie,score,rank
0,,3,0.0,1
1,,2,0.0,2
2,,1,0.0,3


# Quantitative comparison between the models

We now formally compare the popularity and the personalized models using precision-recall curves. 

## Precision recall to calculate quality of recommendations

In [39]:
train_data.shape

(80000, 6)

In [40]:
train_data.head()

Unnamed: 0,movie_id,title,genres,user_id,rating,timestamp
10382,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,42534,4.0,981571500
73171,2,Jumanji (1995),Adventure|Children|Fantasy,74688,3.5,1472861478
30938,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,126997,4.0,1436561086
99310,3,Grumpier Old Men (1995),Comedy|Romance,127781,1.5,1056431462
58959,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,241602,3.5,1448244208


## Code to plot precision recall curve

In [109]:
import pylab as pl

#Method to generate precision and recall curve
def plot_precision_recall(m1_precision_list, m1_recall_list, m1_label, m2_precision_list, m2_recall_list, m2_label):
    pl.clf()    
    pl.plot(m1_recall_list, m1_precision_list, label=m1_label)
    pl.plot(m2_recall_list, m2_precision_list, label=m2_label)
    pl.xlabel('Recall')
    pl.ylabel('Precision')
    pl.ylim([0.0, 0.20])
    pl.xlim([0.0, 0.20])
    pl.title('Precision-Recall curve')
    #pl.legend(loc="upper right")
    pl.legend(loc=9, bbox_to_anchor=(0.5, -0.2))
    pl.show()


The curve shows that the personalized model provides much better performance over the popularity model. 