<h1>Simple Recommender</h1>
A simple recommender that is based on the popularity/ratings of movies.

The popularity-based recommender system will perform the following steps:
1. Retrieve user, item and activity data
2. Generate item profiles
3. Generate user profiles
4. Generate the recommendation engine model
5. Suggest the top N recommendations

<h2>Step 1 - Retrieve Data</h2>
The first step would always be to gather the data and pull it into the programming environment.

For our use case, we download the MovieLens dataset containing three sets of data,
 - Movie data containing a certain movie's information, such as movieID, release date, URL, genre details, and so on
 - User data containing the user information, such as userID, age, gender, occupation, ZIP code, and so on
 - Ratings data containing userID, itemID, rating, timestamp

In [2]:
# Import the libraries that are going to be used here
import pandas as pd
import numpy as np
import scipy
import sklearn

In [3]:
# Column headers for the dataset
data_cols = ['user id','movie id','rating','timestamp']
item_cols = ['movie id','movie title','release date', 'video release date','IMDb URL','unknown','Action', 'Adventure','Animation','Childrens','Comedy','Crime', 'Documentary','Drama','Fantasy','Film-Noir','Horror', 'Musical','Mystery','Romance ','Sci-Fi','Thriller', 'War' ,'Western']
user_cols = ['user id','age','gender','occupation', 'zip code']

In [4]:
# List of users
df_u_user = pd.read_csv('/home/nbuser/library/dataset/u.user', header=None, sep='|', names=user_cols, encoding='latin-1')
df_u_user = df_u_user.sort_values('user id', ascending=1)
df_u_user.columns
df_u_user.head(10)

Unnamed: 0,user id,age,gender,occupation,zip code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [5]:
# List of movie items
df_u_item = pd.read_csv('/home/nbuser/library/dataset/u.item', header=None, sep='|', names=item_cols, encoding='latin-1')
df_u_item = df_u_item.sort_values('movie id', ascending=1)
df_u_item.columns
df_u_item.head(10)

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,9,Dead Man Walking (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Dead%20Man%20...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,10,Richard III (1995),22-Jan-1996,,http://us.imdb.com/M/title-exact?Richard%20III...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [6]:
# User activity data
df_u_data = pd.read_csv('/home/nbuser/library/dataset/u.data', header=None, sep='\t', names=data_cols, encoding='latin-1')
df_u_data = df_u_data.sort_values('user id', ascending=1)
df_u_data.columns
df_u_data.head(10)

Unnamed: 0,user id,movie id,rating,timestamp
66567,1,55,5,875072688
62820,1,203,4,878542231
10207,1,183,5,875072262
9971,1,150,5,876892196
22496,1,68,4,875072688
9811,1,201,3,878542960
9722,1,157,4,876892918
9692,1,184,4,875072956
9566,1,210,4,878542909
9382,1,163,4,875072442


<h2>Creating A Simple Recommender</h2>

In [8]:
#First we merge the three dataframes into one single dataframe

#Create one data frame from the three
dataset = pd.merge(pd.merge(df_u_item, df_u_data), df_u_user)
dataset.head(10)

Unnamed: 0,movie id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Childrens,...,Thriller,War,Western,user id,rating,timestamp,age,gender,occupation,zip code
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,1,5,874965758,24,M,technician,85711
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,1,0,0,1,3,876893171,24,M,technician,85711
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,1,0,0,1,4,878542960,24,M,technician,85711
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,1,3,876893119,24,M,technician,85711
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,1,0,0,1,3,889751712,24,M,technician,85711
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,0,0,0,0,0,...,0,0,0,1,5,887431973,24,M,technician,85711
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,1,4,875071561,24,M,technician,85711
7,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,...,0,0,0,1,1,875072484,24,M,technician,85711
8,9,Dead Man Walking (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Dead%20Man%20...,0,0,0,0,0,...,0,0,0,1,5,878543541,24,M,technician,85711
9,10,Richard III (1995),22-Jan-1996,,http://us.imdb.com/M/title-exact?Richard%20III...,0,0,0,0,0,...,0,1,0,1,3,875693118,24,M,technician,85711


Next we use groupby to group the movies by their titles. Then we use the size function to returns the total number of entries under each movie title. This will help us get the number of people who rated the movie/ the number of ratings.



In [9]:
ratings_total = dataset.groupby('movie title').size()
print(ratings_total.head())

movie title
'Til There Was You (1997)      9
1-900 (1994)                   5
101 Dalmatians (1996)        109
12 Angry Men (1957)          125
187 (1997)                    41
dtype: int64


Next we try to take the mean ratings of each movie using the mean function. First we groupby movie title. From the resulting dataframe we select only the movie title and the rating headers. Then we use the mean function on them.



In [10]:
ratings_mean = (dataset.groupby('movie title'))['movie title','rating'].mean()
print(ratings_mean.head())

                             rating
movie title                        
'Til There Was You (1997)  2.333333
1-900 (1994)               2.600000
101 Dalmatians (1996)      2.908257
12 Angry Men (1957)        4.344000
187 (1997)                 3.024390


Now if you check ratings_total then you will find its a Series and not a Data Frame. So we will convert that into a dataframe. In the ratings_mean we will see that the movie title has been converted from a column to an index. So we make that a column again.

In [11]:
#modify the dataframes so that we can merge the two
ratings_total = pd.DataFrame({'movie title':ratings_total.index, 'total ratings': ratings_total.values})
ratings_mean['movie title'] = ratings_mean.index

Now we head for the merging part. Now we sort the values by the total rating and this helps us sort the data frame by the number of people who viewed the movie

In [12]:
final = pd.merge(ratings_mean, ratings_total).sort_values(by = 'total ratings', ascending= False)
print(final.head())

        rating                movie title  total ratings
1398  4.358491           Star Wars (1977)            583
333   3.803536             Contact (1997)            509
498   4.155512               Fargo (1996)            508
1234  4.007890  Return of the Jedi (1983)            507
860   3.156701           Liar Liar (1997)            485


We need to look at the basic characteristics of the data to determine the minimum cutoff of total ratings. Because its not reliable to recommend a movie with a high mean rating that has been rated by only 10 people.

In [13]:
print(final.describe())
final = final[:300].sort_values(by = 'rating', ascending = False)
print(final.head())

            rating  total ratings
count  1664.000000    1664.000000
mean      3.077018      60.096154
std       0.780418      80.956484
min       1.000000       1.000000
25%       2.665094       7.000000
50%       3.162132      27.000000
75%       3.651808      80.250000
max       5.000000     583.000000
        rating                       movie title  total ratings
1281  4.466443           Schindler's List (1993)            298
1652  4.466102        Wrong Trousers, The (1993)            118
273   4.456790                 Casablanca (1942)            243
1317  4.445230  Shawshank Redemption, The (1994)            283
1215  4.387560                Rear Window (1954)            209
