# Acquiring the Data

Dataset acquired from [GroupLens](http://grouplens.org/datasets/movielens/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2021-01-01)\
Download the dataset using !wget from IBM Object Storage

In [1]:
!wget -O moviedataset.zip https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%205/data/moviedataset.zip
print('unziping ...')
!unzip -o -j moviedataset.zip 

--2021-09-09 21:32:40--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%205/data/moviedataset.zip
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160301210 (153M) [application/zip]
Saving to: ‘moviedataset.zip’


2021-09-09 21:32:44 (41.1 MB/s) - ‘moviedataset.zip’ saved [160301210/160301210]

unziping ...
Archive:  moviedataset.zip
  inflating: links.csv               
  inflating: movies.csv              
  inflating: ratings.csv             
  inflating: README.txt              
  inflating: tags.csv                


# Preprocessing

In [2]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
#Movie info
movdf = pd.read_csv('movies.csv')
#User info
rdf = pd.read_csv('ratings.csv')

## Movie Dataframe

In [4]:
movdf.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Each movie has a unique ID, a title with its release year along with it and different genres compiled in the same field

In [5]:
#Separating title from its release year
#Create a column 'year' by extracting parenthesis that contain 4 digits
movdf['year'] = movdf.title.str.extract('(\(\d\d\d\d\))',expand=False)  #Parantheses to avoid conflict with movies that have years in their titles
#Removing the parentheses from col 'Year'
movdf['year'] = movdf.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column by replacing the format (parenthesis with 4 digits inside) with empty space
movdf['title'] = movdf.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movdf['title'] = movdf['title'].apply(lambda x: x.strip())

  movdf['title'] = movdf.title.str.replace('(\(\d\d\d\d\))', '')


In [6]:
#Dropping the genres column
movdf = movdf.drop('genres', 1)

In [7]:
movdf.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


## Ratings dataframe

In [8]:
rdf.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


Each observation consists if a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. I don't need the timestamp col.

In [9]:
rdf = rdf.drop('timestamp', 1)
rdf.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


# Collaborative Filtering

## Input Movies dataframe

In [10]:
#Creating an input user to recommend movies to
VanInput = [
            {'title':'Jumanji','rating':2},
            {'title':'Toy Story', 'rating':6.5},
            {'title':'Mulan', 'rating':4},
            {'title':"Cruella", 'rating':8.9},
            {'title':'Pulp Fiction', 'rating':7.5}
         ] 
inp_mov = pd.DataFrame(VanInput)
inp_mov

Unnamed: 0,title,rating
0,Jumanji,2.0
1,Toy Story,6.5
2,Mulan,4.0
3,Cruella,8.9
4,Pulp Fiction,7.5


Bring the input movies's id's from the movies df

In [11]:
#Create a table inputId (based on moves table) that contains only matches with the inp_mov df
inputId = movdf[movdf['title'].isin(inp_mov['title'].tolist())]
#Implicitly merging the filtered df with inp_mov by title to bring the 'ratings' col
inp_mov = pd.merge(inputId, inp_mov)
#Dropping excess caol
inp_mov = inp_mov.drop('year', 1)
#P.S.:Not all user input movies are in the original dataframe
inp_mov

Unnamed: 0,movieId,title,rating
0,1,Toy Story,6.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,7.5
3,1907,Mulan,4.0
4,90620,Mulan,4.0


## Users Subset dataframe

Get subset of users that have watched and reviewed the movies inputed (by id)

In [12]:
#Filtering out users that have watched movies that the input has watched
userSubset = rdf[rdf['movieId'].isin(inp_mov['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
479,13,2,2.0
681,14,296,2.0
749,15,1,4.0
776,15,296,3.0


In [13]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
usg = userSubset.groupby(['userId'])

In [14]:
#Ex: Checking on id 17
usg.get_group(17)

Unnamed: 0,userId,movieId,rating
1247,17,1,5.0
1248,17,2,3.0
1333,17,296,2.0
1822,17,1907,3.0


In [15]:
#Sort groups so the users that share the most movies in common with the input have higher priority
#This provides a richer recommendation since I won't go through every single user.
usg = sorted(usg,  key=lambda x: len(x[1]), reverse=True)

In [16]:
usg[:1]

[(1040,
         userId  movieId  rating
  96689    1040        1     3.0
  96690    1040        2     1.5
  96733    1040      296     3.5
  96915    1040     1907     4.5
  97499    1040    90620     3.5)]

I will select a subset of users to iterate through \
This limit is imposed to avoid wasting too much time going through every single user

In [17]:
usg = usg[0:100]

## Pearson Corr

In [18]:
#Pearson Correlation between Input User and Subset Group, and store it in a dictionary, where the key is the user id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for i, group in usg:
    #Sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inp_mov = inp_mov.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inp_mov[inp_mov['movieId'].isin(group['movieId'].tolist())]
    #Store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[i] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[i] = 0

In [19]:
pearsonCorrelationDict.items()

dict_items([(1040, 0.41039099474342067), (13493, 0.7018203569799772), (53030, 0.43048326528833586), (63572, 0.38968659876683565), (109213, 0.7480544714310469), (132030, 0.40524829064032775), (155986, -0.4262762762934024), (178786, 0.7579643954569074), (198941, 0.7954873853310125), (227112, 0.5853345883431811), (228452, 0.6626870906293719), (236434, 0.7125812379830004), (17, 0.053338074706266496), (114, 0.7397954428741078), (222, -0.2684624220856097), (241, -0.1513415349215006), (277, 0.8053872662568292), (340, 0.9299811099505543), (393, 0.7710996009560598), (407, 0.9009861607017401), (479, -0.8790079680805661), (670, 0.7009996372327816), (683, 0.8000711205939974), (815, 0.8740200053736314), (904, 0.3449417900125087), (1130, -0.11624763874381927), (1204, 0.9299811099505543), (1348, 0.0), (1414, 0.9441196694198674), (1599, 0.6711560552140243), (1607, -0.1643989873053573), (1615, 0.2684624220856097), (1621, 0.34874291623145787), (1629, 0.9867967802605516), (1643, 0.5607997097862253), (167

In [20]:
per = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
per.columns = ['similarityIndex']
per['userId'] = per.index
per.index = range(len(per))
per.head()

Unnamed: 0,similarityIndex,userId
0,0.410391,1040
1,0.70182,13493
2,0.430483,53030
3,0.389687,63572
4,0.748054,109213


In [21]:
#Top 50 users most similar to the input
topUsers=per.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
33,0.986797,1629
57,0.986394,3388
79,0.967239,6157
52,0.949158,2791
28,0.94412,1414


In [22]:
#Get the movies watched by the users in per from the ratings df and then store their correlation in a new column
top_rat=topUsers.merge(rdf, left_on='userId', right_on='userId', how='inner')
top_rat.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.986797,1629,1,4.5
1,0.986797,1629,2,2.5
2,0.986797,1629,3,2.5
3,0.986797,1629,5,2.5
4,0.986797,1629,6,3.5


Multiply the movie rating by its weight (similarity index), then sum up the new ratings and divide it by the sum of the weights\
Multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:\
It shows the idea of all similar users to candidate movies for the input user:

In [23]:
#Multiplies the similarity by the user's ratings
top_rat['weightedRating'] = top_rat['similarityIndex']*top_rat['rating']
top_rat.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.986797,1629,1,4.5,4.440586
1,0.986797,1629,2,2.5,2.466992
2,0.986797,1629,3,2.5,2.466992
3,0.986797,1629,5,2.5,2.466992
4,0.986797,1629,6,3.5,3.453789


In [24]:
#Applies a sum to the topUsers after grouping it up by userId
temptop_rat = top_rat.groupby('movieId').sum()[['similarityIndex','weightedRating']]
temptop_rat.columns = ['sum_similarityIndex','sum_weightedRating']
temptop_rat.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,41.944079,170.99888
2,41.944079,113.015682
3,13.12017,34.592224
4,4.262199,11.028845
5,14.705382,39.584802


In [25]:
#Creates an empty df
rec_df = pd.DataFrame()
#Now we take the weighted average
rec_df['weighted average recommendation score'] = temptop_rat['sum_weightedRating']/temptop_rat['sum_similarityIndex']
rec_df['movieId'] = temptop_rat.index
rec_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.07683,1
2,2.694437,2
3,2.636568,3
4,2.587595,4
5,2.691858,5


In [26]:
#Top 10 movies that the algorithm recommended
rec_df = rec_df.sort_values(by='weighted average recommendation score', ascending=False)
rec_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
7091,5.0,7091
6286,5.0,6286
5747,5.0,5747
363,5.0,363
26524,5.0,26524
6509,5.0,6509
58627,5.0,58627
26788,5.0,26788
6434,5.0,6434
6368,5.0,6368


In [27]:
movdf.loc[movdf['movieId'].isin(rec_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
359,363,"Wonderful, Horrible Life of Leni Riefenstahl, ...",1993
5649,5747,Gallipoli,1981
6188,6286,"Man Without a Past, The (Mies vailla menneisyy...",2002
6264,6368,Cinemania,2002
6326,6434,"Objective, Burma!",1945
6400,6509,Ali: Fear Eats the Soul (Angst essen Seele auf),1974
6980,7091,Horse Feathers,1932
8902,26524,"Times of Harvey Milk, The",1984
9058,26788,"Story of Qiu Ju, The (Qiu Ju da guan si)",1992
12549,58627,Never Back Down,2008


The table above provides the top 10 movies to be recommended for a user who's input was listed in VanInput