# Movie Recommendation Systems

by **Young Hun Ji**

June 8, 2021

*Note: This notebook is based on a project submitted to Coursera in partial fulfillment of the requirements for the IBM DATA SCIENCE PROFESSIONAL CERTIFICATE*

!["auto1"](cover.jpg "cover")

## Overview

In this mini-project, I developed two simple movie recommendation systems using **(1) content-based filtering**, which recommends movies similar in genre as those rated highly by the user, and **(2) collaborative filtering**, which recommends movies rated highly by other users with similar 

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1.  <a href="#chapter1">Loading the Dataset</a><br>
2.  <a href="#chapter2">Preprocessing</a><br>
3.  <a href="#chapter3">Content-Based Filtering</a><br>
4.  <a href="#chapter4">Collaborative Filtering</a><br>
    </font>
    </div>

## 1. Loading the Dataset <a class="anchor" id="chapter1"></a>

Prior to loading the dataset, I imported all dependencies required for the analysis:

In [1]:
import pandas as pd
from math import sqrt
import urllib
from zipfile import ZipFile

I downloaded the data file from the URL provided by IBM as follows:

In [2]:
moviedataset = urllib.request.urlretrieve("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%205/data/moviedataset.zip", "moviedataset.zip")

After downloading the data file, I extracted the two datasets used in this analysis: `movies.csv` and `ratings.csv`:

In [3]:
with ZipFile("moviedataset.zip", 'r') as zipf:
    #Storing the movie information into a pandas dataframe
    #zip.extract("ml-latest/ratings.csv")
    movies_df = pd.read_csv(zipf.extract("ml-latest/movies.csv"))
    #Storing the user information into a pandas dataframe
    ratings_df = pd.read_csv(zipf.extract("ml-latest/ratings.csv"))

## 2. Preprocessing <a class="anchor" id="chapter2"></a>

Inspecting the `movies.csv` dataset:

In [4]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


I removed the years from the **title** column and stored them in a new **year** column as follows:

In [5]:
# Specifying the parantheses to avoid conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)

# Removing the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)

# Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

# Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

# Viewing the first 5 rows
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


Next, I split the values in the **Genres** column into a list of genres:

In [6]:
# Splitting on "|"
movies_df['genres'] = movies_df.genres.str.split('|')

# Viewing the first 5 rows
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


I then used one hot encoding for each genre catagory and stored the result as a new dataframe:

In [7]:
# Creating a copy of the dataframe
moviesWithGenres_df = movies_df.copy()

# Iterating through the list of genres for each row and placing a 1 into the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
        
# Filling in the NaN values with 0
moviesWithGenres_df = moviesWithGenres_df.fillna(0)

# Viewing the first 5 rows
moviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, inspecting the `ratings.csv` dataset:

In [8]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


As shown, every row had a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. For this analysis, I didn't need the timestamp column and thus removed it as follows:

In [9]:
# Removing the Timestamp column
ratings_df = ratings_df.drop('timestamp', 1)

# Viewing the first 5 rows
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


Checking the shape of the dataframe:

In [10]:
ratings_df.shape

(22884377, 3)

So there were 22,884,377 user ratings in total.

## 3. Content-Based Filtering <a class="anchor" id="chapter3"></a>

First, I created a recommendation system that uses **content-based filtering**. This method aims to identify what a user's favorite aspect of an item is and then recommends other items that share those attributes. In this case, I created a system that recommends movies sharing **similar genre characteristics** as those rated highly by the user.

I began by creating a hypothetical user with the following movie ratings:

In [11]:
# User ratings
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 

# Saving to pandas dataframe
inputMovies = pd.DataFrame(userInput)

# Viewing the dataframe
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


Next, I extracted the **movid IDs** from the movies dataframe and added them to the user input dataframe as follows:

In [12]:
# Filtering out the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

# Merging 
inputMovies = pd.merge(inputId, inputMovies)

# Dropping unused columns
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)

# Viewing the resulting dataframe
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


I then created a dataframe depicting the user's genre preferences. This was done by extracting the subset of movies rated by the user from the `moviesWithGenres` dataframe:

In [13]:
# Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]

# Viewing the resulting dataframe
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
293,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1246,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1885,1968,"Breakfast Club, The","[Comedy, Drama]",1985,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Since only the genre table was needed, I dropped the columns containing the movieId, title, genres and year columns:

In [14]:
# Resetting the index
userMovies = userMovies.reset_index(drop=True)

# Dropping unused columns
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)

# Viewing the resulting dataframe
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


From there, I computed the user's **weighted genre preferences**. This was done by taking the user's ratings and multiplying them with the user genre table and then summing the resulting table by columns. This was essentially the **dot product** between a vector (i.e., user ratings) and a matrix (i.e., user genre table).

In [15]:
# User ratings vector
inputMovies['rating']

0    3.5
1    2.0
2    5.0
3    4.5
4    5.0
Name: rating, dtype: float64

In [16]:
# Dot produt between user ratings vector and user genre table
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])

# User profile (i.e., user's weighted genre preferences)
userProfile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

After computing the "user profile" above, I created a system that recommends movies that best fit the user's weighted genre preferences.

First, I extracted only genre columns from the original dataframe:

In [17]:
# Extracting the genres of every movie in the original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])

# Dropping unnecessary columns
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)

# Viewing the first 5 rows
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, for each movie, I multipled the genres by the user's genre preference weights, then divided it by the sum of the weights. Essentially, I computed **weighted average scores** reflecting the extent to which each movie fits the genre preferences of the focal user.

In [18]:
# Multiplying the genres by the weights and then taking the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())

# Viewing the first 5 rows
recommendationTable_df.head()

movieId
1    0.594406
2    0.293706
3    0.188811
4    0.328671
5    0.188811
dtype: float64

In [19]:
# Sorting the recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)

# Viewing the first 5 rows
recommendationTable_df.head()

movieId
5018      0.748252
26093     0.734266
27344     0.720280
148775    0.685315
6902      0.678322
dtype: float64

Finally, I used the movie IDs to generate a **recommendation of top 20 movies that best fit the user's genre preferences**:

In [20]:
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
664,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996
1824,1907,Mulan,"[Adventure, Animation, Children, Comedy, Drama...",1998
2902,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
4923,5018,Motorama,"[Adventure, Comedy, Crime, Drama, Fantasy, Mys...",1991
6793,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002
8605,26093,"Wonderful World of the Brothers Grimm, The","[Adventure, Animation, Children, Comedy, Drama...",1962
8783,26340,"Twelve Tasks of Asterix, The (Les douze travau...","[Action, Adventure, Animation, Children, Comed...",1976
9296,27344,Revolutionary Girl Utena: Adolescence of Utena...,"[Action, Adventure, Animation, Comedy, Drama, ...",1999
9825,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005
11716,51632,Atlantis: Milo's Return,"[Action, Adventure, Animation, Children, Comed...",2003


Below are some of the advantages versus disadvantages of content-based filtering:

**Advantages**

-   Learns user's preferences
-   Highly personalized for the user

**Disadvantages**

-   Doesn't take into account what others think of the item, so low quality item recommendations might happen
-   Extracting data is not always intuitive
-   Determining what characteristics of the item the user dislikes or likes is not always obvious

## 4. Collaborative Filtering <a class="anchor" id="chapter4"></a>

Second, I created a recommendation system that uses **collaborative filtering** (also referred to as "user-user filtering"). This method uses inputs from other users to recommend items to the input user. In this case, I created a system that recommends movies rated highly by **other users with similar preferences** to those of the focal user. 

I used **Pearson correlation coefficients** to determine other users' similarity to the focal user.

As before, I began by creating a hypothetical user with the same movie ratings as before:

In [21]:
# Hypothetical user's movie ratings
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


Next, I dropped the "genre" column from the `movies_df` dataframe, which was not used in this analysis:

In [22]:
# Dropping the genres column
movies_df = movies_df.drop('genres', 1)

# Inspecting the dataframe
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


I then used the movie IDs to extract the subset of users that have watched and reviewed the same movies as those of the focal user:

In [23]:
# Filtering out users that have rated the same movies that the focal user has reviewed
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]

# Viewing the first 5 rows
userSubset.head()

Unnamed: 0,userId,movieId,rating
19,4,296,4.0
441,12,1968,3.0
479,13,2,2.0
531,13,1274,5.0
681,14,296,2.0


Checking the number of users who have reviewed the same movies:

In [24]:
userSubset.shape

(196623, 3)

Checking the total number of users in the "ratings" dataframe:

In [25]:
ratings_df.shape

(22884377, 3)

So 196,623 of 22,884,377 users have watched and reviewed at least one of the movies reviewed by the focal user.

Next, I grouped the rows in the resulting dataframe by user IDs:

In [26]:
userSubsetGroup = userSubset.groupby(['userId'])

The resulting dataframe can be used to check the ratings given by a particular user within the subset. For example, inspecting the user with user ID = 1130:

In [27]:
userSubsetGroup.get_group(1130)

Unnamed: 0,userId,movieId,rating
104167,1130,1,0.5
104168,1130,2,4.0
104214,1130,296,4.0
104363,1130,1274,4.5
104443,1130,1968,4.5


I then sorted these groups so that the users sharing the most movies in common with the focal user have higher priority:

In [28]:
# Sorting so that users with the most movies in common with the focal user will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

Inspecting the first three users in the subset:

In [29]:
userSubsetGroup[0:3]

[(75,
        userId  movieId  rating
  7507      75        1     5.0
  7508      75        2     3.5
  7540      75      296     5.0
  7633      75     1274     4.5
  7673      75     1968     5.0),
 (106,
        userId  movieId  rating
  9083     106        1     2.5
  9084     106        2     3.0
  9115     106      296     3.5
  9198     106     1274     3.0
  9238     106     1968     3.5),
 (686,
         userId  movieId  rating
  61336     686        1     4.0
  61337     686        2     3.0
  61377     686      296     4.0
  61478     686     1274     4.0
  61569     686     1968     5.0)]

After sorting the groups (i.e., such that other users sharing the most movies in common with the focal user are at the top), **I computed the focal user's similarity to each of the first 100 users in the subset using Pearson correlations**. Specifically, I computed the Pearson correlation coefficients between the focal user's ratings and the ratings given by each of the first 100 users in the subset.

In [30]:
# Selecting the first 100 users in the sorted subset
userSubsetGroup = userSubsetGroup[0:100]

I calculated the Pearson correlations and stored it in a dictionary, where the key is the user Id and the value is the coefficient:

In [31]:
# Storing the correlation coefficients in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

# Filling the dictionary
for name, group in userSubsetGroup:
    # Sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    # Getting the N 
    nRatings = len(group)
    # Getting the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    # Storing them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    # Putting the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    # Calculating pearson correlations between two users X and Y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    # If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [32]:
# Dictionary items
pearsonCorrelationDict.items()

dict_items([(75, 0.8272781516947562), (106, 0.5860090386731182), (686, 0.8320502943378437), (815, 0.5765566601970551), (1040, 0.9434563530497265), (1130, 0.2891574659831201), (1502, 0.8770580193070299), (1599, 0.4385290096535153), (1625, 0.716114874039432), (1950, 0.179028718509858), (2065, 0.4385290096535153), (2128, 0.5860090386731196), (2432, 0.1386750490563073), (2791, 0.8770580193070299), (2839, 0.8204126541423674), (2948, -0.11720180773462392), (3025, 0.45124262819713973), (3040, 0.89514359254929), (3186, 0.6784622064861935), (3271, 0.26989594817970664), (3429, 0.0), (3734, -0.15041420939904673), (4099, 0.05860090386731196), (4208, 0.29417420270727607), (4282, -0.4385290096535115), (4292, 0.6564386345361464), (4415, -0.11183835382312353), (4586, -0.9024852563942795), (4725, -0.08006407690254357), (4818, 0.4885967564883424), (5104, 0.7674257668936507), (5165, -0.4385290096535153), (5547, 0.17200522903844556), (6082, -0.04728779924109591), (6207, 0.9615384615384616), (6366, 0.65779

In [33]:
# Saving correlations to a pandas dataframe
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')

# Labeling columns
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index

# Setting index
pearsonDF.index = range(len(pearsonDF))

# Viewing the first 5 rows
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.827278,75
1,0.586009,106
2,0.83205,686
3,0.576557,815
4,0.943456,1040


Next, I identified the top 50 users with the highest similarity to the focal user:

In [34]:
# Sorting the users by descending similarity indices
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]

# Viewing the first 5 rows
topUsers.head()

Unnamed: 0,similarityIndex,userId
64,0.961678,12325
34,0.961538,6207
55,0.961538,10707
67,0.960769,13053
4,0.943456,1040


From there, I took a list of all movies rated by the other users, then multiplied the ratings by the Pearson correlation associated with each user. Essentially, **I computed the weighted scores of all of the users' movie ratings, with the correlations (i.e., similarity to the focal user) as the weight**.

In [35]:
# Using user IDs to obtain movie ratings given by the top 50 simiar users
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')

# Viewing the first 5 rows
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.961678,12325,1,3.5
1,0.961678,12325,2,1.5
2,0.961678,12325,3,3.0
3,0.961678,12325,5,0.5
4,0.961678,12325,6,2.5


In [36]:
# Multiplying ratings by the correlation (i.e., similarity) coefficients to reflect Weighted Ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']

# Viewing the first 5 rows
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.961678,12325,1,3.5,3.365874
1,0.961678,12325,2,1.5,1.442517
2,0.961678,12325,3,3.0,2.885035
3,0.961678,12325,5,0.5,0.480839
4,0.961678,12325,6,2.5,2.404196


I then summed the similarity indices and weighted ratings for each of the movies in the list:

In [37]:
# Summing the weighted ratings (and similarity indices) by movie IDs
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]

# Re-labeling columns
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']

# Viewing the first 5 rows
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,38.376281,140.800834
2,38.376281,96.656745
3,10.253981,27.254477
4,0.929294,2.787882
5,11.723262,27.151751


Finally, I divided the aggregate weighted ratings by aggregate similarity indices for each of the movies in the list:

In [38]:
# Creates an empty dataframe
recommendation_df = pd.DataFrame()

# Computing the average weighted rating for each movie in the list
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']

# Setting movie IDs as the index
recommendation_df['movieId'] = tempTopUsersRating.index

# Viewing the first 5 rows
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.668955,1
2,2.518658,2
3,2.657941,3
4,3.0,4
5,2.316058,5


Sorting the recommendation table:

In [39]:
# Sorting
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)

# Viewing the first 5 rows
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
5073,5.0,5073
3329,5.0,3329
2284,5.0,2284
26801,5.0,26801
6776,5.0,6776


Finally, I used the movie IDs to generate a **recommendation of top 20 movies based on ratings of similar users**:

In [40]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(20)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
97,99,Heidi Fleiss: Hollywood Madam,1995
119,121,"Boys of St. Vincent, The",1992
2200,2284,Bandit Queen,1994
3243,3329,"Year My Voice Broke, The",1987
3449,3539,"Filth and the Fury, The",2000
3669,3759,Fun and Fancy Free,1947
3679,3769,Thunderbolt and Lightfoot,1974
3685,3775,Make Mine Music,1946
3686,3776,Melody Time,1948
3759,3851,I'm the One That I Want,2000


Below are some of the advantages versus disadvantages of collaborative filtering:

**Advantages**

-   Takes other user's ratings into consideration
-   Doesn't need to study or extract information from the recommended item
-   Adapts to the user's interests which might change over time

**Disadvantages**

-   Approximation function can be slow
-   There might be a low of amount of users to approximate
-   Privacy issues when trying to learn the user's preferences

### Thank you!

Created by Young Hun Ji