# CS 1656 – Introduction to Data Science 

## Instructor: Alexandros Labrinidis / Teaching Assistant: Evangelos Karageorgos
### Additional credits: Xiaoting Li, Phuong Pham, Zuha Agha, Anatoli Shein
## Recitation 9: Collaborative Filtering & Similarity Metrics
---
In this recitation we will be doing a fun exercise to implement collaborative filtering for recommender systems. We will also learn how the choice of similarity metric in collaborative filtering can affect its output of predicted ratings. 

Packages you will need for the recitation are,

* pandas
* numpy
* scipy

Recall that numpy package provides nd-arrays and operations for easily manipulating them. 
Likewise, scipy provides an addtional suite of useful mathematical functions and distributions for numpy arrays, including distance functions which we will use in this recitation to compute the measure of similarity. We will only import the distance funcions we need for today's session as shown below. Note that cityblock is just another name for Manhattan distance metric seen in class.

In [1]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean, cityblock, cosine
from scipy.stats import pearsonr

## User-Based vs Item-Based Recommendation
There are two type of collaborative filtering method: user-based and item-based.

User-based recommendation assumes that similar users give similar ratings to each item. Whereas item-based recommendation assumes that similar items receive similar ratings from each user. You can think of them as a dual of each other. 

In this recitation, we will walk through a toy example for user-based recommendation and you will try out item-based recommendation later in one of your tasks. 

## Data Input

In [2]:
df = pd.read_csv('http://data.cs1656.org/movies_example.csv')
df

Unnamed: 0,Name,Alice,Bob,Christine,David,Elaine,Frank
0,The Matrix,2,3.0,4,5.0,5.0,
1,Gone with the Wind,5,,5,,3.0,3.0
2,Jack and Jill,2,1.0,2,2.0,1.0,1.0
3,Planes,4,4.0,5,2.0,,3.0
4,Rocky IV,2,2.0,3,4.0,3.0,


### Accessing rows in dataframe

The two ways to access dataframes rows are shown below,

In [3]:
# Converting value equality test fo a Series of booleans
df['Name'] == 'The Matrix'

0     True
1    False
2    False
3    False
4    False
Name: Name, dtype: bool

In [4]:
# First way to access rows
df[df['Name'] == 'The Matrix']

Unnamed: 0,Name,Alice,Bob,Christine,David,Elaine,Frank
0,The Matrix,2,3.0,4,5.0,5.0,


In [5]:
# Second way
df.iloc[0]

Name         The Matrix
Alice                 2
Bob                   3
Christine             4
David                 5
Elaine                5
Frank               NaN
Name: 0, dtype: object

### Missing values in data frame

To exlude missing values or NaNs in a dataframe, we can use the notnull() function.

In [6]:
df['Frank'].notnull()

0    False
1     True
2     True
3     True
4    False
Name: Frank, dtype: bool

In [7]:
df['Elaine'].notnull()

0     True
1     True
2     True
3    False
4     True
Name: Elaine, dtype: bool

You can also perform logical operations on the boolean Series returned as shown below,

In [8]:
df['Frank'].notnull() & df['Elaine'].notnull()

0    False
1     True
2     True
3    False
4    False
dtype: bool

You can also select subset of rows and columns where the boolean value is True.

In [9]:
df_notmissing = df[['Frank','Elaine']][df['Frank'].notnull() & df['Elaine'].notnull()]
df_notmissing

Unnamed: 0,Frank,Elaine
1,3.0,3.0
2,1.0,1.0


## Similarity Metrics & Predicted Ratings
Different distance metrics can be used to measure the similarity. In this recitation, we will use Euclidean, Manhattan, Pearson Correlation and Cosine distance metrics to measure the similarity.

### Euclidean 

In [10]:
sim_weights = {}
for user in df.columns[1:-1]:
    df_subset = df[['Frank',user]][df['Frank'].notnull() & df[user].notnull()]
    dist = euclidean(df_subset['Frank'], df_subset[user])
    sim_weights[user] = 1.0 / (1.0 + dist)
print ("similarity weights: %s" % sim_weights)

similarity weights: {'Alice': 0.28989794855663564, 'Bob': 0.5, 'Christine': 0.25, 'David': 0.4142135623730951, 'Elaine': 1.0}


Now let's find the predicted rating of 'Frank' for 'The Matrix'. We can get all ratings for a movie by accessing a row of the dataframe using iloc learnt earlier. We only slice the columns of ratings we need indicated by the index [1:-1]. In this case we do not need the first column 'Name' and the last column 'Frank'.

In [11]:
ratings = df.iloc[0][1:-1]
ratings

Alice        2
Bob          3
Christine    4
David        5
Elaine       5
Name: 0, dtype: object

Now we will find our predicted rating by multiplying each user weight with its corresponding rating for the movie matrix.

In [12]:
predicted_rating = 0.0
weights_sum = 0.0
for user in df.columns[1:-1]:
    predicted_rating += ratings[user] * sim_weights[user]
    weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %f" % predicted_rating)

predicted rating: 4.136268


### Manhattan (Cityblock)

We repeat our method of finding predicted rating using cityblock distance now.

In [13]:
sim_weights = {}
for user in df.columns[1:-1]:
    df_subset = df[['Frank',user]][df['Frank'].notnull() & df[user].notnull()]
    dist = cityblock(df_subset['Frank'], df_subset[user])
    sim_weights[user] = 1.0 / (1.0 + dist)
print ("similarity weights: %s" % sim_weights)

predicted_rating = 0
weights_sum = 0.0
ratings = df.iloc[0][1:-1]
for user in df.columns[1:-1]:
    predicted_rating += ratings[user] * sim_weights[user]
    weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %f" % predicted_rating)

similarity weights: {'Alice': 0.2, 'Bob': 0.5, 'Christine': 0.16666666666666666, 'David': 0.3333333333333333, 'Elaine': 1.0}
predicted rating: 4.196970


### Pearson Correlation Coefficient

In [14]:
sim_weights = {}
for user in df.columns[1:-1]:
    df_subset = df[['Frank',user]][df['Frank'].notnull() & df[user].notnull()]
    sim_weights[user] = pearsonr(df_subset['Frank'], df_subset[user])[0]
print ("similarity weights: %s" % sim_weights)

predicted_rating = 0.0
weights_sum = 0.0
ratings = df.iloc[0][1:-1]
for user in df.columns[1:-1]:
    predicted_rating += ratings[user] * sim_weights[user]
    weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %s" % predicted_rating)

similarity weights: {'Alice': 0.9449111825230679, 'Bob': 1.0, 'Christine': 1.0, 'David': nan, 'Elaine': 1.0}
predicted rating: nan




Why nan?
Because anything divided by 0 is undefined. Computing it again with this modfication gives the following.

In [15]:
predicted_rating = 0.0
weights_sum = 0.0
ratings = df.iloc[0][1:-1]
for user in df.columns[1:-1]:
    if (not np.isnan(sim_weights[user])):
        predicted_rating += ratings[user] * sim_weights[user]
        weights_sum += sim_weights[user]

predicted_rating /= weights_sum
print ("predicted rating: %f" % predicted_rating)

predicted rating: 3.520947


## Tasks
For your tasks, use the movie ratings data we collected from a previous class in movie_class_responses.csv. You will predict missing movie ratings of a student based on other students with similar tastes. The first column, 'Alias' is the name of the movie, while all other columns are user names of students. The ratings are from 1 to 5, while there are a lot of missing values (missing movie ratings).

In [16]:
df = pd.read_csv('http://data.cs1656.org/movie_class_responses.csv')
df.head()

Unnamed: 0,Alias,potatofaminewasaninsidejob,Thug Nugget,Arjen_Robben_Is_Cool,Oh-Long Johnson,BigD,Starlord,captainamerica,BabyKangaroo,Ryan,...,Garbage,CleverAlias,Kirito,ryan,Geek in the Pink,Michelle,Jim Jarmush,rater,RonJohnson,Dory
0,50 Shades of Grey,1.0,,,1.0,,1.0,1.0,,1,...,3,3,1.0,1,1,,1.0,,2,2.0
1,Airplane,,,,5.0,,3.0,,4.0,4,...,5,3,5.0,1,3,,,,4,
2,Au Revoir Les Enfants,,,,2.0,,,,,3,...,3,3,,3,3,,,,4,
3,Blues Brothers,,,,4.0,,,,2.0,3,...,5,3,4.0,3,4,,4.0,,4,
4,Dark Night,,3.0,,5.0,5.0,5.0,,5.0,5,...,4,3,4.0,4,4,,,4.0,5,5.0


In [41]:
df

Unnamed: 0,Alias,potatofaminewasaninsidejob,Thug Nugget,Arjen_Robben_Is_Cool,Oh-Long Johnson,BigD,Starlord,captainamerica,BabyKangaroo,Ryan,...,Garbage,CleverAlias,Kirito,ryan,Geek in the Pink,Michelle,Jim Jarmush,rater,RonJohnson,Dory
0,50 Shades of Grey,1.0,,,1.0,,1.0,1.0,,1,...,3,3,1.0,1,1,,1.0,,2,2.0
1,Airplane,,,,5.0,,3.0,,4.0,4,...,5,3,5.0,1,3,,,,4,
2,Au Revoir Les Enfants,,,,2.0,,,,,3,...,3,3,,3,3,,,,4,
3,Blues Brothers,,,,4.0,,,,2.0,3,...,5,3,4.0,3,4,,4.0,,4,
4,Dark Night,,3.0,,5.0,5.0,5.0,,5.0,5,...,4,3,4.0,4,4,,,4.0,5,5.0
5,Deadpool (2016),4.0,4.0,,5.0,,2.0,5.0,,4,...,5,4,4.0,5,3,4.0,1.0,,5,
6,Don't Mess with the Zohan,,,3.0,5.0,,2.0,,,2,...,1,3,3.0,1,3,,1.0,,2,
7,Dr. Strange,4.0,,,3.0,,4.0,5.0,,4,...,3,5,4.0,3,3,,5.0,,4,
8,Ex Machina,4.0,,,2.0,,4.0,,,5,...,4,3,4.0,3,3,3.0,3.0,,5,
9,Fast and Furious,,,,1.0,4.0,1.0,3.0,,2,...,3,5,4.0,5,2,,,,5,4.0


** Task 1: User-based Recommendation with Cosine Metric**

For a specified user, calculate ALL missing movie ratings using user-based recommendation with Cosine Metric.

In [42]:
def t1(name):
    sim_weights = {}
    missing_ratings = []
    
    for user in df.columns[1:]:
        if user == name:
            continue
        df_subset = df[[name,user]][df[name].notnull() & df[user].notnull()]
        dist = cosine(df_subset[name], df_subset[user])
        sim_weights[user] = 1.0 / (1.0 + dist)
    
    df_subset = df[name]
    for x in range(0, len(df)):
        movie_title = df['Alias'][x]
        user_rating = df_subset.iloc[x]
        
        if np.isnan(user_rating):
            predicted_rating = 0.0
            weights_sum = 0.0
            ratings = df.iloc[x][1:]
            
            for user in df.columns[1:]:
                if (user != name and not np.isnan(sim_weights[user]) and not np.isnan(ratings[user])):
                    predicted_rating += ratings[user] * sim_weights[user]
                    weights_sum += sim_weights[user]
            
            predicted_rating /= weights_sum
            missing_ratings.append((movie_title, predicted_rating))
                
    return missing_ratings

print(t1('BabyKangaroo'))

[('50 Shades of Grey', 1.800510039784056), ('Au Revoir Les Enfants', 2.677412888697481), ('Deadpool (2016)', 3.884560310821831), ("Don't Mess with the Zohan", 2.3252685926656045), ('Dr. Strange', 3.4874545533060894), ('Ex Machina', 3.295500789684209), ('Fast and Furious', 3.2643182745411243), ('Finding Dory', 3.797867126964514), ('Grownups', 2.6027977835524734), ("Harry Potter: The Sorcerer's Stone", 3.8440637198956455), ('Jurassic Park', 3.6299162284421977), ('Keanu', 2.7149176195697478), ('La La Land', 3.519753102584685), ('Lord of the Rings: Return of the King', 3.8882854882871607), ('Mr. Deeds', 3.110755481556487), ('Nacho Libre', 2.3646069694393863), ('Remember the Titans', 3.2135894977660335), ('Shrek', 4.113414168507833), ('Snakes on a Plane', 2.0812632781733402), ('Spiderman', 3.39690383570258), ('Star Wars: Episode IV', 3.704378124682928), ('Star Wars: The Return of the Jedi', 3.6867809290631763), ('Suicide Squad', 2.5957661280674174), ('Superbad', 3.3117330945104837), ('The F

** Task 2: Item-based Recommendation with Cosine Metric**

Repeat the task above by doing an item-based recommendation instead of a user based recommendation. To calculate a missing movie rating using item-based recommendation, you are supposed to find similarity between movies instead of users. In other words, you measure the similarity of the user's missing rating movie with movies that the user has rated in the past. Then compute a weighted average using similar movie weights and their ratings to find out the predicted rating. You need to predict ALL missing movie ratings for the user.

In [83]:
df_t = df.transpose()
df_t

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
Alias,50 Shades of Grey,Airplane,Au Revoir Les Enfants,Blues Brothers,Dark Night,Deadpool (2016),Don't Mess with the Zohan,Dr. Strange,Ex Machina,Fast and Furious,...,The Fifth Element,The Fundamentals of Caring,The Godfather,The Imitation Game,The Martian,The Matrix,The Room,Split (2016),Toy Story,War Dogs
potatofaminewasaninsidejob,1,,,,,4,,4,4,,...,,,,4,4,5,3,,,
Thug Nugget,,,,,3,4,,,,,...,,,1,,4,4,,,,
Arjen_Robben_Is_Cool,,,,,,,3,,,,...,,,,,,5,,,,
Oh-Long Johnson,1,5,2,4,5,5,5,3,2,1,...,,,5,4,4,4,1,,,
BigD,,,,,5,,,,,4,...,,,,5,4,,,,5,
Starlord,1,3,,,5,2,2,4,4,1,...,5,,,5,4,5,,,4,3
captainamerica,1,,,,,5,,5,,3,...,,,4,5,,,,,4,
BabyKangaroo,,4,,2,5,,,,,,...,,,5,3,4,3,,,5,
Ryan,1,4,3,3,5,4,2,4,5,2,...,3,4,5,5,4,5,3,3,5,3


In [97]:
name = 'BabyKangaroo' # not necessary to have in task.py
unrated_movies = {}
sim_weights = {}
predicted_ratings = []

for x in range(0, len(df)):
    if np.isnan(df[name][x]):
        unrated_movies[x] = df['Alias'][x]
        
df_t = df.transpose()

for x in unrated_movies:
    movie_title = unrated_movies[x]
    for y in range(0, len(df)):
        movie = df['Alias'][y]
        if movie_title == movie:
            continue
        
        df_subset = df_t[[x,y]][df_t[x].notnull() & df_t[y].notnull()]
        dist = cosine(df_subset[x][1:].astype(float), df_subset[y][1:].astype(float))
        sim_weights[movie] = 1.0 / (1.0 + dist)

    print(sim_weights)
    break
    
    # now have sim_weights for that movie
    # go through user's df and find the ones that they rated and use that to calculate the weighted average

# print(df_t[1])
# for unrated_movie in unrated_movies:
#     for movie in df_t.iloc[0][1:]:
#         if unrated_movie == movie:
#             continue
            
#         df_subset = df_t[[unrated_movie,movie]][df_t[unrated_movie].notnull() & df_t[movie].notnull()]
#         dist = cosine(df_subset[unrated_movie], df_subset[movie])
#         sim_weights[movie] = 1.0 / (1.0 + dist)

#     print(sim_weights)
        

{'Airplane': 0.778212983152115, 'Au Revoir Les Enfants': 0.825340944432571, 'Blues Brothers': 0.792033989807172, 'Dark Night': 0.7754706897841156, 'Deadpool (2016)': 0.8081969606912395, "Don't Mess with the Zohan": 0.7762383510353102, 'Dr. Strange': 0.8157074385068009, 'Ex Machina': 0.7918084563632197, 'Fast and Furious': 0.8290820970258105, 'Fight Club': 0.7832795123037051, 'Finding Dory': 0.8344246041589632, 'Good Will Hunting': 0.7951188772279406, 'Grownups': 0.8299672785224285, "Harry Potter: The Sorcerer's Stone": 0.8594138279431318, 'Jurassic Park': 0.816197416788081, 'Keanu': 0.8043182731945776, 'La La Land': 0.8357208207826734, 'Lord of the Rings: Return of the King': 0.752137175310163, 'Mr. Deeds': 0.7737980505227353, 'Nacho Libre': 0.7726649443013877, 'Pulp Fiction (1994)': 0.7972155700616214, 'Ratatuille': 0.8485803371245432, 'Remember the Titans': 0.7767707730472735, 'Sharknado (2013)': 0.841067921922297, 'Shawshank Redemption': 0.8522620003996483, 'Shrek': 0.87231236966125

** Task 3: User-based Recommendation with Cosine Metric**

Repeat Task 1 while computing the weighted average using just top 10 most similar users instead of all users.

In [49]:
def t3(name):
    sim_weights = {}
    missing_ratings = []
    
    for user in df.columns[1:]:
        if user == name:
            continue
        df_subset = df[[name,user]][df[name].notnull() & df[user].notnull()]
        dist = cosine(df_subset[name], df_subset[user])
        sim_weights[user] = 1.0 / (1.0 + dist)
    
    sim_weights = {k: v for k, v in sorted(sim_weights.items(), key=lambda item: item[1], reverse=True)[:10]}
    
    df_subset = df[name]
    for x in range(0, len(df)):
        movie_title = df['Alias'][x]
        user_rating = df_subset.iloc[x]
        
        if np.isnan(user_rating):
            predicted_rating = 0.0
            weights_sum = 0.0
            ratings = df.iloc[x][1:]
            
            for user in df.columns[1:]:
                if (user != name and user in sim_weights and not np.isnan(sim_weights[user]) and not np.isnan(ratings[user])):
                    predicted_rating += ratings[user] * sim_weights[user]
                    weights_sum += sim_weights[user]
            
            predicted_rating /= weights_sum
            missing_ratings.append((movie_title, predicted_rating))
                
    return missing_ratings

print(t3('BabyKangaroo'))

[('50 Shades of Grey', 1.2501062977225976), ('Au Revoir Les Enfants', 2.499979499698017), ('Deadpool (2016)', 3.6715087825874733), ("Don't Mess with the Zohan", 3.0003808055650913), ('Dr. Strange', 3.499050502383946), ('Ex Machina', 3.598541193625111), ('Fast and Furious', 2.43305340976135), ('Finding Dory', 3.7530297156456833), ('Grownups', 2.5712769115930887), ("Harry Potter: The Sorcerer's Stone", 3.751014292593285), ('Jurassic Park', 3.447578069750026), ('Keanu', 2.3304483131845752), ('La La Land', 3.80217068036191), ('Lord of the Rings: Return of the King', 4.124140887522079), ('Mr. Deeds', 3.165723776145156), ('Nacho Libre', 2.99787714677672), ('Remember the Titans', 3.4998598496048094), ('Shrek', 3.871824257347736), ('Snakes on a Plane', 1.3758214108843625), ('Spiderman', 3.1153416120632498), ('Star Wars: Episode IV', 4.201510804793711), ('Star Wars: The Return of the Jedi', 3.8569293788600487), ('Suicide Squad', 1.857633145720359), ('Superbad', 3.714544137074008), ('The Fifth E