# Collaborative Filtering

# Imports

In [1]:
import pandas as pd
import numpy as np

# Data

In [2]:
user_ratings = pd.read_csv('user_ratings.csv', index_col=False)
user_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


# Exercises

> # 1. Pivoting your data

Transform the user_ratings DataFrame to a DataFrame containing ratings with one row per user and one column per movie and call it user_ratings_table.

In [3]:
user_ratings = user_ratings[['userId', 'title', 'rating']]
user_ratings.sample(15)

Unnamed: 0,userId,title,rating
48128,279,Star Wars: Episode VII - The Force Awakens (2015),3.5
50658,103,Raging Bull (1980),5.0
83988,380,Underworld: Evolution (2006),2.0
83207,552,"Blue Lagoon, The (1980)",1.0
80659,57,Monsieur Verdoux (1947),4.0
31948,64,Coneheads (1993),3.5
70500,105,"Samouraï, Le (Godson, The) (1967)",4.0
34191,376,Contact (1997),4.5
86814,89,Carry on Cruising (1962),3.5
48344,249,10 Cloverfield Lane (2016),4.0


In [27]:
#user_ratings_pivot = user_ratings.pivot(index='userId',
#                                       columns='title',
#                                       values='rating')
# Deu erro: correção no link abaixo:
# https://www.statology.org/valueerror-index-contains-duplicate-entries-cannot-reshape/#:~:text=How%20to%20Fix%3A%20ValueError%3A%20Index%20contains%20duplicate%20entries%2C%20cannot%20reshape,-One%20error%20you&text=This%20error%20usually%20occurs%20when,share%20the%20same%20index%20values.

In [4]:
user_ratings.pivot_table(index='userId', columns='title', values='rating', aggfunc='mean')

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,,,,,,,,,,,...,,,,,,,,,,
607,,,,,,,,,,,...,,,,,,,,,,
608,,,,,,,,,,,...,,,,,,4.5,3.5,,,
609,,,,,,,,,,,...,,,,,,,,,,


> # 2. Finding similar users

Collaborative filtering is built around the premise that users who have ranked items similarly in the past have similar tastes, and therefore are likely to rate new items in a similar fashion.

A subset of the movies dataset has been loaded as user_ratings_subset. The DataFrame contains user ratings with a row for each user and a column for each movie.

Examine user_ratings_subset. Which user is most similar to User A?

![](subset_user_ratings.png)

The user_B is most similar to the user_A because they gave similar ratings for the same movies, Pulp Fiction and The Matrix. 

> # 3. Challenges with missing values

You may have noticed that the pivoted DataFrames you have been working with often have missing data. This is to be expected since users rarely see all movies, and most movies are not seen by everyone, resulting in gaps in the user-rating matrix.

In this exercise, you will explore another subset of the user ratings table user_ratings_subset that has missing values and observe how different approaches in dealing with missing data may impact its usability.

- Fill the gaps in the user_ratings_subset with zeros.
- Print and inspect the results.

In [7]:
user_ratings_subset = pd.read_csv('user_ratings_subset2.csv', index_col=0)

In [8]:
user_ratings_subset

Unnamed: 0,Forrest Gump,Pulp Fiction,Toy Story,The Matrix
User_A,10,9,7,
User_B,10,9,7,0.0
User_C,10,9,7,8.0


In [9]:
# Fill in missing values with 0
user_ratings_table_filled = user_ratings_subset.fillna(0)

# Inspect the result
print(user_ratings_table_filled)

                   Forrest Gump   Pulp Fiction   Toy Story   The Matrix
User_A                        10              9           7         0.0
User_B                        10              9           7         0.0
User_C                        10              9           7         8.0


### Question

Based on this user_ratings_table_filled, who now looks most similar to User_A?

Possible Answers

- Both User B and User C

- User B ✔️

- User C

> # 4. Compensating for incomplete data

For most datasets, the majority of users will have rated only a small number of items. As you saw in the last exercise, how you deal with users who do not have ratings for an item can greatly influence the validity of your models.

In this exercise, you will fill in missing data with information that should not bias the data that you do have.

You'll get the average score each user has given across all their ratings, and then use this average to center the users' scores around zero. Finally, you'll be able to fill in the empty values with zeros, which is now a neutral score, minimizing the impact on their overall profile, but still allowing the comparison of users.


- Find the average of the ratings given by each user in user_ratings_table and store them as avg_ratings.
- Subtract the row averages from each row in user_ratings_table, and store it as user_ratings_table_centered.
- Fill the empty values in the newly created user_ratings_table_centered with zeros.

In [11]:
# Get the average rating for each user 
avg_ratings = user_ratings_subset.mean(axis=1)

# Center each users ratings around 0
user_ratings_table_centered = user_ratings_subset.sub(avg_ratings, axis=0)

# Fill in the missing data with 0s
user_ratings_table_normed = user_ratings_table_centered.fillna(0)

In [12]:
user_ratings_table_normed

Unnamed: 0,Forrest Gump,Pulp Fiction,Toy Story,The Matrix
User_A,1.333333,0.333333,-1.666667,0.0
User_B,3.5,2.5,0.5,-6.5
User_C,1.5,0.5,-1.5,-0.5
