<a href="https://colab.research.google.com/github/villafue/Capstone_2_Netflix/blob/main/Springboard/Tutorial/DataCamp/Building%20Recommendation%20Engines%20in%20Python/3%20Collaborative%20Filtering/3_Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering

Discover new items to recommend to users by finding others with similar tastes. Learn to make user-based and item-based recommendations—and in what context they should be used. Use k-nearest neighbors models to leverage the wisdom of the crowd and predict how someone might rate an item they haven’t yet encountered.

# Collaborative filtering

1. Collaborative filtering

In the last chapter, we used the items a customer liked to make suggestions of other similar items. This works well when we have a lot of information about the items, but not much data on how people feel about them. In this chapter, we will find the users that have the most similar preferences to the user we are making recommendations for and based on that group's preferences, make suggestions.
2. Collaborative filtering

This form of recommendation is called collaborative filtering. Collaborative filtering is the name given to the prediction, or filtering, of items that might interest a user based on the preferences of similar users. It works around the premise that person A has similar tastes to person B and C.
3. Collaborative filtering

and both person B and C also like a certain item,
4. Collaborative filtering

then it is likely that person A would also like that new item.
5. Finding similar users

But how do we go about programmatically finding users with similar interests? Rating data is often difficult to compare between users. Even here it is not immediately clear how User_1 and User_2 compare.
6. Finding similar users

We need to get this data into a matrix of users and the items they rated. Now we can see what items both users have seen. Based on this matrix we can compare across users, here it is apparent that User_1 and User_3 have more similar preferences than User_1 and User_2.
7. Working with real data

Time for some real data! We will continue working with the book ratings dataset from the previous chapters containing each user, the book they rated, and the rating score.
8. Pivoting our data

As the data is in a DataFrame, pandas' pivot method can be used to reshape the data around specified columns. We want the users as the index, the columns representing the books, and the ratings as the corresponding values like you see here.
9. Data sparsity

The first thing that may become apparent after this transform is the number of missing entries, demonstrated by the NaN values. This is expected - a user will rarely have rated every item, and it's similarly rare that an item will have been rated by every person. This is an issue, as most similarity metrics do not handle missing data very well. How can we deal with this? We cannot just drop all the rows and columns that have missing data as with data this sparse that could be the whole data frame!
10. Filling the missing values

Similarly, you might suggest filling the empty values with 0s, which might be valid for some machine learning models, but can create issues with recommendation engines. Take for example the second user here. They loved Catcher in the Rye, and enjoyed Fifty Shades of Grey, but have not rated The Great Gatsby. If we were to fill this NaN with a 0, we would be incorrectly implying they greatly disliked the book compared to the others, which we can't say for sure.
11. Filling the missing values

One alternative is to center each user's ratings around 0 by deducting the row average and then fill in the missing values with 0. This means the missing data is replaced with neutral scores.
12. Filling the missing values

We first find the row means. Then subtract it from the rest of the row, you can see the rows centered around 0 here.
13. Filling the missing values

We then fill the NaNs with 0s. This is not a perfect solution, as the values lose some of their interpretability, and these values should not be used as predictions in themselves, but suffice when comparing between users.
14. Let's practice!

We can now calculate similarities between users and we will get to that soon, but first let's work through shaping the data! 

# Pivoting your data

In this chapter, you will go one step further in generating personalized recommendations — you will find items that users, similar to the one you are making recommendations for, have liked.

The first step you will need to start with is formatting your data. You begin with a dataset containing users and their ratings as individual rows with the following columns:

 * user: User ID
 * title: Title of the movie
 * rating: Rating the user gave the movie

You will need to transform the DataFrame into a user rating matrix where each row represents a user, and each column represents the movies on the platform. This will allow you to easily compare users and their preferences.

Instructions

1. Inspect the first five rows of the user_ratings DataFrame to observe which columns would be most appropriate to pivot the data around.


In [None]:
# Inspect the first 5 rows of user_ratings
print(user_ratings.head())

'''
<script.py> output:
         userId  rating                title
    0  user_001     3.0  Pulp Fiction (1994)
    1  user_004     1.0  Pulp Fiction (1994)
    2  user_005     5.0  Pulp Fiction (1994)
    3  user_006     2.0  Pulp Fiction (1994)
    4  user_008     4.0  Pulp Fiction (1994)
'''

Question

 2. Which column from user_ratings should become the index of the pivoted DataFrame?

Possible Answers

1. userId
 - Correct
 
2. title
 - Incorrect: Not quite, the resulting DataFrame should have the title values used as the column names.

3. rating
 - Incorrect. The rating scores are used for the contents of the DataFrame, not its index.

 3. Transform the user_ratings DataFrame to a DataFrame containing ratings with one row per user and one column per movie and call it user_ratings_table.

In [None]:
# Transform the table
user_ratings_table = user_ratings.pivot(index='userId', columns='title', values='rating')
# Inspect the transformed table
print(user_ratings_table.head())

'''
<script.py> output:
    title     Forrest Gump (1994)  Matrix, The (1999)  Pulp Fiction (1994)  Shawshank Redemption, The (1994)  Silence of the Lambs, The (1991)
    userId                                                                                                                                    
    user_001                  4.0                 5.0                  3.0                               NaN                               4.0
    user_002                  NaN                 NaN                  NaN                               3.0                               NaN
    user_004                  NaN                 1.0                  1.0                               NaN                               5.0
    user_005                  NaN                 NaN                  5.0                               3.0                               NaN
    user_006                  5.0                 NaN                  2.0                               5.0                               4.0
'''

Conclusion

Good work! With this data in a matrix, you will be able to compare between users much more easily.

# Finding similar users

Collaborative filtering is built around the premise that users who have ranked items similarly in the past have similar tastes, and therefore are likely to rate new items in a similar fashion.

A subset of the movies dataset has been loaded as user_ratings_subset. The DataFrame contains user ratings with a row for each user and a column for each movie.

Examine user_ratings_subset. Which user is most similar to User A?

In [None]:
In [1]:
user_ratings_subset
Out[1]:

        Pulp Fiction  Forrest Gump  Toy Story  The Matrix
User_A             4             1          1           5
User_B             5             1          1           4
User_C             2             4          5           2
User_D             1             4          4           2

Possible Answers

1. They are all equally similar.
 - Incorrect, one of the users has rated the movies more similar to User A than the others.

2. User B.
 - Correct! User A and B both ranked "Forrest Gump" and "Toy Story" poorly, and "The Matrix" and "Pulp Fiction" highly. It is likely that they would both give other movies similar ratings too.
 
3. User C.
 - Incorrect: Not quite, the movies User B has rated highly are the opposite of those that User A liked.

4. User D.
 - Incorrect: Not quite, the movies User C has rated highly are the opposite of those that User A liked.

# Challenges with missing values

You may have noticed that the pivoted DataFrames you have been working with often have missing data. This is to be expected since users rarely see all movies, and most movies are not seen by everyone, resulting in gaps in the user-rating matrix.

In this exercise, you will explore another subset of the user ratings table user_ratings_subset that has missing values and observe how different approaches in dealing with missing data may impact its usability.

Instructions

Question

Take a look at the user_ratings_subset that has been loaded for you. The None value represents a situation where a user has not made a rating.

1. Based on the table, which user is most similar to User_A?



In [None]:
In [1]:
user_ratings_subset
Out[1]:

       Forrest Gump Pulp Fiction Toy Story The Matrix
User_A           10            9         7       None
User_B           10            9         7          0
User_C           10            9         7          8

Possible Answers

1. Both User_B and User_C
 - Correct
 
2. User_B
 - Incorrect: Not quite, neither User_B nor User_C are more similar to User_A as they have given the same review scores for all the movies that User_A has reviewed.

3. User_C
 - Incorrect: Not quite, neither User_B nor User_C are more similar to User_A as they have given the same review scores for all the movies that User_A has reviewed.

 2. Fill the gaps in the user_ratings_subset with zeros.

 3. Print and inspect the results.


In [None]:
# Fill in missing values with 0
user_ratings_table_filled = user_ratings_subset.fillna(0)

# Inspect the result
print(user_ratings_table_filled)

'''
<script.py> output:
            Forrest Gump  Pulp Fiction  Toy Story  The Matrix
    User_A            10             9          7           0
    User_B            10             9          7           0
    User_C            10             9          7           8
'''

Question

 4. Based on this user_ratings_table_filled, who now looks most similar to User_A?

Possible Answers

1. Both User B and User C
 - Not quite, one of the users is now more similar to User_A.

2. User B
 - True, User_B now looks a lot more similar to User_A when you fill in the missing values with zero, but you know from the unfilled data this should not be the case. Merely filling in gaps with zeros without adjusting the data otherwise can cause issues by skewing the reviews more negative and should not be done.
 
3. User C
 - Incorrect, User_C is no longer as similar to User_A as User_B is.