<a href="https://colab.research.google.com/github/villafue/Capstone_2_Netflix/blob/main/Springboard/Tutorial/DataCamp/Building%20Recommendation%20Engines%20in%20Python/3%20Collaborative%20Filtering/3_Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering

Discover new items to recommend to users by finding others with similar tastes. Learn to make user-based and item-based recommendations—and in what context they should be used. Use k-nearest neighbors models to leverage the wisdom of the crowd and predict how someone might rate an item they haven’t yet encountered.

# Collaborative filtering

1. Collaborative filtering

In the last chapter, we used the items a customer liked to make suggestions of other similar items. This works well when we have a lot of information about the items, but not much data on how people feel about them. In this chapter, we will find the users that have the most similar preferences to the user we are making recommendations for and based on that group's preferences, make suggestions.
2. Collaborative filtering

This form of recommendation is called collaborative filtering. Collaborative filtering is the name given to the prediction, or filtering, of items that might interest a user based on the preferences of similar users. It works around the premise that person A has similar tastes to person B and C.
3. Collaborative filtering

and both person B and C also like a certain item,
4. Collaborative filtering

then it is likely that person A would also like that new item.
5. Finding similar users

But how do we go about programmatically finding users with similar interests? Rating data is often difficult to compare between users. Even here it is not immediately clear how User_1 and User_2 compare.
6. Finding similar users

We need to get this data into a matrix of users and the items they rated. Now we can see what items both users have seen. Based on this matrix we can compare across users, here it is apparent that User_1 and User_3 have more similar preferences than User_1 and User_2.
7. Working with real data

Time for some real data! We will continue working with the book ratings dataset from the previous chapters containing each user, the book they rated, and the rating score.
8. Pivoting our data

As the data is in a DataFrame, pandas' pivot method can be used to reshape the data around specified columns. We want the users as the index, the columns representing the books, and the ratings as the corresponding values like you see here.
9. Data sparsity

The first thing that may become apparent after this transform is the number of missing entries, demonstrated by the NaN values. This is expected - a user will rarely have rated every item, and it's similarly rare that an item will have been rated by every person. This is an issue, as most similarity metrics do not handle missing data very well. How can we deal with this? We cannot just drop all the rows and columns that have missing data as with data this sparse that could be the whole data frame!
10. Filling the missing values

Similarly, you might suggest filling the empty values with 0s, which might be valid for some machine learning models, but can create issues with recommendation engines. Take for example the second user here. They loved Catcher in the Rye, and enjoyed Fifty Shades of Grey, but have not rated The Great Gatsby. If we were to fill this NaN with a 0, we would be incorrectly implying they greatly disliked the book compared to the others, which we can't say for sure.
11. Filling the missing values

One alternative is to center each user's ratings around 0 by deducting the row average and then fill in the missing values with 0. This means the missing data is replaced with neutral scores.
12. Filling the missing values

We first find the row means. Then subtract it from the rest of the row, you can see the rows centered around 0 here.
13. Filling the missing values

We then fill the NaNs with 0s. This is not a perfect solution, as the values lose some of their interpretability, and these values should not be used as predictions in themselves, but suffice when comparing between users.
14. Let's practice!

We can now calculate similarities between users and we will get to that soon, but first let's work through shaping the data! 

# Pivoting your data

In this chapter, you will go one step further in generating personalized recommendations — you will find items that users, similar to the one you are making recommendations for, have liked.

The first step you will need to start with is formatting your data. You begin with a dataset containing users and their ratings as individual rows with the following columns:

 * user: User ID
 * title: Title of the movie
 * rating: Rating the user gave the movie

You will need to transform the DataFrame into a user rating matrix where each row represents a user, and each column represents the movies on the platform. This will allow you to easily compare users and their preferences.

Instructions

1. Inspect the first five rows of the user_ratings DataFrame to observe which columns would be most appropriate to pivot the data around.


In [None]:
# Inspect the first 5 rows of user_ratings
print(user_ratings.head())

'''
<script.py> output:
         userId  rating                title
    0  user_001     3.0  Pulp Fiction (1994)
    1  user_004     1.0  Pulp Fiction (1994)
    2  user_005     5.0  Pulp Fiction (1994)
    3  user_006     2.0  Pulp Fiction (1994)
    4  user_008     4.0  Pulp Fiction (1994)
'''

Question

 2. Which column from user_ratings should become the index of the pivoted DataFrame?

Possible Answers

1. userId
 - Correct
 
2. title
 - Incorrect: Not quite, the resulting DataFrame should have the title values used as the column names.

3. rating
 - Incorrect. The rating scores are used for the contents of the DataFrame, not its index.

 3. Transform the user_ratings DataFrame to a DataFrame containing ratings with one row per user and one column per movie and call it user_ratings_table.

In [None]:
# Transform the table
user_ratings_table = user_ratings.pivot(index='userId', columns='title', values='rating')
# Inspect the transformed table
print(user_ratings_table.head())

'''
<script.py> output:
    title     Forrest Gump (1994)  Matrix, The (1999)  Pulp Fiction (1994)  Shawshank Redemption, The (1994)  Silence of the Lambs, The (1991)
    userId                                                                                                                                    
    user_001                  4.0                 5.0                  3.0                               NaN                               4.0
    user_002                  NaN                 NaN                  NaN                               3.0                               NaN
    user_004                  NaN                 1.0                  1.0                               NaN                               5.0
    user_005                  NaN                 NaN                  5.0                               3.0                               NaN
    user_006                  5.0                 NaN                  2.0                               5.0                               4.0
'''

Conclusion

Good work! With this data in a matrix, you will be able to compare between users much more easily.

# Finding similar users

Collaborative filtering is built around the premise that users who have ranked items similarly in the past have similar tastes, and therefore are likely to rate new items in a similar fashion.

A subset of the movies dataset has been loaded as user_ratings_subset. The DataFrame contains user ratings with a row for each user and a column for each movie.

Examine user_ratings_subset. Which user is most similar to User A?

In [None]:
In [1]:
user_ratings_subset
Out[1]:

        Pulp Fiction  Forrest Gump  Toy Story  The Matrix
User_A             4             1          1           5
User_B             5             1          1           4
User_C             2             4          5           2
User_D             1             4          4           2

Possible Answers

1. They are all equally similar.
 - Incorrect, one of the users has rated the movies more similar to User A than the others.

2. User B.
 - Correct! User A and B both ranked "Forrest Gump" and "Toy Story" poorly, and "The Matrix" and "Pulp Fiction" highly. It is likely that they would both give other movies similar ratings too.
 
3. User C.
 - Incorrect: Not quite, the movies User B has rated highly are the opposite of those that User A liked.

4. User D.
 - Incorrect: Not quite, the movies User C has rated highly are the opposite of those that User A liked.

# Challenges with missing values

You may have noticed that the pivoted DataFrames you have been working with often have missing data. This is to be expected since users rarely see all movies, and most movies are not seen by everyone, resulting in gaps in the user-rating matrix.

In this exercise, you will explore another subset of the user ratings table user_ratings_subset that has missing values and observe how different approaches in dealing with missing data may impact its usability.

Instructions

Question

Take a look at the user_ratings_subset that has been loaded for you. The None value represents a situation where a user has not made a rating.

1. Based on the table, which user is most similar to User_A?



In [None]:
In [1]:
user_ratings_subset
Out[1]:

       Forrest Gump Pulp Fiction Toy Story The Matrix
User_A           10            9         7       None
User_B           10            9         7          0
User_C           10            9         7          8

Possible Answers

1. Both User_B and User_C
 - Correct
 
2. User_B
 - Incorrect: Not quite, neither User_B nor User_C are more similar to User_A as they have given the same review scores for all the movies that User_A has reviewed.

3. User_C
 - Incorrect: Not quite, neither User_B nor User_C are more similar to User_A as they have given the same review scores for all the movies that User_A has reviewed.

 2. Fill the gaps in the user_ratings_subset with zeros.

 3. Print and inspect the results.


In [None]:
# Fill in missing values with 0
user_ratings_table_filled = user_ratings_subset.fillna(0)

# Inspect the result
print(user_ratings_table_filled)

'''
<script.py> output:
            Forrest Gump  Pulp Fiction  Toy Story  The Matrix
    User_A            10             9          7           0
    User_B            10             9          7           0
    User_C            10             9          7           8
'''

Question

 4. Based on this user_ratings_table_filled, who now looks most similar to User_A?

Possible Answers

1. Both User B and User C
 - Not quite, one of the users is now more similar to User_A.

2. User B
 - True, User_B now looks a lot more similar to User_A when you fill in the missing values with zero, but you know from the unfilled data this should not be the case. Merely filling in gaps with zeros without adjusting the data otherwise can cause issues by skewing the reviews more negative and should not be done.
 
3. User C
 - Incorrect, User_C is no longer as similar to User_A as User_B is.

# Compensating for incomplete data

For most datasets, the majority of users will have rated only a small number of items. As you saw in the last exercise, how you deal with users who do not have ratings for an item can greatly influence the validity of your models.

In this exercise, you will fill in missing data with information that should not bias the data that you do have.

You'll get the average score each user has given across all their ratings, and then use this average to center the users' scores around zero. Finally, you'll be able to fill in the empty values with zeros, which is now a neutral score, minimizing the impact on their overall profile, but still allowing the comparison of users.

user_ratings_table with a row per user has been loaded for you.

Instructions

1. Find the average of the ratings given by each user in user_ratings_table and store them as avg_ratings.

2. Subtract the row averages from each row in user_ratings_table, and store it as user_ratings_table_centered.

3. Fill the empty values in the newly created user_ratings_table_centered with zeros.


In [None]:
In [1]:
user_ratings_table.head()
Out[1]:

title     Forrest Gump (1994)  Matrix, The (1999)  Pulp Fiction (1994)  Shawshank Redemption, The (1994)  Silence of the Lambs, The (1991)
userId                                                                                                                                    
user_001                  4.0                 5.0                  3.0                               NaN                               4.0
user_002                  NaN                 NaN                  NaN                               3.0                               NaN
user_004                  NaN                 1.0                  1.0                               NaN                               5.0
user_005                  NaN                 NaN                  5.0                               3.0                               NaN
user_006                  5.0                 NaN                  2.0                               5.0                               4.0

In [None]:
# Get the average rating for each user 
avg_ratings = user_ratings_table.mean(axis=1)

# Center each users ratings around 0
user_ratings_table_centered = user_ratings_table.sub(avg_ratings, axis=0)

# Fill in the missing data with 0s
user_ratings_table_normed = user_ratings_table_centered.fillna(0)

'''
In [2]:
user_ratings_table_normed.head()
Out[2]:

title     Forrest Gump (1994)  Matrix, The (1999)  Pulp Fiction (1994)  Shawshank Redemption, The (1994)  Silence of the Lambs, The (1991)
userId                                                                                                                                    
user_001                  0.0            1.000000            -1.000000                               0.0                          0.000000
user_002                  0.0            0.000000             0.000000                               0.0                          0.000000
user_004                  0.0           -1.333333            -1.333333                               0.0                          2.666667
user_005                  0.0            0.000000             1.000000                              -1.0                          0.000000
user_006                  1.0            0.000000            -2.000000                               1.0                          0.000000
'''

Conclusion

Great work! You will now be able to compare between rows without adding an unnecessary bias to the data when values are missing.

Finding similarities

1. Finding similarities

We have been focusing on finding similar users so far in this chapter. This is called user-based collaborative filtering. Comparisons between items, or item-based collaborative filtering, is also possible.
2. Item-based recommendations

It assumes if Item A and B receive similar reviews, either positive or negative,
3. Item-based recommendations

Then however other people feel about A,
4. Item-based recommendations

They should feel the same way about B.
5. User-based to item-based

If we have our data prepped as we did in the last few exercises we can switch between these two approaches,
6. User-based to item-based

by transposing the matrices giving us the items as rows and the users as columns.
7. User-based to item-based

This can be achieved in pandas by looking at the book rating DataFrame (user_ratings_pivot) we have generated previously and shown here. By calling dot T on a DataFrame to get its transposed version. We can see the user-based matrix on top, and the corresponding item-based matrix on the bottom. We will discuss in more depth which matrix is preferred later in this chapter, but the high-level answer, like with many questions in data science, is that it depends on the data. We will focus on item-based filtering for now as the items can be a little more relatable.
8. Cosine similarities

With the item-based matrix containing a row per book, shown here, we can calculate the similarities and distances between items in the dataset, like what we did with our content-based recommendations last chapter. We'll continue to use cosine distance, but is worth noting that as we have centered the data around zero, the cosine values can now range from -1 to 1, with 1 being the most similar, and -1 the least. This does not have any impact on the process, so don't be concerned if you see some negative cosine values!
9. Cosine similarities

Even though the range of the output can be different, the way we calculate similarities is the same. Let's compare two books, The Lord of the Rings and The Hobbit, from our dataset with the cosine distance. As a reminder, cosine similarity compares two NumPy arrays, so we need to do some reshaping first.
10. Cosine similarities

We first get the rows we want to compare.
11. Cosine similarities

Then we need to turn them into a NumPy array with dot values.
12. Cosine similarities

And reshape them into a 1d array. As you can see, the two books are found to be quite similar (remember the values are between -1 and 1). This is expected as they are by the same author, but if we repeat it with two very different books, we might even get a negative value.
13. Cosine similarities

Comparing items is all well and good, but you want of course to start making recommendations. Let's do so by finding the most similar items overall! To do this, we need to find the similarities between all the items at once. Just like we did with content-based recommendations, we can call cosine_similarity on the full dataset. Resulting in a similarity matrix between all items. Tidying this up by wrapping it in a DataFrame with the index and columns the item names gets us a usable lookup table with the similarities for all items.
14. Cosine similarities

With this matrix calculated, we can even make recommendations by finding the items that have been rated most similar to the one a user liked by selecting the item you want to compare against and sort its similarities. Here you can see that the most similarly rated different item to The Hobbit was Lord of the Rings, which makes sense as they share characters and author.
15. Let's practice!

Let us try this with the movies dataset you have been working on! 

# User-based to item-based

By now you have a dataset with no empty values that is primed for use.

In the preceding video, you learned about both user-based recommendations and item-based recommendations. User-based recommendations compare amongst users, and item-based recommendations compare different items.

In other words, you could use user-based data to find similar users based on how they rated different movies, while you could use item-based data to find similar movies based on how they have been rated by the users.

In this exercise, you will switch between the two and compare their values.

user_ratings_subset, a subset of the user-based DataFrame you have been working with, has been loaded for you.



In [None]:
In [1]:
user_ratings_subset.head
Out[1]:

<bound method NDFrame.head of         The Sandlot  Ocean's Eleven  The Lion King  John Wick
User_A            1               4              1          5
User_B            1               5              1          4
User_C            4               2              5          2
User_D            4               1              4          2>

Instructions

Question

1. Based on the data in user_ratings_subset, which user is most similar to User_A?

Possible Answers

1. User_B
 - Correct
 
2. User_C
 - Not quite, the movies User C has rated highly are the opposite of those that User A liked.

3. User_D
 - Not quite, the movies User D has rated highly are the opposite of those that User A liked.

 2. Transpose the user_ratings_subset table so that it is indexed by the movies and store the result as movie_ratings_subset.

In [None]:
# Transpose the user_ratings_subset DataFrame
movie_ratings_subset = user_ratings_subset.T

print(movie_ratings_subset)

'''
<script.py> output:
                    User_A  User_B  User_C  User_D
    The Sandlot          1       1       4       4
    Ocean's Eleven       4       5       2       1
    The Lion King        1       1       5       4
    John Wick            5       4       2       2
'''

Question

3. Based on this new transposed data, what movie appears most similar to The Sandlot?

Possible Answers

* Pulp Fiction
    - Not quite, Pulp Fiction has recieved very different ratings to The Sandlot.

* The Lion King
 - Awesome! You are now able to switch between the data needed for user-based models and item-based models. This will allow you to build recommendations using both kinds of data to see which suits your use case the best.
 
* John Wick
 - Not quite, John Wick has recieved very different ratings to The Sandlot

# Similar and different movie ratings

Some types of movies might be liked by one group of people, but hated by another. This might reflect the type of movie far more than its quality. Take, for example, horror movies — many people absolutely love them, while others hate them.

By understanding which movies were reviewed in a similar way, we can often find very similar movies.

In this exercise, you will compare movies and see whether they have received similar reviewing patterns.

The DataFrame movie_ratings_centered has been loaded with a row per movie, and the centered ratings it received as the values.

Instructions

1. Assign the values for Star Wars: Episode IV and Star Wars: Episode V to sw_IV and sw_V.

2. Find their cosine similarity.

3. Find the cosine similarity between the ratings for Jurassic Park (jurassic_park) and Pulp Fiction (pulp_fiction).


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Assign the arrays to variables
sw_IV = movie_ratings_centered.loc['Star Wars: Episode IV - A New Hope (1977)', :].values.reshape(1, -1)
sw_V = movie_ratings_centered.loc['Star Wars: Episode V - The Empire Strikes Back (1980)', :].values.reshape(1, -1)

# Find the similarity between two Star Wars movies
similarity_A = cosine_similarity(sw_IV, sw_V)
print(similarity_A)

'''
<script.py> output:
    [[0.5357054]]
'''

from sklearn.metrics.pairwise import cosine_similarity

# Assign the arrays to variables
sw_IV = movie_ratings_centered.loc['Star Wars: Episode IV - A New Hope (1977)', :].values.reshape(1, -1)
sw_V = movie_ratings_centered.loc['Star Wars: Episode V - The Empire Strikes Back (1980)', :].values.reshape(1, -1)

# Find the similarity between two Star Wars movies
similarity_A = cosine_similarity(sw_IV, sw_V)

# Assign the arrays to variables
jurassic_park = movie_ratings_centered.loc['Jurassic Park (1993)', :].values.reshape(1, -1)
pulp_fiction = movie_ratings_centered.loc['Pulp Fiction (1994)', :].values.reshape(1, -1)

# Find the similarity between Pulp Fiction and Jurassic Park
similarity_B = cosine_similarity(jurassic_park, pulp_fiction)
print(similarity_B)

'''
<script.py> output:
    [[-0.25630617]]
'''

Conclusion

Great work! As you can see, the two Star Wars movies generated a much larger similarity rating than Forrest Gump and Pulp fiction. This is expected, as although they are all award-winning movies, the users who like one Star Wars movie are very likely to like the other, while totally different users may like Jurassic Park and Pulp Fiction.

# Finding similarly liked movies

Just like you calculated the similarity between two movies, you can calculate it across all users to find the most similar movie to another based on how users have rated them.

The approach is similar to how you worked with content-based filtering.

You will find the similarity scores between all movies and then drill down on the movie of interest by isolating and sorting the column containing its similarity scores.

movie_ratings_centered has once again been loaded, containing each movie as a row, and their centered ratings stored as the values.

Instructions

1. Calculate the similarity matrix between all movies in movie_ratings_centered and store it as similarities.

2. Wrap the similarities matrix in a DataFrame, with the indices of movie_ratings_centered as the columns and rows.


In [None]:
In [1]:
movie_ratings_centered.head()
Out[1]:

userId                  user_001  user_002  user_003  user_004  user_005  ...  user_606  user_607  user_608  user_609  user_610
title                                                                     ...                                                  
American Beauty (1999)    0.5625       0.0       0.0  1.777778     0.000  ...  0.526316 -1.142857  1.105263     0.000 -0.947368
Apollo 13 (1995)          0.0000       0.0       0.0  0.000000    -0.875  ...  0.000000  0.857143 -1.894737    -0.375  0.000000
Braveheart (1995)        -0.4375       0.0       0.0  0.000000     0.125  ... -0.473684  0.857143  0.105263    -0.375  0.052632
Fight Club (1999)         0.5625       0.0       0.0 -1.222222     0.000  ...  1.026316  0.000000  1.105263     0.000  0.552632
Forrest Gump (1994)      -0.4375       0.0       0.0  0.000000     0.000  ...  0.026316  0.000000 -0.894737     0.625 -1.447368

[5 rows x 569 columns]

In [None]:
In [3]:
similarities
Out[3]:

array([[ 1.00000000e+00, -6.46003587e-02, -6.57918009e-02,
         8.12554487e-02, -1.42046480e-01, -1.36312980e-01,
        -1.01912289e-01, -9.70650392e-02, -7.41500108e-02,
         3.72337357e-03, -2.21793757e-02, -7.13422047e-02,
         1.33354491e-03, -3.32204227e-02, -2.33990088e-02,
        -8.88923342e-02, -9.15217725e-02, -3.84703569e-02,
        -8.65701327e-02,  9.79745495e-02],
       [-6.46003587e-02,  1.00000000e+00, -1.62996329e-02,
        -1.57908568e-01, -5.34098390e-02,  1.57140328e-01,
         9.78615975e-02, -8.21050217e-02, -3.92370219e-02,
        -1.87951184e-01, -1.03280954e-01, -5.83904712e-02,
        -6.45201957e-02, -1.83872364e-01, -1.65054131e-01,
        -4.98510149e-02, -9.45589388e-02,  1.51806909e-01,
         5.64957944e-02, -2.31133519e-01],
       [-6.57918009e-02, -1.62996329e-02,  1.00000000e+00,
        -5.89881348e-02,  4.96600654e-02, -2.26548297e-02,
         9.41426718e-03, -3.14775590e-02,  2.08386749e-02,
        -1.52063960e-01, -9.38768763e-02, -1.06807870e-01,
        -5.26415672e-02, -1.87559576e-02, -1.08687110e-01,
        -1.48078057e-01, -1.28978663e-01,  1.12499237e-01,
        -1.25975699e-01, -1.17067146e-01],
       [ 8.12554487e-02, -1.57908568e-01, -5.89881348e-02,
         1.00000000e+00, -1.06253240e-01, -1.73208145e-01,
        -2.11609425e-01, -1.40538303e-02,  1.23051261e-02,
         1.94274111e-01, -5.76758308e-02, -4.70463299e-02,
         1.37836015e-01,  8.42396675e-02, -4.02229146e-02,
        -2.03626466e-01, -1.34649853e-01, -1.59341352e-01,
        -9.90444129e-02,  7.06688375e-02],
       [-1.42046480e-01, -5.34098390e-02,  4.96600654e-02,
        -1.06253240e-01,  1.00000000e+00, -6.81829320e-02,
        -2.95868964e-02, -8.31055468e-02, -8.68320750e-02,
        -1.35365635e-01, -6.79248758e-02,  3.47007695e-02,
        -1.54840185e-01,  1.26956893e-01, -3.66663985e-02,
        -1.43114222e-01, -1.20872062e-01, -1.03927853e-01,
         2.64944865e-03, -7.96453940e-02],
       [-1.36312980e-01,  1.57140328e-01, -2.26548297e-02,
        -1.73208145e-01, -6.81829320e-02,  1.00000000e+00,
         2.67337979e-01, -3.93326556e-02, -9.76493888e-02,
        -2.46225357e-01, -8.26892279e-02, -1.29566031e-01,
        -5.53154354e-02, -1.47364269e-01, -9.71665777e-02,
        -2.36363992e-01, -1.86114479e-01, -7.88990783e-03,
         1.33897199e-01, -1.89822722e-01],
       [-1.01912289e-01,  9.78615975e-02,  9.41426718e-03,
        -2.11609425e-01, -2.95868964e-02,  2.67337979e-01,
         1.00000000e+00, -2.77069421e-02, -1.04342040e-03,
        -2.56306168e-01,  5.16978163e-02, -1.84108198e-01,
        -3.97814277e-02, -2.71635082e-01, -7.65415232e-02,
        -7.22196623e-02, -1.77528675e-01,  8.18253853e-02,
         1.97670843e-02, -2.12358531e-01],
       [-9.70650392e-02, -8.21050217e-02, -3.14775590e-02,
        -1.40538303e-02, -8.31055468e-02, -3.93326556e-02,
        -2.77069421e-02,  1.00000000e+00, -7.20666262e-02,
        -6.40836831e-02, -6.96812556e-03, -8.05585555e-02,
        -8.77927077e-03, -1.00709309e-01, -1.91399791e-02,
         2.05817554e-02, -1.44351895e-02, -7.27698206e-02,
        -9.22371838e-02, -9.67344759e-03],
       [-7.41500108e-02, -3.92370219e-02,  2.08386749e-02,
         1.23051261e-02, -8.68320750e-02, -9.76493888e-02,
        -1.04342040e-03, -7.20666262e-02,  1.00000000e+00,
        -2.85071801e-02, -4.71872993e-02, -1.15232514e-01,
        -2.44707919e-03, -7.81735208e-02, -9.52737225e-02,
        -3.63820103e-02, -5.78575480e-02, -3.80315668e-02,
        -1.76950324e-01, -1.91849859e-02],
       [ 3.72337357e-03, -1.87951184e-01, -1.52063960e-01,
         1.94274111e-01, -1.35365635e-01, -2.46225357e-01,
        -2.56306168e-01, -6.40836831e-02, -2.85071801e-02,
         1.00000000e+00, -8.53401424e-02, -3.30206049e-02,
         1.76011848e-02, -5.82428159e-03,  8.27147317e-02,
        -8.38668092e-02, -6.34846655e-02, -1.47780097e-01,
        -9.71907407e-02,  1.71815836e-01],
       [-2.21793757e-02, -1.03280954e-01, -9.38768763e-02,
        -5.76758308e-02, -6.79248758e-02, -8.26892279e-02,
         5.16978163e-02, -6.96812556e-03, -4.71872993e-02,
        -8.53401424e-02,  1.00000000e+00, -1.48590271e-01,
        -1.12897832e-02, -1.03063408e-01,  1.29804400e-02,
         7.83998009e-02,  1.05068966e-01, -3.06752807e-02,
        -2.80791575e-02, -8.91666501e-02],
       [-7.13422047e-02, -5.83904712e-02, -1.06807870e-01,
        -4.70463299e-02,  3.47007695e-02, -1.29566031e-01,
        -1.84108198e-01, -8.05585555e-02, -1.15232514e-01,
        -3.30206049e-02, -1.48590271e-01,  1.00000000e+00,
        -9.04903156e-02,  9.24880164e-02, -9.15568079e-03,
         1.36018307e-02,  4.35882369e-02, -1.19084598e-01,
        -3.67484923e-02,  9.70244906e-03],
       [ 1.33354491e-03, -6.45201957e-02, -5.26415672e-02,
         1.37836015e-01, -1.54840185e-01, -5.53154354e-02,
        -3.97814277e-02, -8.77927077e-03, -2.44707919e-03,
         1.76011848e-02, -1.12897832e-02, -9.04903156e-02,
         1.00000000e+00, -1.29394844e-01,  3.54727466e-02,
        -1.52130055e-01, -1.26926171e-01, -1.53221216e-01,
        -1.09157680e-02, -4.76558288e-02],
       [-3.32204227e-02, -1.83872364e-01, -1.87559576e-02,
         8.42396675e-02,  1.26956893e-01, -1.47364269e-01,
        -2.71635082e-01, -1.00709309e-01, -7.81735208e-02,
        -5.82428159e-03, -1.03063408e-01,  9.24880164e-02,
        -1.29394844e-01,  1.00000000e+00, -7.95182351e-02,
        -2.90289134e-02, -2.54978290e-02, -1.09295309e-01,
        -1.12996004e-01,  1.16411059e-01],
       [-2.33990088e-02, -1.65054131e-01, -1.08687110e-01,
        -4.02229146e-02, -3.66663985e-02, -9.71665777e-02,
        -7.65415232e-02, -1.91399791e-02, -9.52737225e-02,
         8.27147317e-02,  1.29804400e-02, -9.15568079e-03,
         3.54727466e-02, -7.95182351e-02,  1.00000000e+00,
        -5.01647646e-02, -7.23981334e-02, -1.70069415e-01,
        -9.74701756e-03,  2.75834785e-02],
       [-8.88923342e-02, -4.98510149e-02, -1.48078057e-01,
        -2.03626466e-01, -1.43114222e-01, -2.36363992e-01,
        -7.22196623e-02,  2.05817554e-02, -3.63820103e-02,
        -8.38668092e-02,  7.83998009e-02,  1.36018307e-02,
        -1.52130055e-01, -2.90289134e-02, -5.01647646e-02,
         1.00000000e+00,  5.35705396e-01, -9.14909936e-04,
        -1.32080403e-01, -3.00237567e-02],
       [-9.15217725e-02, -9.45589388e-02, -1.28978663e-01,
        -1.34649853e-01, -1.20872062e-01, -1.86114479e-01,
        -1.77528675e-01, -1.44351895e-02, -5.78575480e-02,
        -6.34846655e-02,  1.05068966e-01,  4.35882369e-02,
        -1.26926171e-01, -2.54978290e-02, -7.23981334e-02,
         5.35705396e-01,  1.00000000e+00,  2.68886812e-02,
        -1.19884554e-01, -3.88486348e-02],
       [-3.84703569e-02,  1.51806909e-01,  1.12499237e-01,
        -1.59341352e-01, -1.03927853e-01, -7.88990783e-03,
         8.18253853e-02, -7.27698206e-02, -3.80315668e-02,
        -1.47780097e-01, -3.06752807e-02, -1.19084598e-01,
        -1.53221216e-01, -1.09295309e-01, -1.70069415e-01,
        -9.14909936e-04,  2.68886812e-02,  1.00000000e+00,
        -4.28457006e-02, -1.78383516e-01],
       [-8.65701327e-02,  5.64957944e-02, -1.25975699e-01,
        -9.90444129e-02,  2.64944865e-03,  1.33897199e-01,
         1.97670843e-02, -9.22371838e-02, -1.76950324e-01,
        -9.71907407e-02, -2.80791575e-02, -3.67484923e-02,
        -1.09157680e-02, -1.12996004e-01, -9.74701756e-03,
        -1.32080403e-01, -1.19884554e-01, -4.28457006e-02,
         1.00000000e+00, -4.10989025e-02],
       [ 9.79745495e-02, -2.31133519e-01, -1.17067146e-01,
         7.06688375e-02, -7.96453940e-02, -1.89822722e-01,
        -2.12358531e-01, -9.67344759e-03, -1.91849859e-02,
         1.71815836e-01, -8.91666501e-02,  9.70244906e-03,
        -4.76558288e-02,  1.16411059e-01,  2.75834785e-02,
        -3.00237567e-02, -3.88486348e-02, -1.78383516e-01,
        -4.10989025e-02,  1.00000000e+00]])

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Generate the similarity matrix
similarities = cosine_similarity(movie_ratings_centered)

# Wrap the similarities in a DataFrame
cosine_similarity_df = pd.DataFrame(similarities, index=movie_ratings_centered.index, columns=movie_ratings_centered.index)

# Find the similarity values for a specific movie
cosine_similarity_series = cosine_similarity_df.loc['Star Wars: Episode IV - A New Hope (1977)']

# Sort these values highest to lowest
ordered_similarities = cosine_similarity_series.sort_values(ascending=False)

print(ordered_similarities)

'''
<script.py> output:
    title
    Star Wars: Episode IV - A New Hope (1977)                                         1.000000
    Star Wars: Episode V - The Empire Strikes Back (1980)                             0.535705
    Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    0.078400
    Lord of the Rings: The Fellowship of the Ring, The (2001)                         0.020582
    Schindler's List (1993)                                                           0.013602
    Terminator 2: Judgment Day (1991)                                                -0.000915
    Shawshank Redemption, The (1994)                                                 -0.029029
    Usual Suspects, The (1995)                                                       -0.030024
    Matrix, The (1999)                                                               -0.036382
    Apollo 13 (1995)                                                                 -0.049851
    Silence of the Lambs, The (1991)                                                 -0.050165
    Jurassic Park (1993)                                                             -0.072220
    Pulp Fiction (1994)                                                              -0.083867
    American Beauty (1999)                                                           -0.088892
    Toy Story (1995)                                                                 -0.132080
    Forrest Gump (1994)                                                              -0.143114
    Braveheart (1995)                                                                -0.148078
    Seven (a.k.a. Se7en) (1995)                                                      -0.152130
    Fight Club (1999)                                                                -0.203626
    Independence Day (a.k.a. ID4) (1996)                                             -0.236364
    Name: Star Wars: Episode IV - A New Hope (1977), dtype: float64
'''

Conclusion

Fantastic! As you can see, the most similar movie to Star Wars: Episode IV was Star Wars: Episode V, followed by Indiana Jones, another action-packed movie from the same era.

# Using K-nearest neighbors

1. Using K-nearest neighbors

You are now able to find similar items based on how the users in your dataset have rated them.
2. Beyond similar items

But what if we wanted to not only find similarly rated items, but actually predict how a user might rate an item even if it is not similar to any item they have seen! One approach is to find similar users using a K nearest neighbors model and see how they liked the item.
3. K-nearest neighbors

As a reminder, K-NN finds the k users that are closest measured by a specified metric, to the user in question. It then averages the rating those users gave the item we are trying to get a rating for. In this example, k equals 3, so it finds the 3 nearest users and gets their rating. This allows us to predict how we think a user might feel about an item, even if they haven't seen it before. Scikit-learn has a pre-built KNN model we will use later, but it is valuable to understand how it works by going through the process step by step first.
4. User-user similarity

We continue with our book rating DataFrame, this time predicting what rating User_1 might give the book "Catch-22" which they have not read. We previously generated the similarity scores between all items, in the item-based DataFrame. As we are now looking to find similar users, we repeat the process, but on the user-based DataFrame, and assign the users as columns and indices.
5. Understanding the similarity matrix

Examining the output, we see a grid of all users as rows and columns, and where they meet, their similarity score.
6. Understanding the similarity matrix

So User_1 and User_3 here are quite similar.
7. Understanding the similarity matrix

While User_1 and User_2 are not.
8. Step by step KNN

Lets set k to 3 and find the KNN to User_1. We select User_1's similarity values Then order them to find the 3 most similar users getting just their names using dot index.
9. Step by step KNN

We then find the ratings these users gave to the book from our original ratings DataFrame and get the mean. This rating represents the rating the user would likely give to Catch-22 based on the ratings users similar to them gave it.
10. Using scikit-learn's KNN

Let's look how this can be done using scikit-learn. For this, we need two datasets: the centered user-based rating DataFrame, with a row per user, a column per item, and values of the ratings centered around 0, and the original user_ratings_table with uncentered scores and missing values.
11. Using scikit-learn's KNN

We drop the catch-22 column as that will be our target, and separate the user we are predicting for. Note we use double brackets to keep this as a DataFrame. The original raw ratings for the item we are predicting on are extracted. Think of this as your Y values in your model.
12. Using scikit-learn's KNN

As we only care about neighbors that have read the book, we filter the users that have actually rated it. We similarly drop the rows in the ratings that are empty. Think of other_users_x and other_users_y as your x and y training values, while target_users_x is the data you are trying to predict with.
13. Using scikit-learn's KNN

We can then import and instantiate the KNeighborsRegressor model from sklearn specifying cosine similarities as the metric. We fit it the same way we fit any model and predict on the user values we want to predict.
14. Using scikit-learn's KNN

An advantage of using the sklearn approach is that you can quickly change parameters, or even try out classification as opposed to regression, where the most common rating is predicted as opposed to the average like seen here!
15. Let's practice!

Now its time to try this yourself. 

# Stepping through K-nearest neighbors

You have just seen how K-nearest neighbors can be used to infer how someone might rate an item based on the wisdom of a (similar) crowd. In this exercise, you will step through this process yourself to ensure a good understanding of how it works.

To get you started, as you have generated similarity matrices many times before, that step has been done for you with the user similarity matrix wrapped in a DataFrame loaded as user_similarities.

This has each user as the rows and columns, and where they meet the corresponding similarity score.

In this exercise, you will be working with user_001's similarity scores, find their nearest neighbors, and based on the ratings those neighbors gave a movie, infer what rating user_001 might give it if they saw it.

Instructions

1. Find the IDs of User_A's 10 nearest neighbors by extracting the top 10 users in ordered_similarities and storing them as nearest_neighbors.

2. Extract the ratings the users in nearest_neighbors gave from user_ratings_table as neighbor_ratings.

3. Calculate the average rating these users gave to the movie Apollo 13 (1995) to infer what User_A might give it if they had seen it.


In [None]:
# Isolate the similarity scores for user_1 and sort
user_similarity_series = user_similarities.loc['user_001']
ordered_similarities = user_similarity_series.sort_values(ascending=False)

# Find the top 10 most similar users
nearest_neighbors = ordered_similarities[1:11].index

# Extract the ratings of the neighbors
neighbor_ratings = user_ratings_table.reindex(nearest_neighbors)

# Calculate the mean rating given by the users nearest neighbors
print(neighbor_ratings['Apollo 13 (1995)'].mean())

'''
<script.py> output:
    3.8
'''

Good work! Based on the scores the users most similar to User_001 gave Apollo 13, we can infer that User_001 would likely give it a score of close to 4.

# Getting KNN data in shape

Now that you understand the ins and outs of how K-nearest neighbors works, you can leverage scikit-learn's implementation of KNN while recognizing what it is doing underneath the hood.

In the next two exercises, you will step through how to prepare your data for scikit-learn's KNN model, and then use it to make inferences about what rating a user might give a movie they haven't seen.

For consistency, you will once again be working with User_1 and the rating they would give Apollo 13 (1995) if they saw it.

The users_to_ratings DataFrame has again been loaded for you. This contains each user with its own row and each rating they gave as the values.

Similarly, user_ratings_table has been loaded, which contains the raw rating values (pre-centering and filling with zeros).

Instructions

1. Drop the column corresponding to the movie you are predicting for (Apollo 13 (1995)) from the users_to_ratings DataFrame in place.

2. Extract the ratings for user_001 from the resulting users_to_ratings table and store them as target_user_x.


In [None]:
In [1]:
users_to_ratings.head()
Out[1]:

title     American Beauty (1999)  Apollo 13 (1995)  Braveheart (1995)  Fight Club (1999)  Forrest Gump (1994)  ...  Star Wars: Episode IV - A New Hope (1977)  \
userId                                                                                                         ...                                              
user_001                0.562500             0.000            -0.4375           0.562500              -0.4375  ...                                   0.562500   
user_002                0.000000             0.000             0.0000           0.000000               0.0000  ...                                   0.000000   
user_003                0.000000             0.000             0.0000           0.000000               0.0000  ...                                   0.000000   
user_004                1.777778             0.000             0.0000          -1.222222               0.0000  ...                                   1.777778   
user_005                0.000000            -0.875             0.1250           0.000000               0.0000  ...                                   0.000000   

title     Star Wars: Episode V - The Empire Strikes Back (1980)  Terminator 2: Judgment Day (1991)  Toy Story (1995)  Usual Suspects, The (1995)  
userId                                                                                                                                            
user_001                                           0.562500                                  0.000           -0.4375                      0.5625  
user_002                                           0.000000                                  0.000            0.0000                      0.0000  
user_003                                           0.000000                                  0.000            0.0000                      0.0000  
user_004                                           1.777778                                  0.000            0.0000                      0.0000  
user_005                                           0.000000                                 -0.875            0.1250                      0.1250  

[5 rows x 20 columns]

In [None]:
In [3]:
user_ratings_table.head()
Out[3]:

title     American Beauty (1999)  Apollo 13 (1995)  Braveheart (1995)  Fight Club (1999)  Forrest Gump (1994)  ...  Star Wars: Episode IV - A New Hope (1977)  \
userId                                                                                                         ...                                              
user_001                     5.0               NaN                4.0                5.0                  4.0  ...                                        5.0   
user_002                     NaN               NaN                NaN                NaN                  NaN  ...                                        NaN   
user_003                     NaN               NaN                NaN                NaN                  NaN  ...                                        NaN   
user_004                     5.0               NaN                NaN                2.0                  NaN  ...                                        5.0   
user_005                     NaN               3.0                4.0                NaN                  NaN  ...                                        NaN   

title     Star Wars: Episode V - The Empire Strikes Back (1980)  Terminator 2: Judgment Day (1991)  Toy Story (1995)  Usual Suspects, The (1995)  
userId                                                                                                                                            
user_001                                                5.0                                    NaN               4.0                         5.0  
user_002                                                NaN                                    NaN               NaN                         NaN  
user_003                                                NaN                                    NaN               NaN                         NaN  
user_004                                                5.0                                    NaN               NaN                         NaN  
user_005                                                NaN                                    3.0               4.0                         4.0  

[5 rows x 20 columns]

In [None]:
# Drop the column you are trying to predict
users_to_ratings.drop("Apollo 13 (1995)", axis=1, inplace=True)

# Get the data for the user you are predicting for
target_user_x = users_to_ratings.loc[["user_001"]]

 3. Get the raw ratings for Apollo 13 (1995) from the user_ratings_table and store it as other_users_y.

In [None]:
# Drop the column you are trying to predict
users_to_ratings.drop("Apollo 13 (1995)", axis=1, inplace=True)

# Get the data for the user you are predicting for
target_user_x = users_to_ratings.loc[["user_001"]]

# Get the target data from user_ratings_table
other_users_y = user_ratings_table["Apollo 13 (1995)"]

 4. Select only the users from users_to_ratings that have rated the movie and store it as other_users_x.

 5. Drop the rows from the other_users_y target that have not rated the movie.


In [None]:
# Drop the column you are trying to predict
users_to_ratings.drop("Apollo 13 (1995)", axis=1, inplace=True)

# Get the data for the user you are predicting for
target_user_x = users_to_ratings.loc[["user_001"]]

# Get the target data from user_ratings_table
other_users_y = user_ratings_table["Apollo 13 (1995)"]

# Get the data for only those that have seen the movie
other_users_x = users_to_ratings[other_users_y.notnull()]

# Remove those that have not seen the movie from the target
other_users_y.dropna(inplace=True)

Conclusion

Great work! You now have the data to train a KNN model (other_users_x and other_users_y) and the data you wish to predict against (target_user_x).

# KNN predictions

With the data in the correct shape from the last exercise, you can now use it to infer how user_001 feels about Apollo 13 (1995)

As a reminder, the data you prepared in the last exercise (and have been loaded into this one) are:

1. target_user_x - Centered ratings that user_001 has given to the movies they have seen.

2. other_users_x - Centered ratings for all other users and the movies they have rated excluding the movie Apollo 13.

3. other_users_y - Raw ratings that all other users have given the movie Apollo 13.

You will use other_users_x and other_users_y to fit a KNeighborsRegressor from scikit-learn and use it to predict what user_001 might have rated Apollo 13 (1995).

Instructions

1. Import KNeighborsRegressor from scikit-learn.

2. Instantiate the regressor as user_knn with the metric specified as cosine and <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>k</mi>
</math> set to 10.

In [None]:
# Import the regressor
from sklearn.neighbors import KNeighborsRegressor

# Instantiate the user KNN model
user_knn = KNeighborsRegressor(metric='cosine', n_neighbors=10)

 3. Fit the user_knn regressor on the other_users_x and other_users_y data.

 4. Using the trained model, predict what user_001 (whose ratings are stored in target_user_x) would have given the movie.


In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Instantiate the user KNN model
user_knn = KNeighborsRegressor(metric='cosine', n_neighbors=10)

# Fit the model and predict the target user
user_knn.fit(other_users_x, other_users_y)
user_user_pred = user_knn.predict(target_user_x)

print(user_user_pred)

'''
<script.py> output:
    [3.85]
'''

Conclusion

Perfect! One advantage of using a library like scikit-learn for these steps is that you are able to iterate easily. For example, you can try the above again, but this time with a different n_neighbors value, or even try to replace KNeighborsRegressor with KNeighborsClassifier to find the most common neighbors' rating as opposed to their average.

# Item-based or user-based

1. Item-based or user-based

Throughout this chapter, we have worked with item-based recommendations and user-based recommendations.
2. Item-based filtering

As a reminder, item-based recommendations are where you use the average of the k most similar items that a user has rated to suggest a rating for an item they haven't yet seen. If we want to see what this user would rate the purple book, we take the 3 most similar books they have read and reviewed, and average their scores.
3. User-based filtering

Alternatively, user-based recommendations are when you use the average of the ratings the k most similar users gave an item, to suggest what rating the target user would give it. So in the same example as before we would find the 3 most similar users that have reviewed the purple book, and take that average as our user's score. Both provide useful results, but you, of course, will want to know when to use item-based, and when to use user-based recommendations. Often it will very much depend on the data, but there are some standard pros and cons of both which we will discuss.
4. Why use item-based filtering?

First of all, item-based recommendations are more consistent over time. Users' preferences change, for example, you might enjoy animated movies when you are younger, but change your preferences to action movies later in life. Items on the other hand do not usually change, a movie that was a horror movie when it came out is still a horror movie years later. Item-based recommendations can be easier to explain Telling a user that they were recommended a book because they liked a similar one (item-based collaborating) can make more sense than persuading them they might like a book because a user they have never met liked it (user-based collaborating). Item-based recommendations can have more precalculations. Any online store generally has a finite known inventory. Its owner can calculate what items in the inventory are similar to each other and which ones are not offline and use it on their site. New users, on the other hand, appear every day, so cannot benefit from as much precalculation. One negative is that item-based recommendations are often very obvious, for example just suggesting the next movie in a series which might not be much of a value add.
5. Why use user-based filtering?

In fact although item-based recommendations appear to be preferable to users based on most accounts, one area that user-based can win out is that user-based recommendations can be a lot more interesting, and unexpected than item-based. It can be particularly useful at finding less popular items that the user would like. For this reason, while item-based recommendations are preferable in use-cases that conservative suggestions are encouraged, such as an e-commerce store, user-based recommendations can add value for more subjective items such as movies, books, or other entertainment.
6. Let's practice!

Now you can not only generate item-based and user-based collaborative recommendations you should also be able to recognize the value of them both. Let's see if you can compare them and understand their benefits. 

# Comparing item-based and user-based models

You have now looked at two different KNN approaches. The first was item-item KNN where you use the average of the k most similar movies that a user has rated to suggest a rating for a movie they haven't watched. The other approach was user-user KNN where you use the average of the ratings that the k most similar users gave the movie to suggest what rating the target user would give the movie.

Now, you will compare the two and calculate what rating user_002 would give to Forrest Gump.

The code for the user_rating_predictor model (that predicts based on what similar users gave the movie), and the movie_rating_predictor (that predicts based off of what ratings this user gave to similar movies) has been started for you.

KNeighborsRegressor has been imported for you.

Instructions

1. Create a user-user K-nearest neighbors model called user_knn.

2. Fit the user_knn model then predict on target_user_x.

3. Similarly, fit an item-item K-nearest neighbors model called movie_knn, then predict ontarget_movie_x.


In [None]:
# Instantiate the user KNN model
user_knn = KNeighborsRegressor()

# Fit the model and predict the target user
user_knn.fit(other_users_x, other_users_y)
user_user_pred = user_knn.predict(target_user_x)
print("The user-user model predicts {}".format(user_user_pred))

# Instantiate the user KNN model
movie_knn = KNeighborsRegressor()

# Fit the model on the movie data and predict
movie_knn.fit(other_movies_x, other_movies_y)
item_item_pred = movie_knn.predict(target_movie_x)
print("The item-item model predicts {}".format(item_item_pred))

'''
<script.py> output:
    The user-user model predicts [4.5]
    The item-item model predicts [4.1]
'''

Conclusion

Good work! Now you can make recommendations off of both user data and item data and compare them! The two models predict different values — which approach is more accurate will depend on the data available.

# Which should you choose?

In the last exercise, you compared user-based and item-based models. As you saw, they both generated serviceable results, but it was not clear why you would choose one model over the other.

As a reminder, user-based collaborative filtering finds similar users to the one you are making a recommendation for and suggests items these similar users enjoy. Item-based collaborative filtering finds similar items to those that the user you are making recommendations for likes and recommends those.

Both have their uses, and choosing the best one will depend on the use case. Being able to identify which to use is a valuable skill.

Instructions

1. Each of the following attributes are mostly associated with one type of collaborative filtering than the other. Drag the tile to the model that suits the descriptions best.

Item-Item

- This type of collaborative filtering is often easier to explain or more relatable.

- This type of collaborative filtering is more consistent over time.

- This type of collaborative filtering allows for similarities to be precalculated.

User-User

- This type of collaborative filtering generally creates more surprising/unexpected, but still relevant suggestions.

Conclusion

Congratulations, you now have shown that you are able to generate user-based and item-based recommendations, and you can tell when you should choose each one.