<a href="https://colab.research.google.com/github/ruizleandro/Recommender-Systems/blob/master/About_Recommender_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source: Laura Igual, Santi Segui. (2017). *Introduction to Data Science: A Python Approach*, Barcelona, Spain: Springer.

# Introduction

## What is a Recommender System?

It is a tool designed to intercat with large and complex information spaces, and to provide information or items that are likely to be of interest to the user, in an automated fashion.

## Why and When Do We Need a Recommender System?

In this new era, where the quantity of information is huge, recommender systems are extremely useful in several domains. People are not able to be experts in all these domains in which they are users, and they do not have enough time to spend looking for the eprfect TV or book to buy. Particularly, recommender systems are really interesting when dealing with the following issues:

* Solutions for large amounts of good data.
* Reduction of cognitive load on the user.
* Allowing new itmes to be revealed to users.

## How Do Recommender Systems Work?

There are seeral different ways to build a recommender system. However, most of them take one of two basic approaches: *content-based filtering* (CBF) or *collaborative filtering* (CF).

### Content-Based Filtering

CBF methods are constructed behind the following paradigm: "Show me more of the same what I've liked". So, this approach will recommend items which are similar to those the user liked before and the recommmendations are based on descriptions of items and a profile of the user's preferences. The computation of the similarity between items is the most important part of these methods and it is based on the content of the items themselves. As the content of the item can be very diverse, and it usually depends on the kind of items the system recommends, a range of sophisticated algorithms are usually used to abstract features from items. When dealing with textual information such as books or news, a widely used algorithm is *tf-idf* representation. The term if-tdf refers to frequency-inverse document frequency, it is a numerical statistic that measures how important a word is to a document in a collection or corpus.

### Collaborative Filtering

CF methods are constructed behind the following paradigm: "Tell me what's popular among my like-minded users". This is really intuitive paradigm since it is really similar of what people use to do: ask or look at the preferences of the people they trust. An important working hypothesis behind these kind of recommenders is that on collecting and analyzing a large number of data related to the behavior, activities, or tastes of useres, and predicting what users will like based on their similarity to other users. One of the main advantages of this type of system is that does not need to "understand" what the item it recommends is.

These methods are extremely popular because of the simplicity and the large amount of data available from users. The main drawbacks of this kind of method is the need for a user community, as well as the *cold-start* effect for new users in the community. The cold-start problem appears when the system cannot draw any, or an optimal, inference or recommendation for the users (or items) since it has not yet obtained the sufficient information of them.

CF can be of two types: *user-based* or *item-based*.

* User-based CF works like this: Find similar users to me and recommend what they liked. In this method, given a user, *U*, we first find a set of other users, *D*, whose ratings are similar to the ratings of *U* and then we calculate a prediction for *U*.
* Item-based CF works like this: Find similar items to those that I previously liked. In item-based CF, we first build an item-item matrix that determines relationships between pairs of items; then using this matrix and data on the current user *U*, we infer the user's taste. Typically, this approach is used in the domain: people who buy *x* also buy *y*. This is a really popular approach used by companies like Amazon. Moreover, one of the advantages of this approach is that items usually don't change much, so its similarities can computed offline.

### Hybrid Recommenders

Hybrid approaches can be implemented in several ways: by making content-based and collaborative predictions separately and then combining them; by adding content-based capabilities to a collaborative approach (and vice versa); or by unifying the approaches into one model.

## Modeling User Preferences

Both, CBF and CF recommender systems, require to understand the user preferences. Understanding how to model the user preferences is a critical step due to the variety of sources. It is not the same when we deal with applications like the movie recommender from Netflix, where the users rank the movies with 1 to 5 stars; or as dealing with any product recommender system from Amazon, where usually the tracking information of the purchases is used. In this case, three values can be used: 0 - not bought; 1 - viewed; 2 - bought.

The most common types of labels used to estimate the user preferences are:

* Boolean expressions (is bought?; is viewed?)
* Numerical expressions (e.g., star ranking)
* Up-Down expressions(e.g., like, neutral, or dislike)
* Weighted value expressions (e.g., number of reproductions or clicks)

## Evaluating Recommenders

There are two main metrics for recommender systems, *offline* and *online* evaluation.

We refer to evaluation as offline when a set of labeled data is obtained and then divided into two stes: a *training set* and a *test set*. The training set is used to create the model and adjust all the parameters; while the test set is used to determine selected evaluation metrics. As mentioned above, standard metrics such as RSME, precision, and recall are extensively used, but recently other indirect functions have also started to be widely considered. Examples of thse: diversity, novelty, coverage, cold-start, or serendipity, the latter is a quite popular metric that evaluates how surprisins the recommendations are.

We refer to evaluation as online when a set of tools is used to allows us to look at the interactions of users with the system. The most common online techniques is called *A-B Testing* and has the benefit of allowing evaluation of the system at the same time as users are learning, buying, or playing with the recommender system. This brings the evaluation closer to the acutal working of the system and makes it really effective when the purpose of the system is to change or influence the behavior changes when the user is interacting with different recommender systems. Let us give an example: imagine we want to develop a music recommender system like Pandora, where your final goal is none other than for users to love your intelligent music station and spend more time listening to it. In such a situation, offline metrics like RMSE are not good enough. In this case, we are particularly interested in evaluation of the global goal of the recommender system as it is long-term profit or user retention.

# Practical Case

We are going to use the *MovieLens 100K Dataset*.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving movies.csv to movies (1).csv
User uploaded file "movies.csv" with length 494431 bytes


In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving ratings.csv to ratings.csv
User uploaded file "ratings.csv" with length 2483723 bytes


In [None]:
import pandas as pd 

m_cols = ['movie_id', ' title']
movies = pd.read_csv('movies.csv', usecols=['movieId', 'title'])

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('ratings.csv', usecols=['userId', 'movieId', 'rating'])

# merging the two tables so we can have the user's id and the movies they rated
ratings = pd.merge(movies, ratings)

In [None]:
ratings.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
1,1,Toy Story (1995),5,4.0
2,1,Toy Story (1995),7,4.5
3,1,Toy Story (1995),15,2.5
4,1,Toy Story (1995),17,4.5


## Movies and Users DataFrame

Pivoting the table so we can have a direct relation between movies and users.

In [None]:
movieRatings = ratings.pivot_table(index=['userId'],
                                   columns=['title'], values='rating')
movieRatings.head( )

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...All the Marbles (1981),...And Justice for All (1979),00 Schneider - Jagd auf Nihil Baxter (1994),1-900 (06) (1994),10 (1979),10 Cent Pistol (2015),10 Cloverfield Lane (2016),10 Items or Less (2006),10 Things I Hate About You (1999),10 Years (2011),"10,000 BC (2008)",100 Girls (2000),100 Streets (2016),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),101 Dalmatians II: Patch's London Adventure (2003),101 Reykjavik (101 Reykjavík) (2000),102 Dalmatians (2000),10th & Wolf (2006),"10th Kingdom, The (2000)","10th Victim, The (La decima vittima) (1965)","11'09""01 - September 11 (2002)",11:14 (2003),"11th Hour, The (2007)",12 Angry Men (1957),12 Angry Men (1997),12 Chairs (1971),12 Chairs (1976),12 Rounds (2009),12 Years a Slave (2013),...,Zathura (2005),Zatoichi and the Chest of Gold (Zatôichi senryô-kubi) (Zatôichi 6) (1964),Zazie dans le métro (1960),Zebraman (2004),"Zed & Two Noughts, A (1985)",Zeitgeist: Addendum (2008),Zeitgeist: Moving Forward (2011),Zeitgeist: The Movie (2007),Zelary (2003),Zelig (1983),Zero Dark Thirty (2012),Zero Effect (1998),"Zero Theorem, The (2013)",Zero de conduite (Zero for Conduct) (Zéro de conduite: Jeunes diables au collège) (1933),Zeus and Roxanne (1997),Zipper (2015),Zodiac (2007),Zombeavers (2014),Zombie (a.k.a. Zombie 2: The Dead Are Among Us) (Zombi 2) (1979),Zombie Strippers! (2008),Zombieland (2009),Zone 39 (1997),"Zone, The (La Zona) (2007)",Zookeeper (2011),Zoolander (2001),Zoolander 2 (2016),Zoom (2006),Zoom (2015),Zootopia (2016),Zulu (1964),Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,
2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Now, we can easily extract vectors of every movie that our user watched, and we can laso extract vectors of every user that rated a given movie, which is what we want.

In [None]:
# for example
toyStoryRatings = movieRatings['Toy Story (1995)']
toyStoryRatings.head()

userId
1    4.0
2    NaN
3    NaN
4    NaN
5    4.0
Name: Toy Story (1995), dtype: float64

Next, we will compare columns from different movies.

For this, we are going to build the correlation score between Toy Story and every other movie. First, we hve to drop all the `Nan` values, so that we only have movie similarities that actually exist, where more than one person rated it. And, we are going to construct a new DataFrame from the results and look at the top 10 results.

## Similar Movies According To Ratings

In [None]:
similarMovies = movieRatings.corrwith(toyStoryRatings)
similarMovies = similarMovies.dropna()
df = pd.DataFrame(similarMovies.sort_values(ascending=False))
df.head(10)

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Land Before Time III: The Time of the Great Giving (1995),1.0
Orlando (1992),1.0
Goosebumps (2015),1.0
Encounters at the End of the World (2008),1.0
Suburban Commando (1991),1.0
Bad Words (2013),1.0
Escaflowne: The Movie (Escaflowne) (2000),1.0
Opera (1987),1.0
"Evening with Kevin Smith 2: Evening Harder, An (2006)",1.0
Everybody Wants Some (2016),1.0


These results are obviously wrong because there are a lot of movies that are perfectly similar to Toy Story. So now we will improve the results of movie similarities.

The problem here was that if a person liked Toy Story and Orlando (for example), then this will be highly correlated. But we can't make recommendations based on the taste of one single person or two.

We need to have some sort of confidence level in our similarities by enforcing a minimum boundary of how many people watched a given movie. 

So, let;s try to put that insight into action using the following code:

In [None]:
import numpy as np
movieStats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'71 (2014),1.0,4.0
'Hellboy': The Seeds of Creation (2004),1.0,4.0
'Round Midnight (1986),2.0,3.5
'Salem's Lot (2004),1.0,5.0
'Til There Was You (1997),2.0,4.0


Now we can see how many people rated every movie and the average rating.

Let's go ahead and get rid of movies rated by fewer than 100 people:

In [None]:
popularMovies = movieStats['rating']['size'] >= 100
movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Shawshank Redemption, The (1994)",317.0,4.429022
"Godfather, The (1972)",192.0,4.289062
Fight Club (1999),218.0,4.272936
"Godfather: Part II, The (1974)",129.0,4.25969
"Departed, The (2006)",107.0,4.252336
Goodfellas (1990),126.0,4.25
Casablanca (1942),100.0,4.24
"Dark Knight, The (2008)",149.0,4.238255
"Usual Suspects, The (1995)",204.0,4.237745
"Princess Bride, The (1987)",142.0,4.232394


What we have here is a list of movies that were rated by more than 100 people, sorted by their average rating score, and this in itslef is a recommender system.

Things look a little better now, so let's go ahead and basically our new DataFrame of Toy Story recommendations, movies similar to Toy Story, where we only base it on movies that appear in this new DataFrame. So, we are going to use the `join` operation, to go ahead and join our original `similarMovies` DataFrame to this new DataFrame of only movies that have greater than 100 ratings, okay?

## SImilar Movies According To A Lot Of People

In [None]:
df = movieStats[popularMovies].join(pd.DataFrame(similarMovies,
                                                 columns=['similarity']))
df.sort_values(by=['similarity'], ascending=False)[:15]



Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Toy Story (1995),215.0,3.92093,1.0
"Incredibles, The (2004)",125.0,3.836,0.643301
Finding Nemo (2003),141.0,3.960993,0.618701
Aladdin (1992),183.0,3.79235,0.611892
"Monsters, Inc. (2001)",132.0,3.871212,0.490231
Mrs. Doubtfire (1993),144.0,3.388889,0.446261
"Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)",120.0,4.183333,0.438237
American Pie (1999),103.0,3.378641,0.420117
Die Hard: With a Vengeance (1995),144.0,3.555556,0.410939
E.T. the Extra-Terrestrial (1982),122.0,3.766393,0.409216


This is starting to look better! There's still room for improvement, but hey! We got some reuslts that make sense, whoo-hoo! 

## Making Movie Recommendations To People

In [None]:
corrMatrix = movieRatings.corr(method='pearson', min_periods=100)
corrMatrix.head(10)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...All the Marbles (1981),...And Justice for All (1979),00 Schneider - Jagd auf Nihil Baxter (1994),1-900 (06) (1994),10 (1979),10 Cent Pistol (2015),10 Cloverfield Lane (2016),10 Items or Less (2006),10 Things I Hate About You (1999),10 Years (2011),"10,000 BC (2008)",100 Girls (2000),100 Streets (2016),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),101 Dalmatians II: Patch's London Adventure (2003),101 Reykjavik (101 Reykjavík) (2000),102 Dalmatians (2000),10th & Wolf (2006),"10th Kingdom, The (2000)","10th Victim, The (La decima vittima) (1965)","11'09""01 - September 11 (2002)",11:14 (2003),"11th Hour, The (2007)",12 Angry Men (1957),12 Angry Men (1997),12 Chairs (1971),12 Chairs (1976),12 Rounds (2009),12 Years a Slave (2013),...,Zathura (2005),Zatoichi and the Chest of Gold (Zatôichi senryô-kubi) (Zatôichi 6) (1964),Zazie dans le métro (1960),Zebraman (2004),"Zed & Two Noughts, A (1985)",Zeitgeist: Addendum (2008),Zeitgeist: Moving Forward (2011),Zeitgeist: The Movie (2007),Zelary (2003),Zelig (1983),Zero Dark Thirty (2012),Zero Effect (1998),"Zero Theorem, The (2013)",Zero de conduite (Zero for Conduct) (Zéro de conduite: Jeunes diables au collège) (1933),Zeus and Roxanne (1997),Zipper (2015),Zodiac (2007),Zombeavers (2014),Zombie (a.k.a. Zombie 2: The Dead Are Among Us) (Zombi 2) (1979),Zombie Strippers! (2008),Zombieland (2009),Zone 39 (1997),"Zone, The (La Zona) (2007)",Zookeeper (2011),Zoolander (2001),Zoolander 2 (2016),Zoom (2006),Zoom (2015),Zootopia (2016),Zulu (1964),Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
'71 (2014),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
'Round Midnight (1986),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
'Salem's Lot (2004),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
'Til There Was You (1997),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
'Tis the Season for Love (2015),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"'burbs, The (1989)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
'night Mother (1986),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
(500) Days of Summer (2009),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
*batteries not included (1987),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Let's start by using the ratings of a random user to create recommendations for.

In [None]:
# create ur own ratings with this:
myRatings = movieRatings.loc[463].dropna()
myRatings.head(50)

As Good as It Gets (1997)                    4.5
Dead Poets Society (1989)                    4.5
Die Hard (1988)                              0.5
Django Unchained (2012)                      4.5
Forrest Gump (1994)                          5.0
Fugitive, The (1993)                         0.5
Ghost (1990)                                 4.0
Gladiator (2000)                             0.5
Good Will Hunting (1997)                     4.0
Goodfellas (1990)                            4.0
Gravity (2013)                               0.5
Green Mile, The (1999)                       4.0
Inglourious Basterds (2009)                  0.5
Jurassic Park (1993)                         0.5
Matrix, The (1999)                           0.5
Nine Months (1995)                           4.0
Rain Man (1988)                              3.5
Schindler's List (1993)                      3.5
Shawshank Redemption, The (1994)             4.5
Shutter Island (2010)                        4.0
Silence of the Lambs

I am going to try to find recommendations for this fictitious user.

So, let's start by creating a series called `simCandidates` and I'm going to go through every movie that he rated.

In [None]:
simCandidates = pd.Series()
for i in range(0, len(myRatings.index)):
  print('Adding sims for ' + myRatings.index[i] + "...")
  # retrieve similar movies to this one that he rated
  sims = corrMatrix[myRatings.index[i]].dropna()
  # now scale its similarity by how well he rated this movie
  sims = sims.map(lambda x: x * myRatings[i])
  # add the score to the list of similarity candidates
  simCandidates = simCandidates.append(sims)
# glance at our results so far:
print("Sorting...")
simCandidates.sort_values(inplace=True, ascending=False)
print(simCandidates.head(15))

Adding sims for As Good as It Gets (1997)...
Adding sims for Dead Poets Society (1989)...
Adding sims for Die Hard (1988)...
Adding sims for Django Unchained (2012)...
Adding sims for Forrest Gump (1994)...
Adding sims for Fugitive, The (1993)...
Adding sims for Ghost (1990)...
Adding sims for Gladiator (2000)...
Adding sims for Good Will Hunting (1997)...
Adding sims for Goodfellas (1990)...
Adding sims for Gravity (2013)...
Adding sims for Green Mile, The (1999)...
Adding sims for Inglourious Basterds (2009)...
Adding sims for Jurassic Park (1993)...
Adding sims for Matrix, The (1999)...
Adding sims for Nine Months (1995)...
Adding sims for Rain Man (1988)...
Adding sims for Schindler's List (1993)...
Adding sims for Shawshank Redemption, The (1994)...
Adding sims for Shutter Island (2010)...
Adding sims for Silence of the Lambs, The (1991)...
Adding sims for Sixth Sense, The (1999)...
Adding sims for Sleepless in Seattle (1993)...
Adding sims for Stand by Me (1986)...
Adding sims fo

  """Entry point for launching an IPython kernel.


This doesn't look bad, right?

Let's start to refine these results a little bit more. We're seeing that we're getting duplicate values back. If we have a movie that was similar to more than one movie that he rated, it will come back more than once in the results, so we want to combine those together. If he do in fact have the same movie, maybe that should get added up together into a combined, stronger recommendation score.

### Using The `groupby` Command To Combine Rows

We're going to use the `groupby` command again to group together all of the rows are for the same movie. Next, we will sum up their correlation score and look at the results.

In [None]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.sort_values(inplace=True, ascending=False)
simCandidates.head(20)

Forrest Gump (1994)                                      12.354128
Shawshank Redemption, The (1994)                         10.852738
Schindler's List (1993)                                   9.243253
Matrix, The (1999)                                        8.603162
Silence of the Lambs, The (1991)                          8.591899
Saving Private Ryan (1998)                                8.537086
Sixth Sense, The (1999)                                   8.395077
Fight Club (1999)                                         7.752193
Good Will Hunting (1997)                                  7.743020
Godfather, The (1972)                                     6.536762
Braveheart (1995)                                         5.702647
Ghost (1990)                                              5.541518
Terminator 2: Judgment Day (1991)                         5.537900
Usual Suspects, The (1995)                                5.382437
Gladiator (2000)                                          5.33

This is looking really good!

These are all movie that our user would actually enjoy watching!

The last thing we need to do is filter out the movies that he've already rated, because it doesn't make sense to recommend movies you've already seen.

In [None]:
filteredSims = simCandidates.append(myRatings)
filteredSims = filteredSims.drop_duplicates(keep=False)
filteredSims.head(20)

Forrest Gump (1994)                                      12.354128
Shawshank Redemption, The (1994)                         10.852738
Schindler's List (1993)                                   9.243253
Matrix, The (1999)                                        8.603162
Silence of the Lambs, The (1991)                          8.591899
Saving Private Ryan (1998)                                8.537086
Sixth Sense, The (1999)                                   8.395077
Fight Club (1999)                                         7.752193
Good Will Hunting (1997)                                  7.743020
Godfather, The (1972)                                     6.536762
Braveheart (1995)                                         5.702647
Ghost (1990)                                              5.541518
Terminator 2: Judgment Day (1991)                         5.537900
Usual Suspects, The (1995)                                5.382437
Gladiator (2000)                                          5.33