## Collaborative filtering using Python

Alright, so let's do it! We have some Python code that will use Pandas, and all the various other tools at our disposal, to create movie recommendations with a surprisingly little amount of code.       

The first thing we're going to do is show you item-based collaborative filtering in practice. So, we'll build up *people who watched also watched* basically, you know, *people who rated things highly also rated this thing highly*, so building up these movie to movie relationships. So, we're going to base it on real data that we got from the MovieLens project. So, if you go to MovieLens.org, there's actually an open movie recommender system there, where people can rate movies and get recommendations for new movies.      
 
And, they make all the underlying data publicly available for researchers like us. So, we're going to use some real movie ratings data-it is a little bit dated, it's like 10 years old, so keep that in mind, but it is real behavior data that we're going to be working with finally here. And, we will use that to compute similarities between movies. And, that data in and of itself is useful. You can use that data to say *people who liked also liked*. So, let's say I'm looking at a web page for a movie. the system can then say: *if you liked this movie, and given that you're looking at it you're probably interested in it, then you might also like these movies*. And that's a form of a recommender system right there, even though we don't even know who you are.       

Now, it is real-world data, so we're going to encounter some real-world problems with it. Our initial set of results aren't going to look good, so we're going to spend a little bit of extra time trying to figure out why, which is a lot of what you spend your time doing as a data scientist-correct those problems, and go back and run it again until we get results that makes sense.            

And finally, we'll actually do item-based collaborative filtering in its entirety, where we actually recommend movies to individuals based on their own behavior. So, let's do this, let's get started!

## Finding movie similarities

Let's apply the concept of item-based collaborative filtering. To start with, movie similarities-figure out what movies are similar to other movies. In particular, we'll try to figure out what movies are similar to Star Wars, based on user rating data, and we'll see what we get out of it. Let's dive in!                   

Okay so, let's go ahead and compute the first half of item-based collaborative filtering, which is finding similarities between items.

In [6]:
import pandas as pd

r_cols = ['user_id','movie_id','rating']
ratings = pd.read_csv('u.data.csv',names=r_cols, usecols=range(3))
m_cols = ['movie_id','title']
movies = pd.read_csv('u.item.csv',names=m_cols, sep='|',usecols=range(2))
#print(ratings.head())
ratings = pd.merge(movies,ratings)
ratings.to_csv('ratings.csv')
movies.head()

Unnamed: 0,movie_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In this case, we're going to be looking at similarities between movies, based on user behavior. And, we're going to be using some real movie rating data from the GroupLens project. GroupLens.org provides real movie ratings data, by real people who are using the MovieLens.org website to rate movies and get recommendations back for new movies that they want to watch.                        

We have included the data files that you need from the GroupLens dataset with the course materials, and the first thing we need to do is import those into a Pandas DataFrame, and we're really going to see the full power of Pandas in this example. It's pretty cool stuff!

Let's add a `ratings.head()` command and then run those cells. What we end up with is something like the following table. That was pretty quick!

In [7]:
ratings.head()

Unnamed: 0,movie_id,title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


We end up with a new DataFrame that contains the `user_id` and rating for each movie that a user rated, and we have both the `movie_id` and the `title` that we can read and see what it really is. So, the way to read this is `user_id` number `308` rated the `Toy Story (1995)` movie 4 stars, `user_id` number `287` rated the `Toy Story (1995)` movie 5 stars, and so on and so forth. And, if we were to keep looking at more and more of this DataFrame, we'd see different ratings for different movies as we go through it.

Now the real magic of Pandas comes in. So, what we really want is to look at relationships between movies based on all the users that watched each pair of movies, so we need, at the end, a matrix of every movie, and every user, and all the ratings that every user gave to every movie. The `pivot_table` command in Pandas can do that for us. It can basically construct a new table from a given DataFrame, pretty much any way that you want it. For this, we can use the following code:

In [9]:
movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values=['rating'])
movieRatings.head()

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Wyatt Earp (1994),Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown
user_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,,5.0,3.0,,,,4.0
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


It's kind of amazing how that just put it all together for us. Now, you'll see some NaN values, which stands for **Not a Number**, and its just how Pandas indicates a missing value. So, the way to interpret this is, `user_id` number 1, for example, did not watch the movie `1-900 (1994)`, but `user_id` number 1 did watch `101 Dalmatians (1996)` and rated it 2 stars. The `user_id` number 1 also watched `12 Angry Men (1957)` and rated it 5 stars, but did not watch the movie 2 Days in `the Valley (1996)`, for example, okay? So, what we end up with here is a sparse matrix basically, that contains every user, and every movie, and at every intersection where a user rated a movie there's a rating value.    

So, you can see now, we can very easily extract vectors of every movie that our user watched, and we can also extract vectors of every user that rated a given movie, which is what we want. So, that's useful for both user-based and item-based collaborative filtering, right? If I wanted to find relationships between users, I could look at correlations between these user rows, but if I want to find correlations between movies, for item-based collaborative filtering, I can look at correlations between columns based on the user behavior. So, this is where the real *flipping things on its head for user versus item-based similarities* comes into play.     

Now, we're going with item-based collaborative filtering, so we want to extract columns, to do this let's run the following code:

In [30]:
starWarsRatings = movieRatings['rating','Star Wars (1977)']
starWarsRatings.head()

user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: (rating, Star Wars (1977)), dtype: float64

Now, with the help of that, let's go ahead and extract all the users who rated Star Wars (1977):                  

And, we can see most people have, in fact, watched and rated `Star Wars (1977)` and everyone liked it, at least in this little sample that we took from the head of the DataFrame. So, we end up with a resulting set of user IDs and their ratings for `Star Wars (1977)`. The user ID 3 did not rate `Star Wars (1977)` so we have a `NaN` value, indicating a missing value there, but that's okay. We want to make sure that we preserve those missing values so we can directly compare columns from different movies. So, how do we do that?

## The corrwith function

In [10]:
corrMatrix = movieRatings.corr(method='pearson',min_periods=100)  #pearson is the corr method
corrMatrix.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Wyatt Earp (1994),Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown
Unnamed: 0_level_2,title,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
rating,'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
rating,1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
rating,101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
rating,12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
rating,187 (1997),,,,,,,,,,,...,,,,,,,,,,


Well, Pandas keeps making it easy for us, and has a corrwith function that you can see in the following code that we can use:

In [31]:
movieRatings1 = movieRatings['rating']
movieRatings1
similarMovies = movieRatings1.corrwith(starWarsRatings)
# print(similarMovies.shape)
similarMovies = similarMovies.dropna()
df = pd.DataFrame(similarMovies)
similarMovies.sort_values(ascending=False)
# print(similarMovies.shape)

title
No Escape (1994)                          1.0
Man of the Year (1995)                    1.0
Hollow Reed (1996)                        1.0
Commandments (1997)                       1.0
Cosi (1996)                               1.0
                                         ... 
Theodore Rex (1995)                      -1.0
I Like It Like That (1994)               -1.0
Two Deaths (1995)                        -1.0
Roseanna's Grave (For Roseanna) (1997)   -1.0
Frankie Starlight (1995)                 -1.0
Length: 1410, dtype: float64

That code will go ahead and correlate a given column with every other column in the DataFrame, and compute the correlation scores and give that back to us. So, what we're doing here is using corrwith on the entire movieRatings DataFrame, that's that entire matrix of user movie ratings, correlating it with just the starWarsRatings column, and then dropping all of the missing results with dropna. So, that just leaves us with items that had a correlation, where there was more than one person that viewed it, and we create a new DataFrame based on those results and then display the top 10 results. So again, just to recap:       

1. We're going to build the correlation score between Star Wars and every other movie. 
2. Drop all the NaN values, so that we only have movie similarities that actually exist, where more than one person rated it. 
3. And, we're going to construct a new DataFrame from the results and look at the top 10 results.

We ended up with this result of correlation scores between each individual movie for Star Wars and we can see, for example, a surprisingly high correlation score with the movie `Til There Was You (1997)`, a negative correlation with the movie `1-900 (1994)`, and a very weak correlation with `101 Dalmatians (1996)`.      

Now, all we should have to do is sort this by similarity score, and we should have the top movie similarities for Star Wars, right? Let's go ahead and do that.     

Just call sort_values on the resulting DataFrame, again Pandas makes it really easy, and we can say `ascending=False`, to actually get it sorted in reverse order by correlation score. So, let's do that:       

Okay, so `Star Wars (1977)` came out pretty close to top, because it is similar to itself, but what's all this other stuff? What the heck? We can see in the preceding output, some movies such as: `Full Speed (1996)`, `Man of the Year (1995)`, `The Outlaw (1943)`. These are all, you know, fairly obscure movies, that most of them I've never even heard of, and yet they have perfect correlations with Star Wars. That's kinda weird! So, obviously we're doing something wrong here. What could it be?      

Well, it turns out there's a perfectly reasonable explanation, and this is a good lesson in why you always need to examine your results when you're done with any sort of data science task-question the results, because often there's something you missed, there might be something you need to clean in your data, there might be something you did wrong. But you should also always look skeptically at your results, don't just take them on faith, okay? If you do so, you're going to get in trouble, because if I were to actually present these as recommendations to people who liked Star Wars, I would get fired. Don't get fired! Pay attention to your results! So, let's dive into what went wrong in our next section.

## Improving the results of movie similarities

Let's figure out what went wrong with our movie similarities there. We went through all this exciting work to compute correlation scores between movies based on their user ratings vectors, and the result we got kind of sucked. so just to remind you, we looked for movies that are similar to Star Wars using that technique, and we ended up with a bunch of weird recommendation at the top that had a perfect correlation.     

And, most of them were very obscure movies. So, What do you think might be going on there? Well, one think that might make sense is, Let's say we have a lot of people wathch Star Wars and some other obscure film. We'd end up with a good correlation between these two movies because they're tied together by Star Wars, but at the end of the day, do we really want to base our recommendations on the behaviour of one or two people that watch some obscure movie?    

Probably not! I mean yes, the two people in the world, or whatever it is, that watch the movie Full Speed, and both liked it in addition to Star Wars, maybe that is a good recommendation for them, but it's probably not a good recommendation to the rest of the world. We need to have some sort of confidence level in our similarities by enforcing a minimum boundary of how many people watched a given movie. We can't make a judgement that a given movie is good just based on the behaviour of one or two people.    

So, let's try to put that insight into action using the following code.

In [32]:
import numpy as np
movieStats = ratings.groupby('title').agg({'rating':[np.size,np.mean]})
movieStats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),9,2.333333
1-900 (1994),5,2.6
101 Dalmatians (1996),109,2.908257
12 Angry Men (1957),125,4.344
187 (1997),41,3.02439


What we're going to do is try to identify the movies that weren't actually rated by many people and we'll just throw them out and see what we get. So, to do that we're going to take our original ratings DataFrame and we're going to say `groupby('title')`, again Pandas has all sorts of magic in it. And, this will basically construct a new DataFrame that aggregates together all the rows for a given title into one row.       

We can say that we want to aggregate specifically on the rating, and we want to show both the size, the number of ratings for each movie, and the mean average score, the mean rating for that movie. So, when we do that, we end up with something like the above.

This is telling us, for example, for the movie `101 Dalmatians (1996)`, `109` people rated that movie and their average rating was 2.9 stars, so not that great of a score really! So, if we just eyeball this data, we can say okay well, movies that I consider obscure, like `187 (1997)`, had `41` ratings, but `101 Dalmatians (1996)`, I've heard of that, you know `12 Angry Men (1957)`, I've heard of that. It seems like there's sort of a natural cutoff value at around `100` ratings, where maybe that's the magic value where things start to make sense.           
         
Let's go ahead and get rid of movies rated by fewer than 100 people, and yes, you know I'm kind of doing this intuitively at this point. As we'll talk about later, there are more principled ways of doing this, where you could actually experiment and do train/test experiments on different threshold values, to find the one that actually performs the best. But initially, let's just use our common sense and filter out movies that were rated by fewer than 100 people. Again, Pandas makes that really easy to do.    

Let's figure it out with the following example:

In [33]:
#movieStats.loc['Star Wars (1977)']

popularMovies = movieStats['rating']['size']>=100
movieStats = movieStats[popularMovies].sort_values([('rating','mean')],ascending=False)
# movieStats = movieStats[movieStats.index != 'Star Wars (1977)']    #drop row having Star Wars(1977)
movieStats

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.456790
"Shawshank Redemption, The (1994)",283,4.445230
...,...,...
Spawn (1997),143,2.615385
Event Horizon (1997),127,2.574803
Crash (1996),128,2.546875
Jungle2Jungle (1997),132,2.439394


    
What we have here is a list of movies that were rated by more than 100 people, sorted by their average rating score, and this in itself is a recommender system. These are highly-rated popular movies. A `Close Shave (1995)`, apparently, was a really good movie and a lot of people watched it and they really liked it.         

So again, this is a very old dataset, from the late 90s, so even though you might not be familiar with the film A `Close Shave (1995)`, it might be worth going back and rediscovering it; add it to your Netflix! `Schindler's List (1993)` not a big surprise there, that comes up on the top of most top movies lists. The `Wrong Trousers (1993)`, another example of an obscure film that apparently was really good and was also pretty popular. So, some interesting discoveries there already, just by doing that.   

Things look a little bit better now, so let's go ahead and basically make our new DataFrame of Star Wars recommendations, movies similar to Star Wars, where we only base it on movies that appear in this new DataFrame. So, we're going to use the `join` operation, to go ahead and join our original `similarMovies` DataFrame to this new DataFrame of only movies that have greater than 100 ratings, okay?    
        
       

In [34]:
df = movieStats.join(pd.DataFrame(similarMovies,columns=['similarity']))
df.head()

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Close Shave, A (1995)",112,4.491071,0.183451
Schindler's List (1993),298,4.466443,0.100933
"Wrong Trousers, The (1993)",118,4.466102,0.216204
Casablanca (1942),243,4.45679,0.248016
"Shawshank Redemption, The (1994)",283,4.44523,0.174986


In this code, we create a new DataFrame based on similarMovies where we extract the similarity column, join that with our movieStats DataFrame, which is our popularMovies DataFrame, and we look at the combined results. And, there we go with that output!      

Now we have, restricted only to movies that are rated by more than 100 people, the similarity score to Star Wars. So, now all we need to do is sort that using the following code:

In [35]:
df.sort_values(['similarity'],ascending=False)[:15]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),584,4.359589,1.0
"Empire Strikes Back, The (1980)",368,4.206522,0.748353
Return of the Jedi (1983),507,4.00789,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,3.246154,0.377433
"Sting, The (1973)",241,4.058091,0.367538
Indiana Jones and the Last Crusade (1989),331,3.930514,0.350107
Pinocchio (1940),101,3.673267,0.347868
"Frighteners, The (1996)",115,3.234783,0.332729
L.A. Confidential (1997),297,4.161616,0.319065


                                             
This is starting to look a little bit better! So, `Star Wars (1977)` comes out on top because it's similar to itself, The `Empire Strikes Back (1980)` is number 2, `Return of the Jedi (1983)` is number 3, `Raiders of the Lost Ark (1981)`, number 4. You know, it's still not perfect, but these make a lot more sense, right? So, you would expect the three Star Wars films from the original trilogy to be similar to each other, this data goes back to before the next three films, and `Raiders of the Lost Ark (1981)` is also a very similar movie to Star Wars in style, and it comes out as number 4. So, I'm starting to feel a little bit better about these results. There's still room for improvement, but hey! We got some results that make sense, whoo-hoo!
                              

Now, ideally, we'd also filter out Star Wars, you don't want to be looking at similarities to the movie itself that you started from, but we'll worry about that later! So, if you want to play with this a little bit more, like I said 100 was sort of an arbitrary cutoff for the minimum number of ratings. If you do want to experiment with different cutoff values, I encourage you to go back and do so. See what that does to the results. You know, you can see in the preceding table that the results that we really like actually had much more than 100 ratings in common. So, we end up with `Austin Powers: International Man of Mystery (1997)` coming in there pretty high with only 130 ratings so maybe 100 isn't high enough! `Pinocchio (1940)` snuck in at 101, not very similar to Star Wars, so, you might want to consider an even higher threshold there and see what it does.

**Note:** Please keep in mind too, this is a very small, limited dataset that we used for experimentation purposes, and it's based on very old data, so you're only going to    see older movies. So, interpreting these results intuitively might be a little bit challenging as a result, but not bad results

In [38]:
k = movieStats.index is not 'Star Wars (1977)'
k

True

In [39]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

In [40]:
movieStats.loc['Miracle on 34th Street (1994)']

rating  size    101.000000
        mean      3.722772
Name: Miracle on 34th Street (1994), dtype: float64

## Understanding movie recommendation with an example

So, what we do with this data? Well, what we want to do is recommend movies for people. The way we do that is we look at all the ratings for a given person, find movies similar to the stuff that they rated, and those are candidates for recommendations to that person.     

Let's start by creating a fake person to create recommendations for. I've actually already added a fake user by hand, ID number 0, to the MovieLens dataset that we're processing. You can see that user with the following code:

In [42]:
myRatings = movieRatings.loc[0].dropna()
myRatings

        title                          
rating  Empire Strikes Back, The (1980)    5.0
        Gone with the Wind (1939)          1.0
        Star Wars (1977)                   5.0
Name: 0, dtype: float64

That kind of represents someone like me, who loved Star Wars and The Empire Strikes Back, but hated the movie Gone with the Wind. So, this represents someone who really loves Star Wars, but does not like old style, romantic dramas, okay? So, I gave a rating of 5 star to `The Empire Strikes Back (1980)` and `Star Wars (1977)`, and a rating of 1 star to `Gone with the Wind (1939)`. So, I'm going to try to find recommendations for this fictitious user. So, how do I do that? Well, let's start by creating a series called simCandidates and I'm going to go through every movie that I rated.

In [45]:
simCandidates = pd.Series()
for i in range(0,len(myRatings.index)):
    print("Adding sims for ",myRatings.index[i],"...")
    sims = corrMatrix[myRatings.index[i]].dropna()
    sims = sims.map(lambda x: x*myRatings[i])
    print(sims)
    simCandidates = simCandidates.append(sims)

print('\nsorting..\n')
simCandidates.sort_values(inplace=True,ascending=False)
print(simCandidates.head(10))


Adding sims for  ('rating', 'Empire Strikes Back, The (1980)') ...
        title                                                                      
rating  2001: A Space Odyssey (1968)                                                   0.707991
        Abyss, The (1989)                                                              1.389334
        African Queen, The (1951)                                                      1.158286
        Air Force One (1997)                                                           0.828101
        Aladdin (1992)                                                                 1.555313
        Alien (1979)                                                                   1.008343
        Aliens (1986)                                                                  1.462883
        Amadeus (1984)                                                                 0.746641
        American President, The (1995)                                           

For i in range 0 through the number of ratings that I have in `myRatings`, I am going to add up similar movies to the ones that I rated. So, I'm going to take that `corrMatrix` DataFrame, that magical one that has all of the movie similarities, and I am going to create a correlation matrix with `myRatings`, drop any missing values, and then I am going to scale that resulting correlation score by how well I rated that movie.      

So, the idea here is I'm going to go through all the similarities for The Empire Strikes Back, for example, and I will scale it all by 5, because I really liked The Empire Strikes Back. But, when I go through and get the similarities for Gone with the Wind, I'm only going to scale those by 1, because I did not like Gone with the Wind. So, this will give more strength to movies that are similar to movies that I liked, and less strength to movies that are similar to movies that I did not like, okay? So, I just go through and build up this list of similarity candidates, recommendation candidates if you will, sort the results and print them out. Let's see what we get above

Hey, those don't look too bad, right? So, obviously The `Empire Strikes Back (1980)` and `Star Wars (1977)` come out on top, because I like those movies explicitly, I already watched them and rated them. But, bubbling up to the top of the list is `Return of the Jedi (1983)`, which we would expect and `Raiders of the Lost Ark (1981)`.      

Let's start to refine these results a little bit more. We're seeing that we're getting duplicate values back. If we have a movie that was similar to more than one movie that I rated, it will come back more than once in the results, so we want to combine those together. If I do in fact have the same movie, maybe that should get added up together into a combined, stronger recommendation score. Return of the Jedi, for example, was similar to both Star Wars and The Empire Strikes Back. How would we do that?

## Using the groupby command to combine rows

We'll go ahead and explore that. We're going to use the groupby command again to group together all of the rows that are for the same movie. Next, we will sum up their correlation score and look at the results:

In [46]:
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.sort_values(inplace=True,ascending=False)
simCandidates.head(10)

(rating, Empire Strikes Back, The (1980))              8.877450
(rating, Star Wars (1977))                             8.870971
(rating, Return of the Jedi (1983))                    7.178172
(rating, Raiders of the Lost Ark (1981))               5.519700
(rating, Indiana Jones and the Last Crusade (1989))    3.488028
(rating, Bridge on the River Kwai, The (1957))         3.366616
(rating, Back to the Future (1985))                    3.357941
(rating, Sting, The (1973))                            3.329843
(rating, Cinderella (1950))                            3.245412
(rating, Field of Dreams (1989))                       3.222311
dtype: float64

Hey, this is looking really good!     

So `Return of the Jedi (1983)` comes out way on top, as it should, with a score of 7, `Raiders of the Lost Ark (1981)` a close second at 5, and then we start to get to `Indiana Jones and the Last Crusade (1989)`, and some more movies, `The Bridge on the River Kwai (1957)`, `Back to the Future (1985)`,`The Sting (1973)`. These are all movies that I would actually enjoy watching! You know, I actually do like old-school Disney movies too, so `Cinderella (1950)` isn't as crazy as it might seem.       

The last thing we need to do is filter out the movies that I've already rated, because it doesn't make sense to recommend movies you've already seen.

## Removing entries with the drop command

So, I can quickly drop any rows that happen to be in my original ratings series using the following code:

In [47]:
filteredSims = simCandidates.drop(myRatings.index)
filteredSims.head(10)

(rating, Return of the Jedi (1983))                    7.178172
(rating, Raiders of the Lost Ark (1981))               5.519700
(rating, Indiana Jones and the Last Crusade (1989))    3.488028
(rating, Bridge on the River Kwai, The (1957))         3.366616
(rating, Back to the Future (1985))                    3.357941
(rating, Sting, The (1973))                            3.329843
(rating, Cinderella (1950))                            3.245412
(rating, Field of Dreams (1989))                       3.222311
(rating, Wizard of Oz, The (1939))                     3.200268
(rating, Dumbo (1941))                                 2.981645
dtype: float64

And there we have it! `Return of the Jedi (1983)`, `Raiders of the Lost Ark (1981)`, `Indiana Jones` and `the Last Crusade (1989)`, all the top results for my fictitious user, and they all make sense. I'm seeing a few family-friendly films, you know, `Cinderella (1950)`, `The Wizard of Oz (1939)`, `Dumbo (1941)`, creeping in, probably based on the presence of Gone with the Wind in there, even though it was weighted downward it's still in there, and still being counted. And, there we have our results, so. There you have it! Pretty cool!        

We have actually generated recommendations for a given user and we could do that for any user in our entire DataFrame. So, go ahead and play with that if you want to. I also want to talk about how you can actually get your hands dirty a little bit more, and play with these results; try to improve upon them.     

There's a bit of an art to this, you know, you need to keep iterating and trying different ideas and different techniques until you get better and better results, and you can do this pretty much forever. I mean, I made a whole career out of it. So, I don't expect you to spend the next, you know, 10 years trying to refine this like I did, but there are some simple things you can do, so let's talk about that.