In [12]:
import pandas as pd
import numpy as np

## The data

The dataset contains 100,000 ratings and 3,600 tag applications applied to 9,724 movies by 610 users. 

In [16]:
ratings = pd.read_csv('data/ratings_names.csv')

In [17]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,1,4.0,964982703,Toy Story (1995)
1,1,3,4.0,964981247,Grumpier Old Men (1995)
2,1,6,4.0,964982224,Heat (1995)
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,964982931,"Usual Suspects, The (1995)"


In [18]:
# number of users
len(ratings.userId.unique())

610

In [19]:
# number of movies
len(ratings.movieId.unique())

9724

In [20]:
# ratings
np.sort(ratings.rating.unique())

array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

In [21]:
# number of ratings
len(ratings)

100836

In [23]:
# list of all movies
list(ratings.title)

['Toy Story (1995)',
 'Grumpier Old Men (1995)',
 'Heat (1995)',
 'Seven (a.k.a. Se7en) (1995)',
 'Usual Suspects, The (1995)',
 'From Dusk Till Dawn (1996)',
 'Bottle Rocket (1996)',
 'Braveheart (1995)',
 'Rob Roy (1995)',
 'Canadian Bacon (1995)',
 'Desperado (1995)',
 'Billy Madison (1995)',
 'Clerks (1994)',
 'Dumb & Dumber (Dumb and Dumber) (1994)',
 'Ed Wood (1994)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Pulp Fiction (1994)',
 'Stargate (1994)',
 'Tommy Boy (1995)',
 'Clear and Present Danger (1994)',
 'Forrest Gump (1994)',
 'Jungle Book, The (1994)',
 'Mask, The (1994)',
 'Blown Away (1994)',
 'Dazed and Confused (1993)',
 'Fugitive, The (1993)',
 'Jurassic Park (1993)',
 'Mrs. Doubtfire (1993)',
 "Schindler's List (1993)",
 'So I Married an Axe Murderer (1993)',
 'Three Musketeers, The (1993)',
 'Tombstone (1993)',
 'Dances with Wolves (1990)',
 'Batman (1989)',
 'Silence of the Lambs, The (1991)',
 'Pinocchio (1940)',
 'Fargo (1996)',
 'Mission: Impossible (1996)',

## A movie recommendation system

Suppose some user watches the movie

In [46]:
movie = 'Star Wars: Episode V - The Empire Strikes Back (1980)'

the goal is to recommend new movies to this user.

First, let's create a pandas DataFrame that contains the set of user that have rated each movie.

In [47]:
movie_sets = ratings.groupby('title').userId.apply(set)
movie_sets

title
'71 (2014)                                                                               {610}
'Hellboy': The Seeds of Creation (2004)                                                  {332}
'Round Midnight (1986)                                                              {377, 332}
'Salem's Lot (2004)                                                                      {345}
'Til There Was You (1997)                                                           {345, 113}
                                                                   ...                        
eXistenZ (1999)                              {387, 391, 520, 267, 414, 425, 560, 182, 312, ...
xXx (2002)                                   {263, 9, 140, 274, 20, 414, 432, 182, 438, 448...
xXx: State of the Union (2005)                                       {610, 232, 432, 274, 382}
¡Three Amigos! (1986)                        {1, 141, 19, 282, 27, 414, 421, 294, 42, 555, ...
À nous la liberté (Freedom for Us) (1931)   

## Recommendations based on the Jaccard similarity

The Jaccard similarity between two sets $A$ and $B$ is defined as

$$
J(A,B) = \frac{|A\cap B|}{|A\cup B|}
$$

In [48]:
def Jaccard_sim(movie_A, movie_B):
    return len(movie_sets[movie_A].intersection(movie_B))/len(movie_sets[movie_A].union(movie_B))

In [49]:
score_Jac = movie_sets.apply(lambda x : Jaccard_sim(movie, x)).sort_values(ascending=False)

In [52]:
# top 20 recommendations
score_Jac.head(20)

title
Star Wars: Episode V - The Empire Strikes Back (1980)                             1.000000
Star Wars: Episode IV - A New Hope (1977)                                         0.698529
Star Wars: Episode VI - Return of the Jedi (1983)                                 0.661224
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    0.562738
Matrix, The (1999)                                                                0.547468
Indiana Jones and the Last Crusade (1989)                                         0.506438
Star Wars: Episode I - The Phantom Menace (1999)                                  0.487288
Back to the Future (1985)                                                         0.469231
Alien (1979)                                                                      0.463115
Terminator, The (1984)                                                            0.461538
Aliens (1986)                                                                     0.

## Recommendations based on the Serendipity similarity

In [53]:
def Serendipity_sim(movie_A, movie_B):
    return len(movie_sets[movie_A].intersection(movie_B))/len(movie_sets[movie_A])

In [54]:
score_Ser = movie_sets.apply(lambda x : Serendipity_sim(movie, x)).sort_values(ascending=False)

In [56]:
score_Ser.head(20)

title
Star Wars: Episode V - The Empire Strikes Back (1980)                             1.000000
Star Wars: Episode IV - A New Hope (1977)                                         0.900474
Matrix, The (1999)                                                                0.819905
Star Wars: Episode VI - Return of the Jedi (1983)                                 0.767773
Forrest Gump (1994)                                                               0.701422
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    0.701422
Pulp Fiction (1994)                                                               0.658768
Silence of the Lambs, The (1991)                                                  0.639810
Shawshank Redemption, The (1994)                                                  0.630332
Terminator 2: Judgment Day (1991)                                                 0.611374
Jurassic Park (1993)                                                              0.