GroupLens Research provides a number of collections of movie ratings data collected
from users of MovieLens in the late 1990s and early 2000s. The data provide movie
ratings, movie metadata (genres and year), and demographic data about the users
(age, zip code, gender identification, and occupation). Such data is often of interest in
the development of recommendation systems based on machine learning algorithms.

The MovieLens 1M dataset contains 1 million ratings collected from 6,000 users on
4,000 movies. It’s spread across three tables: ratings, user information, and movie
information. 

In [0]:
!wget -qnc -P ./movielens https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/movielens/users.dat
!wget -qnc -P ./movielens https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/movielens/ratings.dat
!wget -qnc -P ./movielens https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/movielens/movies.dat

In [0]:
import pandas as pd
import numpy as np

# pd.options.display.max_rows = 10

In [0]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

In [4]:
unames = ['user_id', 'gender', 'age', 'occupation', "zip"]
users = pd.read_table('movielens/users.dat', sep="::", header=None, names=unames, 
                      engine='python')
print(users.shape)

(6040, 5)


In [5]:
rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("movielens/ratings.dat", sep="::", header=None,
                      names=rnames, engine="python")
print(ratings.shape)

(1000209, 4)


In [6]:
mnames = ['movie_id', "title", "genres"]
movies = pd.read_table("movielens/movies.dat", sep="::", header=None,
                      names=mnames, engine="python")
print(movies.shape)

(3883, 3)


In [7]:
display("users", "ratings", "movies.head()")

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
ratings.user_id.unique()

array([   1,    2,    3, ..., 6038, 6039, 6040])

In [9]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
len(ratings['user_id'].unique())

6040

In [11]:
len(users.user_id.unique())

6040

In [12]:
# https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html

# many to one mapping

user_ratings = pd.merge(ratings, users)
display("users", "ratings", "user_ratings.head()")

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip
0,1,1193,5,978300760,F,1,10,48067
1,1,661,3,978302109,F,1,10,48067
2,1,914,3,978301968,F,1,10,48067
3,1,3408,4,978300275,F,1,10,48067
4,1,2355,5,978824291,F,1,10,48067


In [13]:
data = pd.merge(user_ratings, movies)
display("user_ratings", "movies", "data.head()")

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip
0,1,1193,5,978300760,F,1,10,48067
1,1,661,3,978302109,F,1,10,48067
2,1,914,3,978301968,F,1,10,48067
3,1,3408,4,978300275,F,1,10,48067
4,1,2355,5,978824291,F,1,10,48067
...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,M,25,6,11106
1000205,6040,1094,5,956704887,M,25,6,11106
1000206,6040,562,5,956704746,M,25,6,11106
1000207,6040,1096,4,956715648,M,25,6,11106

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


In [14]:
user_ratings["user_id"].value_counts()[1]

53

In [15]:
data.loc[data["user_id"] == 1, :]

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1725,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
2250,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
2886,1,3408,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama
4201,1,2355,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy
5904,1,1197,3,978302268,F,1,10,48067,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
8222,1,1287,5,978302039,F,1,10,48067,Ben-Hur (1959),Action|Adventure|Drama
8926,1,2804,5,978300719,F,1,10,48067,"Christmas Story, A (1983)",Comedy|Drama
10278,1,594,4,978302268,F,1,10,48067,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical
11041,1,919,4,978301368,F,1,10,48067,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical


In [0]:
# https://www.youtube.com/watch?time_continue=123&v=xPPs59pn6qU&feature=emb_logo

mean_ratings = data.pivot_table(values="rating", index="title", columns="gender",
                                aggfunc="mean")

In [17]:
mean_ratings.head()

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024


In [18]:
mean_ratings.isnull().any()

gender
F    True
M    True
dtype: bool

In [19]:
# mean_ratings[mean_ratings['F'].isnull()].head()
len(mean_ratings.loc[mean_ratings['F'].isnull(), :].index.unique())

225

In [20]:
ratings_by_title = data.groupby('title').size()

for i, (movie, df) in enumerate(data.groupby('title')):
    print(movie)
    print(df)
    if i == 1:
        break

$1,000,000 Duck (1971)
        user_id  movie_id  ...                   title             genres
985679      216      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985680      494      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985681      714      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985682      869      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985683     1034      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985684     1111      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985685     1141      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985686     1556      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985687     1635      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985688     1645      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985689     1680      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985690     1709      2031  ...  $1,000,000 Duck (1971)  Children's|Comedy
985691     1748

In [21]:
ratings_by_title.head()

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64

In [0]:
active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [23]:
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

In [0]:
mean_ratings = mean_ratings.loc[active_titles]

In [25]:
mean_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


In [0]:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)

In [27]:
top_female_ratings[:10]

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644444,4.473795
"Wrong Trousers, The (1993)",4.588235,4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.57265,4.464589
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107,4.385075
Schindler's List (1993),4.562602,4.491415
"Shawshank Redemption, The (1994)",4.539075,4.560625
"Grand Day Out, A (1992)",4.537879,4.293255
To Kill a Mockingbird (1962),4.536667,4.372611
Creature Comforts (1990),4.513889,4.272277
"Usual Suspects, The (1995)",4.513317,4.518248


## Measuring Rating Disagreement

In [0]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

In [0]:
sorted_by_diff = mean_ratings.sort_values(by='diff')

In [30]:
sorted_by_diff.head()

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777


In [31]:
# movies preferred by men

sorted_by_diff[::-1][:10]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Good, The Bad and The Ugly, The (1966)",3.494949,4.2213,0.726351
"Kentucky Fried Movie, The (1977)",2.878788,3.555147,0.676359
Dumb & Dumber (1994),2.697987,3.336595,0.638608
"Longest Day, The (1962)",3.411765,4.031447,0.619682
"Cable Guy, The (1996)",2.25,2.863787,0.613787
Evil Dead II (Dead By Dawn) (1987),3.297297,3.909283,0.611985
"Hidden, The (1987)",3.137931,3.745098,0.607167
Rocky III (1982),2.361702,2.943503,0.581801
Caddyshack (1980),3.396135,3.969737,0.573602
For a Few Dollars More (1965),3.409091,3.953795,0.544704


Suppose instead you wanted the movies that elicited the most disagreement among
viewers, independent of gender identification. Disagreement can be measured by the variance or standard deviation of the ratings

In [0]:
# Standard deviation of rating grouped by title

rating_std_by_title = data.groupby("title")['rating'].std()

In [0]:
# Filter down to active_titles

rating_std_by_title = rating_std_by_title.loc[active_titles]

In [34]:
# Order Series by value in descending order

rating_std_by_title.sort_values(ascending=False)[:10]

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64