# Exploring the MovieLens 1M Dataset


Let's begin by importing pandas. It is conventional to use *pd* to denote pandas

In [4]:
import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('users.dat', sep='::', header=None, names=unames, engine='python')

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ratings.dat', sep='::', header=None, names=rnames, engine='python')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat', sep='::', header=None, names=mnames, engine='python')

data = pd.merge(pd.merge(ratings, users), movies)

1.An aggregate on the number of rating done for each particular genre, e.g.,
Action, Adventure, Drama, Science Fiction,…

In [5]:
data.groupby(['genres']).count()

Unnamed: 0_level_0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title
genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Action,12311,12311,12311,12311,12311,12311,12311,12311,12311
Action|Adventure,10446,10446,10446,10446,10446,10446,10446,10446,10446
Action|Adventure|Animation,345,345,345,345,345,345,345,345,345
Action|Adventure|Animation|Children's|Fantasy,135,135,135,135,135,135,135,135,135
Action|Adventure|Animation|Horror|Sci-Fi,618,618,618,618,618,618,618,618,618
...,...,...,...,...,...,...,...,...,...
Sci-Fi|Thriller|War,280,280,280,280,280,280,280,280,280
Sci-Fi|War,1367,1367,1367,1367,1367,1367,1367,1367,1367
Thriller,17851,17851,17851,17851,17851,17851,17851,17851,17851
War,991,991,991,991,991,991,991,991,991


2.The top 5 ranked genres by women on most number of rating.

In [6]:
genderF=data.loc[data["gender"] == "F"]
genderF.sort_values(["rating"],ascending=False).head(5)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
399110,175,1343,5,977115555,F,25,2,95123,Cape Fear (1991),Thriller
399355,2609,1343,5,973725840,F,50,12,55391,Cape Fear (1991),Thriller
399335,2408,1343,5,974254826,F,45,1,1609,Cape Fear (1991),Thriller
399328,2322,1343,5,974465496,F,56,13,48105,Cape Fear (1991),Thriller


3.The top 5 ranked genres by men on most number of rating.

In [7]:
genderM=data.loc[data["gender"] == "M"]
genderM.sort_values(["rating"],ascending=False).head(5)

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
229230,1671,39,5,974712422,M,35,0,98368,Clueless (1995),Comedy|Romance
229333,2077,39,5,979972045,M,18,0,55112,Clueless (1995),Comedy|Romance
667100,5475,2248,5,960938844,M,25,4,94110,Say Anything... (1989),Comedy|Drama|Romance
667094,5421,2248,5,960154396,M,35,17,22030,Say Anything... (1989),Comedy|Drama|Romance


4.Pick a genre of your choice and provide average movie’s ratings by the following
four time intervals during which the movies were released
(a) 1970 to 1979

In [9]:
data['title']=data['title'].str.slice(-5,-1).astype('int')
genre1=data.loc[(data["genres"] == "Drama") & (data["title"] >= 1970) & (data["title"] <= 1979)]
genre1["rating"].mean()


3.9167029458714264

(b) 1980 to 1989

In [17]:
genre2=data.loc[(data["genres"] == "Drama") & (data["title"] >= 1980) & (data["title"] <= 1989)]
genre2["rating"].mean()


3.775669352742397

(c) 1990 to 1999


In [18]:
genre3=data.loc[(data["genres"] == "Drama") & (data["title"] >= 1990) & (data["title"] <= 1999)]
genre3["rating"].mean()

3.7056136079801445

(d) 2000 to 2009


In [19]:
genre4=data.loc[(data["genres"] == "Drama") & (data["title"] >= 2000) & (data["title"] <= 2009)]
genre4["rating"].mean()

3.626737427343947

5.A function that given a genre and a rating_range (i.e. [3.5, 4]), returns all
the movies of that genre and within that rating range sorted by average rating. Using
an example, demonstrate that your function works.

In [14]:
def pickMovies(genre,rating_range):
    rating_range=rating_range.lstrip ("[").rstrip ("]")
    ratingLeft=float(rating_range.split(",")[0])
    ratingRight = float(rating_range.split(",")[1])
    data["rating"]=data["rating"].astype(float)
    movies=data.loc[(data["genres"] == genre) & (data["rating"] >= ratingLeft) & (data["rating"] <= ratingRight)]
    return movies

pickMovies('Drama','[3.5,4]')

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
2,12,1193,4.0,978220179,M,25,12,32793,1975,Drama
3,15,1193,4.0,978199279,M,25,7,22903,1975,Drama
5,18,1193,4.0,978156168,F,18,3,95825,1975,Drama
12,44,1193,4.0,978018995,M,45,17,98052,1975,Drama
13,47,1193,4.0,977978345,M,18,4,94305,1975,Drama
...,...,...,...,...,...,...,...,...,...,...
1000151,4169,530,4.0,976589311,M,50,0,66048,1994,Drama
1000165,4572,3164,4.0,964460301,F,1,10,17036,1968,Drama
1000169,4888,439,4.0,962737163,F,56,0,08055,1993,Drama
1000173,5146,2480,4.0,962722549,M,56,1,04240,1997,Drama


6.Present one other statistic, figure, aggregate, or plot that you created using
this dataset, along with a short description of what interesting observations you derived
from it. This question is meant to give you a freehand to explore and present aspects
of the dataset that interests you.

In [15]:
#Movie watching line chart by age
#The figure is cw1(6).png
import matplotlib.pyplot as plt
dataA=data["age"].value_counts()
dataA=pd.DataFrame(dataA).reset_index()
dataA.columns =["age","count"]
dataA["age"]=dataA["age"].astype(int)
dataA=dataA.sort_values(["age"],ascending=False)
plt.figure()
plt.plot(dataA["age"],dataA["count"])
plt.title('Movie watching line chart by age')
plt.xlabel('age')
plt.ylabel('count')
plt.show()

<Figure size 640x480 with 1 Axes>