# Movie Ratings Lab (GroupBy Operations)

### Intro and objectives
#### Apply the concepts learned so far in a real use case

### In this lab you will learn:
1. Implement advanced data filtering

## What I hope you'll get out of this lab
* Gain experience filtering datasets
* Compute basic insigths from data given to you

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

### GroupLens Research provides a number of collections of movie ratings data collected from users of MovieLens in the late 1990s and early 2000s. The data provides movie ratings, movie metadata (genres and year), and demographic data about the users (age, zip code, gender identification, and occupation). Such data is often of interest in the development of recommendation systems based on machine learning algorithms.

### While we do not explore machine learning techniques in detail in this book, You will learn how to slice and dice datasets like these into the exact form you need.

### The MovieLens 1M dataset contains one million ratings collected from six thousand users on four thousand movies. It’s spread across three tables: ratings, user information, and movie information. These files contain 1,000,209 anonymous ratings of approximately 3,900 movies  made by 6,040 MovieLens users who joined MovieLens in 2000.


#### RATINGS FILE DESCRIPTION
================================================================================

All ratings are contained in the file "ratings.dat" and are in the
following format:

UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

#### USERS FILE DESCRIPTION
================================================================================

User information is in the file "users.dat" and is in the following
format:

UserID::Gender::Age::Occupation::Zip-code

All demographic information is provided voluntarily by the users and is
not checked for accuracy.  Only users who have provided some demographic
information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

#### MOVIES FILE DESCRIPTION
================================================================================

Movie information is in the file "movies.dat" and is in the following
format:

MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western

- Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
- Movies are mostly entered by hand, so errors and inconsistencies may exist

In [2]:
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/users2.txt", sep="::",header=None, names=unames)






  users = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/users2.txt", sep="::",header=None, names=unames)


In [3]:
rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/ratings2.txt", sep="::",header=None, names=rnames)



  ratings = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/ratings2.txt", sep="::",header=None, names=rnames)


In [4]:
mnames = ["movie_id", "title", "genres"]
movies = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/movies2.txt", sep="::",header=None, names=mnames)

  movies = pd.read_table("https://raw.githubusercontent.com/thousandoaks/Python4DS-II/main/datasets/movies2.txt", sep="::",header=None, names=mnames)


In [5]:
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [6]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [7]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


#### Analyzing the data spread across three tables is not a simple task; for example, suppose you wanted to compute mean ratings for a particular movie by gender identity and age. As you will see, this is more convenient to do with all of the data merged together into a single table. Using pandas’s merge function, we first merge ratings with users and then merge that result with the movies data. pandas infers which columns to use as the merge (or join) keys based on overlapping names:

In [8]:
RatingsUsersDataFrame=pd.merge(ratings, users, on='user_id')

In [9]:
RatingsUsersDataFrame

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip
0,1,1193,5,978300760,F,1,10,48067
1,1,661,3,978302109,F,1,10,48067
2,1,914,3,978301968,F,1,10,48067
3,1,3408,4,978300275,F,1,10,48067
4,1,2355,5,978824291,F,1,10,48067
...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,M,25,6,11106
1000205,6040,1094,5,956704887,M,25,6,11106
1000206,6040,562,5,956704746,M,25,6,11106
1000207,6040,1096,4,956715648,M,25,6,11106


In [10]:
RatingsUsersDataFrameMovies=pd.merge(RatingsUsersDataFrame,movies, on='movie_id')

In [11]:
RatingsUsersDataFrameMovies

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy
...,...,...,...,...,...,...,...,...,...,...
1000204,6040,1091,1,956716541,M,25,6,11106,Weekend at Bernie's (1989),Comedy
1000205,6040,1094,5,956704887,M,25,6,11106,"Crying Game, The (1992)",Drama|Romance|War
1000206,6040,562,5,956704746,M,25,6,11106,Welcome to the Dollhouse (1995),Comedy|Drama
1000207,6040,1096,4,956715648,M,25,6,11106,Sophie's Choice (1982),Drama


In [12]:
## We set movie_id as the new index

RatingsUsersDataFrameMovies.set_index('movie_id',inplace=True)
RatingsUsersDataFrameMovies

Unnamed: 0_level_0,user_id,rating,timestamp,gender,age,occupation,zip,title,genres
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1193,1,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
661,1,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
914,1,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
3408,1,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama
2355,1,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy
...,...,...,...,...,...,...,...,...,...
1091,6040,1,956716541,M,25,6,11106,Weekend at Bernie's (1989),Comedy
1094,6040,5,956704887,M,25,6,11106,"Crying Game, The (1992)",Drama|Romance|War
562,6040,5,956704746,M,25,6,11106,Welcome to the Dollhouse (1995),Comedy|Drama
1096,6040,4,956715648,M,25,6,11106,Sophie's Choice (1982),Drama


## Let's describe ratings accros genres categories

In [13]:
RatingsUsersDataFrameMovies.groupby('genres').agg({'rating':'describe'})

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
genres,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Action,12311.0,3.354886,1.052655,1.0,3.0,3.0,4.0,5.0
Action|Adventure,10446.0,3.676814,1.171912,1.0,3.0,4.0,5.0,5.0
Action|Adventure|Animation,345.0,4.147826,0.948470,1.0,4.0,4.0,5.0,5.0
Action|Adventure|Animation|Children's|Fantasy,135.0,2.703704,1.106791,1.0,2.0,3.0,3.5,5.0
Action|Adventure|Animation|Horror|Sci-Fi,618.0,3.546926,1.073581,1.0,3.0,4.0,4.0,5.0
...,...,...,...,...,...,...,...,...
Sci-Fi|Thriller|War,280.0,3.439286,1.035166,1.0,3.0,3.0,4.0,5.0
Sci-Fi|War,1367.0,4.449890,0.805507,1.0,4.0,5.0,5.0,5.0
Thriller,17851.0,3.555879,1.085143,1.0,3.0,4.0,4.0,5.0
War,991.0,3.889001,0.916620,1.0,3.0,4.0,5.0,5.0


## Let's compute rating average accross genres

In [14]:
RatingsUsersDataFrameMovies.groupby('genres').agg({'rating':'mean'}).sort_values(by='rating',ascending=False)

Unnamed: 0_level_0,rating
genres,Unnamed: 1_level_1
Animation|Comedy|Thriller,4.473837
Sci-Fi|War,4.449890
Animation,4.394336
Film-Noir|Mystery,4.367424
Adventure|War,4.346107
...,...
Action|Adventure|Children's|Fantasy,2.090909
Comedy|Film-Noir|Thriller,2.000000
Action|Adventure|Children's|Sci-Fi,1.874286
Action|Children's,1.742373


## Let's compute rating average accross genres and gender

In [15]:
RatingsUsersDataFrameMovies.groupby(['genres','gender']).agg({'rating':'mean'}).sort_values(by='rating',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,rating
genres,gender,Unnamed: 2_level_1
Animation|Comedy|Thriller,F,4.550802
Animation,F,4.533333
Sci-Fi|War,M,4.464789
Film-Noir|Romance|Thriller,F,4.448718
Animation|Comedy|Thriller,M,4.445110
...,...,...
Action|Adventure|Children's|Sci-Fi,M,1.820339
Action|Children's,M,1.708696
Action|Adventure|Children's,M,1.325000
Action|Adventure|Children's,F,1.250000


## Let's compute average values for rating and age accross genres and gender

In [16]:
RatingsUsersDataFrameMovies.groupby(['genres','gender']).agg({'rating':'mean','age':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,rating,age
genres,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,F,3.367474,31.360646
Action,M,3.352991,29.671215
Action|Adventure,F,3.701213,30.080384
Action|Adventure,M,3.671115,29.420879
Action|Adventure|Animation,F,3.843750,25.125000
...,...,...,...
Thriller,M,3.553364,30.061157
War,F,3.841584,36.920792
War,M,3.894382,38.287640
Western,F,3.668613,34.705109


In [17]:
RatingsUsersDataFrameMovies

Unnamed: 0_level_0,user_id,rating,timestamp,gender,age,occupation,zip,title,genres
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1193,1,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
661,1,3,978302109,F,1,10,48067,James and the Giant Peach (1996),Animation|Children's|Musical
914,1,3,978301968,F,1,10,48067,My Fair Lady (1964),Musical|Romance
3408,1,4,978300275,F,1,10,48067,Erin Brockovich (2000),Drama
2355,1,5,978824291,F,1,10,48067,"Bug's Life, A (1998)",Animation|Children's|Comedy
...,...,...,...,...,...,...,...,...,...
1091,6040,1,956716541,M,25,6,11106,Weekend at Bernie's (1989),Comedy
1094,6040,5,956704887,M,25,6,11106,"Crying Game, The (1992)",Drama|Romance|War
562,6040,5,956704746,M,25,6,11106,Welcome to the Dollhouse (1995),Comedy|Drama
1096,6040,4,956715648,M,25,6,11106,Sophie's Choice (1982),Drama


## Let's explore user's preferences for genres

In [18]:
UsersGenrePreferences=RatingsUsersDataFrameMovies.groupby(['genres','user_id']).agg({'rating':'mean'}).reset_index()
UsersGenrePreferences

Unnamed: 0,genres,user_id,rating
0,Action,2,3.0
1,Action,4,5.0
2,Action,5,3.0
3,Action,6,4.0
4,Action,8,2.0
...,...,...,...
352167,Western,6032,3.0
352168,Western,6034,3.0
352169,Western,6036,4.2
352170,Western,6037,3.0


In [19]:
## This code computes the top 3 preferences per user based on his/her ratings
UsersGenrePreferences.groupby('user_id').apply(lambda x: x.nlargest(3, 'rating'))


  UsersGenrePreferences.groupby('user_id').apply(lambda x: x.nlargest(3, 'rating'))


Unnamed: 0_level_0,Unnamed: 1_level_0,genres,user_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,21139,Action|Adventure|Drama,1,5.0
1,87523,Action|Drama|War,1,5.0
1,170760,Animation|Children's|Musical|Romance,1,5.0
2,12477,Action|Adventure|Comedy|Romance,2,5.0
2,22605,Action|Adventure|Drama|Sci-Fi|War,2,5.0
...,...,...,...,...
6039,39192,Action|Adventure|Romance|Sci-Fi|War,6039,5.0
6039,157848,Adventure|War,6039,5.0
6040,124775,Adventure,6040,5.0
6040,136238,Adventure|Children's|Drama|Musical,6040,5.0


In [20]:
## This code computes the top 3 preferences per user based on his/her ratings
UsersGenrePreferences.groupby('user_id').head(3).sort_values(by='user_id')

Unnamed: 0,genres,user_id,rating
27066,Action|Adventure|Fantasy|Sci-Fi,1,4.000000
21139,Action|Adventure|Drama,1,5.000000
12476,Action|Adventure|Comedy|Romance,1,3.000000
0,Action,2,3.000000
12477,Action|Adventure|Comedy|Romance,2,5.000000
...,...,...,...
30664,Action|Adventure|Fantasy|Sci-Fi,6039,5.000000
15535,Action|Adventure|Comedy|Romance,6039,4.000000
20922,Action|Adventure|Crime|Drama,6040,2.000000
2747,Action,6040,2.000000


## Let's extract movies with at least 500 reviews having ratings larger than 3 on average

In [21]:
MovieRatingsAndCounts=RatingsUsersDataFrameMovies.reset_index().groupby('movie_id').agg({'rating':'mean','user_id':'count'})
MovieRatingsAndCounts

Unnamed: 0_level_0,rating,user_id
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.146846,2077
2,3.201141,701
3,3.016736,478
4,2.729412,170
5,3.006757,296
...,...,...
3948,3.635731,862
3949,4.115132,304
3950,3.666667,54
3951,3.900000,40


In [22]:
MovieRatingsAndCounts[(MovieRatingsAndCounts['rating']>2)&(MovieRatingsAndCounts['user_id']>500)]

Unnamed: 0_level_0,rating,user_id
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.146846,2077
2,3.201141,701
6,3.878723,940
10,3.540541,888
11,3.793804,1033
...,...,...
3868,3.680636,692
3893,3.502683,559
3897,4.226358,994
3911,4.073059,657
