# Module 3 In-Class Assignment
Python for Data Analytics | Module 3 
<br>Professor James Ng

## Dataset 

In this In-Class Assignment, you will be working with the MovieLens dataset. The dataset contains users ratings of thousands of motion pictures collected by MovieLens, a movie recommendation service.

The dataset is in one single file, movielens.csv. Each row in the dataset represents a rating provided by the user for a movie at a given time. The following are the description of each column 

* **userId** - The user identifier who provided the rating
* **movieId** - The movie identifier on which the rating was provided
* **rating** - The star rating for a movie. Ranging from half a star to five stars
* **title** - Name of the movie
* **year** - the year the movie was released 

<div class="alert alert-block alert-warning">
    <b>This assignment is due by the end of class.</b> 
    <br>Late submissions will NOT be graded. 
    <br>There are three problems, and I think you can do it!
    </div>

In [1]:
# SETUP. RUN BUT DO NOT CHANGE.
import numpy as np
import pandas as pd

movielens = pd.read_csv("http://www3.nd.edu/~jng2/movielens_v2.csv", encoding='latin-1')

In [2]:
# PROBLEM 1 (10 points)
# Find the number of unique users, unique movies, and unique titles in the dataset. 
# Hint: This question DOES NOT involve groupby and is very straightforward.

#
# YOUR CODE HERE
#

### BEGIN SOLUTION
print(movielens['userId'].nunique(), movielens['movieId'].nunique(), movielens['title'].nunique())
### END SOLUTION

671 9050 8821


In [3]:
# PROBLEM 2 (30 points)
# There are more unique movies than unique titles. Why? Give a reason and back it up with the data. 
# Hint 1: The reason is intuitive.
# Hint 2: You may want to use drop_duplicates(), duplicated()  and sort_values().

#
# YOUR CODE HERE
#

### BEGIN SOLUTION
# Reason: Some movies were remade with the same title.

# Let's show this. 
# First, de-duplicate movieId's
uniqmovies = movielens.drop_duplicates(subset='movieId')

# Next, demonstrate that different movies can share the same title.
uniqmovies[ uniqmovies.duplicated(subset='title', keep=False) ].sort_values(['title', 'year'])[[
    'movieId', 'title', 'year']].head(15)


### END SOLUTION

Unnamed: 0,movieId,title,year
4404,1203,12 Angry Men,1957
57609,77846,12 Angry Men,1997
4494,5300,3:10 to Yuma,1957
90229,54997,3:10 to Yuma,2007
86638,115881,9,2005
93292,71057,9,2009
7429,4191,Alfie,1966
83604,8948,Alfie,2004
433,80748,Alice in Wonderland,1933
2983,1032,Alice in Wonderland,1951


In [5]:
# PROBLEM 3 (60 points)
# In each decade, find the movie with the highest average rating AND at least 20 reviews.
# Your answer should be a DataFrame with these columns: decade, movieId, title, year, 
# average_rating, number_of_reviews. It should have 10 rows (1 for each decade). Also, sort 
# the DataFrame in ascending order of decade.

# HINTS: 
# Disambiguation: Years 1980, 1981,...,1989 belong to the same decade, the 1980s. 
# There are ten decades in the data, starting with the 1920s.

# Notice that the data doesn't have a decade column. You will first have to create one.

#
# YOUR CODE HERE
#

### BEGIN SOLUTION

#movielens['year'].unique()# view years

movielens['decade'] = pd.cut(movielens['year'], 
                             bins = [1919,1929,1939,1949,1959,1969,1979,1989,1999,2009,2019])

# compute average user rating and number of reviews for each movieId
df_avgrating = movielens.groupby('movieId')['rating'].aggregate(
    ['mean', 'count']).rename(
    columns={'mean':'average_rating', 'count':'number_of_reviews'})

# keep just the movies with >=20 reviews
df_avgrating = df_avgrating[ df_avgrating['number_of_reviews'] >= 20 ]

# de-duplicate movieId's 
df_movies = movielens.drop_duplicates(subset='movieId')

# keep just the relevant columns
df_movies = df_movies[['movieId', 'title', 'year', 'decade']]

# merge the two dataframes
df_final = pd.merge(df_movies, df_avgrating, left_on='movieId', right_index=True)

# sort movies by decade (ascending), ave rating (descending), num reviews (descending)
df_final.sort_values(by=['decade', 'average_rating', 'number_of_reviews'], ascending=[True, False, False], inplace=True)

# keep first row per decade. this only works if df_final has been sorted exactly as in the preceding step.
df_final.drop_duplicates(subset='decade', keep='first', inplace=True)

df_final

Unnamed: 0,movieId,title,year,decade,average_rating,number_of_reviews
128,2010,Metropolis,1927,"(1919, 1929]",3.980769,26
453,905,It Happened One Night,1934,"(1929, 1939]",4.38,25
1496,913,"Maltese Falcon, The",1941,"(1939, 1949]",4.387097,62
3598,1945,On the Waterfront,1954,"(1949, 1959]",4.448276,29
7558,1276,Cool Hand Luke,1967,"(1959, 1969]",4.271739,46
9681,858,"Godfather, The",1972,"(1969, 1979]",4.4875,200
20485,1217,Ran,1985,"(1979, 1989]",4.423077,26
39759,318,"Shawshank Redemption, The",1994,"(1989, 1999]",4.487138,311
74522,7502,Band of Brothers,2001,"(1999, 2009]",4.386364,22
95822,88125,Harry Potter and the Deathly Hallows: Part 2,2011,"(2009, 2019]",4.220588,34
