# Popularity based recommendation engine

Simple recommenders are basic systems that recommend the top items based on a certain metric or score. The basic idea behind this system is that movies that are more popular will have a higher probability of being liked by the average audience.

In this lesson, we will build a simplified clone of IMDb Top 250 Movies using metadata collected from IMDb.

## 1. Import libraries and read the data

To begin, we import the pandas library and read the IMDb-data:

In [1]:
# import Pandas
import pandas as pd

# load Movies Metadata
movies_df = pd.read_csv('resources/movies_metadata.csv')

  movies_df = pd.read_csv('resources/movies_metadata.csv')


## 2. Explore the data

It's a good idea to always explore your data a bit, so you know what you're working with. So, let's get into the habit of doing some pandas splicing & dicing!

Let's print the first three rows and have a look at the data.

In [2]:
# print the first three rows
movies_df.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [3]:
movies_df.iloc[:,23] #give me the 23rd column (all rows) --> look, it gave me a sanity check: the head() en tail(), the name of the column, length, and datatype! Nice!
# BTW: iloc stands for 'integer location'. 

0        5415.0
1        2413.0
2          92.0
3          34.0
4         173.0
          ...  
45461       1.0
45462       3.0
45463       6.0
45464       0.0
45465       0.0
Name: vote_count, Length: 45466, dtype: float64

In [4]:
# Since it's a named column, remember pandas uses metadata like columnnames to embellish the dataframe, it's also possible to do the same thing by referencing the name via 'loc', instead of the integer 'iloc'
movies_df.loc[:,'vote_count'] # since 

0        5415.0
1        2413.0
2          92.0
3          34.0
4         173.0
          ...  
45461       1.0
45462       3.0
45463       6.0
45464       0.0
45465       0.0
Name: vote_count, Length: 45466, dtype: float64

In [5]:
# Or, even simpler (single square brackets means it will return a list / series): 
movies_df['vote_count']

0        5415.0
1        2413.0
2          92.0
3          34.0
4         173.0
          ...  
45461       1.0
45462       3.0
45463       6.0
45464       0.0
45465       0.0
Name: vote_count, Length: 45466, dtype: float64

In [6]:
type(movies_df['vote_count'])

pandas.core.series.Series

In [7]:
#Can we get which movies have more than 5000 votes?
movies_df['vote_count'] >= 5000

0         True
1        False
2        False
3        False
4        False
         ...  
45461    False
45462    False
45463    False
45464    False
45465    False
Name: vote_count, Length: 45466, dtype: bool

In [8]:
# BTW we can keep the metadata from the pandas dataframe, when we select one or more columns: (notice the double square brackets)
movies_df[['vote_count']]

Unnamed: 0,vote_count
0,5415.0
1,2413.0
2,92.0
3,34.0
4,173.0
...,...
45461,1.0
45462,3.0
45463,6.0
45464,0.0


In [9]:
type(movies_df[['vote_count']])

pandas.core.frame.DataFrame

In [10]:
# Next, let's get those movies with +160 votes
movies_df[movies_df['vote_count'] >= 160]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,1995-12-15,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,1995-12-22,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45177,False,"{'id': 442352, 'name': 'Brice Collection', 'po...",0,"[{'id': 35, 'name': 'Comedy'}]",,375798,tt5029602,fr,Brice 3,"Brice is back. The world has changed, but not ...",...,2016-10-19,0.0,95.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Brice 3,False,4.3,160.0
45204,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,417870,tt3564472,en,Girls Trip,Four girlfriends take a trip to New Orleans fo...,...,2017-07-21,0.0,122.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"""Forgive us in advance for this wild weekend""",Girls Trip,False,7.1,393.0
45258,False,"{'id': 466463, 'name': 'Descendants Collection...",0,"[{'id': 10770, 'name': 'TV Movie'}, {'id': 107...",,417320,tt5117876,en,Descendants 2,When the pressure to be royal becomes too much...,...,2017-07-21,0.0,111.0,"[{'iso_639_1': 'da', 'name': 'Dansk'}]",Released,Long live evil.,Descendants 2,False,7.5,171.0
45265,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,265189,tt2121382,sv,Turist,"While holidaying in the French Alps, a Swedish...",...,2014-08-15,1359497.0,118.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,,Force Majeure,False,6.8,255.0


You can see that only 4.5k movies fit the bill, out of a total of 45k movies. We'll show you below how to get the value of 160 votes, to be considered a movie with a top10% number of votes

Now print multiple columns, like the 'vote_average' and 'vote_count' columns, and let's just print the first 10 rows.

In [11]:
# print the vote_average and vote_count of the first 10 rows
movies_df[['vote_average', 'vote_count']].head(10)

Unnamed: 0,vote_average,vote_count
0,7.7,5415.0
1,6.9,2413.0
2,6.5,92.0
3,6.1,34.0
4,5.7,173.0
5,7.7,1886.0
6,6.2,141.0
7,5.4,45.0
8,5.5,174.0
9,6.6,1194.0


One of the most basic metrics to build our *Top 250* is the rating (vote_average from above). However, using this metric has a few caveats. For example, it does not take into consideration the popularity of a movie. Therefore, a movie with a rating of 9 from 10 voters will be considered 'better' than a movie with a rating of 8.9 from 10,000 voters.

So it is necessary to come up with a weighted rating that takes into account __the average rating and the number of votes__ it has gathered. We will use the IMDb's weighted rating formula (since we are trying to build a clone of IMDb's Top 250):

$$\textrm{Weighted Rating (WR)} = (\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$$

where,

- v is the number of votes for the movie (`vote_count`)
- m is the minimum number of votes required to be listed in the chart (to be computed)
- R is the average rating of the movie (`vote_average`)
- C is the mean vote across the whole report (to be computed)

As a first step, let's calculate the value of C, the mean rating across all movies:

In [12]:
# calculate C
C = movies_df['vote_average'].mean()
print(C)

5.618207215134185


Now we need to determine an appropriate value for m, the minimum number of votes required to be listed in the chart. There is no right value for m, therefore we will use the 90th percentile as cutoff. In other words, for a movie to feature in the charts, it must be in the 10% top most votes list (since we are cutting off 90% of the movies based on `vote_count`).

Let's calculate the number of votes, m, received by the movie in the 90th percentile. The Pandas library makes this task extremely trivial using the .quantile() method:

In [13]:
# calculate the minimum number of votes required to be in the chart, m
m = movies_df['vote_count'].quantile(0.90)
print(m)

160.0


If we had chosen the 75th percentile, we would have considered the top 25% of the movies in terms of the number of votes gathered. As the percentile decreases, the number of movies considered increases. You can check the number of votes for the 75th percentile yourself.

## 3. Filter the data

Next, we can filter the movies that qualify for the chart, based on their vote counts. We use the .copy() method to ensure that the new q_movies_df DataFrame created is independent of your original movies_df DataFrame. In other words, any changes made to the q_movies_df DataFrame does not affect movies_df.

You see that there are 4555 movies which qualify to be in this list.

In [14]:
# filter out all qualified movies into a new DataFrame
q_movies_df = movies_df.copy().loc[movies_df['vote_count'] >= m]
q_movies_df.shape

(4555, 24)

Now, we need to calculate the *Weighted Rating* for each qualified movie. To do this, we will define a function, weighted_rating() and define a new feature score, of which we'll calculate the value by applying this function to the DataFrame of qualified movies:

In [15]:
# function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # calculation based on the IMDb formula
    return (v/(v+m) * R) + (m/(m+v) * C)

# define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies_df['score'] = q_movies_df.apply(weighted_rating, axis=1)

Let's have a look at the new created column.

In [16]:
q_movies_df.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,score
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.640253
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.820293
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.6607
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,7.537201
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0,5.556626


In [17]:
q_movies_df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,score
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,7.640253
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,6.820293
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,5.660700
5,False,,60000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,187436818.0,170.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886.0,7.537201
8,False,,35000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",,9091,tt0114576,en,Sudden Death,International action superstar Jean Claude Van...,...,64350171.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terror goes into overtime.,Sudden Death,False,5.5,174.0,5.556626
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45177,False,"{'id': 442352, 'name': 'Brice Collection', 'po...",0,"[{'id': 35, 'name': 'Comedy'}]",,375798,tt5029602,fr,Brice 3,"Brice is back. The world has changed, but not ...",...,0.0,95.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Brice 3,False,4.3,160.0,4.959104
45204,False,,0,"[{'id': 35, 'name': 'Comedy'}]",,417870,tt3564472,en,Girls Trip,Four girlfriends take a trip to New Orleans fo...,...,0.0,122.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"""Forgive us in advance for this wild weekend""",Girls Trip,False,7.1,393.0,6.671272
45258,False,"{'id': 466463, 'name': 'Descendants Collection...",0,"[{'id': 10770, 'name': 'TV Movie'}, {'id': 107...",,417320,tt5117876,en,Descendants 2,When the pressure to be royal becomes too much...,...,0.0,111.0,"[{'iso_639_1': 'da', 'name': 'Dansk'}]",Released,Long live evil.,Descendants 2,False,7.5,171.0,6.590372
45265,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,265189,tt2121382,sv,Turist,"While holidaying in the French Alps, a Swedish...",...,1359497.0,118.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,,Force Majeure,False,6.8,255.0,6.344369


In [18]:
q_movies_df['score']

0        7.640253
1        6.820293
4        5.660700
5        7.537201
8        5.556626
           ...   
45177    4.959104
45204    6.671272
45258    6.590372
45265    6.344369
45343    4.791783
Name: score, Length: 4555, dtype: float64

In [19]:
q_movies_df.loc[:, 'score']

0        7.640253
1        6.820293
4        5.660700
5        7.537201
8        5.556626
           ...   
45177    4.959104
45204    6.671272
45258    6.590372
45265    6.344369
45343    4.791783
Name: score, Length: 4555, dtype: float64

## 4. Top 15

Finally, let's sort the DataFrame based on the score feature and output the title, vote count, vote average and score (= weighted rating) of the top 15 movies.

In [20]:
# sort movies based on score calculated above
q_movies_df = q_movies_df.sort_values('score', ascending=False)

# print the top 15 movies
q_movies_df[['title', 'vote_count', 'vote_average', 'score']].head(15)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


You see that the chart has a lot of movies in common with the IMDB Top 250 chart: for example, your top two movies, "Shawshank Redemption" and "The Godfather", are the same as IMDb. Check the other movies yourself. Pretty impressive! No?

<img src="./resources/imdb.png" style="height: 500px"/>