SYSTEM DESIGN AND IMPLEMENTATION
                 ================================

IMPORT OUR DEPENDENCIES
=========

In [1]:
import pandas as pd
import numpy as np
import ast 
from ast import literal_eval


LOAD DATA SET
==

In [2]:
md = pd.read_csv('movies_metadata.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


UNDERSTAND DATA SET
==

METADATA  DATAFRAME
----

In [3]:
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')


FEATURES
---


    adult: Indicates if the movie is X-Rated or Adult.
    
    belongs_to_collection: A stringified dictionary that gives information on the movie series the particular film belongs to.
    
    budget: The budget of the movie in dollars.
    
    genres: A stringified list of dictionaries that list out all the genres associated with the movie.
    
    homepage: The Official Homepage of the move.
    
    id: The ID of the movie.
    
    imdb_id: The IMDB ID of the movie.
    
    original_language: The language in which the movie was originally shot in.
    
    original_title: The original title of the movie.
    
    overview: A brief blurb of the movie.
    
    popularity: The Popularity Score assigned by TMDB.
    
    poster_path: The URL of the poster image.
    
    production_companies: A stringified list of production companies involved with the making of the movie.
    
    production_countries: A stringified list of countries where the movie was shot/produced in.
    
    release_date: Theatrical Release Date of the movie.
    
    revenue: The total revenue of the movie in dollars.
    
    runtime: The runtime of the movie in minutes.
    
    spoken_languages: A stringified list of spoken languages in the film.
    
    status: The status of the movie (Released, To Be Released, Announced, etc.)
    
    tagline: The tagline of the movie.
    
    title: The Official Title of the movie.
    
    video: Indicates if there is a video present of the movie with TMDB.
    
    vote_average: The average rating of the movie.
    
    vote_count: The number of votes by users, as counted by TMDB.



In [4]:
md.shape

(45466, 24)

In [5]:
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

BUILD RECOMMENDATION SYSTEM
===

Simple recommendation system
---



Approach
--

   * The Simple Recommender offers generalized recommendations to every user based on movie popularity and (sometimes) genre.

   * The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience.

   * This model does not give personalized recommendations based on the user.
    


What we are actually doing:
--
   * The implementation of this model is extremely trivial.
   
   * All we have to do is sort our movies based on ratings and popularity and display the top movies of our list.
   
   * As an added step, we can pass in a genre argument to get the top movies of a particular genre.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!
    



In [6]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i[
    'name'] for i in x] if isinstance(x, list) else [])

WEIGHTED RATING (WR)
==

where,

    v is the number of votes for the movie
    
    m is the minimum votes required to be listed in the chart
    
    R is the average rating of the movie
    
    C is the mean vote across the whole report
    
    

![math.svg](attachment:math.svg)

In [7]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')


In [8]:
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

# this is C
C = vote_averages.mean()
C

5.244896612406511



   * The next step, we need to determine an appropriate value for m, the minimum votes required to be listed in the chart.

   * We will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.



In [9]:
m = vote_counts.quantile(0.95)
m

434.0

 * Pre-processing step for getting year from date by splliting it using '-'

In [10]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(
    lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [11]:
qualified = md[(md['vote_count'] >= m) & 
               (md['vote_count'].notnull()) & 
               (md['vote_average'].notnull())][['title', 
                                                'year', 
                                                'vote_count', 
                                                'vote_average', 
                                                'popularity', 
                                                'genres']]

qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

* Therefore, to qualify to be considered for the chart, a movie has to have at least 434 votes on TMDB.


* We also see that the average rating for a movie on TMDB is 5.244 on a scale of 10.
    
    
* Here, only 2274 movies are qualify to be on our chart.



In [12]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [13]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [14]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

In [15]:
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,"[Adventure, Fantasy, Action]",7.851924




   * We see that three Christopher Nolan Films, Inception, The Dark Knight and Interstellar occur at the very top of our chart.
    
   
   * The chart also indicates a strong bias of TMDB Users towards particular genres and directors.
   

   * Let us now construct our function that builds charts for particular genres.
   

  *  For this, we relax our default conditions to the 85th percentile instead of 95.



In [16]:
'''
>>> s
     a   b
one  1.  2.
two  3.  4.

>>> s.stack()
one a    1
    b    2
two a    3
    b    4
'''
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)
gen_md.head(3).transpose()

  s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)


Unnamed: 0,0,0.1,0.2
adult,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ...","{'id': 10194, 'name': 'Toy Story Collection', ..."
budget,30000000,30000000,30000000
homepage,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story,http://toystory.disney.com/toy-story
id,862,862,862
imdb_id,tt0114709,tt0114709,tt0114709
original_language,en,en,en
original_title,Toy Story,Toy Story,Toy Story
overview,"Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ...","Led by Woody, Andy's toys live happily in his ..."
popularity,21.946943,21.946943,21.946943


In [17]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & 
                   (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: 
                        (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C),
                        axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified



Let us see our method in action by displaying the Top 15 Romance Movies (Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).

Top 15 Romantic Movies
                                                                         
                                                                         


  EXPERIMENTAL RESULTS AND ANALYSIS
  ========

In [18]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,8.565285
351,Forrest Gump,1994,8147,8,48.307194,7.971357
876,Vertigo,1958,1162,8,18.20822,7.811667
40251,Your Name.,2016,1030,8,34.461252,7.789489
883,Some Like It Hot,1959,835,8,11.845107,7.745154
1132,Cinema Paradiso,1988,834,8,14.177005,7.744878
19901,Paperman,2012,734,8,7.198633,7.713951
37863,Sing Street,2016,669,8,10.672862,7.689483
882,The Apartment,1960,498,8,11.994281,7.599317
38718,The Handmaiden,2016,453,8,16.727405,7.566166


In [19]:
build_chart('Action').head(6)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
15480,Inception,2010,14075,8,29.108149,7.955099
12481,The Dark Knight,2008,12269,8,123.167259,7.94861
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,7.929579
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,7.924031
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,7.918382
256,Star Wars,1977,6778,8,42.149697,7.908327


In [20]:
build_chart('Family').head(3) 

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
1225,Back to the Future,1985,6239,8,25.778509,7.893053
359,The Lion King,1994,5520,8,21.605761,7.879754
5481,Spirited Away,2001,3968,8,41.048867,7.835635


In [21]:
build_chart('Comedy').head(8)


Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,8.463024
351,Forrest Gump,1994,8147,8,48.307194,7.963363
1225,Back to the Future,1985,6239,8,25.778509,7.952358
18465,The Intouchables,2011,5410,8,16.086919,7.945207
22841,The Grand Budapest Hotel,2014,4644,8,14.442048,7.936384
2211,Life Is Beautiful,1997,3643,8,39.39497,7.91943
732,Dr. Strangelove or: How I Learned to Stop Worr...,1964,1472,8,9.80398,7.809073
3342,Modern Times,1936,881,8,8.159556,7.695554


In [22]:
build_chart('Animation').head(12)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
359,The Lion King,1994,5520,8,21.605761,7.909339
5481,Spirited Away,2001,3968,8,41.048867,7.875933
9698,Howl's Moving Castle,2004,2049,8,16.136048,7.772103
2884,Princess Mononoke,1997,2041,8,17.166725,7.771305
5833,My Neighbor Totoro,1988,1730,8,13.507299,7.735274
40251,Your Name.,2016,1030,8,34.461252,7.58982
5553,Grave of the Fireflies,1988,974,8,0.010902,7.570962
19901,Paperman,2012,734,8,7.198633,7.465676
39386,Piper,2016,487,8,11.243161,7.285132
20779,Wolf Children,2012,483,8,10.249498,7.281198
