<a id='section_id'></a>
# Netflix Movie Recommender

## Problem Statement
The goal of this project is to build a recommender system to help users find movies and shows that fit their preferences and tastes so that they will be more likely to maintain a subscription with Netflix. Due to user privacy issues and the unavailability of datasets containing both movie content details and user ratings, we would be using _two_ separate datasets to build _two_ distinct recommender systems - **Content-Based Filtering Model** and **Collaborative Filtering Model**.  

This has been a core focus for Netflix in recent years. Based on an academic paper penned by Gomez-Uribe and Netflix's Chief Product Officer Neil Hunt, they assert that ['the combined effect of personalization and recommendations is estimated to save them more than \\$1 billion per year.'](https://www.businessinsider.sg/netflix-recommendation-engine-worth-1-billion-per-year-2016-6?r=US&IR=T) This would be a significant area of cost savings for the tech giant as it is expected to spend more than [\\$17 billion on global content](https://variety.com/2020/digital/news/netflix-2020-content-spending-17-billion-1203469237/) for the current financial year.

### 90 Seconds of Truth
Based on findings from Netflix - “Consumer research suggests that a typical Netflix member loses interest after perhaps 60 to 90 seconds of choosing, having reviewed 10 to 20 titles (perhaps 3 in detail) on one or two screens. The user either finds something of interest or the risk of the user abandoning our service increases substantially.” Furthermore, Netflix estimates that only 20% of its subscriber video choices come from search, with the other 80% coming from recommendations. So it’s crucial that Netflix gets its recommendation system right.

The end goal of the engine is “moments of truth,” when “a member starts a session and we help that member find something engaging within a few seconds, preventing abandonment of our service for an alternative entertainment option.”

### Summary of Notebooks
1. [Data Cleaning & Wrangling](#section_id)
2. [EDA](Netflix-EDA.ipynb#section_id2)
3. [Content Based Filtering Model](Netflix-Content_Based_Filtering_Model.ipynb#section_id3)
4. [Collaborative Filtering](Netflix-Collaborative_Filtering.ipynb#section_id4)

### Load Libraries

In [1]:
import pandas as pd
import numpy as np

from rake_nltk import Rake

To display all the information from the columns and rows.

In [2]:
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

## Background of Dataset

This dataset is obtained from [Kaggle](https://www.kaggle.com/shivamb/netflix-shows) and consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from [Flixable](https://flixable.com) which is a third-party Netflix search engine.

### Load Dataset

In [3]:
df = pd.read_csv('./netflix_titles.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,81145628,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson","United States, India, South Korea, China","September 9, 2019",2019,TV-PG,90 min,"Children & Family Movies, Comedies","Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first."
1,80117401,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,"September 9, 2016",2016,TV-MA,94 min,Stand-Up Comedy,"Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of ""Sex on Fire"" in his comedy show."
2,70234439,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle",United States,"September 8, 2018",2013,TV-Y7-FV,1 Season,Kids' TV,"With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticons and their leader, Megatron."
3,80058654,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen",United States,"September 8, 2018",2016,TV-Y7,1 Season,Kids' TV,"When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind."
4,80125979,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis",United States,"September 8, 2017",2017,TV-14,99 min,Comedies,"When nerdy high schooler Dani finally attracts the interest of her longtime crush, she lands in the cross hairs of his ex, a social media celebrity."


The columns of the dataset consists of:
- show_id - Unique ID for every Movie / Tv Show
- typeIdentifier - A Movie or TV Show
- title - Title of the Movie / Tv Show
- director - Director of the Movie
- cast - Actors involved in the movie / show
- country - Country where the movie / show was produced
- date_added - Date it was added on Netflix
- release_year - Actual Release year of the move / show
- rating - TV Rating of the movie / show
- duration - Total Duration - in minutes or number of seasons
- listed_in - Genre
- description - The summary description

In [4]:
df.shape

(6234, 12)

The dataset provided has 6,234 rows and 12 columns.

In [5]:
df.isnull().sum()

show_id         0   
type            0   
title           0   
director        1969
cast            570 
country         476 
date_added      11  
release_year    0   
rating          10  
duration        0   
listed_in       0   
description     0   
dtype: int64

The dataset consists of null values in multiple columns - 1,969 in _director_ column, 570 in _cast_ column, 476 in _country_ column, 11 in _date_added_ column and 10 in _rating_ column. This could be due to the unavailability of the data on Flixable the third-party Netflix search engine.

In [6]:
df.dtypes

show_id         int64 
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year    int64 
rating          object
duration        object
listed_in       object
description     object
dtype: object

## Data Cleaning and Wrangling

The focus of this dataset is build a content-based filtering model using bag of words and TF-IDF vectorizer with data from the following columns - _title, director, cast, listed_in_ and _description_. Hence, this would be our approach in cleaning the data.

In [7]:
new_df = df[['title','director','cast','listed_in','description']]
new_df.head()

Unnamed: 0,title,director,cast,listed_in,description
0,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson","Children & Family Movies, Comedies","Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first."
1,Jandino: Whatever it Takes,,Jandino Asporaat,Stand-Up Comedy,"Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of ""Sex on Fire"" in his comedy show."
2,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton, Steve Blum, Andy Pessoa, Ernie Hudson, Daran Norris, Will Friedle",Kids' TV,"With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticons and their leader, Megatron."
3,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen",Kids' TV,"When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind."
4,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins, Keith Powers, Alicia Sanz, Jake Borelli, Kid Ink, Yousef Erakat, Rebekah Graf, Anne Winters, Peter Gilroy, Patrick Davis",Comedies,"When nerdy high schooler Dani finally attracts the interest of her longtime crush, she lands in the cross hairs of his ex, a social media celebrity."


As there are no reasonable approaches in imputing the null values using inferences, we will drop the null values from the columns for this project.

In [8]:
# Remove NaN values and empty strings
new_df.dropna(inplace=True)

blanks = []

col=['title','director','cast','listed_in','description']
for i,col in new_df.iterrows():  
    if type(col)==str:            
        if col.isspace():        
            blanks.append(i)     

new_df.drop(blanks, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [9]:
new_df.isnull().sum()

title          0
director       0
cast           0
listed_in      0
description    0
dtype: int64

No null values remained in the new dataframe after cleaning.

**Keywords Extraction Using Rake**

RAKE short for _Rapid Automatic Keyword Extraction_ algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

For this case, we used RAKE to split the text into a list of words and remove stopwords from that list. This returns us a list of what is known as _content words_ which is useful for building our model and analyzing the keywords within the plot description of the show/ movie.

In [10]:
new_df['key_words'] = ""

for index, row in new_df.iterrows():
    description = row['description']
    
    # instantiating Rake, by default it uses english stopwords from NLTK 
    # and discards all puntuation characters as well
    r = Rake()

    # extracting the words by passing the text
    r.extract_keywords_from_text(description)

    # getting the dictionary whith key words as keys and their scores as values
    key_words_dict_scores = r.get_word_degrees()
    
    # assigning the key words to the new column for the corresponding movie
    row['key_words'] = list(key_words_dict_scores.keys())

# dropping the Plot column
new_df.drop(columns = ['description'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [11]:
# discarding the commas between the actors' and directors' full names
new_df['cast'] = new_df['cast'].map(lambda x: x.split(','))
new_df['director'] = new_df['director'].map(lambda x: x.split(','))

# converting the genres into a list of words
new_df['listed_in'] = new_df['listed_in'].map(lambda x: x.lower().split(','))

# merging together first and last name for each actor and director
for index, row in new_df.iterrows():
    row['cast'] = [x.lower().replace(' ','') for x in row['cast']]
    row['director'] = [y.lower().replace(' ','') for y in row['director']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
new_df.head()

Unnamed: 0,title,director,cast,listed_in,key_words
0,Norm of the North: King Sized Adventure,"[richardfinn, timmaltby]","[alanmarriott, andrewtoth, briandobson, colehoward, jennifercameron, jonathanholmes, leetockar, lisadurupt, mayakay, michaeldobson]","[children & family movies, comedies]","[grandfather, planning, polar, bear, king, must, take, back, stolen, artifact, awesome, wedding, evil, archaeologist, first]"
4,#realityhigh,[fernandolebrija],"[nestacooper, katewalsh, johnmichaelhiggins, keithpowers, aliciasanz, jakeborelli, kidink, youseferakat, rebekahgraf, annewinters, petergilroy, patrickdavis]",[comedies],"[lands, social, media, celebrity, cross, hairs, interest, longtime, crush, ex, nerdy, high, schooler, dani, finally, attracts]"
6,Automata,[gabeibáñez],"[antoniobanderas, dylanmcdermott, melaniegriffith, birgittehjortsørensen, robertforster, christacampbell, timmcinnerny, andynyman, davidryall]","[international movies, sci-fi & fantasy, thrillers]","[tech, company, investigates, robot, killed, violating, protocol, global, conspiracy, insurance, adjuster, dystopian, future, discovers]"
7,Fabrizio Copano: Solo pienso en mi,"[rodrigotoro, franciscoschultz]",[fabriziocopano],[stand-up comedy],"[family, whatsapp, groups, next, level, stand, fabrizio, copano, takes, audience, participation, set, reflecting, sperm, banks]"
9,Good People,[henrikrubengenz],"[jamesfranco, katehudson, tomwilkinson, omarsy, samspruell, annafriel, thomasarnold, oliverdimsdale, dianahardcastle, michaeljibson, diarmaidmurtagh]","[action & adventure, thrillers]","[apartment, recently, murdered, find, believe, struggling, couple, money, stash, luck, neighbor]"


In [13]:
new_df.set_index('title', inplace = True)
new_df.head()

Unnamed: 0_level_0,director,cast,listed_in,key_words
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Norm of the North: King Sized Adventure,"[richardfinn, timmaltby]","[alanmarriott, andrewtoth, briandobson, colehoward, jennifercameron, jonathanholmes, leetockar, lisadurupt, mayakay, michaeldobson]","[children & family movies, comedies]","[grandfather, planning, polar, bear, king, must, take, back, stolen, artifact, awesome, wedding, evil, archaeologist, first]"
#realityhigh,[fernandolebrija],"[nestacooper, katewalsh, johnmichaelhiggins, keithpowers, aliciasanz, jakeborelli, kidink, youseferakat, rebekahgraf, annewinters, petergilroy, patrickdavis]",[comedies],"[lands, social, media, celebrity, cross, hairs, interest, longtime, crush, ex, nerdy, high, schooler, dani, finally, attracts]"
Automata,[gabeibáñez],"[antoniobanderas, dylanmcdermott, melaniegriffith, birgittehjortsørensen, robertforster, christacampbell, timmcinnerny, andynyman, davidryall]","[international movies, sci-fi & fantasy, thrillers]","[tech, company, investigates, robot, killed, violating, protocol, global, conspiracy, insurance, adjuster, dystopian, future, discovers]"
Fabrizio Copano: Solo pienso en mi,"[rodrigotoro, franciscoschultz]",[fabriziocopano],[stand-up comedy],"[family, whatsapp, groups, next, level, stand, fabrizio, copano, takes, audience, participation, set, reflecting, sperm, banks]"
Good People,[henrikrubengenz],"[jamesfranco, katehudson, tomwilkinson, omarsy, samspruell, annafriel, thomasarnold, oliverdimsdale, dianahardcastle, michaeljibson, diarmaidmurtagh]","[action & adventure, thrillers]","[apartment, recently, murdered, find, believe, struggling, couple, money, stash, luck, neighbor]"


In [14]:
# Save cleaned dataset to pickled file
new_df.to_pickle('new_df.pkl')