# FlixRecommender

FlixRecommender is my version of Netflix's recommender system. It uses an NLP (Natural Language Processing) model and a K-Means Clustering model to group Netflix movies and TV shows by their plot description, genre, lead actor or actress, director, and the country the movie or TV show was filmed in. Users can utilize my recommender model to find movies or TV shows that are similar to their favourite films.

#  
## Packages

In [1]:
import pandas as pd
import numpy as np
#import spacy
import sklearn

#  
## Data Set

There is a total of 6234 movies / TV shows in this data set collected from [Kaggle](https://www.kaggle.com/shivamb/netflix-shows). 

Each row contains the following information: **type** (Movie or TV Show), **title**, **director**, **cast**, **country**, **rating** (ex. PG, PG-13, R, etc.), **listed_in** (genre), and plot **description**.

In [2]:
# Load in the data from csv file
netflix = pd.read_csv('netflix.csv').drop(['show_id','date_added','release_year','duration'], axis=1)
netflix_df = netflix.copy()
netflix_df.head()

Unnamed: 0,type,title,director,cast,country,rating,listed_in,description
0,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",TV-PG,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,TV-MA,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,TV-Y7-FV,Kids' TV,"With the help of three human allies, the Autob..."
3,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,TV-Y7,Kids' TV,When a prison ship crash unleashes hundreds of...
4,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,TV-14,Comedies,When nerdy high schooler Dani finally attracts...


#   
## Natural Language Processing (NLP) Model

For the NLP portion of this project, I will first convert all plot descriptions to word vectors so they can be processed by the NLP model. Then, the similarity between all word vectors will be calculated using cosine similarity (measures the angle between two vectors, resulting in a score between -1 and 1, corresponding to complete oppositses or perfectly similar vectors). Finally, I will extract the 5 movies or TV shows with the most similar plot description to a given movie or TV show.

In [3]:
# Load the large model to get the vectors
#nlp = spacy.load('en_core_web_lg')

In [4]:
# Create word vectors for all movie and TV show descriptions
#with nlp.disable_pipes():
#    vectors = np.array([nlp(film.description).vector for idx, film in netflix_df.iterrows()])

In [5]:
# Function to analyze how similar two word vectors are
def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

In [6]:
# Calculate the mean for all word vectors
#vec_mean = vectors.mean(axis=0)

# Subtract the mean from the vectors
#centered = vectors - vec_mean 

In [7]:
# Function to get the indices of the five most similar descriptions
def get_similar_description_indices(description_vec):
    
    # Calculate similarities between given description and other descriptions in the dataset
    sims = np.array([cosine_similarity(description_vec - vec_mean, vec) for vec in centered])
    
    # Get the indices of the five most similar descriptions
    most_similar_index = np.argsort(sims)[-6:-1]
    
    return most_similar_index

In [8]:
# Create array of lists containing indices of five most similar descriptions
#similar_indices = np.array([get_similar_description_indices(vec) for vec in vectors])

 
### Test NLP Model

To test my NLP model, I will look at the most similar plot descriptions to one of my favourite movies, *Catch Me If You Can*, directed by Steven Spielberg and starring Leonardo DiCaprio and Tom Hanks.

In [9]:
#test_index = netflix.index[netflix.title == "Catch Me If You Can"][0]

#print("Chosen Movie/TV Show")
#print(netflix_df.title[test_index] + ': ' + netflix_df.description[test_index] + '\n')
#print("Top Recommendations")
#print(netflix_df.title[similar_indices[test_index][4]] + ': ' + netflix_df.description[similar_indices[test_index][4]] + '\n')
#print(netflix_df.title[similar_indices[test_index][3]] + ': ' + netflix_df.description[similar_indices[test_index][3]] + '\n')
#print(netflix_df.title[similar_indices[test_index][2]] + ': ' + netflix_df.description[similar_indices[test_index][2]] + '\n')
#print(netflix_df.title[similar_indices[test_index][1]] + ': ' + netflix_df.description[similar_indices[test_index][1]] + '\n')
#print(netflix_df.title[similar_indices[test_index][0]] + ': ' + netflix_df.description[similar_indices[test_index][0]] + '\n')

As you can see, my NLP model detected a lot of similar key words like 'FBI' from the descriptions. The descriptions also appear to be semantically related as they all appear to be related to crime.

#   
#   
## K-Means Clustering Model

Prior to creating the k-means clustering model, I will perform the following data cleaning and feature engineering tasks:
- Fill missing values with most common entry for the column (ex. fill missing country entries as 'United States')
- Convert columns with multiple values in a cell to a list and only take the first value (ex. take only the lead actor or actress from each cast list)
- Encode all categorical variables

### Data Cleaning 

In [10]:
# Fill missing country entries with United States
netflix_df.fillna('Other', inplace=True)

# Change director, cast, country, and listed_in columns from type str to lists
netflix_df.director = netflix_df.director.str.split(', ').tolist()
netflix_df.cast = netflix_df.cast.str.split(', ').tolist()
netflix_df.country = netflix_df.country.str.split(', ').tolist()
netflix_df.listed_in = netflix_df.listed_in.str.split(', ').tolist()

#   
### Feature Reduction

In [11]:
# Array of all unique directors, cast members, countries, ratings, and genres
all_directors = netflix_df['director'].explode().unique()
all_cast = netflix_df['cast'].explode().unique()
all_countries = netflix_df['country'].explode().unique()
all_ratings = netflix_df['rating'].unique()
all_genres = netflix_df['listed_in'].explode().unique()

len(all_directors), len(all_cast), len(all_countries), len(all_ratings),len(all_genres)

(3656, 27406, 107, 15, 42)

In [12]:
all_titles = netflix_df['title'].unique().tolist()
all_titles.sort()

There are 3656 directors, 27406 actors / actresses, and 107 countries  in the data set which are too many features to include in a K-Means clustering model. Thus, I will reduce the number of features by only taking the primary director, lead actor/actress, and primary country for each movie or TV show. Then, I will count encode each of these features by replacing each categorical value with the number of times it appears in the dataset.

I will use one-hot encoding encoding to encode ratings and genres (listed_in) since there are only 15 ratings and 42 genres. One-hot-encoding creates new columns indicating the presence (1) or absence (0) of each possible value in the data. Since a movie or TV show can belong to more than one genre, I will use a Multi Label Binarizer for rating.

In [13]:
# Retain primary director, lead actor/actress, and primary country
for inx in range(len(netflix_df)):
    netflix_df['director'][inx] = netflix_df['director'][inx][0]
    netflix_df['cast'][inx] = netflix_df['cast'][inx][0]
    netflix_df['country'][inx] = netflix_df['country'][inx][0]
    inx += 1

In [14]:
feature_reduced_df = netflix_df.copy()
feature_reduced_df.head()

Unnamed: 0,type,title,director,cast,country,rating,listed_in,description
0,Movie,Norm of the North: King Sized Adventure,Richard Finn,Alan Marriott,United States,TV-PG,"[Children & Family Movies, Comedies]",Before planning an awesome wedding for his gra...
1,Movie,Jandino: Whatever it Takes,Other,Jandino Asporaat,United Kingdom,TV-MA,[Stand-Up Comedy],Jandino Asporaat riffs on the challenges of ra...
2,TV Show,Transformers Prime,Other,Peter Cullen,United States,TV-Y7-FV,[Kids' TV],"With the help of three human allies, the Autob..."
3,TV Show,Transformers: Robots in Disguise,Other,Will Friedle,United States,TV-Y7,[Kids' TV],When a prison ship crash unleashes hundreds of...
4,Movie,#realityhigh,Fernando Lebrija,Nesta Cooper,United States,TV-14,[Comedies],When nerdy high schooler Dani finally attracts...


#  
### Feature Engineering (Categorical Encoding)

* Use the MultiLabelBinarizer to encode the genres the movies or TV shows are listed in (each entry can belong to multiple genres)
* One-hot encode rating
* Count encode the primary director, lead actor or actress, and country of each movie / TV show

In [15]:
from sklearn.preprocessing import MultiLabelBinarizer

# Create the MultiLabelBinarizer 
mlb = MultiLabelBinarizer()

# Encode each genre and join to dataframe
mlb_df = feature_reduced_df.join(pd.DataFrame(mlb.fit_transform(feature_reduced_df.pop('listed_in')),
                                              columns=mlb.classes_,
                                              index=feature_reduced_df.index))

In [16]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to rating column
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_rating = pd.DataFrame(OH_encoder.fit_transform(mlb_df[['rating']]), columns=all_ratings)

# One-hot encoding removed index; put it back
OH_rating.index = mlb_df.index

# Add one-hot encoded columns to data frame
mlb_df = mlb_df.join(OH_rating)

In [17]:
import category_encoders as ce

# Create the count encoder
count_enc = ce.CountEncoder()

# Count encode director, cast, and country columns
count_encoded = count_enc.fit_transform(mlb_df[['director','cast','country']])

# Rename the columns with the _count suffix, and join to dataframe
netflix_encoded_df = mlb_df.join(count_encoded.add_suffix("_count"))

In [18]:
netflix_encoded_df = netflix_encoded_df.drop(['type','title','director','cast','country','rating','description'], axis=1)
netflix_encoded_df.head()

Unnamed: 0,Action & Adventure,Anime Features,Anime Series,British TV Shows,Children & Family Movies,Classic & Cult TV,Classic Movies,Comedies,Crime TV Shows,Cult Movies,...,PG-13,TV-G,PG,G,Other,UR,NC-17,director_count,cast_count,country_count
0,0,0,0,0,1,0,0,1,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1,1,2302
1,0,0,0,0,0,0,0,0,0,0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1969,2,483
2,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1969,1,2302
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1969,1,2302
4,0,0,0,0,0,0,0,1,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,2302


#    
### Create (K-Means) Clusters 

I will create a k-means clustering model that will group the 6234 movies/TV shows into 500 clusters. After initializing the model, cluster predictions will be made and attached to the original data frame to show each movie/TV show and the cluster they belong to.

In [19]:
from sklearn.cluster import KMeans

# Create K-Means Model
modelkmeans = KMeans(n_clusters=500, init='k-means++', n_init=10).fit(netflix_encoded_df)

# Form cluster predictions using K-Means Model
predictions = modelkmeans.predict(netflix_encoded_df)

# Convert cluster predictions to data frame
predictions_df = pd.DataFrame(predictions, columns=['cluster'])

# Attach cluster predictions to original data frame
netflix_pred = netflix.copy()
netflix_pred.insert(len(netflix.columns), column = 'cluster', value = predictions_df.cluster)

In [20]:
netflix_pred.head()

Unnamed: 0,type,title,director,cast,country,rating,listed_in,description,cluster
0,Movie,Norm of the North: King Sized Adventure,"Richard Finn, Tim Maltby","Alan Marriott, Andrew Toth, Brian Dobson, Cole...","United States, India, South Korea, China",TV-PG,"Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...,101
1,Movie,Jandino: Whatever it Takes,,Jandino Asporaat,United Kingdom,TV-MA,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...,40
2,TV Show,Transformers Prime,,"Peter Cullen, Sumalee Montano, Frank Welker, J...",United States,TV-Y7-FV,Kids' TV,"With the help of three human allies, the Autob...",178
3,TV Show,Transformers: Robots in Disguise,,"Will Friedle, Darren Criss, Constance Zimmer, ...",United States,TV-Y7,Kids' TV,When a prison ship crash unleashes hundreds of...,435
4,Movie,#realityhigh,Fernando Lebrija,"Nesta Cooper, Kate Walsh, John Michael Higgins...",United States,TV-14,Comedies,When nerdy high schooler Dani finally attracts...,420


#    
### Test K-Means Clustering Model 

I will test the k-means clustering model using one of my favourite TV shows, *Breaking Bad*.

In [21]:
# Get cluster number from given movie or TV show
cluster_num = netflix_pred[netflix_pred.title=='Breaking Bad'].cluster.item()

# View cluster the movie or TV show belongs to
netflix[netflix_pred.cluster == cluster_num]

Unnamed: 0,type,title,director,cast,country,rating,listed_in,description
278,TV Show,Unbelievable,,"Toni Collette, Merritt Wever, Kaitlyn Dever, D...",United States,TV-MA,"Crime TV Shows, TV Dramas",After a young woman is accused of lying about ...
1147,TV Show,Godless,,"Jeff Daniels, Michelle Dockery, Jack O'Connell...",United States,TV-MA,TV Dramas,A ruthless outlaw terrorizes the West in searc...
1459,TV Show,Get Shorty,,"Ray Romano, Chris O'Dowd",United States,TV-MA,"Crime TV Shows, TV Comedies, TV Dramas",Organized crime enforcer Miles Daly strives to...
2490,TV Show,Gypsy,,"Naomi Watts, Billy Crudup, Sophie Cookson, Kar...","United States, United Kingdom",TV-MA,TV Dramas,Therapist Jean Holloway develops dangerous and...
2862,TV Show,Another Life,,"Katee Sackhoff, Justin Chatwin, Samuel Anderso...",United States,TV-MA,"TV Action & Adventure, TV Dramas, TV Mysteries","After a massive alien artifact lands on Earth,..."
3817,TV Show,Unsolved,,"Josh Duhamel, Jimmi Simpson, Bokeem Woodbine",United States,TV-MA,"Crime TV Shows, TV Dramas",Ride along for a dramatized version of the rea...
4943,TV Show,Marvel's The Defenders,,"Charlie Cox, Krysten Ritter, Mike Colter, Finn...",United States,TV-MA,"Crime TV Shows, TV Action & Adventure, TV Dramas","Daredevil, Jessica Jones, Luke Cage and Iron F..."
5637,TV Show,Narcos,,"Wagner Moura, Pedro Pascal, Boyd Holbrook, Dam...","United States, Colombia, Mexico",TV-MA,"Crime TV Shows, TV Action & Adventure, TV Dramas",The true story of Colombia's infamously violen...
5671,TV Show,Marvel's Daredevil,,"Charlie Cox, Deborah Ann Woll, Elden Henson, R...",United States,TV-MA,"Crime TV Shows, TV Action & Adventure","Blinded as a young boy, Matt Murdock fights in..."
5741,TV Show,House of Cards,,"Kevin Spacey, Robin Wright, Kate Mara, Corey S...",United States,TV-MA,"TV Dramas, TV Thrillers",A ruthless politician will stop at nothing to ...


As you can see, this cluster primarily contains American Crime TV Shows and TV Dramas.

#   
## Flask App

Finally, I will create a Flask app that will allow users to input a movie or TV show and will provide recommendations based on the director, lead actor/actress, genre, and country it was produced in. Prior to creating the app, I will clean the data and reformat it.

In [22]:
# Retain top 3 actors/actresses of each film
netflix_cast3 = netflix.copy()
netflix_cast3.fillna('N/A', inplace=True)
netflix_cast3.cast = netflix_cast3.cast.str.split(', ').tolist()

for x in range(len(netflix_cast3)):
    netflix_cast3['cast'][x] = netflix_cast3['cast'][x][:3]
    x += 1
    
netflix_cast3['cast'] = netflix_cast3['cast'].agg(lambda x: ', '.join(map(str, x)))

In [23]:
# Drop unnecessary columns
drop_netflix = netflix_cast3.drop(['type','director','country','rating'], axis=1)

# Rename remaining columns
flix_df = drop_netflix.rename(columns={'title':'Title','listed_in':'Genre','cast':'Cast','description':'Description', 
                                       'cluster':'Group'})

flix_df

Unnamed: 0,Title,Cast,Genre,Description
0,Norm of the North: King Sized Adventure,"Alan Marriott, Andrew Toth, Brian Dobson","Children & Family Movies, Comedies",Before planning an awesome wedding for his gra...
1,Jandino: Whatever it Takes,Jandino Asporaat,Stand-Up Comedy,Jandino Asporaat riffs on the challenges of ra...
2,Transformers Prime,"Peter Cullen, Sumalee Montano, Frank Welker",Kids' TV,"With the help of three human allies, the Autob..."
3,Transformers: Robots in Disguise,"Will Friedle, Darren Criss, Constance Zimmer",Kids' TV,When a prison ship crash unleashes hundreds of...
4,#realityhigh,"Nesta Cooper, Kate Walsh, John Michael Higgins",Comedies,When nerdy high schooler Dani finally attracts...
...,...,...,...,...
6229,Red vs. Blue,"Burnie Burns, Jason Saldaña, Gustavo Sorola","TV Action & Adventure, TV Comedies, TV Sci-Fi ...","This parody of first-person shooter games, mil..."
6230,Maron,"Marc Maron, Judd Hirsch, Josh Brener",TV Comedies,"Marc Maron stars as Marc Maron, who interviews..."
6231,Little Baby Bum: Nursery Rhyme Friends,,Movies,Nursery rhymes and original music for children...
6232,A Young Doctor's Notebook and Other Stories,"Daniel Radcliffe, Jon Hamm, Adam Godley","British TV Shows, TV Comedies, TV Dramas","Set during the Russian Revolution, this comic ..."


In [24]:
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output

import gunicorn

In [25]:
from dash import dash_table

app = dash.Dash(__name__)
server = app.server

app.layout = html.Div(style={'backgroundColor': 'white'}, children = [
    html.H1("FlixRecommender", style={'text-align': 'center', 
                                      'font-family':'trebuchet ms',
                                      'font-size':'60px',
                                      'color': 'rgb(229,9,20)',
                                      'backgroundColor': 'black',
                                      'padding':'1%',
                                      'box-shadow': '2px 5px 5px 1px rgba(255, 101, 131,0.5)'}), 
    html.H2("Favourite Movie/TV Show:", style={'text-align': 'left', 
                                               'font-family':'trebuchet ms',
                                               'font-size':'20px',
                                               'color': 'black',
                                               'padding':'1%'}),
    dcc.Dropdown(id="select_film",
                 options=[{"label": title, "value": title} for title in all_titles],
                 multi=False,
                 value="Breaking Bad",
                 style={'width': "50%", 
                        'font-size':'14px', 
                        'font-family':'trebuchet ms', 
                        'padding-left':'1%'}
                 ),
    html.Br(),
    html.Br(),
    html.H2("Recommendations", style={'text-align': 'center', 
                                       'font-family':'trebuchet ms',
                                       'font-size':'24px',
                                       'color': 'white',
                                       'backgroundColor': 'rgb(229,9,20)',
                                       'padding':'1%',
                                       'box-shadow': '2px 5px 5px 1px grey'}),
    html.Div(id='dd-output-container'),
    
])

@app.callback(
    dash.dependencies.Output('dd-output-container', 'children'),
    [dash.dependencies.Input('select_film', 'value')])


def update_output(value):
    return  dash_table.DataTable(
        id='table',
        columns=[{"name": i, "id": i} for i in flix_df.columns],
        data=flix_df[netflix_pred.cluster == netflix_pred[netflix_pred.title==value].cluster.item()].to_dict('records'),
        style_header={
                    'backgroundColor': 'rgb(229,9,20)',
                    'color': 'white',
                    'fontWeight': 'bold',
                    'font-size':'14px',
                    'font-family':'trebuchet ms',
                    'padding':'1%'},
        style_cell={
                    'textAlign': 'left',
                    'backgroundColor': 'white', 
                    'color': 'black',
                    'font-size':'13px',
                    'font-family':'trebuchet ms',
                    'padding':'1%'},
        style_data={
                    'whiteSpace': 'normal',
                    'height': 'auto'},
        style_cell_conditional=[
            {'if': {'column_id': 'Title'},
             'width': '20%'},
            {'if': {'column_id': 'Cast'},
             'width': '20%'},
            {'if': {'column_id': 'Genre'},
             'width': '20%'},
        ]
    )

# Run dashboard app
if __name__ == '__main__':
    app.run_server(debug=True, use_reloader=False)

Dash is running on http://127.0.0.1:8050/

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on


In [26]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.
