# Movie Recommendation System  
In this notebook, we will go through the steps of building a movie recommendation system using content-based filtering. We will use the Netflix dataset, which contains movies along with their corresponding features.  

## Import Python Packages

In [87]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Load Netflix Movies & Shows Dataset 

In [88]:
data = pd.read_csv('Data/netflix_data.csv')
data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [89]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [90]:
data.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [91]:
data.nunique()

show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

In [92]:
data['date_added'] = pd.to_datetime(data['date_added'].str.strip(), format='%B %d, %Y')
data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [93]:
data.fillna('', inplace=True)

In [94]:
data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
show_id,8807.0,8807.0,s1,1.0,,,,,,,
type,8807.0,2.0,Movie,6131.0,,,,,,,
title,8807.0,8807.0,Dick Johnson Is Dead,1.0,,,,,,,
director,8807.0,4529.0,,2634.0,,,,,,,
cast,8807.0,7693.0,,825.0,,,,,,,
country,8807.0,749.0,United States,2818.0,,,,,,,
date_added,8797.0,,,,2019-05-17 05:59:08.436967168,2008-01-01 00:00:00,2018-04-06 00:00:00,2019-07-02 00:00:00,2020-08-19 00:00:00,2021-09-25 00:00:00,
release_year,8807.0,,,,2014.180198,1925.0,2013.0,2017.0,2019.0,2021.0,8.819312
rating,8807.0,18.0,TV-MA,3207.0,,,,,,,
duration,8807.0,221.0,1 Season,1793.0,,,,,,,


This dataset contains 8807 Movie and TV Show titles that are released between 1925 and 2019.

## Data Visualisations

#### Distribution of Movies and TV Shows

In [95]:
movie_counts = data['release_year'].value_counts().sort_index()
fig = go.Figure(data=go.Bar(x=movie_counts.index, y=movie_counts.values))
fig.update_layout(
    plot_bgcolor='rgb(17, 17, 17)',  
    paper_bgcolor='rgb(17, 17, 17)',  
    font_color='white', 
    title='Number of Movies & TV Shows Released Each Year',  
    xaxis=dict(title='Year'),  
    yaxis=dict(title='Number of Movies')
)
fig.update_traces(marker_color='red')
fig.show()

#### Distribution of Movie Types

In [96]:
movie_type_counts = data['type'].value_counts()

fig = go.Figure(data=go.Pie(labels=movie_type_counts.index, values=movie_type_counts.values))

fig.update_layout(
    plot_bgcolor='rgb(17, 17, 17)',  
    paper_bgcolor='rgb(17, 17, 17)', 
    font_color='white',  
    title='Distribution of C. Types',
)
fig.update_traces(marker=dict(colors=['red']))
fig.show()

#### Top Countries

In [97]:
top_countries = data['country'].value_counts().head(10)

fig = px.treemap(names=top_countries.index, parents=["" for _ in top_countries.index], values=top_countries.values)

fig.update_layout(
    plot_bgcolor='rgb(17, 17, 17)',  
    paper_bgcolor='rgb(17, 17, 17)', 
    font_color='white',  
    title='Top Countries with Highest Number of Movies',
)
fig.show()

#### Movies Rating Distribution

In [98]:
ratings       = list(data['rating'].value_counts().index)
rating_counts = list(data['rating'].value_counts().values)

fig = go.Figure(data=[go.Bar(
    x=ratings,
    y=rating_counts,
    marker_color='#E50914'
)])

fig.update_layout(
    title='Movie Ratings Distribution',
    xaxis_title='Rating',
    yaxis_title='Count',
    plot_bgcolor='rgba(0, 0, 0, 0)',
    paper_bgcolor='rgba(0, 0, 0, 0.7)',
    font=dict(
        color='white'
    )
)

fig.show()

## Data Preprocessing

#### Data Textual Representation

In [99]:
def create_textual_representation(row):
    text_representation = f"""Type:{row['type']}, 
    Title:{row['title']},
    Director:{row['director']},
    Cast:{row['cast']},
    Released:{row['release_year']},
    Genre:{row['listed_in']},
    Description:{row['description']}"""
    return text_representation

In [100]:
data['textual representation'] = data.apply(create_textual_representation, axis=1)
print(data['textual representation'].values[0])

Type:Movie, 
    Title:Dick Johnson Is Dead,
    Director:Kirsten Johnson,
    Cast:,
    Released:2020,
    Genre:Documentaries,
    Description:As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.


#### Converting Text to Vector

In [101]:
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(data['textual representation'])
tfidf_matrix.shape

(8807, 52947)

Each example in the data is converted to a vector of size 52947

#### Computing Similarities

In [102]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

#### Recommendation System

We are going to define a function that:
* takes the movie title as an input
* map the movie title to the corresponding movie ID
* get the top 10 movies based on the similarity matrix

In [103]:
indices = pd.Series(data.index, index=data['title']).drop_duplicates()

In [104]:
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return data['title'].iloc[movie_indices]

In [105]:
get_recommendations('Dick Johnson Is Dead')

5233     The Death and Life of Marsha P. Johnson
4877                                    End Game
7015                          How to Be a Player
5894                  Anjelah Johnson: Not Fancy
3927                                    New Girl
7622                                 Nowhere Boy
3717                               Triple Threat
5540                                  Win It All
6553                                    Daffedar
6398    Cabins in the Wild with Dick Strawbridge
Name: title, dtype: object

In [106]:
get_recommendations('Blood & Water')

1514              Diamond City
1593          Kings of Jo'Burg
8316       The Future of Water
218             Titletown High
108                  Dive Club
4475                  Shirkers
2184                  Get Even
1955    The School Nurse Files
5479               The Keepers
5038                   Re:Mind
Name: title, dtype: object