# **Project Name**    - Netflix Movies and TV Shows Clustering



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

This project aims to analyze the evolution of Netflix's content library, using a dataset of TV shows and movies available on Netflix as of 2019, collected from Flixable. Since 2010, the number of TV shows on Netflix has nearly tripled, while the number of movies has decreased by over 2,000 titles. Through Exploratory Data Analysis (EDA), visualization, data cleaning, and unsupervised machine learning algorith, the project will uncover trends in content availability, genre distribution, and other key attributes. Integrating this dataset with external sources such as IMDb and Rotten Tomatoes will enrich the analysis, providing insights into content popularity and quality. The project will also employ clustering algorithms to identify content similarities and use dimensionality reduction techniques to reveal hidden patterns. The outcome will be detailed insights into Netflix's content strategy, interactive dashboards for user exploration, and a comprehensive view of how Netflix content is perceived in the broader entertainment ecosystem.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

In this project, you are required to do

* Exploratory Data Analysis
* Understanding what type content is available in different countries
* If Netflix has been increasingly focusing on TV rather than movies in recent years.
* Clustering similar content by matching text-based features

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick
from matplotlib.pyplot import figure
import plotly.graph_objects as go
import plotly.offline as py
import plotly.express as px
from plotly.subplots import make_subplots
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/Almabetter/Data Science/dataset/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# copy main dataset
df1 = df.copy()

In [None]:
# Dataset Rows & Columns count
df1.shape

In [None]:
df1.columns

### Dataset Information

In [None]:
# Dataset Info
df1.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values = df1.duplicated().sum()
duplicate_values

#### Missing Values/Null Values

In [None]:
#null values
df1.isnull().sum().sum()

In [None]:
# Missing Values/Null Values Count
missing_value = df1.isnull().sum().sort_values(ascending=False).reset_index().rename(columns={'index':'Columns',0:'Missing Values'})
missing_value.head(5)

In [None]:
# Visualizing the missing values

# Define a color palette
palette = sns.color_palette("colorblind", len(missing_value))

# Create a bar plot with missing values
plt.figure(figsize=(8,6))
# Assuming 'missing_value' is a DataFrame with 'Columns' and 'Missing Values' columns
ax = sns.barplot(x='Columns', y='Missing Values', data=missing_value.head(5), palette=palette)
plt.xticks(rotation=90)
plt.xlabel('Columns')
plt.ylabel('Missing Values')
plt.title('Missing Values')

# Adding the exact values on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()



### What did you know about your dataset?

In the given dataset there are 7787 rows and 12 columns. There is duplicate values in the dataset.

There are total 3631 missing values and 2389 missing values in director column, 718 missing vlaues in cast column, 507 missing values in country column, 10 missing values in data_added column, and 7 missing value in rating column.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.describe()

### Variables Description

* show_id : Unique ID for every Movie / Tv Show

* type : Identifier - A Movie or TV Show

* title : Title of the Movie / Tv Show

* director : Director of the Movie

* cast : Actors involved in the movie / show

* country : Country where the movie / show was produced

* date_added : Date it was added on Netflix

* release_year : Actual Releaseyear of the movie / show

* rating : TV Rating of the movie / show

* duration : Total Duration - in minutes or number of seasons

* listed_in : Genere

* description: The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df1.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
duplicate_values = df1.duplicated().sum()
duplicate_values

In [None]:
# find out missing values
missing_values = df1.isnull().sum().sort_values(ascending=False).reset_index().rename(columns={'index':'Columns',0:'Missing Values'})
missing_values.head(5)

In [None]:
# replace null values
df1['cast'].fillna(value = "No Cast", inplace=True)
df1['country'].fillna(value = df['country'].mode()[0], inplace=True)


In [None]:
# date_added and ratings columns have some rows which have null values. so we drom them using dropna.
df1.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
# director column is not needed so we drop that columns from dataset
df1.drop(['director'],axis=1,inplace=True)

In [None]:
# checking null values
df1.isnull().sum()

### What all manipulations have you done and insights you found?

* In the given dataset there is no duplicate values therefore no need to do any changes.

* In the given dataset there are total 3613 missing values.
* There are 5 columns which have missing values as follows:
  * director - 2389
  * cast - 718
 * country - 507
  * date_added - 10
  * rating - 7
* From the above 5 columns I deropped 1 column which is director column because I do not neede for analysis and date_added and ratings columns have null values so I dropped those null values using dropna fumction.

* Missing values from cast column is replace by "No Cast" and missing value from country column is replaced by name of countries from dataset.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **How many TV shows and movies are there in the dataset?**

In [None]:
df1.columns

In [None]:
# calculate tv shows and movies
tv_movie_shows = df1['type'].value_counts()
tv_movie_shows

In [None]:
# Chart - 1 visualization tv shows and movies
plt.figure(figsize=(10, 6))
plt.pie(tv_movie_shows, labels=tv_movie_shows.index, autopct='%1.1f%%', startangle=140)
plt.title('TV Shows and Movies')
plt.axis('equal')
plt.show()



##### 1. Why did you pick the specific chart?

I picked pie chart becuase it gives clear and simple visualization.

##### 2. What is/are the insight(s) found from the chart?

From the above pie chart we can see that there are more number of movies than TV shows.

#### **What is the distribution of release years for the content?**

In [None]:
df1.columns

In [None]:
# calculate distribution of release years for the content
release_year_dist = df['release_year'].value_counts().sort_index(ascending=False).head(10)
release_year_dist

In [None]:
# visualize distribution of release years for the content
# Define a color palette
palette = sns.color_palette("colorblind", len(release_year_dist))

# Create a bar plot with missing values
plt.figure(figsize=(12, 8))
ax = sns.barplot(x=release_year_dist.index, y=release_year_dist.values, palette=palette)
plt.title('Distribution of Release Years')
plt.xlabel('Release Year')
plt.ylabel('Number of shows/movies')

# Adding the exact values on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()

##### 1. Why did you pick the specific chart?

Barplot gives clear and easy explanation of data.

##### 2. What is/are the insight(s) found from the chart?

In 2018, the number of new releases was the highest, while in 2021, it was the lowest.

#### **Top 10 countries who produce the most content?**

In [None]:
# find out content by countries
content_by_countries = df1['country'].value_counts().head(10)
content_by_countries

In [None]:
# Chart - 3 visualization of content by contries

# Define a color palette
palette = sns.color_palette("colorblind", len(content_by_countries))

# Create a bar plot with missing values
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=content_by_countries.index, y=content_by_countries.values, palette=palette)
plt.title('Content by Countries')
plt.xlabel('Country')
plt.ylabel('Number of shows/movies')
plt.xticks(rotation=90)

# Adding the exact values on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='baseline', fontsize=11, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()



##### 1. Why did you pick the specific chart?

Barplot gives clear and easy to understand visualization.

##### 2. What is/are the insight(s) found from the chart?

From the above chart, United states released more content thatn other contries.

#### Chart - 4 : Rating distribution of TV shows and Movies

In [None]:
df1['rating']

In [None]:
# assign rating into gropued categories
ratings = {
    'TV-MA': 'Adults',
    'R': 'Adults',
    'PG-13': 'Teens',
    'TV-14': 'Young Adults',
    'TV-PG': 'Older Kids',
    'NR': 'Adults',
    'TV-G': 'Kids',
    'TV-Y': 'Kids',
    'TV-Y7': 'Older Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'NC-17': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'UR': 'Adults'
}

df1['target_ages'] = df1['rating'].replace(ratings)

In [None]:
# type should be categorical
df1['type'] = df1['type'].astype('category')
df1['target_ages'] = pd.Categorical(df1['target_ages'], categories = ['Kids', 'Teens', 'Young Adults', 'Adults', 'Older Kids'])

In [None]:
df1.head()

In [None]:
# create two columns
tv_shows = df1[df1['type'] == 'TV Show']
movies = df1[df1['type'] == 'Movie']

In [None]:
df1

In [None]:
# rating based on rating system of all tv shows
tv_shows['rating'].value_counts()

In [None]:
# Visualize rating distribution of all tv shows
plt.figure(figsize=(10, 6))
sns.pointplot(x=tv_shows['rating'].value_counts().index, y=tv_shows['rating'].value_counts().values)
plt.title('Rating Distribution of TV Shows')
plt.xlabel('Rating')
plt.ylabel('Number of TV Shows')
plt.show()


In [None]:
# Visualize rating distribution of all movies
plt.figure(figsize=(10, 6))
sns.pointplot(x=movies['rating'].value_counts().index, y=movies['rating'].value_counts().values)
plt.title('Rating Distribution of Movies')
plt.xlabel('Rating')
plt.ylabel('Number of Movies')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a point plot to visualize the rating distribution of TV showa and movies because it effectively heighlights variablility within diffrent categories.

##### 2. What is/are the insight(s) found from the chart?

TV-MA has the highest number of ratings in both the cases i. e. tv shows as well as movies category.

In [None]:
df1.columns

In [None]:
df1

#### Chart - 5 : What are the most common genres on Netflix?

In [None]:
# find out top 10 genre of the movies
top_10_genres = df1['listed_in'].value_counts().head(10)
top_10_genres

In [None]:
# Chart - 5 visualization top 10 genre
palette = sns.color_palette("colorblind", len(top_10_genres))

plt.figure(figsize=(10, 6))
sns.countplot(y='listed_in', data=df1, order=df1['listed_in'].value_counts().index[:10], palette=palette)
plt.title('Top 10 Genres of shows/movies')
plt.xlabel('Number of Shows/Movies')
plt.ylabel('Genre')
plt.show()

In [None]:
# visualizing top 10 genres of tv_shows
palette = sns.color_palette("colorblind", len(top_10_genres))

plt.figure(figsize=(10, 6))
sns.countplot(y='listed_in', data=tv_shows, order=tv_shows['listed_in'].value_counts().index[:10], palette=palette)
plt.title('Top 10 Genres of shows/movies')
plt.xlabel('Number of Shows/Movies')
plt.ylabel('Genre')
plt.show()

In [None]:
# visualizing top 10 genres of movies
palette = sns.color_palette("colorblind", len(top_10_genres))

plt.figure(figsize=(10, 6))
sns.countplot(y='listed_in', data=movies, order=movies['listed_in'].value_counts().index[:10], palette=palette)
plt.title('Top 10 Genres of shows/movies')
plt.xlabel('Number of Shows/Movies')
plt.ylabel('Genre')
plt.show()

#### Chart - 9 : How does the number of TV shows and movies vary by release year?

In [None]:
# find out tv shows and movies vary by release year
tv_shows_by_year = df1[df1['type'] == 'TV Show'].groupby('release_year').size()
movies_by_year = df1[df1['type'] == 'Movie'].groupby('release_year').size()

In [None]:
# visualize tv shows and movies vary by release year
plt.figure(figsize=(12,6))
plt.plot(tv_shows_by_year.index, tv_shows_by_year.values, label='TV Shows')
plt.plot(movies_by_year.index, movies_by_year.values, label='Movies')
plt.title('Number of TV Shows and Movies by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Number of Shows/Movies')
plt.legend()
plt.show()

##### 2. What is/are the insight(s) found from the chart?

The graph shows a significant increase in both TV shows and movies on Netflix starting from the 2000s, with a sharp spike in content around 2015-2019. TV shows have seen rapid growth, especially in the last decade, while the number of movies, after peaking, appears to have slightly declined recently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Netflix should continue investing in TV shows due to their rapid growth and strong engagement. Additionally, diversifying with more classic content could attract a broader audience. Monitoring the recent decline in movie additions can help maintain a balanced content library.

#### Chart - 10 : How are genres distributed across TV shows and movies?

In [None]:
# find out genres distributrd acreodd tv shows and movies
genre_counts = df1['listed_in'].value_counts()
genre_counts

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***