# **Project Name**    -



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine. In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Netflix, the world's leading online streaming platform, boasts over 220 million subscribers as of Q2 2022. To enhance user experience and minimize subscriber churn, it is essential to organize the shows on their platform into well-defined clusters.

By clustering shows, we can gain insights into their similarities and differences. These clusters can then be used to provide personalized show recommendations, tailored to the preferences of individual users.

The primary objective of this project is to categorize Netflix shows into distinct clusters, ensuring that shows within the same cluster are similar, while those in different clusters are significantly dissimilar.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/AlmaBetter Project/M6-Project/Netflix Movies -Tv Shows Dataset.csv'

df_netflix = pd.read_csv(file_path,index_col='show_id')



### Dataset First View

In [None]:
# Dataset First Look
df_netflix.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df_netflix.shape

The dataset contains 7787 records and 11 attributes.

### Dataset Information

In [None]:
# Dataset Info
df_netflix.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df_netflix.duplicated().value_counts()

There are no duplicated records in the dataset.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df_netflix.isnull().sum()

There are many missing values in director, cast, country, date_added, and rating columns.

In [None]:
# Visualizing the missing values
plt.figure(figsize=(10, 6))
sns.heatmap(df_netflix.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df_netflix.columns

In [None]:
# Dataset Describe
df_netflix.describe()

### Variables Description

* show_id : Unique ID for every Movie Or TV Show
* type : Identifier - Movie Or TV Show
* title : Title of the Movie/Show
* director : Director of the Movie/Show
* cast : Actors involved in the Movie/Show
* country : Country where the Movie/Show was produced
* date_added : Date it was added on Netflix
* release_year : Actual Release year of the Movie/Show
* rating : TV Rating of the Movie/Show
* duration : Total Duration - in minutes or number of seasons
* listed_in : Genre
* description : The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df_netflix.nunique()

* The missing values in the director, cast, and country attributes can be replaced with 'Unknown'
* 10 records with missing values in the date_added column can be dropped.
* The missing values in rating can be imputed with its mode, since this attribute is discrete.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Handling the missing values
df_netflix.fillna({'director': 'Unknown', 'cast': 'Unknown', 'country': 'Unknown'}, inplace=True)
df_netflix['rating'].fillna(df_netflix['rating'].mode()[0], inplace=True)
df_netflix.dropna(subset=['date_added'], inplace=True)
df_netflix.isnull().sum()


We have successfully handled all the missing values in the dataset.

###Typecasting

In [None]:
# Typecasting 'duration' from string to integer
df_netflix['duration'] = df_netflix['duration'].apply(lambda x: int(x.split()[0]))
df_netflix.head()

In [None]:
# Number of TV Shows
print(f"Number of TV Shows: {df_netflix[df_netflix['type'] == 'TV Show'].shape[0]}")

In [None]:
# Movie length in minutes
print(f"Movie length in minutes: {df_netflix[df_netflix['type'] == 'Movie'].duration.unique()}")


In [None]:
# datatype of duration
df_netflix.duration.dtype

In [None]:
# Typecasting 'date_added' from string to datetime
df_netflix["date_added"] = pd.to_datetime(df_netflix['date_added'],errors='coerce')
df_netflix.head()

In [None]:
# Adding new attributes month and year of date added
df_netflix['month_added'] = df_netflix['date_added'].dt.month
df_netflix['year_added'] = df_netflix['date_added'].dt.year
# df_netflix.drop('date_added', axis=1, inplace=True)
df_netflix.head()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Number of Movies and TV Shows in the dataset
# type_counts =

# Create a pie chart
plt.figure(figsize=(8, 6))
colors = ['#1f77b4', '#ff7f0e']  # Define custom colors
explode = (0.1, 0)  # Explode the first slice for emphasis

plt.pie(
    df_netflix['type'].value_counts(),
    labels=df_netflix['type'].value_counts().index,
    autopct='%1.1f%%',  # Display percentage
    startangle=140,  # Rotate for better visibility
    colors=colors,
    explode=explode,
    shadow=True,  # Add shadow for effect
    wedgeprops={'edgecolor': 'black'}  # Add border to slices
)

# Add a title
plt.title('Distribution of Movies and TV Shows in the Dataset', fontsize=14)

# Show the chart
plt.show()

##### 1. Why did you pick the specific chart?

The pie chart effectively shows the proportional distribution of Movies and TV Shows, making it easy to compare their relative sizes.

##### 2. What is/are the insight(s) found from the chart?

There are more movies (69.14%) than TV shows (30.86%) in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive Business Impact:
Yes, the insights can guide content strategy, focusing on dominant categories (Movies) while improving underrepresented ones (TV Shows) to attract a diverse audience and enhance subscriber retention.
* Negative Growth:
An imbalance, such as too few TV Shows, risks alienating binge-watchers and losing viewers to competitors. Balancing content offerings ensures broader appeal and sustainable growth.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Top 10 director in the dataset

plt.figure(figsize=(12, 5))

# Filtering, counting, and plotting the top 10 directors
top_directors = df_netflix[~(df_netflix['director'] == 'Unknown')]['director'].value_counts().nlargest(10)
top_directors.plot(kind='barh', color='skyblue', edgecolor='black')

# Adding titles and labels for clarity
plt.title('Top 10 Directors by Number of Shows Directed', fontsize=14, weight='bold', pad=10)
plt.xlabel('Number of Shows Directed', fontsize=10)
plt.ylabel('Directors', fontsize=10)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.gca().invert_yaxis()  # To display the highest count on top

# Annotating bar values for better readability
for index, value in enumerate(top_directors):
    plt.text(value + 0.5, index, str(value), va='center', fontsize=10)

# Displaying the chart
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen to efficiently compare the top 10 directors based on the number of shows directed. This format provides clear readability for director names and their corresponding counts.

##### 2. What is/are the insight(s) found from the chart?

Raul Campos and Jan Suter together have directed 18 movies / TV shows, higher than anyone in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive Impact: Identifying top directors helps strengthen partnerships and ensures continued high-quality content production, which could enhance viewer retention and subscriptions.
* Negative Impact: Over-reliance on a small group of directors could lead to content fatigue, potentially reducing audience engagement if diversity is not maintained.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Top 10 countries with the highest number movies and TV shows in the dataset

# Remove rows where 'country' is 'Unknown'
df_netflix = df_netflix[df_netflix['country'] != 'Unknown']

# Plot
plt.figure(figsize=(10,5))
sns.countplot(x="country", data=df_netflix, hue="type", order=df_netflix['country'].value_counts().head(10).index, palette="viridis")
plt.title("Top 10 Countries by Movies/TV Shows Count", fontsize=14, weight='bold')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Country')
plt.ylabel('Count')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A count plot is used to show the distribution of movies and TV shows per country, allowing a quick visual comparison.

##### 2. What is/are the insight(s) found from the chart?

The highest number of movies / TV shows were based out of the US, followed by India and UK.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive Impact: Helps in making content localization and marketing decisions. Identifying high-content countries can optimize regional focus and partnerships.
* Negative Impact: Imbalance between movies and TV shows could limit content diversity for specific markets.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Visualizing the year in which the movie / tv show was released
plt.figure(figsize=(10,5))
sns.histplot(df_netflix['release_year'])
plt.title('Distribution by released year',fontsize=14, weight='bold')

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Over the years a greater number of shows were added in the months of October, November, December, and January.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Top 10 genres

# count of genre shows
df_netflix.listed_in.value_counts()
# Choosing the primary genre to simplify the analysis
df_netflix['listed_in'] = df_netflix['listed_in'].apply(lambda x: x.split(',')[0])
# genre of shows
df_netflix.listed_in.value_counts()

plt.figure(figsize=(10,5))
df_netflix.listed_in.value_counts().nlargest(10).plot(kind='bar',color='darkgreen')
plt.title('Top 10 genres',fontsize=14, weight='bold')

##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal for comparing the top 10 most frequent genres clearly.

##### 2. What is/are the insight(s) found from the chart?

The dramas is the most popular genre followed by comedies and documentaries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Positive Impact: Yes, it helps in focusing content strategy on popular genres, boosting engagement.
* Negative Impact:Over-representation of certain genres could reduce variety, potentially alienating niche audiences.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***