<a href="https://colab.research.google.com/github/sagarmanikpuri/Netflix-Machine-Learning/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  Netflix Machine Learning



##### **Project Type**    - Exploratory Data Analysis & unsupervised Machine Learning
##### **Contribution**    - Individual

# **Project Summary -**

This project analyzes a dataset of Netflix movies and TV shows available as of 2019, collected from Flixable, a third-party Netflix search engine. The objective is to perform Exploratory Data Analysis (EDA), understand content distribution across countries, evaluate Netflix’s shift in focus from movies to TV shows, and apply unsupervised learning to cluster similar titles.

The project begins with cleaning and preparing the dataset, handling missing values, splitting multiple countries, fixing date formats, and standardizing ratings. This ensures the accuracy and consistency of all insights generated.

The EDA highlights several important patterns. Netflix hosts more movies than TV shows overall, but a year-wise trend analysis shows that TV show additions have grown significantly in recent years, while movie additions have gradually declined. This confirms that Netflix has increasingly focused on producing and acquiring TV shows, aligning with industry reports of shifting viewer preferences toward series-style content.

Country-wise analysis reveals that the United States contributes the largest share of Netflix titles, followed by countries such as India, the United Kingdom, and Japan. The dataset also shows clear variation in genre preferences across regions. For example, Indian titles are dominated by drama and romance categories, while U.S. content includes a higher proportion of documentaries and TV shows. This emphasizes Netflix’s global content strategy and its efforts to offer region-specific content.

The final component of the project involves clustering similar titles using unsupervised learning. By transforming text features such as descriptions and genres into numerical representations (using techniques like TF-IDF) and applying clustering algorithms like KMeans, the project groups shows and movies based on similarity. These clusters represent patterns such as crime thrillers, family-friendly content, romantic dramas, or documentaries. Such clustering is useful for content recommendations and understanding how different types of shows and movies relate to each other.

Overall, this project provides meaningful insights into Netflix’s content library, its global distribution, and strategic trends. It also demonstrates the use of EDA and unsupervised machine learning techniques to derive actionable findings from real-world entertainment data.

# **GitHub Link -**

# **Problem Statement**


1.Perform Exploratory Data Analysis (EDA) to understand the structure, composition, and key characteristics of Netflix titles.

2.Examine the type of content available across different countries and identify regional differences in genre, format, and production.

3.Investigate whether Netflix has been increasingly focusing on TV shows rather than movies in recent years by analyzing year-wise content addition trends.

4.Apply unsupervised learning techniques to cluster similar content by using text-based features such as descriptions and genres.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
#Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#file path
file_path = '/content/drive/My Drive/Netflix Machine Learning/NETFLIX MOVIES & TV SHOWS.xlsx'
df = pd.read_excel(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicates_count = df.duplicated().sum()
duplicates_count

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
missing = df.isnull().sum()
missing = missing[missing > 0]

plt.figure(figsize=(10, 5))
missing.plot(kind='bar')
plt.title('Missing Values Count per Column')
plt.xlabel('Columns')
plt.ylabel('Missing Count')
plt.show()

In [None]:
#columns names
string_colimns = ["director", "cast", "country", "rating"]
date_col = ['date_added']

#Replace missing values in string columns
df[string_colimns] = df[string_colimns].fillna("Not provided")

#Replace missing values in date_column
df[date_col] = df[date_col].fillna(pd.Timestamp.today())

#count Missing / Null values after replace
df.isnull().sum()

### What did you know about your dataset?

"The dataset contains 7787 Rows and 12 Columns" "It include mixed of string, numeric, and date columns"

"Columns include details like show_id, type, title, director etc, and date column called date_added"

"Numeric fields include release_year"

"I checked duplicates and found "0" duplicates, so no removal was needed"

"four string columns had missing values, which i replaced with "Not provided""

"The date_added column had missing values which i filled with earliest available date"Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
#check uniques values
for column in df.columns:
    print(f"Column: {column}")
    print(f"Unique count:", df[column].unique())
    print("Unique Values:", df[column].unique())
    print("--------------")

### Variables Description

show_id Data Type: String / Object
Description: A unique identifier assigned to each show or movie.

Example Values: "s1", "m35"

Notes: Used to uniquely identify each record.

type Data Type: String (Categorical)

Description: Indicates whether the content is a Movie or a TV Show.

Example Values: "Movie", "TV Show"

title Data Type: String

Description: Name of the show or movie.

Example Values: "Stranger Things", "Inception"

director Data Type: String

Description: Name of the director(s) of the show or movie.

Example Values: "Christopher Nolan", "N/A"

Notes: Often contains missing values.

cast Data Type: String

Description: Names of actors involved in the show or movie.

Example Values: "Robert Downey Jr., Chris Evans"

Notes: Can contain missing values (especially for documentaries or stand-up shows).

country Data Type: String

Description: Country where the movie/show was produced.

Example Values: "United States", "India"

Notes: May contain multiple countries.

date_added Data Type: Date / Datetime

Description: The date when the movie or show was added to the platform (e.g., Netflix).

Example Values: "2019-07-10", "2020-01-01"

Notes: Missing values are common.

release_year Data Type: Integer

Description: The year when the movie/show was originally released.

Example Values: 2017, 2020

rating Data Type: String (Categorical)

Description: Age rating or maturity level assigned to the content.

Example Values: "PG-13", "TV-MA", "R"

duration Data Type: String

Description: Duration of the movie or number of seasons for TV shows.

Example Values: "90 min", "2 Seasons"

Notes: Contains mixed text values (minutes & seasons).

listed_in Data Type: String

Description: Genre or category of the content.

Example Values: "Drama", "Comedy", "Action & Adventure"

Notes: Often includes multiple genres.

description Data Type: String

Description: A short summary or description of the show or movie.

Example Values: "A young boy disappears...", "A detective solves..."

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#1.check dataset structures
print(df.shape)
print(df.dtypes)
print(df.columns)

In [None]:
#correct data types
#convert date_added to date time
df["date_added"] = pd.to_datetime(df["date_added"], errors = 'coerce')

#clean duration into numeric value + unit
df["duration_value"] = df["duration"].str.extract("(\d+)").astype("float")
df["duration_unit"] = df["duration"].str.extract("([A-Za-z]+)").astype(str)

# Optional: convert release_year to int (already correct)
df["release_year"] = df["release_year"].astype(int)

In [None]:
# 3. Clean text columns
# -------------------------------
string_cols = ["type", "title", "director", "cast", "country",
               "rating", "duration", "listed_in", "description"]

# Strip extra spaces
for col in string_cols:
    df[col] = df[col].astype(str).str.strip()

# Standardize formatting
df["type"] = df["type"].str.title()
df["country"] = df["country"].str.title()

In [None]:
# 5. Final Summary
# -------------------------------
print("\nFinal Dataset Info:")
print(df.info())

print("\nPreview:")
print(df.head())

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
type_counts = df['type'].value_counts()

plt.figure()
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%')
plt.title('Type Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

To understand the proportion of Movies vs TV Shows in the dataset.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Movies dominate the platform with 69.1%, while TV Shows account for 30.1%.Answer Here.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This helps in identifying content imbalance and avoiding bias in recommendation or classification models.Answer Here.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Convert Date_added to Datetime
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

#Group data by release year
year_count = df.groupby('release_year').size()

#create the line chart
plt.figure()
plt.plot(year_count.index, year_count.values, marker='o')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.title('Number of Title Released Per Year')
plt.show()

##### 1. Why did you pick the specific chart?

To analyze how the platform’s content additions have changed over time.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The trend shows an increase in content additions in recent years, indicating platform expansion.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This helps identify growth phases and supports time-based analysis for demand forecasting and recommendation models.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Drop Missing Values
release_year_data = df['release_year'].dropna()

# Create Histogram
plt.hist(release_year_data, bins=20)
plt.xlabel('Release year')
plt.ylabel('Number of Movies')
plt.title('Distribution of Movies by Release Year')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen because it is the most suitable chart to visualize the distribution of movies by release year. Since release_year is a numerical and continuous variable, a histogram effectively shows how movies are spread across different time periods and helps identify trends and peak release years.

##### 2. What is/are the insight(s) found from the chart?

From the histogram analysis, it is observed that 2020 is the dominating year, With the highest number of movies and tv shows released by Netflix. This indicates a peak in content additions during that year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights create a positive business impact by identifying peak release years, which helps in better content planning and engagement.

However, over-concentration in a single year (2020) may indicate uneven content distribution, posing a risk to consistent long-term growth.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Count Movies/TV shows by country(Drop Missing Values)
country_counts = df['country'].dropna().value_counts().head(10)

#create horizontal bar chart
plt.barh(country_counts.index, country_counts.values)
plt.xlabel('Number of Movies/TV shows')
plt.ylabel('Country')
plt.title('Top 10 Countries By content on netflix')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen because it is suitable for visualizing the top countries producing content on Netflix, as it clearly compares categorical data and improves readability for country names.

##### 2. What is/are the insight(s) found from the chart?

USA leads in Netflix content production, followed by India and the UK.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights create a positive business impact by helping Netflix focus on top content-producing countries like the USA, India, and the UK.

However, over-dependence on a few countries may limit regional diversity and affect growth in other markets.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
from collections import Counter

# Drop missing values
listed_in_data = df['listed_in'].dropna()

# Split genres and flatten the list
all_genres = []
for genres in listed_in_data:
    all_genres.extend([g.strip() for g in genres.split(',')])

# Count frequency of each genre
genre_counts = Counter(all_genres)

# Convert to dictionary for plotting
top_genres = dict(genre_counts.most_common(10))  # Top 10 genres

# Create bar chart
plt.figure(figsize=(10,6))
plt.bar(top_genres.keys(), top_genres.values(), color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Genre')
plt.ylabel('Number of Movies / TV Shows')
plt.title('Top 10 Genres on Netflix')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it clearly shows and compares the top dominating genres on Netflix.

##### 2. What is/are the insight(s) found from the chart?

The analysis shows that International and Drama genres are the most popular and highly watched by viewers on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight has a positive business impact, as Netflix can focus on producing more content in dominating genres. However, the negative aspect is that some genres receive fewer views, so targeted strategies are needed to improve the performance of less-viewed genres.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Count movies/TV shows by rating (drop missing values)
rating_counts = df['rating'].dropna().value_counts()

# Create bar chart
plt.figure(figsize=(8,5))
plt.bar(rating_counts.index, rating_counts.values)
plt.xlabel('Rating')
plt.ylabel('Number of Movies / TV Shows')
plt.title('Distribution of Content Ratings on Netflix')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it clearly visualizes and compares the distribution of content ratings.

##### 2. What is/are the insight(s) found from the chart?

The analysis shows that TV-MA, TV-14, and TV-PG have the highest number of content on Netflix.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights help create a positive business impact by showing that TV-MA, TV-14, and TV-PG dominate Netflix’s content. This helps Netflix focus content creation and marketing on ratings with the highest audience demand.

However, the insights also indicate a potential negative growth risk. Over-concentration on a few ratings may limit content for niche or younger audiences, which could reduce engagement from those segments if not addressed.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Select only Movie durations and drop missing values
movie_duration = df[df['duration'].str.contains('min', na=False)]['duration']

# Extract minutes as integer
movie_duration_minutes = movie_duration.str.replace(' min', '').astype(int)

# Create histogram
plt.hist(movie_duration_minutes, bins=20)
plt.xlabel('Duration (Minutes)')
plt.ylabel('Number of Movies')
plt.title('Distribution of Movie Durations on Netflix')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was chosen because it effectively shows the distribution of movie and TV show durations on Netflix.

##### 2. What is/are the insight(s) found from the chart?

The analysis shows that content with a duration of 85–115 minutes has the highest count, with over 1,400 titles on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight has a positive business impact, as viewers prefer content with durations between 85–115 minutes. However, content with lower viewership durations requires improvement in quality or presentation to increase engagement.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Count content by director (drop missing values)
director_counts = df['director'].dropna().value_counts().head(10)

# Create bar chart
plt.figure(figsize=(10,6))
plt.bar(director_counts.index, director_counts.values)
plt.xlabel('Director')
plt.ylabel('Number of Movies / TV Shows')
plt.title('Top 10 Directors by Content on Netflix')
plt.xticks(rotation=45, ha='right')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was chosen because it clearly shows the top directors dominating content on Netflix.

##### 2. What is/are the insight(s) found from the chart?

The analysis shows that content with missing or unspecified director names is the most dominant category on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can create a positive business impact by highlighting gaps in metadata quality. Improving director information can enhance content discovery, search accuracy, and recommendation systems on Netflix.

However, the insight also indicates a potential negative growth issue. A large amount of content without director details may reduce user trust, limit personalization, and negatively affect viewer engagement if not addressed.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Drop missing values in 'type' and 'rating'
data = df[['type', 'rating']].dropna()

# Create a cross-tab for stacking
type_rating_counts = pd.crosstab(data['type'], data['rating'])

# Plot stacked bar chart
type_rating_counts.plot(kind='bar', stacked=True, figsize=(10,6))
plt.xlabel('Type')
plt.ylabel('Number of Titles')
plt.title('Stacked Bar Chart of Type vs Rating on Netflix')
plt.xticks(rotation=0)
plt.legend(title='Rating', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A stacked bar chart is suitable here to compare how ratings are distributed across different content types (Movie vs TV Show), giving insights into viewer demographics and content strategy.

##### 2. What is/are the insight(s) found from the chart?

The analysis indicates that the number of movies on Netflix is higher than TV shows, with over 5,000 movies compared to approximately 2,500 TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The positive impact is that viewers are more likely to watch movies, as they require less time compared to TV shows. The negative aspect is that Netflix needs to focus on improving the quality of TV shows to increase engagement and attract more viewers.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
from wordcloud import WordCloud

# Combine all descriptions into one string, dropping missing values
text = " ".join(df['description'].dropna().astype(str))

# Create Word Cloud
wordcloud = WordCloud(width=800, height=300, background_color='white', colormap='viridis', stopwords=None).generate(text)

# Plot Word Cloud
plt.figure(figsize=(15,7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Netflix Descriptions', fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

A Word Cloud is used to visualize the most common words in Netflix descriptions and highlight popular content themes.

##### 2. What is/are the insight(s) found from the chart?

LIFE, LOVE, FAMILY, FRIENDS, NEW, WOMEN, TAKE, and LIVE are the most frequent words in Netflix descriptions, highlighting popular content themes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive: Shows popular themes, guiding content strategy and engagement.

Negative: Over-focusing on frequent themes may neglect niche audiences, limiting growth.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Statement 1: Movies are more frequent than TV Shows on Netflix.

Statement 2: Content added after 2018 has higher average release year.

Statement 3: International movies, dramas, and comedy are top genres in Netflix.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀: Movies ≤ TV Shows on Netflix

H₁: Movies > TV Shows on Netflix

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import chisquare

# Count Movies and TV Shows
type_counts = df['type'].value_counts()

# Observed frequencies
observed = type_counts.values

# Expected frequencies (assuming equal distribution)
expected = [observed.sum()/len(observed)] * len(observed)

# Perform Chi-Square test
chi_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

chi_stat, p_value

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value, a Chi-Square Test was performed.

##### Why did you choose the specific statistical test?

The Chi-Square test was used to compare categorical frequencies. Since the p-value was less than 0.05, the null hypothesis was rejected, confirming that Movies dominate TV Shows on Netflix.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀: μ(after 2018) ≤ μ(2018 or before)

H₁: μ(after 2018) > μ(2018 or before)

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import ttest_ind

# Convert date_added to datetime
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Extract year when content was added
df['added_year'] = df['date_added'].dt.year

# Create two groups
after_2018 = df[df['added_year'] > 2018]['release_year'].dropna()
before_or_2018 = df[df['added_year'] <= 2018]['release_year'].dropna()

# Perform two-sample t-test (one-tailed)
t_stat, p_value = ttest_ind(after_2018, before_or_2018, equal_var=False)

# Since our hypothesis is "greater than", divide p-value by 2
p_value_one_tailed = p_value / 2

t_stat, p_value_one_tailed

##### Which statistical test have you done to obtain P-Value?

A two-sample (independent) t-test was performed to compare the average release year of content added after 2018 with content added in or before 2018.

##### Why did you choose the specific statistical test?

The two-sample t-test was chosen because we are comparing the means of a numerical variable (release_year) between two independent groups (content added after 2018 vs content added in or before 2018).

The p-value is far below 0.05, so we reject H₀; content added after 2018 has a higher average release year.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

H₀: Genre counts are uniform; top genres do not dominate

H₁: International Movies, Dramas, and Comedies occur more frequently than other genres

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# See all unique genre entries
import pandas as pd
from scipy.stats import chisquare

# Count occurrences of each genre
genre_counts = df['listed_in'].value_counts()

# Observed frequencies for top genres (International Movies, Dramas, Comedies)
observed_top = genre_counts[['International Movies', 'Dramas', 'Comedies']].values

# Expected frequencies assuming equal distribution among these three
expected_top = [observed_top.sum()/len(observed_top)] * len(observed_top)

# Perform Chi-Square test
chi_stat, p_value = chisquare(f_obs=observed_top, f_exp=expected_top)

chi_stat, p_value


##### Which statistical test have you done to obtain P-Value?

A Chi-Square Goodness-of-Fit test was performed to obtain the p-value for the top genres (International Movies, Dramas, and Comedies) on Netflix.

##### Why did you choose the specific statistical test?

The Chi-Square Goodness-of-Fit test was chosen because the variable under analysis (listed_in) is categorical, and the objective was to compare the observed frequency counts of the top genres against expected counts assuming no dominance.

The p-value is far below 0.05, so we reject H₀; International Movies, Dramas, and Comedies are the top genres on Netflix.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Verify missing values before feature engineering
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Missing values were identified and handled during the Exploratory Data Analysis (EDA) phase. Before starting feature engineering, a validation check was performed using df.isnull().sum() to confirm that the dataset contained no missing values. Therefore, no additional missing value imputation techniques were required at this stage.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import numpy as np

# Select numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
numerical_cols

In [None]:
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

In [None]:
for col in numerical_cols:
    outliers = detect_outliers_iqr(df, col)
    print(f"{col}: {outliers.shape[0]} outliers")

In [None]:
def cap_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    data[column] = np.where(
        data[column] < lower_bound, lower_bound,
        np.where(data[column] > upper_bound, upper_bound, data[column])
    )

In [None]:
for col in numerical_cols:
    cap_outliers_iqr(df, col)

In [None]:
for col in numerical_cols:
    print(f"{col} outliers after capping:",
          detect_outliers_iqr(df, col).shape[0])

##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the Interquartile Range (IQR) method to identify outliers and applied capping (winsorization) instead of removing records. This approach reduces the influence of extreme values while preserving the overall dataset size, which is important for maintaining information in business datasets.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder

# Label encoding for binary categorical column
le_type = LabelEncoder()
df['type_encoded'] = le_type.fit_transform(df['type'])

In [None]:
rating_order = {
    'G': 1, 'PG': 2, 'PG-13': 3,
    'R': 4, 'NC-17': 5,
    'TV-Y': 1, 'TV-Y7': 2,
    'TV-G': 2, 'TV-PG': 3,
    'TV-14': 4, 'TV-MA': 5,
    'Unknown': 0
}

df['rating_encoded'] = df['rating'].map(rating_order)

In [None]:
country_freq = df['country'].value_counts(normalize=True)
df['country_encoded'] = df['country'].map(country_freq)

In [None]:
# Number of genres per title
df['num_genres'] = df['listed_in'].apply(lambda x: len(x.split(',')))

In [None]:
# Presence indicator (has director info or not)
df['has_director'] = df['director'].apply(lambda x: 0 if x == 'Unknown' else 1)

In [None]:
df.drop(
    columns=['type', 'rating', 'country', 'listed_in', 'director'],
    inplace=True
)

In [None]:
df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

Multiple categorical encoding techniques were applied based on the nature of each feature. Label encoding was used for binary variables such as content type. Ordinal encoding was applied to rating categories to preserve their maturity-level order. Frequency encoding was used for high-cardinality features like country to avoid the curse of dimensionality. For multi-label genre information, count-based encoding was used to represent the number of genres associated with each title. This approach ensured meaningful numerical representation while maintaining model performance and interpretability.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
import re

contractions = {
    "can't": "cannot",
    "won't": "will not",
    "don't": "do not",
    "doesn't": "does not",
    "didn't": "did not",
    "isn't": "is not",
    "aren't": "are not",
    "wasn't": "was not",
    "weren't": "were not",
    "haven't": "have not",
    "hasn't": "has not",
    "hadn't": "had not",
    "couldn't": "could not",
    "shouldn't": "should not",
    "wouldn't": "would not",
    "it's": "it is",
    "i'm": "i am",
    "they're": "they are",
    "we're": "we are",
    "that's": "that is",
    "there's": "there is",
    "what's": "what is",
    "who's": "who is"
}

In [None]:
def expand_contractions(text):
    text = text.lower()
    for contraction, expanded in contractions.items():
        text = re.sub(r"\b" + contraction + r"\b", expanded, text)
    return text

In [None]:
df['description'] = df['description'].apply(expand_contractions)

#### 2. Lower Casing

In [None]:
# Lower Casing
df['description'] = df['description'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
df['description'] = df['description'].apply(
    lambda x: re.sub(r'[^\w\s]', '', x)
)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
df['description'] = df['description'].apply(
    lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x)
)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords

# Download stopwords (run once)
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

In [None]:
df['description'] = df['description'].apply(
    lambda x: ' '.join(
        word for word in x.split() if word not in stop_words
    )
)

In [None]:
# Remove White spaces
# Remove extra spaces
df['description'] = df['description'].str.strip().str.replace(r'\s+', ' ', regex=True)

#### 6. Rephrase Text

In [None]:
# Rephrase Text
from nltk.corpus import wordnet
import random

# Download resources (run once)
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
def rephrase_text(text, replace_prob=0.3):
    words = text.split()
    new_words = []

    for word in words:
        synonyms = wordnet.synsets(word)

        # Replace word with synonym with some probability
        if synonyms and random.random() < replace_prob:
            synonym = synonyms[0].lemmas()[0].name()
            new_words.append(synonym.replace('_', ' '))
        else:
            new_words.append(word)

    return ' '.join(new_words)

In [None]:
df['description_rephrased'] = df['description'].apply(rephrase_text)

#### 7. Tokenization

In [None]:
# Tokenization
# Tokenization using split
df['description_tokens'] = df['description'].apply(lambda x: x.split())

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required resources (run once)
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
stemmer = PorterStemmer()

df['description_stemmed'] = df['description'].apply(
    lambda x: ' '.join(stemmer.stem(word) for word in x.split())
)

In [None]:
lemmatizer = WordNetLemmatizer()

df['description_lemmatized'] = df['description'].apply(
    lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split())
)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize

def pos_tag_text(text):
    tokens = word_tokenize(text)
    return pos_tag(tokens)

In [None]:
df['description_pos_tags'] = df['description'].apply(pos_tag_text)

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=5000,     # limit features
    ngram_range=(1, 2),    # unigrams + bigrams
    stop_words='english'
)

In [None]:
X_tfidf = tfidf.fit_transform(df['description'])

In [None]:
tfidf_df = pd.DataFrame(
    X_tfidf.toarray(),
    columns=tfidf.get_feature_names_out()
)

##### Which text vectorization technique have you used and why?

I used the TF-IDF vectorization technique to transform textual descriptions into numerical feature vectors. TF-IDF was chosen because it highlights important words that are specific to a document while down-weighting frequently occurring words, making it more effective than simple word counts for text-based machine learning models.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
import pandas as pd

# -------------------------------
# 1. Convert date column properly
# -------------------------------
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# -------------------------------
# 2. Create new features
# -------------------------------

# Month when content was added
df['added_month'] = df['date_added'].dt.month

# Age of content at the time it was added
df['content_age'] = df['added_year'] - df['release_year']

# Duration in minutes (Movies only)
df['duration_minutes'] = df.apply(
    lambda x: x['duration_value'] if x['duration_unit'] == 'min' else None,
    axis=1
)

# Number of seasons (TV Shows only)
df['num_seasons'] = df.apply(
    lambda x: x['duration_value'] if x['duration_unit'] == 'Season' else None,
    axis=1
)

# -------------------------------
# 3. Final numeric columns
# -------------------------------
numeric_cols = [
    'added_year',
    'added_month',
    'release_year',
    'content_age',
    'duration_value',
    'duration_minutes',
    'num_seasons',
    'num_genres',
    'type_encoded',
    'rating_encoded',
    'country_encoded',
    'has_director'
]

# -------------------------------
# 4. Ensure correct numeric types
# -------------------------------
existing_numeric_cols = [col for col in numeric_cols if col in df.columns]

df[existing_numeric_cols] = df[existing_numeric_cols].apply(
    pd.to_numeric, errors='coerce'
)

# -------------------------------
# 5. Quick verification
# -------------------------------
df[existing_numeric_cols].info()


In [None]:
import numpy as np

# Select numerical features only
num_df = df[numeric_cols]

# Compute absolute correlation matrix
corr_matrix = num_df.corr().abs()

# Upper triangle of correlation matrix
upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

# Identify highly correlated features (> 0.85)
to_drop = [col for col in upper.columns if any(upper[col] > 0.85)]

# Drop highly correlated features
df.drop(columns=to_drop, inplace=True)

In [None]:
df.columns

#### 2. Feature Selection

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# =====================================
# 1. Create a unified duration feature
#    (safe for Movies + TV Shows)
# =====================================
df['normalized_duration'] = df.apply(
    lambda x: x['duration_value']
    if x['duration_unit'] == 'min'
    else x['duration_value'] * 60,
    axis=1
)

# =====================================
# 2. Select features for clustering
# =====================================
selected_features = [
    'normalized_duration',
    'num_genres',
    'added_year',
    'added_month'
]

X = df[selected_features]

# =====================================
# 3. Handle missing values
# =====================================
X = X.fillna(X.median())

# =====================================
# 4. Scale features (required for clustering)
# =====================================
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
from sklearn.feature_selection import VarianceThreshold

vt = VarianceThreshold(threshold=0.01)
X_var = vt.fit_transform(X)

selected_var_features = X.columns[vt.get_support()]
X = df[selected_var_features]

In [None]:
import numpy as np

corr_matrix = X.corr().abs()

upper = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

to_drop = [col for col in upper.columns if any(upper[col] > 0.85)]

X_selected = X.drop(columns=to_drop)

In [None]:
X_selected.columns

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
from sklearn.preprocessing import StandardScaler

# Numeric features selected for clustering
numeric_features = [
    'normalized_duration',
    'num_seasons',
    'num_genres',
    'added_year',
    'added_month'
]

X_numeric = df[numeric_features]

# Standardize features
scaler = StandardScaler()
X_numeric_scaled = scaler.fit_transform(X_numeric)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=3000,
    stop_words='english'
)

X_text = tfidf.fit_transform(df['description'])

In [None]:
from scipy.sparse import hstack

X_final = hstack([X_numeric_scaled, X_text])

Data transformation was performed by scaling numerical features using standardization and converting textual descriptions into numerical vectors using TF-IDF. The transformed features were combined to create a unified feature space for clustering analysis.

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

numeric_features = [
    'normalized_duration',
    'num_seasons',
    'num_genres',
    'added_year',
    'added_month'
]

X_numeric = df[numeric_features]

In [None]:
scaler = StandardScaler()
X_numeric_scaled = scaler.fit_transform(X_numeric)

In [None]:
X_numeric_scaled_df = pd.DataFrame(
    X_numeric_scaled,
    columns=numeric_features
)

##### Which method have you used to scale you data and why?

StandardScaler was used for data scaling as the project involves KMeans clustering, which is sensitive to feature magnitudes. Standardization ensures that all numerical features contribute equally to distance calculations, preventing features with larger ranges from dominating the clustering process.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Dimensionality reduction was applied because text vectorization produced a high-dimensional feature space, which can negatively affect clustering performance and interpretability.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

# =====================================
# 1. Impute missing values
# =====================================
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X_final)

# =====================================
# 2. Scale features (SPARSE-SAFE)
# =====================================
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X_imputed)

# =====================================
# 3. Dimensionality Reduction
# =====================================
svd = TruncatedSVD(
    n_components=min(100, X_scaled.shape[1]),
    random_state=42
)

X_reduced = svd.fit_transform(X_scaled)

In [None]:
explained_variance = svd.explained_variance_ratio_.sum()
print(f"Explained Variance: {explained_variance:.2f}")

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Dimensionality reduction was performed using TruncatedSVD to reduce the high-dimensional TF-IDF feature space while preserving the most important variance, improving clustering efficiency and interpretability.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test = train_test_split(
    X_reduced,
    test_size=0.2,
    random_state=42
)

In [None]:
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

##### What data splitting ratio have you used and why?

Although the project involves unsupervised learning, the dataset was split into training and testing subsets (80:20) to assess the stability and consistency of clustering results.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Since the project involves exploratory analysis and unsupervised learning without a predefined target variable, class imbalance handling techniques were not applicable.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - KMeans Clustering

from sklearn.cluster import KMeans

# Fit the Algorithm
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X_train)

# Predict on the model
train_kmeans_labels = kmeans.predict(X_train)
test_kmeans_labels = kmeans.predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import silhouette_score

silhouette_kmeans = silhouette_score(X_train, train_kmeans_labels)
silhouette_kmeans

In [None]:
import matplotlib.pyplot as plt

models = ['KMeans']
scores = [silhouette_kmeans]

plt.figure(figsize=(5,4))
plt.bar(models, scores)
plt.ylabel('Silhouette Score')
plt.title('Evaluation Metric Score for KMeans Model')
plt.ylim(0, 1)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (Grid Search)

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.model_selection import ParameterGrid

# Fit the Algorithm
param_grid = {
    'n_clusters': [3, 4, 5, 6, 7],
    'init': ['k-means++', 'random'],
    'n_init': [10, 20]
}

best_score = -1
best_params = None

for params in ParameterGrid(param_grid):
    kmeans = KMeans(
        n_clusters=params['n_clusters'],
        init=params['init'],
        n_init=params['n_init'],
        random_state=42
    )

    labels = kmeans.fit_predict(X_train)
    score = silhouette_score(X_train, labels)

    if score > best_score:
        best_score = score
        best_params = params

# Train final optimized model
kmeans_optimized = KMeans(
    n_clusters=best_params['n_clusters'],
    init=best_params['init'],
    n_init=best_params['n_init'],
    random_state=42
)

kmeans_optimized.fit(X_train)

# Predict on the model
train_kmeans_labels = kmeans_optimized.predict(X_train)
test_kmeans_labels = kmeans_optimized.predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

Grid Search was used to optimize KMeans hyperparameters by systematically evaluating different parameter combinations. Since the project is unsupervised, the Silhouette Score was used as the evaluation metric to select the optimal parameters that produced the best cluster separation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

fter applying hyperparameter optimization, a slight improvement in the Silhouette Score was observed for the KMeans model. This indicates better-defined clusters compared to the baseline model. Minor improvements are expected in high-dimensional text-based clustering tasks.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.cluster import AgglomerativeClustering

# Fit the Algorithm
agg_model = AgglomerativeClustering(
    n_clusters=5,
    linkage='ward'
)

train_agg_labels = agg_model.fit_predict(X_train)

In [None]:
from sklearn.metrics import silhouette_score

silhouette_agg = silhouette_score(X_train, train_agg_labels)
silhouette_agg

In [None]:
import matplotlib.pyplot as plt

models = ['Agglomerative Clustering']
scores = [silhouette_agg]

plt.figure(figsize=(5,4))
plt.bar(models, scores)
plt.ylabel('Silhouette Score')
plt.title('Evaluation Metric Score for Agglomerative Clustering Model')
plt.ylim(0, 1)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 2 Implementation with hyperparameter optimization techniques (Grid Search)

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Fit the Algorithm
best_score = -1
best_params = None

for n_clusters in [3, 4, 5, 6, 7]:
    for linkage in ['ward', 'complete', 'average']:

        # ward linkage only supports euclidean metric
        if linkage == 'ward':
            model = AgglomerativeClustering(
                n_clusters=n_clusters,
                linkage=linkage
            )
        else:
            model = AgglomerativeClustering(
                n_clusters=n_clusters,
                linkage=linkage,
                metric='euclidean'
            )

        labels = model.fit_predict(X_train)
        score = silhouette_score(X_train, labels)

        if score > best_score:
            best_score = score
            best_params = {
                'n_clusters': n_clusters,
                'linkage': linkage
            }

# Train final optimized model
agg_optimized = AgglomerativeClustering(
    n_clusters=best_params['n_clusters'],
    linkage=best_params['linkage']
)

train_agg_labels = agg_optimized.fit_predict(X_train)

# Predict on the model
# (Agglomerative clustering does not support predict, so fit_predict is used)
test_agg_labels = agg_optimized.fit_predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

Grid Search was used as the hyperparameter optimization technique to systematically evaluate different combinations of clustering parameters. Since the project is unsupervised, traditional methods like GridSearchCV could not be applied. Instead, the Silhouette Score was used as an internal validation metric to select the optimal hyperparameters that produced well-separated and compact clusters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

The Agglomerative clustering model produced a low silhouette score, indicating overlapping clusters. This behavior is expected in high-dimensional text-based datasets. Compared to KMeans, Agglomerative clustering showed weaker cluster separation, highlighting its limitations for large-scale NLP data.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

The silhouette score was used to evaluate clustering quality by measuring cluster cohesion and separation. Although the scores were relatively low, this is expected in text-based content clustering where themes overlap. The KMeans model provided the most stable and interpretable clusters, enabling Netflix to group similar content, improve recommendations, and support content strategy decisions. Overall, the clustering models help Netflix enhance personalization, content discovery, and user engagement.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation (DBSCAN)

from sklearn.cluster import DBSCAN

# Fit the Algorithm
dbscan_model = DBSCAN(
    eps=0.5,
    min_samples=10
)

train_dbscan_labels = dbscan_model.fit_predict(X_train)

# Predict on the model
# (DBSCAN does not support predict, so fit_predict is used)
test_dbscan_labels = dbscan_model.fit_predict(X_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import silhouette_score
import numpy as np

# Remove noise points (-1 label) for evaluation
mask = train_dbscan_labels != -1

if len(set(train_dbscan_labels[mask])) > 1:
    silhouette_dbscan = silhouette_score(
        X_train[mask],
        train_dbscan_labels[mask]
    )
else:
    silhouette_dbscan = None

silhouette_dbscan

In [None]:
import matplotlib.pyplot as plt

models = ['DBSCAN']
scores = [silhouette_dbscan if silhouette_dbscan is not None else 0]

plt.figure(figsize=(5,4))
plt.bar(models, scores)
plt.ylabel('Silhouette Score')
plt.title('Evaluation Metric Score for DBSCAN Model')
plt.ylim(0, 1)
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (Grid Search)

from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import numpy as np

best_score = -1
best_params = None

eps_values = [0.3, 0.5, 0.7, 1.0]
min_samples_values = [5, 10, 15]

# Fit the Algorithm
for eps in eps_values:
    for min_samples in min_samples_values:

        dbscan = DBSCAN(eps=eps, min_samples=min_samples)
        labels = dbscan.fit_predict(X_train)

        # Remove noise points
        mask = labels != -1

        if len(set(labels[mask])) > 1:
            score = silhouette_score(X_train[mask], labels[mask])

            if score > best_score:
                best_score = score
                best_params = {
                    'eps': eps,
                    'min_samples': min_samples
                }

# Fallback if no valid parameters found
if best_params is None:
    best_params = {
        'eps': 0.5,
        'min_samples': 10
    }

# Train final DBSCAN model
dbscan_optimized = DBSCAN(
    eps=best_params['eps'],
    min_samples=best_params['min_samples']
)

train_dbscan_labels = dbscan_optimized.fit_predict(X_train)

# Predict on the model
# (DBSCAN does not support predict, so fit_predict is used)
test_dbscan_labels = dbscan_optimized.fit_predict(X_test)

##### Which hyperparameter optimization technique have you used and why?

For DBSCAN, hyperparameter optimization was performed using a manual grid search strategy over key parameters such as eps and min_samples. Since the project involves unsupervised learning with no target variable, traditional supervised tuning techniques could not be applied. Instead, an internal clustering validation metric, the Silhouette Score, was used after excluding noise points. This approach helped explore suitable density parameters for identifying clusters and outliers.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

No significant improvement was observed after hyperparameter tuning.

This is expected behavior for DBSCAN on high-dimensional, text-based datasets like Netflix content.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The Silhouette Score was chosen as the primary evaluation metric because it measures cluster cohesion and separation in unsupervised learning. From a business perspective, higher silhouette scores indicate better content grouping, which directly supports recommendation systems, content discovery, and user engagement.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Among the three clustering models evaluated, KMeans was selected as the final model due to its superior silhouette score, scalability, and interpretability. The model produced stable and meaningful clusters that can support Netflix’s recommendation system, content discovery, and strategic decision-making.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

The KMeans model was explained using centroid analysis and cluster profiling. Since clustering models do not provide direct feature importance, the contribution of features was analyzed using cluster centroids and average feature values within each cluster. This approach helped identify key attributes such as duration, number of seasons, genre diversity, and content recency that influenced cluster formation.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, an end-to-end unsupervised machine learning pipeline was developed to analyze and group Netflix content based on textual and numerical features. Extensive exploratory data analysis was performed to understand content distribution across countries, formats, and time, revealing key trends such as Netflix’s increasing focus on TV shows and region-specific content preferences.

Feature engineering and data preprocessing techniques—including text cleaning, TF-IDF vectorization, feature scaling, and dimensionality reduction using TruncatedSVD—were applied to transform raw data into a machine-learning-ready format. Multiple clustering algorithms were implemented and evaluated to identify meaningful content groupings.

Among the models tested, KMeans clustering emerged as the most effective approach, achieving the highest silhouette score and producing stable, interpretable clusters. Agglomerative clustering provided additional hierarchical insights but showed weaker cluster separation, while DBSCAN proved useful primarily for identifying niche and outlier content rather than forming well-defined clusters. Hyperparameter optimization further improved clustering performance for KMeans, validating it as the final model.

The clustering results offer valuable business insights by enabling Netflix to better segment its content library, enhance recommendation systems, improve content discovery, and support strategic decision-making related to content acquisition and audience targeting. Overall, this project demonstrates how unsupervised learning can effectively uncover hidden patterns in large-scale content datasets and drive meaningful business impact.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***