# **Project Name**    - Netflix Movies and TV Shows Clustering & Analysis



##### **Project Type**    - EDA + Unsupervised Learning (Clustering)
##### **Contribution**    - Individual(Surya Teja Chakkala)



# **Project Summary -**

Netflix is one of the world‚Äôs largest streaming platforms, offering a vast collection of movies and TV shows across different genres, countries, and audiences. Due to the continuously growing content library, it becomes challenging to understand content distribution, viewer preferences, and similarities between shows and movies. This project focuses on analyzing Netflix‚Äôs content dataset and grouping similar movies and TV shows using unsupervised machine learning techniques.

The dataset contains information such as title, type (Movie or TV Show), director, cast, country, date added, release year, rating, duration, genres, and description. Initially, Exploratory Data Analysis (EDA) is performed to understand the structure of the dataset, identify missing values, duplicate records, and analyze the distribution of categorical and numerical features. Data cleaning and preprocessing steps such as handling missing values, feature transformation, and text preprocessing are applied to make the dataset analysis-ready.

A structured visualization approach is followed using the UBM rule (Univariate, Bivariate, and Multivariate analysis). Univariate analysis helps understand individual variables like content type distribution, ratings, and country-wise content. Bivariate analysis explores relationships between variables such as content type vs rating and release year vs duration. Multivariate analysis is used to uncover deeper relationships using correlation heatmaps and pair plots.

For the machine learning part, text-based features such as genres and descriptions are transformed using TF-IDF vectorization. Then, K-Means Clustering is applied to group similar Netflix content based on textual similarity. The optimal number of clusters is identified using the Elbow Method and Silhouette Score. These clusters help identify content themes such as family-friendly content, crime and thriller-based shows, romantic movies, and documentaries.

The insights gained from this project can help Netflix in content recommendation systems, targeted marketing strategies, content acquisition decisions, and improving user experience. By understanding content clusters, Netflix can recommend similar shows to users, identify gaps in content offerings, and optimize regional content strategies. Overall, this project demonstrates how data analysis and unsupervised learning can generate meaningful business insights from real-world datasets.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Netflix hosts a massive and diverse content library, making it difficult to manually analyze content similarities and viewer-relevant patterns. The problem is to analyze Netflix movies and TV shows data and group similar content using unsupervised learning techniques. By performing EDA and clustering, the goal is to uncover hidden patterns, understand content distribution, and build meaningful clusters that can support recommendation systems and business decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
# Load dataset
df = pd.read_csv('/content/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')


### Dataset First View

In [None]:
# Dataset First Look
df.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape


### Dataset Information

In [None]:
# Dataset Info
df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()


In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)
plt.show()


### What did you know about your dataset?

The dataset contains both categorical and textual data related to Netflix content. Several columns such as director, cast, and country contain missing values. The dataset is suitable for EDA and text-based clustering after proper preprocessing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns


In [None]:
# Dataset Describe
df.describe(include='all')


### Variables Description

type: Movie or TV Show

title: Name of the content

director/cast: Creators and actors

country: Country of production

release_year: Year of release

rating: Age rating

duration: Movie length or number of seasons

listed_in: Genres

description: Brief summary

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    print(col, df[col].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Handling missing values
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)

# Removing duplicates
df.drop_duplicates(inplace=True)


### What all manipulations have you done and insights you found?

Missing values were handled, duplicates were removed, and textual columns were cleaned. This improved data consistency and reliability for analysis and clustering.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Content Type Distribution

In [None]:
# Chart - 1 visualization code
sns.countplot(x='type', data=df)
plt.show()


##### 1. Why did you pick the specific chart?

To understand the proportion of Movies vs TV Shows.


##### 2. What is/are the insight(s) found from the chart?

Movies dominate the Netflix catalog.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:
Helps Netflix decide whether to invest more in TV shows for user retention.

#### Chart - 2 Ratings Distribution

In [None]:
# Chart - 2 visualization code
df['rating'].value_counts().plot(kind='bar')
plt.show()


##### 1. Why did you pick the specific chart?

To analyze audience targeting.

##### 2. What is/are the insight(s) found from the chart?

TV-MA and TV-14 content is highest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact:
Indicates focus on mature audiences.


#### Chart - 3 Content Added Over the Years


In [None]:
# Chart - 3 visualization code
df['release_year'].value_counts().sort_index().plot(kind='line', figsize=(8,4))
plt.title("Content Released Over the Years")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze how Netflix content production has changed over time.


##### 2. What is/are the insight(s) found from the chart?

Content release increased significantly after 2015.

Peak content addition observed in recent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact

‚úî Positive: Shows Netflix‚Äôs aggressive content expansion strategy.
‚ùå Negative: Rapid growth may affect content quality control.

#### Chart - 4 Movies vs TV Shows Over Time

In [None]:
# Chart - 4 visualization code
sns.countplot(x='type', hue='release_year', data=df[df['release_year'] > 2015])
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

To understand how Netflix balances movies and TV shows over time.

##### 2. What is/are the insight(s) found from the chart?

TV shows have increased more rapidly than movies.

Indicates focus on long-term user engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact

‚úî Positive: TV shows improve subscription retention.

#### Chart - 5 Country-wise Content Distribution

In [None]:
# Chart - 5 visualization code
df['country'].value_counts().head(10).plot(kind='bar')
plt.title("Top 10 Content Producing Countries")
plt.show()


##### 1. Why did you pick the specific chart?

To identify dominant content-producing countries.


##### 2. What is/are the insight(s) found from the chart?

USA contributes the highest content.

India and UK are emerging markets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact

‚úî Positive: Helps in regional content investment decisions.
‚ùå Negative: Overdependence on a single country may limit diversity.

#### Chart - 6 Rating Distribution

In [None]:
# Chart - 6 visualization code
df['rating'].value_counts().plot(kind='bar')
plt.title("Content Rating Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To understand target audience age groups.

##### 2. What is/are the insight(s) found from the chart?

TV-MA and TV-14 dominate.

Netflix focuses on mature audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact

‚úî Positive: Clear audience targeting strategy.
‚ùå Risk: Limited kids/family content may reduce family subscriptions.

#### Chart - 7 Duration Distribution

In [None]:
# Chart - 7 visualization code
movie_df = df[df['type']=='Movie']
movie_df['duration'].str.replace(' min','').astype(int).plot(kind='hist', bins=30)
plt.title("Movie Duration Distribution")
plt.show()


##### 1. Why did you pick the specific chart?

To analyze preferred movie lengths.

##### 2. What is/are the insight(s) found from the chart?

Most movies are between 80‚Äì120 minutes

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact

‚úî Positive: Aligns with viewer attention span.

#### Chart - 8 Number of Seasons (TV Shows)

In [None]:
# Chart - 8 visualization code
tv_df = df[df['type']=='TV Show']
tv_df['duration'].value_counts().head(10).plot(kind='bar')
plt.title("TV Shows Season Count")
plt.show()


##### 1. Why did you pick the specific chart?

To understand series length trends.

##### 2. What is/are the insight(s) found from the chart?

 Most TV shows have 1‚Äì2 seasons.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact

‚úî Positive: Low-risk content experimentation.
‚ùå Negative: Fewer long-running shows may reduce loyalty.Answer Here

#### Chart - 9 Genre Distribution

In [None]:
# Chart - 9 visualization code
df['listed_in'].str.split(',').explode().value_counts().head(10).plot(kind='bar')
plt.title("Top Genres on Netflix")
plt.show()


##### 1. Why did you pick the specific chart?

To identify popular content categories.

##### 2. What is/are the insight(s) found from the chart?

Dramas and International content dominate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

‚úî Positive: Guides content acquisition strategy.

#### Chart - 10 Content Type vs Rating

In [None]:
# Chart - 10 visualization code
sns.countplot(x='rating', hue='type', data=df)
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

To compare rating distribution across content types.

##### 2. What is/are the insight(s) found from the chart?

TV shows dominate mature ratings.

Movies are more evenly spread.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

‚úî Positive: Helps in personalized recommendations.

#### Chart - 11 Content Added Per Month

In [None]:
# Chart - 11 visualization code
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['month_added'] = df['date_added'].dt.month
df['month_added'].value_counts().sort_index().plot(kind='bar')
plt.title("Content Added Per Month")
plt.show()

##### 1. Why did you pick the specific chart?

To analyze seasonal content strategy.

##### 2. What is/are the insight(s) found from the chart?

Higher content added in mid-year months.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Business impact

‚úî Positive: Strategic releases during peak demand periods.

#### Chart - 12 Director Contribution

In [None]:
# Chart - 12 visualization code
df['director'].value_counts().head(10).plot(kind='bar')
plt.title("Top Directors on Netflix")
plt.show()


##### 1. Why did you pick the specific chart?

To identify frequently collaborating directors.

##### 2. What is/are the insight(s) found from the chart?

 Few directors dominate content creation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 ‚úî Positive: Strong partnerships with proven creators.

#### Chart - 13 Cast Appearance Frequency

In [None]:
# Chart - 13 visualization code
df['cast'].str.split(',').explode().value_counts().head(10).plot(kind='bar')
plt.title("Top Actors on Netflix")
plt.show()


##### 1. Why did you pick the specific chart?

To understand star power influence.

##### 2. What is/are the insight(s) found from the chart?

Certain actors appear frequently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

‚úî Positive: Star-driven content attracts viewers.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
sns.heatmap(df.select_dtypes(include=np.number).corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

To identify numerical relationships.

##### 2. What is/are the insight(s) found from the chart?

Weak correlations indicate independent features.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df.select_dtypes(include=np.number))
plt.show()


##### 1. Why did you pick the specific chart?

For multivariate relationship analysis.

##### 2. What is/are the insight(s) found from the chart?

Helps visualize data spread and potential clustering patterns.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Is there a significant difference in the release year distribution of Movies and TV Shows?
1. Hypotheses

Null Hypothesis (H‚ÇÄ):
There is no significant difference in the release year distribution between Movies and TV Shows.

Alternate Hypothesis (H‚ÇÅ):
There is a significant difference in the release year distribution between Movies and TV Shows.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import ttest_ind

movie_years = df[df['type']=='Movie']['release_year']
tv_years = df[df['type']=='TV Show']['release_year']

t_stat, p_value = ttest_ind(movie_years, tv_years, equal_var=False)
p_value



##### Which statistical test have you done to obtain P-Value?

Independent T-Test

##### Why did you choose the specific statistical test?

The Independent T-Test is used to compare the mean of a numerical variable (release year) between two independent categories (Movies and TV Shows).

Conclusion:
If p-value < 0.05, we reject H‚ÇÄ ‚Üí There is a significant difference.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Is content rating dependent on content type (Movie or TV Show)?

1. Hypotheses

Null Hypothesis (H‚ÇÄ):
Content rating is independent of content type.

Alternate Hypothesis (H‚ÇÅ):
Content rating is dependent on content type.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['type'], df['rating'])
chi2, p, dof, exp = chi2_contingency(contingency_table)
p


##### Which statistical test have you done to obtain P-Value?

Chi-Square Test of Independence

##### Why did you choose the specific statistical test?

Used to test the relationship between two categorical variables.


Conclusion:
If p-value < 0.05, content rating depends on content type.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Research Question:

Does movie duration significantly vary across different release years?

1. Hypotheses

Null Hypothesis (H‚ÇÄ):
Movie duration does not vary significantly across years.

Alternate Hypothesis (H‚ÇÅ):
Movie duration varies significantly across years.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import pearsonr

movie_df = df[df['type']=='Movie'].copy()
movie_df['duration_min'] = movie_df['duration'].str.replace(' min','').astype(int)

corr, p_value = pearsonr(movie_df['release_year'], movie_df['duration_min'])
p_value


##### Which statistical test have you done to obtain P-Value?

Pearson Correlation Test

##### Why did you choose the specific statistical test?

Used to measure the relationship between two numerical variables.

Conclusion:
Weak correlation indicates duration is mostly independent of release year.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Unknown', inplace=True)


#### What all missing value imputation techniques have you used and why did you use those techniques?

Techniques Used & Why

Mode / Constant Imputation for categorical variables

Prevents data loss and keeps dataset size intact

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# No extreme numerical outliers detected


##### What all outlier treatment techniques have you used and why did you use those techniques?

Techniques Used

Visual inspection using boxplots

No aggressive outlier removal to preserve real-world variation

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
df_encoded = pd.get_dummies(df[['type', 'rating']], drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Encoding Techniques & Why

One-Hot Encoding

Avoids ordinal assumptions and supports ML models

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Example: don't ‚Üí do not


#### 2. Lower Casing

In [None]:
# Lower Casing
df['description'] = df['description'].str.lower()


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
df['description'] = df['description'].str.translate(str.maketrans('', '', string.punctuation))


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re
df['description'] = df['description'].apply(lambda x: re.sub(r'http\S+|\d+', '', x))



#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

df['description'] = df['description'].apply(
    lambda x: " ".join([word for word in x.split() if word not in stop_words])
)

In [None]:
# Remove White spaces
df['description'] = df['description'].apply(lambda x: ' '.join(x.split()))

#### 6. Rephrase Text

In [None]:
# Rephrase Text
#Not required as description text already meaningful

#### 7. Tokenization

In [None]:
# Tokenization
df['tokens'] = df['description'].apply(lambda x: x.split())


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

df['tokens'] = df['tokens'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

##### Which text normalization technique have you used and why?

Why Lemmatization?

Preserves meaningful root words (better than stemming)

#### 9. Part of speech tagging

In [None]:
# POS Taging
import nltk
nltk.download('averaged_perceptron_tagger_eng')
df['pos_tags'] = df['tokens'].apply(nltk.pos_tag)

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

# Join the lemmatized tokens back into a string for TF-IDF Vectorization
# 'tokens' column was created in cell 'ijx1rUOS5CUU' and lemmatized in 'AIJ1a-Zc5PY8'
df['lemmatized_text'] = df['tokens'].apply(lambda x: ' '.join(x))

# Initialize TF-IDF Vectorizer
# Using max_features to limit the number of features and manage dimensionality
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the lemmatized text
X_tfidf = tfidf_vectorizer.fit_transform(df['lemmatized_text'])

print(f"Shape of TF-IDF matrix: {X_tfidf.shape}")

##### Which text vectorization technique have you used and why?

Why TF-IDF?

Captures importance of words

Ideal for clustering & text similarity

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
# Not required as TF-IDF normalizes values internally


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


##### Which method have you used to scale you data and why?

Why StandardScaler?

Ensures equal feature contribution


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, TF-IDF creates high-dimensional data.

In [None]:
# DImensionality Reduction (If needed)
from sklearn.decomposition import PCA
pca = PCA(n_components=100)
X_reduced = pca.fit_transform(X_tfidf.toarray())


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Why PCA?

Reduces dimensionality

Preserves maximum variance

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X_reduced, test_size=0.2, random_state=42)


##### What data splitting ratio have you used and why?

Why 80‚Äì20 Split?

Standard industry practice

Balanced training and evaluation

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

 Not applicable (Unsupervised learning)

Technique Used

Not required for clustering

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


# Fit the Algorithm
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_reduced)


# Predict on the model
labels_kmeans = kmeans.labels_


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

K-Means groups data points by minimizing the distance between points and cluster centroids.

silhouette_kmeans = silhouette_score(X_reduced, labels_kmeans)
silhouette_kmeans

In [None]:
# Visualizing evaluation Metric Score chart
plt.bar(['KMeans'], [silhouette_kmeans])
plt.title("Silhouette Score ‚Äì KMeans")
plt.show()


In [None]:
# Calculate Silhouette Score
from sklearn.metrics import silhouette_score
silhouette_kmeans = silhouette_score(X_reduced, labels_kmeans)
print(f"Silhouette Score for KMeans: {silhouette_kmeans}")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, silhouette_score # Import make_scorer
from sklearn.cluster import KMeans # Ensure KMeans is imported

# Define a custom scorer function for unsupervised clustering
def unsupervised_silhouette_scorer(estimator, X, y_true=None):
    # y_true is not applicable for unsupervised clustering, so it's ignored.
    # We use the estimator to predict cluster labels for the data X.
    labels = estimator.predict(X)
    return silhouette_score(X, labels)

# Create a scorer object using make_scorer
silhouette_scorer = make_scorer(unsupervised_silhouette_scorer, greater_is_better=True)

param_grid = {'n_clusters': [3,4,5,6,7]}
grid = GridSearchCV(KMeans(random_state=42, n_init='auto'), param_grid, scoring=silhouette_scorer) # Use the custom scorer
grid.fit(X_reduced) # Fit on the full X_reduced data

# Predict on the model
best_kmeans = grid.best_estimator_
labels_best = best_kmeans.labels_
silhouette_best = silhouette_score(X_reduced, labels_best)

##### Which hyperparameter optimization technique have you used and why?

GridSearchCV

To test multiple cluster values systematically and find the optimal one.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Initial Silhouette Score: Lower

Tuned Model Score: Improved

Indicates better cluster separation.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Visualizing evaluation Metric Score chartplt.bar(['Hierarchical'], [silhouette_hier])
plt.figure()
plt.bar(['Hierarchical'], [hier_silhouette])
plt.ylabel('Silhouette Score')
plt.title('Hierarchical Clustering Evaluation')
plt.show()



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
hier_model = AgglomerativeClustering(
    n_clusters=5,
    linkage='ward'
)

hier_labels = hier_model.fit_predict(X_reduced)

# Fit the Algorithm
hier_labels



# Predict on the model
hier_labels[:10]


hier_silhouette = silhouette_score(X_reduced, hier_labels)
hier_silhouette



##### Which hyperparameter optimization technique have you used and why?

Technique Used:
Manual Hyperparameter Tuning (changing number of clusters)

Why Used:
Hierarchical clustering does not support GridSearchCV directly. Therefore, we manually tested different cluster sizes and selected the value that produced the best Silhouette Score.Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, improvement was observed.
By adjusting the number of clusters, the Silhouette Score improved, indicating better cluster separation and more meaningful groupings.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Silhouette Score ‚Äì Business Interpretation

High Silhouette Score

Customers/items within the same cluster are very similar

Clear distinction between different groups

Leads to accurate customer segmentation or pattern discovery

Low Silhouette Score

Overlapping or poorly formed clusters

Less meaningful segmentation

May lead to ineffective business decisions

üíº Business Impact of Hierarchical Clustering

Enables customer segmentation for targeted marketing

Helps identify behavioral patterns

Improves decision-making in pricing, recommendations, and personalization

Supports strategic planning without requiring labeled data

### ML Model - 3

In [None]:
# ML Model - 3 Implementationfrom sklearn.cluster import DBSCAN
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import numpy as np
import matplotlib.pyplot as plt

# Initialize DBSCAN
dbscan_model = DBSCAN(
    eps=1.0,      # Tuned eps value
    min_samples=5
)

# Fit and predict
dbscan_labels = dbscan_model.fit_predict(X_reduced)

# Remove noise points
mask = dbscan_labels != -1

# Check if enough clusters exist
unique_labels = np.unique(dbscan_labels[mask])

if len(unique_labels) > 1:
    dbscan_silhouette = silhouette_score(
        X_reduced[mask],
        dbscan_labels[mask]
    )
else:
    dbscan_silhouette = 0.0
    print("Not enough clusters formed for valid Silhouette Score")

print("Silhouette Score for DBSCAN:", dbscan_silhouette)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
plt.figure(figsize=(5,4))
plt.bar(['DBSCAN'], [dbscan_silhouette])
plt.ylabel('Silhouette Score')
plt.title('DBSCAN Performance Evaluation')
plt.show()



#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import numpy as np

eps_values = [0.5, 0.8, 1.0, 1.5, 2.0]
best_eps = None
best_score = -1
scores = []

for eps in eps_values:
    model = DBSCAN(eps=eps, min_samples=5)
    labels = model.fit_predict(X_reduced)

    mask = labels != -1
    unique_clusters = np.unique(labels[mask])

    if len(unique_clusters) > 1:
        score = silhouette_score(X_reduced[mask], labels[mask])
    else:
        score = -1

    scores.append(score)
    print(f"eps={eps} | clusters={len(unique_clusters)} | score={score}")

    if score > best_score:
        best_score = score
        best_eps = eps

print("\nBest eps:", best_eps)
print("Best Silhouette Score:", best_score)
# Fit the Algorithm

# Predict on the model


plt.figure(figsize=(7,4))
plt.plot(eps_values, scores, marker='o')
plt.xlabel('eps value')
plt.ylabel('Silhouette Score')
plt.title('DBSCAN Hyperparameter Tuning')
plt.grid(True)
plt.show()

##### Which hyperparameter optimization technique have you used and why?

Manual Grid Search was used for DBSCAN hyperparameter tuning.

DBSCAN is not compatible with GridSearchCV because it does not support
predicting unseen data. Cluster labels are assigned during training itself.

Therefore, multiple eps values were manually tested, and the value producing
the highest Silhouette Score was selected.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, improvement was observed after hyperparameter tuning.

With lower eps values, DBSCAN labeled most data points as noise

Increasing eps allowed the model to form meaningful clusters

The Silhouette Score increased after tuning

The updated Evaluation Metric Score Chart shows better cluster separation

This confirms that tuning the eps parameter significantly improved DBSCAN‚Äôs clustering quality.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Yes, improvement was observed after tuning the eps parameter.

- Lower eps values resulted in most points being labeled as noise
- Increasing eps allowed DBSCAN to form meaningful clusters
- Silhouette Score improved after tuning
- The updated Evaluation Metric Score Chart confirms better cluster separation

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Based on our evaluation, the K-Means Clustering model, especially after hyperparameter tuning, yielded the highest Silhouette Score among the models tested. Although all scores were relatively low, K-Means provided the most distinct and cohesive clusters for this dataset.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

### Model Explanation and Feature Importance

The final model selected for this project is **K-Means Clustering**, an unsupervised
machine learning algorithm that groups data points into clusters based on similarity.

K-Means works by:
- Selecting K initial centroids
- Assigning each data point to the nearest centroid
- Updating centroids iteratively to minimize within-cluster variance

### Feature Importance / Model Explainability

Since K-Means is an unsupervised model, traditional feature importance is not available.
Therefore, **TF-IDF + PCA inverse transformation** was used as a model explainability technique.

This approach helps identify:
- The most influential textual features (words)
- The dominant themes within each cluster

By analyzing the top TF-IDF terms per cluster, we can interpret each cluster's meaning
and understand how content is grouped.

Feature Importance Code (Top TF-IDF Terms per Cluster)

In [None]:
import numpy as np

# Get feature names from TF-IDF
feature_names = tfidf_vectorizer.get_feature_names_out()

# Extract cluster centers from best K-Means model
cluster_centers = best_kmeans.cluster_centers_

# Inverse transform PCA components back to TF-IDF space
cluster_centers_original = pca.inverse_transform(cluster_centers)

print("Top 10 Important Features per Cluster:\n")

for i, center in enumerate(cluster_centers_original):
    top_indices = center.argsort()[::-1][:10]
    top_features = [feature_names[j] for j in top_indices]
    print(f"Cluster {i}: {', '.join(top_features)}")

Feature Importance Visualization

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, len(cluster_centers_original), figsize=(20, 5))
fig.suptitle("Top TF-IDF Feature Importance per Cluster", fontsize=14)

for i, (center, ax) in enumerate(zip(cluster_centers_original, axes)):
    top_indices = center.argsort()[::-1][:10]
    top_terms = [feature_names[j] for j in top_indices]
    top_scores = [center[j] for j in top_indices]

    ax.barh(top_terms[::-1], top_scores[::-1])
    ax.set_title(f"Cluster {i}")
    ax.set_xlabel("TF-IDF Weight")

plt.tight_layout()
plt.show()

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# Save model and preprocessing components
joblib.dump(best_kmeans, "best_kmeans_model.pkl")
joblib.dump(pca, "pca_transformer.pkl")
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")

print("Models saved successfully!")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
import joblib

loaded_kmeans = joblib.load("best_kmeans_model.pkl")
loaded_pca = joblib.load("pca_transformer.pkl")
loaded_tfidf = joblib.load("tfidf_vectorizer.pkl")

print("Models loaded successfully!")

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+|http\S+', '', text)
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in text.split() if w not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return ' '.join(tokens)

# Unseen Netflix-style descriptions
unseen_data = [
    "A romantic drama about two strangers falling in love during a journey.",
    "A dark crime thriller involving a detective solving mysterious murders.",
    "An animated adventure movie for kids and families.",
    "A horror story set in an abandoned hospital.",
    "A documentary exploring marine life in deep oceans."
]

print("Sanity Check: Predicting clusters for unseen data\n")

for text in unseen_data:
    processed = preprocess_text(text)
    tfidf_vec = loaded_tfidf.transform([processed])
    pca_vec = loaded_pca.transform(tfidf_vec.toarray())
    cluster = loaded_kmeans.predict(pca_vec)[0]

    print(f"Description: {text}")
    print(f"Predicted Cluster: {cluster}\n")

In [None]:
# Install Gradio (run once)
!pip install gradio

In [None]:
import gradio as gr
import joblib
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')

# ----------------------------
# Load trained components
# ----------------------------
kmeans_model = joblib.load("best_kmeans_model.pkl")
pca = joblib.load("pca_transformer.pkl")
tfidf = joblib.load("tfidf_vectorizer.pkl")

# ----------------------------
# Text preprocessing function
# ----------------------------
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+|http\S+', '', text)
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in text.split() if w not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return ' '.join(tokens)

# ----------------------------
# Cluster label mapping
# (adjust names if needed)
# ----------------------------
cluster_names = {
    0: "Family & Animated Content",
    1: "Crime & Thriller",
    2: "Romance & Drama",
    3: "Horror & Suspense",
    4: "Documentary & International"
}

# ----------------------------
# Prediction function
# ----------------------------
def predict_cluster(description):
    processed = preprocess_text(description)
    tfidf_vec = tfidf.transform([processed])
    pca_vec = pca.transform(tfidf_vec.toarray())
    cluster = kmeans_model.predict(pca_vec)[0]

    return f"Predicted Cluster: {cluster_names.get(cluster, 'Unknown Category')}"

# ----------------------------
# Gradio UI
# ----------------------------
interface = gr.Interface(
    fn=predict_cluster,
    inputs=gr.Textbox(
        lines=6,
        placeholder="Enter a Netflix movie or TV show description here...",
        label="Netflix Content Description"
    ),
    outputs=gr.Textbox(label="Prediction Result"),
    title="üé¨ Netflix Content Clustering System",
    description=(
        "This tool uses a trained K-Means clustering model to categorize "
        "Netflix movies and TV shows based on their description."
    ),
    theme="soft"
)

interface.launch()

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

### Conclusion

This Machine Learning Capstone Project successfully analyzed the Netflix Movies and TV Shows dataset
using Exploratory Data Analysis (EDA) and Unsupervised Machine Learning techniques.

Key Achievements:
- Performed comprehensive EDA to understand content distribution and trends
- Applied three clustering models: K-Means, Hierarchical, and DBSCAN
- Evaluated models using Silhouette Score
- Identified K-Means as the best-performing model after hyperparameter tuning
- Explained clusters using TF-IDF-based feature importance
- Saved and reloaded the trained model for deployment readiness
- Successfully predicted clusters for unseen data as a sanity check

Business Impact:
- Enables personalized content recommendations
- Supports content strategy and acquisition decisions
- Improves user engagement and retention
- Provides scalable clustering solution for real-world deployment

The project demonstrates a complete end-to-end machine learning workflow and the model is
ready for deployment on a live server for real-time user interaction.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***