# **Project Name**- EDA Of Netflix data



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member - Rashi Jain**


# **Project Summary**

The Netflix Content Analysis and Clustering Project focuses on understanding Netflix’s vast and diverse content library using data analysis, visualization, and clustering techniques. Netflix, being one of the largest streaming platforms globally, offers thousands of movies and TV shows across multiple genres, ratings, and regions. However, with such a massive catalog, users often face difficulty discovering shows that match their interests. Hence, this project aims to analyze the dataset to identify meaningful patterns, trends, and similarities among different types of content — ultimately helping Netflix enhance its recommendation system and strategic decision-making.

The dataset used in this project contains information about various Netflix titles, including attributes such as type (Movie/TV Show), title, director, cast, country, release year, rating, duration, and listed genres. The first phase of the project involves data wrangling, which includes cleaning missing values, handling duplicates, and preparing the data for analysis. Once the data is structured and refined, the project proceeds to exploratory data analysis (EDA), where several visualizations are created to uncover patterns and insights.

The analysis begins with univariate analysis, focusing on individual variables. For example, charts like “Count of Movies vs TV Shows” reveal that Netflix’s library is dominated by movies compared to TV shows. Similarly, “Top 10 Countries Producing Netflix Content” identifies that the United States, India, and the United Kingdom are the top contributors, highlighting Netflix’s global content strategy. The distribution of ratings such as TV-MA, TV-14, and R indicates that Netflix primarily caters to mature and teenage audiences.

The bivariate and multivariate analyses explore relationships between different variables. Visualizations such as release trends over the years show Netflix’s rapid content expansion post-2015, coinciding with its international growth phase. Duration-based analysis highlights the difference between movies (in minutes) and TV shows (in seasons). Genre-based word clouds and bar charts uncover the popularity of genres like Dramas, Comedies, and International Movies, demonstrating the audience’s preference for emotionally engaging and diverse storytelling.

Further, correlation heatmaps and pair plots are used to understand numerical relationships between variables like release year, duration, and ratings. These visualizations provide a statistical overview of the dataset, helping identify which attributes might influence others. For example, newer releases tend to have more diverse genres and shorter durations to align with modern viewing preferences.

After the exploratory analysis, the project transitions into content-based clustering, where machine learning techniques such as K-Means, Hierarchical Clustering, or DBSCAN are applied. By transforming text-based attributes like genre and description into numerical representations using methods like TF-IDF (Term Frequency–Inverse Document Frequency), the clustering algorithms group similar titles together. These clusters represent sets of movies or shows with similar content, genre, or style. This step is crucial for building recommendation systems — allowing Netflix to suggest shows that align closely with users’ past preferences.

The business objective of this project is to help Netflix leverage data-driven insights for strategic benefits. The clustering results can improve personalized recommendations, identify content gaps, and support regional and genre-based marketing strategies. For instance, if a cluster reveals an underrepresented genre or region, Netflix can focus its production or licensing efforts accordingly.

In conclusion, this project provides a comprehensive overview of Netflix’s content landscape through detailed visual and statistical analysis. It highlights how Netflix’s catalog has evolved, what types of content dominate the platform, and how advanced clustering can uncover hidden relationships within data. By integrating these insights, Netflix can enhance user satisfaction through better content discovery, optimize production strategies, and maintain its leadership in the competitive streaming industry. The project demonstrates the power of data analytics and machine learning in turning raw data into actionable business intelligence — a critical advantage in today’s data-driven entertainment market.

# **GitHub Link -**

https://github.com/rashij08/Netflix-Content-Clustering.git

# **Problem Statement**


Netflix has a vast library of content spanning multiple genres, durations, and ratings. However, users often face difficulty discovering shows that match their preferences due to the platform’s large catalog. The goal is to perform content-based clustering on the Netflix dataset to group similar shows/movies together based on their attributes such as genre, rating, duration, and description.
This will help Netflix improve its recommendation system, content strategy, and marketing decisions.

#### **Define Your Business Objective?**


The primary business objective is to segment Netflix’s catalog using clustering techniques (such as K-Means, Hierarchical, or DBSCAN) to:

Identify content clusters — e.g., grouping shows by genre similarity, release period, or rating.

Understand user content preferences indirectly by observing content groupings.

Help Netflix make data-driven decisions such as:

Recommending similar types of shows to users.

Identifying underrepresented content clusters for future production.

Enhancing regional and genre-based marketing strategies.

By discovering these insights, Netflix can improve content personalization, curation, and strategic planning, leading to better user engagement and satisfaction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd                     # For data manipulation
import numpy as np                      # For numerical operations
import matplotlib.pyplot as plt         # For visualization
import seaborn as sns                   # For advanced visualization
import missingno as msno                # For missing value visualization

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv("NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")

### Dataset First View

In [None]:
# Dataset First Look
# Display the first 5 rows of the dataset
print("🔹 First 5 Rows of Dataset:")
display(df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, cols = df.shape
print(f"🔹 Number of Rows: {rows}")
print(f"🔹 Number of Columns: {cols}")

### Dataset Information

In [None]:
# Dataset Info
print("🔹 Dataset Info:")
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(f"🔹 Total Duplicate Rows: {duplicate_count}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("🔹 Missing Values in Each Column:")
print(missing_values)

In [None]:
#using Missingno for a cleaner view
msno.bar(df, color='tomato', figsize=(10,5))
plt.title("Missing Values Overview")
plt.show()

### What did you know about your dataset?

The Netflix dataset contains information about movies and TV shows available on the platform.
It includes 12 columns such as title, director, cast, country, release year, rating, and duration.
After inspection:

The dataset has around 8,800+ records and 12 columns.

Some columns like director, cast, and country have missing values.

No major data type issues were found.

There may be a few duplicate rows which can be removed before analysis.

Overall, the dataset is clean and ready for further exploration.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
# Display all column names in the dataset
print("🔹 Columns in the Dataset:")
print(df.columns.tolist())

In [None]:
# Dataset Describe
# Since most columns are categorical, describe(include='all') gives both numeric and object stats
print("\n🔹 Dataset Description:")
display(df.describe(include='all'))


### Variables Description

The dataset consists of 12 variables, each describing different characteristics of Netflix content:

show_id: Unique identifier for each show/movie.

type: Indicates whether the entry is a Movie or TV Show.

title: Name of the show or movie.

director: Director(s) of the show/movie (may have missing values).

cast: Main actors involved in the show/movie (many missing entries).

country: Country where the content was produced.

date_added: Date when the content was added to Netflix.

release_year: Year when the show/movie was originally released.

rating: Content rating (e.g., TV-MA, PG-13, R, etc.).

duration: Duration of the movie (in minutes) or number of seasons for TV shows.

listed_in: Category or genre(s) of the content (e.g., Drama, Comedy, Action).

description: Short summary of the content.

From the unique value counts:

type has 2 categories (Movies, TV Shows).

release_year spans multiple decades (e.g., 1925–2021).

rating has several classifications depending on age restrictions.

listed_in contains multiple genres, often comma-separated.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("\n🔹 Unique Values Count for Each Column:\n")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Remove Duplicate Rows
before = df.shape[0]
df.drop_duplicates(inplace=True)
after = df.shape[0]
print(f"🔹 Removed {before - after} duplicate rows.")


# Handle Missing Values
# Replace missing 'country', 'rating', 'director', 'cast' with 'Unknown'
for col in ['country', 'rating', 'director', 'cast']:
    df[col] = df[col].fillna('Unknown')

# Fill missing 'date_added' with mode (most frequent date)
df['date_added'] = df['date_added'].fillna(df['date_added'].mode()[0])


# Feature Engineering: Clean 'duration'
# Convert duration to numeric (for clustering)
# For Movies -> extract minutes, For TV Shows -> extract number of seasons
def convert_duration(value):
    if 'min' in str(value):
        return int(value.split(' ')[0])
    elif 'Season' in str(value):
        return int(value.split(' ')[0]) * 60  # Approx conversion to minutes
    else:
        return np.nan

df['duration_mins'] = df['duration'].apply(convert_duration)

# Fill missing duration_mins with median value
df['duration_mins'] = df['duration_mins'].fillna(df['duration_mins'].median())


# Convert 'date_added' to datetime format
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Extract month and year of addition for trend analysis
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month


# Verify Cleaned Dataset
print("\n🔹 Dataset after cleaning:")
display(df.head())

print("\n🔹 Missing Values after cleaning:")
print(df.isnull().sum())


# Check Data Types
print("\n🔹 Data Types after Wrangling:")
print(df.dtypes)

### What all manipulations have you done and insights you found?

Data Wrangling Summary & Insights:

Removed all duplicate rows to ensure data consistency.

Filled missing categorical values (country, rating, director, cast) with “Unknown” to avoid null-related issues.

Converted duration into a numeric field (duration_mins) for clustering and analysis:

Movies: converted from minutes.

TV Shows: converted number of seasons into approximate minutes (×60).

Converted date_added into proper datetime format and extracted year_added and month_added for time-based analysis.

Ensured no missing values remain in key columns.

Insights:

Many entries had missing director and cast information — indicating incomplete metadata.

Duration varies widely, showing Netflix has both short films and long TV series.

date_added shows Netflix has been adding more content in recent years, suggesting platform growth.

After these manipulations, the dataset is clean, standardized, and ready for visualization and clustering.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# 📊 Chart 1 : Count of Movies vs TV Shows
plt.figure(figsize=(7,5))
ax = sns.countplot(x='type', data=df, palette='coolwarm')
plt.title("Distribution of Content Type on Netflix", fontsize=14, fontweight='bold')
plt.xlabel("Type of Content", fontsize=12)
plt.ylabel("Count", fontsize=12)

# Adding labels on top of bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + 0.3, p.get_height() + 50))

plt.show()

##### 1. Why did you pick the specific chart?

I chose a count plot (bar chart) because it effectively shows the comparison between two categorical variables — Movies and TV Shows.
It provides a clear visual representation of how much of Netflix’s content library consists of movies versus TV shows.

##### 2. What is/are the insight(s) found from the chart?

The visualization reveals that Movies dominate the Netflix catalog, making up around 70% of the total content, while TV Shows represent the remaining 30%.
This suggests that Netflix has historically focused more on adding movies compared to TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

Positive Impact: Knowing that movies dominate the catalog helps Netflix assess whether it should expand its TV show portfolio to increase user retention, as series often lead to long-term engagement.

Strategic Decision: If the platform observes higher user watch time or subscription retention for TV shows, it can rebalance the content ratio by producing or acquiring more series.

No Negative Growth: However, over-prioritizing one category might reduce diversity, so maintaining a healthy balance is key.

#### Chart - 2

In [None]:
# 📊 Chart 2 : Top 10 Countries with Most Netflix Titles
# Split 'country' column since multiple countries can be listed for a single show
country_data = df['country'].dropna().astype(str).str.split(',')
country_counts = country_data.explode().str.strip().value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=country_counts.values, y=country_counts.index, palette='viridis')
plt.title("Top 10 Countries Producing Netflix Content", fontsize=14, fontweight='bold')
plt.xlabel("Number of Titles", fontsize=12)
plt.ylabel("Country", fontsize=12)

# Add labels to bars
for index, value in enumerate(country_counts.values):
    plt.text(value + 5, index, str(value), va='center')

plt.show()

##### 1. Why did you pick the specific chart?

I selected a horizontal bar chart because it clearly shows which countries contribute the most content to Netflix’s catalog.
Horizontal bars are especially effective when displaying categorical data with long text labels like country names.
This visualization helps compare multiple categories quickly and highlights top-performing countries.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that the United States produces the largest number of Netflix titles, followed by India, the United Kingdom, Canada, and Japan.
This indicates that Netflix’s content library is heavily dominated by English-speaking countries, although it also includes a growing amount of international content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.

Positive Impact: These insights help Netflix understand which regions are its strongest production bases and where it can expand its regional partnerships.

For instance, since India ranks high, Netflix can continue to invest in regional Indian-language content to capture a diverse user base.

Strategic Opportunity: Countries with fewer titles (e.g., African or Middle Eastern regions) represent untapped markets that could boost global reach.

Risk of Negative Growth: Over-dependence on U.S.-based content could lead to saturation and reduced cultural diversity, which may affect global subscriptions.
Thus, balancing global content creation can strengthen Netflix’s market presence.

#### Chart - 3

In [None]:
# Chart - 3
# 📊 Chart 3 : Trend of Netflix Content Added Over the Years

# Convert 'date_added' to datetime and extract year
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['year_added'] = df['date_added'].dt.year

# Count number of titles added each year
yearly_counts = df['year_added'].value_counts().sort_index()

# Plot
plt.figure(figsize=(10,6))
sns.lineplot(x=yearly_counts.index, y=yearly_counts.values, marker='o', color='coral')
plt.title("Trend of Netflix Content Added Over the Years", fontsize=14, fontweight='bold')
plt.xlabel("Year Added", fontsize=12)
plt.ylabel("Number of Titles", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.6)

# Add labels to each point
for x, y in zip(yearly_counts.index, yearly_counts.values):
    plt.text(x, y + 5, str(y), ha='center')

plt.show()


##### 1. Why did you pick the specific chart?

I chose a line chart because it clearly shows the trend of Netflix content additions over time. It helps visualize how Netflix expanded its library year by year

##### 2. What is/are the insight(s) found from the chart?

The chart shows a significant increase in the number of titles added after 2015, peaking around 2018–2020. This indicates Netflix’s aggressive content expansion strategy during its global growth phase.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive impact: This growth trend suggests strong content production and acquisition strategies that attracted global audiences. It also shows Netflix’s investment in diverse content to strengthen its market position.

Negative insight: If there’s a slight drop after 2020, it could be due to content saturation or strategic reduction, which might affect the perception of new variety but could improve quality control and brand consistency.

#### Chart - 4

In [None]:
# 📊 Chart 4 : Top 10 Directors with Most Titles

director_counts = df['director'].dropna().astype(str).str.split(',')
director_counts = director_counts.explode().str.strip().value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=director_counts.values, y=director_counts.index, palette='mako')
plt.title("Top 10 Directors with Most Titles", fontsize=14, fontweight='bold')
plt.xlabel("Number of Titles", fontsize=12)
plt.ylabel("Director", fontsize=12)

for index, value in enumerate(director_counts.values):
    plt.text(value + 2, index, str(value), va='center')

plt.show()

##### 1. Why did you pick the specific chart?

A bar chart effectively highlights which directors have the most Netflix titles.

##### 2. What is/are the insight(s) found from the chart?

Certain directors like Raul Campos and Marcus Raboy dominate the list, showing their frequent collaborations with Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Netflix identify top-performing creators and maintain partnerships.
No negative growth — consistency in creative collaboration improves content quality.

#### Chart - 5

In [None]:
# 📊 Chart 5 : Top 10 Actors/Actresses on Netflix

cast_data = df['cast'].dropna().astype(str).str.split(',')
cast_counts = cast_data.explode().str.strip().value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=cast_counts.values, y=cast_counts.index, palette='cool')
plt.title("Top 10 Actors/Actresses Appearing in Netflix Titles", fontsize=14, fontweight='bold')
plt.xlabel("Number of Titles", fontsize=12)
plt.ylabel("Actor/Actress", fontsize=12)

for index, value in enumerate(cast_counts.values):
    plt.text(value + 2, index, str(value), va='center')

plt.show()


##### 1. Why did you pick the specific chart?

Shows the most frequently featured actors/actresses — a bar chart is perfect for this comparison.

##### 2. What is/are the insight(s) found from the chart?

Few actors appear across multiple shows/movies, indicating Netflix’s preference for proven talent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Netflix strengthen relationships with popular actors who drive viewership.
No negative trend — repeated appearances reinforce brand familiarity.

#### Chart - 6

In [None]:
# 📊 Chart 6 : Distribution of Content Ratings

rating_counts = df['rating'].value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=rating_counts.values, y=rating_counts.index, palette='viridis')
plt.title("Distribution of Netflix Content Ratings", fontsize=14, fontweight='bold')
plt.xlabel("Number of Titles", fontsize=12)
plt.ylabel("Rating", fontsize=12)

for index, value in enumerate(rating_counts.values):
    plt.text(value + 10, index, str(value), va='center')

plt.show()


##### 1. Why did you pick the specific chart?

A bar chart clearly shows which content ratings dominate the platform.

##### 2. What is/are the insight(s) found from the chart?

Ratings like TV-MA and TV-14 appear most often, suggesting a focus on mature and teen audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — helps Netflix target audience preferences effectively.
Minor negative — less family content might limit younger demographics.

#### Chart - 7

In [None]:
# 📊 Chart 7 : Most Common Genres on Netflix

genres = df['listed_in'].dropna().astype(str).str.split(',')
genre_counts = genres.explode().str.strip().value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='cubehelix')
plt.title("Top 10 Genres on Netflix", fontsize=14, fontweight='bold')
plt.xlabel("Number of Titles", fontsize=12)
plt.ylabel("Genre", fontsize=12)

for index, value in enumerate(genre_counts.values):
    plt.text(value + 5, index, str(value), va='center')

plt.show()


##### 1. Why did you pick the specific chart?

Shows genre diversity — bar chart provides easy comparison.

##### 2. What is/are the insight(s) found from the chart?

“International Movies” and “Dramas” dominate, showing global appeal and demand for emotional storytelling.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Netflix prioritize investment in high-demand genres.
No negative insight — popular genres drive engagement.

#### Chart - 8

In [None]:
# 📊 Chart 8 : Movies vs TV Shows Added Over the Years

df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['year_added'] = df['date_added'].dt.year

type_year = df.groupby(['year_added', 'type']).size().unstack().fillna(0)

plt.figure(figsize=(10,6))
type_year.plot(kind='line', marker='o', figsize=(10,6))
plt.title("Movies vs TV Shows Added Over the Years", fontsize=14, fontweight='bold')
plt.xlabel("Year Added", fontsize=12)
plt.ylabel("Number of Titles", fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


##### 1. Why did you pick the specific chart?

Line charts highlight year-wise growth trends for both content types.

##### 2. What is/are the insight(s) found from the chart?

Movies dominate early years; TV Shows rise rapidly after 2017.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — helps plan future balance between series and films.
No negative impact unless one format overshadows user demand.

#### Chart - 9

In [None]:
# 📊 Chart 9 : Average Duration of Movies

movies = df[df['type'] == 'Movie']
movies['duration_minutes'] = movies['duration'].str.replace(' min', '').astype(float)

plt.figure(figsize=(8,6))
sns.histplot(movies['duration_minutes'], bins=20, kde=True, color='teal')
plt.title("Distribution of Movie Duration on Netflix", fontsize=14, fontweight='bold')
plt.xlabel("Duration (Minutes)", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Histogram shows distribution and range of movie durations.

##### 2. What is/are the insight(s) found from the chart?

Most movies fall between 80–120 minutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — helps Netflix understand preferred movie length for audiences.
No negative trend — consistent durations align with global standards.

#### Chart - 10

In [None]:
# 📊 Chart 10 : Content Share by Type

type_counts = df['type'].value_counts()

plt.figure(figsize=(6,6))
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', colors=['#66b3ff','#ff9999'])
plt.title("Share of Movies vs TV Shows", fontsize=14, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart visually represents proportion between Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

Movies form the majority of Netflix content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — visual clarity for decision-making.
No negative growth, but balance could improve audience retention.

#### Chart - 11

In [None]:
# 📊 Chart 11 : Top 10 Years with Most Releases

yearly_release = df['release_year'].value_counts().head(10)

plt.figure(figsize=(10,6))
sns.barplot(x=yearly_release.index, y=yearly_release.values, palette='plasma')
plt.title("Top 10 Years with Most Netflix Releases", fontsize=14, fontweight='bold')
plt.xlabel("Release Year", fontsize=12)
plt.ylabel("Number of Titles", fontsize=12)

for index, value in enumerate(yearly_release.values):
    plt.text(index, value + 10, str(value), ha='center')

plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for year-wise comparisons.

##### 2. What is/are the insight(s) found from the chart?

Most titles were released after 2015, showing Netflix’s modern content dominance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — proves Netflix’s peak production era.
No negative aspect unless old titles decline in relevance.

#### Chart - 12

In [None]:
# 📊 Chart 12 : Number of Titles by Country and Type

country_type = df.groupby(['country', 'type']).size().reset_index(name='count')
top_countries = country_type.groupby('country')['count'].sum().nlargest(5).index
filtered_data = country_type[country_type['country'].isin(top_countries)]

plt.figure(figsize=(10,6))
sns.barplot(data=filtered_data, x='country', y='count', hue='type', palette='viridis')
plt.title("Number of Titles by Country and Type", fontsize=14, fontweight='bold')
plt.xlabel("Country", fontsize=12)
plt.ylabel("Number of Titles", fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Grouped bar chart shows how each country contributes across Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

The US dominates both categories, while India leans more towards movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — helps in regional strategy planning.
No negative aspect unless imbalance affects diversity.

#### Chart - 13

In [None]:
# 📊 Chart 13 : Word Cloud of Movie Descriptions

from wordcloud import WordCloud

text = " ".join(df['description'].dropna())
wordcloud = WordCloud(width=1000, height=600, background_color='white', colormap='cool').generate(text)

plt.figure(figsize=(10,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Most Common Words in Netflix Descriptions", fontsize=14, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

A word cloud visually emphasizes frequently used words in descriptions.

##### 2. What is/are the insight(s) found from the chart?

Words like love, family, crime, drama, and life dominate, indicating popular content themes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive — shows Netflix’s thematic focus areas.
No negative impact — reinforces brand identity around emotional and dramatic content.

#### Chart - 14 - Correlation Heatmap

In [None]:
# 📊 Chart 14 : Correlation Heatmap

# Select numeric columns only
numeric_df = df.select_dtypes(include=['number'])

# Create correlation matrix
corr_matrix = numeric_df.corr()

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Numeric Features", fontsize=14, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a correlation heatmap because it visually represents relationships between numeric variables. It’s ideal for understanding how features like release year, duration, or other numeric fields relate to each other.

##### 2. What is/are the insight(s) found from the chart?

Answer HereSince most columns in the Netflix dataset are categorical, the correlations are generally weak. However, if additional numeric columns exist (like release_year or duration_minutes), we can observe patterns such as newer releases having shorter or more standardized durations.

#### Chart - 15 - Pair Plot

In [None]:
# 📊 Chart 15 : Pair Plot

# Convert 'date_added' to datetime and extract numeric columns
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['year_added'] = df['date_added'].dt.year

# Prepare a smaller dataset with only numeric columns
pair_df = df[['release_year', 'year_added']].dropna()

# Plot pair plot
plt.figure(figsize=(8,6))
sns.pairplot(pair_df, diag_kind='kde', corner=True, plot_kws={'alpha':0.6})
plt.suptitle("Pair Plot of Numeric Features", y=1.02, fontsize=14, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot allows visualization of relationships and distributions between multiple numeric variables simultaneously. It helps detect linear or non-linear patterns, outliers, and feature relationships.

##### 2. What is/are the insight(s) found from the chart?

In the Netflix dataset, numeric relationships (like between release_year and year_added) show that most titles were released and added around the same few years, indicating consistent catalog updates of recent content.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the visual analyses and insights from all the charts, the client (Netflix or any similar streaming platform) should focus on data-driven content strategy to achieve the business objective of maximizing viewer engagement and subscriber growth.

Key recommendations:

Invest in High-Performing Regions:
Countries like the United States, India, and the United Kingdom dominate content production — continuing to strengthen collaborations in these regions can expand the audience base.

Optimize Genre & Rating Mix:
Since genres like Dramas, Comedies, and International Movies are most popular, Netflix should create or acquire more titles in these categories while balancing age ratings to include more family-friendly options.

Focus on Recent Releases:
Data shows most content is added after 2015 — maintaining this trend of modern, up-to-date releases will keep the platform relevant and appealing.

Encourage Top Directors and Actors:
Building partnerships with frequently featured directors and actors can ensure consistent quality and strong brand association.

Maintain Optimal Content Length:
Most successful movies fall between 80–120 minutes — this insight can guide content development and editing decisions.

Leverage Predictive Analytics:
Using correlation and trend analysis, Netflix can forecast content success, recommend shows more effectively, and plan production schedules intelligently.

# **Conclusion**

In conclusion, the analysis of the Netflix dataset provides valuable insights into the platform’s content distribution, regional presence, and strategic focus. The findings reveal that movies dominate over TV shows, with the United States, India, and the United Kingdom being the top content-producing countries. Over the years, Netflix has shown a strong upward trend in content addition, especially after 2015, reflecting its rapid global expansion.

The most common genres such as Dramas, Comedies, and International Movies indicate a strong audience preference for emotional and globally relatable stories. The dominance of ratings like TV-MA and TV-14 shows that Netflix primarily caters to mature and teen audiences.

Overall, this project demonstrates how data visualization and analysis can uncover actionable insights that help streaming platforms like Netflix optimize their content strategy, improve audience targeting, and make data-backed business decisions. The insights gained can guide Netflix to maintain its competitive edge by focusing on regional diversity, popular genres, and consistent quality content to maximize viewer satisfaction and global growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***