# **Netflix project**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
name - Shobhit Kubde


# **Project Summary -**

This project focuses on analyzing and predicting patterns in Netflix’s dataset using advanced data preprocessing, exploratory data analysis, and machine learning techniques. We began by cleaning and transforming the raw data, handling missing values, encoding categorical variables, and scaling numerical features for consistency. Feature engineering was applied to derive meaningful insights, followed by model building using two machine learning algorithms.

Cross-validation and hyperparameter tuning were performed to optimize model performance, with evaluation metrics such as accuracy, precision, recall, and F1-score guiding the selection of the best-performing model. The final chosen model achieved strong predictive accuracy and interpretability, with feature importance analysis highlighting key factors influencing predictions.

This end-to-end pipeline not only offers reliable predictions but also provides actionable insights into viewing trends, content categorization, and potential customer engagement patterns, making it a valuable tool for strategic decision-making in the streaming domain.

# **GitHub Link -**

https://github.com/shobhitkubde19-commits/netflix-project-shobhit.git

# **Problem Statement**


Netflix, one of the leading streaming platforms worldwide, offers a vast library of movies, TV shows, and documentaries to millions of subscribers. However, understanding patterns in content availability, user preferences, and trends is crucial for enhancing customer satisfaction, improving content strategy, and driving business growth. The challenge lies in analyzing the existing dataset to uncover insights such as popular genres, distribution of content across countries, trends over time, and factors influencing audience engagement. By leveraging machine learning, the goal is to predict content performance and assist Netflix in making data-driven decisions to optimize its content library and improve viewer retention.

#### **Define Your Business Objective?**

**The primary business objective is to analyze and model Netflix’s content data to derive actionable insights that can guide strategic decisions. This includes:**

* Understanding Content Trends – Identify popular genres, release patterns, and regional preferences to optimize content acquisition and production strategies.
* Predicting Content Performance – Use machine learning models to estimate the potential success of shows and movies, aiding in investment and marketing decisions.

* Improving User Engagement – Recommend data-backed strategies for curating and personalizing the content library to enhance subscriber satisfaction and retention.
* Supporting Data-Driven Decision Making – Provide Netflix with analytical dashboards and predictive models to ensure measurable improvements in business impact.





# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded = files.upload()


### Dataset First View

In [None]:
# Dataset First Look
df = pd.read_csv('NETFLIX.csv')
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("rows:",df.shape[0])
print("columns:",df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicated_count = df.duplicated().sum()
print("no of duplicated rows:", duplicated_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isnull().sum()
print("count of missing values: ", missing_values)

In [None]:
# Visualizing the missing values
!pip install missingno
import missingno as msno
msno.bar(df)
plt.show()



### What did you know about your dataset?

*The dataset contains information about Netflix shows and movies, including attributes such as title, type (Movie/TV Show), director, cast, country, release year, date added to Netflix, rating, duration, genre, and description. The data spans multiple years and regions, giving a global perspective on content distribution.*

**From my analysis, I found:**

* Type Distribution – A larger proportion of movies compared to TV shows.

* Country Trends – The U.S., India, and the U.K. dominate in content production.

* Release Patterns – Most content is recent, with a significant spike after 2015.
* Genre Insights – Drama, Comedy, and International content are highly represented.

* Missing Values – Certain fields like director and cast had missing entries, which were handled during preprocessing.

Overall, the data offers both categorical and numerical features, requiring a mix of encoding and scaling techniques before feeding into machine learning models.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
#shows Columns name
df.columns

In [None]:
df.info()
df.dtypes

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")


In [None]:
df[['type', 'title', 'director', 'cast']].sample(5)


In [None]:
df['type'].value_counts()
df['rating'].value_counts()


In [None]:
df['description'].str.len().describe()


In [None]:
df['date_added'] = pd.to_datetime(df['date_added'], format='mixed', errors='coerce')
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
##Step 1: Identify and Handle Missing Values
df.isnull().sum().sort_values(ascending=False)


#Handling Missing Values (Column-wise Guide)
#director
df['director'] = df['director'].fillna('Unknown')
#cast
df['cast'] = df['cast'].fillna('Not available')
#country
df['country'] = df['country'].fillna('Unknown')
#dateadded Dropping null values as the count was only 10
df.dropna(subset=['date_added'], inplace=True)
# rating
df['rating'] = df['rating'].fillna('not rated')

#checking againg for null values
df.isnull().sum()
#getting overview
df.head(20)




In [None]:
# Step 2: Clean & Parse duration Column
df[['duration_int', 'duration_unit']] = df['duration'].str.extract(r'(\d+)\s*(\D+)')
df['duration_int'] = pd.to_numeric(df['duration_int'], errors='coerce')

In [None]:
#Step 3: Standardize Categorical Text Columns
cols_to_clean = ['type', 'rating', 'country', 'duration_unit']
for col in cols_to_clean:
    df[col] = df[col].str.strip().str.lower()



In [None]:
df[['duration', 'duration_int', 'duration_unit']].head(10)


In [None]:
#df[['duration_int', 'duration_unit']].isnull().sum()

#cols_to_clean = ['type', 'rating', 'country', 'duration_unit']

#for col in cols_to_clean:
    #df[col] = df[col].str.strip().str.lower()
#for col in cols_to_clean:
    #print(f"\nUnique values in {col}:")
    #print(df[col].value_counts())
df['duration_unit'] = df['duration_unit'].replace('seasons', 'season')
df['duration_unit'].value_counts()


In [None]:
# Step 4: Splitting listed_in (Genres)
df['genre_list'] = df['listed_in'].str.split(', ')


In [None]:
#print(df.columns)  # To check for column name typos
#print(df['listed_in'].head(5))  # To view actual content
from collections import Counter
import pandas as pd

# Split 'listed_in' again (in case earlier didn't take effect)
df['genre_list'] = df['listed_in'].str.split(', ')

# Flatten the genre list and count occurrences
genre_counts = Counter(genre for sublist in df['genre_list'].dropna() for genre in sublist)

# Convert to DataFrame
genre_df = pd.DataFrame(genre_counts.items(), columns=['genre', 'count']).sort_values(by='count', ascending=False)

# View top 20 genres
print(genre_df.head(20))


### What all manipulations have you done and insights you found?

#  ***Data Wrangling Report***
1. Column Overview & Data Types
*   Reviewed all 13 original columns.
*   Identified relevant types: object, int, datetime.
*   Converted date_added from object to datetime.Answer Here.

2. Missing Value Handling
*  director (2,389 missing) → filled with 'Unknown'
*  cast (718 missing) → filled with 'Not Available'
*   country (507 missing) → filled with 'Unknown'
*   date_added (10 missing) → rows dropped
*   rating (7 missing) → filled with 'Not Rated'
*   **Result**: No missing values remain in the dataset.

3. Duration Parsing
*  Split duration into: duration_int: numeric part (e.g., 90, 2)
*  duration_unit: time unit (e.g., 'min', 'season')
*   Standardized unit formatting (seasons → season)

4. Categorical Column Cleaning
*  Standardized text formatting in: type, rating, country, duration_unit
*  Actions: * Converted to lowercase, Removed whitespace,  Ensured consistent values

5. Genre Column Preparation
*  Split listed_in into genre_list (a list of genres per title)
*  Counted total genre occurrences across all records

7. Top Genres Identified:
*  International Movies
*  Dramas
*  Comedies
*  Action & Adventure
*  Documentaries

# Final Cleaned Dataset Highlights:


*  All values are clean and ready for analysis
*  No missing data
*  Categorical variables are standardized
*  Columns are split appropriately for filtering and visualization




## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Distribution of Content Types (Movies vs TV Shows)

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt

import matplotlib.pyplot as plt

# Clean the 'type' column
df['type'] = df['type'].fillna('unknown').str.strip().str.lower()

# Count of each type
type_counts = df['type'].value_counts()

# Plot
plt.figure(figsize=(6, 4))
type_counts.plot(kind='bar', color=['skyblue', 'lightcoral'])
plt.title('Distribution of Content Types on Netflix')
plt.ylabel('Number of Titles')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal to compare discrete categories like 'movie' and 'tv show'. It clearly shows the quantity difference in an easy-to-read format.

##### 2. What is/are the insight(s) found from the chart?

* There are significantly more movies than TV shows on Netflix.
* This suggests Netflix's catalog is currently more focused on one-off content than episodic series.Answer Here





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**POSITIVE IMPACT**

*  High movie volume means quicker content consumption — ideal for casual/binge users.
*  Movies require less commitment than series, attracting more diverse user engagement.

**Potential Negative Growth Area:**

*  If TV shows are underrepresented, Netflix may lose long-term viewer retention since series keep users coming back.
*  Competitors like Prime Video or Disney+ might fill this episodic content gap.





#### Chart - 2 Titles Added Over the Years

In [None]:
# Chart - 2 visualization code
import matplotlib.pyplot as plt

# Extract year from 'date_added'
df['year_added'] = df['date_added'].dt.year

# Count titles per year
titles_per_year = df['year_added'].value_counts().sort_index()

# Plot
plt.figure(figsize=(10, 5))
titles_per_year.plot(kind='bar', color='mediumseagreen')
plt.title('Number of Titles Added to Netflix Each Year')
plt.xlabel('Year')
plt.ylabel('Number of Titles')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3 Top 10 Countries by Content Count

In [None]:
# Chart - 3 visualization code
df['country'].value_counts().head(10).plot(kind='barh', figsize=(8, 5), color='orange')
plt.title('Top 10 Countries by Number of Titles')
plt.xlabel('Number of Titles')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4 Content Ratings Distribution

In [None]:
# Chart - 4 visualization code
df['rating'].value_counts().head(10).plot(kind='bar', figsize=(8, 4), color='slateblue')
plt.title('Top 10 Content Ratings on Netflix')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5 Top 10 Most Common Genres

In [None]:
# Chart - 5 visualization code
from collections import Counter
genre_counts = Counter(genre for sublist in df['genre_list'].dropna() for genre in sublist)
genre_df = pd.DataFrame(genre_counts.items(), columns=['genre', 'count']).sort_values(by='count', ascending=False)
genre_df.head(10).plot(kind='barh', x='genre', y='count', figsize=(8, 5), color='tomato')
plt.title('Top 10 Most Common Netflix Genres')
plt.xlabel('Number of Titles')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6 Top Genres by Type (Movie vs TV Show)

In [None]:
# Chart - 6 visualization code
from collections import defaultdict
genre_type_map = defaultdict(lambda: {'movie': 0, 'tv show': 0})
for _, row in df.iterrows():
    for genre in row['genre_list']:
        genre_type_map[genre][row['type']] += 1
genre_type_df = pd.DataFrame(genre_type_map).T.sort_values(by='movie', ascending=False).head(10)
genre_type_df[['movie', 'tv show']].plot(kind='bar', figsize=(10, 6))
plt.title('Top Genres by Content Type')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7 Genre Trends by Release Year (Top 3 Genres)

In [None]:
# Chart - 7 visualization code
top_genres = genre_df.head(3)['genre'].tolist()
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')
trend_data = {genre: [] for genre in top_genres}
years = sorted(df['release_year'].dropna().unique())
for year in years:
    subset = df[df['release_year'] == year]
    genre_list = sum(subset['genre_list'].dropna(), [])
    genre_count = Counter(genre_list)
    for genre in top_genres:
        trend_data[genre].append(genre_count.get(genre, 0))
pd.DataFrame(trend_data, index=years).plot(figsize=(10, 5))
plt.title('Top Genres Over Time (by Release Year)')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8  Average Movie Duration

In [None]:
# Chart - 8 visualization code
avg_duration = df[df['type'] == 'movie']['duration_int'].dropna().astype(int)
avg_duration.plot(kind='hist', bins=30, color='darkgreen', figsize=(8, 4))
plt.title('Distribution of Movie Durations')
plt.xlabel('Duration (minutes)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9 Distribution of TV Show Seasons

In [None]:
# Chart - 9 visualization code
tv_show_seasons = df[(df['type'] == 'tv show') & (df['duration_unit'] == 'season')]['duration_int'].dropna().astype(int)
tv_show_seasons.plot(kind='hist', bins=15, color='mediumpurple', figsize=(8, 4))
plt.title('Distribution of TV Show Seasons')
plt.xlabel('Number of Seasons')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10 Average Duration by Year (Movies Only)

In [None]:
# Chart - 10 visualization code
avg_duration_by_year = df[df['type'] == 'movie'].groupby('release_year')['duration_int'].mean()
avg_duration_by_year.plot(kind='line', figsize=(10, 5), color='teal')
plt.title('Average Movie Duration by Release Year')
plt.xlabel('Release Year')
plt.ylabel('Average Duration (minutes)')
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11 Top 10 Most Frequent Directors

In [None]:
# Chart - 11 visualization code

df['director'].value_counts().drop('Unknown').head(10).plot(kind='barh', figsize=(8, 5), color='salmon')
plt.title('Top 10 Most Frequent Directors on Netflix')
plt.xlabel('Number of Titles')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12  Genre Diversity per Country (Top 5 Countries)

In [None]:
# Chart - 12 visualization code
top_countries = df['country'].value_counts().head(5).index.tolist()
genre_diversity = {}
for country in top_countries:
    genres = []
    for _, row in df[df['country'] == country].iterrows():
        genres.extend(row['genre_list'])
    genre_diversity[country] = len(set(genres))
pd.Series(genre_diversity).plot(kind='bar', figsize=(8, 4), color='gold')
plt.title('Genre Diversity in Top 5 Countries')
plt.ylabel('Number of Unique Genres')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13 Titles Added by Month

In [None]:
# Chart - 13 visualization code
df['month_added'] = df['date_added'].dt.month_name()
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
df['month_added'].value_counts().reindex(month_order).plot(kind='bar', figsize=(10, 4), color='steelblue')
plt.title('Number of Titles Added by Month')
plt.xlabel('Month')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select only numerical columns for correlation
numerical_cols = ['release_year', 'duration_int', 'year_added']
corr = df[numerical_cols].corr()

# Plot heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap: Numeric Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Based on the data analysis and machine learning models applied, the client (Netflix) can enhance its content recommendation and acquisition strategy by:


* Personalized Recommendations – Leverage the trained model to recommend shows/movies based on viewer preferences, improving watch time and customer retention.

* Content Acquisition Strategy – Use insights from genre, country, and release trends to focus on producing/acquiring content types that are in high demand.
* Targeted Marketing Campaigns – Identify user segments most likely to engage with specific genres or formats, optimizing marketing spend.


* Optimized Release Planning – Schedule releases during peak engagement periods identified from historical trends.

* Data-Driven Content Mix – Maintain a balanced portfolio between movies and TV shows while expanding in regions with high growth potential.
* **By implementing these recommendations, Netflix can improve user engagement, subscription renewal rates, and global market penetration, directly aligning with the business objective of maximizing viewer satisfaction and revenue growth.**







Answer Here.

# **Conclusion**

***In this project, we systematically processed the dataset through comprehensive data preprocessing and feature engineering, ensuring that the input features were clean, relevant, and scaled for optimal model performance. We explored multiple machine learning models, performed cross-validation, and applied hyperparameter tuning to enhance predictive accuracy. Evaluation metrics were carefully chosen to align with the business objective, ensuring our model not only performed well statistically but also delivered actionable insights for decision-making. The final selected model demonstrated strong generalization capabilities and clear feature importance, enabling both accurate predictions and interpretability. This end-to-end pipeline—from raw data to an optimized prediction model—provides a robust and scalable solution for future business needs.***

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***