<a href="https://colab.research.google.com/github/srijit78/Netflix/blob/main/2_Srijit_Das_Netfilx_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV Shows EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### Name - Srijit Das


# **Project Summary -**

This Exploratory Data Analysis (EDA) project focuses on the Netflix Movies and TV Shows dataset, aiming to extract meaningful insights and identify patterns in the platform's content over time. The dataset includes over 7700 entries and features attributes like title, type, country, release year, added date, rating, and genre. Using the UBM (Univariate, Bivariate, and Multivariate) approach, the analysis explores various dimensions such as content type distribution, geographical contributions, genre popularity, and trends over time.

In the univariate section, we examine individual features like the proportion of movies to TV shows, top contributing countries, and distribution of content ratings. This gives a general overview of the dataset composition.

In the bivariate analysis, we delve deeper by exploring relationships between pairs of variables. For example, we analyze how content type varies with rating, which countries favor which type of content, and how release year trends differ across movies and TV shows.

The multivariate section builds upon the previous insights by analyzing how multiple variables interact. We chart content growth trends across years for both content types, examine genre preferences by content type, and compare rating distribution with type and release year


# **Problem Statement**


To analyze Netflix content data to uncover trends and insights about content type, popularity by region, genre distribution, and user engagement patterns based on ratings and release years.

#### **Define Your Business Objective?**

To guide Netflix in making data-driven decisions for content acquisition, regional expansion, and platform personalization by analyzing patterns in movie/TV content additions, genres, ratings, and user-preferred content formats over time.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset

netflix_data = pd.read_csv("/content/drive/MyDrive/2. internship works/Labmentix /Projects/Netflix/1. NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv")

### Dataset First View

In [None]:
# Dataset First Look

netflix_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Rows:" ,netflix_data.shape[0])
print("Columns:" ,netflix_data.shape[1])

### Dataset Information

In [None]:
# Dataset Info

netflix_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

print('Number of duplicates in dataset:', netflix_data.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

missing_values =pd.DataFrame({
    "missing_values" : netflix_data.isnull().sum(),
    "percentage (%)" : netflix_data.isnull().mean() * 100}).sort_values("missing_values",ascending=False)

print(missing_values)

In [None]:
# Visualizing the missing values

plt.figure(figsize=(10, 6))
sns.heatmap(netflix_data.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap")
plt.show()


### What did you know about your dataset?

* The dataset contains over 7700 entries with 12 features.

* Missing values exist in fields like director, cast, country, date_added, and rating.

* Most content is from the US and added between 2015 and 2020.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

netflix_data.columns

In [None]:
# Dataset Describe

netflix_data.describe()

### Variables Description

* Type: Movie or TV Show

* Title: Name of the content

* Director/Cast: People involved

* Country: Production country

* Date Added: When it was added to Netflix

* Release Year: When it was released

* Rating: Age rating

* Duration: Length in minutes or number of seasons

* Genre: Category or genres listed

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

netflix_data.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# # Create copy of the dataset
netflix_data_copy = netflix_data.copy()

In [None]:
# Handling missing value

netflix_data_copy['director'] = netflix_data_copy['director'].fillna('Unknown')
netflix_data_copy['cast'] = netflix_data_copy['cast'].fillna('Unknown')
netflix_data_copy['country'] = netflix_data_copy['country'].fillna('Unknown')

netflix_data_copy = netflix_data_copy.dropna(subset=['date_added', 'rating'])

In [None]:
# cross checking null values

netflix_data_copy.isnull().sum()

In [None]:
# Extract numeric movie duration

netflix_data_copy["duration_int"] = netflix_data_copy["duration"].str.extract("(\d+)").astype(float)
netflix_data_copy["duration_type"] = netflix_data_copy["duration"].str.extract("([a-zA-Z]+)").astype(str).apply(lambda x: x.str.strip())

In [None]:
# Convert 'date_added' to datetime and created year and month columns

netflix_data_copy['date_added'] = pd.to_datetime(netflix_data_copy['date_added'], format='mixed')
netflix_data_copy['year_added'] = netflix_data_copy['date_added'].dt.year
netflix_data_copy['month_added'] = netflix_data_copy['date_added'].dt.month

In [None]:
netflix_data_copy.head()

In [None]:
# Outlier Detection for Movie Durations

movie_durations = netflix_data_copy[netflix_data_copy["type"] == "Movie"]["duration_int"].dropna()

Q1 = movie_durations.quantile(0.25)
Q3 = movie_durations.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = movie_durations[(movie_durations < lower) | (movie_durations > upper)]
print(f"Number of movie duration outliers: {len(outliers)}")


In [None]:
# Flag outliers
netflix_data_copy['is_outlier_duration'] = netflix_data_copy['duration_int'].apply(lambda x: 1 if x < lower or x > upper else 0)

In [None]:
# Outliers Data

netflix_data_copy[netflix_data_copy['is_outlier_duration'] == 1]

In [None]:
# Final rows and columns

print("Rows: " , netflix_data_copy.shape[0])
print("Columns: ", netflix_data_copy.shape[1])

In [None]:
# Final dataset for visualizations

netflix_data_copy.head()

### What all manipulations have you done and insights you found?

* Converted date_added to datetime format

* Extracted year and month from date_added

* Dropped missing/null values where necessary

* Detected outliers and flaged the outliers

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# Univariate Visualizations

#### Chart - 1

In [None]:
# Content type count

netflix_data_copy['type'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title("Count of Movies and TV Shows")
plt.show()

##### 1. Why did you pick the specific chart?

To show the overall share of Movies vs TV Shows

##### 2. What is/are the insight(s) found from the chart?

**Insight:** 70% of Netflix content is Movies and 30% is TV Shows

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact**: Yes, it helps in understanding the platform’s main content format

**Negative Growth:** No - Movie dominance is consistent with demand trends.

#### Chart - 2

In [None]:
# Top 10 countries with most content

top_countries = netflix_data_copy['country'].value_counts().head(10)
plt.figure(figsize=(10, 6))
plt.xticks(rotation=45)
plt.title("Top 10 Countries with Most Content")
sns.barplot(x=top_countries.index, y=top_countries.values)
plt.show()

##### 1. Why did you pick the specific chart?

To find countries with the most content

##### 2. What is/are the insight(s) found from the chart?

**Insight:** USA dominates, followed by India and UK

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Highlights strong production regions — useful for regional marketing and localization strategies.

**Negative Growth:** No - emphasizes global content strength

#### Chart - 3

In [None]:
# Yearly Additions

plt.figure(figsize=(10, 6))
sns.countplot(data=netflix_data_copy, x="year_added", hue='type')
plt.title("Number of Content Additions per Year")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To track growth of content over time

##### 2. What is/are the insight(s) found from the chart?

**Insight:** Netflix content peaked between 2018-2020

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Helps analyze content expansion trend and growing user base.

**Negative Growth:** Slight dip post-2020 could reflect saturation or COVID-related disruptions

#### Chart - 4

In [None]:
# Rating distribution

sns.countplot(data=netflix_data_copy, y='rating', hue="rating", order=netflix_data_copy['rating'].value_counts().index, palette='magma')
plt.title("Rating Distribution")
plt.show()

##### 1. Why did you pick the specific chart?

To analyze content rating spread

##### 2. What is/are the insight(s) found from the chart?

**Insight:** TV-MA and TV-14 dominate

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Indicates strong engagement with mature and teen audiences

#### Chart - 5

In [None]:
# Top 10 Genres

genres = netflix_data_copy["listed_in"].str.split(", ", expand=True).stack()
top_genres = genres.value_counts().head(10)
sns.barplot(x=top_genres.values, y=top_genres.index, palette='viridis')
plt.title("Top 10 Genres")
plt.show()

##### 1. Why did you pick the specific chart?

To find genre popularity

##### 2. What is/are the insight(s) found from the chart?

**Insight:** International movies, Dramas and Comedies are top genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Helps decide what type of content to acquire more

**Negative Growth:** No - validates user preferences.

# Bivariate Visualizations

#### Chart - 6

In [None]:
# Type vs Rating

plt.figure(figsize=(10, 5))
sns.countplot(data=netflix_data_copy, x='rating', hue='type', palette='husl')
plt.title("Content Type by Rating")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To compare content type by rating

##### 2. What is/are the insight(s) found from the chart?

**Insight:** TV Shows and movies more common in Teen ratings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Guides platform rating strategy

**Negative Growth:** No - supports audience alignment

#### Chart - 7

In [None]:
# Content Type vs Country (Top 10 Countries)

top_countries = netflix_data_copy['country'].value_counts().head(10)
plt.figure(figsize=(12, 6))
top_countries_data = netflix_data_copy[netflix_data_copy['country'].isin(top_countries.index)]
sns.countplot(data=top_countries_data, x='country', hue='type', palette='Set1')
plt.title("Content Type by Top 10 Country")
plt.xlabel('Country')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Type')
plt.show()

##### 1. Why did you pick the specific chart?

To explore country-wise preferences

##### 2. What is/are the insight(s) found from the chart?

**Insight:** India shows more Movies, US has more balance

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Regional content expansion ideas/strategies

**Negative Growth:** No - enables better localization.

#### Chart - 8

In [None]:
# Release Year vs Content Type

plt.figure(figsize=(12, 6))
sns.countplot(data=netflix_data_copy, x='year_added', hue='type', palette='Accent')
plt.title("Content Type by Release Year")
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Type')
plt.show()

##### 1. Why did you pick the specific chart?

Time trend for content type

##### 2. What is/are the insight(s) found from the chart?

**Insight:** Growth of TV Shows faster after 2016

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Shows changing content strategy

**Negative Growth:** equires evaluation of market saturation.

# Multivariate Visuaizations

#### Chart - 9

In [None]:
# TV vs Movie Growth Trend

df_year_type = netflix_data_copy.groupby(['year_added', 'type']).size().unstack().fillna(0)
df_year_type.plot(kind='line', marker='o')
plt.title("Netflix Content Growth Over Years")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Tracks the growth of Movies and TV Shows added to Netflix over time.

##### 2. What is/are the insight(s) found from the chart?

**Insight:** Both categories show steady growth, but movies began increasing more rapidly after 2016, indicating a strategic pivot toward serialized content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:** Understanding this trend can help Netflix anticipate future demand and balance its content investments accordingly.

#### Chart - 10

In [None]:
# Genre vs Rating vs Type

# Filter out TV Shows and Movies separately
movies = netflix_data_copy[netflix_data_copy['type'] == 'Movie'].copy()
movies['duration_int'] = movies['duration'].str.extract('(\d+)').astype(float)

# Remove nulls and outliers for cleaner view
movies = movies[(movies['duration_int'] < 300) & (movies['duration_int'] > 0)]

plt.figure(figsize=(12, 6))
sns.boxplot(data=movies, x='rating', y='duration_int')
plt.title('Movie Duration by Rating')
plt.xlabel('Rating')
plt.ylabel('Duration (minutes)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Combines genre, content rating, and content count.

##### 2. What is/are the insight(s) found from the chart?

**insight:**

* Dramas and Comedies dominate across multiple ratings.

* Family-friendly genres like Children & Family Movies are underrepresented.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Impact:**

* Supports continued investment in high-demand genres like Drama.

*  Reveals underutilized genres (e.g., Kids content) for expansion.

**negative growth:**

The chart shows that a significant portion of popular genres fall under TV-MA and R ratings.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

* Focus more on growing TV Show content for youth (TV-14, TV-MA)

* Acquire region-specific content (India, UK, etc.)

* Invest in top genres like Drama, Comedy, Action

* Track year-wise performance to predict content demand

# **Conclusion**

The EDA revealed valuable patterns in content type, country contributions, genre popularity, and yearly trends. Netflix can use these insights to align its content strategy with viewer preferences and regional demand.