## ***1. Know Your Data***

### Import Libraries

In [1]:
import pandas as pd
print("Notebook is working!")




Notebook is working!


In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_excel("/content/netflixdata.xlsx")

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(df)



### What did you know about your dataset?

There are 7787 rows and 12 columns in the dataset. Out of which 5 Columns have missing Values. Column director have most missing values of 2389, then cast with 718 missing values, then country with 507 missing values, then date_added with 10 missing values and ratings with least missing values of 7.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

| Column Name    | Description                                         |
| -------------- | --------------------------------------------------- |
| `show_id`      | Unique identifier for each Netflix title.           |
| `type`         | Check Whether the content is a *Movie* or a *TV Show*.    |
| `title`        | Title of the content.                               |
| `director`     | Names of the directors.                         |
| `cast`         | Name of the Lead actor.                |
| `country`      | Country where the content was produced.             |
| `date_added`   | Date the title was added to Netflix.                |
| `release_year` | Year when the content was released.                 |
| `rating`       | Age rating               |
| `duration`     | Duration in minutes. |
| `listed_in`    | Genres of the content.                            |
| `description`  | Brief description or synopsis.                      |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# - Convert 'date_added' to datetime
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

In [None]:
# Create year and month columns
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month

In [None]:
# Split 'duration' into 'duration_int' and 'duration_type' (e.g., "90 min" → 90 and "min")
df[['duration_int', 'duration_type']] = df['duration'].str.extract('(\d+)\s*(\D+)')
df['duration_int'] = pd.to_numeric(df['duration_int'], errors='coerce')

In [None]:
# Count remaining nulls after wrangling
null_counts_after = df.isnull().sum()
print(null_counts_after)

### What all manipulations have you done and insights you found?

**Key Manipulations:**

- Converted ***date_added*** column to proper datetime format.

- Extracted ***year_added*** and ***month_added*** from ***date_added*** for trend analysis.

- Split ***duration*** column into:

    - duration_int (numeric part)

    - duration_type (unit: minutes/seasons)

- Identified and counted missing values in key columns like ***director***, ***cast***, ***country***, etc.

**Insights Gained:**

- ***director*** and ***cast*** have a large number of missing values, indicating possible data sparsity is associated with *older/regional/less known* content.

- The ***duration_type*** allows easy differentiation between *Movies* and TV *Shows* for deeper analysis.

- Extracted time features (***year_added***, ***month_added***) enable time-series insights like content growth and seasonal trends.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 (Distribution of Movies vs TV Shows)

In [None]:
# Distribution of Movies vs TV Shows
type_counts = df['type'].value_counts()

plt.figure(figsize=(10,6))
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', startangle=110, colors=['#ff9999','#66b3ff'])
plt.title('Distribution of Movies vs TV Shows on Netflix')
plt.axis('equal')
plt.show()



##### 1. Why did you pick the specific chart?

A pie chart is effective for visualizing the proportion between two categories—here, Movies and TV Shows. It quickly communicates which format dominates.

##### 2. What is/are the insight(s) found from the chart?

The number of movies on Netflix significantly exceeds the number of TV Shows, showing a heavier investment in movie content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This helps Netflix evaluate whether diversifying into more TV shows can improve user retention, especially since series often boost engagement. However, over-investment in movies without enough long-form content may limit binge-watching behavior, potentially reducing user time-on-platform.



#### Chart - 2 (Top 10 countries)

In [None]:
# Top 10 countries by content production on Netflix
top_countries = df['country'].dropna().str.split(', ', expand=True).stack().value_counts().head(10)

plt.figure(figsize=(10,5))
sns.barplot(x=top_countries.values, y=top_countries.index, palette='viridis')
plt.title('Top 10 Countries by Content Production on Netflix')
plt.xlabel('Count')
plt.ylabel('Country')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart clearly visualizes which countries contribute the most content. This format is ideal for ranked categorical data with long labels.

##### 2. What is/are the insight(s) found from the chart?

The United States dominates content production on Netflix, followed by India, the United Kingdom, and others. The U.S. alone accounts for a massive portion of the total content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can help Netflix localize strategies. It indicates strong content sourcing from the U.S. and growing markets like India. However, lack of content diversity from other regions may limit market penetration in underrepresented geographies, potentially restricting user acquisition.



#### Chart - 3 (Top 10 directors)

In [None]:
# Count top 10 directors
director_counts = df['director'].dropna().str.split(', ').explode().value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=director_counts.index, y=director_counts.values, palette="coolwarm")
plt.title("Top 10 Directors on Netflix")
plt.ylabel("Titles Directed")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

It identifies the most active and possibly famous directors associated with Netflix.

##### 2. What is/are the insight(s) found from the chart?

Top directors who repeatedly appear, indicates trust and investment by Netflix in specific creations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, this shows Reliable directors helps to scale production quickly. Due to this there is a risk of similar content themes or limited variety in storytelling which can adversely affect the growth.



#### Chart - 4 (Top 10 Most Frequent Actors)

In [None]:
# Top 10 Most Frequent Actors on Netflix
actor_list = df['cast'].dropna().str.split(', ').explode()
top_actors = actor_list.value_counts().head(10)
print(top_actors)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_actors.index, y=top_actors.values, palette="coolwarm")
plt.title("Top 10 Most Frequent Actors on Netflix")
plt.ylabel("Appearances")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

Bar plots help identify actors frequently featured in content which is good for partnership and marketing.

##### 2. What is/are the insight(s) found from the chart?

Certain actors appear far more often than others, indicating star preferences or contracts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights indicates that Star power can bring viewership and Help in casting popular faces.However Overuse may lead to decline in views as viewers may crave for new talent/faces.



#### Chart - 5 (Popular Topics on Netflix)

In [None]:
# Popular Topics on Netflix
from wordcloud import WordCloud
text = ' '.join(df['description'].dropna().tolist())

plt.figure(figsize=(12,5))
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Popular Topics on Netflix')
plt.show()



##### 1. Why did you pick the specific chart?

WordCloud gives a quick, creative overview of common themes, genres, or topics Netflix content revolve around.

##### 2. What is/are the insight(s) found from the chart?

Top topics are life, Family , World, friend, love etc. This Indicates user interest and Netflix’s content production strategy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights shows netfix investment in topics like life and family. Lesser visibility of niche genres like Adventure , Dark etc. can impact negatively in markerting.


#### Chart - 6 (Top 10 most common genres/categories)

In [None]:
# Top 10 most common genres/categories on Netflix
from collections import Counter

# Split and count genres
genre_list = df['listed_in'].dropna().str.split(', ')
flat_genres = [genre for sublist in genre_list for genre in sublist]
top_genres = Counter(flat_genres).most_common(10)
genres, counts = zip(*top_genres)

# Plotting
plt.figure(figsize=(10,5))
sns.barplot(x=list(counts), y=list(genres), palette='magma')
plt.title('Top 10 Genres on Netflix')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()



##### 1. Why did you pick the specific chart?

A horizontal bar chart is perfect for showcasing frequency distribution of categorical data like genres. It provides clarity even when category names are long.

##### 2. What is/are the insight(s) found from the chart?

The most common genres on Netflix include International Movies,Dramas, and Comedies. These dominate the graph, suggesting user demand is high in these categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding genre preferences can help Netflix personalize recommendations and focus future content investments. However, over-representation in certain genres (e.g., dramas) could lead to audience fatigue and underrepresentation of niche interests, which might limit audience diversity.

#### Chart - 7 (Rating Distribution)

In [None]:
# Rating Distribution
plt.figure(figsize=(10, 5))
rating_counts = df['rating'].dropna().value_counts().sort_values(ascending=False)
print(rating_counts)
sns.barplot(x=rating_counts.index, y=rating_counts.values, palette="rocket")
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()



##### 1. Why did you pick the specific chart?

Bar charts are best for comparing categorical variables like content ratings .

##### 2. What is/are the insight(s) found from the chart?

TV-MA and TV-14 dominate the content, indicating a majority of mature and teen audience targeting and Lower counts for G, TV-Y7-FV, etc., show underrepresentation of children’s content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, It Highlights Netflix's focus on adult and teen demographics is useful for targeting ads and promotions. But there must be a loss in children/family content—expanding this could open up a new market segment.



#### Chart - 8 (Content Added per Month)

In [None]:
# Content added to Netflix by month (seasonality check)
monthly_counts = df['month_added'].value_counts().sort_index()

plt.figure(figsize=(10,5))
sns.barplot(x=monthly_counts.index, y=monthly_counts.values, palette='coolwarm')
plt.title('Content Added to Netflix by Month')
plt.xlabel('Month')
plt.ylabel('Count')
plt.xticks(ticks=range(0, 12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                                       'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.grid(axis='y')
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for showing how content addition varies month-to-month, helping to identify seasonal trends or marketing strategies.

##### 2. What is/are the insight(s) found from the chart?

Certain months, especially around october and December, tend to have spikes in content addition. This likely aligns with holidays or school breaks when viewership is expected to be higher.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can help Netflix plan content releases during busy times to boost views and subscriptions. But if there’s not enough new content in some months, people might lose interest, which could affect how many stay subscribed.



#### Chart - 9 (Number of Releases per year)

In [None]:
# Number of Netflix releases per year
release_counts = df['release_year'].value_counts().sort_index()

plt.figure(figsize=(19,6))
sns.lineplot(x=release_counts.index, y=release_counts.values, marker='o', color='teal')
plt.title('Netflix Content Released Over the Years')
plt.xlabel('Release Year')
plt.ylabel('Count')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is ideal for showing trends over time. It clearly illustrates the rise or fall in content production across years.

##### 2. What is/are the insight(s) found from the chart?

Netflix significantly ramped up content releases from 2015 onwards, peaking around 2018–2019. There’s a noticeable dip post-2019, possibly due to the COVID-19 pandemic’s impact on production.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This insight helps Netflix  understand which countries produce the most content. It can guide where to invest in future shows or movies. The drop in content after 2019 might be due to global disruptions, showing that Netflix should have backup plans to keep viewers engaged when production slows down.

#### Chart - 10 (Content Type across countries)

In [None]:
# Content Type across Countries
top_countries = df['country'].value_counts().head(10).index
filtered_df = df[df['country'].isin(top_countries)]

country_type = pd.crosstab(filtered_df['country'], filtered_df['type']).sort_values(by=['Movie', 'TV Show'], ascending=False)

country_type.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='tab20c')
plt.title('Content Type by Country')
plt.xlabel('Country')
plt.ylabel('Count')
plt.legend(title='Type')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Stacked bar chart clearly compares Movies vs TV Shows across top countries.

##### 2. What is/are the insight(s) found from the chart?

The following insights are found from the chart :
- USA leads in total content, mostly Movies.
- UK has a more balanced mix.
- India, Canada, France contribute mainly Movies.
- South Korea & Japan lean more toward TV Shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, The insights Helps target content and marketing by region (like try to focus on TV shows in South Korea and movies in US). However Heavy reliance on U.S.-originated content could backfire if audience preferences shift.

#### Chart - 11 (Content Type over Time)

In [None]:
# Content Type over Time
release_counts = df.groupby(['release_year', 'type']).size().unstack().fillna(0)

release_counts.plot(kind='line', figsize=(12, 6), linewidth=2)
plt.title('Content Type over Time')
plt.xlabel('Year')
plt.ylabel('Count')
plt.legend(title='Type')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line chart is perfect to visualize how content production trends have changed over years for both Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

- There is the Sharp increase in content after 2015, especially in TV Shows.

- Movies dominated earlier years and TV Shows have increase in recent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights Shows rising demand for TV Shows which suggests that Netflix can invest more in series. And Decline in Movie growth may indicate market saturation or shifting preferences.

#### Chart - 12 (Top Genres by Country and Type)

In [None]:
# Top Genres by Country and Type
df_genre = df.dropna(subset=['listed_in', 'country', 'type']).copy()
df_genre['genre'] = df_genre['listed_in'].str.split(', ')
df_genre['country'] = df_genre['country'].str.split(', ')
df_genre = df_genre.explode('genre').explode('country')

# Focus on top 5 countries and top 5 genres for clarity
top_countries = df_genre['country'].value_counts().head(5).index
top_genres = df_genre['genre'].value_counts().head(5).index

filtered = df_genre[df_genre['country'].isin(top_countries) & df_genre['genre'].isin(top_genres)]

# Pivot table
genre_country_type = pd.pivot_table(filtered, index='genre', columns=['country', 'type'],
                                    values='title', aggfunc='count', fill_value=0)

# Heatmap
plt.figure(figsize=(14, 6))
sns.heatmap(genre_country_type, annot=True, fmt='d', cmap='YlGnBu')
plt.title('Top Genres by Country and Type (Heatmap)')
plt.xlabel('Country & Type')
plt.ylabel('Genre')
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

A heatmap effectively compares multiple variables—genre, country, and content type—in a compact, visual format, highlighting patterns via color intensity.

##### 2. What is/are the insight(s) found from the chart?

- The US dominates content across most top genres and types.
- India heavily focuses on International Movies.
- Genre preferences vary by country (e.g., UK's strength in International TV Shows).
- Movies generally have higher representation in these top genres than TV shows within countries.
- Canada and France show lower output in these specific popular genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, Insights Helps Netflix tailor genre-based content strategy per country and type and Supports localized marketing and production. However, If over-relying on a few genres, may cause viewer fatigue and risks lack of diversity

#### Chart - 13 (Yearly Content Additions by Type and Rating)

In [None]:
# Yearly Content Additions by Type and Rating
df_filtered = df.dropna(subset=['year_added', 'type', 'rating'])
top_ratings = df_filtered['rating'].value_counts().head(5).index
df_filtered = df_filtered[df_filtered['rating'].isin(top_ratings)]

# Group, pivot, and plot
pivot = df_filtered.groupby(['year_added', 'type', 'rating']).size() \
    .unstack(['type', 'rating'], fill_value=0) \
    .sort_index()

pivot.index = pivot.index.astype(int)  # Ensure years are integers

pivot.plot(kind='bar', stacked=True, figsize=(14, 7), colormap='tab20c')
plt.title('Yearly Content Additions by Type and Rating')
plt.xlabel('Year')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()





##### 1. Why did you pick the specific chart?

A stacked bar chart allows clear visualization of how content additions are distributed by rating and type over the years, showing both volume and composition changes.

##### 2. What is/are the insight(s) found from the chart?

- Significant growth post-2015, peak in 2019-2020, decline in 2021.
- Movie additions generally higher than TV shows.
- Mature ratings (R, TV-MA, TV-14) dominate additions for both types, especially in peak years.
- Early years had very low additions.
- 2021 shows a drop across most categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Yes, Insights shows investment in mature content, movie-heavy strategy, and peak release periods. Decline in 2021 suggests potential production issues or strategy shift which needs to be investigated. Over-emphasis on mature content might limit audience reach.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

   * Focus more on top-performing ratings like **TV-MA** and **TV-14** which dominate the platform.
   * Ensure a balance of family-friendly and adult content to retain a wide audience base.
   * Expand **TV Show** Offerings in emerging markets with regional languages.
   * **Customize Content Strategies** based on country-specific preferences.
   * Genres like **International TV Shows**, **Dramas**, and **Comedies** appear most frequently and should be prioritized for production and promotion.
   * Introduce more niche genres to attract underserved user segments.
   * Countries like the **United States**, **India**, and **UK** are highly represented — continue collaborating with top production houses from these regions.
   * Cast like **Anupam Kher**, **Shah Rukh Khan**, or **David Attenborough** appear frequently — audience preference for familiar faces should guide future collaborations.
   * Avoid overly long content which may lead to viewer drop-offs.
   * Ensure all records have complete info (e.g. directors, cast, country) to improve **search relevance** and **algorithmic recommendations**.
   * Identify **peak months or quarters** for adding region-specific content and launch promotions around them.

# **Conclusion**

The Netflix data analysis shows that most content is added after 2015, with TV Shows gaining popularity over time. The United States leads in content production, followed by India and the UK. Popular genres like TV shows, Drama and Comedy are watched the most across top countries. Ratings such as TV-MA and TV-14 dominate the platform, showing a trend toward mature content. However, Netflix can grow further by adding more family-friendly shows and focusing on regional content. Overall, this analysis helps Netflix understand audience preferences and make smarter decisions to improve user engagement and expand in global markets.
