<a href="https://colab.research.google.com/github/saurabhsingh3786/Netflix-Movies-and-TV-Shows-Clustering/blob/main/individual_notebook_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - `Netflix Movies and TV Shows Clustering`



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Name**            - Saurabh Singh



# **Project Summary -**

The goal of this project is to analyze the Netflix catalog of movies and TV shows, which was sourced from the third-party search engine Flixable, and group them into relevant clusters. This will aid in enhancing the user experience and prevent subscriber churn for the world's largest online streaming service provider, Netflix, which currently boasts over 220 million subscribers as of 2022-Q2. The dataset, which includes movies and TV shows as of 2019, will be analyzed to uncover new insights and trends in the rapidly growing world of streaming entertainment.

* There were approximately 7787 records and 12 attributes in the dataset.

* We started by working on the missing values in the dataset and conducting exploratory data analysis (EDA).

* Using the following attributes to create a cluster: cast, country, genre, director, rating, and description The TFIDF vectorizer was used to tokenize, preprocess, and vectorize the values in these attributes.

* The problem of dimensionality was dealt with through the use of Principal Component Analysis (PCA).

* Using a variety of methods, including the elbow method, silhouette score, dendrogram, and others, we constructed two distinct types of clusters with the K-Means Clustering and Agglomerative Hierarchical clustering algorithms, respectively, and determined the optimal number of clusters.

* The similarity matrix generated by applying cosine similarity was used to construct a content-based recommender system. The user will receive ten recommendations from this recommender system based on the type of show they watched.

# **GitHub Link -**

https://github.com/saurabhsingh3786/Netflix-Movies-and-TV-Shows-Clustering

# **Problem Statement**


Netflix is a streaming service that offers a wide variety of television shows and movies for viewers to watch at their convenience. With a monthly subscription, users have access to a vast library of content, including original series and films produced by Netflix. The platform also allows users to create multiple profiles, making it easy for family members or roommates to have their own personalized viewing experience. Additionally, Netflix allows users to download content to watch offline, making it a great option for those who travel frequently or have limited internet access. Overall, Netflix is a convenient and cost-effective way to access a wide variety of entertainment.

As of 2022-Q2, more than 220 million people had signed up for Netflix's online streaming service, making it the largest OTT provider worldwide. To improve the user experience and prevent subscriber churn, they must efficiently cluster the shows hosted on their platform.

By creating clusters, we will be able to comprehend the shows that are alike and different from one another. These clusters can be used to provide customers with individualized show recommendations based on their preferences.

This project aims to classify and group Netflix shows into specific clusters in such a way that shows in the same cluster are similar to one another and shows in different clusters are different.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries and modules

# libraries that are used for analysis and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
%matplotlib inline

# Visualizing the missing values
import missingno as msno

# Word Cloud library
from wordcloud import WordCloud, STOPWORDS

# libraries used to process textual data
import string
string.punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA

# libraries used to implement clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# libraries that are used to construct a recommendation system
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
netflix = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Capstone Projects/Netflix Movies and TV Shows Clustering/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')
netflix_df = netflix.copy()


### Dataset First View

In [None]:
# Dataset First Look
# first five rows
netflix_df.head()

In [None]:
#Last five rows
netflix_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'number of rows : {netflix_df.shape[0]}  \nnumber of columns : {netflix_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
netflix_df.info()

#### Duplicate Values

**`How important is it to get rid of duplicate records in my data?`**

The mere presence of repeated data in the dataset is referred to as "duplication." This could be caused by incorrect data entry or procedures for collecting data. We can save time and money by not sending the same data to the machine learning model multiple times by removing duplicate data from our set.

In [None]:
# Dataset Duplicate Value Count
duplicate_value = len(netflix_df[netflix_df.duplicated()])
print("The number of duplicate values in the data set is = ",duplicate_value)

We found that there were no duplicate entries in the above data.

#### Missing Values/Null Values

**Why dealing with missing values is necessary?**

There are frequently a lot of missing values in the actual data. Corrupted or missing data may result in missing values. Since many machine-learning algorithms do not support missing values, missing data must be handled during the dataset's pre-processing. Therefore, we begin by looking for values that are missing.

In [None]:
# Missing Values/Null Values Count
print(netflix_df.isnull().sum())

In [None]:
# Missing Values Percentage
round(netflix_df.isna().sum()/len(netflix_df)*100, 2)

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(netflix_df, color='green',sort='ascending', figsize=(10,3), fontsize=15)

### What did you know about your dataset?

The given dataset is from the online streaming industry; our task is to examine the dataset, build the clustering methods and content based recommendation system.

Clustering is a technique used in machine learning and data mining to group similar data points together. A clustering algorithm is a method or technique used to identify clusters within a dataset. These clusters represent natural groupings of the data, and the goal of clustering is to discover these groupings without any prior knowledge of the groupings.

* There are 7787 rows and 12 columns in the dataset. In the director, cast, country, date_added, and rating columns, there are missing values. The dataset does not contain any duplicate values.

* Every row of information we have relates to a specific movie. Therefore, we are unable to use any method to impute any null values. Additionally, due to the small size of the data, we do not want to lose any data, so after analyzing each column, we simply impute numeric values using an empty string in the following procedure.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
netflix_df.columns

In [None]:
# Dataset Describe
netflix_df.describe().T

### Variables Description

* **show_id :** Unique ID for every Movie/Show
* **type :** Identifier - Movie/Show
* **title :** Title of the Movie/Show
* **director :** Director of the Movie/Show
* **cast :** Actors involved in the Movie/Show
* **country :** Country where the Movie/Show was produced
* **date_added :** Date it was added on Netflix
* **release_year :** Actual Release year of the Movie/Show
* **rating :** TV Rating of the Movie/Show
* **duration :** Total Duration - in minutes or number of seasons
* **listed_in :** Genre
* **description :** The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in netflix_df.columns.tolist():
  print("No. of unique values in",i,"is",netflix_df[i].nunique())

### Observations:

* We are focusing on several key columns of our dataset, including 'type', 'title', 'director', 'cast', 'country', 'rating', 'listed_in', and 'description', as they contain a wealth of information.
* By utilizing these features, we plan to create a cluster column and implement both K-means and Hierarchical clustering algorithms.
* Additionally, we will be developing a content-based recommendation system that utilizes the information from these columns to provide personalized suggestions to users. This approach will allow us to gain valuable insights and group similar data points together, as well as provide personalized recommendations based on user preferences and viewing history.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Explore each column in netflix_df
for column in netflix_df.columns:
    print(f"Column: {column}")
    print("Data Type:", netflix_df[column].dtype)
    print("Number of Unique Values:", netflix_df[column].nunique())
    print("Value Counts:")
    print(netflix_df[column].value_counts())
    print("-" * 30)

### Type -

In [None]:
# Calculate the value counts of each type
type_counts = netflix_df['type'].value_counts()

# Calculate the total number of entries in the dataset
total_entries = len(netflix_df)

# Calculate the percentage of each type
percentage_movie = (type_counts['Movie'] / total_entries) * 100
percentage_tv_show = (type_counts['TV Show'] / total_entries) * 100

print(f"Percentage of Movies: {percentage_movie:.2f}%")
print(f"Percentage of TV Shows: {percentage_tv_show:.2f}%")


###Title-

In [None]:
#most occured word in title?
#subsetting df
df_wordcloud = netflix_df['title']
text = " ".join(word for word in df_wordcloud)
# Create stopword list:
stopwords = set(STOPWORDS)
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

it seems like words love, man, story, christmas, movie etc are very common here.

### Country & Listed_in -

There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it. lets find out them.

In [None]:
# Find entries with multiple countries
multiple_countries = netflix_df[netflix_df['country'].str.contains(',', na=False)]

# Find entries with multiple genres
multiple_genres = netflix_df[netflix_df['listed_in'].str.contains(',', na=False)]

# Print movies/TV shows with multiple countries
print("Movies/TV Shows Filmed in Multiple Countries:")
print(multiple_countries[['title', 'country']])

# Print movies/TV shows with multiple genres
print("\nMovies/TV Shows with Multiple Genres:")
print(multiple_genres[['title', 'listed_in']])


To simplify the analysis, let's consider only the primary country where that respective movie / TV show was filmed.
Also, let's consider only the primary genre of the respective movie / TV show.

In [None]:
# Function to extract the primary value
def extract_primary(value):
    if isinstance(value, str):
        return value.split(',')[0]
    return value

# Apply the function to 'country' and 'listed_in' columns
netflix_df['country'] = netflix_df['country'].apply(extract_primary)
netflix_df['listed_in'] = netflix_df['listed_in'].apply(extract_primary)

# Print the DataFrame with simplified values
netflix_df

### date_added-

In [None]:
# Typecasting 'date_added' from string to datetime
netflix_df["date_added"] = pd.to_datetime(netflix_df['date_added'])

In [None]:
# first and last date on which a show was added on Netflix
netflix_df.date_added.min(),netflix_df.date_added.max()

The shows were added on Netflix between 1st January 2008 and 16th January 2021.

In [None]:
# Adding new attributes day,  month and year of date added
netflix_df['day_added'] = netflix_df['date_added'].dt.day
netflix_df['month_added'] = netflix_df['date_added'].dt.month
netflix_df['year_added'] = netflix_df['date_added'].dt.year


### Rating -

The ratings can be changed to age restrictions that apply on certain movies and TV shows.

[Reference](https://www.primevideo.com/help/ref=atv_hp_nd_cnt?nodeId=GFGQU3WYEG6FSJFJ)

In [None]:
# Age ratings for shows in the dataset
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=netflix_df)

**Highest number of shows on Netflix are rated by TV-MA, followed by TV-14 and TV-PG**

In [None]:
# Age ratings
netflix_df.rating.unique()

In [None]:
# Changing the values in the rating column
rating_map = {'TV-MA':'Adults',
              'R':'Adults',
              'PG-13':'Teens',
              'TV-14':'Young Adults',
              'TV-PG':'Older Kids',
              'NR':'Adults',
              'TV-G':'Kids',
              'TV-Y':'Kids',
              'TV-Y7':'Older Kids',
              'PG':'Older Kids',
              'G':'Kids',
              'NC-17':'Adults',
              'TV-Y7-FV':'Older Kids',
              'UR':'Adults'}

netflix_df['rating'].replace(rating_map, inplace = True)
netflix_df['rating'].unique()

In [None]:
# Age ratings for shows in the dataset
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=netflix_df)

**Around 50% of shows on Netflix are produced for adult audience. Followed by young adults, older kids and kids. Netflix has the least number of shows that are specifically produced for teenagers than other age groups.**

### Duration -

In [None]:
# Splitting the duration column, and changing the datatype to integer
netflix_df['duration'] = netflix_df['duration'].apply(lambda x: int(x.split()[0]))

In [None]:
# Number of seasons for tv shows
netflix_df[netflix_df['type']=='TV Show'].duration.value_counts()

In [None]:
# Movie length in minutes
netflix_df[netflix_df['type']=='Movie'].duration.unique()

In [None]:
# datatype of duration
netflix_df.duration.dtype

We have successfully converted the datatype of duration column to int.

### What all manipulations have you done and insights you found?

There are 12 attributes out of which some attributes are not in proper datatypes like date_added and duration so we apply some method to convert them in desired datatype. after which we get to know that in type feature there is more movies in comparison of tv shows. we generate cloud word image to get know about common words that occured most in title. There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it so we focus only primary country and primary genre for that type. Highest number of shows on Netflix are rated by TV-MA, followed by TV-14 and TV-PG so we change rating according to view preference like adults, teens, older kids etc and we found that Around 50% of shows on Netflix are produced for adult audience. Followed by young adults, older kids and kids. Netflix has the least number of shows that are specifically produced for teenagers than other age groups.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### **UNIVARIATE ANALYSIS -**

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#Question-1: Analyze the type of content available on Netflix.

fig,ax = plt.subplots(1,2, figsize=(14,5))

# countplot
graph = sns.countplot(x = 'type', data = netflix_df, ax=ax[0])
graph.set_title('Count of Values', size=20)

# piechart
netflix_df['type'].value_counts().plot(kind='pie', autopct='%1.2f%%', ax=ax[1], figsize=(15,6),startangle=90)
plt.title('Percentage Distribution', size=20)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

we've chosen a countplot to show the exact counts of "Movies" and "TV Shows." This gives a clear comparison of how many of each type of content is in our dataset. Additionally, we've chosen a pie chart to show the percentage distribution of these content types, which allows us to see the proportion of movies and TV shows in the whole dataset.

##### 2. What is/are the insight(s) found from the chart?

- The majority of the content available on Netflix is in the form of "Movies."
- "TV Shows" constitute a smaller portion of the overall content on the platform.
- The pie chart provides a clear visual representation of the distribution, with "Movies" taking up a larger portion of the whole.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Content Strategy: The insight that a larger portion of the content consists of "Movies" suggests that movies are more prevalent on the platform. This could inform content acquisition and production strategies, allowing Netflix to focus on obtaining popular and diverse movie titles to cater to a wider audience.

2. User Engagement: Understanding that "Movies" dominate the content catalog can help Netflix tailor its marketing and user engagement strategies. This insight could lead to targeted promotional campaigns for specific genres, leveraging the popularity of movies to attract and retain subscribers.

3. Retention Strategies: By knowing that movies are more abundant, Netflix can create customized recommendations and curated collections to enhance user engagement and satisfaction. Providing users with relevant movie suggestions could lead to increased usage and longer subscription durations.

**Negative Growth Insights:**

The provided visualizations do not directly indicate any insights that would lead to negative growth. However, it's important to note that the lack of "TV Shows" might imply a potential gap in certain content areas:

1. Diversity of Content: If the available "TV Shows" are limited in number or variety, there could be negative implications for subscribers who prefer TV series. They might find the content offerings lacking, potentially leading to lower satisfaction or churn.

2. Market Competitiveness: If competitors are offering a broader range of TV shows, Netflix might face challenges in attracting users who are specifically seeking TV series content. This could impact their market share.

3. Subscription Tier Optimization: Depending on user preferences, Netflix might need to optimize its subscription tiers. If users primarily prefer TV shows, they might expect more options in lower-cost tiers, which could affect the perceived value of subscriptions.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Question-2 : distribution of shows by the primary filming country and Identify the countries with the highest number of shows.

# Calculate the value counts of each country
country_counts = netflix_df['country'].value_counts()

# Choose the top N countries to display on the chart
top_countries = country_counts[:10]  # Change the number to choose how many countries to display

# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x=top_countries.index, y=top_countries.values, palette="viridis")
plt.xticks(rotation=45, ha="right", fontsize=12)
plt.yticks(fontsize=12)
plt.xlabel("Country", fontsize=14)
plt.ylabel("Number of Shows", fontsize=14)
plt.title("Distribution of Shows by Primary Filming Country", fontsize=16)
plt.tight_layout()
plt.show()

# Print the countries with the highest number of shows
print("Top countries with the highest number of shows:")
print(top_countries)


##### 1. Why did you pick the specific chart?

The reason to choose a bar plot in this context is that it effectively visualizes the distribution of categorical data, which in this case is the primary filming country. Here's why a bar plot is appropriate:

1. Categorical Data: The primary filming country is a categorical variable, and a bar plot is a common choice for visualizing the distribution of categorical data. Each country is represented by a separate bar, making it easy to compare the counts of shows for different countries.

2. Comparison: A bar plot allows you to quickly identify and compare the countries with the highest number of shows. The lengths of the bars provide a clear visual representation of the frequency of shows from each country.

3. Ordered Presentation: You can order the bars in descending order of show counts, placing the countries with the highest counts at the top. This helps in quickly identifying the top contributors.

4. Ease of Interpretation: Viewers can easily read the values on the y-axis (show counts) corresponding to each country on the x-axis, allowing for accurate interpretation of the distribution.

5. Insight Extraction: By observing the bar lengths, you can immediately identify the countries that have a significant presence on Netflix in terms of content production.

In summary, a bar plot is a suitable choice for visualizing the distribution of shows by primary filming country because it effectively presents categorical data in a manner that allows for easy comparison and insight extraction.

##### 2. What is/are the insight(s) found from the chart?

The output of the top countries with the highest number of shows provides valuable insights into the distribution of content on Netflix based on primary filming countries. Here are some insights we can gain from this information:

1. Content Production Leaders: The United States has a significant lead with the highest number of shows, indicating that it's a major contributor to Netflix's content library. This could be attributed to the presence of a robust entertainment industry in the U.S.

2. Global Diversity: Countries like India, the United Kingdom, Canada, and Japan also have a substantial number of shows. This suggests a diverse range of content from various parts of the world, catering to different viewer preferences and cultures.

3. Language and Localization: The presence of shows from different countries highlights Netflix's efforts to offer content in multiple languages and localize content for global audiences. This diversity can help attract and retain a broader subscriber base.

4. Regional Appeal: Countries like South Korea, Spain, and Mexico are notable contributors, indicating the popularity of content from these regions. This could reflect a growing interest in international content and a willingness to explore shows from different cultures.

5. Potential Audience Segmentation: The distribution of shows across countries can aid Netflix in segmenting its audience based on preferences. For example, they might offer curated content to specific regions to cater to local tastes.

6. Collaborative Productions: Co-productions between countries could also be contributing to the high numbers for some countries. Collaborations allow for the sharing of resources and talent, leading to more diverse and engaging content.

7. Market Penetration: The number of shows from a specific country might reflect the extent of Netflix's penetration into that market. Higher numbers could indicate a stronger presence and focus in certain regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights about the distribution of shows by primary filming countries can indeed have a positive business impact for Netflix, but there are also potential challenges that need to be considered. Let's examine both the positive and potentially negative aspects:

**Positive Business Impact:**

1. Global Market Expansion: The insights highlight that Netflix has a diverse content library spanning various countries. This can attract a global audience, leading to positive growth by expanding its subscriber base in different regions.

2. Localized Content: Understanding the top contributing countries allows Netflix to focus on creating and acquiring content that resonates with specific markets. This localization strategy can lead to higher engagement and viewer satisfaction.

3. Cultural Relevance: Content from different countries provides opportunities to cater to cultural and regional preferences. This can foster a sense of inclusivity and connect with viewers on a deeper level.

4. Strategic Partnerships: High show counts from specific countries can signify successful partnerships and collaborations. Netflix can continue nurturing these relationships for co-productions and exclusive content, strengthening its position in the market.

5. Data-Informed Decisions: The insights help Netflix make informed decisions about content acquisition, production, and distribution. This can optimize resource allocation and content strategy.

**Potential Challenges and Negative Impact:**

1. Market Saturation: Relying heavily on content from a few countries, particularly the United States, might lead to market saturation. Overwhelming viewers with content from a single region can limit appeal to a broader global audience.

2. Cultural Misalignment: While localization is beneficial, there's a risk of cultural misalignment if content is not adapted accurately. Insensitive content can lead to negative backlash and viewer attrition.

3. Competition and Differentiation: If other streaming platforms offer unique content from underrepresented regions, Netflix might face competition for viewership. Lack of diversity could hinder differentiation.

4. Localization Complexity: Producing and localizing content for multiple countries can be complex and resource-intensive. This can strain budgets and operational efficiency.

5. Content Diversity: Focusing excessively on countries with high show counts might lead to overlooking content from regions with smaller contributions. Diversifying content sources is essential for attracting a broader audience.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
#Question-3 : content growth over the years?

# Filter data by type (TV Show or Movie)
tv_show = netflix_df[netflix_df["type"] == "TV Show"]
movie = netflix_df[netflix_df["type"] == "Movie"]

col = "year_added"

# Count content added each year for TV Shows and Movies
content_1 = tv_show[col].value_counts().reset_index()
content_1 = content_1.rename(columns={col: "count", "index": col})
content_1 = content_1.sort_values(col)

content_2 = movie[col].value_counts().reset_index()
content_2 = content_2.rename(columns={col: "count", "index": col})
content_2 = content_2.sort_values(col)

# Create traces for TV Shows and Movies
trace1 = go.Scatter(x=content_1[col], y=content_1["count"], name="TV Shows", marker=dict(color="#db0000"))
trace2 = go.Scatter(x=content_2[col], y=content_2["count"], name="Movies", marker=dict(color="#564d4d"))

data = [trace1, trace2]
layout = go.Layout(
    title="Content Added Over the Years",
    xaxis=dict(title="Year"),
    yaxis=dict(title="Count"),
    legend=dict(x=0.4, y=1.1, orientation="h")
)
fig = go.Figure(data, layout=layout)

# Display the figure (if using show)
fig.show()


##### 1. Why did you pick the specific chart?

The choice of chart in the code you provided is a line chart (specifically, a scatter plot with connected lines) to visualize the growth of content (TV shows and movies) over the years. Here's why this choice of chart might be suitable:

1. Temporal Data: The x-axis represents years, which is a continuous variable and can be effectively shown using a line chart. Line charts are often used to display trends over time.

2. Comparison: The line chart allows you to compare the growth of TV shows and movies side by side. Each line represents a content type, making it easy to observe the trends and differences.

3. Connected Data Points: In your code, you've used a scatter plot with connected lines. This is a good choice when you have discrete data points (years) but still want to show the trend between them.

4. Multiple Series: You have two data series (TV shows and movies) that you want to compare. Line charts are well-suited for displaying multiple series on the same graph.

5. Year-to-Year Change: Line charts are effective for showing changes in data over time. You can quickly see if there are any spikes or drops in content added during certain years.

##### 2. What is/are the insight(s) found from the chart?

**TV Shows:**

* There was a small amount of TV show content added in the early years (2008 to 2010), possibly indicating the beginning of Netflix's original content creation.
* A notable increase in TV show additions started around 2015, which continued to grow in the subsequent years.
* The highest growth in TV show content occurred from 2016 to 2020, with a peak of 697 TV shows added in 2020.
* There seems to be a significant drop in TV show additions in 2021 compared to the previous years.

**Movies:**

* Similar to TV shows, the earliest years (2008 to 2010) saw a relatively small number of movie additions.
* There's a noticeable increase in movie additions starting from 2014, with a more significant rise in 2016.
* The growth trend continues from 2016 to 2019, with the highest number of movies (1497) added in 2019.
* Movie additions show a slight decline in 2020, followed by a relatively higher number of additions in 2021.

**Overall Insights:**

* Both TV shows and movies exhibit a growth trend over the years, with significant expansion starting around 2015-2016.
* The years 2018 and 2019 appear to be particularly active in terms of content additions for both TV shows and movies.
* The drop in content additions in 2020 might be attributed to factors like production delays caused by the COVID-19 pandemic.
* The lower number of additions in 2021 for both TV shows and movies could indicate a potential shift in strategy or the impact of ongoing circumstances.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Strategic Decision-Making: The insights into the years of significant growth (such as 2018-2019) can help Netflix understand what strategies, content acquisitions, or original productions were successful during those periods. This information can guide future decision-making to replicate and build upon those successes.

2. Content Investment: Identifying trends in content growth can help Netflix allocate resources more effectively. For instance, if TV shows have shown consistent growth, Netflix might choose to invest more in producing and acquiring TV show content, targeting genres and themes that have proven popular.

3. Subscriber Retention and Attraction: Consistent content growth can be a driver for subscriber retention and acquisition. New and diverse content attracts and retains subscribers, potentially reducing churn rates.

4. Global Events Impact: The drop in content additions in 2020 can be attributed to the COVID-19 pandemic, which disrupted production schedules worldwide. This insight can be useful for understanding the impact of external events on content availability and setting expectations for subscribers during such periods.

**Potential Negative Impact:**

1. Decline in Content Additions: The drop in content additions in 2021, both for TV shows and movies, could potentially lead to a decrease in subscriber engagement and retention. Users often expect a steady stream of fresh content, and a sudden drop could result in dissatisfaction.

2. Competition and Variety: The consistent growth in content could lead to oversaturation and reduced audience engagement if the content quality or variety isn't maintained. Users may become overwhelmed with choices, and competitors might offer content that better aligns with specific tastes.

3. Production Delays: Production delays due to unforeseen events, as seen in 2020, could lead to lower content availability. This could negatively impact user engagement and satisfaction, potentially affecting subscription renewals.

4. Content Quality: While the insights provided do not directly touch upon content quality, it's important to consider that growth should be accompanied by maintaining high-quality content. If content additions are made solely to meet quantity targets without considering quality, it could lead to dissatisfaction among subscribers.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#Question-4 : In which month do most movies and tv shows get added?

# Create dataframe to store manth values and counts.
months_df = pd.DataFrame(netflix_df.month_added.value_counts())
months_df.reset_index(inplace=True)
months_df.rename(columns={'index':'month', 'month_added':'count'}, inplace=True)
fig = px.bar(months_df, x="month", y="count", text_auto=True, color='count', color_continuous_scale=['#db0000', '#564d4d'])
fig.update_layout(
    title={
        'text': 'Month wise addition of movies and shows to the platform',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        autosize=False,
        width=1000,
        height=500)
fig.show()


##### 1. Why did you pick the specific chart?

the bar chart is an appropriate choice for this visualization because it effectively displays the distribution of content additions across months and allows for easy comparison and interpretation of the data.

##### 2. What is/are the insight(s) found from the chart?

* High-Volume Months: The months with the highest content additions are December (833), October (785), and January (757). These months seem to be particularly active in terms of adding new content to the platform.

* End-of-Year Peaks: December stands out as the month with the highest content additions. This could be related to the holiday season, where more people might be engaged in streaming and the platform might want to provide a variety of content choices.

* Release Patterns: The top months for content additions (December, October, January) might correspond to popular times for content releases, aligning with holidays, school breaks, or other cultural events.

* Mid-Year Dips: Months like May (543) and June (542) show relatively lower content additions. This could be due to factors like production schedules, vacation periods, or a focus on promoting existing content.

* Consistent Activity: Months from March to August (with counts ranging from 542 to 669) show consistent content additions. This could reflect a strategy to provide a steady stream of new content throughout the year.

* Potential Seasonal Patterns: It's interesting to see that December, October, and January are clustered at the top. This could indicate a seasonal pattern related to holidays, colder weather, or viewer behavior during specific periods.

* Varied Peaks: While December, October, and January are high-volume months, it's notable that other months like November (738) and February (472) also have substantial content additions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Strategic Content Releases: The insights into high-volume months (such as December, October, and January) can be leveraged for strategic content releases. By concentrating major releases during these months, Netflix can maximize viewer engagement and subscriptions, leading to positive business impact.

2. Subscriber Engagement: Releasing content during months of higher user engagement can enhance user satisfaction and encourage longer subscriptions. Users are more likely to stay engaged when they have a variety of new content options.

3. Marketing Campaigns: The months with higher content additions can be targeted for marketing campaigns to highlight the availability of new and exciting content. This can attract both new subscribers and existing ones.

4. Revenue Generation: Optimizing content release schedules can lead to increased user engagement, attracting more viewers and generating additional revenue through increased subscription numbers and user retention.

**Potential Negative Impact:**

1. Content Oversaturation: Focusing too heavily on high-volume months could lead to content oversaturation during those periods. This might result in users feeling overwhelmed by the choices and potentially not fully engaging with the content.

2. Neglecting Low-Volume Months: Overemphasizing high-volume months could lead to neglecting low-volume months. If users experience a lack of new content during these months, they might become dissatisfied and consider canceling subscriptions.

3. Unpredictable Viewer Behavior: While the insights show trends, viewer behavior can be unpredictable. Relying solely on high-volume months might not fully capture the diverse preferences and viewing habits of all subscribers.

4. Competition: Other streaming platforms might also capitalize on high-volume months, leading to increased competition for viewers' attention and subscriptions. This could result in a fragmented audience and potential negative impact on subscriber numbers.

5. Quality Over Quantity: Prioritizing content additions during high-volume months might compromise content quality. Focusing on releasing high-quality content should be a priority to maintain viewer satisfaction and loyalty.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Question-5 : Which days are more prominent?

# Create dataframe to store day values and count.
days_df = pd.DataFrame(netflix_df.day_added.value_counts())
days_df.reset_index(inplace=True)
days_df.rename(columns={'index':'day', 'day_added':'count'}, inplace=True)

fig = px.bar(days_df, x="day", y="count", text_auto=True, color='count', color_continuous_scale=['#db0000', '#564d4d'])
fig.update_layout(
    title={
        'text': 'Which days are more prominent',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        autosize=False,
        width=1200,
        height=600)
fig.show()


##### 1. Why did you pick the specific chart?

the bar chart is a suitable choice for this visualization because it effectively displays the distribution of content additions across days of the week and allows for easy comparison and interpretation of the data.

##### 2. What is/are the insight(s) found from the chart?

* Weekdays Dominate: Weekdays (days 1 to 5) have significantly higher content additions compared to weekends (days 6 and 7). This suggests that content is added more frequently during weekdays.

* Day 1 Peak: day 1 has the highest count of content additions (2069). This suggests a potential trend of adding new content at the beginning of the week.

* Mid-Month Peaks: Days around the 15th of the month (days 15 and 16) have relatively high content additions (644 and 240, respectively). This could indicate a trend of content additions around the middle of the month.

* End-of-Month Surges: Days at the end of the month (days 31, 30, and 31) also show relatively higher content additions (274, 182, and 130, respectively). This could be related to content releases before the end of the month.

* Variation on Weekends: Days 6 and 7 (Saturday and Sunday) have lower content additions (165 and 162, respectively), suggesting a potential strategy of focusing less on weekends.

* Consistency in Numbers: Days in the mid-range (days 18 to 28) show consistent content additions, indicating a steady flow of new content throughout the month.

* Influence of Viewer Behavior: The higher content additions at the beginning and middle of the month might reflect viewer behavior patterns, such as higher engagement after weekends and around mid-month paydays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Strategic Content Releases: The insights into days with higher content additions (e.g., Mondays, mid-month, end-of-month) can be leveraged for strategic content releases. By concentrating major releases during these periods, Netflix can maximize viewer engagement and subscriptions.

2. Optimized User Engagement: Aligning content releases with days of the week when users are most active (e.g., weekdays) can lead to optimized user engagement and longer subscription durations. Users are more likely to engage when there's fresh content available.

3. Viewer Satisfaction: Consistent content additions throughout the week can enhance viewer satisfaction. Offering a steady stream of new content prevents content gaps and provides viewers with reasons to keep using the platform.

4. Content Variety: Analyzing specific days with lower content additions (e.g., weekends) could be an opportunity to diversify content and cater to different viewer preferences during those times.

**Potential Negative Impact:**

1. Neglecting Weekends: Overemphasizing weekdays for content additions might lead to neglecting weekends. If users experience a lack of new content during leisure days, they might become dissatisfied and consider canceling subscriptions.

2. Viewer Fatigue: Concentrating content additions on specific peak days (e.g., Mondays) might result in viewer fatigue. Releasing too much content all at once could lead to oversaturation and reduced engagement.

3. Content Quality Over Quantity: Focusing solely on aligning content additions with specific days could compromise content quality. It's important to ensure that content releases maintain high quality to retain viewer satisfaction.

4. Neglecting Viewer Diversity: Viewer behavior varies widely, and not all users follow the same patterns. Relying exclusively on the insights from certain days might overlook segments of users with different preferences and schedules.

5. Competition: If other streaming platforms also follow similar patterns of concentrated content releases, it could lead to increased competition for viewers' attention, potentially resulting in a fragmented audience.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#Question-6 : top 10 and last 10 genre in listed_in?

# seperating genre from listed_in columns for analysis purpose
genres = netflix_df['listed_in'].str.split(', ', expand=True).stack()
# top 10 genres in listed movies/TV shows
genres = genres.value_counts().reset_index().rename(columns={'index':'genre', 0:'count'})

# plotting graph
fig,ax = plt.subplots(1,2, figsize=(15,6))

# Top 10 genres
top = sns.barplot(x='genre', y = 'count', data=genres[:10], ax=ax[0])
top.set_title('Top 10 genres present in Netflix', size=20)
plt.setp(top.get_xticklabels(), rotation=90)

# Last 10 genres
bottom = sns.barplot(x='genre', y = 'count', data=genres[-10:], ax=ax[1])
bottom.set_title('Last 10 genres present in Netflix', size=20)
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The choice of using a bar plot (bar chart) in the provided code is appropriate for visualizing the distribution of genres in the "listed_in" column of the Netflix dataset. Here's why a bar plot is a suitable choice for this visualization:

1. Categorical Data: The genres extracted from the "listed_in" column are categorical data. A bar plot is commonly used to represent the distribution of categorical data.

2. Comparison: Bar plots are ideal for comparing the frequency or count of different categories. In this case, you're comparing the count of each genre.

3. Ordered Data: The x-axis represents genres, and you're interested in understanding their relative frequencies. The x-axis categories don't need to be sorted, but a bar plot can still accommodate unsorted categorical data.

##### 2. What is/are the insight(s) found from the chart?

**Top 10 Genres:**

1. Diverse Genre Offerings: The top genres include a variety of content, ranging from dramas, comedies, documentaries, and action & adventure. This diversity reflects Netflix's efforts to cater to a wide range of viewer preferences.

2. Mainstream Appeal: Genres like dramas, comedies, and documentaries have a high count, indicating their popularity and mainstream appeal among viewers.

3. Global Audience: The presence of "International TV Shows" in the top genres suggests that Netflix has a strong focus on providing content from various countries, appealing to a global audience.

4. Family and Kids' Content: The presence of "Children & Family Movies," "Kids' TV," and "Animation" genres indicates a commitment to offering family-friendly content.

5. Entertainment Variety: Genres like "Stand-Up Comedy" and "Music & Musicals" add entertainment variety, addressing different moods and preferences.

**Last 10 Genres:**

1. Niche and Specialized Content: The genres in the last 10 list, such as "Cult Movies," "TV Horror," and "Sci-Fi & Fantasy," tend to be more specialized and might cater to niche audiences.

2. Limited Appeal: Genres with lower counts, such as "LGBTQ Movies," "Sports Movies," and "Spanish-Language TV Shows," suggest that these genres might have limited appeal compared to more mainstream genres.

3. Highly Specific Content: The genres "TV Sci-Fi & Fantasy" and "TV Horror" are specific subgenres that might cater to fans of these particular genres.

4. Limited Availability: Some genres with very low counts (e.g., "Sports Movies") might indicate that Netflix offers limited content within those genres.

5. Viewer Diversity: The presence of genres like "TV Shows" and "Romantic Movies" suggests that Netflix aims to cater to diverse viewer interests, even if these genres have lower counts.

6. Content Focus: Lower counts in some genres might reflect a strategic decision to focus resources on more popular and mainstream genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Viewer Engagement: Offering a diverse range of popular genres (e.g., dramas, comedies, documentaries) can lead to higher viewer engagement, longer viewing sessions, and increased subscription renewals.

2. Global Audience: The inclusion of "International TV Shows" reflects Netflix's commitment to catering to a global audience. This can lead to a broader user base and positive business impact through increased international subscriptions.

3. Family-Friendly Content: Providing genres like "Children & Family Movies" and "Kids' TV" can attract families and parents, resulting in higher subscriptions and positive word-of-mouth recommendations.

4. Entertainment Variety: Offering a mix of genres, including "Stand-Up Comedy" and "Music & Musicals," can attract viewers seeking different types of entertainment, leading to longer engagement on the platform.

5. Catering to Niche Audiences: While some genres have lower counts, they might cater to niche audiences with passionate fan bases. Satisfying these niche audiences can lead to increased loyalty and positive reviews.

**Potential Negative Impact:**

1. Neglected Genres: Overemphasis on popular genres could lead to neglecting genres with lower counts. This might result in decreased engagement from viewers who prefer these genres.

2. Oversaturation: Overemphasizing the most popular genres might lead to oversaturation, causing viewers to become overwhelmed with content choices and potentially reducing engagement.

3. Limited Niche Content: While catering to niche audiences is valuable, focusing solely on niche genres might limit overall viewership and potentially result in negative growth if those genres don't have a sustainable audience.

4. Quality Over Quantity: Prioritizing quantity over quality in certain genres could lead to viewer dissatisfaction, negative reviews, and potential churn.

5. Missed Opportunities: Neglecting certain genres (e.g., LGBTQ Movies, Spanish-Language TV Shows) might miss opportunities to capture specific viewer segments, potentially leading to negative growth within those segments.

6. Competition: If certain genres are neglected or not well-curated, viewers might turn to other streaming platforms that offer more diverse and tailored genre options.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Question-7 : Total shows/movies added each years?
sns.set(rc={'figure.figsize':(15,7)})
sns.countplot(x='year_added',data=netflix_df,palette="Set1")

plt.title('Total shows/movies added each year on netflix ',size='15',fontweight="bold")
plt.show()



##### 1. Why did you pick the specific chart?

 the countplot is a suitable choice for visualizing the distribution of shows/movies added to Netflix each year. It effectively presents the frequency of content additions for each year and enables easy comparison and interpretation of the data.

##### 2. What is/are the insight(s) found from the chart?

- Rapid Growth in Recent Years: The years 2019 and 2020 saw the highest numbers of content additions, with 2153 and 2009 shows/movies added, respectively. This indicates a period of rapid growth for Netflix's content library in recent years.

- Continued Expansion: Following 2019 and 2020, the year 2018 also had a substantial number of content additions, with 1685 shows/movies added. This suggests that Netflix's content expansion efforts have been consistent over multiple years.

- Steady Growth: The years 2017 and 2016 also had significant numbers of content additions, with 1225 and 443 shows/movies added, respectively. This indicates steady growth in Netflix's content library during those years.

- Recent Decline: In 2021, the number of content additions dropped to 117 shows/movies. While this could indicate a slowdown, it's important to note that the data might not be complete for the entire year, and trends can change throughout the year.

- Early Years: The years 2014 and earlier had lower numbers of content additions, suggesting that Netflix's content library was smaller in its early years of operation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Content Library Growth: The rapid growth in content additions in recent years (2019 and 2020) indicates that Netflix is actively investing in expanding its content library. This can have a positive impact on user engagement, attracting new subscribers and retaining existing ones.

* Subscriber Retention: Consistent content additions over multiple years (2017, 2018) contribute to subscriber satisfaction and retention. A diverse and growing content library can encourage users to stay subscribed.

* Competitive Edge: Regular content updates give Netflix a competitive edge by offering a wider variety of content compared to competitors. This can attract viewers looking for a comprehensive entertainment experience.

* Market Penetration: High content additions in recent years indicate Netflix's efforts to penetrate and capture a larger share of the global streaming market.

* Original Content Strategy: The growth in content additions aligns with Netflix's strategy of producing original content. Original shows and movies can generate brand loyalty and exclusivity.

**Potential Negative Impact:**

* Decline in Content Quality: A sudden drop in content additions in 2021 might suggest a decline in content quality or a shift in content strategy. If content additions continue to decrease, it could lead to viewer dissatisfaction and churn.

* Subscription Attrition: A decline in content additions might result in users seeking content elsewhere, leading to subscription attrition or reduced acquisition of new subscribers.

* Saturation Effect: Oversaturation of the content library can overwhelm viewers, making it difficult for them to choose what to watch. This could lead to viewer frustration and potentially reduced engagement.

* Missed Opportunities: A lower number of content additions in earlier years might indicate missed opportunities to capture early adopters and establish a larger subscriber base from the beginning.

* Increased Competition: If other streaming platforms continue to invest heavily in content additions, Netflix's reduced growth could lead to increased competition for viewer attention.

* Lack of Freshness: A low number of content additions might result in a lack of freshness in the content library, potentially leading to viewer fatigue and decreased engagement.

### BIVARIATE ANALYSIS

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Question-8 : Rating based on rating system of all TV Shows and movies?
# Rating vs. Type (Grouped bar chart)
plt.figure(figsize=(10, 6))
sns.countplot(x="rating", hue="type", data=netflix_df)
plt.title("Rating vs. Type")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.legend(title="Type")

# Print count values on the bars
ax = plt.gca()
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is a type of bar plot that is specifically designed to show the count of occurrences of a categorical variable. It is particularly useful when you want to visualize the distribution of categorical data and compare the frequency of different categories.

##### 2. What is/are the insight(s) found from the chart?

* For the "Adults" rating, there are significantly more movies (2595) compared to TV shows (1025). However, for the "Teens" rating, there are only movies (386) and no TV shows.
* The "Young Adults" rating has a relatively balanced distribution between movies (1272) and TV shows (659), indicating a diverse range of content for this category.

* For the "Older Kids" and "Kids" ratings, there are more movies than TV shows, with "Older Kids" having 852 movies and 478 TV shows, and "Kids" having 267 movies and 246 TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impacts:**

1. Dominant Content Types: The insight that certain content ratings are dominant in specific content types (e.g., "Adults" rating having more movies) can help Netflix allocate resources more effectively. For example, producing more content within popular rating categories could attract and retain subscribers who prefer those ratings, leading to positive growth.

2. Balanced Distribution: The balanced distribution of the "Young Adults" rating between movies and TV shows indicates that there's a diverse audience within that age group. Offering a variety of content types can lead to higher engagement and satisfaction among different segments of viewers.

**Negative Impacts:**

1. Limited TV Shows for Certain Ratings: The absence of TV shows for the "Teens" rating might result in a missed opportunity to attract younger viewers looking for TV show content. This could lead to negative growth in the teenage demographic if not addressed.

2. Children's Content: The higher count of movies compared to TV shows in the "Older Kids" and "Kids" categories might limit the options available to younger audiences who prefer TV shows. This could result in negative growth among families seeking TV show content for children.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#Question-9 : Season-wise distribution of tv shows
tv_df = netflix_df[netflix_df['type']=='TV Show']
tv_df['duration'].value_counts()

tv=tv_df['duration'].value_counts().T.reset_index()

fig = px.pie(tv,values='duration',names='index',color_discrete_sequence=px.colors.sequential.Greens)
fig.update_layout(title="season-wise distribution of tv shows")
fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=20,
                  marker=dict( line=dict(color = 'RebeccaPurple', width=2)))

##### 1. Why did you pick the specific chart?

a pie chart is a suitable choice for visualizing the distribution of TV shows on Netflix based on the number of seasons. It effectively conveys the proportion of TV shows within each season category and allows viewers to compare these proportions visually.

##### 2. What is/are the insight(s) found from the chart?

1. Diverse Content Strategy: Netflix has a diverse content strategy that includes a mix of single-season shows and multi-season shows. This strategy allows them to cater to a wide range of viewer preferences and consumption habits.

2. Emphasis on Shorter Formats: The dominance of single-season shows suggests that Netflix invests in producing shorter formats like mini-series and limited series. These formats might be more appealing to viewers who prefer concise storytelling.

3. Variety in Multi-Season Shows: The presence of multi-season shows in different ranges (2-3, 4-6, etc.) indicates that Netflix offers a variety of ongoing series and shows that explore longer story arcs.

4. Viewer Engagement with Long-Running Shows: Although rare, the presence of TV shows with higher numbers of seasons suggests that there are shows on Netflix that have managed to maintain viewer engagement over a significant period.

5. Impact of Production and Costs: The decreasing frequency as the number of seasons increases could be influenced by production costs and viewer engagement. Longer-running shows require sustained resources and consistent audience interest.

6. Balance between Quantity and Quality: The distribution might reflect Netflix's approach to balance the quantity of content with the quality of storytelling. This can ensure that both shorter and longer shows maintain a certain level of engagement and production value.

7. Changing Viewer Preferences: The gaps in the distribution might indicate changing viewer preferences. The absence of mid-range shows (7-9 seasons) could be due to viewer interest shifting toward shorter or longer formats.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Content Diversity: The diverse distribution of TV shows across different numbers of seasons indicates that Netflix is catering to a wide range of viewer preferences. This diversity can attract and retain a broader audience, leading to positive business impact.

2. Viewer Engagement: The presence of TV shows with higher numbers of seasons suggests that some shows have successfully maintained viewer engagement over the long term. These engaged viewers contribute to positive word-of-mouth, loyalty, and potentially higher subscriber retention rates.

3. Data-Informed Renewals: The insights gained from this distribution can inform content renewal decisions. Shows with strong engagement and consistent viewer interest can be renewed for additional seasons, leading to sustained viewer satisfaction.

4. Appeal to Different Viewers: By offering both single-season and multi-season shows, Netflix can attract viewers with varying preferences. Some viewers prefer short, self-contained stories, while others enjoy longer narrative arcs.

5. Platform Stickiness: A diverse content library with shows of different lengths can make Netflix more "sticky" for subscribers. Subscribers might stay engaged for a longer time as they explore a variety of content.

**Potential Negative Impact:**

1. Overemphasis on Short Formats: If Netflix focuses excessively on producing single-season shows, it could result in a lack of long-running, ongoing series. This might lead to viewer dissatisfaction if subscribers are seeking shows with more extended storylines.

2. Content Fatigue: A skewed distribution with a majority of single-season shows might lead to content fatigue, as viewers may find it challenging to invest in shorter formats repeatedly. This could potentially impact viewer engagement and retention.

3. Risk of Abandoning Shows: If Netflix doesn't renew shows with potential for longevity, it might miss out on cultivating dedicated fan bases and long-term viewer engagement. Prematurely discontinuing shows could lead to subscriber disappointment.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#Question-10 : top 10 directors who directed tv shows and movies?
fig,ax = plt.subplots(1,2, figsize=(14,5))

# top 10 directors who directed TV shows
tv_shows = netflix_df[netflix_df['type']=='TV Show']['director'].value_counts()[:10].plot(kind='barh', ax=ax[0])
tv_shows.set_title('top 10 director who directed TV Shows', size=15)

# top 10 directors who directed Movies
movies = netflix_df[netflix_df['type']=='Movie']['director'].value_counts()[:10].plot(kind='barh', ax=ax[1])
movies.set_title('top 10 director who directed Movies', size=15)

plt.tight_layout(pad=1.2, rect=[0, 0, 0.95, 0.95])
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
netflix_df[['director','cast','country']] = netflix_df[['director','cast','country']].fillna('unknown', inplace=True)
netflix_df.dropna(axis=0, inplace=True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

- The missing values in the director, cast, and country attributes can be replaced with string 'unknown'
- Small amount of null value percentage present in rating and date_added column, if we drop these nan values it will not affect that much while building the model. So, we simply drop the nan value present in rating and date_added columns.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***