<a href="https://colab.research.google.com/github/saurabhsingh3786/Netflix-Movies-and-TV-Shows-Clustering/blob/main/Team_Notebook_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font size="+4" color=blue><center> Netflix Movies and TV Shows Clustering</center>



##### **Project Type**    - Unsupervised
##### **Contribution**    - Team
##### **Team Member 1 -**   Bharathwaj Bejjarapu
##### **Team Member 2 -**   Shriya Chouhan
##### **Team Member 3 -**   Saurabh Singh


# **Project Summary -**

Introduction:

With more than 83 million subscribers and presence in more than 190 countries, Netflix is the most popular Internet television network in the world. Its users watch more than 125 million hours of TV and movie content daily, including original series, documentaries, and feature films. On almost any screen that is linked to the Internet, members can watch as much as they want, whenever and wherever. Without interruptions or obligations, members can play, pause, and resume watching at any time.

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

In this project, we worked on a text clustering problem where we had to classify/group the Netflix movie/shows into certain clusters such that the shows within a cluster are similar to each other and the shows in different clusters are dissimilar to each other.

<b>**In this  project, We are required to do -** </b>
1. Exploratory Data Analysis
2. Understanding what type content is available in different countries
3. Has Netflix increasingly focused on TV as compared to movies in recent years?
4. Clustering similar content by matching text-based features



* There were approximately 7787 records and 12 attributes in the dataset.

* We started by working on the missing values in the dataset and conducting exploratory data analysis (EDA).

* Using the following attributes to create a cluster: cast, country, genre, director, rating, and description The TFIDF vectorizer was used to tokenize, preprocess, and vectorize the values in these attributes.

* The problem of dimensionality was dealt with through the use of Principal Component Analysis (PCA).

* Using a variety of methods, including the elbow method, silhouette score, dendrogram, and others, we constructed two distinct types of clusters with the K-Means Clustering and Agglomerative Hierarchical clustering algorithms, respectively, and determined the optimal number of clusters.

* The similarity matrix generated by applying cosine similarity was used to construct a content-based recommender system. The user will receive ten recommendations from this recommender system based on the type of show they watched.

**This comprehensive analysis and recommendation system are expected to enhance user satisfaction, leading to improved retention rates for Netflix.**

# **GitHub Link -**

https://github.combharath977Netflix_Content_Analysis_and_Clustering_for_Insights

https://github.com/ShriyaChouhan/Netflix_Movies_and_TV_Shows_Clustering

https://github.com/saurabhsingh3786/Netflix-Movies-and-TV-Shows-Clustering

# **Problem Statement**


* Netflix is the world's largest online streaming service provider, with over 220 million subscribers as of 2022-Q2. It is crucial that they effectively cluster the shows that are hosted on their platform in order to enhance the user experience, thereby preventing subscriber churn.
* We will be able to understand the shows that are similar to and different from one another by creating clusters, which may be leveraged to offer the consumers personalized show suggestions depending on their preferences.
* The goal of this project is to classify/group the Netflix shows into certain clusters such that the shows within a cluster are similar to each other and the shows in different clusters are dissimilar to each other.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries and modules

# libraries that are used for analysis and visualization
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
%matplotlib inline

# Visualizing the missing values
import missingno as msno

# libraries used to process textual data
import string
string.punctuation
import nltk
nltk.download('punkt')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# libraries used to implement clusters
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# libraries that are used to construct a recommendation system
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Library of warnings would assist in ignoring warnings issued
import warnings
warnings.filterwarnings('ignore')
import warnings;warnings.simplefilter('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
netflix = pd.read_csv('/content/drive/MyDrive/AlmaBetter/Capstone Projects/Netflix Movies and TV Shows Clustering/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')
netflix_df = netflix.copy()


### Dataset First View

In [None]:
# Dataset First Look
# first five rows
netflix_df.head()

In [None]:
#Last five rows
netflix_df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'number of rows : {netflix_df.shape[0]}  \nnumber of columns : {netflix_df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
netflix_df.info()

#### Duplicate Values

**`How important is it to get rid of duplicate records in my data?`**

The mere presence of repeated data in the dataset is referred to as "duplication." This could be caused by incorrect data entry or procedures for collecting data. We can save time and money by not sending the same data to the machine learning model multiple times by removing duplicate data from our set.

In [None]:
# Dataset Duplicate Value Count
duplicate_value = len(netflix_df[netflix_df.duplicated()])
print("The number of duplicate values in the data set is = ",duplicate_value)

We found that there were no duplicate entries in the above data.

#### Missing Values/Null Values

**Why dealing with missing values is necessary?**

There are frequently a lot of missing values in the actual data. Corrupted or missing data may result in missing values. Since many machine-learning algorithms do not support missing values, missing data must be handled during the dataset's pre-processing. Therefore, we begin by looking for values that are missing.

In [None]:
# Missing Values/Null Values Count
print("-"*50)
print("Null value count in each of the Column: ")
print("-"*50)
print(netflix_df.isna().sum())
print("-"*50)

# Percentage of null values in each category
print("Percentage of null values in each Column: ")
print("-"*50)
null_count_by_column = netflix_df.isnull().sum()/len(netflix_df)
print(f"{null_count_by_column*100}%")
print("-"*50)

In [None]:
# Visualizing the missing values
import missingno as msno
msno.bar(netflix_df, color='green',sort='ascending', figsize=(10,3), fontsize=15)

In [None]:
# Visualizing the missing values
plt.figure(figsize=(15,8))
plots= sns.barplot(x=netflix_df.columns,y=netflix_df.isna().sum())
plt.grid(linestyle='--', linewidth=0.3)

for bar in plots.patches:
      plots.annotate(bar.get_height(),
                     (bar.get_x() + bar.get_width() / 2,
                      bar.get_height()), ha='center', va='center',
                     size=12, xytext=(0, 8),
                     textcoords='offset points')
plt.show()

### What did you know about your dataset?

The dataset "Netflix Movies and TV Shows Clustering" comprises 12 columns, with only one column having an integer data type. It does not contain any duplicate values, but it does have null values in five columns: director, cast, country, date_added, and rating.

This dataset provides a valuable resource for exploring trends in the range of movies and TV shows available on Netflix. Additionally, it can be utilized for developing clustering models to categorize similar titles together based on shared attributes such as genre, country of origin, and rating.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(f"Available columns:\n{netflix_df.columns.to_list()}")

In [None]:
# Dataset Describe
netflix_df.describe(include='all').T

### Variables Description

The variable description of the Netflix Movies and TV Shows Clustering Dataset is as follows:

1. **show_id**: Unique identifier for each movie/show.

2. **type**: Indicates whether the entry is a movie or a TV show.
3. **title**: Name of the movie or TV show.
4. **director**: Name of the director(s) of the movie or TV show.
5. **cast**: Names of the actors and actresses featured in the movie or TV show.
6. **country**: Country or countries where the movie or TV show was produced.
7. **date_added**: Date when the movie or TV show was added to Netflix.
8. **release_year**: Year when the movie or TV show was released.
9. **rating**: TV rating or movie rating of the movie or TV show.
10. **duration**: Length of the movie or TV show in minutes or seasons.
11. **listed_in**: Categories or genres of the movie or TV show.
12. **description**: Brief synopsis or summary of the movie or TV show.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in netflix_df.columns.tolist():
  print("No. of unique values in",i,"is",netflix_df[i].nunique())

### Observations:

* We are focusing on several key columns of our dataset, including 'type', 'title', 'director', 'cast', 'country', 'rating', 'listed_in', and 'description', as they contain a wealth of information.
* By utilizing these features, we plan to create a cluster column and implement both K-means and Hierarchical clustering algorithms.
* Additionally, we will be developing a content-based recommendation system that utilizes the information from these columns to provide personalized suggestions to users. This approach will allow us to gain valuable insights and group similar data points together, as well as provide personalized recommendations based on user preferences and viewing history.

## 3. ***Data Wrangling***

### Data Wrangling Code

####  Handling Null values from each feature

In [None]:
# Missing Values/Null Values Count
print("-"*50)
print("Null value count in each of the column: ")
print("-"*50)
print(netflix_df.isna().sum())
print("-"*50)

# Let's find out the percentage of null values in each category in order to deal with it.
print("Percentage of null values in each variable: ")
print("-"*50)
null_count_by_column = netflix_df.isnull().sum()/len(netflix_df)
print(f"{null_count_by_column*100}%")
print("-"*50)

In [None]:
netflix_df["date_added"].value_counts()

In [None]:
netflix_df['rating'].value_counts()

In [None]:
netflix_df['country'].value_counts()

1. Since 'date_added' and 'rating' has very less percentage of null count so we can drop those observations to avoid any biasness in our clustering model.
2. We cannot drop or impute any values in 'director' and 'cast' as the null percentage is comparatevely high and we do not know data of those actual movie/TV shows, so its better to replace those entries with 'unknown'.
3. We can fill null values of 'country' with mode as we only have 6% null values and most of the movies/shows are from US only.

In [None]:
## Imputing null value as per our discussion
# imputing with unknown in null values of director and cast feature
netflix_df[['director','cast']]=netflix_df[['director','cast']].fillna("Unknown")

# Imputing null values of country with Mode
netflix_df['country']=netflix_df['country'].fillna(netflix_df['country'].mode()[0])

# Dropping remaining null values of date_added and rating
netflix_df.dropna(axis=0, inplace=True)

In [None]:
# Rechecking the Missing Values/Null Values Count
print("-"*50)
print("Null value count in each of the column: ")
print("-"*50)
print(netflix_df.isna().sum())
print("-"*50)

# Rechecking the percentage of null values in each category
print("Percentage of null values in each column: ")
print("-"*50)
null_count_by_column = netflix_df.isnull().sum()/len(netflix_df)
print(f"{null_count_by_column*100}%")
print("-"*50)

#### Country & Listed_in -

In [None]:
# Top countries
netflix_df.country.value_counts()

In [None]:
# Genre of shows
netflix_df.listed_in.value_counts()

There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it. lets find out them.

In [None]:
# Find entries with multiple countries
multiple_countries = netflix_df[netflix_df['country'].str.contains(',', na=False)]

# Find entries with multiple genres
multiple_genres = netflix_df[netflix_df['listed_in'].str.contains(',', na=False)]

# Print movies/TV shows with multiple countries
print("Movies/TV Shows Filmed in Multiple Countries:")
print(multiple_countries[['title', 'country']])

# Print movies/TV shows with multiple genres
print("\nMovies/TV Shows with Multiple Genres:")
print(multiple_genres[['title', 'listed_in']])


To simplify the analysis, let's consider only the primary country where that respective movie / TV show was filmed.
Also, let's consider only the primary genre of the respective movie / TV show.

In [None]:
# Function to extract the primary value
def extract_primary(value):
    if isinstance(value, str):
        return value.split(',')[0]
    return value

# Apply the function to 'country' and 'listed_in' columns
netflix_df['country'] = netflix_df['country'].apply(extract_primary)
netflix_df['listed_in'] = netflix_df['listed_in'].apply(extract_primary)

# Print the DataFrame with simplified values
netflix_df

#### date_added-

In [None]:
# Typecasting 'date_added' from string to datetime
netflix_df["date_added"] = pd.to_datetime(netflix_df['date_added'])

In [None]:
# first and last date on which a show was added on Netflix
netflix_df.date_added.min(),netflix_df.date_added.max()

The shows were added on Netflix between 1st January 2008 and 16th January 2021.

In [None]:
# Adding new attributes day,  month and year of date added
netflix_df['day_added'] = netflix_df['date_added'].dt.day
netflix_df['month_added'] = netflix_df['date_added'].dt.month
netflix_df['year_added'] = netflix_df['date_added'].dt.year
# Dropping date_added
netflix_df.drop('date_added', axis=1, inplace=True)

#### Rating -

The ratings can be changed to age restrictions that apply on certain movies and TV shows.

[Reference](https://www.primevideo.com/help/ref=atv_hp_nd_cnt?nodeId=GFGQU3WYEG6FSJFJ)

In [None]:
# Age ratings for shows in the dataset
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=netflix_df)

**Highest number of shows on Netflix are rated by TV-MA, followed by TV-14 and TV-PG**

In [None]:
# Age ratings
netflix_df.rating.unique()

In [None]:
# Changing the values in the rating column
rating_map = {'TV-MA':'Adults',
              'R':'Adults',
              'PG-13':'Teens',
              'TV-14':'Young Adults',
              'TV-PG':'Older Kids',
              'NR':'Adults',
              'TV-G':'Kids',
              'TV-Y':'Kids',
              'TV-Y7':'Older Kids',
              'PG':'Older Kids',
              'G':'Kids',
              'NC-17':'Adults',
              'TV-Y7-FV':'Older Kids',
              'UR':'Adults'}

netflix_df['rating'].replace(rating_map, inplace = True)
netflix_df['rating'].unique()

In [None]:
# Age ratings for shows in the dataset
plt.figure(figsize=(10,5))
sns.countplot(x='rating',data=netflix_df)

**Around 50% of shows on Netflix are produced for adult audience. Followed by young adults, older kids and kids. Netflix has the least number of shows that are specifically produced for teenagers than other age groups.**

#### Duration -

In [None]:
# Splitting the duration column, and changing the datatype to integer
netflix_df['duration'] = netflix_df['duration'].apply(lambda x: int(x.split()[0]))

In [None]:
# Number of seasons for tv shows
netflix_df[netflix_df['type']=='TV Show'].duration.value_counts()

In [None]:
# Movie length in minutes
netflix_df[netflix_df['type']=='Movie'].duration.unique()

In [None]:
# datatype of duration
netflix_df.duration.dtype

We have successfully converted the datatype of duration column to int.

### What all manipulations have you done and insights you found?

There are 12 attributes out of which some attributes are not in proper datatypes like date_added and duration so we apply some method to convert them in desired datatype. after which we get to know that in type feature there is more movies in comparison of tv shows. we generate cloud word image to get know about common words that occured most in title. There are some movies / TV shows that were filmed in multiple countries, have multiple genres associated with it so we focus only primary country and primary genre for that type. Highest number of shows on Netflix are rated by TV-MA, followed by TV-14 and TV-PG so we change rating according to view preference like adults, teens, older kids etc and we found that Around 50% of shows on Netflix are produced for adult audience. Followed by young adults, older kids and kids. Netflix has the least number of shows that are specifically produced for teenagers than other age groups.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **What is EDA?**
* EDA stands for Exploratory Data Analysis. It is a process of analyzing and understanding the data, which is an essential step in the data science process. The goal of EDA is to gain insights into the data, identify patterns, and discover relationships and trends. It is an iterative process that helps to identify outliers, missing values, and any other issues that may affect the analysis and modeling of the data.


### **UNIVARIATE ANALYSIS -**

#### Chart - 1: Analyze the type of content available on Netflix.

In [None]:
# Chart - 1 visualization code

fig,ax = plt.subplots(1,2, figsize=(14,5))

# countplot
graph = sns.countplot(x = 'type', data = netflix_df, ax=ax[0])
graph.set_title('Count of Values', size=20)

# piechart
netflix_df['type'].value_counts().plot(kind='pie', autopct='%1.2f%%', ax=ax[1], figsize=(15,6),startangle=90)
plt.title('Percentage Distribution', size=20)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

we've chosen a countplot to show the exact counts of "Movies" and "TV Shows." This gives a clear comparison of how many of each type of content is in our dataset. Additionally, we've chosen a pie chart to show the percentage distribution of these content types, which allows us to see the proportion of movies and TV shows in the whole dataset.

##### 2. What is/are the insight(s) found from the chart?

- The majority of the content available on Netflix is in the form of "Movies."
- "TV Shows" constitute a smaller portion of the overall content on the platform.
- The pie chart provides a clear visual representation of the distribution, with "Movies" taking up a larger portion of the whole.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Content Strategy: The insight that a larger portion of the content consists of "Movies" suggests that movies are more prevalent on the platform. This could inform content acquisition and production strategies, allowing Netflix to focus on obtaining popular and diverse movie titles to cater to a wider audience.

2. User Engagement: Understanding that "Movies" dominate the content catalog can help Netflix tailor its marketing and user engagement strategies. This insight could lead to targeted promotional campaigns for specific genres, leveraging the popularity of movies to attract and retain subscribers.

3. Retention Strategies: By knowing that movies are more abundant, Netflix can create customized recommendations and curated collections to enhance user engagement and satisfaction. Providing users with relevant movie suggestions could lead to increased usage and longer subscription durations.

**Negative Growth Insights:**

The provided visualizations do not directly indicate any insights that would lead to negative growth. However, it's important to note that the lack of "TV Shows" might imply a potential gap in certain content areas:

1. Diversity of Content: If the available "TV Shows" are limited in number or variety, there could be negative implications for subscribers who prefer TV series. They might find the content offerings lacking, potentially leading to lower satisfaction or churn.

2. Market Competitiveness: If competitors are offering a broader range of TV shows, Netflix might face challenges in attracting users who are specifically seeking TV series content. This could impact their market share.

3. Subscription Tier Optimization: Depending on user preferences, Netflix might need to optimize its subscription tiers. If users primarily prefer TV shows, they might expect more options in lower-cost tiers, which could affect the perceived value of subscriptions.

#### Chart - 2: top 10 countries in content creation.

In [None]:
# Chart - 2 visualization code

df_country = netflix_df.groupby(['country']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plt.figure(figsize=(15,6))
plots= sns.barplot(y = "country",x = 'title', data = df_country)
plt.xticks(rotation = 60)
plt.title('Top 10 Countries for content creation')
plt.grid(linestyle='--', linewidth=0.3)
plots.bar_label(plots.containers[0])
plt.show()

##### 1. Why did you pick the specific chart?

The reason to choose a bar plot in this context is that it effectively visualizes the distribution of categorical data, which in this case is the primary filming country. Here's why a bar plot is appropriate:

1. Categorical Data: The primary filming country is a categorical variable, and a bar plot is a common choice for visualizing the distribution of categorical data. Each country is represented by a separate bar, making it easy to compare the counts of shows for different countries.

2. Comparison: A bar plot allows you to quickly identify and compare the countries with the highest number of shows. The lengths of the bars provide a clear visual representation of the frequency of shows from each country.

3. Ordered Presentation: You can order the bars in descending order of show counts, placing the countries with the highest counts at the top. This helps in quickly identifying the top contributors.

4. Ease of Interpretation: Viewers can easily read the values on the y-axis (show counts) corresponding to each country on the x-axis, allowing for accurate interpretation of the distribution.

5. Insight Extraction: By observing the bar lengths, you can immediately identify the countries that have a significant presence on Netflix in terms of content production.

In summary, a bar plot is a suitable choice for visualizing the distribution of shows by primary filming country because it effectively presents categorical data in a manner that allows for easy comparison and insight extraction.

##### 2. What is/are the insight(s) found from the chart?

The output of the top countries with the highest number of shows provides valuable insights into the distribution of content on Netflix based on primary filming countries. Here are some insights we can gain from this information:

1. Content Production Leaders: The United States has a significant lead with the highest number of shows, indicating that it's a major contributor to Netflix's content library. This could be attributed to the presence of a robust entertainment industry in the U.S.

2. Global Diversity: Countries like India, the United Kingdom, Canada, and Japan also have a substantial number of shows. This suggests a diverse range of content from various parts of the world, catering to different viewer preferences and cultures.

3. Language and Localization: The presence of shows from different countries highlights Netflix's efforts to offer content in multiple languages and localize content for global audiences. This diversity can help attract and retain a broader subscriber base.

4. Regional Appeal: Countries like South Korea, Spain, and Mexico are notable contributors, indicating the popularity of content from these regions. This could reflect a growing interest in international content and a willingness to explore shows from different cultures.

5. Potential Audience Segmentation: The distribution of shows across countries can aid Netflix in segmenting its audience based on preferences. For example, they might offer curated content to specific regions to cater to local tastes.

6. Collaborative Productions: Co-productions between countries could also be contributing to the high numbers for some countries. Collaborations allow for the sharing of resources and talent, leading to more diverse and engaging content.

7. Market Penetration: The number of shows from a specific country might reflect the extent of Netflix's penetration into that market. Higher numbers could indicate a stronger presence and focus in certain regions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights about the distribution of shows by primary filming countries can indeed have a positive business impact for Netflix, but there are also potential challenges that need to be considered. Let's examine both the positive and potentially negative aspects:

**Positive Business Impact:**

1. Global Market Expansion: The insights highlight that Netflix has a diverse content library spanning various countries. This can attract a global audience, leading to positive growth by expanding its subscriber base in different regions.

2. Localized Content: Understanding the top contributing countries allows Netflix to focus on creating and acquiring content that resonates with specific markets. This localization strategy can lead to higher engagement and viewer satisfaction.

3. Cultural Relevance: Content from different countries provides opportunities to cater to cultural and regional preferences. This can foster a sense of inclusivity and connect with viewers on a deeper level.

4. Strategic Partnerships: High show counts from specific countries can signify successful partnerships and collaborations. Netflix can continue nurturing these relationships for co-productions and exclusive content, strengthening its position in the market.

5. Data-Informed Decisions: The insights help Netflix make informed decisions about content acquisition, production, and distribution. This can optimize resource allocation and content strategy.

**Potential Challenges and Negative Impact:**

1. Market Saturation: Relying heavily on content from a few countries, particularly the United States, might lead to market saturation. Overwhelming viewers with content from a single region can limit appeal to a broader global audience.

2. Cultural Misalignment: While localization is beneficial, there's a risk of cultural misalignment if content is not adapted accurately. Insensitive content can lead to negative backlash and viewer attrition.

3. Competition and Differentiation: If other streaming platforms offer unique content from underrepresented regions, Netflix might face competition for viewership. Lack of diversity could hinder differentiation.

4. Localization Complexity: Producing and localizing content for multiple countries can be complex and resource-intensive. This can strain budgets and operational efficiency.

5. Content Diversity: Focusing excessively on countries with high show counts might lead to overlooking content from regions with smaller contributions. Diversifying content sources is essential for attracting a broader audience.

#### Chart - 3: content growth over the years?

In [None]:
# Chart - 3 visualization code


# Filter data by type (TV Show or Movie)
tv_show = netflix_df[netflix_df["type"] == "TV Show"]
movie = netflix_df[netflix_df["type"] == "Movie"]

col = "year_added"

# Count content added each year for TV Shows and Movies
content_1 = tv_show[col].value_counts().reset_index()
content_1 = content_1.rename(columns={col: "count", "index": col})
content_1 = content_1.sort_values(col)

content_2 = movie[col].value_counts().reset_index()
content_2 = content_2.rename(columns={col: "count", "index": col})
content_2 = content_2.sort_values(col)

# Create traces for TV Shows and Movies
trace1 = go.Scatter(x=content_1[col], y=content_1["count"], name="TV Shows", marker=dict(color="#db0000"))
trace2 = go.Scatter(x=content_2[col], y=content_2["count"], name="Movies", marker=dict(color="#564d4d"))

data = [trace1, trace2]
layout = go.Layout(
    title="Content Added Over the Years",
    xaxis=dict(title="Year"),
    yaxis=dict(title="Count"),
    legend=dict(x=0.4, y=1.1, orientation="h")
)
fig = go.Figure(data, layout=layout)

# Display the figure (if using show)
fig.show()


##### 1. Why did you pick the specific chart?

The choice of chart in the code you provided is a line chart (specifically, a scatter plot with connected lines) to visualize the growth of content (TV shows and movies) over the years. Here's why this choice of chart might be suitable:

1. Temporal Data: The x-axis represents years, which is a continuous variable and can be effectively shown using a line chart. Line charts are often used to display trends over time.

2. Comparison: The line chart allows you to compare the growth of TV shows and movies side by side. Each line represents a content type, making it easy to observe the trends and differences.

3. Connected Data Points: In your code, you've used a scatter plot with connected lines. This is a good choice when you have discrete data points (years) but still want to show the trend between them.

4. Multiple Series: You have two data series (TV shows and movies) that you want to compare. Line charts are well-suited for displaying multiple series on the same graph.

5. Year-to-Year Change: Line charts are effective for showing changes in data over time. You can quickly see if there are any spikes or drops in content added during certain years.

##### 2. What is/are the insight(s) found from the chart?

**TV Shows:**

* There was a small amount of TV show content added in the early years (2008 to 2010), possibly indicating the beginning of Netflix's original content creation.
* A notable increase in TV show additions started around 2015, which continued to grow in the subsequent years.
* The highest growth in TV show content occurred from 2016 to 2020, with a peak of 697 TV shows added in 2020.
* There seems to be a significant drop in TV show additions in 2021 compared to the previous years.

**Movies:**

* Similar to TV shows, the earliest years (2008 to 2010) saw a relatively small number of movie additions.
* There's a noticeable increase in movie additions starting from 2014, with a more significant rise in 2016.
* The growth trend continues from 2016 to 2019, with the highest number of movies (1497) added in 2019.
* Movie additions show a slight decline in 2020, followed by a relatively higher number of additions in 2021.

**Overall Insights:**

* Both TV shows and movies exhibit a growth trend over the years, with significant expansion starting around 2015-2016.
* The years 2018 and 2019 appear to be particularly active in terms of content additions for both TV shows and movies.
* The drop in content additions in 2020 might be attributed to factors like production delays caused by the COVID-19 pandemic.
* The lower number of additions in 2021 for both TV shows and movies could indicate a potential shift in strategy or the impact of ongoing circumstances.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Strategic Decision-Making: The insights into the years of significant growth (such as 2018-2019) can help Netflix understand what strategies, content acquisitions, or original productions were successful during those periods. This information can guide future decision-making to replicate and build upon those successes.

2. Content Investment: Identifying trends in content growth can help Netflix allocate resources more effectively. For instance, if TV shows have shown consistent growth, Netflix might choose to invest more in producing and acquiring TV show content, targeting genres and themes that have proven popular.

3. Subscriber Retention and Attraction: Consistent content growth can be a driver for subscriber retention and acquisition. New and diverse content attracts and retains subscribers, potentially reducing churn rates.

4. Global Events Impact: The drop in content additions in 2020 can be attributed to the COVID-19 pandemic, which disrupted production schedules worldwide. This insight can be useful for understanding the impact of external events on content availability and setting expectations for subscribers during such periods.

**Potential Negative Impact:**

1. Decline in Content Additions: The drop in content additions in 2021, both for TV shows and movies, could potentially lead to a decrease in subscriber engagement and retention. Users often expect a steady stream of fresh content, and a sudden drop could result in dissatisfaction.

2. Competition and Variety: The consistent growth in content could lead to oversaturation and reduced audience engagement if the content quality or variety isn't maintained. Users may become overwhelmed with choices, and competitors might offer content that better aligns with specific tastes.

3. Production Delays: Production delays due to unforeseen events, as seen in 2020, could lead to lower content availability. This could negatively impact user engagement and satisfaction, potentially affecting subscription renewals.

4. Content Quality: While the insights provided do not directly touch upon content quality, it's important to consider that growth should be accompanied by maintaining high-quality content. If content additions are made solely to meet quantity targets without considering quality, it could lead to dissatisfaction among subscribers.

#### Chart - 4: In which month do most movies and tv shows get added on netflix?

In [None]:
# Chart - 4 visualization code


# Create dataframe to store month values and counts.
months_df = pd.DataFrame(netflix_df.month_added.value_counts())
months_df.reset_index(inplace=True)
months_df.rename(columns={'index':'month', 'month_added':'count'}, inplace=True)
fig = px.bar(months_df, x="month", y="count", text_auto=True, color='count', color_continuous_scale=['#db0000', '#564d4d'])
fig.update_layout(
    title={
        'text': 'Month wise addition of movies and shows to the platform',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        autosize=False,
        width=1000,
        height=500)
fig.show()


##### 1. Why did you pick the specific chart?

the bar chart is an appropriate choice for this visualization because it effectively displays the distribution of content additions across months and allows for easy comparison and interpretation of the data.

##### 2. What is/are the insight(s) found from the chart?

* High-Volume Months: The months with the highest content additions are December (833), October (785), and January (757). These months seem to be particularly active in terms of adding new content to the platform.

* End-of-Year Peaks: December stands out as the month with the highest content additions. This could be related to the holiday season, where more people might be engaged in streaming and the platform might want to provide a variety of content choices.

* Release Patterns: The top months for content additions (December, October, January) might correspond to popular times for content releases, aligning with holidays, school breaks, or other cultural events.

* Mid-Year Dips: Months like May (543) and June (542) show relatively lower content additions. This could be due to factors like production schedules, vacation periods, or a focus on promoting existing content.

* Consistent Activity: Months from March to August (with counts ranging from 542 to 669) show consistent content additions. This could reflect a strategy to provide a steady stream of new content throughout the year.

* Potential Seasonal Patterns: It's interesting to see that December, October, and January are clustered at the top. This could indicate a seasonal pattern related to holidays, colder weather, or viewer behavior during specific periods.

* Varied Peaks: While December, October, and January are high-volume months, it's notable that other months like November (738) and February (472) also have substantial content additions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Strategic Content Releases: The insights into high-volume months (such as December, October, and January) can be leveraged for strategic content releases. By concentrating major releases during these months, Netflix can maximize viewer engagement and subscriptions, leading to positive business impact.

2. Subscriber Engagement: Releasing content during months of higher user engagement can enhance user satisfaction and encourage longer subscriptions. Users are more likely to stay engaged when they have a variety of new content options.

3. Marketing Campaigns: The months with higher content additions can be targeted for marketing campaigns to highlight the availability of new and exciting content. This can attract both new subscribers and existing ones.

4. Revenue Generation: Optimizing content release schedules can lead to increased user engagement, attracting more viewers and generating additional revenue through increased subscription numbers and user retention.

**Potential Negative Impact:**

1. Content Oversaturation: Focusing too heavily on high-volume months could lead to content oversaturation during those periods. This might result in users feeling overwhelmed by the choices and potentially not fully engaging with the content.

2. Neglecting Low-Volume Months: Overemphasizing high-volume months could lead to neglecting low-volume months. If users experience a lack of new content during these months, they might become dissatisfied and consider canceling subscriptions.

3. Unpredictable Viewer Behavior: While the insights show trends, viewer behavior can be unpredictable. Relying solely on high-volume months might not fully capture the diverse preferences and viewing habits of all subscribers.

4. Competition: Other streaming platforms might also capitalize on high-volume months, leading to increased competition for viewers' attention and subscriptions. This could result in a fragmented audience and potential negative impact on subscriber numbers.

5. Quality Over Quantity: Prioritizing content additions during high-volume months might compromise content quality. Focusing on releasing high-quality content should be a priority to maintain viewer satisfaction and loyalty.

#### Chart - 5: Which days are more prominent?

In [None]:
# Chart - 5 visualization code

# Create dataframe to store day values and count.
days_df = pd.DataFrame(netflix_df.day_added.value_counts())
days_df.reset_index(inplace=True)
days_df.rename(columns={'index':'day', 'day_added':'count'}, inplace=True)

fig = px.bar(days_df, x="day", y="count", text_auto=True, color='count', color_continuous_scale=['#db0000', '#564d4d'])
fig.update_layout(
    title={
        'text': 'Which days are more prominent',
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        autosize=False,
        width=1200,
        height=600)
fig.show()


##### 1. Why did you pick the specific chart?

the bar chart is a suitable choice for this visualization because it effectively displays the distribution of content additions across days of the week and allows for easy comparison and interpretation of the data.

##### 2. What is/are the insight(s) found from the chart?

* Weekdays Dominate: Weekdays (days 1 to 5) have significantly higher content additions compared to weekends (days 6 and 7). This suggests that content is added more frequently during weekdays.

* Day 1 Peak: day 1 has the highest count of content additions (2069). This suggests a potential trend of adding new content at the beginning of the week.

* Mid-Month Peaks: Days around the 15th of the month (days 15 and 16) have relatively high content additions (644 and 240, respectively). This could indicate a trend of content additions around the middle of the month.

* End-of-Month Surges: Days at the end of the month (days 31, 30, and 31) also show relatively higher content additions (274, 182, and 130, respectively). This could be related to content releases before the end of the month.

* Variation on Weekends: Days 6 and 7 (Saturday and Sunday) have lower content additions (165 and 162, respectively), suggesting a potential strategy of focusing less on weekends.

* Consistency in Numbers: Days in the mid-range (days 18 to 28) show consistent content additions, indicating a steady flow of new content throughout the month.

* Influence of Viewer Behavior: The higher content additions at the beginning and middle of the month might reflect viewer behavior patterns, such as higher engagement after weekends and around mid-month paydays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Strategic Content Releases: The insights into days with higher content additions (e.g., Mondays, mid-month, end-of-month) can be leveraged for strategic content releases. By concentrating major releases during these periods, Netflix can maximize viewer engagement and subscriptions.

2. Optimized User Engagement: Aligning content releases with days of the week when users are most active (e.g., weekdays) can lead to optimized user engagement and longer subscription durations. Users are more likely to engage when there's fresh content available.

3. Viewer Satisfaction: Consistent content additions throughout the week can enhance viewer satisfaction. Offering a steady stream of new content prevents content gaps and provides viewers with reasons to keep using the platform.

4. Content Variety: Analyzing specific days with lower content additions (e.g., weekends) could be an opportunity to diversify content and cater to different viewer preferences during those times.

**Potential Negative Impact:**

1. Neglecting Weekends: Overemphasizing weekdays for content additions might lead to neglecting weekends. If users experience a lack of new content during leisure days, they might become dissatisfied and consider canceling subscriptions.

2. Viewer Fatigue: Concentrating content additions on specific peak days (e.g., Mondays) might result in viewer fatigue. Releasing too much content all at once could lead to oversaturation and reduced engagement.

3. Content Quality Over Quantity: Focusing solely on aligning content additions with specific days could compromise content quality. It's important to ensure that content releases maintain high quality to retain viewer satisfaction.

4. Neglecting Viewer Diversity: Viewer behavior varies widely, and not all users follow the same patterns. Relying exclusively on the insights from certain days might overlook segments of users with different preferences and schedules.

5. Competition: If other streaming platforms also follow similar patterns of concentrated content releases, it could lead to increased competition for viewers' attention, potentially resulting in a fragmented audience.

#### Chart - 6: top 10 and last 10 genre present in listed in.

In [None]:
# Chart - 6 visualization code


# seperating genre from listed_in columns for analysis purpose
genres = netflix_df['listed_in'].str.split(', ', expand=True).stack()
# top 10 genres in listed movies/TV shows
genres = genres.value_counts().reset_index().rename(columns={'index':'genre', 0:'count'})

# plotting graph
fig,ax = plt.subplots(1,2, figsize=(15,6))

# Top 10 genres
top = sns.barplot(x='genre', y = 'count', data=genres[:10], ax=ax[0])
top.set_title('Top 10 genres present in Netflix', size=20)
plt.setp(top.get_xticklabels(), rotation=90)

# Last 10 genres
bottom = sns.barplot(x='genre', y = 'count', data=genres[-10:], ax=ax[1])
bottom.set_title('Last 10 genres present in Netflix', size=20)
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

The choice of using a bar plot (bar chart) in the provided code is appropriate for visualizing the distribution of genres in the "listed_in" column of the Netflix dataset. Here's why a bar plot is a suitable choice for this visualization:

1. Categorical Data: The genres extracted from the "listed_in" column are categorical data. A bar plot is commonly used to represent the distribution of categorical data.

2. Comparison: Bar plots are ideal for comparing the frequency or count of different categories. In this case, you're comparing the count of each genre.

3. Ordered Data: The x-axis represents genres, and you're interested in understanding their relative frequencies. The x-axis categories don't need to be sorted, but a bar plot can still accommodate unsorted categorical data.

##### 2. What is/are the insight(s) found from the chart?

**Top 10 Genres:**

1. Diverse Genre Offerings: The top genres include a variety of content, ranging from dramas, comedies, documentaries, and action & adventure. This diversity reflects Netflix's efforts to cater to a wide range of viewer preferences.

2. Mainstream Appeal: Genres like dramas, comedies, and documentaries have a high count, indicating their popularity and mainstream appeal among viewers.

3. Global Audience: The presence of "International TV Shows" in the top genres suggests that Netflix has a strong focus on providing content from various countries, appealing to a global audience.

4. Family and Kids' Content: The presence of "Children & Family Movies," "Kids' TV," and "Animation" genres indicates a commitment to offering family-friendly content.

5. Entertainment Variety: Genres like "Stand-Up Comedy" and "Music & Musicals" add entertainment variety, addressing different moods and preferences.

**Last 10 Genres:**

1. Niche and Specialized Content: The genres in the last 10 list, such as "Cult Movies," "TV Horror," and "Sci-Fi & Fantasy," tend to be more specialized and might cater to niche audiences.

2. Limited Appeal: Genres with lower counts, such as "LGBTQ Movies," "Sports Movies," and "Spanish-Language TV Shows," suggest that these genres might have limited appeal compared to more mainstream genres.

3. Highly Specific Content: The genres "TV Sci-Fi & Fantasy" and "TV Horror" are specific subgenres that might cater to fans of these particular genres.

4. Limited Availability: Some genres with very low counts (e.g., "Sports Movies") might indicate that Netflix offers limited content within those genres.

5. Viewer Diversity: The presence of genres like "TV Shows" and "Romantic Movies" suggests that Netflix aims to cater to diverse viewer interests, even if these genres have lower counts.

6. Content Focus: Lower counts in some genres might reflect a strategic decision to focus resources on more popular and mainstream genres.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Viewer Engagement: Offering a diverse range of popular genres (e.g., dramas, comedies, documentaries) can lead to higher viewer engagement, longer viewing sessions, and increased subscription renewals.

2. Global Audience: The inclusion of "International TV Shows" reflects Netflix's commitment to catering to a global audience. This can lead to a broader user base and positive business impact through increased international subscriptions.

3. Family-Friendly Content: Providing genres like "Children & Family Movies" and "Kids' TV" can attract families and parents, resulting in higher subscriptions and positive word-of-mouth recommendations.

4. Entertainment Variety: Offering a mix of genres, including "Stand-Up Comedy" and "Music & Musicals," can attract viewers seeking different types of entertainment, leading to longer engagement on the platform.

5. Catering to Niche Audiences: While some genres have lower counts, they might cater to niche audiences with passionate fan bases. Satisfying these niche audiences can lead to increased loyalty and positive reviews.

**Potential Negative Impact:**

1. Neglected Genres: Overemphasis on popular genres could lead to neglecting genres with lower counts. This might result in decreased engagement from viewers who prefer these genres.

2. Oversaturation: Overemphasizing the most popular genres might lead to oversaturation, causing viewers to become overwhelmed with content choices and potentially reducing engagement.

3. Limited Niche Content: While catering to niche audiences is valuable, focusing solely on niche genres might limit overall viewership and potentially result in negative growth if those genres don't have a sustainable audience.

4. Quality Over Quantity: Prioritizing quantity over quality in certain genres could lead to viewer dissatisfaction, negative reviews, and potential churn.

5. Missed Opportunities: Neglecting certain genres (e.g., LGBTQ Movies, Spanish-Language TV Shows) might miss opportunities to capture specific viewer segments, potentially leading to negative growth within those segments.

6. Competition: If certain genres are neglected or not well-curated, viewers might turn to other streaming platforms that offer more diverse and tailored genre options.

#### Chart - 7: Number of shows/movies on Netflix for different age groups.

In [None]:
# Chart - 7 visualization code

sns.set(rc={'figure.figsize':(15,7)})
sns.countplot(x='year_added',data=netflix_df,palette="Set1")

plt.title('Total shows/movies added each year on netflix ',size='15',fontweight="bold")
plt.show()



##### 1. Why did you pick the specific chart?

 the countplot is a suitable choice for visualizing the distribution of shows/movies added to Netflix each year. It effectively presents the frequency of content additions for each year and enables easy comparison and interpretation of the data.

##### 2. What is/are the insight(s) found from the chart?

- Rapid Growth in Recent Years: The years 2019 and 2020 saw the highest numbers of content additions, with 2153 and 2009 shows/movies added, respectively. This indicates a period of rapid growth for Netflix's content library in recent years.

- Continued Expansion: Following 2019 and 2020, the year 2018 also had a substantial number of content additions, with 1685 shows/movies added. This suggests that Netflix's content expansion efforts have been consistent over multiple years.

- Steady Growth: The years 2017 and 2016 also had significant numbers of content additions, with 1225 and 443 shows/movies added, respectively. This indicates steady growth in Netflix's content library during those years.

- Recent Decline: In 2021, the number of content additions dropped to 117 shows/movies. While this could indicate a slowdown, it's important to note that the data might not be complete for the entire year, and trends can change throughout the year.

- Early Years: The years 2014 and earlier had lower numbers of content additions, suggesting that Netflix's content library was smaller in its early years of operation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Content Library Growth: The rapid growth in content additions in recent years (2019 and 2020) indicates that Netflix is actively investing in expanding its content library. This can have a positive impact on user engagement, attracting new subscribers and retaining existing ones.

* Subscriber Retention: Consistent content additions over multiple years (2017, 2018) contribute to subscriber satisfaction and retention. A diverse and growing content library can encourage users to stay subscribed.

* Competitive Edge: Regular content updates give Netflix a competitive edge by offering a wider variety of content compared to competitors. This can attract viewers looking for a comprehensive entertainment experience.

* Market Penetration: High content additions in recent years indicate Netflix's efforts to penetrate and capture a larger share of the global streaming market.

* Original Content Strategy: The growth in content additions aligns with Netflix's strategy of producing original content. Original shows and movies can generate brand loyalty and exclusivity.

**Potential Negative Impact:**

* Decline in Content Quality: A sudden drop in content additions in 2021 might suggest a decline in content quality or a shift in content strategy. If content additions continue to decrease, it could lead to viewer dissatisfaction and churn.

* Subscription Attrition: A decline in content additions might result in users seeking content elsewhere, leading to subscription attrition or reduced acquisition of new subscribers.

* Saturation Effect: Oversaturation of the content library can overwhelm viewers, making it difficult for them to choose what to watch. This could lead to viewer frustration and potentially reduced engagement.

* Missed Opportunities: A lower number of content additions in earlier years might indicate missed opportunities to capture early adopters and establish a larger subscriber base from the beginning.

* Increased Competition: If other streaming platforms continue to invest heavily in content additions, Netflix's reduced growth could lead to increased competition for viewer attention.

* Lack of Freshness: A low number of content additions might result in a lack of freshness in the content library, potentially leading to viewer fatigue and decreased engagement.

### BIVARIATE ANALYSIS

#### Chart - 8: Rating based on rating system of all TV Shows and movies?

In [None]:
# Chart - 8 visualization code

# Rating vs. Type (Grouped bar chart)
plt.figure(figsize=(10, 6))
sns.countplot(x="rating", hue="type", data=netflix_df)
plt.title("Rating vs. Type")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.legend(title="Type")

# Print count values on the bars
ax = plt.gca()
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', fontsize=10, color='black', xytext=(0, 5),
                textcoords='offset points')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot is a type of bar plot that is specifically designed to show the count of occurrences of a categorical variable. It is particularly useful when you want to visualize the distribution of categorical data and compare the frequency of different categories.

##### 2. What is/are the insight(s) found from the chart?

* For the "Adults" rating, there are significantly more movies (2595) compared to TV shows (1025). However, for the "Teens" rating, there are only movies (386) and no TV shows.
* The "Young Adults" rating has a relatively balanced distribution between movies (1272) and TV shows (659), indicating a diverse range of content for this category.

* For the "Older Kids" and "Kids" ratings, there are more movies than TV shows, with "Older Kids" having 852 movies and 478 TV shows, and "Kids" having 267 movies and 246 TV shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impacts:**

1. Dominant Content Types: The insight that certain content ratings are dominant in specific content types (e.g., "Adults" rating having more movies) can help Netflix allocate resources more effectively. For example, producing more content within popular rating categories could attract and retain subscribers who prefer those ratings, leading to positive growth.

2. Balanced Distribution: The balanced distribution of the "Young Adults" rating between movies and TV shows indicates that there's a diverse audience within that age group. Offering a variety of content types can lead to higher engagement and satisfaction among different segments of viewers.

**Negative Impacts:**

1. Limited TV Shows for Certain Ratings: The absence of TV shows for the "Teens" rating might result in a missed opportunity to attract younger viewers looking for TV show content. This could lead to negative growth in the teenage demographic if not addressed.

2. Children's Content: The higher count of movies compared to TV shows in the "Older Kids" and "Kids" categories might limit the options available to younger audiences who prefer TV shows. This could result in negative growth among families seeking TV show content for children.

#### Chart - 9: Season-wise distribution of tv shows.

In [None]:
# Chart - 9 visualization code

tv_df = netflix_df[netflix_df['type']=='TV Show']
tv_df['duration'].value_counts()

tv=tv_df['duration'].value_counts().T.reset_index()

fig = px.pie(tv,values='duration',names='index',color_discrete_sequence=px.colors.sequential.Greens)
fig.update_layout(title="season-wise distribution of tv shows")
fig.update_traces(textposition='inside', textinfo='percent+label', textfont_size=20,
                  marker=dict( line=dict(color = 'RebeccaPurple', width=2)))

##### 1. Why did you pick the specific chart?

a pie chart is a suitable choice for visualizing the distribution of TV shows on Netflix based on the number of seasons. It effectively conveys the proportion of TV shows within each season category and allows viewers to compare these proportions visually.

##### 2. What is/are the insight(s) found from the chart?

1. Diverse Content Strategy: Netflix has a diverse content strategy that includes a mix of single-season shows and multi-season shows. This strategy allows them to cater to a wide range of viewer preferences and consumption habits.

2. Emphasis on Shorter Formats: The dominance of single-season shows suggests that Netflix invests in producing shorter formats like mini-series and limited series. These formats might be more appealing to viewers who prefer concise storytelling.

3. Variety in Multi-Season Shows: The presence of multi-season shows in different ranges (2-3, 4-6, etc.) indicates that Netflix offers a variety of ongoing series and shows that explore longer story arcs.

4. Viewer Engagement with Long-Running Shows: Although rare, the presence of TV shows with higher numbers of seasons suggests that there are shows on Netflix that have managed to maintain viewer engagement over a significant period.

5. Impact of Production and Costs: The decreasing frequency as the number of seasons increases could be influenced by production costs and viewer engagement. Longer-running shows require sustained resources and consistent audience interest.

6. Balance between Quantity and Quality: The distribution might reflect Netflix's approach to balance the quantity of content with the quality of storytelling. This can ensure that both shorter and longer shows maintain a certain level of engagement and production value.

7. Changing Viewer Preferences: The gaps in the distribution might indicate changing viewer preferences. The absence of mid-range shows (7-9 seasons) could be due to viewer interest shifting toward shorter or longer formats.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Content Diversity: The diverse distribution of TV shows across different numbers of seasons indicates that Netflix is catering to a wide range of viewer preferences. This diversity can attract and retain a broader audience, leading to positive business impact.

2. Viewer Engagement: The presence of TV shows with higher numbers of seasons suggests that some shows have successfully maintained viewer engagement over the long term. These engaged viewers contribute to positive word-of-mouth, loyalty, and potentially higher subscriber retention rates.

3. Data-Informed Renewals: The insights gained from this distribution can inform content renewal decisions. Shows with strong engagement and consistent viewer interest can be renewed for additional seasons, leading to sustained viewer satisfaction.

4. Appeal to Different Viewers: By offering both single-season and multi-season shows, Netflix can attract viewers with varying preferences. Some viewers prefer short, self-contained stories, while others enjoy longer narrative arcs.

5. Platform Stickiness: A diverse content library with shows of different lengths can make Netflix more "sticky" for subscribers. Subscribers might stay engaged for a longer time as they explore a variety of content.

**Potential Negative Impact:**

1. Overemphasis on Short Formats: If Netflix focuses excessively on producing single-season shows, it could result in a lack of long-running, ongoing series. This might lead to viewer dissatisfaction if subscribers are seeking shows with more extended storylines.

2. Content Fatigue: A skewed distribution with a majority of single-season shows might lead to content fatigue, as viewers may find it challenging to invest in shorter formats repeatedly. This could potentially impact viewer engagement and retention.

3. Risk of Abandoning Shows: If Netflix doesn't renew shows with potential for longevity, it might miss out on cultivating dedicated fan bases and long-term viewer engagement. Prematurely discontinuing shows could lead to subscriber disappointment.

#### Chart - 10: Top 10 Directors directed in Movies and TV Shows.

In [None]:
# Chart - 10 visualization code
# Top 10 Directors in Movies and TV Shows
df_movies= netflix_df[netflix_df['type']== 'Movie']
df_tvshows= netflix_df[netflix_df['type']== 'TV Show']
plt.figure(figsize=(23,8))
for i,j,k in ((df_movies, 'Movies',0),(df_tvshows, 'TV Shows',1)):
  plt.subplot(1,2,k+1)
  df_director = i.groupby(['director']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[1:10]
  plots= sns.barplot(y = "director",x = 'title', data = df_director, palette='Paired')
  plt.title(f'Directors appeared in most of the {j}')
  plt.grid(linestyle='--', linewidth=0.3)
  plots.bar_label(plots.containers[0])
plt.show()

##### 1. Why did you pick the specific chart?

the horizontal bar chart with subplots is a visually effective way to compare and contrast the top directors in both TV shows and movies, offering insights into their contributions to Netflix's content library across different categories.

##### 2. What is/are the insight(s) found from the chart?

**Top 10 Directors of TV Shows:**

1. Diverse Directorship: The top directors for TV shows vary in terms of the number of shows directed. The highest count is 3 shows directed by Alastair Fothergill, while others have directed 2 shows each.

2. Variety in Content: The variety in the names of the top TV show directors suggests that there isn't a single director who dominates the TV show category. This indicates a diverse range of directors contributing to Netflix's TV show offerings.

3. Documentaries and Series: Directors like Alastair Fothergill and Ken Burns are known for documentaries, which might be contributing to their high directorship counts.

4. Continuity in Series: Directors like Shin Won-ho, Iginio Straffi, and Rob Seidenglanz have directed multiple shows, possibly indicating a continuation of a successful series or franchise.

**Top 10 Directors of Movies:**

1. Highly Prolific Directors: The top directors for movies have directed a significant number of films. Raúl Campos and Jan Suter have directed the highest count of 18 movies, followed closely by directors like Marcus Raboy and Jay Karas.

2. Comedy and Stand-Up: Directors like Marcus Raboy, Jay Karas, and Jay Chapman are known for directing comedy content, including stand-up specials.

3. Diverse Genres: The presence of directors like Cathy Garcia-Molina and Youssef Chahine suggests a diverse range of movie genres, potentially including romance, drama, and international films.

4. Renowned Filmmakers: Directors like Martin Scorsese and Steven Spielberg, who are renowned in the film industry, are also among the top directors. This might indicate collaborations with Netflix for original films.

5. Variety in Directing Style: The list includes directors with varying styles and backgrounds, contributing to Netflix's diverse movie portfolio.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Diverse Content Portfolio: Collaborating with a diverse set of directors who specialize in different genres, styles, and formats can enhance Netflix's content portfolio. This diversity can attract a wider audience, leading to increased engagement and potentially more subscribers.

2. Renowned Filmmakers: Collaborating with renowned filmmakers like Martin Scorsese and Steven Spielberg can elevate the platform's prestige and attract subscribers seeking high-quality content from respected directors.

3. Prolific Directors: Directors who have directed a significant number of movies or TV shows can contribute to a consistent stream of fresh content. This frequent release of content can maintain subscriber interest and engagement.

4. Catering to Audience Preferences: By having a mix of directors specializing in comedy, drama, documentaries, and other genres, Netflix can cater to the diverse preferences of its global audience.

**Potential Negative Impact:**

1. Overemphasis on Quantity: While having prolific directors can result in a large quantity of content, an overemphasis on quantity over quality might lead to content fatigue among subscribers. Quality control and viewer satisfaction could be compromised.

2. Lack of Focus: Collaborating with too many directors might lead to a lack of a coherent content strategy. A broad spectrum of content could lack a unifying brand identity, potentially leading to viewer confusion.

3. Risk of Exclusivity: Relying heavily on a few renowned directors might make the platform dependent on their availability and schedules. If these directors choose to work with other platforms or studios, Netflix could face content gaps.

4. Niche Versus Mainstream: Depending on the mix of directors, Netflix might lean more towards niche content or mainstream blockbusters. Striking a balance is crucial to cater to a wide range of audience segments.

5. Disproportionate Focus: If a small group of directors dominates the content library, it might overshadow emerging talent and innovative storytelling, limiting the platform's ability to discover and promote new voices.

6. Limited Originality: Overreliance on certain directors might lead to a lack of originality in content, potentially resulting in repetitive themes and narratives.

#### Chart - 11: top 10 actors in tv shows/movies.

In [None]:
# Chart - 11 visualization code

# Filter out rows with 'unknown' cast entries
filtered_netflix_df = netflix_df[~netflix_df['cast'].str.contains('unknown', case=False, na=False)]
fig,ax = plt.subplots(1,2, figsize=(14,5))

# seperating TV shows actor from cast column
top_TVshows_actor = filtered_netflix_df[filtered_netflix_df['type']=='TV Show']['cast'].str.split(', ', expand=True).stack()
# plotting actor who appeared in highest number of TV Show
a = top_TVshows_actor.value_counts().head(10).plot(kind='barh', ax=ax[0])
a.set_title('Top 10 TV shows actors', size=15)

# seperating movie actor from cast column
top_movie_actor = filtered_netflix_df[filtered_netflix_df['type']=='Movie']['cast'].str.split(', ', expand=True).stack()
# plotting actor who appeared in highest number of Movie
b = top_movie_actor.value_counts().head(10).plot(kind='barh', ax=ax[1])
b.set_title('Top 10 Movie actors', size=15)

plt.tight_layout(pad=1.2, rect=[0, 0, 0.95, 0.95])
plt.show()

##### 1. Why did you pick the specific chart?

the horizontal bar chart with subplots is a visually effective way to compare and contrast the top actors in both TV shows and movies on Netflix, providing insights into their level of engagement and popularity in each category.

##### 2. What is/are the insight(s) found from the chart?

**Top 10 TV Show Actors:**

1. Japanese Voice Actors: The presence of Japanese voice actors like Takahiro Sakurai, Yuki Kaji, Daisuke Ono, and Ai Kayano among the top TV show actors suggests that anime content is well-represented on Netflix.

2. Dubbed Content: The high number of appearances by these voice actors could indicate the popularity of dubbed anime content on the platform.

3. Frequent Collaborations: Junichi Suwabe, Yoshimasa Hosoya, and Yuichi Nakamura are among the top actors, indicating frequent collaborations with the platform or consistent roles in TV shows.

4. Diverse Genres: While some of these actors are known for anime, their diverse appearances could mean they are involved in a range of genres beyond animation.

**Top 10 Movie Actors:**

1. Bollywood Dominance: The list of top movie actors is dominated by Bollywood stars like Shah Rukh Khan, Akshay Kumar, and Amitabh Bachchan. This suggests a strong presence of Indian cinema on Netflix.

2. Indian Cinema Showcase: The high counts for actors like Anupam Kher, Om Puri, Naseeruddin Shah, and Paresh Rawal highlight the platform's focus on showcasing classic and contemporary Indian cinema.

3. Versatile Actors: These actors have appeared in a variety of genres, showcasing their versatility in Indian cinema.

4. Global Reach of Bollywood: The popularity of these actors indicates that Bollywood films have a global audience on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Catering to Diverse Audiences: The presence of both Japanese voice actors and Bollywood stars indicates that Netflix is successfully catering to a diverse global audience. This diverse approach can attract and retain subscribers from various regions.

2. Audience Engagement: High appearances by certain actors suggest that these actors have a dedicated fan base. Featuring them in more content can increase viewer engagement and potentially attract new subscribers.

3. Regional Content Focus: The presence of Bollywood actors highlights Netflix's focus on regional content, particularly Indian cinema. This can lead to increased subscription rates from regions with a strong interest in Indian movies.

4. Collaborations and Partnerships: Consistent appearances by specific actors might indicate successful collaborations or partnerships with production houses and studios associated with these actors. Such collaborations can result in high-quality content and positive viewer reception.

**Potential Negative Impact:**

1. Overreliance on Specific Actors: An overemphasis on a few actors, especially in terms of high appearances, might lead to viewer fatigue. The audience might start perceiving the platform as repetitive or lacking in variety.

2. Limited Exploration: Focusing heavily on certain actors can limit the exploration of new and emerging talent. This can impact the platform's ability to discover fresh voices and innovative storytelling.

3. Cultural Balance: While featuring diverse actors is positive, if certain regions or cultural backgrounds are consistently underrepresented, it can lead to dissatisfaction among those specific audiences.

4. Risk of Overexposure: Overusing certain actors might lead to oversaturation in the market. Viewers might become less excited about their appearances, resulting in a decrease in engagement.

5. Competitive Challenges: Relying heavily on specific actors might limit the platform's ability to compete with other streaming services that have their own exclusive content deals with those actors.

6. Long-Term Engagement: While featuring popular actors can attract initial interest, the long-term success of a platform depends on a diverse range of factors, including content variety, quality, and viewer experience.

7. Cultural Authenticity: In the case of regional actors, maintaining cultural authenticity and sensitively addressing cultural nuances becomes crucial to avoid negative reactions from specific audience groups.

#### Chart - 12: Total number of Movies/TV Shows released and added per year on Netflix?

In [None]:
# Chart - 12 visualization code

plt.figure(figsize=(20,6))
for i,j,k in ((df_movies, 'Movies',0),(df_tvshows, 'TV Shows',1)):
  plt.subplot(1,2,k+1)
  df_release_year = i.groupby(['release_year']).agg({'title':'nunique'}).reset_index().sort_values(by=['release_year'],ascending=False)[:14]
  plots= sns.barplot(x = 'release_year',y= 'title', data = df_release_year, palette='husl')
  plt.title(f'{j} released by year')
  plt.ylabel(f"Number of {j} released")
  plt.grid(linestyle='--', linewidth=0.3)

  for bar in plots.patches:
     plots.annotate(bar.get_height(),
                    (bar.get_x() + bar.get_width() / 2,
                     bar.get_height()), ha='center', va='center',
                    size=12, xytext=(0, 8),
                    textcoords='offset points')
plt.show()

plt.figure(figsize=(20,6))
for i,j,k in ((df_movies, 'Movies',0),(df_tvshows, 'TV Shows',1)):
  plt.subplot(1,2,k+1)
  df_country = i.groupby(['year_added']).agg({'title':'nunique'}).reset_index().sort_values(by=['year_added'],ascending=False)
  plots= sns.barplot(x = 'year_added',y= 'title', data = df_country, palette='husl')
  plt.title(f'{j} added to Netflix by year')
  plt.ylabel(f"Number of {j} added on Netflix")
  plt.grid(linestyle='--', linewidth=0.3)

  for bar in plots.patches:
     plots.annotate(bar.get_height(),
                    (bar.get_x() + bar.get_width() / 2,
                     bar.get_height()), ha='center', va='center',
                    size=12, xytext=(0, 8),
                    textcoords='offset points')
plt.show()

##### 1. Why did you pick the specific chart?

Together, the histogram and countplot provide a comprehensive view of the distribution of content release years and the composition of content types over time. They allow for easy comparison, identification of patterns, and the extraction of meaningful insights about Netflix's content strategy and audience preferences.

##### 2. What is/are the insight(s) found from the chart?

**Distribution of Release Years:**

* The output provides a list of release years along with the corresponding counts of content released in each year.

* The years with the highest number of content releases are concentrated in the recent years, particularly from 2016 to 2020, for both TV shows and movies.

* There's a clear trend of increasing content production in recent years, which is likely influenced by the rise of streaming platforms and original content creation.

**TV Shows Released in Top 15 Years:**

* The output shows the number of TV shows released in the top 15 years.

* The highest number of TV shows were released in the year 2020, followed closely by 2019 and 2018. This suggests that recent years have seen a surge in TV show releases.

**Movies Released in Top 15 Years:**

* The output also displays the number of movies released in the top 15 years.

* Similar to TV shows, the highest number of movies were released in the year 2017, followed by 2018, 2016, and 2019.

* This indicates a trend of increased movie production in recent years as well.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Increased Content Production: The trend of increasing content releases, especially in recent years, suggests that Netflix is actively investing in content creation. This can have a positive impact by attracting and retaining subscribers who have a wide variety of options to choose from.

2. Emphasis on Original Content: The growth in content production aligns with Netflix's strategy of focusing on original content. Original shows and movies can help differentiate the platform from competitors and provide exclusive content that subscribers can't find elsewhere.

3. Variety of Genres: The diverse range of genres, including dramas, comedies, documentaries, and more, caters to various audience preferences. This approach can attract a larger and more diverse subscriber base.

4. Catering to Global Audiences: The presence of international TV shows and movies indicates Netflix's effort to cater to global audiences, expanding its reach beyond its home market.

5. Growing TV Show and Movie Library: The insights about TV shows and movies released in the top 15 years show that Netflix is continuously building a substantial library. This library growth is crucial for maintaining user engagement and subscriber retention.

**Insights That Lead to Negative Growth:**

There are no insights from the provided data that directly lead to negative growth. However, it's essential to consider potential challenges:

1. Quality vs. Quantity: While increased content production is positive, maintaining quality is crucial. A large volume of content doesn't necessarily translate to a positive impact if the quality of the content is compromised. Negative feedback on content quality could lead to subscriber dissatisfaction.

2. Oversaturation and Viewer Fatigue: An excessive number of releases, especially if they are not well-promoted or if there is content fatigue, could lead to viewers feeling overwhelmed. This could result in reduced viewer engagement or even unsubscribing.

3. Market Saturation: As streaming competition increases, the market becomes more saturated. While Netflix is currently a leader, the potential saturation of the market could lead to challenges in acquiring new subscribers.

4. Cannibalization: If there's not enough differentiation between various shows and movies, subscribers might opt for content from other platforms. This could lead to internal competition among Netflix's own offerings.

5. Budget Constraints: High production costs can strain the budget, especially if content isn't generating expected returns. This could impact the financial health of the company.

#### Chart - 13: Which Countries has the highest spread of Movies and TV Shows over Netflix?

In [None]:
# Chart - 13 visualization code

plt.figure(figsize=(18, 5))
plt.grid(linestyle='--', linewidth=0.3)

# Top 15 countries with most content
top_countries = netflix_df['country'].value_counts().index[:15]
sns.countplot(x=netflix_df['country'], order=top_countries, hue=netflix_df['type'], palette="Set1")
plt.xticks(rotation=50)
plt.title('Top 15 countries with most content', fontsize=15, fontweight='bold')
plt.legend(title='Type')

plt.figure(figsize=(20, 8))
df_movies = netflix_df[netflix_df['type'] == 'Movie']
df_tvshows = netflix_df[netflix_df['type'] == 'TV Show']

for df, content_type in [(df_movies, 'Movies'), (df_tvshows, 'TV Shows')]:
    plt.subplot(1, 2, 1 if content_type == 'Movies' else 2)
    df_country = df['country'].value_counts().head(10).reset_index()
    df_country.columns = ['country', 'count']

    plots = sns.barplot(y="country", x='count', data=df_country, palette='Set1')
    plt.title(f'Top 10 countries launching {content_type}', fontsize=15, fontweight='bold')
    plt.grid(linestyle='--', linewidth=0.3)
    for i, value in enumerate(df_country['count']):
        plots.text(value + 10, i, str(value), ha='center', va='center')

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

**Top 15 Countries with Most Content - Countplot (Bar Chart):**
The first chart uses a countplot to show the distribution of content (Movies and TV Shows) among the top 15 countries with the most content on Netflix. This chart helps us understand which countries contribute the most to Netflix's content library and allows for a quick comparison of the content types within those countries.

**Top 10 Countries Launching Movies Back to Back - Bar Chart:**
In this chart, we use a bar chart to visualize the top 10 countries that have the highest number of movies released back to back. This helps identify which countries have a consistent flow of movie releases, which can provide insights into production trends and potential partnerships with those countries.

**Top 10 Countries Launching TV Shows Back to Back - Bar Chart:**
Similar to the previous chart, this bar chart shows the top 10 countries with the highest number of TV show releases back to back. This visualization helps identify countries that are actively producing TV shows and can indicate content creation trends specific to TV shows.

##### 2. What is/are the insight(s) found from the chart?

* The United States is the clear leader in terms of content production, both for TV shows and movies. It produces more than twice as many TV shows as the runner-up, the United Kingdom, and more than 3 times as many movies as India, the second-place finisher.

* India is the second-largest producer of TV shows, and it is also the fastest-growing market for content consumption. The growth of the Indian entertainment industry is being driven by a number of factors, including the increasing popularity of streaming services, the growing middle class, and the rising disposable incomes of Indians.

* South Korea is a major player in the global TV show market, and it is known for its popular dramas and comedies. The Korean Wave, a term used to describe the global popularity of Korean culture, has helped to boost the visibility of South Korean TV shows around the world.

* Canada is a major producer of TV shows, and it is home to many popular series, such as "Schitt's Creek" and "The Handmaid's Tale". The Canadian government provides financial support for the production of TV shows, which helps to attract foreign investment and create jobs.

* China is the world's most populous country, and it has a growing appetite for content. However, the Chinese government tightly controls the media, which limits the number of foreign TV shows and movies that are available in the country.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


Yes, the gained insights can help create a positive business impact. For example, a content production company could use this information to:

Identify new markets to target. For example, the data shows that India is a growing market for content consumption. A content production company could focus on producing content that is specifically tailored to the Indian market.
Develop new strategies to attract viewers. For example, the data shows that South Korean TV shows are popular around the world. A content production company could learn from the success of South Korean TV shows and develop its own shows that have a global appeal.
Partner with other companies in the industry. For example, a content production company could partner with a streaming service to distribute its content to a wider audience.
However, there are also some insights that could lead to negative growth. For example, the data shows that the Chinese government tightly controls the media. This could make it difficult for content production companies to distribute their content in China. Additionally, the data shows that the global content production industry is becoming increasingly competitive. This could make it difficult for smaller content production companies to compete with the larger companies.

#### Chart - 14: Which Genres are Popular in Netflix?

In [None]:
# Chart - 14 visualization code

plt.figure(figsize=(23,8))
df_genre = df.groupby(['listed_in']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
plots= sns.barplot(y = "listed_in",x = 'title', data = df_genre)
plt.title(f'Most popular genre on Netflix')
plt.grid(linestyle='--', linewidth=0.3)
plots.bar_label(plots.containers[0])
plt.show()

plt.figure(figsize=(23,8))
for i,j,k in ((df_movies, 'Movies',0),(df_tvshows, 'TV Shows',1)):
  plt.subplot(1,2,k+1)
  df_genre = i.groupby(['listed_in']).agg({'title':'nunique'}).reset_index().sort_values(by=['title'],ascending=False)[:10]
  plots= sns.barplot(y = "listed_in",x = 'title', data = df_genre, palette='Set1')
  plt.title(f'Most popular genre of the {j}')
  plt.grid(linestyle='--', linewidth=0.3)
  plots.bar_label(plots.containers[0])
  plt.yticks(rotation = 45)
plt.show()

##### 1. Why did you pick the specific chart?

bar plots are a versatile choice for visualizing categorical data with clear rankings and comparisons, making them suitable for conveying insights about the most popular genres on Netflix.

##### 2. What is/are the insight(s) found from the chart?

* The most popular genre in TV shows is international TV shows. This suggests that viewers are interested in watching content from different cultures and countries. This could be due to the increasing globalization of the world, as well as the rise of streaming services that offer a wide variety of content from around the world.
* Crime TV shows are also popular, ranking second in the list. This could be because crime is a universal theme that appeals to viewers of all ages. Crime TV shows can also be suspenseful and exciting, which can keep viewers hooked.
* Kids' TV is another popular genre. This is not surprising, as children are naturally drawn to stories and characters that they can relate to. Kids' TV shows can also be educational, which can help children learn about different topics.
* British TV shows are also popular, ranking fourth in the list. This could be because British TV shows are known for their high quality and originality. British TV shows have also won numerous awards, which can help to attract viewers.
* Documentaries are also popular, ranking fifth in the list. This suggests that viewers are interested in learning about the world around them. Documentaries can be informative and educational, and they can also be entertaining.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. Content Strategy: Knowing the most popular genres in both TV shows and movies allows Netflix to tailor their content acquisition and production strategies. By focusing on genres that have high popularity, they can attract and retain a larger audience.

2. Audience Engagement: Producing content in the most popular genres can lead to higher audience engagement. This can result in longer viewing sessions, increased user satisfaction, and reduced churn rates.

3. Marketing and Promotion: Netflix can use this information to effectively market and promote their content. Highlighting the most popular genres can attract viewers' attention and lead to increased viewership.

4. Personalization: Insights into user preferences can help Netflix enhance its recommendation algorithm. By suggesting content in the genres that users are most likely to enjoy, they can create a more personalized user experience.

No Insights Leading to Negative Growth:

Based on the provided insights, there are no indications of insights that could lead to negative growth. However, it's important to note that relying solely on popular genres might result in overlooking niche genres that have a dedicated audience. A diverse content library catering to various interests can be crucial to maintaining a broad and engaged user base.

#### Chart - 15: Total Number of Movies/TV Shows added per month on Netflix.

In [None]:
plt.figure(figsize=(23,8))
for i,j,k in ((df_movies, 'Movies',0),(df_tvshows, 'TV Shows',1)):
  plt.subplot(1,2,k+1)
  df_month = i.groupby(['month_added']).agg({'title':'nunique'}).reset_index().sort_values(by=['month_added'],ascending=False)
  plots= sns.barplot(x = 'month_added',y='title', data = df_month, palette='husl')
  plt.title(f'{j} added added to Netflix by month')
  plt.ylabel(f"Number of {j} added on Netflix")
  plt.grid(linestyle='--', linewidth=0.3)
  for bar in plots.patches:
     plots.annotate(bar.get_height(),
                    (bar.get_x() + bar.get_width() / 2,
                     bar.get_height()), ha='center', va='center',
                    size=12, xytext=(0, 8),
                    textcoords='offset points')
plt.show()

##### 1. Why did you pick the specific chart?

We have plotted this graph to know in which month the movie/tv shows added is **maximum** and in which year **minimum**.

##### 2. What is/are the insight(s) found from the chart?

1. We found that **October, November and December are the most popular months for TV shows addition**.

2. **January, October and December are the most popular months for movie addition**.

3. February is the least popular month for the movies and TV shows to be added on Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained can help Netflix create a positive business impact by identifying the most popular months for new content additions. This can help Netflix plan content releases during peak periods, leading to increased user engagement and retention.

The insight that February is the least popular month for new content additions could potentially lead to negative growth if Netflix does not maintain a consistent flow of new content during this period. It is important for Netflix to keep its audience engaged throughout the year to avoid dissatisfaction and potential loss of subscribers.

#### Chart - 16: What is the Distribution of Duration of contents over Netflix?

In [None]:
#Checking the distribution of Movie Durations
plt.figure(figsize=(10,7))
plots= sns.distplot(df_movies['duration'],kde=False, color=['green'])
plt.title('Distplot with Normal distribution for Movies',fontweight="bold")
for bar in plots.patches:
   plots.annotate(bar.get_height(),
                  (bar.get_x() + bar.get_width() / 2,
                   bar.get_height()), ha='center', va='bottom',
                  size=7, xytext=(0, 5),
                  textcoords='offset points', rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(23,8))
df_duration = df_tvshows.groupby(['duration']).agg({'title':'nunique'}).reset_index().sort_values(by=['duration'],ascending=False)
plots= sns.barplot(x = 'duration',y='title', data = df_duration, palette='husl')
plt.title(f'Barplot of TV Shows Duration')
plt.ylabel(f"Content count")
plt.grid(linestyle='--', linewidth=0.3)
for bar in plots.patches:
   plots.annotate(bar.get_height(),
                  (bar.get_x() + bar.get_width() / 2,
                   bar.get_height()), ha='center', va='bottom',
                  size=12, xytext=(0, 8),
                  textcoords='offset points', rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

To know the duration distribution for Movies and TV Shows on Netflix.

##### 2. What is/are the insight(s) found from the chart?

1.The histogram of the distribution of movie durations in minutes on Netflix shows that the **majority of movies on Netflix have a duration between 80 to 120 minutes. **

2.The countplot of the distribution of TV show durations in seasons on Netflix shows that the most common **duration for TV shows on Netflix is one season**, followed by two seasons.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

*Hypothetical Statement 1:*
* **Null Hypothesis**: There is no significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

* **Alternative Hypothesis**: There is a significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

*Hypothetical Statement 2:*
* **Null Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is not significantly different from the average duration of TV shows added in the year 2021.

* **Alternative Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is significantly different from the average duration of TV shows added in the year 2021.

*Hypothetical Statement 3:*
* **Null Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is not significantly different from the proportion of movies added on Netflix that are produced in the United States.

* **Alternative Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is significantly different from the proportion of movies added on Netflix that are produced in the United States.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis**: There is no significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

**Alternative Hypothesis**: There is a significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from statsmodels.stats.proportion import proportions_ztest  #------> This function is used to perform z test of proportion.

# Subset the data to only include drama and comedy movies
subset = netflix_df[netflix_df['listed_in'].str.contains('Dramas') | df['listed_in'].str.contains('Comedies')]

# Calculate the proportion of drama and comedy movies
drama_prop = len(subset[subset['listed_in'].str.contains('Dramas')]) / len(subset)
comedy_prop = len(subset[subset['listed_in'].str.contains('Comedies')]) / len(subset)

# Set up the parameters for the z-test
count = [int(drama_prop * len(subset)), int(comedy_prop * len(subset))]
nobs = [len(subset), len(subset)]
alternative = 'two-sided'

# Perform the z-test
z_stat, p_value = proportions_ztest(count=count, nobs=nobs, alternative=alternative)
print('z-statistic: ', z_stat)
print('p-value: ', p_value)

# Set the significance level
alpha = 0.05

# Print the results of the z-test
if p_value < alpha:
    print(f"Reject the null hypothesis.")
else:
    print(f"Fail to reject the null hypothesis.")


We conclude that there is a significant difference in the proportion ratings of drama movies and comedy movies on Netflix.

##### Which statistical test have you done to obtain P-Value?

The statistical test we have used to obtain the P-value is the z-test for proportions.


##### Why did you choose the specific statistical test?

The z-test for proportions was chosen because we are comparing the proportions of two categorical variables (drama movies and comedy movies) in a sample. The null hypothesis and alternative hypothesis are about the difference in proportions, and we want to determine if the observed difference in proportions is statistically significant or not. The z-test for proportions is appropriate for this situation because it allows us to compare two proportions and calculate the probability of observing the difference we see in our sample if the null hypothesis were true.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is not significantly different from the average duration of TV shows added in the year 2021.

**Alternative Hypothesis**: The average duration of TV shows added in the year 2020 on Netflix is significantly different from the average duration of TV shows added in the year 2021.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# To test this hypothesis, we perform a two-sample t-test.
from scipy.stats import ttest_ind

# Create separate dataframes for TV shows in 2020 and 2021
tv_2020 = netflix_df[(netflix_df['type'] == 'TV Show') & (netflix_df['release_year'] == 2020)]
tv_2021 = netflix_df[(netflix_df['type'] == 'TV Show') & (netflix_df['release_year'] == 2021)]

# Perform two-sample t-test
t, p = ttest_ind(tv_2020['duration'].astype(int),
                 tv_2021['duration'].astype(int), equal_var=False)
print('t-value: ', t)
print('p-value: ', p)

# Print the results
if p < 0.05:
    print('Reject null hypothesis. \nThe average duration of TV shows added in the year 2020 on Netflix is significantly different from the average duration of TV shows added in the year 2021.')
else:
    print('Failed to reject null hypothesis. \nThe average duration of TV shows added in the year 2020 on Netflix is not significantly different from the average duration of TV shows added in the year 2021.')



##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain the P-Value is a two-sample t-test.


##### Why did you choose the specific statistical test?

The two-sample t-test was chosen because we are comparing the means of two different samples (TV shows added in 2020 vs TV shows added in 2021) to determine whether they are significantly different. Additionally, we assume that the two samples have unequal variances since it is unlikely that the duration of TV shows added in 2020 and 2021 would have the exact same variance.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is not significantly different from the proportion of movies added on Netflix that are produced in the United States.

**Alternative Hypothesis**: The proportion of TV shows added on Netflix that are produced in the United States is significantly different from the proportion of movies added on Netflix that are produced in the United States.           

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from statsmodels.stats.proportion import proportions_ztest  #------> This function is used to perform z test of proportion.

# Calculate the proportion of drama and comedy movies
tv_proportion = np.sum(df_tvshows['country'].str.contains('United States')) / len(df_tvshows)
movie_proportion = np.sum(df_movies['country'].str.contains('United States')) / len(df_movies)

# Set up the parameters for the z-test
count = [int(tv_proportion * len(df_tvshows)), int(movie_proportion * len(df_movies))]
nobs = [len(df_tvshows), len(df_movies)]
alternative = 'two-sided'

# Perform the z-test
z_stat, p_value = proportions_ztest(count=count, nobs=nobs, alternative=alternative)
print('z-statistic: ', z_stat)
print('p-value: ', p_value)

# Set the significance level
alpha = 0.05

# Print the results of the z-test
if p_value < alpha:
    print(f"Reject the null hypothesis.")
else:
    print(f"Fail to reject the null hypothesis.")


We conclude that the proportion of TV shows added on Netflix that are produced in the United States is significantly different from the proportion of movies added on Netflix that are produced in the United States.

##### Which statistical test have you done to obtain P-Value?

The statistical test used to obtain P-Value is a two-sample proportion test.



##### Why did you choose the specific statistical test?

We chose this specific statistical test because it is appropriate for comparing two proportions, and it helps us to determine whether the difference between the two proportions is due to chance or not.





## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
''' we already done this in data wrangling for getting a good visualization '''

#### What all missing value imputation techniques have you used and why did you use those techniques?

- The missing values in the director, cast, and country attributes can be replaced with string 'unknown' and 'No Cast'.
- Small amount of null value percentage present in rating and date_added column, if we drop these nan values in date_added column and replace nan value in rating with mode values it will not affect that much while building the model. So, we simply drop the nan value present in  date_added columns.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# plotting graph
fig,ax = plt.subplots(1,2, figsize=(15,5))

# Display boxplot and dist plot.
sns.distplot(x=netflix_df['release_year'], ax=ax[0])
sns.boxplot(data=netflix_df, ax=ax[1])

##### What all outlier treatment techniques have you used and why did you use those techniques?

- Since, the almost all of the data present in textual format except release year.
- The data that we need to create cluster/building model are present in textual foramat. So, there is no need to perform handling outlier.

### 3. Textual Data Preprocessing

#### **What is textual data preprocessing?**
* Textual data preprocessing is the process of preparing text data for analysis or modeling. It includes a series of steps that are applied to raw text data in order to clean, organize and standardize it so that it can be easily analyzed or used as input for natural language processing or machine learning models. The preprocessing steps typically include tokenization, stop-word removal, stemming or lemmatization, lowercasing, removing punctuation, and removing numbers. The goal of textual data preprocessing is to prepare the data for further analysis and modeling by removing irrelevant information and standardizing the format of the text. This can help improve the accuracy and effectiveness of the analysis or modeling.

#### 1. Textual Columns

In [None]:
#drop unnecessary columns
columns_to_drop = [ 'month_added', 'day_added', 'year_added']
netflix_df.drop(columns=columns_to_drop, inplace=True)

In [None]:
# Creating new feature content_detail with the help of other textual attributes
netflix_df["content_detail"]= netflix_df["cast"]+" "+netflix_df["director"]+" "+netflix_df["listed_in"]+" "+netflix_df["type"]+" "+netflix_df["rating"]+" "+netflix_df["country"]+" "+netflix_df["description"]

#checking the manipulation
netflix_df.head(5)

#### 2. Lower Casing

In [None]:
# Lower Casing
netflix_df['content_detail']= netflix_df['content_detail'].str.lower()

# Checking the manipulation
netflix_df.iloc[281,]['content_detail']

#### 3. Removing Punctuations

In [None]:
# function to remove punctuations
def remove_punctuations(text):
    '''This function is used to remove the punctuations from the given sentence'''
    #imorting needed library
    import string
    # replacing the punctuations with no space, which in effect deletes the punctuation marks.
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped off punctuation marks
    return text.translate(translator)

In [None]:
# Removing Punctuations from the content_detail
netflix_df['content_detail']= netflix_df['content_detail'].apply(remove_punctuations)

# Checking the observation after manipulation
netflix_df.iloc[281,]['content_detail']

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
def remove_url_and_numbers(text):
    '''This function is used to remove the URL's and Numbers from the given sentence'''
    # importing needed libraries
    import re
    import string

    # Replacing the URL's with no space
    url_number_pattern = re.compile(r'https?://\S+|www\.\S+')
    text= re.sub(url_number_pattern,'', text)

    # Replacing the digits with one space
    text = re.sub('[^a-zA-Z]', ' ', text)

    # return the text stripped off URL's and Numbers
    return text

In [None]:
# Remove URLs & Remove words and digits contain digits
netflix_df['content_detail']= netflix_df['content_detail'].apply(remove_url_and_numbers)

# Checking the observation after manipulation
netflix_df.iloc[281,]['content_detail']

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Downloading stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords

# create a set of English stop words
stop_words = set(stopwords.words('english'))

# displaying stopwords
print(stop_words)



In [None]:
def remove_stopwords_and_whitespaces(text):
    '''This function is used for removing the stopwords from the given sentence'''
    text = [word for word in text.split() if not word in stopwords.words('english')]

    # joining the list of words with space separator
    text=  " ".join(text)

    # removing whitespace
    text = re.sub(r'\s+', ' ', text)

    # return the manipulated string
    return text

In [None]:
# Remove URLs & Remove words and digits contain digits
netflix_df['content_detail']= netflix_df['content_detail'].apply(remove_stopwords_and_whitespaces)

# Checking the observation after manipulation
netflix_df.iloc[281,]['content_detail']

In [None]:
netflix_df['content_detail'][0]

#### 6. Tokenization

In [None]:
# Downloading needed libraries
nltk.download('punkt')

# Tokenization
netflix_df['content_detail']= netflix_df['content_detail'].apply(nltk.word_tokenize)

# Checking the observation after manipulation
netflix_df.iloc[281,]['content_detail']

#### 7. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
# Importing WordNetLemmatizer from nltk module
from nltk.stem import WordNetLemmatizer

# Creating instance for wordnet
wordnet  = WordNetLemmatizer()

In [None]:
def lemmatizing_sentence(text):
    '''This function is used for lemmatizing (changing the given word into meaningfull word) the words from the given sentence'''
    text = [wordnet.lemmatize(word) for word in text]

    # joining the list of words with space separator
    text=  " ".join(text)

    # return the manipulated string
    return text

In [None]:
# Downloading needed libraries
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Rephrasing text by applying defined lemmatizing function
netflix_df['content_detail']= netflix_df['content_detail'].apply(lemmatizing_sentence)

# Checking the observation after manipulation
netflix_df.iloc[281,]['content_detail']

##### Which text normalization technique have you used and why?

We have used Lemmatization instead of Stemming for our project because:

* **Lemmatization produces a more accurate base word**: Unlike Stemming, which simply removes the suffix from a word, Lemmatization looks at the meaning of the word and its context to produce a more accurate base form.

* **Lemmatization can handle different inflections**: Lemmatization can handle various inflections of a word, including plural forms, verb tenses, and comparative forms, making it useful for natural language processing.

* **Lemmatization produces real words**: Lemmatization always produces a real word that can be found in a dictionary, making it easier to interpret the results of text analysis.

* **Lemmatization improves text understanding**: By reducing words to their base form, Lemmatization makes it easier to understand the context and meaning of a sentence.

* **Lemmatization supports multiple languages**: While Stemming may only work well for English, Lemmatization is effective for many different languages, making it a more versatile text processing technique.

#### 8. Part of speech tagging

In [None]:
# tokenize the text into words before POS Taging
netflix_df['pos_tags'] = netflix_df['content_detail'].apply(nltk.word_tokenize).apply(nltk.pos_tag)

# Checking the observation after manipulation
netflix_df.head(5)

#### 9. Text Vectorization

In [None]:
# Vectorizing Text
# Importing needed libraries
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating instance
tfidfv = TfidfVectorizer(max_features=30000)        # Setting max features as 30000 to avoid RAM explosion

In [None]:
# Fitting on TfidfVectorizer
x= tfidfv.fit_transform(netflix_df['content_detail'])

# Checking shape of the formed document matrix
print(x.shape)

##### Which text vectorization technique have you used and why?

We have used TFIDF vectorization in place of BAG OF WORDS because Tf-idf vectorization takes into account the importance of each word in a document. TF-IDF also assigns higher values to rare words that are unique to a particular document, making them more important in the representation.

### 4. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

In textual data processing, there are 30,000 attributes are created in text vectorization and this huge amount of columns cannot be dealed with our local machines. So, we will using the Principal Component Analysis(PCA) techniques to reduce the dimensions of this huge sparse matrix.

In [None]:
# Dimensionality Reduction
# Importing PCA from sklearn
from sklearn.decomposition import PCA

# Defining PCA object with desired number of components
pca = PCA()

# Fitting the PCA model
pca.fit(x.toarray())

# percent of variance captured by each component
variance = pca.explained_variance_ratio_
print(f"Explained variance: {variance}")

In [None]:
# Ploting the percent of variance captured versus the number of components in order to determine the reduced dimensions
fig, ax = plt.subplots()
ax.plot(range(1, len(variance)+1), np.cumsum(pca.explained_variance_ratio_))
ax.set_xlabel('Number of Components')
ax.set_ylabel('Percent of Variance Captured')
ax.set_title('PCA Analysis')
plt.grid(linestyle='--', linewidth=0.3)
plt.show()

It is clear from the above plot that 7770 principal components can capture the 100% of variance. For our case we will consider only those number of PC's that can capture 95% of variance.

In [None]:
## Now we are passing the argument so that we can capture 95% of variance.
# Defining instance
pca_tuned = PCA(n_components=0.95)

# Fitting and transforming the model
pca_tuned.fit(x.toarray())
x_transformed = pca_tuned.transform(x.toarray())

# Checking the shape of transformed matrix
x_transformed.shape

##### Which dimensionality reduction technique have you used and why?

We have used PCA (Principal Component Analysis) for dimensionality reduction. PCA is a widely used technique for reducing the dimensionality of high-dimensional data sets while retaining most of the information in the original data.

PCA works by finding the principal components of the data, which are linear combinations of the original features that capture the maximum amount of variation in the data. By projecting the data onto these principal components, PCA can reduce the number of dimensions while retaining most of the information in the original data.

PCA is a popular choice for dimensionality reduction because it is simple to implement, computationally efficient, and widely available in most data analysis software packages. Additionally, PCA has been extensively studied and has a strong theoretical foundation, making it a reliable and well-understood method.

## ***7. ML Model Implementation***

### ML Model - 1 (K-Means Clustering)

K-means clustering is a type of unsupervised machine learning algorithm used for partitioning a dataset into K clusters based on similarity of data points. The goal of the algorithm is to minimize the sum of squared distances between each data point and its corresponding cluster centroid. It works iteratively by assigning each data point to its nearest centroid and then re-computing the centroid of each cluster based on the new assignments. The algorithm terminates when the cluster assignments no longer change or when a maximum number of iterations is reached.

Let's just itterate over a loop of 1 to 16 clusters and try to find the optimal number of clusters with ELBOW method.

In [None]:
## Determining optimal value of K using KElbowVisualizer
# Importing needed library
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans(random_state=0)
visualizer = KElbowVisualizer(model, k=(1,16),locate_elbow=False)

# Fit the data to the visualizer
visualizer.fit(x_transformed)

# Finalize and render the figure
visualizer.show()

Here it seems that the elbow is forming at the 2 clusters but before blindly believing it let's plot one more chart that itterates over the same number of cluters and determines the Silhouette Score at every point.

Okay, but what is **Silhouette Score**?

The silhouette score is a measure of how similar an object is to its own cluster compared to other clusters. It is used to evaluate the quality of clustering, where a higher score indicates that objects are more similar to their own cluster and dissimilar to other clusters.

The silhouette score ranges from -1 to 1, where a score of 1 indicates that the object is well-matched to its own cluster, and poorly-matched to neighboring clusters. Conversely, a score of -1 indicates that the object is poorly-matched to its own cluster, and well-matched to neighboring clusters.

In [None]:
## Determining optimal value of K using KElbowVisualizer
# Importing needed library
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
visualizer = KElbowVisualizer(model, k=(2,16), metric='silhouette', timings=True, locate_elbow=False)

# Fit the data to the visualizer
visualizer.fit(x_transformed)

# Finalize and render the figure
visualizer.show()

In [None]:
## Computing Silhouette score for each k
# Importing needed libraries
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Defining Range
k_range = range(2, 7)
for k in k_range:
    Kmodel = KMeans(n_clusters=k)
    labels = Kmodel.fit_predict(x_transformed)
    score = silhouette_score(x, labels)
    print("k=%d, Silhouette score=%f" % (k, score))

From the above plots (Elbow plot and Sillhouette plot) it is very clear that the Silhoutte score is comparatively good for 4 number of clusters, so we will consider 4 cluster in kmeans analysis.

Now let's plot and see how our data points look like after assigning to their respective clusters.

In [None]:
#training the K-means model on a dataset
kmeans = KMeans(n_clusters=4, init='k-means++', random_state= 0)

#predict the labels of clusters.
plt.figure(figsize=(10,6), dpi=120)
label = kmeans.fit_predict(x_transformed)
#Getting unique labels
unique_labels = np.unique(label)

#plotting the results:
for i in unique_labels:
    plt.scatter(x_transformed[label == i , 0] , x_transformed[label == i , 1] , label = i)
plt.legend()
plt.show()

We have 4 different clusters but unfortunately the above plot is in TWO-DIMENSIONAL. Let's plot the above figure in 3D using mplot3d library and see if we are getting the separated clusters.

In [None]:
# Importing library to visualize clusters in 3D
from mpl_toolkits.mplot3d import Axes3D

# Plot the clusters in 3D
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111, projection='3d')
colors = ['r', 'g', 'b', 'y']
for i in range(len(colors)):
    ax.scatter(x_transformed[kmeans.labels_ == i, 2], x_transformed[kmeans.labels_ == i, 0], x_transformed[kmeans.labels_ == i, 1], c=colors[i])

# Rotate the plot 30 degrees around the X axis and 45 degrees around the Z axis
ax.view_init(elev=20, azim=-120)
ax.set_xlabel('x-axis')
ax.set_ylabel('y-axis')
ax.set_zlabel('z-axis')
plt.show()

Cool, we can easily differentiate the all 4 clusters with naked eye. Now let's assign the 'Conent' in their respective cluster by appending 1 more attribute in the final dataframe.

In [None]:
# Add cluster values to the dateframe.
netflix_df['kmeans_cluster'] = kmeans.labels_

#### 1. Explain the ML Model used and it's performance

Starting with defining a function that plot a wordcloud for each of the attribute in the given dataframe.

In [None]:
def kmeans_wordcloud(cluster_number, column_name):
    '''function for Building a wordcloud for the movie/shows'''

    #Importing libraries
    from wordcloud import WordCloud, STOPWORDS

    # Filter the data by the specified cluster number and column name
    df_wordcloud = netflix_df[['kmeans_cluster', column_name]].dropna()
    df_wordcloud = df_wordcloud[df_wordcloud['kmeans_cluster'] == cluster_number]
    df_wordcloud = df_wordcloud[df_wordcloud[column_name].str.len() > 0]

    # Combine all text documents into a single string
    text = " ".join(word for word in df_wordcloud[column_name])

    # Create the word cloud
    wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="black").generate(text)

    # Convert the wordcloud to a numpy array
    image_array = wordcloud.to_array()

    # Return the numpy array
    return image_array

In [None]:
# Implementing the above defined function and plotting the wordcloud of each attribute
fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(20, 15))
for i in range(4):
    for j, col in enumerate(['description', 'listed_in', 'country', 'title']):
        axs[j][i].imshow(kmeans_wordcloud(i, col))
        axs[j][i].axis('off')
        axs[j][i].set_title(f'Cluster {i}, {col}',fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### ML Model - 2 (Hierarchial Clustering)

Hierarchical clustering is a type of clustering algorithm used for grouping similar data points together into clusters based on their similarity, by recursively merging or dividing clusters based on a measure of similarity or distance between them.

Let's dive into it by plotting a Dendogram and then we will determine the optimal number of clusters.

In [None]:
#importing needed libraries
from scipy.cluster.hierarchy import linkage, dendrogram

# HIERARCHICAL CLUSTERING
distances_linkage = linkage(x_transformed, method = 'ward', metric = 'euclidean')
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('All films/TV shows')
plt.ylabel('Euclidean Distance')

dendrogram(distances_linkage, no_labels = True)
plt.show()

Cool, but what is Dendogram and how to determine the **optimal value of clusters?**

* A dendrogram is a tree-like diagram that records the sequences of merges or splits.More the distance of the vertical lines in the dendrogram, more the distance between those clusters.
* From the above Dendogram we can say that optimal value of clusters is 2. But before assigning the vlaues to respective clusters, let's check the silhouette scores using Agglomerative clustering and follow the bottom up approach to aggregate the datapoints.

In [None]:
## Computing Silhouette score for each k
# Importing needed libraries
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

# Range selected from dendrogram above
k_range = range(2, 10)
for k in k_range:
    model = AgglomerativeClustering(n_clusters=k)
    labels = model.fit_predict(x_transformed)
    score = silhouette_score(x, labels)
    print("k=%d, Silhouette score=%f" % (k, score))

From the above silhouette scores it is clear that the 2  clusters are optimal value (maximum Silhouette score), which is also clear from the above Dendogram that for 2 cluters the euclidean distances are maximum.

Let's again plot the chart and observe the 2 different formed clusters.

In [None]:
#training the K-means model on a dataset
Agmodel = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean', linkage = 'ward')

#predict the labels of clusters.
plt.figure(figsize=(10,6), dpi=120)
label = Agmodel.fit_predict(x_transformed)
#Getting unique labels
unique_labels = np.unique(label)

#plotting the results:
for i in unique_labels:
    plt.scatter(x_transformed[label == i , 0] , x_transformed[label == i , 1] , label = i)
plt.legend()
plt.show()

Again plotting the 3 Dimensional plot to see the clusters clearly.

In [None]:
# Importing library to visualize clusters in 3D
from mpl_toolkits.mplot3d import Axes3D

# Plot the clusters in 3D
fig = plt.figure(figsize=(20,8))
ax = fig.add_subplot(111, projection='3d')
colors = ['r', 'g', 'b', 'y']
for i in range(len(colors)):
    ax.scatter(x_transformed[Agmodel.labels_ == i, 0], x_transformed[Agmodel.labels_ == i, 1], x_transformed[Agmodel.labels_ == i, 2],c=colors[i])
ax.set_xlabel('x-axis')
ax.set_ylabel('y-axis')
ax.set_zlabel('z-axis')
plt.show()

Cool, we can again easily differentiate the all 2 clusters with naked eye. Now let's assign the 'Content(Movies and TV Shows)' in their respective cluster by appending 1 more attribute in the final dataframe.

In [None]:
# Add cluster values to the dateframe.
netflix_df['agglomerative_cluster'] = Agmodel.labels_

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Let's just again define a function that plots wordcloud for different attributes using Agglomerative Clustering.

In [None]:
def agglomerative_wordcloud(cluster_number, column_name):
  '''function for Building a wordcloud for the movie/shows'''

  #Importing libraries
  from wordcloud import WordCloud, STOPWORDS

  # Filter the data by the specified cluster number and column name
  df_wordcloud = netflix_df[['agglomerative_cluster', column_name]].dropna()
  df_wordcloud = df_wordcloud[df_wordcloud['agglomerative_cluster'] == cluster_number]

  # Combine all text documents into a single string
  text = " ".join(word for word in df_wordcloud[column_name])

  # Create the word cloud
  wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color="black").generate(text)

  # Return the word cloud object
  return wordcloud

In [None]:
# Implementing the above defined function and plotting the wordcloud of each attribute
fig, axs = plt.subplots(nrows=4, ncols=2, figsize=(20, 15))
for i in range(2):
    for j, col in enumerate(['description', 'listed_in', 'country', 'title']):
        axs[j][i].imshow(agglomerative_wordcloud(i, col))
        axs[j][i].axis('off')
        axs[j][i].set_title(f'Cluster {i}, {col}',fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### ML Model - 3 (Building a Recommendaton System)

We are using Cosine similarity as it is a measure of similarity between two non-zero vectors in a multidimensional space. It measures the cosine of the angle between the two vectors, which ranges from -1 (opposite direction) to 1 (same direction), with 0 indicating orthogonality (the vectors are perpendicular to each other).

In this project we have used cosine similarity which is used to determine how similar two documents or pieces of text are. We represent the documents as vectors in a high-dimensional space, where each dimension represents a word or term in the corpus. We can then calculate the cosine similarity between the vectors to determine how similar the documents are based on their word usage.

We are using cosine similarity over tf-idf because:

* Cosine similarity handles high dimensional sparse data better.

* Cosine similarity captures the meaning of the text better than tf-idf. For example, if two items contain similar words but in different orders, cosine similarity would still consider them similar, while tf-idf may not. This is because tf-idf only considers the frequency of words in a document and not their order or meaning.

In [None]:
# Importing neede libraries
from sklearn.metrics.pairwise import cosine_similarity

# Create a TF-IDF vectorizer object and transform the text data
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(netflix_df['content_detail'])

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix)

def recommend_content(title, cosine_sim=cosine_sim, data=netflix_df):
    # Get the index of the input title in the programme_list
    programme_list = data['title'].to_list()
    index = programme_list.index(title)

    # Create a list of tuples containing the similarity score and index
    # between the input title and all other programmes in the dataset
    sim_scores = list(enumerate(cosine_sim[index]))

    # Sort the list of tuples by similarity score in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]

    # Get the recommended movie titles and their similarity scores
    recommend_index = [i[0] for i in sim_scores]
    rec_movie = data['title'].iloc[recommend_index]
    rec_score = [round(i[1], 4) for i in sim_scores]

    # Create a pandas DataFrame to display the recommendations
    rec_table = pd.DataFrame(list(zip(rec_movie, rec_score)), columns=['Recommendation', 'Similarity_score(0-1)'])

    return rec_table

Let's check how our recommender system is performing.

In [None]:
# Testing indian movie
recommend_content('Zindagi Na Milegi Dobara')

In [None]:
# Testing non indian movie
recommend_content('THE RUM DIARY')

In [None]:
# Testing indian tv show
recommend_content('Humsafar')

In [None]:
# Testing non indian tv show
recommend_content('The World Is Yours')

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

We have chosen **Silhoutte Score** over **Distortion Score (also known as inertia or sum of squared distances)** as evaluation metrics as it measures how well each data point in a cluster is separated from other clusters. It ranges from -1 to 1, with higher values indicating better cluster separation. A silhouette score close to 1 indicates that the data point is well-matched to its own cluster and poorly matched to neighboring clusters. A score close to 0 indicates that the data point is on or very close to the boundary between two clusters. A score close to -1 indicates that the data point is probably assigned to the wrong cluster.

The advantages of using silhouette score over distortion score are:

* Silhouette score takes into account both the cohesion (how well data points within a cluster are similar) and separation (how well data points in different clusters are dissimilar) of the clusters, whereas distortion score only considers the compactness of each cluster.
* **Silhouette score is less sensitive to the shape of the clusters**, while distortion score tends to favor spherical clusters, and in our case the clusters are not completely spherical.
* Silhouette score provides more intuitive and interpretable results, as it assigns a score to each data point rather than just a single value for the entire clustering solution.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We have considered **K-means** as our final model, as we are getting the comparatevely **high Silhoutte Score in K-means clustering** and the resulted clusters are very well seperated from each others as have saw in the 3 dimensions.

Also in some of the situations K-means works more accurately then other clustering methods such as:
* **Speed**: K-means is generally faster than hierarchical clustering, especially when dealing with large datasets, since it involves fewer calculations and iterations.

* **Ease of use**: K-means is relatively straightforward to implement and interpret, as it requires only a few parameters (such as the number of clusters) and produces a clear partitioning of the data.

* **Scalability**: K-means can easily handle datasets with a large number of variables or dimensions, whereas hierarchical clustering becomes computationally expensive as the number of data points and dimensions increase.

* **Independence of clusters**: K-means produces non-overlapping clusters, whereas hierarchical clustering can produce overlapping clusters or clusters that are nested within each other, which may not be ideal for certain applications.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import joblib

# 'kmeans' is the best performing K-Means model
# Save the model to a file using joblib
joblib.dump(kmeans, 'kmeans_model.pkl')

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
# Load the saved model
loaded_kmeans_model = joblib.load('kmeans_model.pkl')

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

### **Conclusions drawn from EDA**

In conclusion, the exploratory data analysis (EDA) of Netflix's TV shows and movies clustering has revealed a wealth of insights that shed light on the platform's content distribution, production trends, viewer preferences, and global impact. From the data, several key takeaways emerge:

**Content Diversity and Global Reach:** Netflix's library showcases a rich diversity of content, with a focus on both TV shows and movies. The platform's international TV shows and the popularity of crime and kids' TV genres underscore the global audience's appetite for varied storytelling from different cultures and genres.

**Production Trends:** Over the years, Netflix has experienced rapid growth in content production. The surge in TV shows and movies from 2016 to 2020 reflects the streaming industry's evolving landscape, with platforms like Netflix responding to the demand for original content.

**Global Influences:** The dominance of the United States in content production highlights its historical and industrial strength, while the rise of Indian content underscores the influence of factors like growing middle-class populations, disposable incomes, and the popularity of streaming services.

**Regional Success Stories:** The prominence of South Korean dramas in the TV show market demonstrates the power of the Korean Wave, while Canada's financial support for TV shows has attracted both domestic and foreign investment.

**Viewer Engagement:** The popularity of Japanese voice actors, crime TV shows, kids' TV, British TV shows, and documentaries showcases viewers' diverse interests, from crime thrillers to educational content, across cultures and genres.

**Quality and Collaboration:** The involvement of prolific directors and actors suggests Netflix's emphasis on quality and collaboration, both within and beyond traditional entertainment industries.

In essence, the EDA illustrates Netflix's commitment to catering to a global audience by offering diverse, engaging, and high-quality content. The platform's strategic content production, collaborations with industry leaders, and focus on viewer preferences position it as a frontrunner in the evolving world of entertainment streaming.

### **Conclusions drawn from ML Model**

* Implemented **K-Means Clustering and Agglomerative Hierarchical Clustering**, to cluster the Netflix Movies TV show dataset.
* The optimal number of clusters we are getting from **K-means is 4**, whereas for **Agglomerative Hierarchical Clustering the optimal number of clusters are found out to be 2**.
* We chose **Silhouette Score as the evaluation metric** over distortion score because it provides a more intuitive and interpretable result. Also Silhouette score is less sensitive to the shape of the clusters.
* Built a **Recommendation system** that can help Netflix **improve user experience and reduce subscriber churn** by providing personalized recommendations to users based on their similarity scores.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***