<a href="https://colab.research.google.com/github/vennelaharini/Netflix-Movies-and-TV-Shows-Clustering/blob/main/Netflix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV Shows Clustering


##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

In this project, we explored the Netflix Movies and TV Shows dataset to uncover hidden patterns, perform clustering, and gain deeper insights into content types, country distributions, and other features. The goal was to prepare the dataset for analysis and machine learning tasks like clustering by performing thorough data cleaning, visualization, and hypothesis testing.

To begin, we handled missing values across various columns. For categorical fields such as country and rating, we imputed missing values using mode or filled them with a placeholder like "Unknown" to retain consistency. For the date_added column, forward fill (ffill) was applied after sorting data by release year, ensuring logical continuity in temporal data. This allowed us to preserve the structure of the dataset without losing records unnecessarily.

Outlier treatment was another key part of preprocessing. We applied the Z-score method to numerical fields such as duration_num, which helped in identifying extreme values that lay beyond three standard deviations from the mean. These outliers were removed to ensure better performance and accuracy of clustering algorithms that are sensitive to data spread and scale.

For categorical encoding, we used both Label Encoding (for binary variables like type) and One-Hot Encoding (for multi-class variables like rating). These methods converted text-based columns into a numerical format suitable for machine learning algorithms. Label encoding was efficient for binary classification, while one-hot encoding avoided introducing false ordinal relationships in non-binary fields.

Next, we performed exploratory data visualization to understand relationships between key variables. Using libraries like seaborn and matplotlib, we created various plots including count plots, pair plots, and bar charts. For instance, a count plot of type revealed that Netflix has more movies than TV shows. A country vs type bar chart showed that the U.S., India, and the UK dominate in content production, providing the basis for one of our hypotheses.

We defined three hypothetical statements based on our EDA. For Hypothesis 3—"The top-producing countries are concentrated in a few regions"—we applied a Chi-Square Goodness-of-Fit Test. The null hypothesis assumed uniform content distribution across countries, while the alternative suggested significant differences. The statistical test showed a low p-value, leading us to reject the null hypothesis and conclude that content production is indeed concentrated in a few countries. This justified our earlier observations and visualizations.

In summary, this project demonstrated a complete pipeline from data cleaning and encoding to visualization and textual analysis. Each preprocessing step was chosen based on the nature of the data and the goal of enabling clustering. The insights derived—such as content concentration by country and duration patterns across types—laid the foundation for future clustering and recommendation system models.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


To analyze and cluster Netflix Movies and TV Shows based on various features such as type, country, genre, and description by cleaning and preprocessing the dataset, visualizing key patterns, and applying suitable clustering techniques to uncover meaningful content groupings.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as mplt
import matplotlib as mpl
import numpy as np
import plotly.express as px
import plotly as plt

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
file_path='/content/drive/MyDrive/LabMentics/Netflix/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv'
df=pd.read_csv(file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
df.isnull().sum().plot(kind='bar')

### What did you know about your dataset?

The dataset contains information about Netflix Movies and TV Shows, including title, type, release year, country, genre, duration, and descriptions. It shows that Netflix has more movies than TV shows, most content comes from the U.S., India, and the U.K., and genres like Drama and International Movies are highly popular.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

**Variables Description (Short):**

**type:** The type of content (Movie or TV Show).

**title:** The title of the content.

**director:** The director of the content.

**cast:** The cast involved in the content.

**country:** The country where the content was produced.

**date_added:** The date when the content was added to Netflix.

**release_year:** The year the content was released.

**rating:** The rating of the content (e.g., PG, TV-MA).

**duration:** The duration of the content (for Movies, in minutes; for TV Shows, in number of seasons).

**genre:** The genre(s) the content falls under (e.g., Drama, Comedy).

**description:** A brief description or summary of the content.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df1=df.copy()
df1.drop_duplicates(inplace=True)
df1['country'] = df1['country'].fillna('Unknown')
df1['director'] = df1['director'].fillna('Unknown')
df1['cast'] = df1['cast'].fillna('Unknown')
df1['rating'] = df1['rating'].fillna(df1['rating'].mode()[0])
df1['date_added'] = df1['date_added'].ffill()
# Convert 'date_added' to datetime
df1['date_added'] = pd.to_datetime(df1['date_added'], errors='coerce')
# Extract year and month from 'date_added'
df1['year_added'] = df1['date_added'].dt.year
df1['month_added'] = df1['date_added'].dt.month
# Convert 'duration' to numeric (minutes for Movies, seasons for TV Shows)
df1['duration_num'] = df1['duration'].str.extract('(\d+)').astype(float)
df1['duration_type'] = df1['duration'].str.extract('([a-zA-Z]+)')

# Create binary flags for type
df1['is_movie'] = df1['type'].apply(lambda x: 1 if x == 'Movie' else 0)
# Clean text columns
for col in ['title', 'director', 'cast', 'country', 'rating', 'listed_in']:
    df1[col] = df1[col].str.strip()

# Final cleaned dataset ready for analysis
df1.head()

### What all manipulations have you done and insights you found?

**Data Manipulations:**
  * Missing Value Imputation: Filled missing values in categorical columns with 'Unknown' and used forward fill for temporal data.

  * Outlier Removal: Used Z-score method to remove extreme values in numerical columns like duration_num.

  * Categorical Encoding: Applied Label Encoding for binary columns and One-Hot Encoding for multi-class columns.

  * Text Preprocessing: Expanded contractions, converted text to lowercase, removed punctuation, URLs, stopwords, and performed tokenization and lemmatization.

**Insights Found:**
  * Content Distribution: Movies are more prevalent than TV shows on Netflix.

  * Top-Producing Countries: Content production is highly concentrated in a few countries, especially the US, India, and the UK.

  * Duration Patterns: Movies tend to have a wider range of durations, whereas TV shows have more consistent lengths.

  * Trend Over Time: The number of Netflix titles added has increased over the years, indicating platform growth.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Count of Movies vs TV Shows

In [None]:
# Chart - 1 visualization code
sns.countplot(x='type', data=df1)
mplt.title('Count of Movies vs TV Shows')

##### 1. Why did you pick the specific chart?

The count plot clearly shows the distribution between Movies and TV Shows, making it ideal to compare the volume of each content type.



##### 2. What is/are the insight(s) found from the chart?

Netflix has more Movies than TV Shows, indicating a stronger focus on movie content in its library.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*** positive business impact***

 * Yes, knowing the dominance of Movies helps in **content planning**, user segmentation, and improving **personalized recommendations** based on user preferences.

***negative growth***
 * Yes, if Netflix **underinvests in TV Shows**, it may lose long-term engagement, as **TV Shows encourage binge-watching and retention**. A low share might signal a gap in user demand.

#### Chart - 2 - Top 10 Countries Producing Netflix Content

In [None]:
# Chart - 2 visualization code
top_countries = df1['country'].value_counts().head(10)
top_countries.plot(kind='bar')
mplt.title('Top 10 Countries Producing Netflix Content')

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing categorical data like countries and their content volume. It clearly shows how much content each top country contributes.

##### 2. What is/are the insight(s) found from the chart?

The United States dominates Netflix content production, followed by India, the UK, and others. These countries have strong content pipelines and user bases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Netflix can focus marketing and licensing efforts in high-contributing countries to strengthen its catalog and user engagement.

**Negative insight :**

Yes. Underrepresentation of other regions (e.g., African or Middle Eastern countries) shows limited diversity, which may hinder global growth. Investing in regional content could improve global reach.

#### Chart - 3 - Top 10 Genres/Listed_in Categories

In [None]:
# Chart - 3 visualization code
from collections import Counter
genre_split = df1['listed_in'].dropna().apply(lambda x: x.split(', '))
all_genres = [genre for sublist in genre_split for genre in sublist]
genre_counts = pd.Series(Counter(all_genres)).sort_values(ascending=False).head(10)
genre_counts.plot(kind='bar')
mplt.title('Top 10 Genres on Netflix')

##### 1. Why did you pick the specific chart?

To identify the most popular content categories on Netflix and understand audience content preferences using a simple bar chart.

##### 2. What is/are the insight(s) found from the chart?

Genres like Dramas, Comedies, and Documentaries dominate Netflix, indicating viewer demand for story-rich and informative content.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Focusing production and acquisitions on top-performing genres can boost engagement, user retention, and content ROI.

**negative:**

Yes. Over-reliance on a few genres may limit content diversity, causing viewer fatigue and loss of niche audiences who prefer underrepresented genres like Science & Nature or Kids’ shows.

#### Chart - 4 - Content Added Over Years

In [None]:
# Chart - 4 visualization code
df1['year_added'].value_counts().sort_index().plot(kind='line')
mplt.title('Content Added to Netflix Over the Years')

##### 1. Why did you pick the specific chart?

The Content Added Over Years chart shows the trend of how Netflix's content library has grown over time, making it ideal for understanding platform expansion and content strategy.

##### 2. What is/are the insight(s) found from the chart?

A major rise in content additions was observed between 2016 and 2020, indicating aggressive growth; however, a dip post-2020 suggests a slowdown, possibly due to pandemic-related production halts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it helps identify peak content acquisition years and plan future investments by understanding past trends and production bottlenecks.

**negative:**

Yes, the post-2020 decline may signal reduced content variety, affecting user retention. Understanding this can drive decisions to boost content sourcing and production to maintain subscriber engagement.

#### Chart - 5 - Monthly Trend of Content Addition

In [None]:
# Chart - 5 visualization code
df1['month_added'].value_counts().sort_index().plot(kind='bar')
mplt.title('Monthly Trend of Netflix Content Addition')

##### 1. Why did you pick the specific chart?

The Monthly Trend of Content Addition chart reveals how frequently Netflix adds content throughout the year, highlighting seasonal or campaign-driven spikes.

##### 2. What is/are the insight(s) found from the chart?

Most content is added between July and October, suggesting strategic releases during mid-year and fall seasons, possibly to boost engagement or align with holidays.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. By aligning marketing and content promotions with high addition months, Netflix can maximize visibility and user retention.

**negative:**

Yes. Low content additions in early months (e.g., Jan–March) may reduce engagement and new subscriptions during that period. A more balanced release strategy could help maintain consistent viewer interest year-round.

#### Chart - 6 - Top 10 Directors by Content Count

In [None]:
# Chart - 6 visualization code
top_directors = df1[df1['director'] != 'Unknown']['director'].value_counts().head(10)
top_directors.plot(kind='bar')
mplt.title('Top 10 Directors on Netflix')

##### 1. Why did you pick the specific chart?

The bar chart is ideal for comparing discrete categories (directors) with content count, making it easy to identify the most prolific directors on Netflix.

##### 2. What is/are the insight(s) found from the chart?

The chart shows which directors have the most titles on Netflix, indicating their popularity or frequent collaborations with Netflix.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Netflix can strengthen partnerships with high-performing directors to retain and grow viewership, as they already deliver content that aligns with user interests.

**negative:**

Possibly. Over-reliance on a few directors may limit content diversity, which can reduce audience engagement over time. It's important to balance popular creators with fresh talent.

#### Chart - 7 - Distribution of Ratings

In [None]:
# Chart - 7 visualization code
sns.countplot(y='rating', data=df1, order=df1['rating'].value_counts().index)
mplt.title('Distribution of Ratings')

##### 1. Why did you pick the specific chart?

The Distribution of Ratings chart is ideal for visualizing how Netflix content is categorized by age group, helping identify content suitability for different audiences.

##### 2. What is/are the insight(s) found from the chart?

Most titles are rated TV-MA and TV-14, indicating Netflix heavily targets mature and teen audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes — it shows content is tailored to major viewing groups, guiding marketing and production strategies to retain and grow the subscriber base.

**negative:**

Yes — fewer titles rated for children or general audiences may alienate family viewers. Investing in more family-friendly content could fill this gap and expand audience reach.

#### Chart - 8 - Distribution of Duration (Movies Only)


In [None]:
# Chart - 8 visualization code
sns.histplot(df1[df1['type'] == 'Movie']['duration_num'].dropna(), bins=20)
mplt.title('Distribution of Movie Duration (minutes)')

##### 1. Why did you pick the specific chart?

The Distribution of Duration (Movies Only) chart, specifically a histogram, is chosen to understand how movie durations are distributed across the dataset. It shows the frequency of different durations and helps identify trends in movie length.

##### 2. What is/are the insight(s) found from the chart?

  * The majority of movies have durations in the range of 80-120 minutes.

  * A smaller proportion of movies have longer durations, indicating that most content is focused on shorter formats.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, understanding movie duration distribution helps in content strategy. If shorter movies are more popular, the platform can focus on producing more compact, engaging content to align with viewer preferences.

**Negative Growth Insight:**

If longer movies are significantly underrepresented, it may indicate an underutilization of long-form content, leading to a potential gap in offerings for certain audiences who prefer detailed, longer movies.

#### Chart - 9 - Distribution of Seasons (TV Shows Only)

In [None]:
# Chart - 9 visualization code
sns.countplot(x='duration_num', data=df1[df1['type'] == 'TV Show'])
mplt.title('TV Shows by Number of Seasons')

##### 1. Why did you pick the specific chart?

The Distribution of Seasons (TV Shows Only) chart is a countplot (or bar chart) that visually represents the number of TV shows categorized by their season count. It's chosen because it highlights trends in the number of seasons for TV shows, which is useful for understanding the content structure.

##### 2. What is/are the insight(s) found from the chart?

  * Most TV shows are short-term with fewer seasons (typically 1-3 seasons).

  * A smaller number of TV shows have longer runs (5+ seasons), indicating the presence of more popular, longer-running shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding the typical season length helps in:

  * Content Planning: For creating shows with a longer shelf life.

  * Targeting: Understanding audience preferences for shorter or longer TV shows.

**negative:**

  * Shorter shows dominate: If most TV shows have fewer seasons, it could indicate a lack of long-term engagement. This could suggest a gap in building lasting, loyal viewership, potentially leading to negative growth in subscriber retention for platforms.

#### Chart - 10 - Pie Chart – Content Type Share

In [None]:
# Chart - 10 visualization code
df1['type'].value_counts().plot(kind='pie', autopct='%1.1f%%')
mplt.title('Movies vs TV Shows Share on Netflix')

##### 1. Why did you pick the specific chart?

The Pie Chart is used to show the proportion of Movies vs TV Shows on Netflix. It provides an easy-to-understand visual representation of the content type distribution, highlighting the share of each category.

##### 2. What is/are the insight(s) found from the chart?

The Pie Chart shows the relative percentage of Movies and TV Shows in the dataset. For example, if Movies make up 60% and TV Shows 40%, it indicates that movies dominate the platform’s content offering.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

  * **Positive Impact:** Insights from the chart can help Netflix understand which content type is more popular and tailor marketing or acquisition strategies accordingly (e.g., investing more in Movies if they dominate).

  * **Negative Growth:** If TV Shows are significantly less represented, it might signal a missed opportunity in that segment, potentially leading to lower engagement for users who prefer TV Shows.

#### Chart - 11 - Country vs Type Chart

In [None]:
# Chart - 11 visualization code
top_countries = df1['country'].value_counts().head(10).index
country_type_df = df1[df1['country'].isin(top_countries)]
country_type_crosstab = pd.crosstab(country_type_df['country'], country_type_df['type'])
country_type_crosstab.plot(kind='bar', figsize=(10,6))
mplt.title('Top 10 Countries: Movie vs TV Show Count')
mplt.xlabel('Country')
mplt.ylabel('Number of Titles')
mplt.xticks(rotation=45)
mplt.legend(title='Content Type')
mplt.tight_layout()
mplt.show()

##### 1. Why did you pick the specific chart?

The grouped bar chart is chosen because it effectively compares the number of Movies and TV Shows for the top countries, making it easy to visually differentiate between content types across countries.

##### 2. What is/are the insight(s) found from the chart?

  * Some countries may have a higher concentration of Movies or TV Shows.

  * Certain countries may produce a more diverse range of content (both Movies and TV Shows).

  * Content distribution varies widely by country.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

  * **Positive Impact:**Identifying countries with high production volumes helps target regional content strategies, tailoring marketing and production to maximize content consumption.

  * **Negative Growth:**If a country shows a disproportionate focus on one type (e.g., mostly TV Shows, but limited Movies), it may indicate an imbalance in content strategy, limiting audience diversity and engagement.

#### Chart - 12 - Genre-wise Movie vs TV Show Count

In [None]:
# Chart - 12 visualization code
genre_df = df1.explode('listed_in')
genre_df['listed_in'] = genre_df['listed_in'].str.strip()
sns.countplot(data=genre_df[genre_df['listed_in'].isin(genre_counts.index)],
              x='listed_in', hue='type')
mplt.title('Genre-wise Count of Movies vs TV Shows')
mplt.xticks(rotation=45)

##### 1. Why did you pick the specific chart?

The Genre-wise Movie vs TV Show Count chart was chosen because it visually compares the distribution of Movies and TV Shows across different genres. A stacked bar chart is suitable for this as it highlights both the overall genre popularity and its distribution between Movies and TV Shows.

##### 2. What is/are the insight(s) found from the chart?

  * Genre Popularity: Some genres, like Drama or Comedy, might have a higher number of Movies or TV Shows, while others could be dominated by one content type (e.g., Action may have more Movies).

  * Content Strategy: Certain genres might be underserved in one category (Movies or TV Shows), offering a potential opportunity for content expansion.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

  * **Positive Impact:** If a genre is underrepresented in one content type (e.g., more Drama TV Shows but fewer Movies), creating more content in that area could attract more subscribers.

  * **Negative Growth:** If a genre has an overwhelming focus on one content type while the other is neglected (e.g., too many Movies and not enough TV Shows), it may limit content diversity and reduce audience engagement for subscribers who prefer the underrepresented type.

#### Chart - 13 - Trend of Movie Releases by Year

In [None]:
# Chart - 13 visualization code
df1[df1['type'] == 'Movie']['release_year'].value_counts().sort_index().plot()
mplt.title('Movie Releases Over Years')

##### 1. Why did you pick the specific chart?

I picked the Trend of Movie Releases by Year line chart because it clearly shows the temporal trend in movie releases, helping to identify growth patterns and fluctuations in Netflix's content over time.

##### 2. What is/are the insight(s) found from the chart?

  * There is a steady increase in movie releases over the years, particularly in the last few years.

  * Fluctuations might be seen during certain periods, possibly due to external factors like global events or business strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help Netflix understand content production trends, enabling better planning for new releases and marketing strategies. Increased releases indicate strong growth in the content library.

**negative:**

If the chart shows declining releases in certain years, it might indicate lack of investment in content, potentially affecting user engagement and subscription growth. This would negatively impact business if content stagnation occurs.

#### Chart - 14 - Correlation Heatmap (Numerical Features)

In [None]:
# Correlation Heatmap visualization code
sns.heatmap(df1[['duration_num', 'year_added', 'month_added']].corr(), annot=True, cmap='coolwarm')
mplt.title('Correlation Between Numerical Variables')

##### 1. Why did you pick the specific chart?

A Correlation Heatmap was chosen because it visually represents the relationships between numerical features like duration_num, year_added, and month_added. It helps quickly identify strong or weak correlations between variables.

##### 2. What is/are the insight(s) found from the chart?

Strong positive/negative correlations show how certain variables are related. For example, a correlation between duration_num and year_added might indicate longer movies or series are being released in specific years. A low correlation might suggest no meaningful relationship.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sample_df = pairplot_df.sample(1000, random_state=42)
sns.pairplot(sample_df, hue='type', palette='Set2')

##### 1. Why did you pick the specific chart?

The Pair Plot was chosen because it allows for visualizing the relationships between multiple numerical variables in a dataset, highlighting how variables like duration_num, year_added, and month_added relate to each other and to the content type (Movie or TV Show).

##### 2. What is/are the insight(s) found from the chart?

The **Pair Plot** reveals patterns such as:

**Duration:** Movies tend to have a wider range of durations, while TV shows are clustered in smaller ranges (usually fewer seasons).

**Year Added:** The trend over the years shows how Netflix has increasingly added content.

**Content Type:** The chart shows that movies and TV shows differ in terms of *duration and seasonality*, with TV shows having more consistent season counts.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1.   Movies tend to have a longer duration compared to TV Shows on Netflix.
2.   The number of TV Shows added to Netflix has increased more significantly than Movies over the years
3.   The top-producing countries are mostly concentrated in a few regions (e.g., US, India, UK).





### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H₀):**There is no significant difference in the duration between Movies and TV Shows on Netflix.
(µ_movie = µ_tv_show)

**Alternate Hypothesis (H₁):**
Movies have a significantly longer duration compared to TV Shows on Netflix.
(µ_movie > µ_tv_show)


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats
movies_duration = df1[df1['type'] == 'Movie']['duration_num'].dropna()
tv_shows_duration = df1[df1['type'] == 'TV Show']['duration_num'].dropna()
t_stat, p_value = stats.ttest_ind(movies_duration, tv_shows_duration, alternative='greater')
print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

If P-Value < 0.05: We reject the null hypothesis and conclude that movies tend to have a significantly longer duration than TV shows.

If P-Value ≥ 0.05: We fail to reject the null hypothesis, meaning there is no significant difference in the durations of movies and TV shows.

##### Why did you choose the specific statistical test?

We are comparing the means of two independent groups: Movies and TV Shows.

The sample sizes for both groups are likely large enough for the t-test to be applicable (assuming we have a sufficient number of movies and TV shows in the dataset).

We assume that the duration data is normally distributed for both movies and TV shows.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): There is no significant difference in the yearly growth rate of TV Shows and Movies added to Netflix.

Alternative Hypothesis (H₁): TV Shows have a significantly higher yearly growth rate than Movies added to Netflix.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
yearly_data = df1.groupby(['year_added', 'type']).size().unstack().fillna(0)
movies_growth = yearly_data['Movie'].diff().dropna()
tvshows_growth = yearly_data['TV Show'].diff().dropna()
t_stat, p_value = stats.ttest_ind(tvshows_growth, movies_growth, equal_var=False)

print(f"T-Statistic: {t_stat}")
print(f"P-Value: {p_value}")

##### Which statistical test have you done to obtain P-Value?

We used the Independent Samples T-Test (Welch’s T-test).

##### Why did you choose the specific statistical test?

The t-test compares the means of two independent samples — in this case, the yearly growth of TV Shows vs Movies.

We used Welch’s T-test (by setting equal_var=False) because the variances in both groups may not be equal.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.


Null Hypothesis (H₀): Content production is evenly distributed across all countries — no significant difference among countries.

Alternative Hypothesis (H₁): Content production is not evenly distributed — some countries produce significantly more content than others.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chisquare
top_countries = df1['country'].value_counts().head(5)
observed = top_countries.values

# Assume uniform expected distribution
expected = [sum(observed)/len(observed)] * len(observed)

# Chi-Square Goodness-of-Fit Test
chi_stat, p_value = chisquare(f_obs=observed, f_exp=expected)

print("Chi-Square Statistic:", chi_stat)
print("P-Value:", p_value)

##### Which statistical test have you done to obtain P-Value?

Test Used: Chi-Square Goodness-of-Fit Test

##### Why did you choose the specific statistical test?

The Chi-Square test is appropriate when comparing categorical data distributions (e.g., country-wise content counts) against an expected distribution.

We are testing whether observed frequencies differ significantly from a uniform expectation.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df1['country'].fillna('Unknown', inplace=True)
df1['date_added'] = df1['date_added'].ffill()

#### What all missing value imputation techniques have you used and why did you use those techniques?

These columns are categorical; replacing with the most common value or 'Unknown' preserves data integrity without introducing bias.

Temporal data often benefits from propagating last known values, especially when sorted by release_year.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
from scipy.stats import zscore
df1['z_score'] = zscore(df1['duration_num'])
df1 = df1[df1['z_score'].abs() < 3]

##### What all outlier treatment techniques have you used and why did you use those techniques?

Z-score helps identify data points that are more than 3 standard deviations from the mean—commonly treated as outliers in a normal distribution.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder
df1['type'] = LabelEncoder().fit_transform(df1['type'])
df1 = pd.get_dummies(df1, columns=['rating'], drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Label Encoding is simple and effective for binary values.

One-Hot Encoding avoids introducing ordinal relationships in nominal categories.



### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing
df1['description'] = df1['description'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
df1['description'] = df1['description'].str.translate(str.maketrans('', '', string.punctuation))

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
df1['description'] = df1['description'].str.replace(r"http\S+|www.\S+", "", regex=True)
df1['description'] = df1['description'].str.replace(r'\w*\d\w*', '', regex=True)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why?

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

# **Conclusion**

Through data cleaning, preprocessing, and visualization, we found that Netflix's content distribution is concentrated in a few key countries, with movies generally having longer durations than TV shows. The dataset also reveals a significant increase in content added over time. These insights provide a foundation for building clustering models to better understand content patterns and improve recommendations on the platform.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***