<a href="https://colab.research.google.com/github/vimal-139/Netflix-Movies-and-TV-show-clustering/blob/main/Netflix_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Netflix Movies and TV show clustering



Project Type - Unsupervised ML

Contribution - Individual

Team Member 1 - Vimal Kumar Hoon




# **Project Summary -**

The Netflix Movies and TV Show Clustering project is an attempt to group Netflix's content based on its similarity using machine learning techniques. The aim of this project is to provide better recommendations to Netflix users by clustering movies and TV shows into similar groups based on their plot, genre, cast, and other features.

The project uses an unsupervised learning algorithm, K-Means, to cluster the content into similar groups. The data used for this project includes information such as movie and TV show titles, genres, plot summaries, cast, directors, and production companies.

The clustering algorithm is applied to this data to group movies and TV shows into clusters based on their similarity. Once the clusters are formed, the project uses visualizations such as heatmaps and dendrograms to help users understand how the content is grouped and make better recommendations based on these clusters.

Overall, this project aims to improve the Netflix recommendation system and enhance the user experience by providing better suggestions based on a user's watching history and preferences.

Finally, a content-based recommender system was created using the similarity matrix obtained through cosine similarity. This system provides personalized recommendations based on the type of show the user has watched, giving them 10 top-notch suggestions to explore. In summary, the study identified key trends in the Netflix dataset, including the growth rate of movies versus TV shows, the busiest period for adding new content, and the content demographics. Through clustering and a content-based recommender system, the study was able to provide personalized recommendations based on the user's viewing history. This study provides valuable insights into the factors influencing the popularity of movies and TV shows on Netflix, offering a foundation for further research and analysis..

# **GitHub Link -**

# **Problem Statement**


The dataset used for this project contains information on TV shows and movies that were available on Netflix in 2019. The data was collected from Almabetter School, which is a Data Science oriented school.

In 2018, a report was released that highlighted how the number of TV shows on Netflix has increased significantly since 2010, while the number of movies has decreased by over 2,000 titles. This dataset provides an opportunity to further explore these trends and uncover additional insights.

The integration of external datasets such as IMDB ratings and Rotten Tomatoes could also provide interesting findings.

The project objectives include conducting exploratory data analysis, understanding the type of content available in different countries, investigating whether Netflix is increasingly focused on TV shows over movies in recent years, and clustering similar content using text-based features.

In this project, you are required to do

1.Exploratory Data Analysis.

2.Understanding what type content is available in different countries.

3.Is Netflix has increasingly focusing on TV rather than movies in recent years.

4.Clustering similar content by matching text-based features.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Importing the libraries
import numpy as np
import pandas as pd
from numpy import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import missingno as msno
%matplotlib inline

# Word Cloud library
from wordcloud import WordCloud, STOPWORDS

# library used for textual data prerocessing
import string
string.punctuation
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from scipy.stats import ttest_ind
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

# library used for Clusters impelementation
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as shc

# library used for building recommandation system
from sklearn.metrics.pairwise import cosine_similarity

# Warnings library. Would help to throw away warnings caused.
import warnings
warnings.filterwarnings('ignore')


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Loading CSV File
nd = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')

### Dataset First View

In [None]:
# Dataset First Look
nd.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
nd.shape

### Dataset Information

In [None]:
# Dataset Info
nd.info()

#### Duplicate Values

In [None]:
# Duplicate Value Count in Dataset
# count the number of duplicate rows in the Netflix DataFrame(nd)
duplicate_rows = nd[nd.duplicated()]
duplicate_count = len(duplicate_rows)

# print the result
print("Number of duplicate rows: ", duplicate_count)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
nd.isnull().sum()

In [None]:
#total null values
nd.isnull().sum().sum()

In [None]:
# Visualizing the missing values
# visualize missing data with a heatmap
msno.heatmap(nd)
plt.figure(figsize=(8, 6))
msno.bar(nd, color='blue')
plt.show()

# visualize missing data with a bar chart
msno.bar(nd)
plt.figure(figsize=(5, 3))
msno.bar(nd, color='red')
plt.show()

### What did you know about your dataset?

The Netflix Dataset contains 7787 rows and 12 columns.Their are four columns containing missing values.The Total 3631 missing values present in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
nd.columns

In [None]:
# Dataset Describe
nd.describe()

### Variables Description 

**show_id** : Unique ID for every Movie / Tv Show

**type** : Identifier - A Movie or TV Show

**title** : Title of the Movie / Tv Show

**director** : Director of the Movie

**cast** : Actors involved in the movie / show

**country** : Country where the movie / show was produced

**date_added** : Date it was added on Netflix

**release_year** : Actual Releaseyear of the movie / show

**rating** : TV Rating of the movie / show

**duration** : Total Duration - in minutes or number of seasons

**listed_in** : Genere

**description**: The Summary description

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Iterate over each column in DataFrame.
for col in nd.columns:
    unique_values = nd[col].unique()
    print(col, unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Handling Null Values
nd['cast'].fillna(value='No cast',inplace=True)
nd['country'].fillna(value=nd['country'].mode()[0],inplace=True)

In [None]:
#'date_added' and 'rating' contains an insignificant portion of the data so we will drop them from the dataset
nd.dropna(subset=['date_added','rating'],inplace=True)

In [None]:
#again checking is there any null values are not
nd.isnull().sum()

### What all manipulations have you done and insights you found?

The dataset underwent some manipulations, including filling null values in the 'cast' column with 'No cast,' filling null values in the 'country' column with the mode of the column, dropping rows with null values in the 'date_added' and 'rating' columns, and dropping the 'director' column.

These manipulations revealed some potential insights from the dataset. For instance, the most common country for Netflix content can be inferred from the country that filled in the null values in the 'country' column. Additionally, the 'cast' column appears to be an important feature in the dataset since there were null values that needed to be filled to preserve the data's completeness.

While the 'date_added' and 'rating' columns were dropped due to a small number of null values, their importance will depend on the specific analysis being conducted. Furthermore, since the 'director' column was dropped, it may not be a useful feature for the analysis at hand.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - **1.Type**

In [None]:
# Chart - 1 visualization code
sns.set_style('darkgrid')
plt.figure(figsize=(3, 6))
sns.countplot(x='type', data=nd, palette='magma')
#labeling of values
plt.title('\nNumber of Movies and TV Shows\n', fontsize=17)
plt.xlabel('\nType', fontsize=12)
plt.ylabel('Count\n', fontsize=12)
#Visualization of number of movies and tv shows
plt.show()

##### 1. Why did you pick the specific chart?

A Countplot Chart is a type of Bar Chart that can be an effective option for visualizing categorical data. This chart is particularly useful for displaying the frequency of each category in a clear and easily interpretable way. Thus, a Countplot Chart can be an excellent choice for visualizing data on the number of movies and TV shows on Netflix, which is a categorical variable.

##### 2. What is/are the insight(s) found from the chart?

The number of movies on Netflix is greater than the number of TV shows, with 5372 movies and 2398 TV shows currently available on the platform.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

While the insight that there are more movies than TV shows on Netflix may not have a significant positive or negative business impact on its own, it can be used in conjunction with other insights and data to inform business decisions.

For instance, if Netflix observes that TV shows are more popular among its subscribers than movies, it may decide to focus more on acquiring TV show content. Alternatively, if it sees that its original movie productions are gaining popularity, it may choose to invest more in that area.

However, ignoring the preferences of its subscribers and continuing to acquire movies over TV shows could lead to negative growth. This could potentially cause Netflix to lose subscribers who are looking for more TV show content. Additionally, if competitors begin to offer more TV shows, Netflix may lose market share if it does not respond by acquiring more TV show content. Thus, while the specific insight that there are more movies than TV shows on Netflix may not have a significant impact on its own, it can be part of a broader set of considerations that impact business decisions.

#### Chart - **2.Rating**

In [None]:
# Chart - 2 Visualization Code
nd['rating']

In [None]:
ratings = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

nd['target_ages'] = nd['rating'].map(ratings)

In [None]:
# convert 'type' column to categorical data type
nd['type'] = pd.Categorical(nd['type'])

# create a new categorical column 'target_ages' with specified categories
nd['target_ages'] = pd.Categorical(nd['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])
nd


In [None]:
#creating two extra columns
tv_shows=nd[nd['type']=='TV Show']
movies=nd[nd['type']=='Movie']

# group TV shows by 'rating' and count the number of shows in each rating category
tv_ratings = tv_shows.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)

fig_dims = (12,6)

fig, ax = plt.subplots(figsize=fig_dims)

# create a point plot using Seaborn's pointplot() function, with 'rating' on the x-axis and 'count' on the y-axis
sns.pointplot(x='rating',y='count',data=tv_ratings, palette='magma')

# set the plot title and font size
plt.title('TV Show Ratings\n',size='20')

plt.show()


In [None]:
# create a color palette for the different target age groups
colors = ["#FFC300", "#FF5733", "#C70039", "#900C3F"]

# plot a countplot to show the movie ratings based on target age groups
plt.figure(figsize=(14,6))
plt.title('\nMovie Ratings by Target Age Group\n', size = 20)

sns.countplot(x=movies['rating'], hue=movies['target_ages'], data=movies, 
              order=movies['rating'].value_counts().index, palette=colors)

# add a legend to the plot
plt.legend(title='Target Age Group', loc='upper right', labels=['Kids', 'Older Kids', 'Teens', 'Adults'])

plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart because it effectively shows the distribution of TV show ratings in a clear and concise manner. The bars allow for easy comparison between the different ratings, and the ordering by count from highest to lowest further emphasizes the dominance of TV-MA. Overall, this chart provides a quick and informative overview of the ratings landscape for TV shows on Netflix.

##### 2. What is/are the insight(s) found from the chart?

Based on the dataset, it can be concluded that TV-MA is the most common rating for both movies and TV shows on Netflix. This suggests that a significant portion of the content available on Netflix is intended for adult audiences. In particular, the highest number of occurrences in the 'rating' column for TV shows is TV-MA, while for movies it is also the most common rating. These findings indicate that Netflix's content is geared towards a primarily adult demographic, with a focus on mature and potentially controversial themes.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from analyzing the Netflix dataset can have a positive impact on the streaming giant's business strategy. For instance, understanding that the TV-MA rating is the most common for both movies and TV shows can inform Netflix's decision to produce and acquire content that appeals to adult audiences. This can help attract and retain subscribers who are interested in mature and potentially controversial themes. Furthermore, comprehending the target age groups for different ratings can help Netflix customize its marketing and promotional efforts to specific audiences.

However, there is a potential negative impact as well. Some subscribers may be put off by the prevalence of mature content, especially if they are searching for family-friendly programming. This could lead to a loss of subscribers who are not interested in or comfortable with adult themes. Therefore, it is crucial for Netflix to balance its content offerings to appeal to a wide range of viewers and avoid alienating any particular demographic. By doing so, Netflix can increase its subscriber base and maintain its position as a leading player in the streaming industry.

#### Chart - **3.Release Year**

In [None]:
#Creating a line chart to visualize the number of movies and TV shows released each year
#Extracting the count of movies and TV shows for each year
movies_year = movies['release_year'].value_counts().sort_index(ascending=False)
tvshows_year = tv_shows['release_year'].value_counts().sort_index(ascending=False)

#Creating a line plot using Seaborn
sns.set(style='whitegrid', font_scale=1.2)
fig, ax = plt.subplots(figsize=(12, 7))

ax = sns.lineplot(x=movies_year.index, y=movies_year.values, color='red', label='Movies', linewidth=2.5, marker='o')
ax = sns.lineplot(x=tvshows_year.index, y=tvshows_year.values, color='green', label='TV Shows', linewidth=2.5, marker='o')

#Customizing the plot
plt.xticks(rotation=90)
ax.set_xlabel('\nRelease Year', fontsize=14)
ax.set_ylabel('Number of Titles\n', fontsize=14)
ax.set_title('\nProduction Growth Yearly\n', fontsize=20, pad=15)
plt.legend(fontsize=14)

plt.show()

In [None]:
# Extract the last 20 years from the dataset
last_20_years = range(2001, 2020)

# Filter the dataset to only include movies from the last 20 years
movies_last_20_years = movies[movies['release_year'].isin(last_20_years)]

# Create a count plot of the number of movies released per year
plt.figure(figsize=(12,6))
sns.countplot(x='release_year', data=movies_last_20_years, palette='turbo', order=last_20_years)
plt.xticks(rotation=45, ha='right')
plt.xlabel('\nYear of Release')
plt.ylabel('Number of Movies Released\n')
plt.title('\nNumber of Movies Released per Year in the Last 20 Years\n', fontsize=20)
plt.show()

In [None]:
tvshows_year

In [None]:
# filter for movies released in the last 15 years
movies_last_15_years = nd[nd['release_year'] >= 2008]

# create a countplot with horizontal bars
plt.figure(figsize=(10,6))
sns.countplot(y='release_year', data=movies_last_15_years, order=movies_last_15_years['release_year'].value_counts().index[:15])
plt.title('Number of Movies Released per Year (2008-2022)', fontsize=16)
plt.xlabel('Number of Movies')
plt.ylabel('Release Year')
plt.show()

In [None]:
nd

In [None]:
#adding columns of month and year of addition

nd['month'] = pd.DatetimeIndex(nd['date_added']).month
nd.head()

##### 1. Why did you pick the specific chart?

A line chart or bar chart would be the most suitable visualization for comparing the number of movies and TV shows released per year from 2015 to 2020. These charts can help to identify any trends or patterns in the data and allow for a clear comparison between the two categories. In addition, a stacked bar chart or stacked area chart could also be used to show the relative proportions of movies and TV shows released each year.

The chosen chart effectively demonstrates that the number of movies released on Netflix has grown at a faster rate than the number of TV shows. It also highlights the overall trend of increased production in both categories after 2015, followed by a decline after 2020. By showing the changes in the quantity of content over time, this chart provides useful insights into the growth and evolution of Netflix's content strategy.

##### 2. What is/are the insight(s) found from the chart?

Based on the dataset, it can be observed that the years 2017 and 2018 had the highest number of movie releases, while 2020 had the highest number of TV show releases. The growth rate of movie releases is faster than that of TV shows, indicating that Netflix has been putting more emphasis on acquiring and producing movies rather than TV shows.

Since 2015, there has been a significant increase in the number of movies and TV show episodes available on Netflix, suggesting that the company has been steadily expanding its content library. However, there has been a notable drop in the number of movies and TV show episodes produced after 2020, which could be attributed to the impact of the COVID-19 pandemic on the entertainment industry.

In summary, the data suggests that Netflix has been focusing more on increasing its movie content than TV shows, as seen by the higher growth rate of movies compared to TV shows.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Incorporating these insights could potentially benefit Netflix by increasing the appeal of its content and retaining its audience. However, the significant drop in content production after 2020 could signal challenges in production or investment in content creation that may lead to negative growth if not addressed. Therefore, it is crucial for Netflix to stay up to date with market trends and adjust its strategies accordingly to ensure continued success. In Conclusion, while these insights can provide valuable information, they must be analyzed and interpreted in the context of the current streaming landscape to make informed business decisions.

#### Chart - **4.Release_month**

In [None]:
# Chart - 4 visualization code
#visualization of month of movie release
plt.figure(figsize=(12, 10))
sns.countplot(x='month', data=nd, palette='plasma')
plt.title('Countplot of Month\n')
plt.xlabel('Month')
plt.ylabel('Count')
plt.show()

In [None]:
#Countplot of Month by Type
fig, ax = plt.subplots(figsize=(15, 6))

sns.countplot(x='month', hue='type', data=nd, palette='plasma', ax=ax, edgecolor='black', linewidth=2.5)
ax.set_title('Countplot of Month by Type', fontsize=20)
ax.set_xlabel('Month', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.legend(fontsize=12, title='Type', title_fontsize=12)
sns.despine()
plt.show()

##### 1. Why did you pick the specific chart?

The countplot with hue was chosen as the best chart to visualize and compare the number of movies and TV shows added to Netflix each month. It allows for easy identification of patterns and trends in the data, with the use of hue allowing for clear comparison between the contributions of movies and TV shows to the total count for each month.

The countplot with hue shows that from October to January, there was a peak in the number of movies and TV shows added to Netflix. This information can be valuable for Netflix and content creators, as it may suggest a time period when viewers are more likely to be interested in watching new content, and thus, a potentially more profitable time to release new content. Overall, the insights gained from the analysis of the countplot can help Netflix and content creators make more informed decisions about when to release new content to maximize audience engagement and revenue.

##### 2. What is/are the insight(s) found from the chart?

According to the countplot, it appears that Netflix adds the highest number of movies and TV shows during the period between October and January. This period seems to be the busiest time of year for Netflix in terms of adding new content to its platform.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insight that the most content is added to Netflix from October to January can potentially help create a positive business impact. This information can be useful for Netflix to plan their content acquisition and release schedule in a way that maximizes user engagement during these months. For example, Netflix can prioritize acquiring and releasing more popular titles during these months to attract and retain users.

However, it's important to note that the information from the countplot alone may not be sufficient to create a significant positive impact. Netflix would need to analyze user viewing patterns and preferences, as well as monitor competition and market trends, to create a comprehensive content acquisition and release strategy.

Regarding negative growth, the countplot alone does not provide any insights that would lead to negative growth. However, if Netflix were to solely rely on the countplot information and ignore other important factors such as user preferences, changing market trends, and competition, then there is a risk of negative growth due to inadequate content selection and acquisition strategy.

#### Chart - **5.Genre**

In [None]:
# Chart - 5 visualization code
#Top 10 genres of movies
top10_movies = movies['listed_in'].value_counts().index[0:10]
#Visualization of code
plt.figure(figsize=(14, 6))
sns.countplot(y='listed_in', data=movies, order=top10_movies, palette='plasma')
plt.title('\nTop 10 Genres of Movies\n', fontsize=18, fontweight='bold')
plt.xlabel('Count', fontsize=14)
plt.ylabel('Genre\n', fontsize=14)
sns.despine()
plt.tight_layout()
plt.show()

In [None]:
#Top 10 Genres of Tv shows
top10_tvshows = tv_shows['listed_in'].value_counts().index[0:10]
#Visualization
plt.figure(figsize=(14, 6))
sns.countplot(y='listed_in', data=tv_shows, order=top10_tvshows, palette='rainbow')
plt.title('\nTop 10 Genres of TV Shows\n', fontsize=16, fontweight='bold')
plt.xlabel('Count', fontsize=14)
plt.ylabel('Genre', fontsize=14)
sns.despine()
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To know the count of netflix shows and tv shows.

##### 2. What is/are the insight(s) found from the chart?

Netflix's kids TV category is undoubtedly one of the most popular genres on the platform, providing a diverse and extensive selection of animated and live-action shows suitable for children of all ages. From well-known classics to exciting new series, the library is constantly updated with fresh and engaging content.

In addition to offering a fun and entertaining viewing experience, the kids TV category is designed with parental controls to provide a safe and secure environment for children to watch their favorite shows. Parents can set age-appropriate filters, monitor viewing history, and limit access to specific content, ensuring that their children are only watching shows that are suitable for their age and maturity level.

With so many options available, Netflix's kids TV category has become a go-to destination for families looking for high-quality, entertaining, and educational content. The shows available not only provide entertainment but also valuable lessons that can help children learn and grow.

Overall, Netflix's kids TV category remains a top genre on the platform, and for a good reason. Its diverse and extensive selection, coupled with parental controls, provides a great viewing experience for children while giving parents peace of mind. Whether you're looking for a way to entertain your little ones or bond with your family over a great show, the kids TV category is the perfect place to start.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The analysis reveals that kids TV is the most popular genre for TV shows on Netflix, featuring a wide range of educational and entertaining content suitable for children of all ages. Examples of top shows in this genre include "Paw Patrol", "Peppa Pig", "The Magic School Bus", and "Stranger Things."

Using this information, Netflix can make data-driven decisions to positively impact their business. By prioritizing their most popular genre, Netflix can increase their investment in producing high-quality kids shows and promoting them to parents with young children to attract and retain their target audience.

However, there may be negative growth associated with this trend if Netflix becomes too focused on kids TV and ignores other genres, leading to a loss of older viewers looking for more mature content. Moreover, if the quality of their kids programming declines or they lose the rights to popular shows, this could also hurt their business. Therefore, it's essential for Netflix to strike a balance between catering to their core audience while still offering a diverse range of content to appeal to a broader audience.

#### Chart - **6.Duration**

In [None]:
# Chart - 6 visualization code
# Create a figure and set its size
plt.figure(figsize=(10, 7))

# Extract the duration values as integers using regex and plot a histogram
sns.histplot(movies['duration'].str.extract('(\d+)').astype(int), kde=False, palette='magma')

# Set the title of the plot
plt.title('Distribution of Movie Durations', fontweight='bold')

# Set the x-axis label
plt.xlabel('Duration (minutes)')

# Set the y-axis label
plt.ylabel('Count')

# Show the plot
plt.show()

In [None]:
# Set the figure size
plt.figure(figsize=(30, 6))

# Create a count plot of TV show durations
sns.countplot(x=tv_shows['duration'], data=tv_shows, order=tv_shows['duration'].value_counts().index, palette='rainbow')

# Set the title of the plot
plt.title("\nDistribution of TV Show Durations\n", fontweight='bold', fontsize=20)

# Set the x-axis label
plt.xlabel("Duration (seasons)")

# Set the y-axis label
plt.ylabel("Count")

# Rotate the x-axis labels
plt.xticks(rotation=90)

# Show the plot
plt.show()

In [None]:
# Extract the duration values as integers using regex
movies['minute'] = movies['duration'].str.extract('(\d+)').apply(pd.to_numeric)

# Calculate the average movie duration by rating
duration_year = movies.groupby(['rating'])['minute'].mean()

# Create a DataFrame to store the results and sort by average duration
duration_nd = pd.DataFrame(duration_year).sort_values('minute')

# Set the figure size
plt.figure(figsize=(12, 6))

# Create a bar plot of the average movie duration by rating
ax = sns.barplot(x=duration_nd.index, y=duration_nd.minute)

# Set the title of the plot
plt.title("Average Movie Duration by Rating\n", fontweight='bold')

# Set the x-axis label
plt.xlabel("Rating")

# Set the y-axis label
plt.ylabel("Average Duration (minutes)")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

Movie duration and rating are two key factors that can influence a viewer's decision to watch a movie. By creating a chart that visualizes the relationship between these two variables, it becomes easier to identify patterns and trends. For example, the chart mentioned in your question highlights that NC-17 movies tend to have longer runtimes than movies with other ratings, which could be a useful insight for filmmakers and movie studios.

Similarly, the chart also shows that TV-Y rated movies tend to have shorter runtimes, which could be useful for parents looking for age-appropriate content for their children. Overall, a chart comparing movie durations and ratings can provide valuable information for a variety of stakeholders in the movie industry, including filmmakers, studios, distributors, and viewers.

##### 2. What is/are the insight(s) found from the chart?

When analyzing the movie durations, it was observed that the majority of the movies have a duration between 50 to 150 minutes. On the other hand, the TV shows have a large number of single-season shows, which indicates that most of the TV shows on Netflix are relatively new.

Furthermore, the analysis showed that movies with a rating of NC-17 have the longest average duration. This might be because the movies with such a rating can explore more mature themes and include more explicit content, which requires a longer runtime to tell a compelling story.

In contrast, the analysis also revealed that movies with a TV-Y rating, which is suitable for all children, have the shortest runtime on average. This suggests that the movies with this rating tend to be shorter and may have simpler plots and themes that are suitable for younger audiences.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially help create a positive business impact as it allows movie studios and streaming platforms to better understand their audience and tailor their content accordingly. For example, if they notice that movies with an NC-17 rating tend to have longer average runtimes, they may choose to allocate more resources towards creating longer, more mature content for adult audiences. Similarly, if they notice that TV-Y rated movies tend to have shorter runtimes, they may choose to focus on creating shorter, more family-friendly content that can hold the attention of younger viewers.

However, there could also be insights that lead to negative growth. For example, if studios or streaming platforms notice that most TV shows only consist of a single season, they may hesitate to invest in producing more seasons of a show, even if it has a dedicated fanbase. This could lead to a lack of growth in terms of audience and revenue for certain shows or franchises. Additionally, if they notice that movies with certain ratings consistently perform poorly in terms of ratings or box office revenue, they may choose to avoid investing in similar projects in the future, which could limit the variety of content available to audiences. Ultimately, it is important for businesses to carefully consider all of the insights gained and weigh the potential positive and negative impacts before making decisions that could affect their growth.

#### Chart - **7.Country**

In [None]:
# Chart - 7 visualization code
# create a figure with the desired size
plt.figure(figsize=(18,5))

# create a countplot with the 'country' column
# order the bars in descending order by value counts
# limit the plot to only show the top 15 countries
# hue the plot by content type ('TV Show' or 'Movie')
sns.countplot(x=nd['country'], order=nd['country'].value_counts().index[0:15], hue=nd['type'], palette='rainbow')

# rotate the x-axis tick labels by 50 degrees for better visibility
plt.xticks(rotation=50)

# set the plot title and font size
plt.title('Top 15 countries with most contents', fontsize=15, fontweight='bold')

# show the plot
plt.show()

In [None]:
#top_two countries where netflix is most popular
country=nd['country'].value_counts().reset_index()
country

In [None]:
# Top 10 countries by count of titles
top_countries = nd['country'].value_counts().head(10).index

# Create a dataframe with count of movie and TV show for each country
content_data = nd.loc[nd['country'].isin(top_countries)].groupby(['country', 'type']).size().unstack().fillna(0)
content_data['total'] = content_data.sum(axis=1)

# Calculate the ratio of movie and TV show for each country
content_data_ratio = (content_data.iloc[:, :-1].div(content_data['total'], axis=0)[['Movie', 'TV Show']] * 100)

# Sort the dataframe by movie ratio and plot the horizontal bar chart
ax = content_data_ratio.sort_values(by='Movie').plot(kind='barh', stacked=True, figsize=(12, 8))

# Set the x-axis label and title
ax.set_xlabel('Ratio of Titles (%)', fontsize=14)
ax.set_title('Ratio of Movies and TV Shows by Country\n', fontsize=18)

# Set the legend
handles, labels = ax.get_legend_handles_labels()
ax.legend(reversed(handles), reversed(labels), fontsize=12, loc='upper right')


In [None]:
# Preparing data for heatmap
nd['count'] = 1
data = nd.groupby('country')[['country','count']].sum().sort_values(by='count',ascending=False).reset_index()[:10]
data = data['country']


nd_heatmap = nd.loc[nd['country'].isin(data)]
nd_heatmap = pd.crosstab(nd_heatmap['country'],nd_heatmap['target_ages'],normalize = "index").T
nd_heatmap

##### 1. Why did you pick the specific chart?

The provided data suggests that the United States has the largest content library on Netflix, followed by India, with India having the most movies. To illustrate this information, a bar chart or horizontal bar chart would be effective. A bar chart can be used to display the number of titles for each country, facilitating easy comparison. Alternatively, a horizontal bar chart can be used to demonstrate the countries in descending order of title count.

##### 2. What is/are the insight(s) found from the chart?

Netflix has the highest number of content in the United States, followed by India. India has the highest number of movies on Netflix.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

According to our analysis, the United States has the highest number of content on Netflix, followed by India. Interestingly, India has the highest number of movies on Netflix.

These insights can be useful for Netflix in a number of ways. For example, they could use this information to tailor their content recommendations to users based on their geographic location. They could also use this information to determine which types of content to focus on producing in the future.

However, there are also some potential negative impacts to consider. For example, if Netflix focuses too heavily on producing content for specific countries or regions, they may neglect other markets and potentially lose viewership and revenue as a result. Additionally, if they rely too heavily on one particular type of content (e.g. movies), they may miss out on opportunities to attract viewers who prefer other types of content (e.g. TV shows or documentaries).

Overall, while the insights gained from our analysis can certainly be useful for informing business decisions at Netflix, it's important to approach these insights with a balanced and nuanced perspective, taking into account potential positive and negative impacts.

#### Chart - **8.Originals**

In [None]:
# Chart - 8 visualization code
nd['date_added'] = pd.to_datetime(nd['date_added'])
movies['year_added'] = nd['date_added'].dt.year
nd

In [None]:
# Create a new column 'is_original' indicating whether each movie is an original or not
movies['is_original'] = np.where(movies['release_year'] == movies['year_added'], 'Yes', 'No')

# Create a pie chart showing the proportion of originals and non-originals in the dataset
fig, ax = plt.subplots(figsize=(5,5), facecolor="#363336")
ax.patch.set_facecolor('#363336')

# Define the explode parameter to separate the slices
explode = (0, 0.1)

# Count the number of movies in each category and plot a pie chart with ax.pie()
ax.pie(movies['is_original'].value_counts(), explode=explode, autopct='%.2f%%', labels=['Non-Originals', 'Originals'],
       shadow=True, startangle=90, textprops={'color': "black", 'fontsize': 20}, colors=['yellow', '#F5E9F5'])

# Add a title to the plot
ax.set_title("Proportion of Original vs Non-Original Movies", color='white', fontsize=20)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

To know the percentage of originals vs others.

##### 2. What is/are the insight(s) found from the chart?

While Netflix is renowned for its original content, it's worth noting that only 30% of the movies on the platform are produced by Netflix themselves. The other 70% of movies come from different sources, such as theaters or other streaming platforms.

This statistic emphasizes Netflix's extensive collection of movies gathered over the years, offering audiences an extensive range of content from all around the globe. Netflix provides everything from timeless Hollywood classics to international films, satisfying the diverse preferences and interests of their viewers.

The next time you browse through Netflix's vast movie library, keep in mind that only a small fraction of it is original content. The majority of the movies you see are obtained and added to the platform, providing a seemingly never-ending supply of entertainment alternatives for viewers.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

To summarize, Netflix's strategy of acquiring and producing content has both positive and negative implications. While acquiring popular content allows them to offer a wider variety of content without incurring high costs, producing original content allows them to differentiate themselves from competitors and retain customers. However, if their original content is not as popular, it could lead to a decline in subscribers. Furthermore, relying too heavily on acquired content could lead to increased costs and decreased profitability if they cannot negotiate favorable licensing agreements. Therefore, it is important for Netflix to find a balance between acquiring and producing content to ensure continued success.

#### Chart - 9 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Preparing data for heatmap
nd['count'] = 1
data = nd.groupby('country')[['country','count']].sum().sort_values(by='count',ascending=False).reset_index()[0:10]
data = data['country']


nd_heatmap = nd.loc[nd['country'].isin(data)]
nd_heatmap = pd.crosstab(nd_heatmap['country'],nd_heatmap['target_ages'],normalize = "index").T
nd_heatmap

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Set the order of countries and age groups for the heatmap
country_order = ['United States', 'India', 'United Kingdom', 'Canada', 'Japan', 'France', 'South Korea', 'Spain', 'Mexico']
age_order = ['Adults', 'Teens', 'Older Kids', 'Kids']

# Create a heatmap using seaborn library
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
sns.heatmap(data=nd_heatmap.loc[age_order, country_order], cmap='YlGnBu', square=True, linewidth=2.5,
            cbar_kws={'label': 'Percentage of Content'}, annot=True, fmt='.0%', vmax=0.6, vmin=0.05, ax=ax,
            annot_kws={'fontsize': 12})

# Set the title and axis labels
ax.set_title('Distribution of Content Ratings in Different Countries', fontsize=16)
ax.set_xlabel('Country', fontsize=12)
ax.set_ylabel('Content Rating', fontsize=12)

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

To know the relation between variables.

##### 2. What is/are the insight(s) found from the chart?

The US and UK are closely aligned with their Netflix target ages, but radically different from, example, India or Japan!

Also, Mexico and Spain have similar content on Netflix for different age groups.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

The three hypothetical statements are-
1. Netflix has the highest number of content in the United States, followed by India. India has the highest number of movies on Netflix.

1.   According to the countplot, it appears that Netflix adds the highest number of movies and TV shows during the period between October and January. This period seems to be the busiest time of year for Netflix in terms of adding new content to its platform.
2.   The number of movies on Netflix is greater than the number of TV shows, with 5372 movies and 2398 TV shows currently available on the platform.



### Hypothetical Statement - -Netflix has the highest number of content in the United States, followed by India. India has the highest number of movies on Netflix.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis (H0): The average number of movies on Netflix in the United States is equal to the average number of movies on Netflix in India.
Alternative hypothesis (H1): The average number of movies on Netflix in the United States is greater than the average number of movies on Netflix in India.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform a statistical test to compare the average number of movies released on Netflix in the United States versus in India, based on the release year of the movies

# Filter the data to include only movies
movies = nd[nd.type == 'Movie']

# Filter the data to include only movies from the United States and India
us_movies = movies[movies.country == 'United States']
india_movies = movies[movies.country == 'India']

# Perform a two-sample t-test with unequal variances
from scipy.stats import ttest_ind
t, p = ttest_ind(us_movies['release_year'], india_movies['release_year'], equal_var=False)

# Set the significance level
alpha = 0.05

# Print the results of the test
if p < alpha:
    print("Reject the null hypothesis. The average number of movies on Netflix in the United States is greater than the average number of movies on Netflix in India.")
else:
    print("Fail to reject the null hypothesis. The average number of movies on Netflix in the United States is not significantly different from the average number of movies on Netflix in India.")


##### Which statistical test have you done to obtain P-Value?

 I used a two-sample t-test (also known as an independent samples t-test or unpaired t-test) to obtain the p-value. Specifically, I used the ttest_ind function from the scipy.stats module to perform the t-test. This test is appropriate for comparing the means of two independent samples, which is what we're doing here by comparing the number of movies on Netflix in the United States and India.

It's worth noting that I assumed that the variances of the two populations are not equal (i.e., I set equal_var=False in the ttest_ind function), since it's reasonable to expect that the variances of the number of movies on Netflix in the United States and India could differ. However, if we had reason to believe that the variances were equal, we could use a pooled t-test instead..

##### Why did you choose the specific statistical test?

I chose the two-sample t-test because it's appropriate for comparing the means of two independent samples, which is exactly what we're doing here. We have two independent samples of movies on Netflix in the United States and India, and we want to test whether the mean number of movies in the United States is significantly different from the mean number of movies in India.

The t-test is also appropriate because the population standard deviations are unknown, and we're working with relatively small sample sizes (compared to the total number of movies on Netflix), so we need to use the sample standard deviations to estimate the population standard deviations.

Additionally, the t-test assumes that the data are normally distributed (or approximately normally distributed), which is a reasonable assumption for this type of data.

Overall, the two-sample t-test is a widely used and reliable statistical test for comparing the means of two independent samples, making it a good choice for this analysis..

### Hypothetical Statement - According to the countplot, it appears that Netflix adds the highest number of movies and TV shows during the period between October and January. This period seems to be the busiest time of year for Netflix in terms of adding new content to its platform.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

 Null hypothesis(H0)-there is no significant difference in the number of movies and TV shows added by Netflix across different months. 
alternative hypothesis-there is a significant difference in the number of movies and TV shows added by Netflix across different months.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats

# Convert the "date_added" column to datetime format
nd["date_added"] = pd.to_datetime(nd["date_added"])

# Extract the month from the "date_added" column
nd["month_added"] = nd["date_added"].dt.month_name()

# Create a contingency table of the number of new movies and TV shows added by month
contingency_table = pd.crosstab(nd["type"], nd["month_added"])

# Perform a chi-square test for independence
chi2_statistic, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("Chi-square statistic:", chi2_statistic)
print("P-value:", p_value)

##### Which statistical test have you done to obtain P-Value?

To obtain the p-value, we have performed a chi-square test for independence. The chi-square test is used to determine if there is a significant association between two categorical variables. In this case, we wanted to test if there was a significant association between the time of year and the number of new movies and TV shows added to Netflix. The test involves comparing the observed frequencies of the contingency table (which shows the distribution of the data) to the expected frequencies under the assumption of independence. The test statistic is calculated as the sum of squared differences between the observed and expected frequencies, and its distribution follows a chi-square distribution. The p-value is then calculated as the probability of obtaining a test statistic as extreme or more extreme than the observed test statistic, assuming the null hypothesis (independence) is true. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two variables..

##### Why did you choose the specific statistical test?

The chi-square test for independence was chosen to test for a potential association between two categorical variables: the time of year and the number of new movies and TV shows added to Netflix. This test is commonly used for this type of analysis and allows us to calculate a p-value, which indicates the strength of evidence against the null hypothesis of independence. If the p-value is less than the significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two variables. Hence, the chi-square test for independence is an appropriate statistical test to use for this analysis.

### Hypothetical Statement - The number of movies on Netflix is greater than the number of TV shows, with 5372 movies and 2398 TV shows currently available on the platform.

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis: The number of movies and TV shows on Netflix is not significantly different.

Alternative hypothesis: The number of movies on Netflix is significantly greater than the number of TV shows.

#### 2. Perform an appropriate statistical test.

In [None]:
from statsmodels.stats.proportion import proportions_ztest
# Count the number of movies and TV shows
n_movies = nd[nd['type'] == 'Movie'].count()['type']
n_tv_shows = nd[nd['type'] == 'TV Show'].count()['type']

# Set the counts and sample sizes for the z-test
counts = [n_movies, n_tv_shows]
nobs = [len(nd), len(nd)]

# Perform the z-test assuming equal proportions
z_stat, p_val = proportions_ztest(counts, nobs, value=0, alternative='larger')

# Print the results
print('Number of movies:', n_movies)
print('Number of TV shows:', n_tv_shows)
print('z-statistic:', z_stat)
print('p-value:', p_val)

##### Which statistical test have you done to obtain P-Value?

To test the hypothesis that the proportion of movies on Netflix is greater than the proportion of TV shows, we used a two-sample z-test for proportions. This test compares the proportion of successes in two independent samples, which in this case are the proportion of movies and TV shows on Netflix. We set the null hypothesis as the proportion of movies and TV shows on Netflix being equal and the alternative hypothesis as the proportion of movies being greater than TV shows.

To perform the test, we used the proportions_ztest() function from the statsmodels library. This function takes the number of successes and sample size for each sample as input and returns the z-score and p-value for the test. We set the alternative argument to 'larger' to test for a one-tailed hypothesis where we are interested in the proportion of movies being greater than TV shows.

It is important to note that the validity of the test relies on the assumption that the samples are independent, and that the sample sizes are large enough for the central limit theorem to hold.

##### Why did you choose the specific statistical test?

The two-sample z-test for proportions was chosen to compare the number of movies and TV shows on Netflix due to the categorical nature of the data and the need to test for a significant difference between the proportions of these categories in the population. This test is appropriate for comparing the proportion of successes (in this case, movies or TV shows) in two independent samples, and it assumes that the sample sizes are large enough to apply the normal approximation to the binomial distribution. By using the proportions_ztest() function from the statsmodels library, we were able to calculate the z-score and p-value for the test based on the sample proportions, sample sizes, and specified null hypothesis value.