# **Project Name**    - Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Content analysis is extremely important for any OTT platform aiming to retain viewers and stay competitive in a crowded digital entertainment space.

Amazon Prime, one of the leading streaming platforms globally, offers a diverse collection of movies and TV shows across genres, languages, and regions. With the increasing demand for personalized and high-quality content, it becomes essential to understand what types of content perform best and how user preferences evolve over time.

In this project,  exploratory data analysis on Amazon Prime's content catalog is performed to uncover patterns and trends related to content type, release years, genre popularity, and viewer ratings. By analyzing these aspects, we aim to provide insights that can help drive content acquisition strategies, improve customer satisfaction, and enhance overall platform performance.

This analysis will support business decisions by identifying what kind of content to prioritize, when to release it, and which genres resonate most with the audience—ultimately contributing to better engagement and retention.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


With the growing demand for personalized and high-quality content on OTT platforms, there is a need to analyze Amazon Prime's vast content library to identify trends and patterns. This project aims to explore the platform’s catalog to understand viewer preferences, content performance, and genre popularity to support better content acquisition and engagement strategies.










#### **Define Your Business Objective?**

Enhance user engagement and satisfaction by optimizing content offerings and recommendations on Amazon Prime Video.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

df1 = pd.read_csv("/content/drive/My Drive/ Project2 dataset/titles.csv")
df2 = pd.read_csv("/content/drive/My Drive/ Project2 dataset/credits.csv")

In [None]:
#top 5 columns of titles dataset
df1.head()

### Dataset First View

In [None]:
#top 5 columns of credits dataset
df2.head()

In [None]:
#merge df1 and df2
df =pd.merge(df2,df1,on='id', how="left")


In [None]:
# Dataset First Look
df

### Variables Description

**id**: The title ID on JustWatch.

**title**: The name of the title.

**show_type**: TV show or movie.

**description**: A brief description.

**release_year**: The release year.

**age_certification**: The age certification.

**runtime**: The length of the episode (SHOW) or movie.

**genres**: A list of genres.

**production_countries**: A list of countries that produced the title.

**seasons**: Number of seasons if it's a SHOW.

**imdb_id**: The title ID on IMDB.

**imdb_score**: Score on IMDB.

**imdb_votes**: Votes on IMDB
.
**tmdb_popularity**: Popularity on TMDB.

**tmdb_score**: Score on TMDB.

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(df.duplicated().sum())

Insights:

There are 168 duplicate values which will be dropeed  during data wrangling

#### Missing Values/Null Values

In [None]:
df.dtypes

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
df.replace(["null", "NULL", "NaN", "None"], np.nan, inplace=True)
df.isnull().sum()

### What did you know about your dataset?
This data set  lists all shows available on Amazon Prime streaming, in order to analyze the data to find interesting facts. This dataset has data available in the United States.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset statistical description for numerical columns

df.describe().T


In [None]:
#for categorical columns
df.describe(include=["object"]).T

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()


## 3. ***Data Wrangling***

### Data Wrangling Code

Step1: Handling Duplicates

In [None]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

Step2: Handling Missing Values

In [None]:
df.isnull().sum()

In [None]:
#datatype: object
df['character'].mode()
#Fill missing character with 'Unknown' (common in cast data)
df['character'].fillna('Himself', inplace=True)
df.isnull().sum()

In [None]:
#description: datatye->	object
df["description"].count()

Insights:

91 rows is a very small fraction (~0.15%)—dropping them won’t affect overall data quality.

In [None]:
df.dropna(subset=["description"],inplace=True)
df.isnull().sum()

In [None]:
# age_certification:datatype	->object
df["age_certification"].mode()

In [None]:
df["age_certification"].fillna("R",inplace=True)
df.isnull().sum()

In [None]:
#seasons:datatype:>	float64
df["seasons"].skew() #  data is positively skewed( Right Skwed)

In [None]:
df["seasons"].median()

In [None]:
df["seasons"].fillna(df["seasons"].median(),inplace=True)
df.isnull().sum()

In [None]:
df.dtypes

In [None]:
#datatype: object
df['imdb_id'].count()

In [None]:
#imdb_id	datatype:object
# Replace missing 'imdb_id' values with unknown
# as the imdb id is unique fro everone we can not fill it with mode so better is to fill with unknown
df['imdb_id'].fillna('unknown', inplace=True)
df.isnull().sum()

In [None]:
#datatype: float64
df["imdb_score"].skew()

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df,x='imdb_score',bins=20,kde=True)
plt.show()

In [None]:
# we can replace with mean  here as  data is slightly left skewed whichb is nearly equal to = 0.5

# Replace missing 'imdb_score' values with the mean
df['imdb_score'].fillna(df['imdb_score'].mean(), inplace=True)
df.isnull().sum()


In [None]:
#df['imdb_votes'] ->float
df['imdb_votes'].skew()

In [None]:
# Fill with median
df['imdb_votes'].fillna(df['imdb_votes'].median(), inplace=True)
df.isnull().sum()

In [None]:

#we can drop it too there are only 13 misssing values
df.dropna(subset=['tmdb_popularity'],inplace =True)
df.isnull().sum()


In [None]:
df['tmdb_score'].skew()

In [None]:
#check distribution using histogram
plt.figure(figsize=(10,5))
sns.histplot(df,x='tmdb_score',bins=20,kde=True)
plt.show()

In [None]:
#the histogram shows data is distributed evenly
#Fill missing tmdb_score with mean
df['tmdb_score'].fillna(df['tmdb_score'].mean(), inplace=True)
df.isnull().sum()

### What all manipulations have you done and insights you found?
* The dataset initially contained 168 duplicate entries, which were removed using the drop_duplicates() method.

* Several columns had missing values, which were handled as follows :

 - 'tmdb_popularity' and 'description' columns had missing values and were dropped from the dataset.

 - Categorical columns such as 'age_certification' and 'character' (object types) had missing values that were filled using the mode.

 - Numerical columns like 'seasons' and 'imdb_votes', which were skewed, had their missing values filled using the median .

 - For columns like 'imdb_score' and 'tmdb_score', which followed a normal distribution, missing values were imputed using the mean.

In [None]:
#numerical columns
numeric_cols= df.select_dtypes(include=['int64','float64'])
numeric_cols

In [None]:
##categorical columns
categorical_cols= df.select_dtypes(include=['object'])
categorical_cols

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
df.head()

#### Chart - 1
## Histogram

**visualize the distribution of content releases**

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(df,x='release_year',bins=20,kde=True)
plt.show()

##### 1. Why did you pick the specift visually emphasizes proportions of each category as slices

We selected  histogram of Release_Year to clearly show the distribution of movie release years over time. It effectively highlights patterns, such as the increase in recent releases and the left-skewed nature of the data.

##### 2. What is/are the insight(s) found from the chart?

The distribution of movie release years is left-skewed (skewness = -1.14), meaning a few very old movies pull the average earlier.

Most movies were released between 2010 and 2022, with a peak around 2020.

Mean release year is 1996, but this is lower than the median (2009) due to the influence of older movies (as early as 1912).

Movie production was low before 1960, then steadily increased, with a major boom after 2000.

The surge in recent releases reflects the growth of digital platforms and filmmaking technology.

##### 3. Will the gained insights help creating a positive business impact?
 Movies involving actors show greater variability in runtimes, reflecting diverse content that can attract a wider audience and enhance recommendation systems.However excessive runtimes may lower viewer engagement, highlighting the need to balance creativity with audience viewing habits.










#### Chart - 2
Histogran

**illustrate the distribution of runtimes across various types of content**

In [None]:
plt.figure(figsize=(10,5))
fig=sns.histplot(df,x='runtime',bins=20,kde=True,color = "green")

# Iterate through patches (bars) in the histogram
for p in fig.patches:
  # Get x and y coordinates for text placement
  x = p.get_x() + p.get_width() / 2
  y = p.get_height()
  # Add text label with count
  fig.text(x, y, f"{int(y)}", ha="center", va="bottom")
plt.show()

##### 1. Why did you pick the specific chart?
We selected the histogram  to effectively show the distribution of movie runtimes. This helps in understanding how most movies vary in length, revealing patterns and outliers in terms of time.

##### 2. What is/are the insight(s) found from the chart?
* Most of the movies  have runtime between 80 to  120 minutes
*  A very few movies are of 180 minutes
*  There are few movies whicih execeede more than 500 minutes indicating outliers
* The distribution is Right-Skewwed , indicating a small number of long- duration movies

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart reveals that most movies have runtimes between 80–120 minutes, helping businesses tailor content accordingly. Long-duration movies are rare and could lead to poor performance if not strategically planned. This data-driven insight supports informed decisions in production and distribution for better audience engagement and profitability.

#### Chart - 3
Box Plot

 **Check if there are any movies whose durantion is very far than other movies in  Runtime**

In [None]:
#box plot to check any outlier in runtime
plt.figure(figsize=(10,6))
sns.boxplot(df, x = 'runtime', color='lightgreen')
plt.title("Box Plot of Runtime")
plt.show()

##### 1. Why did you pick the specific chart?
 The box plot clearly highlights if any movies have significantly longer or shorter durations compared to others

Answer Here.

##### 2. What is/are the insight(s) found from the chart?
* There are several outliers—movies with runtimes significantly higher than the rest, exceeding 300 minutes, and some even above 500 minutes.

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Extremely long movies might need to be marketed differently, or split into parts/series for better user engagement.

* However, if such long runtimes are not aligned with audience preferences, promoting them heavily might lead to negative growth, such as decreased engagement or higher bounce rates.

#### Chart - 4
PIE PLOT


**show the proportion of each content type.**

In [None]:
type_counts = df['type'].value_counts()
type_counts

In [None]:
#find unique values
df['type'].unique()

In [None]:
type_counts = df['type'].value_counts()
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%')
plt.title("Distribution of Content Type")
plt.axis('equal')
plt.show()


##### 1. Why did you pick the specific chart?
selected  the pie chart to visually represent the proportion of each content type in the dataset, making it easier to compare the distribution between "Movie" and "Show."

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

* The chart reveals that Movies make up 93.3% of the content, while Shows account for only 6.7%.
* This indicates a significant dominance of movie content in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 * Knowing that movies dominate the platform can help content strategists and marketers focus their efforts on optimizing movie-related content, improving recommendations, and attracting more viewers interested in movies.

* However, the lack of shows (only 6.7%) could be a point of concern. It might indicate a content gap that, if not addressed, could lead to negative growth by losing potential viewers who prefer watching series or episodic content. This insight can guide the business to consider expanding its show library to cater to a broader audience and improve user retention.

#### Chart - 5

PIE CHART

**show the proportion of different roles .**

In [None]:
plt.figure(figsize=(10, 6))
plt.pie(df["role"].value_counts(), labels =df["role"].value_counts().index,autopct = "%1.1f%%")
plt.show()

##### 1. Why did you pick the specific chart?

 selected  pie chart  to visually represent the  proportion of different roles in the dataset (e.g., Actor, Director, Unknown).

##### 2. What is/are the insight(s) found from the chart?

* The majority of entries in the dataset are for the role "ACTOR" (92.6%), which heavily dominates the dataset.

* "DIRECTOR" roles make up a much smaller percentage (6.7%).



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart reveals a strong focus on actors, enabling targeted marketing and actor-centric strategies.
However, limited data on other roles may cause biased insights and missed opportunities.
Balancing the dataset can lead to more comprehensive and inclusive business decisions.

#### Chart - 6
COUNT PLOT

**How many records belong to each certification category**

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(data =df, x="age_certification", palette ='viridis')
plt.show()

##### 1. Why did you pick the specific chart?

age_certification contains discrete categories (like R, PG, G, etc.), a count plot provides a clear and immediate understanding of how many records belong to each certification category.

##### 2. What is/are the insight(s) found from the chart?

* The 'R' (Restricted) certification has the highest number of entries by a large margin, followed by 'PG-13' and 'PG'.

* Certifications like 'TV-Y7', 'NC-17', 'TV-14', 'TV-MA', and 'TV-Y' have very few entries, indicating they are underrepresented.

* There is a notable imbalance in the distribution of age certifications in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart reveals a heavy focus on 'R' rated content, which supports mature audience engagement but limits appeal to younger viewers. This creates an opportunity to diversify content for broader reach while highlighting a risk of alienating family-friendly segments. Balancing age certifications can drive positive business growth.

###  **BIVARIATE ANALYSIS**

Analyzes the relationship between two variables.
1.  **Numerical vs. Numerical:**
- Scatter Plot
2.  **Numerical vs. Date:**
- Line Chart
3.  **Categorical vs. Numerical:**

- Bar Chart
- Box Plot

#### Chart - 7


Scatter Plot

**how imdb popularity are dependent on  release year**

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(data=df, x='release_year', y='tmdb_popularity')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot clearly shows the relationship between release year and TMDB popularity, helping to identify trends and outliers over time.

##### 2. What is/are the insight(s) found from the chart?

* There is a noticeable increase in TMDB popularity for movies released after the year 2000, especially after 2010.

* Most movies released before 1980 have relatively lower popularity scores, indicating limited audience engagement or data availability.

* The chart shows a few very high popularity outliers in recent years, which might be blockbuster hits or highly marketed releases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on newer movies can drive more user engagement and platform growth.

Relying too much on older, less popular movies without promotion may negatively impact viewer interest.

#### Chart - 8

SCATTER PLOT

**Analyze how TMDB popularity correlates with TMDB score.**

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(data=df, x='tmdb_score', y='tmdb_popularity')
plt.show()

##### 1. Why did you pick the specific chart?

 scatter plot was selected to show relationship between TMDb score (a measure of quality) and TMDb popularity (a measure of audience engagement). A scatter plot shows relationship between two variable and the trend between them

##### 2. What is/are the insight(s) found from the chart?

* Most titles with high TMDb popularity tend to have moderate to low TMDb scores.

* Titles with very high TMDb scores do not necessarily have high popularity, indicating that critical acclaim doesn't always correlate with mass appeal.

*A few outliers have exceptionally high popularity regardless of their score.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart shows that higher popularity doesn't always align with high TMDb scores, revealing audience preference for entertainment over critical acclaim. This suggests an opportunity to invest in engaging content, even if not highly rated. Balancing quality with popularity can optimize viewer satisfaction and drive business growth.

#### Chart - 9

SCATTER PLOT

**What is the relationship between movie runtime and the number of votes received?**

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(data=df, x='runtime', y='imdb_votes')
plt.show()

##### 1. Why did you pick the specific chart?

The scatterplot was chosen to visually analyze the relationship between a movie's runtime and the number of IMDb votes it received. This helps identify patterns or trends that could influence viewer engagement.

##### 2. What is/are the insight(s) found from the chart?

* Maxmimum movies have runtime  between 50 to 180 minutes
* These movies having runtime between this range  tend to have more imdb votes as compared to very short and very long movies
* Extremely long runtimes (above 200 mins) do not consistently receive high votes, suggesting diminishing audience interest or limited viewership.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart shows that movies with runtimes between 80–150 minutes attract more IMDb votes, revealing a great spot in audience engagement. Extremely short or long movies tend to be less popular, highlighting that length impacts viewership. Investing in content within the ideal runtime range can boost user satisfaction and support business growth by aligning with audience preferences.

### Chart 10


LINE CHART

**Show how the number of TV show seasons has changed over the years.**

In [None]:
plt.figure(figsize=(10,5))
sns.lineplot(data=df, x='release_year', y='seasons',color="green")
plt.show()

##### 1. Why did you pick the specific chart?

This line plot was selected to visualize the trend of average TV show seasons over the years. It helps track how the number of seasons has evolved from the early 1900s to recent years.

##### 2. What is/are the insight(s) found from the chart?

* From 1920 to 1940 average number of seasons remained relatively stable
* The chart shows a rise in the average number of TV show seasons peaking around 2000, followed by a sharp decline after 2015, indicating more audience engagement towards less seasons

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart shows a rise in the average number of TV show seasons peaking around 2000, followed by a sharp decline after 2015. This suggests a shift toward shorter series, likely due to changing audience preferences and streaming trends. The insights can help businesses adapt content strategies for better engagement and cost efficiency.

### chart 11

LINE PLOT

**Visualize the change in TMDB popularity across years.**



In [None]:
plt.figure(figsize=(10,5))
sns.lineplot(data=df, x='release_year', y='tmdb_popularity',color="green")
plt.show()

##### 1. Why did you pick the specific chart?

Line Plot was selected to show how popularity increased  with respect to years , as Line Chart clearly describes how one variable changes with respect to another variable

##### 2. What is/are the insight(s) found from the chart?

* The popularity of movies or shows was relatively low and stable until the 1980s.

* There are noticeable spikes in popularity around the late 1990s and early 2000s, possibly due to blockbuster movies or digital advancements.

* A very sharp increase is seen post-2018, with an unusually high spike around 2022–2023, which may be due to increased streaming consumption, content diversification, or specific popular releases.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart shows a steady rise in TMDb popularity, with a significant spike after 2018, peaking around 2022.
This suggests increased viewer interest likely driven by streaming platforms and high-content demand.
The insight helps businesses focus on recent trends, though earlier low engagement may hint at underutilized past opportunities.



#### Chart - 12

LINE CHART

**How has the average runtime of movies evolved over the years?**

In [None]:

plt.figure(figsize=(10,5))
sns.lineplot(data=df, x='release_year', y='runtime',color="green")
plt.show()

##### 1. Why did you pick the specific chart?

Line Chart effectively displays the trend of movie runtimes over time. This type of chart is ideal for visualizing how a continuous variable (runtime) changes over another continuous variable (release year), allowing us to observe fluctuations, patterns, and long-term trends in movie durations.

##### 2. What is/are the insight(s) found from the chart?

* The average runtime of movies had a sharp spike around the early 1910s.

* After that, there was a general drop, followed by fluctuations until around the 1950s.

* From 1960 onward, the average runtime has remained relatively stable, generally having runtime  between 90 and 110 minutes.

* There is a slight downward trend in average runtime in the most recent years (post-2010s).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart shows that average movie runtimes have generally stabilized between 90 to 110 minutes since the 1960s, with a notable dip in recent years. This suggests changing audience preferences, possibly favoring shorter content due to digital consumption habits. The insight helps businesses align production with modern trends, while longer runtimes in earlier eras hint at past audience expectations that may no longer be effective.

#### Chart - 13

BAR PLOT



**Show the TMDB score of shows and movies**

In [None]:
#i have ploted for my understanding
plt.figure(figsize=(10,5))
sns.barplot(data=df,x='role', y='imdb_votes')
plt.xticks(rotation=45)
plt.title("TMDB score of Shows and Movies")
plt.show()

##### 1. Why did you pick the specific chart?

BOx Plot effectively visualizes the distribution and spread of numerical data  across different categories. It highlights key statistical measures like the median, interquartile range, and outliers. This allows us to easily compare how imdb_votes varies between movies involving actors versus directors.

##### 2. What is/are the insight(s) found from the chart?

* Acctors have recieved highter imbd votes than that of directers .



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights suggest that focusing on strong lead actors and analyzing role-based involvement can drive business strategies like content investment and marketing. However, high variance in audience engagement and over-reliance on specific roles may pose risks, potentially misguiding decisions without considering other factors like genre and production quality.

#### CHART-14

BOX plot

**Describe how different Genres are distributed over time**

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(data=df,x='type', y='runtime')
plt.xticks(rotation=45)
plt.title("Distribution of Movie Runtime Across Genres")


1. Why did you pick the specific chart?

Box PLot was chosen because it is ideal for visualizing the distribution of numerical data(runtime ) across different categories(type)

2. What is/are the insight(s) found from the chart?

* Median of MOVIES is higher than that of SHOWS this shows Movies generally have longer runtimes than shows.
* SHOWS have longer IQR than of MOVIES ,means most shows fall within a narrow runtime range.
* Movies have significantly more outliers than Shows indicating some movies are much longer than others.


3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the insights can create a positive business impact by helping platforms optimize content length based on user preferences—for example, promoting shorter shows for binge-watching audiences.
Significant negative growth is observed, but longer movie runtimes with many outliers could lead to viewer drop-off if not managed properly.

#### Chart-15

BAR CHART

**Using Bar chart show which content type has more imdb score**

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=df,x='type', y='imdb_score')
plt.xticks(rotation=45)
plt.title("Imdb score of shows and movies")
plt.show()

1.Why did you pick the specific chart?

it is the most effective way to compare different types of categortries

2- What is/are the insight(s) found from the chart?

* Shows has higher imdb_score than Movies, this shows Shows have better rating than Movies


3-Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.


The insights can create a positive business impact by encouraging investment in SHOWs, which generally receive higher and more consistent IMDB scores—indicating better audience reception. However, the wider score variability and lower ratings in MOVIEs may lead to negative growth if not addressed, as inconsistent quality could reduce viewer trust and engagement.

##### Chart 16

BAR chart

**Show the distribution of runtime across different content**

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=df,x='type', y='runtime')
plt.xticks(rotation=45)
plt.title("Distribution of Movie Runtime Across Movies and Shows")
plt.show()

1.Why did you pick the specific chart?

it is the most effective way to compare the average movie runtime across different types of content

2- What is/are the insight(s) found from the chart?

* Movies have a significantly higher average runtime compared to Shows.

* Shows tend to have a shorter average runtime, possibly because they consist of multiple episodes, each with a shorter duration, while movies are single, longer-form content.

3-Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

The insights show that movies have a longer average runtime than shows, which can guide content scheduling and user engagement strategies. This can create a positive business impact by aligning content with viewer preferences. However, over-prioritizing runtime over quality may lead to negative viewer experiences.

#### Chart 17

  BAR PLOT

**Which country's TV shows and films are most widely viewed?**

In [None]:
plt.figure(figsize=(10, 5))
# Get the top 10 production countries based on tmdb_popularity
top_10_countries = df.groupby('production_countries')['tmdb_popularity'].sum().nlargest(10).index
# Filter the DataFrame to include only the top 10 countries
filtered_df = df[df['production_countries'].isin(top_10_countries)]
# Create the bar plot
sns.barplot(data=filtered_df, x='production_countries', y='tmdb_popularity')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.title("Top 10 Production Countries by TMDB Popularity")
plt.tight_layout()  # Adjust layout to prevent labels from overlapping
plt.show()

1.Why did you pick the specific chart?

Bar charts are excellent for comparing values across categories. Here, it allows us to easily compare the TMDB popularity scores of the top 10 production countries.

2- What is/are the insight(s) found from the chart?

* United States and Japan have significantly higher tmdb popularity compared to all other countries.This suggests that the majority of popular content on TMDB is produced in the US and Japan.
* While the US and Japan  leads, countries like India (IN), United Kingdom (GB), and Canada (CA) also appear, showing a global contribution to popular content, albeit at a smaller scale.

3-Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

The insights show that the US and Japan  dominates in TMDB popularity, which suggests that focusing on US and Japan-based or collaborative productions can increase global viewership and drive business growth. This can create a positive impact by aligning content strategy with audience demand. However, over-reliance on US and Japan  content may limit regional diversity and lead to missed opportunities in other growing markets.

BAR PLOT

Chart 18

**Show the content type with the highest number of seasons using a bar chart.**

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=df,x='type', y='seasons')
plt.xticks(rotation=45)
plt.title("Seasons in Shows and Movies")
plt.show()

1.Why did you pick the specific chart?

A bar chart is ideal for comparing categorical data, and in this case, it helps quickly identify which content type has the highest average number of seasons.

2- What is/are the insight(s) found from the chart?

* The chart shows that Shows have a significantly higher average number of seasons compared to Movies, which have much fewer. This makes sense because movies are typically one-time content, while shows often span multiple seasons.

* This insight highlights that shows tend to offer longer-term content engagement.

3-Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.

The insight supports investing in shows for longer viewer engagement, leading to positive business impact, with no indication of negative growth.

#### Chart - 19

PAIR PLOT

In [None]:
#PAirPLot
plt.figure(figsize=(10,5))
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

A pair plot was chosen to explore relationships between multiple numeric features and detect patterns, correlations, and outliers.

##### 2. What is/are the insight(s) found from the chart?

* The pair plot reveals a strong correlation between IMDb votes and scores, indicating that widely reviewed content tends to be better rated.

* TMDb popularity also shows a mild positive trend with TMDb scores.

*  While runtime and release year show high variability without clear trends, they highlight the diversity in content, suggesting opportunities for personalized recommendations and content segmentation.

#### Chart-20

HEAT-MAP

In [None]:
#Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(numeric_cols.corr(),annot=True,cmap='GnBu')
plt.show()


##### 1. Why did you pick the specific chart?

The heatmap was chosen because it is an excellent way to visually represent the correlation between multiple numerical variables in a grid format.

##### 2. What is/are the insight(s) found from the chart?

* imdb_score has a moderate positive correlation (0.6) with tmdb_score, which makes sense as both are scoring systems evaluating content.

* imdb_votes also shows positive correlation with both imdb_score (0.26) and tmdb_score (0.22), indicating that more voted content might have higher scores.

* runtime, seasons, and release_year show weak correlations with most variables, suggesting they may not strongly influence ratings or popularity.

* There is no strong multicollinearity, as most correlation values are low to moderate, making it safer to use multiple features in a model.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Solution to Business Objective To increase user engagement and satisfaction on Amazon Prime Video, based on the EDA findings:

* Leverage recent releases – Promote and prioritize movies released after 2020, as they dominate the platform and align with current user interest.

* Diversify content types – Add more TV shows to balance with the high number of movies and boost long-term engagement.

* Expand genre variety – Introduce more niche genres like Sci-Fi, Documentary, and Horror to attract a wider audience.

* Offer content for all age groups – Ensure a mix of family-friendly and mature content based on ratings analysis.

* Enhance regional content – Invest more in content from countries like India and the UK to attract global viewers.

* Monitor content performance – Regularly analyze viewer behavior to keep improving recommendations and content strategy.

# **Conclusion**

* Amazon Prime should increase the number of high-quality TV shows, as they offer stronger viewer engagement and longer watch time. Combining this with strategic genre expansion, localization, and personalized recommendations will help drive higher user satisfaction and retention.

* Popular genres like Drama, Comedy, and Action dominate, but there is room for expanding into niche genres to attract diverse viewers.

* The platform features a large volume of content released after 2020, highlighting a strategy focused on modern and up-to-date offerings.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***