<a href="https://colab.research.google.com/github/shrutimore-com/EDA-Analysis/blob/main/Amazon_Prime_TV_Shows_and_Movies_EDA_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Amazon Prime TV Shows and Movies**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

This project focuses on performing a detailed Exploratory Data Analysis (EDA) of Amazon Prime Video’s catalog to gain meaningful insights into its content distribution, trends, and audience preferences. Using Python (Pandas, NumPy, Matplotlib, and Seaborn), I analyzed datasets containing information about movies and TV shows available on the platform, including their release year, genre, ratings, duration, country, and crew details.

The main goal of this analysis was to uncover patterns in content strategy, such as the dominance of specific genres, trends in production over the years, and geographical diversity of titles. Visualizations like bar charts, scatter plots, and heatmaps were used to identify correlations and interpret content-based trends.

Through this project, I developed data cleaning, preprocessing, and visualization skills — and derived actionable insights into how Amazon Prime’s content differs from competitors like Netflix.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Business overview**

1.  Amazon Prime Video offers a vast collection of movies and TV shows across various regions and genres. Understanding the proportion and distribution of this content is key to identifying audience preferences and platform strategy.
2.  Over the years, streaming platforms have evolved rapidly. Analyzing how the number of titles changes over time can reveal strategic expansion periods or global streaming trends.
3.  Amazon Prime Video streams content from various countries to attract a global audience. Knowing which regions contribute most content can help understand localization and diversity strategies.
4.  Genre preference insights help streaming platforms understand what kind of content attracts more viewers and engagement.
5.  Ratings and votes are critical indicators of audience satisfaction and engagement. Analyzing this relationship helps understand viewer behavior and popularity trends.
6.  The involvement of certain directors, actors, and producers can influence a show's popularity and ratings.

#### **Define Your Business Objective?**

The primary business objective of this project is to analyze Amazon Prime Video’s catalog of movies and TV shows to uncover key trends, patterns, and insights that can help understand the platform’s content strategy, audience engagement, and market direction.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
# Display plots in notebook
%matplotlib inline
from typing import NewType
import plotly.graph_objects as go

### Dataset Loading

In [None]:
# Load Dataset
title_df = pd.read_csv("titles.csv")
credits_df = pd.read_csv("credits.csv")

### Dataset First View

In [None]:
# Datasets First Look--For Titles
print("Title DataFrame:")
print(title_df.head())



In [None]:
# Datasets First Look--For Credits
print("\nCredits DataFrame:")
print(credits_df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(title_df.shape)

In [None]:
print(credits_df.shape)

### Dataset Information

In [None]:
# Dataset Info
title_df.info()

In [None]:
credits_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
title_duplicates = title_df.duplicated().sum()
print(f"Duplicate rows in title.csv: {title_duplicates}")


In [None]:
#Drop Duplicates
title_df_clean = title_df.drop_duplicates()

In [None]:
print(title_df_clean.shape)

In [None]:
credits_duplicates = credits_df.duplicated().sum()
print(f"Duplicate rows in credits.csv: {credits_duplicates}")

In [None]:
credits_df_clean = credits_df.drop_duplicates()

In [None]:
print(credits_df_clean.shape)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(title_df_clean.isnull().sum())


In [None]:
#Handling missing value
#Description column
title_df_clean['description'] = title_df_clean['description'].fillna("No Description Available")
#age_certification column
title_df_clean['age_certification'] = title_df_clean['age_certification'].fillna('Unknown')
#seasons column
title_df_clean['seasons'] = title_df_clean['seasons'].fillna(0).astype(int)
#imdb_id column
title_df_clean['imdb_id'] = title_df_clean['imdb_id'].fillna("Unknown")
#imdb_votes column
title_df_clean['imdb_votes']=title_df_clean['imdb_votes'].fillna(0)
#tmdb_popularity column
title_df_clean['tmdb_popularity']=title_df_clean['tmdb_popularity'].fillna(0)




In [None]:
print(title_df_clean.isnull().sum())

In [None]:

title_df_clean['release_year'].value_counts().sort_index().head(10)





In [None]:
print(credits_df_clean.isnull().sum())

In [None]:
#handle missing value for credits dataset
#character column
credits_df_clean['character'] = credits_df_clean['character'].fillna('Not Applicable')

In [None]:
print(credits_df_clean.isnull().sum())

In [None]:
print(title_df_clean.shape)

In [None]:
print(credits_df_clean.shape)

### What did you know about your dataset?

The dataset given is a datasets from Amazon prime shows and movies, and we have to analysis the titles of TV shows and movies names and also the cast of this titles.
Both title.csv and credits.csv were checked for missing and duplicate values. Missing values were handled appropriately, and duplicates were removed to ensure data quality and consistency.

The above dataset has 9868 rows and 15 columns in titles.csv and 124179 rows and 5 columns in credit.csv)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
title_df_clean.columns


In [None]:
credits_df_clean.columns

In [None]:
# Dataset Describe
title_df_clean.describe()


In [None]:
credits_df_clean.describe()

### Variables Description

  #### Title.csv
1. id: The title ID on JustWatch
2. title: The name of the title.
3. show_type: TV show or movie.
4. description: A brief description.
5. release_year: The release year.
6. age_certification: The age certification.
7. runtime: The length of the episode (SHOW) or movie.
8. genres: A list of genres.
9. production_countries: A list of countries that produced the title.
10. seasons: Number of seasons if it's a SHOW.
11. imdb_id: The title ID on IMDB.
12. imdb_score: Score on IMDB.
13. imdb_votes: Votes on IMDB.
14. tmdb_popularity: Popularity on TMDB.
15. tmdb_score: Score on TMDB.


#### credits.csv
1. person_ID: The person ID on JustWatch.
2. id: The title ID on JustWatch.
3. name: The actor or director's name.
4. character_name: The character name.
5. role: ACTOR or DIRECTOR.

### Check Unique Values for each variable.

In [None]:
#checking unique value for each variable.
title_df_clean.nunique()

In [None]:
for col in title_df_clean.columns:
    print(f"\nColumn: {col}")
    print(title_df_clean[col].unique()[:10])   # show first 10 unique values
    print("Total unique:", title_df_clean[col].nunique())


In [None]:
#checking unique value for each variable.
credits_df_clean.nunique()

In [None]:
# Check Unique Values for each variable.
for col in credits_df_clean.columns:
    print(f"\nColumn: {col}")
    print(credits_df_clean[col].unique()[:20])   # show first 10 unique values
    print("Total unique:", credits_df_clean[col].nunique())



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#Create a release_date column from release year.
title_df_clean['release_date'] = pd.to_datetime(title_df_clean['release_year'].astype(str) + '-01-01')

In [None]:
# Insert 'release_date' right after 'release_year'
cols = list(title_df_clean.columns)
cols.remove('release_date')
year_index = cols.index('release_year')
cols.insert(year_index + 1, 'release_date')
title_df_clean = title_df_clean[cols]
title_df_clean.head()


In [None]:
print(title_df_clean.head())

In [None]:
#Create separate actors and directors columns .
# Actors
Actor_df = credits_df_clean[credits_df_clean['role'] == 'ACTOR'] \
    .groupby("id")['name'] \
    .apply(lambda x: ', '.join(sorted(set(x.dropna().astype(str))))) \
    .reset_index(name="ACTOR")

# directors
director_df = credits_df_clean[credits_df_clean['role'] != 'ACTOR'] \
    .groupby("id")['name'] \
    .apply(lambda x: ', '.join(sorted(set(x.dropna().astype(str))))) \
    .reset_index(name="Director")


New_credits_df= pd.merge(Actor_df, director_df, on="id", how="outer")

In [None]:
print(New_credits_df)

In [None]:

# Write your code to make your dataset analysis ready.
# Reset index after cleaning
title_df_clean.reset_index(drop=True, inplace=True)
New_credits_df.reset_index(drop=True, inplace=True)

# Merge on 'id' column
merged_df = pd.merge(title_df_clean, New_credits_df, on='id', how='left')

print("Merged DataFrame Shape:", merged_df.shape)

In [None]:
print(merged_df.head())

### What all manipulations have you done and insights you found?

According to my understanding I have done following manipulation in datasets,

1. Loaded Data:Imported title.csv and credits.csv using pandas.

2. Duplicate Removal:Used drop_duplicates() to remove duplicate records from both datasets.

3. Null Value Treatment:Checked missing values using isnull().sum().

4. Dropped Unnecessary Columns:Removed seasons and age_certification from the title dataset.

5. Checked Unique Values:For each column, the number of unique values was counted to understand data diversity and structure.

6. Reset Index:Reset the index after cleaning for a cleaner dataframe structure.

7.Merged Data:Used pd.merge() on id column to combine title and credits datasets for joint analysis.

I have also found some insights in datasets as follow:

1. Content Type Distribution: The dataset contains both TV shows and movies.
   Most entries are of type = Movie, suggesting Amazon Prime focuses more on movies.

2. Genre Variety :The genres column often contains multiple genres separated by commas.By checking unique genre combinations, we found that Drama, Comedy, and Romance are the most frequent genres.

3. IMDb Score :The avg IMDb score is around 5.9 ,mostly range from 5.0 to 8.5, suggesting a moderate to high quality of content.A few titles have ratings above 9, which may be Amazon Originals or critically acclaimed shows.

4. Data Range:The content spans multiple decades.The release_year ranges from the early 1912s to recent years — showing Amazon Prime hosts both old classics and new releases.

5. Unnecessary or Low-Value Columns: Columns like age_certification and seasons were either sparse or not useful for this analysis, so they were dropped.

6. Unique Value Check: Helped identify categorical vs numerical variable.Useful for deciding how to group or filter data in visualizations later.

7. Credits Dataset Insights: The credits.csv file includes person_id,name,character, role, and id.Many actors and directors have worked on multiple titles, which can help identify frequently featured artists.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart 1-pie chart

In [None]:
# Count Movies vs TV Shows
type_counts = merged_df['type'].value_counts()
print(type_counts)

# Create pie chart
plt.figure(figsize=(6,6))
plt.pie(type_counts,
        labels=type_counts.index,
        autopct='%1.1f%%',
        startangle=90,
        colors=['skyblue','tomato'],
        wedgeprops={'edgecolor':'black'})

plt.title("Movies vs TV Shows on Amazon Prime", fontsize=15, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?


1.   Quick snapshot of the platform content mix.

2.   Useful for answering “Does Prime focus more on Movies or TV Show?.





##### 2. What is/are the insight(s) found from the chart?


1.   Shows the proportion of movies vs TV shows on Amazon Prime.

2.   Movies are dominating on TV Shows,This suggests Amazon is more focused on building a movie-heavy library than long-form episodic content..



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact like:

1.   A large movie library attracts casual viewers who prefer short, 2–3 hour content.

2.   Movies help Amazon compete directly with Netflix and Disney+ for blockbuster releases.

3.   A wide selection of movies increases variety and choice, appealing to a broad audience.


**Negative Impact**:


1.   TV Shows usually drive longer user engagement and retention (since episodes keep people coming back).

2.   By underinvesting in TV Shows, Amazon risks losing subscribers who prefer binge-worthy series like those on Netflix or Disney+.


So in short, Movies dominate, which is good for attracting viewers, but Amazon should balance with more high-quality TV Shows to keep subscribers engaged long-term.












#### Chart 2-linechart

In [None]:
#  Trends Over Time
import matplotlib.pyplot as plt

# Count number of titles per year
titles_per_year = merged_df['release_date'].value_counts().sort_index()

# Plot
plt.figure(figsize=(12,6))
plt.plot(titles_per_year.index, titles_per_year.values, marker='o')
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.title("Trends Over Time: Number of Titles on Amazon Prime Video by Release Year")
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?



1.   Shows growth & evolution of content library.


2.   Helps spot if Prime is adding more content recently or had peak years.



##### 2. What is/are the insight(s) found from the chart?


1.   From 2000 onwards, the number of titles released on Amazon Prime shows a gradual increase.

2.   After 2020,  showing a major expansion in content library.

This suggests a strategic shift, likely to meet the increasing demand for streaming during and after COVID-19.




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact**

1.   Subscriber Growth → More fresh titles attract new customers.

2.   Global Market Push → The post-2020 spike may reflect Amazon’s investment in regional & international content, expanding its market share.



**Negative Insights**

1.   Content Discovery Issues → Too many titles can overwhelm users without a strong recommendation system, making it hard to find gems.
2.   Unsustainable Growth → If the spike was temporary, users might feel content is slowing down in later years, hurting loyalty.

So, the overall story is ,Amazon Prime steadily grew its content after 2000, but aggressively expanded post-2020, likely to capture streaming demand. While this boosted growth, Amazon must ensure quality + discoverability to avoid negative impact






#### Chart 3-bar chart

In [None]:
## Explode genre column
def safe_genre_list(x):
    try:
        # Case 1: if it's already a list
        if isinstance(x, list):
            return x

        # Case 2: if it's a stringified list like "['Drama', 'Action']"
        elif isinstance(x, str) and x.startswith('['):
            return ast.literal_eval(x)

        # Case 3: if it's a comma-separated string like 'Drama, Action'
        elif isinstance(x, str):
            return [i.strip() for i in x.split(',') if i.strip()]

        # Case 4: if it's null or unrecognized
        else:
            return ['Unknown']
    except:
        return ['Unknown']

# Apply to the DataFrame
merged_df['genres'] = merged_df['genres'].apply(safe_genre_list)
#explode genres
exploded_genres = merged_df.explode('genres')
#count genres frequency
genre_counts = exploded_genres['genres'].value_counts().head(10)


In [None]:
# visualization code-Top 10 geners
# Plot the top genres
plt.figure(figsize=(10,6))
sns.barplot(x=genre_counts.values, y=genre_counts.index)
plt.title("Top 10 Dominant Genres on Amazon Prime Video")
plt.xlabel("Number of Titles")
plt.ylabel("Genres")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?



1.   Clear way to compare popularity of genres (Drama, Comedy etc).


2.   Bar charts are best for categorical comparisons.



##### 2. What is/are the insight(s) found from the chart?


1.   Drama is the most dominant genre on Amazon Prime, making it the top choice in the catalog.
2.   After Drama, the next most common genres are Comedy, Thriller, Action, and Romance.
3.   This indicates Amazon Prime’s focus on story-driven and emotional content, supported by lighter (Comedy) and engaging (Thriller/Action) genres.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact**


1.   Wide Audience  → in Drama and Comedy have broad reach across age groups and regions.
2.   Strong Engagement → Thriller and Action genres drive binge-watching and high excitement.



**Negative Impact**

1.   Over-saturation of Drama → Too much focus on Drama may reduce variety, making users feel the catalog lacks freshness.


2.   Underrepresentation of other genres → Genres like European, Family, Horror, or Documentaries may be limited, meaning Prime might miss opportunities in fast-growing niche markets.

Drama dominates Amazon Prime’s catalog, supported by Comedy, Thriller, Action, and Romance. This strengthens broad appeal but also risks genre saturation unless Prime diversifies into emerging genres






#### Chart 4-Toggle vertical bar chart

In [None]:
# --- Prepare data  ---
# Ensure missing certs are labeled
merged_df['age_certification'] = merged_df['age_certification'].fillna("Unknown")

# Order categories by overall frequency
age_order = merged_df['age_certification'].value_counts().index.tolist()

# counts by age_certification and type
counts = (merged_df
          .groupby(['age_certification', 'type'])
          .size()
          .unstack(fill_value=0)
          .reindex(age_order))

# overall counts (sum across types)
counts['All'] = counts.sum(axis=1)

# percentages relative to column totals (so percent within each type column)
pct = counts.divide(counts.sum(axis=0), axis=1) * 100

# Build traces (one trace per column: Movies, TV Show, All)
cols = counts.columns.tolist()   # e.g. ['Movie', 'TV Show', 'All'] depending on your 'type' values
fig = go.Figure()

for col in cols:
    fig.add_trace(
        go.Bar(
            x=counts.index,
            y=counts[col],
            name=str(col),
            visible=True if col == cols[0] else True   # we will toggle overall view via buttons
        )
    )

# Add second set of traces for percentages (same order)
for col in cols:
    fig.add_trace(
        go.Bar(
            x=pct.index,
            y=pct[col],
            name=str(col) + " (%)",
            visible=False  # start hidden; will be shown when "Percent" button clicked
        )
    )

# Buttons: show Count traces (first len(cols) traces) or Percent traces (next len(cols) traces)
n = len(cols)
buttons = [
    dict(
        label="Count",
        method="update",
        args=[{"visible": [True]*n + [False]*n},
              {"yaxis": {"title": "Number of Titles"},
               "title": "Age Certification Distribution — Counts"}]
    ),
    dict(
        label="Percent",
        method="update",
        args=[{"visible": [False]*n + [True]*n},
              {"yaxis": {"title": "Percentage (%)"},
               "title": "Age Certification Distribution — Percentages"}]
    ),
]

# Layout
fig.update_layout(
    updatemenus=[
        dict(
            type="buttons",
            buttons=buttons,
            direction="left",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.0,
            xanchor="left",
            y=1.15,
            yanchor="top"
        ),
    ],
    barmode='group',
    title="Age Certification Distribution — Counts",
    xaxis_title="Age Certification",
    yaxis_title="Number of Titles",
    legend_title="Type",
    template="plotly_white",
    width=900,
    height=500
)

fig.update_xaxes(tickangle=45)
fig.show()



##### 1. Why did you pick the specific chart?



1.   Shows how much content is family-friendly vs adult-oriented.


2.   Toggle (Counts vs Percentages) adds flexibility → absolute numbers + proportions.



##### 2. What is/are the insight(s) found from the chart?

1.   A large share of titles fall into the Unknown certification category.

2.   Movies are assigned certifications such as R, PG-13, PG, and G.

3.   TV Shows are assigned TV-specific ratings such as TV-MA, TV-14, TV-PG, TV-Y, TV-G, and TV-Y7.

This shows that Amazon Prime uses different rating systems for movies and TV shows, reflecting global standards (MPAA for movies, TV Parental Guidelines for shows).





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact**:
1. The growth up to 2019 likely increased subscriber attraction and retention, as a bigger library appeals to diverse audiences.
2. A large content pool also supports regional expansion, offering localized titles to cater to different markets.

**Negative Impact**:
1. The sharp decline after 2019 could harm subscriber engagement if users feel the catalog is shrinking or less diverse.
2. If the reduced catalog isn’t offset by strong original programming, it may lead to subscriber churn and negative growth.

#### Chart 5 -*Histogram*

In [None]:
#imdb score distrubution
plt.figure(figsize=(10,6))
plt.hist(title_df_clean['imdb_score'].dropna(), bins=20, color='skyblue', edgecolor='black')

plt.title("IMDb Score Distribution", fontsize=14, fontweight='bold')
plt.xlabel("IMDb Score")
plt.ylabel("Number of Titles")
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()



##### 1. Why did you pick the specific chart?



1.   Best way to see audience rating spread.


2.   Helps identify if Prime content is generally well-rated or mixed.



##### 2. What is/are the insight(s) found from the chart?


1.   Most titles are rated between 5 and 7, showing that the majority of Amazon Prime content falls in the average-to-good range.

2.   Very few titles are below 3 (poorly rated) or above 8 (exceptionally rated).

3.   The distribution is roughly bell-shaped, peaking around 6.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact**

1.   Shows/movies with higher IMDb scores (7+) can be highlighted in recommendations to attract and retain quality-seeking viewers.

2.   Identifying low-rated content (<3) allows Amazon to phase out or improve recommendations for those titles.


**Negative Impact**

1.   The lack of highly rated content (>8) may reduce Amazon Prime’s competitiveness against rivals like Netflix or Disney+, which emphasize critically acclaimed shows.

2.   Over-concentration in average ratings (5–7) could make the platform feel less exciting for premium subscribers who expect more high-quality shows.





#### Chart 6 -pairplot

In [None]:
# Select numerical columns of interest
num_cols = ["imdb_score", "imdb_votes", "tmdb_popularity", "tmdb_score"]

# Pairplot with hue (Movies vs TV Shows)
sns.pairplot(merged_df[num_cols + ["type"]], hue="type", palette="Set2", diag_kind="kde")
plt.show()

##### 1. Why did you pick the specific chart?



1.   Explores how ratings, votes, and popularity interact.


2.   Hue (Movies vs TV Shows) lets us compare both categories simultaneously.



##### 2. What is/are the insight(s) found from the chart?


1.   Movies generally receive more votes and popularity spikes compared to TV shows, suggesting larger one-time audience engagement.

2.   IMDb and TMDB scores are positively correlated → a highly rated movie/show on one platform is usually rated high on the other.

3.   TV shows form separate clusters with moderate popularity and steadier rating trends → indicates consistent but smaller viewer groups.

4.   Outliers exist:

     a.Some movies are highly popular but not highly rated (overhyped titles).

     b.Some are highly rated but low popularity (hidden gems).







##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact**

1. Helps platforms identify hidden gems (high ratings but low popularity) → these can be promoted to boost engagement.

2. Understanding different audience patterns for Movies vs TV Shows can guide better content recommendations and marketing strategy

**Negative Impact**

1.   Overhyped but poorly rated titles may lead to viewer dissatisfaction, causing churn.

2.   Heavy clustering of TV shows in moderate popularity zones may indicate difficulty in achieving blockbuster hits for TV compared to movies.





#### Chart 7-Scatter Plot

In [None]:
#IMDb Score vs IMDb Votes
plt.figure(figsize=(10,6))
plt.scatter(title_df_clean['imdb_score'], title_df_clean['imdb_votes'], alpha=0.5, color='teal')

plt.title("IMDb Score vs IMDb Votes", fontsize=14, fontweight='bold')
plt.xlabel("IMDb Score")
plt.ylabel("IMDb Votes")
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()



##### 1. Why did you pick the specific chart?



1.   Focuses on specific relationships.


2.   Shows whether higher ratings mean more audience votes, or whether popularity aligns with quality.



##### 2. What is/are the insight(s) found from the chart?

1.   A cluster of titles with 700K–1.2M votes lies between 7 and 9 IMDb score.
Suggests that popular, higher-quality titles attract both more viewers and more engagement.

2.   Titles with scores below 5 generally have very low votes.

3.   A few extreme points (700K–1.2M votes) mostly between 7–9 scores → these are mainstream blockbusters or hit shows.

4.   The densest region is 5–8 IMDb score with moderate votes.This matches IMDb’s general distribution, where most titles fall in the middle.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact**
1.    High-score & high-vote titles drive engagement → These blockbusters bring in massive audience traffic and should be promoted heavily.

2.    High-score but low-vote titles (hidden gems) → Opportunity for Amazon Prime to highlight these in recommendation systems or marketing campaigns to boost watch time.
3.    Middle cluster (5–8 IMDb score) → Represents the bulk of the catalog. Ensuring steady availability of this type of content keeps a large base of viewers engaged.

**Negative Impact**
1.    Low-score titles with low votes → These underperforming titles may drag down platform reputation if surfaced in recommendations.

2.    Overhyped titles (average score, very high votes) → Can create audience dissatisfaction if popularity doesn’t match quality.

#### Chart 8-correlation heatmap

In [None]:
import seaborn as sns

# Select only numeric columns
numeric_cols = title_df_clean.select_dtypes(include=['float64', 'int64'])

plt.figure(figsize=(10,6))
sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm", fmt=".2f")

plt.title("Correlation Heatmap of Numeric Features", fontsize=14, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?



1.   Condenses relationships into one easy-to-read correlation matrix.


2.   Quickly highlights if some metrics are redundant or strongly linked.



##### 2. What is/are the insight(s) found from the chart?

1.    IMDb Score & TMDb Score (0.58):Both rating systems generally agree, but not perfectly → audiences on IMDb and TMDb have slightly different scoring behavior.
2.    IMDb Votes & TMDb Popularity (0.25):Popularity partly depends on votes but is also influenced by marketing, trending topics, or star power.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact**
1.  Consistency across IMDb and TMDb scores helps validate audience opinions across platforms.

2.  Weak correlation between votes and popularity suggests there are “hidden gems” (high scores, low popularity) that Amazon can promote.

3.  Clear distinction between movies vs TV shows (runtime vs seasons) validates the dataset quality.

**Negative Impact**
1.  Popularity doesn’t always align with ratings → some titles may be overhyped but low-rated, leading to audience disappointment.

2.  Low correlation between release year and ratings suggests new content is not guaranteed to perform better, so investment in newer titles is risky.

#### Chart 9-Bubble Chart

In [None]:
#IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?
import matplotlib.pyplot as plt

plt.figure(figsize=(12,8))
# Main scatter plot
plt.scatter(
    merged_df['imdb_score'],
    merged_df['imdb_votes'],
    s=merged_df['imdb_votes']**0.5,  # Bubble size proportional to votes
    alpha=0.5,
    c=merged_df['imdb_score'],
    cmap='coolwarm'
)

plt.xlabel('IMDb Rating')
plt.ylabel('Number of Votes (Popularity)')
plt.title('Popularity vs IMDb Rating (Bubble Chart)')
plt.yscale('log')
plt.colorbar(label='IMDb Rating')

# Identify top-rated and most-voted titles
top_rated = merged_df.loc[merged_df['imdb_score'].idxmax()]
top_voted = merged_df.loc[merged_df['imdb_votes'].idxmax()]

# Highlight Top-rated (big bubble + purple outline)
plt.scatter(
    top_rated['imdb_score'], top_rated['imdb_votes'],
    s=500, facecolors='none', edgecolors='purple', linewidths=2.5
)

# Highlight Most-voted (big bubble + purple outline)
plt.scatter(
    top_voted['imdb_score'], top_voted['imdb_votes'],
    s=500, facecolors='none', edgecolors='Green', linewidths=2.5
)

# Top-rated label (blue text)
plt.annotate(
    f"{top_rated['title']}(Most Rated)",
    (top_rated['imdb_score'], top_rated['imdb_votes']),
    xytext=(-80,-30), textcoords='offset points', fontsize=10, color='Purple',
    arrowprops=dict(arrowstyle="->", color='purple', lw=1.5)
)

# Most-voted label (green text)
plt.annotate(
    f"{top_voted['title']} (Most Voted)",
    (top_voted['imdb_score'], top_voted['imdb_votes']),
    xytext=(10,-15), textcoords='offset points', fontsize=10, color='green',
    arrowprops=dict(arrowstyle="->", color='green', lw=1.5)
)

plt.show()


##### 1. Why did you pick the specific chart?



*   Bubble size adds a third dimension (e.g., popularity), making it visually engaging.


*   Great way to end, because it moves from general patterns to specific star performers.



##### 2. What is/are the insight(s) found from the chart?

1.   Titanic sits high on the Y-axis (around 1 million votes), showing extreme popularity, though its rating is good but not the highest.

2.   Pawankhind is close to the perfect rating (near 10), but has far fewer votes .

3.   The top-rated movies vary widely in vote counts — some are barely voted on, others have decent visibility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact**

1. Popular Movies Are Generally Well-Rated:Many of the most voted movies (like Titanic) fall within the 7–8.5 IMDb rating range.
This suggests a positive correlation between popularity and quality — widely watched movies tend to be well-received.

2. High Ratings Exist Across Vote Counts:Some movies with lower vote counts still have very high ratings (e.g., Pawankhind).
Indicates there is room for niche or regional films to stand out for quality, even without mass visibility.

**Negative Impact**

1.  Some extremely popular movies have moderate ratings, suggesting that hype or brand value can overshadow quality.
2.  Movies like Pawankhind have near-perfect ratings but relatively low vote counts.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1.  Diversify Genres:
Reduce over-reliance on Drama and Comedy by adding more Documentaries, Horror, and Family content to attract niche audiences.

2.  Increase TV Show Content:
With 86% movies, Amazon Prime should expand TV series and limited shows to boost long-term engagement.

3.  Improve Metadata & Certifications:
Fill missing age certifications and standardize ratings to enhance user trust and parental control features.

4.  Focus on Quality Over Quantity:
Invest in high-rated originals and remove or deprioritize low-rated titles to maintain a strong brand image.

5.  Promote Hidden Gems:
Highlight highly rated but less popular titles through targeted recommendations and marketing campaigns.

6.  Leverage Growth After 2020:
Continue expanding content libraries and regional offerings to maintain post-COVID streaming momentum.

7.  Use Analytics Continuously:
Build dashboards to track genre trends, audience ratings, and engagement for data-driven decisions.

# **Conclusion**

Amazon Prime Video’s content library is movie-dominant, globally diverse, and steadily expanding — especially post-2020.
The platform’s strong focus on Drama and Comedy ensures mass appeal, but overreliance on these genres may limit future growth in niche markets.
Ratings data shows consistent, moderate-quality content with a few standout titles.
Audience engagement patterns indicate that popularity and quality don’t always overlap — suggesting Amazon Prime balances between commercial success and artistic diversity.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***