<a href="https://colab.research.google.com/github/sushant2196/Amazon-Prime-Video-EDA/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**  :- Amazon Prime Video Content Analytics    



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -**  Sushant S
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The rapid expansion of digital streaming platforms has transformed how audiences consume entertainment, making data-driven decision-making a critical component of content strategy. Amazon Prime Video, as one of the leading global streaming platforms, continuously invests in a diverse range of movies and television shows to attract and retain subscribers across different demographics. This project presents an in-depth exploratory and analytical study of the Amazon Prime Video content library available in the United States, using structured datasets that capture detailed information about titles, ratings, popularity, and cast and crew involvement.

The dataset used in this project consists of two CSV files: titles.csv, which contains metadata for more than 9,000 unique movies and TV shows, and credits.csv, which includes over 124,000 records related to actors and directors associated with these titles. The titles dataset includes key attributes such as show type (movie or TV show), release year, runtime, number of seasons, genres, production countries, age certifications, IMDb ratings and votes, as well as TMDB popularity and scores. The credits dataset complements this information by providing insights into the people involved in content creation, enabling analysis of creative participation and contribution patterns.

The primary objective of this project is to extract meaningful insights related to content diversity, production trends, audience reception, and platform growth over time. To achieve this, the project follows a structured data analytics pipeline starting with data ingestion and cleaning. Missing values, inconsistent formats, and multi-valued categorical fields such as genres and production countries are handled using Pandas and NumPy to ensure data quality and reliability. Feature engineering techniques are applied to derive new variables, such as content age and aggregated genre indicators, which enhance analytical depth.

Exploratory Data Analysis (EDA) forms the core of this project. Various statistical summaries and visualizations are used to understand the distribution of movies versus TV shows, identify dominant genres, and observe how Amazon Prime Video’s content library has evolved across different release years. Trend analysis reveals periods of accelerated content addition, highlighting Amazon’s strategic expansion phases. The project also examines regional production patterns by analyzing production countries, offering insight into Amazon Prime Video’s global content sourcing strategy, even within the US-available catalog.

A significant portion of the analysis focuses on content quality and popularity metrics. IMDb scores and vote counts are used to evaluate audience reception, while TMDB popularity metrics provide additional context regarding viewer interest and visibility. Visualizations such as histograms, bar charts, line plots, scatter plots, and correlation heatmaps are created using Matplotlib and Seaborn to clearly communicate findings. These visual insights help identify high-performing genres, understand the relationship between popularity and ratings, and recognize content characteristics associated with higher audience engagement.

From a business perspective, the insights generated by this project are valuable for multiple stakeholders. Content strategists can identify underrepresented genres and high-potential content categories for future investment. Data analysts can use the methodology as a scalable framework for analyzing other streaming platforms or regions. Additionally, the findings support strategic decision-making related to content acquisition, production planning, and competitive positioning within the streaming market.

Overall, this project demonstrates how structured data analysis and visualization techniques can be applied to real-world streaming data to uncover actionable insights. By combining technical rigor with business-oriented interpretation, the analysis provides a comprehensive understanding of Amazon Prime Video’s content landscape and highlights the growing importance of analytics in shaping modern digital entertainment platforms.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


With the rapid growth of streaming platforms, Amazon Prime Video hosts a vast and continuously expanding library of movies and television shows. However, the sheer volume and diversity of content make it challenging for businesses, content strategists, and analysts to understand what types of content dominate the platform, how the content portfolio has evolved over time, and which factors contribute to higher audience engagement and popularity. Without systematic data analysis, identifying meaningful trends related to genre diversity, production patterns, and content performance becomes difficult.

This project aims to analyze the Amazon Prime Video content catalog available in the United States by leveraging structured datasets containing title-level metadata and cast and crew information. The objective is to explore content diversity, examine historical trends in content releases, and evaluate audience reception using IMDb ratings, votes, and TMDB popularity metrics. By applying data cleaning, exploratory data analysis, and visualization techniques using Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn, the project seeks to uncover actionable insights that can support data-driven decision-making.

The outcomes of this analysis will help stakeholders better understand viewer preferences, identify high-performing and underrepresented content categories, and gain strategic insights that can inform content acquisition, production investment, and competitive positioning within the streaming industry.

#### **Define Your Business Objective?**

The primary business objective of this project is to leverage data analytics to gain actionable insights into Amazon Prime Video’s content portfolio in the United States. By analyzing content metadata, genre distribution, release trends, and audience engagement metrics such as IMDb ratings, vote counts, and TMDB popularity, the project aims to identify patterns that influence content performance and viewer preferences.

The analysis seeks to support strategic decision-making by helping stakeholders understand which genres and content types drive higher engagement, how the platform’s content library has evolved over time, and where potential gaps or opportunities exist for future content investment. Additionally, insights from cast and director contributions are intended to highlight creative factors associated with successful titles.

Ultimately, this project aims to provide a data-driven foundation that can guide content acquisition, production planning, and optimization strategies, enabling Amazon Prime Video to enhance user engagement, improve subscriber retention, and maintain a competitive edge in the streaming industry.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# ============================================================
# 1. Know Your Data
# Importing Required Libraries
# ============================================================

# Suppress warnings to keep notebook output clean and readable
import warnings
warnings.filterwarnings('ignore')

# ----------------------------
# Data Manipulation Libraries
# ----------------------------
import pandas as pd              # For data loading, cleaning, and analysis
import numpy as np               # For numerical and mathematical operations

# ----------------------------
# Data Visualization Libraries
# ----------------------------
import matplotlib.pyplot as plt  # For basic plotting
import seaborn as sns            # For advanced and statistical visualizations

# ----------------------------
# Utility Libraries
# ----------------------------
from datetime import datetime    # For handling date and time operations

# ----------------------------
# Visualization Settings
# ----------------------------
sns.set(style="whitegrid")       # Set seaborn style for better aesthetics
plt.rcParams["figure.figsize"] = (10, 6)  # Standard figure size for all plots


### Dataset Loading

In [None]:
from google.colab import files
files.upload()


In [None]:
import zipfile

# Unzip titles.csv.zip
with zipfile.ZipFile("titles.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("/content")

# Unzip credits.csv.zip
with zipfile.ZipFile("credits.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("/content")

print("✅ Files unzipped successfully!")


In [None]:
import os
os.listdir("/content")


In [None]:
# ============================================================
# 1.1 Dataset Loading (Final)
# ============================================================

import pandas as pd

titles_df = pd.read_csv("/content/titles.csv")
credits_df = pd.read_csv("/content/credits.csv")

print("✅ Datasets loaded successfully!")
print("Titles Dataset Shape:", titles_df.shape)
print("Credits Dataset Shape:", credits_df.shape)


### Dataset First View

In [None]:
# Dataset First Look
display(titles_df.head())
display(credits_df.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
titles_rows, titles_cols = titles_df.shape
print(f" Titles Dataset contains {titles_rows} rows and {titles_cols} columns.")

credits_rows, credits_cols = credits_df.shape
print(f" Credits Dataset contains {credits_rows} rows and {credits_cols} columns.")

### Dataset Information

In [None]:
# Dataset Info
titles_df.info()

credits_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_titles = titles_df.duplicated().sum()
print(f" Number of duplicate rows in Titles Dataset : {duplicate_titles}")

duplicate_credits = credits_df.duplicated().sum()
print(f" Number of duplicate rows in Credits dataset: {duplicate_credits}")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_titles = titles_df.isnull().sum()

missing_titles = missing_titles[missing_titles > 0].sort_values(ascending=False)

print(" Missing Values in Titles Dataset:")
display(missing_titles)

missing_credits = credits_df.isnull().sum()

missing_credits = missing_credits[missing_credits > 0].sort_values(ascending=False)

print("✅ Missing Values in Credits Dataset:")
display(missing_credits)

In [None]:
# Visualizing the missing values
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12,6)


In [None]:
# Create a boolean mask: True = missing
titles_missing_mask = titles_df.isnull()

# Plot heatmap
plt.figure(figsize=(12,6))
sns.heatmap(titles_missing_mask, cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap - Titles Dataset")
plt.xlabel("Columns")
plt.ylabel("Rows")
plt.show()


In [None]:
# Count missing values per column
missing_credits_count = credits_df.isnull().sum()

# Plot only columns with missing values
missing_credits_count = missing_credits_count[missing_credits_count > 0]

plt.figure(figsize=(10,5))
sns.barplot(x=missing_credits_count.index, y=missing_credits_count.values, palette="magma")
plt.title("Missing Values Count per Column - Credits Dataset")
plt.ylabel("Number of Missing Values")
plt.xlabel("Columns")
plt.show()


### What did you know about your dataset?

Dataset (**Title**) have 9871 Rows and 15 columns
Contains metadata for Amazon Prime Video content: movies and TV shows, including title, type, release year, age rating, runtime, genres, production countries, IMDb and TMDB scores, popularity, and number of seasons for shows.

Dataset (**Credits**) have 124235 Rows and 5 column Contains cast and director information linked to titles by id, including person ID, name, character name, and role (ACTOR/DIRECTOR).

 **Missing Values Insights**

Titles Dataset: age_certification, runtime, and genres have missing values

Credits Dataset: character_name has some missing values



 **Observations / Insights So Far **

The dataset is large enough (~9.8k titles, 124k cast entries) for meaningful insights.

Multi-valued columns (genres, production_countries) require exploding for analysis.

Missing values and some columns need cleaning or imputation.

Numeric columns (imdb_score, imdb_votes, tmdb_score, tmdb_popularity) are ready for visualization and trend analysis.

The structure allows univariate, bivariate, and multivariate analysis, aligned with business objectives:

Content Diversity: genres and categories

Popularity Analysis: IMDb/TMDB scores

Actor/Director Influence: cast impact on ratings

Columns like id and role are mostly complete, making the dataset safe to link titles and credits

## ***2. Understanding Your Variables***

In [None]:
# Dataset columns

titles_df.columns



In [None]:
credits_df.columns

In [None]:
# Dataset Describe
titles_df.describe()

In [None]:
credits_df.describe()

### Variables Description



The Amazon Prime Video dataset consists of two related datasets: Titles and Credits. Each variable provides important information that helps analyze content diversity, popularity, regional trends, and contributor influence on the platform.
| Variable Name            | Description                                                                                                                                                                         |
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **id**                   | A unique identifier assigned to each movie or TV show on the JustWatch platform. This variable is used as a primary key to link titles with their corresponding cast and crew data. |
| **title**                | The official name of the movie or TV show available on Amazon Prime Video.                                                                                                          |
| **show_type**            | Indicates the type of content, whether it is a **MOVIE** or a **TV SHOW**, helping distinguish between different content formats.                                                   |
| **description**          | A brief summary describing the storyline or theme of the content. Useful for content classification and text-based analysis.                                                        |
| **release_year**         | The year in which the movie or TV show was released. This variable is important for analyzing content trends over time.                                                             |
| **age_certification**    | Specifies the age rating assigned to the content, indicating its suitability for different age groups. Some values may be missing.                                                  |
| **runtime**              | The duration of the movie or average episode runtime (in minutes for shows). Helps understand content length preferences.                                                           |
| **genres**               | Lists one or more genres associated with the title, such as Drama, Comedy, or Action. This is a multi-valued categorical variable.                                                  |
| **production_countries** | Represents the country or countries involved in producing the content, enabling regional content analysis.                                                                          |
| **seasons**              | Indicates the total number of seasons for TV shows. This variable is not applicable for movies and may contain null values.                                                         |
| **imdb_id**              | A unique identifier for the title on the IMDb platform, allowing integration with external rating sources.                                                                          |
| **imdb_score**           | The IMDb rating score representing audience perception and quality of the content.                                                                                                  |
| **imdb_votes**           | The total number of votes received on IMDb, indicating audience engagement and popularity.                                                                                          |
| **tmdb_popularity**      | A popularity score from TMDB that reflects how frequently a title is viewed or searched.                                                                                            |
| **tmdb_score**           | The rating score provided by TMDB, offering an alternative measure of content quality.                                                                                              |

| Variable Name      | Description                                                                                     |
| ------------------ | ----------------------------------------------------------------------------------------------- |
| **person_ID**      | A unique identifier for actors and directors associated with Amazon Prime titles.               |
| **id**             | The JustWatch title ID used to connect cast and crew information with the Titles dataset.       |
| **name**           | The name of the actor or director involved in the title.                                        |
| **character_name** | The name of the character played by the actor. This field may be missing for some records.      |
| **role**           | Indicates whether the person’s role is **ACTOR** or **DIRECTOR**, enabling role-based analysis. |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

titles_unique_counts = titles_df.nunique()

titles_unique_counts


In [None]:
# Unique value count for each column in Credits dataset
credits_unique_counts = credits_df.nunique()

credits_unique_counts


## 3. ***Data Wrangling***

In [None]:
# ======================================
# DATA WRANGLING – SAFE VERSION (CHARACTER + SHOW_TYPE)
# ======================================

import pandas as pd
import numpy as np

try:
    # ------------------------------
    # 0. Sanity check
    # ------------------------------
    assert 'titles_df' in globals(), "titles_df is not loaded"
    assert 'credits_df' in globals(), "credits_df is not loaded"

    # ------------------------------
    # 1. Handle Missing Values – TITLES
    # ------------------------------
    if 'age_certification' in titles_df.columns:
        titles_df['age_certification'] = titles_df['age_certification'].fillna('Not Rated')

    if 'runtime' in titles_df.columns:
        titles_df['runtime'] = titles_df['runtime'].fillna(titles_df['runtime'].median())

    if 'genres' in titles_df.columns:
        titles_df['genres'] = titles_df['genres'].fillna('Unknown')

    if 'production_countries' in titles_df.columns:
        titles_df['production_countries'] = titles_df['production_countries'].fillna('Unknown')

    if 'seasons' in titles_df.columns:
        titles_df['seasons'] = titles_df['seasons'].fillna(0)

    if 'imdb_score' in titles_df.columns:
        titles_df['imdb_score'] = titles_df['imdb_score'].fillna(titles_df['imdb_score'].median())

    if 'tmdb_score' in titles_df.columns:
        titles_df['tmdb_score'] = titles_df['tmdb_score'].fillna(titles_df['tmdb_score'].median())

    if 'imdb_votes' in titles_df.columns:
        titles_df['imdb_votes'] = titles_df['imdb_votes'].fillna(0)

    # ------------------------------
    # 2. Handle Missing Values – CREDITS
    # ------------------------------
    credit_cols = credits_df.columns.tolist()
    if 'character_name' in credit_cols:
        credits_df['character_name'] = credits_df['character_name'].fillna('Not Available')
    elif 'character' in credit_cols:
        credits_df.rename(columns={'character': 'character_name'}, inplace=True)
        credits_df['character_name'] = credits_df['character_name'].fillna('Not Available')
    else:
        credits_df.loc[:, 'character_name'] = 'Not Available'

    # ------------------------------
    # 3. Data Type Conversion
    # ------------------------------
    if 'release_year' in titles_df.columns:
        titles_df['release_year'] = titles_df['release_year'].astype(int)

    if 'seasons' in titles_df.columns:
        titles_df['seasons'] = titles_df['seasons'].astype(int)

    if 'imdb_votes' in titles_df.columns:
        titles_df['imdb_votes'] = titles_df['imdb_votes'].astype(int)

    # ------------------------------
    # 4. Text Cleaning
    # ------------------------------
    if 'title' in titles_df.columns:
        titles_df['title'] = titles_df['title'].astype(str).str.strip()

    if 'name' in credits_df.columns:
        credits_df['name'] = credits_df['name'].astype(str).str.strip()

    # ------------------------------
    # 5. Feature Engineering – Safe Show Type Handling
    # ------------------------------
    current_year = 2025
    if 'release_year' in titles_df.columns:
        titles_df['content_age'] = current_year - titles_df['release_year']

    # Safe binary flags for show_type
    if 'type' in titles_df.columns:
        titles_df['is_movie'] = np.where(titles_df['type'] == 'MOVIE', 1, 0)
        titles_df['is_show'] = np.where(titles_df['type'] == 'SHOW', 1, 0)
    else:
        print("⚠️ 'type' column not found. Skipping binary flags.")

    # ------------------------------
    # 6. Explode Multi-valued Columns
    # ------------------------------
    titles_genres_df = titles_df.copy()
    if 'genres' in titles_df.columns:
        titles_genres_df['genres'] = titles_genres_df['genres'].str.split(',')
        titles_genres_df = titles_genres_df.explode('genres')
        titles_genres_df['genres'] = titles_genres_df['genres'].str.strip()

    titles_countries_df = titles_df.copy()
    if 'production_countries' in titles_df.columns:
        titles_countries_df['production_countries'] = titles_countries_df['production_countries'].str.split(',')
        titles_countries_df = titles_countries_df.explode('production_countries')
        titles_countries_df['production_countries'] = titles_countries_df['production_countries'].str.strip()

    # ------------------------------
    # 7. Final Validation
    # ------------------------------
    print("✅ Data Wrangling Completed Successfully!")
    print("Titles Shape:", titles_df.shape)
    print("Genres Exploded Shape:", titles_genres_df.shape)
    print("Countries Exploded Shape:", titles_countries_df.shape)
    print("Credits Shape:", credits_df.shape)

except Exception as e:
    print("❌ Error during Data Wrangling:", e)


### Data Wrangling Code

### What all manipulations have you done and insights you found?

**Handling Missing Values**

Filled age_certification missing values with 'Not Rated'.

Filled runtime missing values with median runtime.

Filled genres and production_countries missing values with 'Unknown'.

Filled seasons missing values with 0 (for movies).

Filled imdb_score and tmdb_score missing values with median scores.

Filled imdb_votes missing values with 0.

For credits_df, missing character_name was handled safely (created if missing).

Column Name Normalization

Found that show_type column was actually named type.

Renamed and standardized character/character_name column in credits_df.

Data Type Conversion

Converted release_year, seasons, imdb_votes to integer type.

Ensured all text columns (title, name) are stripped of spaces.


Feature Engineering

Created content_age = current_year - release_year to measure age of each title.

Created binary flags:

is_movie = 1 if type is Movie

is_show = 1 if type is Show

Multi-Valued Column Handling (Exploding)

Exploded genres column → titles_genres_df for genre-level analysis.

Exploded production_countries → titles_countries_df for country-level analysis.

Safe Error Handling

Checked existence of columns (show_type, character_name) before using them.

Avoided KeyErrors to make notebook deployment-ready.

Notebook can now run top-to-bottom without crashing.

2. Insights from Data After Wrangling
a) Content Library

Total titles: 9,871

Binary flags show clear split between Movies vs Shows (ready for plotting).

Exploded genres: 22,274 rows, showing multi-genre content is common.

Exploded production countries: 11,072 rows, indicating titles often produced in multiple countries.

b) Data Quality Insights

Some columns like age_certification, runtime, genres had missing values → addressed.

High cardinality in title, name → good for unique identification but not for grouping.

IMDb & TMDB scores had some missing values → filled with median to avoid bias in analysis.

c) Potential Business Insights (Preliminary)

Content Type Distribution

Shows or Movies may dominate → initial insight into platform focus.

Genres Distribution

Exploded dataset allows identifying top genres → helps content acquisition strategy.

Regional Focus

Production countries exploded → can identify dominant production regions for licensing decisions.

Content Age Analysis

Older content vs newer content trends → helps understand catalog freshness.

Credits Analysis

Directors and actors can be analyzed for top contributors → impacts content popularity prediction.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Content Type Distribution (Movies vs Shows)

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Plot Content Type Distribution
plt.figure(figsize=(6,5))
sns.countplot(x='type', data=titles_df, palette='viridis')
plt.title('Content Type Distribution on Amazon Prime Video', fontsize=14)
plt.xlabel('Content Type', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Shows how the platform library is split between Movies and TV Shows.

Important for understanding platform content strategy.



##### 2. What is/are the insight(s) found from the chart?

One type may dominate (e.g., Movies > Shows or vice versa).

Indicates user preferences or business focus.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If Shows dominate → invest in more series & episodes.

If Movies dominate → focus on cinematic content and licensing.

#### Chart - 2  Release Year Distribution

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12,5))
sns.histplot(titles_df['release_year'], bins=30, color='skyblue')
plt.title('Distribution of Titles by Release Year', fontsize=14)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Shows growth or decline of content over the years.

Helps track platform expansion and investment trends.

##### 2. What is/are the insight(s) found from the chart?

Peaks in certain years indicate heavy content acquisition.

Older content may need promotion or replacement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in planning content refresh, trending content promotion, and audience engagement.



```
# This is formatted as code
```

#### Chart - 3  IMDb Score Distribution

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,5))
sns.histplot(titles_df['imdb_score'], bins=20, kde=True, color='lightgreen')
plt.title('Distribution of IMDb Scores', fontsize=14)
plt.xlabel('IMDb Score', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Shows audience perception of content quality.

Helps identify high-quality vs low-quality content.

##### 2. What is/are the insight(s) found from the chart?

Most titles cluster in a mid-range score (6-8).

Extreme low scores can be analyzed for improvement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High-rated content → leverage for marketing.

Low-rated content → evaluate for removal or improvements.

#### Chart - 4 Runtime Distribution

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10,5))
sns.histplot(titles_df['runtime'], bins=30, color='salmon')
plt.title('Distribution of Title Runtime (minutes)', fontsize=14)
plt.xlabel('Runtime (minutes)', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Understands viewer preferences for content length.

Helps in planning content acquisition.

##### 2. What is/are the insight(s) found from the chart?

Movies usually cluster around 90-120 min.

Shows may have shorter average episode runtimes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps recommend suitable content lengths for different audiences.

Helps improve viewer retention.

#### Chart - 5  Top 10 Genres Distribution

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(12,6))
top_genres = titles_genres_df['genres'].value_counts().nlargest(10)
sns.barplot(x=top_genres.values, y=top_genres.index, palette='magma')
plt.title('Top 10 Genres on Amazon Prime Video', fontsize=14)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

Shows most popular genres available.

Helps identify content diversity and gaps.

##### 2. What is/are the insight(s) found from the chart?

Top genres may include Drama, Comedy, Action.

Lesser genres → potential niche market opportunity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Guides content acquisition strategy.

Supports marketing campaigns for popular genres.

#### Chart - 6  Age Certification Distribution

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(12,5))
sns.countplot(x='age_certification', data=titles_df, order=titles_df['age_certification'].value_counts().index, palette='cubehelix')
plt.title('Distribution of Age Certifications', fontsize=14)
plt.xlabel('Age Certification', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

-Perfect for Categorical Data   
-Easy Comparison

##### 2. What is/are the insight(s) found from the chart?

Most content is rated for general or teen audiences.

Few titles are adult-restricted.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify target audience and plan content acquisition.

#### Chart - 7  Top 10 Production Countries

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(12,6))
top_countries = titles_countries_df['production_countries'].value_counts().nlargest(10)
sns.barplot(x=top_countries.values, y=top_countries.index, palette='rocket')
plt.title('Top 10 Production Countries', fontsize=14)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

-Best for Comparing Categories

-Horizontal Bars Improve Readability

##### 2. What is/are the insight(s) found from the chart?

US dominates production → strong local content.

Lesser representation from other regions → niche opportunities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Guides international licensing and regional content strategy.

#### Chart - 8  TMDB Popularity Distribution

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10,5))
sns.histplot(titles_df['tmdb_popularity'], bins=30, color='teal', kde=True)
plt.title('Distribution of TMDB Popularity', fontsize=14)
plt.xlabel('TMDB Popularity', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

- Best for Continuous Numerical Data
- Helps Identify Distribution Patterns

##### 2. What is/are the insight(s) found from the chart?

Most titles cluster around lower popularity.

Few titles are extremely popular (outliers).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Popular titles → promote heavily.

Low-popularity titles → consider marketing or replacement.

#### Chart - 9  IMDb Score vs Release Year

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12,6))
sns.scatterplot(x='release_year', y='imdb_score', data=titles_df, alpha=0.5)
plt.title('IMDb Score vs Release Year', fontsize=14)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

- Ideal for Relationship Analysis
- Reveals Trends Over Time

##### 2. What is/are the insight(s) found from the chart?

Older titles may have higher/lower scores.

Trend analysis possible for content quality over time.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identify classic content vs new releases.

#### Chart - 10  Runtime vs IMDb Score

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10,5))
sns.scatterplot(x='runtime', y='imdb_score', data=titles_df, alpha=0.5)
plt.title('Runtime vs IMDb Score', fontsize=14)
plt.xlabel('Runtime (minutes)', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

- Best for Comparing Two Numerical Variables
- Helps Identify Correlation Patterns

##### 2. What is/are the insight(s) found from the chart?

Most high-rated titles cluster around 90–150 mins.

Extremely short/long titles may score lower.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Guide content length decisions for new acquisitions.

#### Chart - 11  Genres vs Average IMDb Score

In [None]:
# Chart - 11 visualization code
genre_score = titles_genres_df.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(12,6))
sns.barplot(x=genre_score.values, y=genre_score.index, palette='mako')
plt.title('Top 10 Genres by Average IMDb Score', fontsize=14)
plt.xlabel('Average IMDb Score', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

- Ideal for Comparing Categories with Aggregated Values
- Horizontal Bars Improve Readability

##### 2. What is/are the insight(s) found from the chart?

Some genres (e.g., Documentary, Drama) score higher.

Genres like Action may have lower average ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Prioritize acquisition of high-rated genres.

#### Chart - 12  Age Certification vs IMDb Score

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(12,6))
sns.boxplot(x='age_certification', y='imdb_score', data=titles_df, palette='Set2')
plt.title('IMDb Score vs Age Certification', fontsize=14)
plt.xlabel('Age Certification', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

- Ideal for Comparing Distributions Across Categories
- Shows More Than Just Averages

##### 2. What is/are the insight(s) found from the chart?

Family/Teen content generally has mid-range scores.

Adult content varies widely.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps tailor content quality to target audience.

#### Chart - 13  Top 10 Directors by Number of Titles

In [None]:
# Chart - 13 visualization code
top_directors = credits_df[credits_df['role']=='DIRECTOR']['name'].value_counts().nlargest(10)
plt.figure(figsize=(12,6))
sns.barplot(x=top_directors.values, y=top_directors.index, palette='cividis')
plt.title('Top 10 Directors by Number of Titles', fontsize=14)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Director', fontsize=12)
plt.show()


##### 1. Why did you pick the specific chart?

- Ideal for Comparing Categorical Frequencies
- Horizontal Bars Improve Readability

##### 2. What is/are the insight(s) found from the chart?

A few directors dominate Amazon Prime content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on partnerships with top directors.

#### Chart - 14 - Correlation Heatmap (Numerical variables)

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select numerical columns for correlation
numerical_cols = [
    'release_year',
    'runtime',
    'seasons',
    'imdb_score',
    'imdb_votes',
    'tmdb_score',
    'tmdb_popularity',
    'content_age'
]

# Compute correlation matrix
corr_matrix = titles_df[numerical_cols].corr()

# Plot heatmap
plt.figure(figsize=(12,8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    linewidths=0.5
)

plt.title('Correlation Heatmap of Numerical Features', fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap is ideal for understanding relationships between multiple numerical variables at once.

It helps identify:

Strong positive correlations

Strong negative correlations

Variables with no meaningful relationship

This chart is essential for multivariate analysis (M) under the UBM rule.

##### 2. What is/are the insight(s) found from the chart?

Content Age vs Release Year

Strong negative correlation

As release year increases, content age decreases (expected but validates data correctness).

IMDb Votes vs TMDB Popularity

Moderate to strong positive correlation

Popular titles tend to receive more audience votes.

IMDb Score vs IMDb Votes

Weak to moderate positive correlation

Highly rated content does not always receive the highest number of votes.

Runtime vs IMDb Score

Very weak correlation

Longer runtime does not guarantee better ratings.

Seasons vs IMDb Score

Near-zero correlation

Having more seasons does not necessarily imply higher quality.

Business Impact Analysis
✅ Positive Business Impact

Helps identify which factors actually influence popularity and engagement.

Confirms that audience engagement (votes & popularity) move together.

Avoids incorrect assumptions (e.g., longer runtime = better content).

#### Chart - 15 - Pair Plot of Key Numerical Variables

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select important numerical features
pairplot_cols = [
    'runtime',
    'imdb_score',
    'imdb_votes',
    'tmdb_score',
    'tmdb_popularity'
]

# Create pair plot
sns.pairplot(
    titles_df[pairplot_cols],
    diag_kind='kde',
    corner=True
)

plt.suptitle('Pair Plot of Key Numerical Features', y=1.02, fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot allows us to analyze pairwise relationships between multiple numerical variables simultaneously.

It combines:

Scatter plots (relationship analysis)

Distribution plots (data spread)

This visualization is ideal for multivariate analysis (M) and supports exploratory data analysis (EDA).

##### 2. What is/are the insight(s) found from the chart?

IMDb Votes vs TMDB Popularity

Clear upward trend indicating positive relationship

Popular titles attract higher audience engagement.

IMDb Score vs IMDb Votes

Weak positive trend

Highly rated titles do not always receive massive vote counts.

Runtime vs IMDb Score

No strong visible pattern

Runtime does not significantly affect audience rating.

TMDB Score vs IMDb Score

Mild positive correlation

Indicates consistency between platforms, but not perfect alignment.

Distributions

IMDb scores are concentrated between 6 and 8

Popularity and votes are right-skewed, with few extreme outliers.

Business Impact Analysis
✅ Positive Business Impact

Helps identify key drivers of popularity (votes & popularity move together).

Confirms that content quality ≠ content length, avoiding misguided investment.

Useful for building recommendation systems and predictive models.

chart - 16 IMDb Votes vs IMDb Score

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(x='imdb_votes', y='imdb_score', data=titles_df, alpha=0.5)
plt.title('IMDb Votes vs IMDb Score', fontsize=14)
plt.xlabel('IMDb Votes', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)
plt.show()


- Why did you pick the specific chart
- The objective is to analyze the relationship between IMDb vote count and IMDb score to understand whether titles with more audience engagement tend to receive higher or lower ratings.

Insights:

Highly voted titles tend to have higher visibility.

Some outliers have low votes but high scores.

Business Impact:

Focus marketing on high-vote, high-score titles.

chart - 17 TMDB Score vs TMDB Popularity

In [None]:
plt.figure(figsize=(10,5))
sns.scatterplot(x='tmdb_score', y='tmdb_popularity', data=titles_df, alpha=0.5)
plt.title('TMDB Score vs Popularity', fontsize=14)
plt.xlabel('TMDB Score', fontsize=12)
plt.ylabel('TMDB Popularity', fontsize=12)
plt.show()


- why did you pick the specific chart
- The objective is to analyze the relationship between TMDB score (content quality) and TMDB popularity (audience reach) to understand whether higher-rated titles are also more popular.

Insights:

High TMDB popularity doesn’t always mean high score.

Business Impact:

Popularity campaigns needed for moderately-rated titles.

Chart 18 – Genre vs Number of Titles

In [None]:
plt.figure(figsize=(12,6))
genre_counts = titles_genres_df['genres'].value_counts().head(20)
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='coolwarm')
plt.title('Top 20 Genres by Number of Titles', fontsize=14)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Genre', fontsize=12)
plt.show()


- why did you pick the specific chart
- The objective is to analyze how titles are distributed across genres and identify which genres dominate the content library.

Insights:

Identifies most prolific genres.

Business Impact:

Guides acquisition for popular genres, fills content gaps.

Chart 19 – Content Age Distribution

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(titles_df['content_age'], bins=30, color='gold', kde=True)
plt.title('Distribution of Content Age', fontsize=14)
plt.xlabel('Content Age (Years)', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.show()


- Why did you pick this specific chart
The objective is to analyze the age of content available on the platform to understand how recent or old the catalog is.

Insights:

Most content is 1–10 years old.

Older content is rare → potential for classic content marketing.

Business Impact:

Identify old but popular titles → promotional opportunities.

Chart 20 – IMDb Score vs Genre (Top 10 Genres)

In [None]:
top_genres_10 = titles_genres_df['genres'].value_counts().nlargest(10).index
plt.figure(figsize=(12,6))
sns.boxplot(x='genres', y='imdb_score', data=titles_genres_df[titles_genres_df['genres'].isin(top_genres_10)], palette='Paired')
plt.title('IMDb Score Distribution Across Top 10 Genres', fontsize=14)
plt.xlabel('Genre', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)
plt.xticks(rotation=45)
plt.show()


- Why did you pick the specific chart
- The objective is to compare the distribution of IMDb scores across the top 10 most common genres to understand how audience ratings vary within and between genres.

Insights:

Some genres consistently score higher than others (e.g., Drama, Documentary).

Business Impact:

Focus acquisition on high-quality genres to improve user satisfaction.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. Optimize Content Portfolio Based on Viewer Preferences

The analysis shows that certain genres (such as Drama, Comedy, and Documentary) dominate the platform and also receive higher average IMDb scores.

Amazon Prime should prioritize acquisition and production of high-performing genres while gradually experimenting with under-represented genres to attract niche audiences.

Business Benefit:
Improves user satisfaction and retention by aligning content with audience interests.
2. Focus on Quality Over Quantity

Correlation and pair plot analysis indicate that longer runtime or more seasons do not guarantee higher ratings.

Viewer engagement is driven more by storytelling quality and audience appeal rather than content length.

Business Benefit:
Optimizes content investment and prevents unnecessary production costs.
3. Strengthen High-Engagement Content Promotion

Titles with high IMDb votes and TMDB popularity show strong audience engagement.

Some high-rated titles have relatively low popularity, indicating under-promotion.

Business Benefit:
Targeted marketing of high-quality but low-visibility content can increase watch time and platform usage.
4. Regional & International Content Expansion

Production country analysis highlights dominance of a few regions.

There is an opportunity to expand international and regional content to cater to diverse audiences.

Business Benefit:
Drives subscription growth in new markets and improves global platform appeal.

5. Maintain a Fresh and Balanced Content Library

Release year and content age analysis reveal a mix of old and new titles.

Regularly refreshing the catalog while promoting popular classic content can balance nostalgia and novelty.

Business Benefit:
Keeps the platform engaging for both new and long-term subscribers.

6. Leverage Talent-Driven Content Strategy

Certain actors and directors appear frequently and attract consistent viewership.

Strategic collaborations with proven talent can enhance content success.

Business Benefit:
Improves predictability of content performance and marketing effectiveness.

Overall Recommendation

Amazon Prime Video should adopt a data-driven content strategy focusing on:

High-quality genres

Targeted promotion

Smart content investment

Regional diversification

This approach will maximize user engagement, improve return on investment, and support sustainable growth in a competitive streaming market.

# **Conclusion**



This project presents a comprehensive exploratory data analysis of the Amazon Prime Video content library in the United States, using a combination of data wrangling, visualization, and statistical exploration techniques. The objective was to understand content diversity, audience preferences, performance indicators, and strategic opportunities that can support data-driven decision-making in the competitive streaming industry.

Through systematic data cleaning and preprocessing, the dataset was transformed into a reliable and analysis-ready form. Missing values were handled appropriately, multi-valued fields such as genres and production countries were normalized through data explosion, and new features such as content age and content-type indicators were engineered. These steps ensured that the analysis was robust, reproducible, and deployment-ready.

Univariate analysis provided insights into the overall structure of Amazon Prime’s content library, revealing patterns in content type distribution, release trends, runtime, age certifications, and genre dominance. The platform demonstrates a strong focus on specific genres and a mix of both movies and television shows, reflecting a strategy aimed at appealing to a broad audience base.

Bivariate analysis helped uncover relationships between key variables such as ratings, popularity, runtime, release year, and genres. It was observed that content quality, as measured by IMDb and TMDB scores, does not strongly depend on runtime or the number of seasons, emphasizing that storytelling quality is more critical than content length. Certain genres consistently performed better in terms of audience ratings, indicating clear genre-based preferences.

Multivariate analysis using correlation heatmaps and pair plots offered deeper insights into how numerical variables interact. Audience engagement metrics such as IMDb votes and TMDB popularity showed positive correlation, while ratings displayed weaker relationships with other quantitative features. These findings highlight that popularity and visibility play a major role in driving engagement, sometimes independent of critical ratings.

Overall, the analysis demonstrates the value of leveraging data analytics to guide content strategy, marketing efforts, and investment decisions. By focusing on high-performing genres, promoting quality content more effectively, expanding regional diversity, and optimizing production investments, Amazon Prime Video can enhance user engagement, improve content ROI, and maintain a competitive edge in the streaming market.

This project successfully illustrates how structured exploratory data analysis and thoughtful visualization can translate raw data into actionable business insights, reinforcing the importance of data-driven decision-making in the digital entertainment industry.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***