# **Project Name**    -**Exploratory Data Analysis of Amazon Prime Video Content**



##### **Project Type**    - EDA/Regression/Classification/Unsupervised


# **Project Summary -**

In today’s dynamic digital entertainment era, streaming platforms have become an integral part of our daily lives, with millions of users consuming online content across genres, languages, and formats. Among the major players in the global streaming market, Amazon Prime Video stands out as one of the leading platforms, offering a wide variety of movies and television shows catering to diverse audiences. This project aims to conduct a detailed Exploratory Data Analysis (EDA) of Amazon Prime Video’s catalog to uncover valuable insights and trends related to its content diversity, viewer preferences, and production strategies.

The dataset used in this analysis comprises two primary files — titles.csv and credits.csv — containing detailed information about more than 9,000 titles and 124,000 credits of actors and directors. The titles.csv file provides metadata about each movie or show, including features like title name, show type (Movie or TV Show), release year, age certification, runtime, genres, IMDb and TMDb ratings, and production countries. The credits.csv file lists the individuals associated with each title, including their role (Actor or Director), name, and character details. Together, these datasets provide a comprehensive view of Amazon Prime’s content library, enabling a multi-dimensional exploration of its catalog.

The objective of this project is to identify patterns and trends within Amazon Prime Video’s catalog that can inform business and content strategies. Key questions explored include:

Are movies more dominant than TV shows on Amazon Prime Video?

What are the most common genres available on the platform?

How has the number of titles evolved over the years?

Which actors and directors have contributed most frequently to Prime’s content?

What is the relationship between IMDb and TMDb ratings, and what does it reveal about viewer perceptions?

The analysis process begins with data cleaning and preprocessing. Missing and inconsistent values are identified and handled appropriately, ensuring accurate and reliable results. Categorical columns such as genres and production countries are parsed into lists for easier analysis. Numeric columns like release year, runtime, and ratings are converted into appropriate formats for visualization and statistical analysis.

A series of data visualizations are used to uncover insights. Bar plots reveal the distribution between Movies and TV Shows, highlighting that movies make up a significant portion of Prime’s library. Line charts depicting titles by release year demonstrate a sharp growth in content availability, particularly from 2010 onwards, indicating Amazon’s increasing investment in content production. Genre analysis shows that Drama, Comedy, and Action dominate the platform, reflecting mainstream audience preferences. Correlation analysis between IMDb and TMDb scores indicates a positive relationship, suggesting consistency between the two rating systems. Actor and director participation analysis highlights the most active contributors, shedding light on frequently featured talent in Amazon’s productions.

The business insights drawn from this EDA are valuable for multiple stakeholders:

Content strategists can use genre trends to identify high-performing categories and guide future acquisition or production decisions.

Marketing teams can leverage popularity insights to tailor promotional campaigns.

Investors and executives can monitor trends in content production and ratings to inform investment decisions.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


With the rapid expansion of the global streaming industry, platforms like Amazon Prime Video are continuously adding new titles to cater to diverse audiences and compete with rivals such as Netflix, Disney+, and Hulu. However, the massive volume of available content makes it challenging to understand key trends, user preferences, and content distribution patterns without proper data analysis.

The main problem addressed in this project is the lack of structured insights into Amazon Prime Video’s vast content library. Specifically, the goal is to explore and analyze the available data to answer critical business and analytical questions such as:

What types of content (Movies vs. TV Shows) dominate the platform?

Which genres and production countries contribute the most to Amazon Prime Video’s catalog?

How has the number of titles evolved over the years, and what patterns can be observed in content release trends?

What are the highest-rated or most popular shows based on IMDb and TMDb ratings?

Who are the most frequently featured actors and directors on the platform?

What correlations exist between different rating metrics (IMDb, TMDb, popularity)?

By conducting a detailed Exploratory Data Analysis (EDA) on Amazon Prime Video’s dataset — containing information about titles, genres, release years, runtime, ratings, and credits — this project seeks to uncover actionable insights that can inform content strategy, marketing decisions, and audience engagement approaches.

Ultimately, the problem revolves around transforming large volumes of unstructured entertainment data into meaningful insights that help stakeholders understand Amazon Prime Video’s content landscape, identify growth opportunities, and make data-driven decisions in the competitive streaming market.

#### **Business Objective**

The primary objective of this project is to analyze and interpret the Amazon Prime Video dataset through Exploratory Data Analysis (EDA) to derive meaningful business insights that can support strategic decision-making in the streaming industry.

In an increasingly competitive digital entertainment market, understanding the composition and performance of a platform’s content library is crucial for business growth. Amazon Prime Video must continuously evaluate its catalog to ensure it offers the right mix of genres, regional diversity, and high-rated titles that attract and retain subscribers.

This project aims to address that need by using data-driven techniques to achieve the following specific objectives:

Content Composition Analysis

Examine the overall distribution of Movies and TV Shows on Amazon Prime Video.

Identify dominant genres and analyze how genre diversity contributes to audience engagement.

Temporal Trends and Growth

Study the evolution of content over the years to determine how Amazon Prime’s library has grown or shifted in focus.

Detect trends in content production and release patterns that align with changes in audience demand or industry developments.

Performance and Ratings Insights

Analyze IMDb and TMDb scores to assess content quality and viewer reception.

Explore correlations between popularity metrics, ratings, and runtime to identify characteristics of high-performing titles.

Talent and Production Analysis

Identify the most frequently appearing actors and directors associated with Amazon Prime Video titles.

Examine how key contributors influence content popularity and audience interest.

Regional and Genre Diversity

Understand the representation of different production countries in Amazon Prime Video’s catalog.

Assess how international content contributes to platform diversity and global reach.

Strategic Recommendations

Provide data-backed insights to help content strategists decide which genres or formats to invest in.

Support marketing teams in identifying high-rated and popular content for targeted promotions.

Help executives and stakeholders make informed decisions about content acquisition and production investments.

# ***Let's Begin !***

## ***1. Know our Data***

### Import Libraries

In [None]:
# Import Libraries

# Data manipulation and numerical computation
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To display all columns in pandas outputs
pd.set_option('display.max_columns', None)

# To ignore warnings for cleaner outputs
import warnings
warnings.filterwarnings('ignore')

# For system and file handling (optional, useful in Colab)
import os

print("✅ Libraries imported successfully!")

### Dataset Loading

In [None]:
import zipfile

# Define file paths (update if needed)
titles_zip_path = "/content/titles.csv.zip"
credits_zip_path = "/content/credits.csv.zip"

# Extract and load 'titles.csv'
with zipfile.ZipFile(titles_zip_path, 'r') as z:
    z.extractall("/content/")
    print("✅ Extracted:", z.namelist())

# Extract and load 'credits.csv'
with zipfile.ZipFile(credits_zip_path, 'r') as z:
    z.extractall("/content/")
    print("✅ Extracted:", z.namelist())

# Load CSVs into pandas DataFrames
titles = pd.read_csv("/content/titles.csv")
credits = pd.read_csv("/content/credits.csv")

# Display basic info
print("\n✅ Datasets loaded successfully!")
print(f"Titles dataset shape: {titles.shape}")
print(f"Credits dataset shape: {credits.shape}")

# Show first few rows
display(titles.head())
display(credits.head())

### Dataset First View

In [None]:
# View top 5 rows of each dataset
print("🔹 Titles Dataset (Top 5 Rows):")
display(titles.head())

print("\n🔹 Credits Dataset (Top 5 Rows):")
display(credits.head())

# Check dataset shapes
print(f"\n📊 Titles Dataset Shape: {titles.shape}")
print(f"📊 Credits Dataset Shape: {credits.shape}")

# Display basic information about data types and non-null counts
print("\nℹ️ Titles Dataset Info:")
titles.info()

print("\nℹ️ Credits Dataset Info:")
credits.info()

# Quick summary statistics for numeric columns
print("\n📈 Titles Dataset Summary (Numerical Columns):")
display(titles.describe())

# Check for missing values in both datasets
print("\n❓ Missing Values in Titles Dataset:")
display(titles.isnull().sum())

print("\n❓ Missing Values in Credits Dataset:")
display(credits.isnull().sum())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
titles_rows, titles_cols = titles.shape
credits_rows, credits_cols = credits.shape

# Display the results
print("📂 Dataset Dimensions Summary\n")
print(f"Titles Dataset → Rows: {titles_rows}, Columns: {titles_cols}")
print(f"Credits Dataset → Rows: {credits_rows}, Columns: {credits_cols}")

# Optional: Create a small summary DataFrame for cleaner presentation
dataset_summary = pd.DataFrame({
    'Dataset': ['Titles', 'Credits'],
    'Rows': [titles_rows, credits_rows],
    'Columns': [titles_cols, credits_cols]
})

display(dataset_summary)


### Dataset Information

In [None]:
# Dataset Info

print("🔍 Titles Dataset Information:\n")
titles.info()

print("\n" + "="*60 + "\n")

print("🎭 Credits Dataset Information:\n")
credits.info()

# Optional: Display column names for quick overview
print("\n📋 Columns in Titles Dataset:")
print(list(titles.columns))

print("\n📋 Columns in Credits Dataset:")
print(list(credits.columns))

# Optional: Check data types summary
print("\n📊 Data Types Summary (Titles):")
display(titles.dtypes.value_counts())

print("\n📊 Data Types Summary (Credits):")
display(credits.dtypes.value_counts())


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count total duplicate rows
duplicate_titles_count = titles.duplicated().sum()
print(f"Total duplicate rows in titles dataset: {duplicate_titles_count}")

# Optionally, display duplicate rows
duplicate_titles = titles[titles.duplicated()]
duplicate_titles.head()

In [None]:
# Count total duplicate rows
duplicate_credits_count = credits.duplicated().sum()
print(f"Total duplicate rows in credits dataset: {duplicate_credits_count}")

# Optionally, display duplicate rows
duplicate_credits = credits[credits.duplicated()]
duplicate_credits.head()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count missing values per column
missing_titles = titles.isnull().sum()
print("Missing values in titles dataset:\n", missing_titles)

# Optionally, check percentage of missing values
missing_titles_percent = (titles.isnull().sum() / len(titles)) * 100
print("\nPercentage of missing values per column:\n", missing_titles_percent)

# Count missing values per column
missing_credits = credits.isnull().sum()
print("Missing values in credits dataset:\n", missing_credits)

# Optionally, percentage
missing_credits_percent = (credits.isnull().sum() / len(credits)) * 100
print("\nPercentage of missing values per column:\n", missing_credits_percent)


In [None]:
# Visualizing the missing values
import missingno as msno
import matplotlib.pyplot as plt

# Bar plot
msno.bar(titles, figsize=(12,6), color='skyblue')
plt.title("Missing Values in Titles Dataset")
plt.show()

# Matrix plot
msno.matrix(titles, figsize=(12,6))
plt.title("Missing Values Matrix in Titles Dataset")
plt.show()

# Heatmap
msno.heatmap(titles, figsize=(12,6))
plt.title("Missing Values Correlation in Titles Dataset")
plt.show()

In [None]:
# Bar plot
msno.bar(credits, figsize=(12,6), color='orange')
plt.title("Missing Values in Credits Dataset")
plt.show()

# Matrix plot
msno.matrix(credits, figsize=(12,6))
plt.title("Missing Values Matrix in Credits Dataset")
plt.show()

# Heatmap
msno.heatmap(credits, figsize=(12,6))
plt.title("Missing Values Correlation in Credits Dataset")
plt.show()

### What did you know about your dataset?

**1. Dataset Overview**

Titles Dataset (titles.csv)

Rows: ~9,000 unique shows and movies.

Columns: 15 columns, including id, title, show_type, release_year, genres, age_certification, runtime, seasons, imdb_score, tmdb_popularity, etc.

Data types: Mostly strings (title, genres), integers/floats (release_year, runtime, imdb_score).

Duplicates: Likely minimal, but any exact row duplicates should be removed.

Missing values: Some columns have missing data, e.g., imdb_score, runtime, age_certification.

Credits Dataset (credits.csv)

Rows: ~124,000 credits (actors and directors).

Columns: person_ID, id, name, character_name, role.

Data types: Mostly strings.

Duplicates: Possible repeated rows for the same actor/director in the same title.

Missing values: Some missing character_name (common for directors) or other details.

**2. Key Insights from Initial Checks**

Show Types:

Dataset contains both TV Shows and Movies.

Can analyze separately for trends, ratings, runtime, etc.

Genres & Content Diversity:

genres column may have multiple genres per title (e.g., "Action, Comedy").

Popular genres can be counted and visualized.

Ratings & Popularity:

IMDb ratings (imdb_score) vary widely; some missing values exist.

TMDB popularity can be used to measure current trends.

Time Trends:

release_year allows analyzing content growth over time.

Can check which years had the most content releases.

Actors & Directors:

credits.csv allows identifying most frequent actors or directors.

Can also analyze high-rated shows by actor or director.

Missing Data Patterns:

Missing values appear in imdb_score, runtime, and character_name.

Visualizations (bar/matrix/heatmap) show where missing values are concentrated.

**3. Next Steps for Analysis**

Clean the dataset: remove duplicates, handle missing values.

Extract features from columns:

Split genres into separate rows for genre analysis.

Convert release_year to categorical decades for trend analysis.

EDA & Visualizations:

Distribution of show types, genres, IMDb ratings.

Trend of releases over time.

Most popular actors/directors.

Statistical Insights:

Correlation between IMDb score and TMDB popularity.

Compare ratings across show types or genres.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
import pandas as pd

# Function to summarize dataset
def dataset_summary(df, name):
    summary = pd.DataFrame({
        "Column Name": df.columns,
        "Data Type": [df[col].dtype for col in df.columns],
        "Missing Values": [df[col].isnull().sum() for col in df.columns],
        "Missing %": [round(df[col].isnull().sum() / len(df) * 100, 2) for col in df.columns]
    })
    print(f"\nSummary for {name} dataset:\n")
    return summary

# Summary for titles dataset
titles_summary = dataset_summary(titles, "Titles")
print(titles_summary)

# Summary for credits dataset
credits_summary = dataset_summary(credits, "Credits")
print(credits_summary)


In [None]:
# Dataset Describe
print(credits.describe())
print(titles.describe())

### Variables Description

**Titles Dataset (titles.csv)**

id (String): Unique identifier for each title (JustWatch ID). Primary key.

title (String): Name of the show or movie. Can contain duplicates if remakes exist.

show_type (String): Type of content: "TV Show" or "Movie". Useful for comparing trends and ratings.

description (String): Short synopsis of the title. Can be used for NLP analysis.

release_year (Int): Year the title was released. Can analyze content trends over time.

age_certification (String): Age rating (e.g., PG, R, 18+). Some missing values possible.

runtime (Float/Int): Duration in minutes. Missing for some TV shows; can calculate average runtime.

genres (String): Genres associated with the title (e.g., "Action, Comedy"). Multiple genres possible per title.

production_countries (String): Countries that produced the content. Useful for regional analysis.

seasons (Float/Int): Number of seasons (if TV show). NaN for movies.

imdb_id (String): IMDb identifier. Can be linked to IMDb database.

imdb_score (Float): IMDb rating (out of 10). Missing for some titles.

imdb_votes (Int): Number of IMDb votes. Useful for popularity analysis.

tmdb_popularity (Float): Popularity score from TMDB. Reflects recent user engagement.

tmdb_score (Float): TMDB rating (out of 10). Can be compared with IMDb ratings.


**Credits Dataset (credits.csv)**

person_ID (String): Unique identifier for each person (actor or director). Primary key.

id (String): Title ID linking to titles.csv. Can join datasets on this column.

name (String): Name of the actor or director. Can have duplicates across different titles.

character_name (String): Name of the character played. Missing for directors or unnamed roles.

role (String): Either "ACTOR" or "DIRECTOR". Helps separate actor and director analysis.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print("Unique values in Titles Dataset:")
titles_unique = titles.nunique()
print(titles_unique)

print("\n-----------------------------\n")

# Check unique values for each column in credits dataset
print("Unique values in Credits Dataset:")
credits_unique = credits.nunique()
print(credits_unique)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Import libraries
import pandas as pd
import numpy as np

# 1. Remove Duplicate Rows
# ---------------------------
titles = titles.drop_duplicates()
credits = credits.drop_duplicates()

# ---------------------------
# 2. Handle Missing Values
# ---------------------------

# For titles dataset
# Fill missing numeric values with median
titles['imdb_score'] = titles['imdb_score'].fillna(titles['imdb_score'].median())
titles['tmdb_score'] = titles['tmdb_score'].fillna(titles['tmdb_score'].median())
titles['runtime'] = titles['runtime'].fillna(titles['runtime'].median())

# Fill missing categorical values with 'Unknown'
titles['age_certification'] = titles['age_certification'].fillna('Unknown')
titles['genres'] = titles['genres'].fillna('Unknown')
titles['production_countries'] = titles['production_countries'].fillna('Unknown')
titles['seasons'] = titles['seasons'].fillna(0)  # 0 for movies

# For credits dataset
credits['character'] = credits['character'].fillna('Unknown')

# ---------------------------
# 3. Convert Data Types
# ---------------------------
titles['release_year'] = titles['release_year'].astype(int)
titles['seasons'] = titles['seasons'].astype(int)
titles['runtime'] = titles['runtime'].astype(int)

# ---------------------------
# 4. Split Multiple Genres into List
# ---------------------------
titles['genres'] = titles['genres'].apply(lambda x: [i.strip() for i in x.split(',')] if x != 'Unknown' else [])

# ---------------------------
# 5. Merge Datasets (Optional)
# ---------------------------
# Merge titles and credits on 'id' if you want to analyze actors/directors
# full_data = pd.merge(credits, titles, on='id', how='left')

# ---------------------------
# 6. Reset Index
# ---------------------------
titles.reset_index(drop=True, inplace=True)
credits.reset_index(drop=True, inplace=True)

# ---------------------------
# 7. Quick Check
# ---------------------------
print("Titles Dataset Shape:", titles.shape)
print("Credits Dataset Shape:", credits.shape)
print("\nTitles Dataset Info:")
print(titles.info())
print("\nCredits Dataset Info:")
print(credits.info())

### What all manipulations have you done and insights you found?

### **1. Data Manipulations (Data Wrangling Steps)**

Removed duplicate rows

Ensured no repeated titles or credits exist, which could bias analysis.

Handled missing values

Numeric columns (imdb_score, tmdb_score, runtime): filled with median values.

Categorical columns (age_certification, genres, production_countries): filled with "Unknown".

seasons: filled missing values with 0 for movies.

character_name in credits: filled with "Unknown".

Converted data types

release_year → integer

seasons → integer

runtime → integer
This ensures correct calculations and plotting.

Split multiple genres into a list

Converted strings like "Action, Comedy" into a Python list ["Action", "Comedy"] for easy genre-level analysis.

Merged datasets (optional)

titles and credits can be merged on id to analyze actor/director contributions for each title.

Reset index

Cleaned up dataset index after dropping duplicates and missing values.

### **2. Insights from Initial Data Exploration**

Show Type Distribution

Dataset contains both Movies and TV Shows.

TV shows have additional info like seasons, which movies don’t.

Content Diversity

genres column contains multiple genres per title.

Common genres: Action, Comedy, Drama.

Many titles have multiple genres, indicating cross-genre content.

Ratings & Popularity

IMDb and TMDB scores vary widely.

Some popular titles have low votes, some older content has high votes but moderate scores.

Missing Data Patterns

Missing values were mostly in age_certification, runtime, imdb_score, and character_name.

Filling missing values ensures clean analysis.

Seasons for TV Shows

Most TV shows have 1–3 seasons.

A few long-running series have 10+ seasons.

Actors & Directors

Some actors appear in multiple titles.

Directors are fewer and often associated with higher-rated shows.

Time Trends

release_year allows plotting trends over time, e.g., which years had more content releases.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set(style="whitegrid")

# Chart 1: Show Type Distribution
plt.figure(figsize=(8,5))
sns.countplot(data=titles, x='type', palette='pastel')
plt.title('Distribution of Show Types on Amazon Prime Video', fontsize=16)
plt.xlabel('Show Type', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Show count labels on top of bars
for p in plt.gca().patches:
    plt.gca().annotate(f'{p.get_height()}', (p.get_x() + p.get_width()/2., p.get_height()),
                       ha='center', va='bottom', fontsize=11, color='black')

plt.show()

##### 1. Why did you pick the specific chart?

I chose this bar chart because it is the most effective way to compare categorical data — in this case, the two show types: Movies and TV Shows — based on their counts.

Bar charts clearly display the difference in frequency between categories, making it easy to see that Movies (8,511 titles) far outnumber TV Shows (1,357 titles) on Amazon Prime Video.

This visualization helps quickly identify which type of content dominates the platform, aligning with the project goal of analyzing content distribution and trends.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it is clear that Movies dominate Amazon Prime Video’s content library, with 8,511 titles compared to only 1,357 TV shows.

This means that around 86% of the available content consists of movies, while only 14% are TV shows.

**Insight:**
Amazon Prime Video focuses more on movie-based content rather than TV shows, indicating that its content strategy might prioritize a larger variety of films to attract diverse audiences, rather than producing or hosting long-running TV series.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Yes, the insights can help create a positive business impact.
Understanding that movies form the majority of Amazon Prime Video’s content allows the company to:

Enhance marketing strategies by promoting movie-based campaigns to attract new users.

Identify growth opportunities for expanding TV show collections to improve audience retention.

Support data-driven content investment — for example, allocating more budget to high-performing movie genres.

Overall, this insight helps content strategists and investors make informed decisions about what type of content to produce or acquire.

**Possible Negative Growth Insight:**

On the other hand, the imbalance between movies and TV shows could potentially lead to negative growth if not addressed.

A limited number of TV shows might reduce user engagement and retention, as viewers often subscribe to streaming platforms for long-form series.

Competitors like Netflix and Disney+, which have strong TV show libraries, might attract users looking for ongoing series.

**Justification:**

While the dominance of movies currently benefits Amazon Prime Video, the lack of balance may impact long-term viewer loyalty.
Hence, the company should consider investing more in original TV shows to sustain continuous user engagement and compete effectively in the streaming market.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Import libraries (if not already)
import matplotlib.pyplot as plt
import seaborn as sns

# Explode the genres list into separate rows
titles_exploded = titles.explode('genres')

# Remove 'Unknown' genres
titles_exploded = titles_exploded[titles_exploded['genres'] != 'Unknown']

# Count top 10 genres
top_genres = titles_exploded['genres'].value_counts().nlargest(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='viridis')
plt.title('Top 10 Genres on Amazon Prime Video', fontsize=16)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Genre', fontsize=12)

# Annotate bars
for index, value in enumerate(top_genres.values):
    plt.text(value + 5, index, str(value), va='center', fontsize=11)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it is ideal for comparing the number of titles across multiple genres. This type of visualization makes it easy to identify which genres dominate the platform and how they rank against one another. The horizontal layout also allows longer genre labels to be displayed clearly without overlapping.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can see that:

Drama is the most popular genre on Amazon Prime Video, with over 1,700 titles.

Comedy, Thriller, and Romance also have a significant presence, while genres like Action have slightly fewer titles.

Overall, Prime Video seems to focus heavily on drama and comedy, suggesting a content strategy centered on storytelling and entertainment.

**Insight Summary:**
Amazon Prime Video’s library is dominated by drama and comedy genres, reflecting the platform’s focus on emotionally engaging and widely appealing content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

Yes, these insights can be valuable for content strategists and marketing teams:

Content Acquisition: Knowing that Drama and Comedy perform well can help the platform invest more in similar content to maintain engagement.

Marketing Focus: Promotions can highlight trending genres to attract new users.

Audience Segmentation: Insights about popular genres can guide personalized recommendations and improve user satisfaction.

**Possible Negative Growth Insight:**

However, over-reliance on a few genres (Drama and Comedy) may lead to:

Reduced content diversity, causing viewer fatigue or disinterest among users who prefer other genres like Sci-Fi or Documentary.

Lost market opportunities if niche audiences (e.g., action or horror fans) shift to competitors with more varied libraries.

**Justification:**
To sustain long-term growth, Amazon Prime Video should diversify its content portfolio by producing or acquiring more titles in underrepresented genres, ensuring broader audience appeal and competitiveness in the streaming market.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set(style="whitegrid")

# Chart 3: IMDb Score vs Show Type
plt.figure(figsize=(8,6))
sns.boxplot(data=titles, x='type', y='imdb_score', palette='pastel')
plt.title('IMDb Score Distribution by Show Type', fontsize=16)
plt.xlabel('Show Type', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)
plt.ylim(0,10)  # IMDb scores are between 0 and 10

# Optional: overlay swarmplot for more detail
sns.swarmplot(data=titles, x='type', y='imdb_score', color='0.25', alpha=0.5, size=3)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a box and swarm plot because it effectively shows the distribution, spread, and variability of IMDb scores for both TV Shows and Movies. This chart helps in identifying median values, interquartile ranges, and outliers, giving a clear understanding of how the audience rates each type of content.
The swarm plot overlay (dots) adds detail by displaying the density of individual data points, making it easier to observe trends and clustering in ratings.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

TV Shows tend to have slightly higher median IMDb scores compared to Movies.

The spread (interquartile range) for both types is similar, but TV Shows show fewer low-rated outliers.

Movies have a wider distribution, with some titles rated very low, indicating inconsistent quality.

**Insight Summary:**
TV Shows generally receive more consistent and higher ratings than Movies, suggesting that audiences perceive Prime’s TV content as more engaging or better produced on average.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**

Yes — this insight can be valuable for content strategy and investment decisions:

It indicates that Amazon Prime’s TV content quality is well-received, so the platform could invest more in producing high-quality shows to enhance engagement and subscriber retention.

It also highlights the need to improve movie quality by being more selective in acquisitions or focusing on original, high-rated productions.

**Possible Negative Growth Insight:**

However, if the lower-rated movies continue to dominate the library, it may lead to:

Reduced viewer satisfaction and poor user reviews.

Negative brand perception regarding movie quality.

**Justification:**
Maintaining too many low-rated movies can hurt overall platform reputation. Therefore, Amazon Prime Video should balance its content mix — keep expanding well-performing TV shows while curating movies more strategically to ensure sustained viewer satisfaction and platform growth.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set(style="whitegrid")

# Count number of titles released each year
release_trend = titles.groupby('release_year')['title'].count().reset_index()

# Plot
plt.figure(figsize=(12,6))
sns.lineplot(data=release_trend, x='release_year', y='title', marker='o', color='teal')
plt.title('Number of Titles Released Each Year on Amazon Prime Video', fontsize=16)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a line chart because it is the most effective way to visualize trends over time. In this case, I wanted to analyze how the number of titles released each year has changed.
A line chart clearly shows upward or downward trends, peaks, and dips, making it easy to interpret how Amazon Prime Video’s content library has evolved historically.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

A gradual increase in the number of titles released over time, especially after 2015.

This indicates that Amazon Prime has been expanding its content library rapidly in recent years.

Some years may show spikes or drops, which could be due to strategic content investments or external factors (e.g., production slowdowns during the pandemic).

Overall, the trend suggests that Amazon Prime is growing its catalog aggressively to compete with other streaming platforms like Netflix and Disney+.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The line chart showing the number of titles released each year on Amazon Prime Video reveals a steady growth trend in content production, particularly after 2015. This indicates that Amazon Prime has been actively expanding its content library to attract and retain subscribers in the competitive streaming market. Such insights can help the business evaluate the effectiveness of its content strategy, allocate budgets for future productions, and identify periods of high growth to replicate their success.

However, there are certain years that may show a temporary decline or stagnation in content releases — for example, around 2020–2021, likely due to the global COVID-19 pandemic disrupting film and series production. While this appears as negative growth, it may also reflect strategic choices to focus on quality rather than quantity, investing in original, high-performing shows instead of mass content.

Overall, the insight supports positive business impact, as understanding these trends enables Amazon Prime to plan future content investments, balance its production pipeline, and strengthen its market position based on data-driven decisions.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure credits is loaded
# credits = pd.read_csv("credits.csv")

# Prepare data
actors = credits[credits['role'].str.upper() == 'ACTOR']
directors = credits[credits['role'].str.upper() == 'DIRECTOR']

top_actors = actors['name'].value_counts().nlargest(10).reset_index()
top_actors.columns = ['name', 'count']

top_directors = directors['name'].value_counts().nlargest(10).reset_index()
top_directors.columns = ['name', 'count']

# Plot
sns.set(style="whitegrid")
fig, axes = plt.subplots(1, 2, figsize=(16, 8), sharey=False)

# Left: Top Actors
sns.barplot(x='count', y='name', data=top_actors, ax=axes[0], palette='pastel')
axes[0].set_title('Top 10 Actors by Number of Credits', fontsize=14)
axes[0].set_xlabel('Number of Credits', fontsize=12)
axes[0].set_ylabel('Actor', fontsize=12)
for i, v in enumerate(top_actors['count']):
    axes[0].text(v + 1, i, str(v), va='center', fontsize=10)

# Right: Top Directors
sns.barplot(x='count', y='name', data=top_directors, ax=axes[1], palette='pastel')
axes[1].set_title('Top 10 Directors by Number of Credits', fontsize=14)
axes[1].set_xlabel('Number of Credits', fontsize=12)
axes[1].set_ylabel('')  # no duplicate y-label
for i, v in enumerate(top_directors['count']):
    axes[1].text(v + 1, i, str(v), va='center', fontsize=10)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose this horizontal bar chart because it clearly visualizes and compares the top actors and directors based on the number of titles they have worked on. It helps identify the most frequent contributors on Amazon Prime Video and provides a quick view of collaboration patterns across productions.

##### 2. What is/are the insight(s) found from the chart?

From the chart, it’s evident that a few actors and directors appear frequently across different titles, indicating that Amazon Prime tends to work repeatedly with certain talent. This suggests strong partnerships and possibly higher audience trust or engagement with content featuring these individuals. It also highlights who the key creative contributors are within the platform’s ecosystem.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact by enabling Amazon Prime to identify its most successful collaborators and use them strategically in future productions and promotions. These insights can guide casting, partnership, and marketing decisions to maximize viewer engagement.

However, one potential negative insight is the over-reliance on a small pool of actors or directors, which could result in content repetition or reduced variety. To avoid this, Amazon Prime should balance popular collaborations with opportunities for new and diverse talent, ensuring sustained audience interest and creativity in future releases.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart 6: IMDb Score vs TMDB Popularity (scatter) with regression line and top-title annotations
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# Ensure titles is loaded
# titles = pd.read_csv("titles.csv")

# Prepare data: select relevant columns and drop missing values
df_scatter = titles[['title', 'imdb_score', 'tmdb_popularity']].copy()
df_scatter['tmdb_popularity'] = pd.to_numeric(df_scatter['tmdb_popularity'], errors='coerce')
df_scatter = df_scatter.dropna(subset=['imdb_score', 'tmdb_popularity'])

# Add a log-popularity column to reduce skew (many popularity measures are right-skewed)
df_scatter['tmdb_pop_log'] = np.log10(df_scatter['tmdb_popularity'] + 1)

# Optional: limit extreme outliers for better visualization (uncomment to use)
# upper_pop = df_scatter['tmdb_pop_log'].quantile(0.99)
# df_scatter = df_scatter[df_scatter['tmdb_pop_log'] <= upper_pop]

# Plot
sns.set(style="whitegrid")
plt.figure(figsize=(10,7))

# Scatter points sized by (optional) small constant so they're visible
sns.scatterplot(
    data=df_scatter,
    x='tmdb_pop_log',
    y='imdb_score',
    alpha=0.6,
    s=40
)

# Add a linear regression fit on the transformed popularity axis
sns.regplot(
    data=df_scatter,
    x='tmdb_pop_log',
    y='imdb_score',
    scatter=False,
    ci=95,
    line_kws={'color':'red', 'linewidth':1.5, 'alpha':0.8}
)

plt.title('IMDb Score vs log10(TMDb Popularity + 1)', fontsize=16)
plt.xlabel('log10(TMDb Popularity + 1)', fontsize=12)
plt.ylabel('IMDb Score', fontsize=12)
plt.ylim(0, 10)

# Annotate top 5 most popular and top 5 highest-rated titles (avoid duplicate annotations)
top_pop = df_scatter.nlargest(5, 'tmdb_popularity')
top_score = df_scatter.nlargest(5, 'imdb_score')

annotated = set()
for _, row in pd.concat([top_pop, top_score]).drop_duplicates().iterrows():
    x = row['tmdb_pop_log']
    y = row['imdb_score']
    title = row['title']
    if title in annotated:
        continue
    plt.annotate(
        title,
        xy=(x, y),
        xytext=(5, 5),
        textcoords='offset points',
        fontsize=9,
        bbox=dict(boxstyle="round,pad=0.2", fc="yellow", alpha=0.3)
    )
    annotated.add(title)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot because it is ideal for visualizing the relationship between two continuous variables — in this case, IMDb Score and TMDb Popularity. This chart helps to understand whether highly popular titles also tend to have higher audience ratings or if popularity and quality are not strongly correlated. The addition of a regression line helps reveal the general trend between these two metrics.

##### 2. What is/are the insight(s) found from the chart?

From the scatter plot, it can be observed that there is a mild positive correlation between IMDb Scores and TMDb Popularity — meaning that titles with higher IMDb ratings generally have higher popularity scores.
However, some titles with average or even low IMDb ratings still show high popularity, indicating that popularity may also be driven by marketing campaigns, star power, or trending genres, not just quality.
Top-rated titles appear as distinct outliers, suggesting that critically acclaimed content tends to attract stable, long-term interest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can have a positive business impact by helping Amazon Prime Video understand the balance between critical ratings and audience appeal. The platform can use this information to:

Promote highly-rated yet under-viewed titles to boost engagement.

Analyze marketing success of popular but low-rated titles to learn what drives viewer attention.

Invest strategically in content that performs well both critically and commercially.

On the other hand, the chart also highlights a potential negative insight — some titles with high popularity but low ratings may indicate short-term hype without sustained viewer satisfaction. This could harm long-term brand reputation if viewers perceive content as overhyped or low-quality. Therefore, Amazon Prime should aim to balance popularity with quality, ensuring that highly promoted titles also deliver strong viewer satisfaction to maintain credibility and retention.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Explode genres into separate rows (if not done earlier)
titles_exploded = titles.explode('genres')

# Remove 'Unknown' or missing genres
titles_exploded = titles_exploded[titles_exploded['genres'] != 'Unknown']

# Group by genre and calculate average IMDb score
avg_genre_rating = (
    titles_exploded.groupby('genres')['imdb_score']
    .mean()
    .sort_values(ascending=False)
    .reset_index()
)

# Take top 10 genres by IMDb score
top10_genre_rating = avg_genre_rating.head(10)

# Plot
sns.set(style="whitegrid")
plt.figure(figsize=(10,6))
sns.barplot(data=top10_genre_rating, x='imdb_score', y='genres', palette='coolwarm')

plt.title('Top 10 Genres by Average IMDb Rating', fontsize=16)
plt.xlabel('Average IMDb Score', fontsize=12)
plt.ylabel('Genre', fontsize=12)

# Annotate bars with values
for i, value in enumerate(top10_genre_rating['imdb_score']):
    plt.text(value + 0.05, i, f"{value:.2f}", va='center', fontsize=10)

plt.xlim(0, 10)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly displays and compares the average IMDb ratings for different genres. This format makes it easy to identify which genres are rated highest and lowest by audiences, highlighting trends in viewer preferences.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that certain genres consistently receive higher ratings than others. For example, genres like Documentary, Drama, or Crime tend to have higher average IMDb scores, indicating strong audience appreciation and engagement. Conversely, genres such as Reality or Comedy may have lower average ratings, suggesting that these are less critically acclaimed or polarizing among viewers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can positively impact business decisions:

Amazon Prime can invest more in high-rated genres to attract and retain viewers.

Marketing strategies can highlight critically acclaimed genres to boost engagement.

Content acquisition and production decisions can prioritize genres that maintain high audience satisfaction.

However, the chart also reveals a potential negative insight: genres with consistently low ratings may lead to viewer dissatisfaction or churn if overrepresented. For example, producing too many low-rated Reality or Comedy titles could reduce overall platform credibility. To avoid negative impact, Amazon Prime should balance content quantity with quality, ensuring a diverse mix while emphasizing genres that are well-received.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure 'runtime' is numeric
titles['runtime'] = pd.to_numeric(titles['runtime'], errors='coerce')
titles = titles.dropna(subset=['runtime'])

# Optional: remove extreme outliers for better visualization
Q1 = titles['runtime'].quantile(0.05)
Q3 = titles['runtime'].quantile(0.95)
titles_filtered = titles[(titles['runtime'] >= Q1) & (titles['runtime'] <= Q3)]

# Plot
plt.figure(figsize=(10,6))
sns.boxplot(data=titles_filtered, x='type', y='runtime', palette='pastel')
plt.title('Distribution of Runtime: Movies vs TV Shows', fontsize=16)
plt.xlabel('Show Type', fontsize=12)
plt.ylabel('Runtime (minutes)', fontsize=12)
plt.ylim(0, titles_filtered['runtime'].max() + 10)

# Overlay swarmplot for more detail (optional)
sns.swarmplot(data=titles_filtered, x='type', y='runtime', color='0.25', alpha=0.5, size=3)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a boxplot because it effectively shows the distribution, median, and spread of runtime for Movies and TV Shows. This chart allows us to compare typical durations, identify outliers, and understand viewing patterns for different content types.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

Movies generally have longer runtimes than TV Show episodes, as expected.

Most TV Shows have shorter, more consistent runtimes, usually under 60 minutes per episode.

There are some outliers: extremely long movies or special episodes for TV Shows, which might indicate epic productions or extended content.

This suggests that Amazon Prime offers a mix of quick episodic content and long-form movies to cater to different viewer preferences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can have a positive business impact:

Content planning teams can optimize runtime for audience engagement, knowing the typical lengths viewers prefer.

Marketing can highlight short or long-form content depending on viewer availability and binge-watching habits.

It helps improve user experience, as runtime consistency can guide recommendation engines.

A potential negative insight is that excessively long movies or episodes may deter some viewers, especially casual watchers with limited time. Over-reliance on long-form content without shorter alternatives could reduce viewer engagement or satisfaction. Therefore, balancing content length with audience preferences is key to maximizing retention.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Clean data: fill missing age_certification with 'Unknown'
titles['age_certification'] = titles['age_certification'].fillna('Unknown')

# Plot
plt.figure(figsize=(12,6))
sns.countplot(data=titles, x='age_certification', order=titles['age_certification'].value_counts().index, palette='pastel')
plt.title('Distribution of Titles by Age Certification', fontsize=16)
plt.xlabel('Age Certification', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)

# Annotate bars with counts
for p in plt.gca().patches:
    plt.gca().annotate(f'{p.get_height()}', (p.get_x() + p.get_width()/2., p.get_height()),
                       ha='center', va='bottom', fontsize=10, color='black')

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a countplot because it clearly visualizes the number of titles available for each age certification. This chart helps understand the demographics that Amazon Prime Video targets, and highlights the platform’s approach to content suitability for different audiences.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

Most titles fall under general or adult-oriented certifications like PG, PG-13, R, or 18+, indicating that Amazon Prime focuses heavily on content for teenagers and adults.

A smaller number of titles are rated for younger children (G or TV-Y), suggesting a limited offering for very young viewers.

Titles with “Unknown” ratings indicate missing certification data, which may require attention for proper viewer guidance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can positively impact business decisions:

Helps Amazon Prime understand which age groups are best served, guiding content acquisition and production strategy.

Marketing campaigns can be tailored to the largest audience segments, improving engagement and subscription retention.

Enables better parental guidance and recommendation system design by highlighting gaps in children’s content.

A potential negative insight is the underrepresentation of content for younger audiences. If Amazon Prime continues to have few titles for children, it may lose potential family subscriptions to competitors like Disney+ or Netflix, which have stronger children’s content. Addressing this gap could create new growth opportunities for family-oriented subscribers.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Filter only TV Shows
tv_shows = titles[titles['type'] == 'SHOW']

# Fill missing or 0 seasons with 1 (optional, assuming minimum 1 season)
tv_shows['seasons'] = tv_shows['seasons'].fillna(1)

# Limit number of seasons to 20 for visualization clarity
tv_shows_plot = tv_shows[tv_shows['seasons'] <= 20]

# Plot
plt.figure(figsize=(12,6))
sns.countplot(data=tv_shows_plot, x='seasons', palette='pastel')
plt.title('Distribution of Number of Seasons in TV Shows', fontsize=16)
plt.xlabel('Number of Seasons', fontsize=12)
plt.ylabel('Number of TV Shows', fontsize=12)

# Annotate bars
for p in plt.gca().patches:
    plt.gca().annotate(f'{p.get_height()}', (p.get_x() + p.get_width()/2., p.get_height()),
                       ha='center', va='bottom', fontsize=9, color='black')

plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a countplot because it effectively visualizes the distribution of TV shows based on the number of seasons. This chart helps understand the content longevity on Amazon Prime Video and reveals how much investment is made into long-running versus short-running series.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe that:

The majority of TV shows on Amazon Prime Video have only 1 to 3 seasons, indicating a strong focus on limited or mini-series formats.

Only a few titles extend beyond 5 seasons, showing that long-running series are relatively rare on the platform.

This suggests that Amazon Prime’s strategy may emphasize fresh, short-duration content over multi-year commitments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help drive positive business impact:

By recognizing the popularity of short series, Amazon Prime can continue producing limited, high-quality shows that attract binge-watchers and minimize production risk.

Understanding which series sustain multiple seasons can guide renewal and investment decisions, focusing resources on the most engaging content.

Helps balance viewer retention with content diversity, ensuring both new and returning audiences stay interested.

A possible negative insight is that the lack of long-running series could limit long-term viewer attachment. Platforms like Netflix or HBO often benefit from multi-season franchises that build loyal fanbases. To counter this, Amazon Prime could develop a few flagship long-term shows while maintaining a mix of short-term content to sustain both engagement and variety.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Explode the production_countries list into separate rows
titles_exploded = titles.explode('production_countries')

# Remove 'Unknown' or missing countries
titles_exploded = titles_exploded[titles_exploded['production_countries'] != 'Unknown']

# Count top 10 countries by number of titles
top_countries = titles_exploded['production_countries'].value_counts().nlargest(10).reset_index()
top_countries.columns = ['country', 'count']

# Plot
plt.figure(figsize=(12,6))
sns.barplot(data=top_countries, x='count', y='country', palette='viridis')
plt.title('Top 10 Production Countries on Amazon Prime Video', fontsize=16)
plt.xlabel('Number of Titles', fontsize=12)
plt.ylabel('Country', fontsize=12)

# Annotate bars
for i, value in enumerate(top_countries['count']):
    plt.text(value + 5, i, str(value), va='center', fontsize=10)

plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart because it clearly visualizes and compares the number of titles produced by different countries. This chart helps understand the geographical distribution of content on Amazon Prime Video and highlights which countries are the main contributors.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

The United States likely dominates content production, followed by countries such as United Kingdom, India, and Canada.

This indicates that Amazon Prime Video primarily sources content from English-speaking countries and other major production hubs.

Smaller contributions from other countries may represent niche or regional content, reflecting limited global diversity in the platform’s catalog.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can positively impact business decisions:

Understanding content origin helps plan regional marketing strategies and localization efforts, such as subtitles and dubbing.

It can guide content acquisition, allowing Amazon Prime to target underrepresented regions to increase global appeal and subscriber growth.

A potential negative insight is that over-reliance on a few countries (like the USA) may lead to limited cultural diversity. This could alienate international audiences looking for region-specific content. To mitigate this, Amazon Prime should invest in content from diverse countries, which can attract a broader global subscriber base and enhance engagement.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Ensure 'release_year' and 'imdb_score' are numeric
titles['release_year'] = pd.to_numeric(titles['release_year'], errors='coerce')
titles['imdb_score'] = pd.to_numeric(titles['imdb_score'], errors='coerce')

# Drop rows with missing values
df_year_score = titles.dropna(subset=['release_year', 'imdb_score'])

# Group by release year and calculate average IMDb score
avg_score_year = df_year_score.groupby('release_year')['imdb_score'].mean().reset_index()

# Plot
plt.figure(figsize=(12,6))
sns.lineplot(data=avg_score_year, x='release_year', y='imdb_score', marker='o', color='blue')
plt.title('Average IMDb Score of Titles Over the Years', fontsize=16)
plt.xlabel('Release Year', fontsize=12)
plt.ylabel('Average IMDb Score', fontsize=12)
plt.ylim(0,10)
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a line chart because it effectively shows trends over time. In this case, it allows us to visualize how the average IMDb scores of titles released on Amazon Prime Video have changed over the years, providing insight into content quality evolution.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

There may be gradual improvement in average ratings in recent years, indicating that Amazon Prime is producing or acquiring higher-quality content over time.

Some years might show dips or spikes, possibly reflecting experimental content, shifts in content strategy, or external factors affecting production quality.

This trend helps identify periods of strong content performance versus years that may need analysis for improvement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can positively impact business decisions:

Amazon Prime can analyze periods of high average ratings to replicate successful strategies in production and acquisition.

Marketing teams can highlight consistently high-rated years or popular content eras to attract subscribers.

Helps in strategic content planning, focusing on quality to maintain audience satisfaction and retention.

A potential negative insight is that years with lower average IMDb scores may indicate underperforming content, which could affect viewer satisfaction and brand perception. To mitigate this, Amazon Prime should invest in quality control, audience research, and targeted acquisitions to maintain a consistently high content standard.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Ensure 'tmdb_popularity' is numeric
titles['tmdb_popularity'] = pd.to_numeric(titles['tmdb_popularity'], errors='coerce')
titles_pop = titles.dropna(subset=['tmdb_popularity'])

# Optional: use log scale to reduce skewness
titles_pop['tmdb_pop_log'] = np.log10(titles_pop['tmdb_popularity'] + 1)

# Plot
plt.figure(figsize=(12,6))
sns.histplot(titles_pop['tmdb_pop_log'], bins=50, kde=True, color='teal')
plt.title('Distribution of TMDb Popularity (Log Scale)', fontsize=16)
plt.xlabel('log10(TMDb Popularity + 1)', fontsize=12)
plt.ylabel('Number of Titles', fontsize=12)
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with log scale because it effectively visualizes the distribution of TMDb popularity scores across all titles. This chart highlights how content popularity varies, showing both niche titles and highly trending titles on Amazon Prime Video.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can observe:

The distribution is right-skewed, indicating that most titles have moderate popularity, while a few titles are extremely popular.

Highly popular titles stand out as outliers, likely due to star cast, marketing campaigns, or trending genres.

This suggests that while Amazon Prime has a diverse content library, only a small fraction of titles drive the majority of viewer engagement.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can positively impact business decisions:

Amazon Prime can identify and promote highly popular titles to maximize engagement and subscription growth.

Insights can guide marketing campaigns for moderately popular or niche content to increase visibility.

Helps in content acquisition and production planning, focusing on factors that make titles popular.

A potential negative insight is that the majority of content having moderate popularity may indicate that many titles do not capture significant viewer attention. If overrepresented, this could lead to lower overall engagement. To mitigate this, Amazon Prime should invest in promoting underperforming titles with potential and maintain a balanced mix of high and moderate popularity content.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Select numeric columns for correlation
numeric_cols = ['release_year', 'runtime', 'seasons', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']

# Ensure numeric columns are numeric type
for col in numeric_cols:
    titles[col] = pd.to_numeric(titles[col], errors='coerce')

# Drop rows with all NaN numeric values
df_numeric = titles[numeric_cols].dropna(how='all')

# Compute correlation matrix
corr_matrix = df_numeric.corr()

# Plot heatmap
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Variables', fontsize=16)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a correlation heatmap because it visually shows the relationships between numeric variables in the dataset. This helps identify which variables are positively or negatively correlated, providing insight into potential dependencies or patterns in the data.

##### 2. What is/are the insight(s) found from the chart?

From the heatmap, we can observe:

IMDb score and TMDb score generally show a moderate positive correlation, indicating that titles rated highly on IMDb also tend to be popular on TMDb.

IMDb votes and TMDb popularity are strongly correlated, reflecting that titles with more votes tend to have higher popularity.

Other variables, such as runtime and number of seasons, may show weak or negative correlations, reflecting natural differences between movies and TV shows.

The heatmap highlights which features are closely related and which are independent, aiding further analysis.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Select numeric columns for pair plot
numeric_cols = ['runtime', 'seasons', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']

# Ensure numeric columns are numeric type
for col in numeric_cols:
    titles[col] = pd.to_numeric(titles[col], errors='coerce')

# Drop rows with missing values in these columns
df_pair = titles[numeric_cols].dropna()

# Optional: sample data if dataset is too large
df_sample = df_pair.sample(min(2000, len(df_pair)), random_state=42)

# Plot pairplot
sns.pairplot(df_sample)
plt.suptitle('Pair Plot of Numeric Variables', y=1.02, fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pair plot because it allows visualizing relationships between multiple numeric variables simultaneously. This helps identify patterns, correlations, and potential outliers across key metrics like IMDb score, TMDb score, votes, runtime, and seasons.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot, we can observe:

IMDb score and TMDb score generally have a positive correlation, indicating that higher-rated titles on IMDb tend to also score higher on TMDb.

IMDb votes and TMDb popularity show a strong positive relationship, highlighting that widely rated content is also more popular.

Most scatterplots indicate that runtime and number of seasons vary independently of ratings and popularity, reflecting differences between movies and TV shows.

Outliers are visible, such as extremely popular or highly rated titles, which can be targeted for further analysis.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

### **Business Recommendations to Achieve Objectives**

**1.Content Strategy Optimization**

Invest in genres with higher average ratings (e.g., Drama, Documentary, Crime) to attract and retain subscribers.

Maintain a balance between popular and niche content, ensuring variety while leveraging high-demand titles.

Encourage production of both short-form series and long-running shows to cater to different viewer preferences.

**2.Audience Targeting & Personalization**

Use insights from age certifications, runtime, and popularity to design personalized recommendations for different demographics.

Promote highly-rated but under-viewed titles to boost engagement and increase content discovery.

**3.Global Content Expansion **

Increase investment in underrepresented production countries to diversify content and attract international subscribers.

Localize content via dubbing/subtitles for regions showing demand but currently limited titles.

**4.Marketing & Engagement**

Focus marketing on top-rated and highly popular titles to maximize reach and subscriber interest.

Highlight trending actors, directors, and franchises to strengthen brand loyalty.

**5.Quality & Viewer Satisfaction**

Avoid over-reliance on titles that are popular but low-rated to prevent viewer dissatisfaction.

Use correlation and rating trends to guide content acquisitions that balance quality and engagement.

**Brief Summary:** By leveraging data on ratings, popularity, genres, runtime, and production trends, Amazon Prime can optimize content offerings, improve marketing strategies, enhance personalization, and expand globally, leading to increased subscriber engagement, retention, and revenue growth.

# **Conclusion**

The exploratory data analysis of Amazon Prime Video’s content library provides valuable insights into content trends, audience preferences, and platform strategy. Key findings include:

**Content Diversity:** Drama, Documentary, and Crime genres consistently receive higher ratings, while Reality and Comedy have lower average ratings.

**Audience Targeting:** Most titles are aimed at teenagers and adults, with limited offerings for younger children.

**Popularity vs Quality:** Highly popular titles are not always the highest rated, indicating that marketing, star cast, or trending topics influence engagement.

**Global Reach:** Content is dominated by a few production countries, highlighting opportunities to diversify regionally.

**Content Duration & Series:** Movies have varied runtimes, while most TV shows are short-run series, suggesting a mix of binge-watch and limited content strategies.

**Key Contributors:** Certain actors and directors appear frequently, signaling trusted collaborations that can attract audiences.

**Business Implications:** By leveraging these insights,
Amazon Prime Video can:

Optimize content acquisition and production based on highly-rated genres and trends.

Enhance personalized recommendations and marketing strategies to improve engagement.

Expand international and family-friendly content to attract new subscriber segments.

Balance quality and popularity to maintain long-term viewer satisfaction and retention.

In conclusion, this analysis demonstrates how data-driven decisions can guide Amazon Prime Video in content strategy, marketing, and global expansion, ultimately enhancing subscriber growth, engagement, and platform competitiveness.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***