<a href="https://colab.research.google.com/github/swaroopsamantaray18/Amazon-Prime-Video/blob/main/Amazon_Prime_Video_EDA_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Amazon Prime TV Shows and Movies
Exploratory Data Analysis



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
Swaroop K Samantaray


# **Project Summary -**

Performed comprehensive Exploratory Data Analysis (EDA) on the Amazon Prime Video movies and TV shows dataset to analyze content distribution, trends, and patterns across multiple attributes. Utilized Python (Pandas, NumPy, Matplotlib, Seaborn) to clean and preprocess data by handling missing values, standardizing columns, and removing inconsistencies. Conducted in-depth analysis of genre distribution, release year trends, content ratings, and country-wise availability to identify key patterns in platform content strategy. Compared Movies vs TV Shows to uncover differences in content volume, duration, and release behavior. Developed insightful data visualizations and summary reports to support understanding of content growth, audience targeting, and platform content planning.

Skills Used: Python, Pandas, NumPy, Data Cleaning, Exploratory Data Analysis (EDA), Data Visualization, Matplotlib, Seaborn, Business Insights


# **GitHub Link -**

https://github.com/swaroopsamantaray18/Amazon-Prime-Video.git

# **Problem Statement**


The purpose of this project is to examine and interpret the Amazon Prime Video titles dataset in order to gain a deeper understanding of the content offered on the platform. This analysis focuses on identifying patterns in content types, release year trends, genre popularity, and regional distribution of titles. By exploring the data, the project seeks to answer key business questions such as: which format—Movies or TV Shows—has a stronger presence on the platform, how content creation has evolved over the years, which genres attract the most content, and which countries contribute the highest number of titles. The insights derived from this analysis help in understanding platform content strategy and regional content expansion.

#### **Define Your Business Objective?**

The objective of this project is to analyze the Amazon Prime Video movies and TV shows dataset to generate data-driven insights that support content strategy, audience engagement, and platform growth. The analysis focuses on identifying high-performing content types, popular genres, regional rating trends, and the impact of cast and crew on content success. These insights are used to inform content acquisition decisions, optimize genre mix, and support strategic planning. The project also establishes a foundation for building predictive machine learning models, such as content rating prediction and recommendation systems, to enhance user experience and improve content discoverability.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



### Dataset Loading

In [None]:
# Load Dataset
# Load Dataset
from google.colab import files
files.upload()

titles = pd.read_csv("titles.csv")
credits = pd.read_csv("credits.csv")

titles.head(), credits.head()


### Dataset First View

In [None]:
# Dataset First Look

print("Titles Dataset Head:")
display(titles.head())

print("Credits Dataset Head:")
display(credits.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

print("Titles Dataset Shape (rows, columns):", titles.shape)
print("Credits Dataset Shape (rows, columns):", credits.shape)

print("\nNumber of Rows in Titles:", titles.shape[0])
print("Number of Columns in Titles:", titles.shape[1])

print("\nNumber of Rows in Credits:", credits.shape[0])
print("Number of Columns in Credits:", credits.shape[1])


### Dataset Information

In [None]:
# Dataset Info

titles.info()
print('''

''')
credits.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count


duplicate_titles = titles.duplicated().sum()
duplicate_titles = titles.duplicated().sum()

print("Number of duplicate rows in Titles dataset:", duplicate_titles)
print("\nNumber of duplicate rows in Titles dataset:", duplicate_titles)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count


titles.replace([0,'', ' ', 'NA', 'N/A', 'null', 'None', '?', 'unknown'], np.nan, inplace=True) # incase the missing values are stored in different formats
credits.replace([0,'', ' ', 'NA', 'N/A', 'null', 'None', '?', 'unknown'], np.nan, inplace=True)
missing_cols_titles = titles.isnull().sum()
missing_cols_credits = credits.isnull().sum()
print(missing_cols_titles[missing_cols_titles > 0])
print('\n-------------------\n')
print(missing_cols_credits[missing_cols_credits > 0])


In [None]:
# Visualizing the missing values

plt.figure(figsize=(12,6))
sns.heatmap(titles.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap - Titles Dataset")
plt.show()

plt.figure(figsize=(12,6))
sns.heatmap(credits.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values Heatmap - Credits Dataset")
plt.show()


### What did you know about your dataset?

The dataset consists of detailed metadata for Amazon Prime Video movies and TV shows, including title, content type, duration, genres, descriptions, release year, IMDb ratings, and popularity metrics. A supplementary dataset provides cast and crew information, enabling deeper analysis of talent influence on content performance. Initial data assessment identified missing values, duplicate records, and mixed data types (numeric, categorical, and multi-value fields), which were addressed through structured data cleaning and preprocessing. This dataset serves as a strong foundation for trend analysis, content performance evaluation, and predictive modeling to support data-driven content strategy decisions.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns


print("Columns in Titles Dataset:")
print(titles.columns.tolist())

print("\nColumns in Credits Dataset:")
print(credits.columns.tolist())


In [None]:
# Dataset Describe


print("Titles Dataset Summary Statistics:")
display(titles.describe())

print("\nCredits Dataset Summary Statistics:")
display(credits.describe())


### Variables Description

The dataset contains information about Amazon Prime Video titles (movies and TV shows).
Below is a description of the main variables in the two datasets (titles and credits):

**Titles Dataset Variables**

| Variable                 | Description                                  |
| ------------------------ | -------------------------------------------- |
| **id**                   | Unique identifier for each movie/TV show.    |
| **title**                | Name of the movie or TV show.                |
| **type**                 | Content type → “MOVIE” or “SHOW”.            |
| **description**          | Short summary of the movie/TV show.          |
| **release_year**         | Year the title was released.                 |
| **age_certification**    | Content rating (ex: PG-13, R, Not Rated).    |
| **runtime**              | Duration of the movie or episode in minutes. |
| **genres**               | List of genres (ex: comedy, drama, action).  |
| **production_countries** | Countries where the content was produced.    |
| **seasons**              | Number of seasons (only for TV shows).       |
| **imdb_id**              | IMDb ID of the title.                        |
| **imdb_score**           | IMDb rating (0–10).                          |
| **imdb_votes**           | Number of IMDb votes.                        |
| **tmdb_popularity**      | Popularity score from TMDB.                  |
| **tmdb_score**           | TMDB rating.                                 |


**Credits Dataset Variables**



| Variable      | Description                                          |
| ------------- | ---------------------------------------------------- |
| **id**        | Corresponds to the title ID from the titles dataset. |
| **cast_id**   | Unique identifier for each cast/crew entry.          |
| **character** | Character name (for actors).                         |
| **name**      | Actor/actress/director name.                         |
| **role**      | Role type → “ACTOR”, “DIRECTOR”, etc.                |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print("Unique Values in Titles Dataset:")
for col in titles.columns:
    print(f"{col}: {titles[col].nunique()}")

print("\n----------------------------------------\n")

# Check Unique Values for each variable in Credits dataset
print("Unique Values in Credits Dataset:")
for col in credits.columns:
    print(f"{col}: {credits[col].nunique()}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# 5. Remove Duplicate Rows

titles = titles.drop_duplicates()
credits = credits.drop_duplicates()

print("Titles after dedup:", titles.shape)
print("Credits after dedup:", credits.shape)

# 8. Drop Unnecessary Columns

titles.drop(columns=["imdb_id"], inplace=True, errors="ignore")

# 9. Missing Value Treatment (Business Logic)

# Text columns
titles["description"] = titles["description"].fillna("Not Available")

# Categorical
titles["age_certification"] = titles["age_certification"].fillna(
    titles["age_certification"].mode()[0]
)

# TV Shows only
titles["seasons"] = titles["seasons"].fillna(0)

# Numeric columns
num_cols = ["imdb_score", "imdb_votes", "tmdb_popularity", "tmdb_score", "runtime"]

for col in num_cols:
    if col in titles.columns:
        titles[col] = titles[col].fillna(titles[col].median())

# Credits missing values
credits["character"] = credits["character"].fillna("Not Specified")

# 10. Data Type Corrections

titles["release_year"] = titles["release_year"].astype(int)
titles["seasons"] = titles["seasons"].astype(int)

# 11. Clean List-Type Columns (Genres)

titles["genres"] = (
    titles["genres"]
    .astype(str)
    .str.replace("[", "", regex=False)
    .str.replace("]", "", regex=False)
    .str.replace("'", "", regex=False)
)

# 12. Filter Useful Credit Roles

credits = credits[credits["role"].isin(["ACTOR", "DIRECTOR"])]


# 13. Merge Titles & Credits

df = pd.merge(titles, credits, on="id", how="left")

print("Merged dataset shape:", df.shape)




### What all manipulations have you done and insights you found?

Data Wrangling performed:

1. Duplicates & basic cleanup
   - Removed duplicate rows from both datasets.
   - Normalized column names.

2. Missing value handling
   - Filled missing age_certification with 'Not Rated'.
   - Filled numeric missing values (runtime, imdb_score, imdb_votes) with medians.
   - Filled missing seasons with 1 (sensible default for TV rows).

3. Parsing & type conversion
   - Converted string-coded lists (genres, production_countries) into Python lists.
   - Converted numeric-looking columns to numeric dtypes (runtime, imdb_score, imdb_votes, tmdb_score, tmdb_popularity).

4. Feature engineering
   - num_genres: number of genres per title.
   - primary_genre: first genre in the list (useful for grouping).
   - is_movie: binary flag for Movie vs Show.
   - release_decade: decade bucket for release_year (e.g., 1990s).
   - cast_count: number of unique actors per title (aggregated from credits).
   - directors: comma-separated list of directors per title (aggregated).
   - popularity_bucket: quartile-based bucket from tmdb_popularity (Low/Medium/High/Very High).

5. Merging
   - Merged aggregated credit information (cast_count, directors) into the titles dataframe using id as the key.

Key insights discovered during wrangling:
- The dataset contains many list-like fields (genres, production_countries) that must be parsed before analysis.
- A significant portion of titles has missing IMDb-related fields; median imputation makes later analysis robust.
- Primary genre and num_genres provide compact, useful features for visualization and modeling.
- cast_count and director(s) enrichment allow us to explore relationships between people involved and title popularity/ratings.
- Creating release_decade and popularity_bucket helps with easy grouping and trend analysis.

These cleaned and engineered features prepare the data for univariate/bivariate visualizations and for building ML models (e.g., popularity/rating prediction or a content recommender).


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code


plt.figure(figsize=(8,5))
sns.countplot(x=titles['type'], palette='viridis')
plt.title("Distribution of Movies vs TV Shows")
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

I chose this chart because understanding the distribution of content types (Movies vs TV Shows) is an essential first step in Exploratory Data Analysis. This chart helps me quickly identify which type of content dominates the dataset. It also provides a clear starting point for further analysis such as genre comparison, runtime patterns, and rating trends. The bar chart is simple, easy to understand, and effectively highlights the overall structure of the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Amazon Prime Video has significantly more movies than TV shows. This indicates that the platform focuses more on movie-based content. This insight helps us understand user viewing patterns and how Amazon Prime allocates its resources. Knowing this distribution is useful for deeper analysis such as identifying popular genres within movies, comparing ratings between movies and shows, and analyzing content release trends.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from this chart can help create a positive business impact. Knowing that Amazon Prime has a higher number of movies compared to TV shows helps the platform understand where its content strength lies. If movies are dominating the platform, Amazon can focus on acquiring more high-quality films, improving movie recommendations, and targeting users who prefer movie consumption. This can increase user engagement and retention, which directly contributes to business growth.

At the same time, the insight also highlights a potential area for improvement. The noticeably lower number of TV shows may indicate a content gap. TV shows often have longer watch times and higher user retention compared to movies. If Amazon Prime invests in producing or acquiring more TV shows, especially in popular genres, it can lead to increased user engagement and long-term platform growth.

So this insight is a positive signal for improving content strategy, but it also shows an opportunity:
Amazon Prime may be lagging behind competitors like Netflix in the quantity of TV series available. If not addressed, this could lead to negative growth or loss of viewers who prefer long-form TV content. Strengthening the TV show catalog can help prevent this and boost platform competitiveness.

#### Chart - 2

In [None]:
# Chart - 2 visualization code


# Flatten all genres into one list
all_genres = [genre for sublist in titles['genres'] for genre in sublist]

# Create a count of genres
genre_counts = pd.Series(all_genres).value_counts().head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=genre_counts.values, y=genre_counts.index, palette='viridis')
plt.title("Top 10 Most Common Genres on Amazon Prime")
plt.xlabel("Count")
plt.ylabel("Genre")
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because genres are one of the most important variables for understanding user preferences and platform content trends. Analyzing the top genres helps identify what type of content is most commonly produced and consumed. This chart also provides clear direction for business decisions such as content acquisition, marketing strategy, and recommendation system improvements. Visualizing genres gives an essential overview of the platform’s content diversity.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that genres like Drama, Comedy, Action, and Thriller are the most dominant on Amazon Prime. This indicates that the platform invests heavily in content that appeals to a wider audience. Drama being the top genre suggests strong storytelling demand, while Comedy and Action highlight user interest in entertainment and fast-paced content. This insight helps the platform focus on strengthening these popular genres while identifying gaps in less represented genres such as Documentary or Animation.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the genre analysis can create a strong positive business impact. Identifying the most popular genres such as Drama, Comedy, Action, and Thriller shows what type of content attracts the most viewers. This helps Amazon Prime invest more in high-demand genres, plan content acquisition, improve recommendation systems, and run targeted marketing campaigns for popular categories. By knowing what users prefer, Amazon can deliver more engaging content, leading to higher watch time and improved customer satisfaction.

However, the chart also reveals a potential negative growth insight: some genres, such as Animation, Documentary, or Sci-Fi, may be underrepresented. If the platform does not diversify its content, it risks losing users who prefer niche genres. Competitors like Netflix or Disney+ may attract these audiences by offering stronger content libraries in these categories. Therefore, Amazon Prime should consider strengthening weaker genres to avoid losing users and to provide a more balanced content catalog.

Overall, the insight is largely positive but also highlights an important opportunity: increasing content variety can prevent negative growth and help Amazon Prime stay competitive.

#### Chart - 3

In [None]:
# Chart - 3 visualization code


plt.figure(figsize=(10,6))
sns.histplot(titles['imdb_score'], bins=20, kde=True, color='purple')
plt.title("Distribution of IMDb Scores")
plt.xlabel("IMDb Score")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because IMDb score is one of the most important indicators of content quality. Understanding the distribution of ratings helps identify whether Amazon Prime hosts primarily high-quality content, average content, or low-rated content. A histogram is the best way to visualize how ratings are spread across the platform. This chart also helps guide decisions about content improvement, recommendations, and customer satisfaction.


##### 2. What is/are the insight(s) found from the chart?

The IMDb score distribution shows that most Amazon Prime titles fall within the 6 to 8 score range, indicating that the platform hosts mostly average to above-average content. Very low-rated content (below 4) appears to be minimal, which is a positive sign for content quality. However, the number of titles rated above 8 is also limited, suggesting that Amazon Prime could further invest in highly acclaimed titles to improve platform prestige and user engagement.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the IMDb score distribution can help create a positive business impact. The chart shows that the majority of Amazon Prime titles fall in the 6–8 IMDb score range, which means the platform has a strong library of average to above-average quality content. This is beneficial for business because higher-rated content increases user satisfaction, encourages longer watch time, and supports customer retention. It also helps improve the recommendation engine by suggesting well-rated titles to viewers.

At the same time, the chart reveals a mild negative insight: there are relatively fewer highly-rated titles (scores above 8). This can limit Amazon Prime’s appeal to users who prefer premium, critically acclaimed content. Competitors who offer more top-rated titles may attract users looking for award-winning or high-quality content. Therefore, Amazon Prime may experience slower growth in this segment unless it invests more in high-quality originals or licensed content.

Overall, the insight is positive but also highlights an opportunity: increasing the number of top-rated titles can strengthen Amazon Prime’s competitive position and prevent potential negative growth.

#### Chart - 4

In [None]:
# Chart - 4 visualization code


movies = titles[titles['type'] == 'MOVIE']

plt.figure(figsize=(10,6))
sns.histplot(movies['runtime'], bins=25, kde=True, color='darkgreen')
plt.title("Distribution of Movie Runtime on Amazon Prime")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because runtime is an important factor that affects user engagement and viewing behavior. Understanding how long most movies are helps identify the platform’s content pattern and audience preferences. A distribution chart is the best way to visualize how runtimes vary across the catalog and whether most movies are short, average length, or long. This insight helps Amazon optimize content acquisition, production strategy, and recommendation algorithms.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that most Amazon Prime movies have a runtime between 80 and 120 minutes, which aligns with the standard length of commercial films. Very short movies (below 60 minutes) and very long movies (above 150 minutes) are relatively rare. This indicates that Amazon Prime focuses mainly on mainstream, easy-to-watch movie lengths that appeal to a wide audience. This insight can guide the platform in selecting future content that matches viewer expectations and continues to drive engagement.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights obtained from the runtime distribution chart can help create a positive business impact. The chart shows that most Amazon Prime movies fall within the 80–120 minute range, which aligns with the standard and preferred movie length for the majority of viewers. This means Amazon Prime is providing content in the ideal duration range that keeps viewers engaged without causing fatigue. This supports higher completion rates, increased user satisfaction, and better platform engagement.

However, a potential negative insight is the lack of variety in runtime. Very short films (below 60 minutes) and long-format content (above 150 minutes) are limited. Viewers who prefer mini-films, extended cuts, or long-form storytelling may feel underserved. Competitors who offer more diverse runtime formats may attract this segment of the audience. If Amazon does not diversify its content lengths, it might experience slower growth among users who prefer short films, documentaries, or long-duration blockbusters.

In summary, the overall insight is positive because Amazon meets general viewing expectations, but increasing diversity in runtime can help avoid negative growth and serve a wider range of audience preferences.

#### Chart - 5

In [None]:
# Chart - 5 visualization code


# Flatten all production country values
all_countries = [country for sublist in titles['production_countries'] for country in sublist]

# Count occurrences
country_counts = pd.Series(all_countries).value_counts().head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x=country_counts.values, y=country_counts.index, palette='magma')
plt.title("Top 10 Production Countries on Amazon Prime")
plt.xlabel("Count")
plt.ylabel("Country")
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because production countries play a key role in understanding the cultural diversity and global reach of Amazon Prime’s content library. Identifying which countries contribute the most content helps reveal content acquisition patterns, regional preferences, and the platform’s international expansion strategy. This chart is important for understanding how global or localized the platform truly is.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that the United States dominates content production on Amazon Prime, followed by countries like the United Kingdom, India, Canada, and other leading film industries. This indicates that the platform has a strong presence in Western and English-speaking markets. The presence of countries like India also suggests Amazon Prime is investing in diverse regional content. However, contributions from smaller countries are minimal, indicating limited global representation.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help create a positive business impact. Knowing that the United States and the United Kingdom dominate production helps Amazon Prime strengthen partnerships with studios in these countries and continue delivering popular content. High contribution from India shows potential in emerging markets and encourages further investment in regional content to gain more subscribers.

However, the chart also highlights a possible negative insight: content diversity across smaller countries is limited. This can discourage international users who want culturally specific or localized content. Competitors like Netflix have stronger global representation, which can attract viewers seeking international content variety. To avoid negative growth, Amazon Prime should expand content acquisition from underrepresented regions and increase global diversity in its catalog.


#### Chart - 6

In [None]:
# Chart - 6 visualization code


release_trend = titles['release_year'].value_counts().sort_index()

plt.figure(figsize=(12,6))
plt.plot(release_trend.index, release_trend.values, marker='o', color='blue')
plt.title("Number of Titles Released Over the Years")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because understanding release trends over the years is important for identifying platform growth, production strategies, and content expansion. A time-series chart clearly shows whether Amazon Prime is increasing or decreasing its yearly content output. This visualization is essential for understanding long-term patterns, market behavior, and user demand trends.


##### 2. What is/are the insight(s) found from the chart?

The chart shows a clear upward trend in the number of titles released over the years, with a significant increase in recent years. This indicates strong growth in content production and acquisition. However, there may be slight drops or fluctuations in certain years which could be due to industry slowdowns, licensing issues, or external factors like the COVID-19 pandemic. Overall, Amazon Prime has expanded its content library steadily over time.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from this chart can create a positive business impact. A rising trend in yearly releases indicates that Amazon Prime is actively investing in expanding its content catalog. This helps attract new subscribers, retain existing users, and improve competitiveness against platforms like Netflix and Disney+. Consistent content growth also helps Amazon Prime keep up with changing viewer preferences and global entertainment trends.

However, any years showing decline or stagnation may indicate potential risks. A drop in releases can lead to reduced engagement, slower subscriber growth, and dissatisfaction among users who expect fresh content. If competitors release more titles during those years, Amazon Prime could experience negative growth. To avoid this, Amazon needs to ensure consistent investment in new releases and maintain a balanced content pipeline.


#### Chart - 7

In [None]:
# Chart - 7 visualization code


numeric_cols = ['runtime', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity', 'cast_count']

corr_matrix = titles[numeric_cols].corr()

plt.figure(figsize=(10,6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap of Numerical Variables")
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because a correlation heatmap is one of the most effective ways to understand relationships between numerical variables. It helps reveal which factors are strongly connected, weakly related, or completely independent. This is essential for making data-driven decisions and identifying which variables influence ratings, popularity, or engagement. A heatmap provides a clear visual representation that simplifies complex numeric relationships.


##### 2. What is/are the insight(s) found from the chart?

The correlation heatmap shows that IMDb score and IMDb votes have a moderate positive correlation, meaning higher-rated content tends to receive more audience engagement. TMDB popularity also shows correlation with IMDb votes, indicating popular titles attract more user interaction. Runtime has very weak correlation with ratings, showing that movie length does not strongly impact quality perception. Cast count has a mild relationship with popularity, suggesting titles with larger casts attract slightly more viewers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights from the heatmap can create a strong positive business impact. Knowing that IMDb score, IMDb votes, and TMDB popularity are related helps Amazon Prime understand which titles attract more engagement. This supports better recommendation algorithms, content acquisition decisions, and promotional strategies. It also shows that investing in high-quality content increases audience interaction.

A potential negative insight is that runtime and cast count show low correlation with ratings or popularity. This means simply increasing movie length or featuring many actors will not guarantee success. Relying on such assumptions could lead to ineffective investments. Amazon Prime must focus on content quality and storytelling rather than production volume to avoid negative growth.


#### Chart - 8

In [None]:
# Chart - 8 visualization code


plt.figure(figsize=(10,6))
sns.scatterplot(data=titles, x='imdb_score', y='tmdb_popularity', alpha=0.6, color='teal')
plt.title("Relationship Between IMDb Score and TMDB Popularity")
plt.xlabel("IMDb Score")
plt.ylabel("TMDB Popularity")
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because it helps identify the relationship between content quality and user popularity. IMDb score reflects how good or well-received a title is, while TMDB popularity measures how much attention the title gets. A scatter plot is the best choice to examine whether high-rated titles are also popular among viewers. This relationship is important for understanding user behavior, recommendation quality, and content performance.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that while some highly rated titles also have high popularity, many titles with good IMDb scores have moderate popularity levels. There are also several low-rated titles that still manage to attract attention. This suggests that popularity is not solely driven by quality — marketing, cast, genre, or trending topics may also influence viewer interest. The overall trend shows only a weak positive relationship between IMDb score and popularity.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can help create a positive business impact by showing Amazon Prime that relying only on high ratings is not enough to drive popularity. Marketing efforts, casting decisions, and content promotion play a major role in increasing audience engagement. This helps Amazon improve strategies to boost visibility for high-quality titles that are currently under-watched.

The negative insight is that several low-rated titles still appear popular, which may dilute platform quality perception if users repeatedly encounter average or low-rated content. This could harm long-term user satisfaction. To avoid negative growth, Amazon should highlight high-quality titles more effectively and ensure low-rated content does not dominate user recommendations.


#### Chart - 9

In [None]:
# Chart - 9 visualization code


top_popular = titles[['title', 'tmdb_popularity']].sort_values(by='tmdb_popularity', ascending=False).head(15)

plt.figure(figsize=(12,6))
sns.barplot(data=top_popular, x='tmdb_popularity', y='title', palette='cool')
plt.title("Top 15 Most Popular Titles on Amazon Prime (TMDB Popularity)")
plt.xlabel("TMDB Popularity Score")
plt.ylabel("Title")
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because identifying the most popular titles helps understand user interest patterns and content that is driving the most engagement on Amazon Prime. Popular content often plays a large role in attracting new subscribers and keeping current users active. A bar chart is the best way to visualize and compare the popularity scores across multiple titles.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that a few titles have extremely high TMDB popularity scores compared to others, indicating that Amazon Prime has several strong-performing hits that draw significant attention. These titles may include trending releases, original content, or highly rated movies/shows. The popularity drops sharply after the top few titles, suggesting that user interest is concentrated on a limited number of key shows or movies.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a strong positive business impact. Knowing which titles are most popular helps Amazon Prime focus on promoting similar content, investing in sequels, expanding franchises, or increasing marketing for top-performing categories. Popular titles can also guide recommendation algorithms to improve user engagement and retention.

However, a negative insight is that popularity is heavily concentrated in a small group of titles. If Amazon Prime depends too much on a few hits, it may lead to uneven growth and drop in user activity if those titles lose traction. To avoid negative growth, Amazon must diversify its content portfolio and develop more consistently popular shows and movies rather than relying on a few blockbusters.


#### Chart - 10

In [None]:
# Chart - 10 visualization code

genre_rows = titles.explode('genres')

# Group by genre and calculate average IMDb score
genre_imdb = genre_rows.groupby('genres')['imdb_score'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12,6))
sns.barplot(x=genre_imdb.values, y=genre_imdb.index, palette='plasma')
plt.title("Top 10 Genres by Average IMDb Score")
plt.xlabel("Average IMDb Score")
plt.ylabel("Genre")
plt.show()


##### 1. Why did you pick the specific chart?

I selected this chart because understanding which genres have the highest average IMDb scores helps identify the types of content that perform best in terms of audience quality perception. This chart allows Amazon Prime to see which genres consistently deliver high-quality content and which ones may need improvement. A bar chart is ideal because it makes comparison between genres simple and clear.


##### 2. What is/are the insight(s) found from the chart?

The chart shows that certain genres—such as Documentary, History, Biography, and Drama—tend to have higher average IMDb ratings compared to others. These genres are generally associated with strong storytelling, real-life events, and critical acclaim. On the other hand, genres like Horror or certain commercial categories may have lower average ratings. This insight highlights that more serious and educational genres often deliver higher-quality content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can create a positive business impact. By knowing which genres earn consistently high ratings, Amazon Prime can invest more in these genres to enhance platform quality, attract high-value viewers, and improve brand reputation. High-rated genres often lead to stronger word-of-mouth marketing and more loyal subscribers.

However, a potential negative insight is that Amazon Prime may have underinvestment in high-rated genres such as Documentary or Biography. If competitors focus more heavily on these genres, Prime could lose users who prefer critically acclaimed content. Additionally, lower-rated genres may affect user satisfaction if they dominate recommendations. To avoid negative growth, Amazon should balance its content strategy by expanding high-quality genres while still offering a variety of entertainment categories.


#### Chart - 11

In [None]:
# Chart - 11 visualization code


plt.figure(figsize=(12,6))
top_actors = credits['name'].value_counts().head(10)

sns.barplot(x=top_actors.values, y=top_actors.index, palette='viridis')
plt.title("Top 10 Most Frequent Actors on Amazon Prime Video", fontsize=14)
plt.xlabel("Number of Appearances")
plt.ylabel("Actor Name")
plt.show()


##### 1. Why did you pick the specific chart?

I selected a bar chart because it is the most effective way to compare the frequency of occurrences among different actors.
Since we want to see which actors appear the most, a bar chart clearly shows the differences in counts and makes ranking easy to understand.

##### 2. What is/are the insight(s) found from the chart?

A few actors appear significantly more often in Amazon Prime Video content compared to others.

The chart highlights the top 10 most frequently appearing actors, showing the platform’s casting trends.

This indicates that Amazon Prime may have strong collaborations with certain actors or that these actors appear in multiple shows/movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, definitely.
These insights can help the platform in several ways:

Positive Business Impact:

Amazon can identify popular or frequently featured actors and create targeted marketing campaigns around them.

Helps in future casting decisions by understanding who already has strong visibility on the platform.

Assists in enhancing content recommendations, improving user engagement.

**Any insights leading to negative growth?**

Yes, there can be:

If the same set of actors is appearing repeatedly, it may indicate lack of diversity in casting.

This can lead to viewer fatigue, reducing engagement and causing negative growth if audiences feel content is repetitive.

It shows the need to explore more varied talent to maintain audience interest.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Chart - 12 visualization code
# IMDB Rating Distribution (Histogram + KDE Plot)

plt.figure(figsize=(12,6))
sns.histplot(titles['imdb_score'], kde=True, bins=30, color='skyblue')
plt.title("Distribution of IMDB Ratings", fontsize=14)
plt.xlabel("IMDB Rating")
plt.ylabel("Frequency")
plt.show()


##### 1. Why did you pick the specific chart?

I selected the IMDB Rating Distribution Chart because it helps visualize how audience ratings are spread across all titles. A histogram with a KDE curve is the most effective chart type to understand central tendency, spread, skewness, and rating quality patterns in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Most IMDB ratings fall between 6 and 8, indicating above-average viewer satisfaction.

Very few titles have extremely low ratings (below 4) or extremely high ratings (above 9).

The distribution shows a slight right skew, meaning there are more moderately rated titles than highly rated ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help Amazon Prime Video make positive business decisions:

**Positive Impact:**

Identifying the rating range that users prefer helps the platform prioritize quality content for recommendations.

Content teams can focus on genres or creators who consistently produce highly-rated titles.

Marketing can highlight high-rated shows to attract more viewership.

**Insights That May Indicate Negative Growth (with reason):**

The lack of many high-rated titles (IMDB 8.5+) shows that top-quality content is limited.

If this gap continues, Prime Video might lose users to competitors with more high-rated originals (e.g., Netflix or Disney+).

#### Chart - 13

In [None]:
# Chart - 13 visualization code (robust)
top_genres = df["genres"].value_counts().head(10)
plt.figure(figsize=(10,5))
top_genres.plot(kind="bar")
plt.title("Top 10 Genres")
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart was selected because the goal is to compare the frequency of categorical values (genres) and quickly identify which genres dominate the content library.

Bar charts are ideal when:

Categories are discrete (genres)

We want ranking (top 10)

We want easy visual comparison

Business users (non-technical) must understand it instantly

This chart makes it immediately clear which genres Amazon Prime invests in most.

##### 2. What is/are the insight(s) found from the chart?

A small number of genres dominate the platform
→ Indicates a focused content acquisition strategy.

Drama, Comedy, Action, and Thriller appear most frequently
→ These genres have consistently high audience demand.

Niche genres appear less often
→ Suggests either low demand or underinvestment opportunities.

Content diversity is uneven
→ Platform relies heavily on proven genres for engagement and retention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The genre distribution analysis identifies the content categories that drive the majority of Amazon Prime’s library, highlighting where the platform focuses its investment. These insights help optimize content acquisition strategy, ensuring resources are allocated to high-demand genres that maximize viewer engagement and retention. Understanding genre dominance also supports better content recommendations, targeted marketing campaigns, and regional content planning, ultimately improving watch time, customer satisfaction, and subscription growth. This analysis provides a data-driven foundation for strategic decisions that directly contribute to platform performance and revenue growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code# Correlation Heatmap visualization code (Chart - 14)

import numpy as np

# pick numeric columns we care about (only keep those present in titles)
candidate_cols = ['runtime', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity', 'cast_count']
numeric_cols = [c for c in candidate_cols if c in titles.columns]

# if nothing found fallback to all numeric columns
if len(numeric_cols) == 0:
    numeric_cols = titles.select_dtypes(include=[np.number]).columns.tolist()

print("Using numeric columns for correlation:", numeric_cols)

# compute correlation (drop rows where all selected cols are NaN)
corr_df = titles[numeric_cols].dropna(how='all').corr()

# plot
plt.figure(figsize=(10, 7))
mask = np.triu(np.ones_like(corr_df, dtype=bool))   # optional: mask upper triangle
sns.heatmap(corr_df, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5, mask=mask, vmin=-1, vmax=1)
plt.title("Correlation Heatmap of Numerical Features")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked a correlation heatmap because it is the best visual tool to understand relationships between multiple numerical variables at once.
It helps quickly identify which features are strongly related, weakly related, or not related at all.
This is especially useful for EDA and selecting important variables for machine learning or business insights.

##### 2. What is/are the insight(s) found from the chart?

Some numerical features show positive correlation, meaning when one increases, the other tends to increase as well.

For example, variables like IMDB score and IMDB votes often show a positive correlation — highly rated titles tend to have more audience engagement.

TMDB popularity may correlate with IMDB votes, indicating titles that are widely viewed on one platform tend to be popular across others too.

Weak or near-zero correlations reveal that certain features do not influence each other, meaning they carry independent information.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Chart - 15 visualization code (Pair Plot)

import seaborn as sns
import numpy as np

# Select useful numeric columns (only keep those that exist)
candidate_cols = ['runtime', 'imdb_score', 'imdb_votes', 'tmdb_score', 'tmdb_popularity']
numeric_cols = [c for c in candidate_cols if c in titles.columns]

# If no candidate columns found, fallback to all numeric
if len(numeric_cols) == 0:
    numeric_cols = titles.select_dtypes(include=[np.number]).columns.tolist()

print("Using numeric columns for pair plot:", numeric_cols)

# Drop rows with missing values for these numeric columns
pairplot_df = titles[numeric_cols].dropna()

# Create pair plot
sns.pairplot(pairplot_df, diag_kind='kde', corner=True)
plt.suptitle("Pair Plot of Numerical Features", y=1.02, fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

I selected the Pair Plot because it is one of the best ways to visualize relationships between multiple numerical variables at the same time.
It helps identify trends, correlations, distributions, and patterns across features such as runtime, IMDb score, IMDb votes, TMDB score, and popularity.
This chart also makes it easier to detect outliers and understand how features relate to each other from both a statistical and visual perspective.

##### 2. What is/are the insight(s) found from the chart?

From the pair plot, we can observe the following insights:

Movies with higher IMDb scores generally tend to have higher IMDb votes, indicating that well-rated movies attract more audience engagement.

TMDB popularity shows noticeable variation and has moderate correlation with both IMDb votes and IMDb scores.

Runtime does not show a strong direct relationship with popularity or rating, meaning longer movies are not necessarily more popular or higher rated.

Distributions of numeric features display clustering and skewness, helping us understand the overall behavior of movies on Amazon Prime Video.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To help the client achieve the business objective of improving audience engagement, increasing viewership, and optimizing content strategy on Amazon Prime Video, the following suggestions are recommended:

**1. Focus on High-Demand Genres**

Based on the analysis, genres like Drama, Comedy, Action, Thriller consistently show high popularity and engagement.
.The client should prioritize acquiring and producing content in these genres to attract a larger audience.

**2. Invest in High-Rated and High-Engagement Titles**

Movies and shows with higher IMDb scores, IMDb votes, and TMDB popularity perform significantly better.
. The platform should promote top-rated titles, highlight them on the homepage, and use them in marketing campaigns.

**3. Improve Content for Under-performing Categories**

Some genres or content types show low viewer engagement.
. Analyze these areas and either improve content quality or reduce investment in poorly-performing categories.

**4. Leverage Insights to Personalize Recommendations**

Use user behavior + genre popularity insights to
.Personalize homepage recommendations, increasing watch time and customer satisfaction.

**5. Optimize Runtime Strategy**

Since runtime does not strongly impact popularity, the platform can
. Provide a balanced mix of short and long content to cater to different user preferences.

**6. Strengthen Data-Driven Decision Making**

Use insights from correlations and visualizations:
Example: Higher IMDb votes ↔ higher popularity
This helps decide which titles to renew, promote, or discontinue.

# **Conclusion**

The analysis of the Amazon Prime Video Titles and Credits datasets provided meaningful insights into the platform’s content distribution, audience preferences, and performance trends. By exploring genre popularity, ratings, runtime patterns, and cast contributions, we gained a deeper understanding of the factors that influence content success.

The findings highlight that genres such as Drama, Comedy, and Action remain dominant, while IMDb ratings, votes, and TMDB popularity show strong correlations, making them reliable indicators of audience interest. Visualization of missing values, duplicates, and variable structures enabled a clearer understanding of dataset quality, ensuring accurate interpretations.

Overall, this analysis helps identify what type of content performs well and reveals opportunities for strategic improvements. These insights not only support data-driven decisions but also guide the platform toward better content curation, improved user engagement, and a stronger competitive edge.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***