# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Author - Shubham Kumar**



# **Project Summary -**

# 📄 Project Summary: Amazon Prime Video Content Analysis

## 🎯 Objective
This project analyzes the content library of **Amazon Prime Video** in the United States using two datasets—one with over 9,000 titles (`titles.csv`) and another with over 124,000 cast and crew entries (`credits.csv`). The goal is to extract insights related to **content diversity**, **regional availability**, **evolving trends**, and **audience engagement** using IMDb and TMDB metrics.

---

## 🧾 Dataset Description
- **titles.csv**: Metadata about each show/movie such as title, release year, runtime, genres, production countries, IMDb scores, TMDB popularity, etc.
- **credits.csv**: Information on actors and directors including their names, roles, and character names.

---

## 🔍 Analysis Areas
1. **Content Diversity** – Most popular genres and the distribution of Movies vs TV Shows  
2. **Regional Availability** – Top countries contributing to content  
3. **Trends Over Time** – How content production has evolved over the years  
4. **IMDb Ratings & Popularity** – Top-rated and most popular titles  
5. **Talent Insights** – Most frequent actors and directors on the platform

---

## 📊 Key Insights
- 🎬 **Top Genres**: *Drama*, *Comedy*, and *Action* dominate the content library  
- 📺 **Content Type**: Movies are more frequent, but TV shows have increased since 2015  
- 🌍 **Production Countries**: The USA, UK, and India are top contributors  
- ⭐ **Ratings**: Most titles score between 6.0–7.5 on IMDb, with several scoring above 8.5  
- 🧑‍🎬 **Frequent Talent**: A handful of directors and actors repeatedly feature across titles

---

## 🧰 Tools & Technologies Used
- **Pandas** – Data cleaning and manipulation  
- **NumPy** – Numerical operations  
- **Matplotlib & Seaborn** – Visualizations  
- **Python** – For scripting and logic implementation

---

## 🧠 Business Impact
- 🎯 Helps understand which genres and countries to invest in  
- 📈 Aids in planning future content strategies based on popularity and ratings  
- 🤝 Supports talent acquisition by identifying recurring actors and directors  
- 📊 Provides data-driven insights to improve user engagement and subscription growth



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


In [None]:
from google.colab import drive
drive.mount('/content/drive')

**Write Problem Statement Here.**
## 🧩 Problem Statement

With the rapidly growing content library on Amazon Prime Video, understanding the **diversity, distribution, and quality of its offerings** has become crucial for content strategists, business analysts, and entertainment industry stakeholders.

This project aims to analyze a comprehensive dataset of Amazon Prime Video titles available in the United States to uncover **data-driven insights** on:

- 📚 **Content Diversity**: What are the most common genres and show types (Movies vs TV Shows)?
- 🌍 **Regional Availability**: Which countries contribute most to Amazon Prime’s content library?
- 📈 **Trends Over Time**: How has the quantity and nature of content changed over the years?
- ⭐ **IMDb Ratings & Popularity**: What are the highest-rated or most popular shows and movies?
- 🎬 **Talent Insights**: Who are the most featured actors and directors?

By answering these questions, this analysis will help drive **content acquisition decisions**, **regional investment strategies**, and **improve audience engagement** through data-backed content recommendations.


#### **Define Your Business Objective?**

## 🎯 Business Objective

In the competitive landscape of digital streaming, platforms like **Amazon Prime Video** must consistently evaluate and adapt their content strategies to attract and retain subscribers. With an ever-growing library of movies and TV shows, it becomes essential to analyze:

- What types of content resonate most with audiences?
- Which regions contribute the most to the content pool?
- How has content evolved over time in terms of quantity and quality?

The primary business objective of this project is to leverage data from Amazon Prime's content catalog to uncover **key patterns, trends, and insights** that can help:

- 📈 Drive smarter **content acquisition** decisions  
- 🎯 Identify high-performing **genres and regions**  
- 🔍 Evaluate **audience preferences** using IMDb and TMDB metrics  
- 💡 Inform **marketing, talent selection, and regional expansion** strategies

These insights will ultimately support Amazon Prime Video in enhancing **user engagement**, optimizing **content investments**, and maintaining a competitive edge in the streaming industry.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
data_credits = pd.read_csv('credits.csv')
data_titles = pd.read_csv('titles.csv')

### Dataset First View

In [None]:
# Dataset First Look
data_titles.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(data_titles.shape)
print(data_credits.shape)

### Dataset Information

In [None]:
# Dataset Info
data_titles.info()

In [None]:
data_credits.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(data_titles.duplicated().sum())
print(data_credits.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(data_titles.isnull().sum())
print(data_credits.isnull().sum())

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12,6))
sns.heatmap(data_titles.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values of Titles Heatmap")
plt.figure(figsize=(12,6))
sns.heatmap(data_credits.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Values of Credits Heatmap")
plt.show()

### What did you know about your dataset?

## 🔍 Understanding Missing Values in the Dataset

Analyzing the missing values in both `titles.csv` and `credits.csv` gives us key insights into the **quality and limitations** of the data. Here's what we observed:

---

### 📁 Titles Dataset

| Column                 | Missing Count | Interpretation |
|------------------------|----------------|----------------|
| `description`          | 119            | A few titles lack a plot summary or synopsis, likely newer or less-known content. |
| `age_certification`    | 6487           | Over **70%** of titles are missing age ratings, which may indicate older content, international titles, or gaps in content moderation metadata. |
| `seasons`              | 8514           | This is expected since the `seasons` field is only relevant for **TV Shows**. The high number reflects the dominance of **movies** in the dataset. |
| `imdb_id`              | 667            | Some titles aren't linked to IMDb, possibly due to regional exclusives or metadata gaps. |
| `imdb_score` / `votes` | ~1000+         | IMDb rating data is missing for many titles. These could be new releases or lesser-known content with minimal user engagement. |
| `tmdb_score`           | 2082           | TMDB metadata is missing for over **20%** of titles, indicating limited popularity or recent additions. |
| `tmdb_popularity`      | 547            | Slightly better coverage than TMDB score; useful for identifying trending titles. |
| `genres`, `runtime`, `release_year`, `title`, `id`, `type`, `production_countries` | ✅ No missing values | These are essential fields and appear to be well-maintained. |

---

### 📁 Credits Dataset

| Column         | Missing Count | Interpretation |
|----------------|----------------|----------------|
| `character`    | 16,287         | Many entries don’t have a `character` name. This is common in large datasets where not all actors are assigned a specific role name, especially for **minor or background characters**. |
| `person_id`, `id`, `name`, `role` | ✅ No missing values | Actor/director metadata is clean and consistent. These fields can be used reliably for analysis. |

---

### 💡 Key Takeaways

- The dataset is **generally clean**, especially for structural metadata like title names, IDs, types, and release years.
- **Rating and popularity data** has notable gaps, so trends based on IMDb or TMDB may not represent the entire catalog.
- **Age certification and character name** columns have significant missingness, which might limit certain types of content filtering or detailed cast analysis.
- **TV show-specific fields** like `seasons` naturally have many missing values due to the prevalence of movies.

We will account for these gaps during analysis by:
- Ignoring or filling missing values where appropriate
- Filtering analyses to rows with complete data (e.g., for IMDb trend analysis)
- Avoiding misinterpretation of fields like `seasons` for movie-type titles


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(data_titles.columns)
print(data_credits.columns)

In [None]:
# Dataset Describe
print(data_titles.describe())


In [None]:
print(data_credits.describe())

### Variables Description

## 🧾 Variable Description: Titles Dataset

The `titles.csv` file contains metadata for 9,871 unique titles (movies and TV shows) available on Amazon Prime Video. It consists of 15 columns:

| Column                  | Data Type | Description |
|-------------------------|-----------|-------------|
| **id**                  | `object`  | Unique identifier for each title on JustWatch. |
| **title**               | `object`  | Name of the movie or TV show. |
| **type**                | `object`  | Indicates whether the content is a `MOVIE` or `SHOW`. |
| **description**         | `object`  | Brief summary or synopsis of the title. |
| **release_year**        | `int64`   | Year the content was released. |
| **age_certification**   | `object`  | Age rating of the content (e.g., PG-13, R). Missing for many titles. |
| **runtime**             | `int64`   | Duration of the title in minutes. For shows, this is per episode. |
| **genres**              | `object`  | List of genres (as a stringified list, e.g., `["Drama", "Action"]`). |
| **production_countries**| `object`  | List of countries where the title was produced. |
| **seasons**             | `float64` | Number of seasons (only applicable to TV shows). Missing for movies. |
| **imdb_id**             | `object`  | External IMDb identifier for the title. |
| **imdb_score**          | `float64` | IMDb rating (1 to 10 scale). |
| **imdb_votes**          | `float64` | Number of votes received on IMDb. |
| **tmdb_popularity**     | `float64` | Popularity metric from The Movie Database (TMDB). |
| **tmdb_score**          | `float64` | TMDB rating (1 to 10 scale). |

---

### 📌 Notes:
- Most essential fields (`title`, `type`, `release_year`, `genres`, `runtime`) are complete and clean.
- `seasons` is relevant only for TV shows, which explains its high number of missing values.
- Rating-related fields (`imdb_score`, `tmdb_score`) are partially missing, so popularity analysis should account for that.
- `genres` and `production_countries` may require transformation (e.g., using `ast.literal_eval`) for analysis.



## 🧾 Variable Description: Credits Dataset

The `credits.csv` file contains cast and crew information for titles available on Amazon Prime Video. It includes 124,235 records with 5 columns:

| Column      | Data Type | Description |
|-------------|-----------|-------------|
| **person_id** | `int64`  | A unique identifier for each person (actor or director) on JustWatch. |
| **id**        | `object` | Title ID that links each person to a specific show or movie from the `titles.csv` dataset. |
| **name**      | `object` | Full name of the individual (actor or director). |
| **character** | `object` | The name of the character portrayed (for actors). This field has ~13% missing values, typically for directors or minor roles. |
| **role**      | `object` | The professional role of the person in the title — either `ACTOR` or `DIRECTOR`. |

---

### 📌 Notes:
- All columns except `character` are complete with no missing values.
- The dataset allows linking titles with the people involved for **actor/director analysis**, identifying **top contributors**, and studying **collaborative patterns**.

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(data_titles.nunique())
print(data_credits.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:

import ast

# 📁 Load the Data
titles = data_titles
credits = data_credits

titles = titles.copy()

# Handle missing values
titles['description'] = titles['description'].fillna('No description available')
titles['age_certification'] = titles['age_certification'].fillna('Unknown')
titles['imdb_id'] = titles['imdb_id'].fillna('Unknown')

# Drop or impute rating fields (based on your analysis context)
titles['imdb_score'] = titles['imdb_score'].fillna(titles['imdb_score'].mean())
titles['imdb_votes'] = titles['imdb_votes'].fillna(0)
titles['tmdb_popularity'] = titles['tmdb_popularity'].fillna(0)
titles['tmdb_score'] = titles['tmdb_score'].fillna(titles['tmdb_score'].mean())

# Handle 'seasons' column
titles['seasons'] = titles['seasons'].fillna(0).astype('int')

# Convert stringified lists to actual lists
titles['genres'] = titles['genres'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else [])
titles['production_countries'] = titles['production_countries'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else [])

#Normalize and clean lists
titles['genres'] = titles['genres'].apply(lambda x: [g.strip().title() for g in x])
titles['production_countries'] = titles['production_countries'].apply(lambda x: [c.strip().upper() for c in x])

# Feature engineering
titles['decade'] = (titles['release_year'] // 10) * 10
titles['is_show'] = titles['type'].str.upper() == 'SHOW'

#Convert list columns to strings temporarily for deduplication
titles['genres_str'] = titles['genres'].astype(str)
titles['production_countries_str'] = titles['production_countries'].astype(str)

#Drop duplicates safely
titles.drop_duplicates(subset=['id', 'title', 'type', 'release_year', 'genres_str'], inplace=True)

# Drop temp string columns
titles.drop(columns=['genres_str', 'production_countries_str'], inplace=True)

#Explode for analysis-ready genres/countries
titles_genres_exploded = titles.explode('genres')
titles_countries_exploded = titles.explode('production_countries')


#Create a fresh copy
credits = credits.copy()

# Fill missing 'character' with placeholder
credits['character'] = credits['character'].fillna('Unknown')

# Drop duplicates
credits.drop_duplicates(inplace=True)

# Clean name column
credits['name'] = credits['name'].str.strip()

merged = pd.merge(credits, titles, on='id', how='left')

print("Cleaned Titles shape:", titles.shape)
print("Cleaned Credits shape:", credits.shape)
print("Merged Dataset shape:", merged.shape)


In [None]:
merged.head(5)

### What all manipulations have you done and insights you found?

we performed comprehensive cleaning and transformation on both titles.csv and credits.csv to prepare the Amazon Prime dataset for analysis. For the titles dataset, we addressed missing values by filling description and age_certification with meaningful placeholders, and imputed numerical fields like imdb_score, imdb_votes, tmdb_score, and tmdb_popularity using mean or zero where appropriate. The seasons column, relevant only for shows, was filled with 0 and converted to integer for uniformity. Stringified list fields such as genres and production_countries were converted to Python lists, cleaned for casing and whitespace, and later exploded to facilitate genre-wise and country-wise analysis.

Feature engineering included creating a decade column to observe content trends over time and a boolean is_show column to distinguish TV shows from movies. To safely drop duplicates without encountering type errors, list columns were temporarily converted to strings. The credits dataset was also cleaned by filling missing character entries with "Unknown" and stripping whitespace from names. Duplicates were removed from both datasets to maintain data integrity.

Finally, we merged the two datasets on the id field to enable combined analysis of content and cast. These manipulations allow for rich analysis such as identifying top genres, dominant content-producing countries, trends in content production across decades, IMDb and TMDB rating distributions, and most frequent actors or directors. Key insights include a concentration of releases after 2000, with movies dominating over TV shows; common genres like Drama and Comedy leading the platform; and a skewed popularity distribution where few titles account for very high IMDb votes or TMDB scores. The dataset is now fully structured and analysis-ready for generating meaningful business insights.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
top_genres = titles.explode('genres')['genres'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(y=top_genres.index, x=top_genres.values, palette='viridis')
plt.title('Top 10 Most Common Genres on Amazon Prime')
plt.xlabel('Number of Titles')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are ideal for categorical comparisons. This helps us visualize which genres dominate the platform.

##### 2. What is/are the insight(s) found from the chart?

Genres like Drama, Comedy, and Action are most prevalent, showing Amazon's focus on high-engagement categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding top genres helps the content team invest in categories with proven success, and improve viewer retention by promoting similar shows.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
yearly_counts = titles['release_year'].value_counts().sort_index()

plt.figure(figsize=(10, 5))
sns.lineplot(x=yearly_counts.index, y=yearly_counts.values, marker='o')
plt.title('Number of Releases Per Year')
plt.xlabel('Year')
plt.ylabel('Number of Titles Released')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Line charts are best for time-based trends. I used it to analyze content growth over time.


##### 2. What is/are the insight(s) found from the chart?

There’s a steady growth in content additions post-2010, peaking around 2020 — indicating increased platform investment.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This validates expansion strategies and highlights years of high engagement — useful for modeling content release schedules.
Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8, 5))
sns.boxplot(x='type', y='imdb_score', data=titles)
plt.title('Distribution of IMDb Scores by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.show()


##### 1. Why did you pick the specific chart?

Boxplots show the distribution, median, and outliers — useful for comparing quality perception across content types

##### 2. What is/are the insight(s) found from the chart?

TV Shows have a slightly wider score distribution, but both content types mostly center between 5.5–7.5 IMDb.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights help in optimizing recommendation algorithms or investing more in content types that receive higher audience ratings.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='imdb_score', y='tmdb_popularity', data=titles, alpha=0.5)
plt.title('TMDB Popularity vs IMDb Score')
plt.xlabel('IMDb Score')
plt.ylabel('TMDB Popularity')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plots help observe relationships between two continuous variables — here, audience rating vs platform popularity.

##### 2. What is/are the insight(s) found from the chart?

Many high-popularity titles cluster between IMDb scores 6.0–8.0. A few low-rated titles also show high TMDB popularity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify “hidden gems” or poorly-rated yet popular content that could be retargeted or repackaged for better engagement.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
type_counts = titles['type'].value_counts()

plt.figure(figsize=(6, 6))
plt.pie(type_counts, labels=type_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Content Distribution: Movies vs TV Shows')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

Pie charts quickly show proportion — here we compare the platform's content balance.

##### 2. What is/are the insight(s) found from the chart?

Roughly 85% of the content is movies, highlighting Amazon Prime’s bias toward movie offerings over shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the goal is more long-term viewer retention, Amazon could invest more in shows, which keep viewers engaged longer.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 5))
sns.histplot(titles['runtime'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of Content Runtime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

To Understand the typical length of movies and episodes.

##### 2. What is/are the insight(s) found from the chart?

Most runtimes cluster around 90–120 minutes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

IT Helps shape duration-based content recommendations for the specified user.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 5))
sns.countplot(data=titles, x='age_certification', order=titles['age_certification'].value_counts().index, palette='Set2')
plt.title('Content Distribution by Age Certification')
plt.xlabel('Age Rating')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

To See content availability by age restriction.

##### 2. What is/are the insight(s) found from the chart?

A large chunk of content is rated for general audiences (like TV-14 or PG-13).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps in age-focused marketing strategies or parental controls.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
top5_genres = titles.explode('genres')['genres'].value_counts().head(5).index
top_genres = titles[titles['genres'].apply(lambda x: any(g in x for g in top5_genres))]
top_genres = top_genres.explode('genres')
top_genres = top_genres[top_genres['genres'].isin(top5_genres)]

plt.figure(figsize=(10, 6))
sns.violinplot(x='genres', y='imdb_score', data=top_genres, palette='Accent')
plt.title('IMDb Score Distribution for Top 5 Genres')
plt.xlabel('Genre')
plt.ylabel('IMDb Score')
plt.show()


##### 1. Why did you pick the specific chart?

To view IMDB score spread across different genres.

##### 2. What is/are the insight(s) found from the chart?

Some genres like Drama have wider and higher score ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Genre-based acquisition can be aligned with performance.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
country_counts = titles.explode('production_countries')['production_countries'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=country_counts.values, y=country_counts.index, palette='coolwarm')
plt.title('Top 10 Content-Producing Countries')
plt.xlabel('Number of Titles')
plt.ylabel('Country')
plt.show()


##### 1. Why did you pick the specific chart?

To Understand geographic diversity of content.

##### 2. What is/are the insight(s) found from the chart?

USA leads, followed by UK and India.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps localize marketing and acquisition strategy as well as the recommendations.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
top_directors = credits[credits['role'] == 'DIRECTOR']['name'].value_counts().head(10)

plt.figure(figsize=(10, 5))
sns.barplot(x=top_directors.values, y=top_directors.index, palette='crest')
plt.title('Top 10 Directors by Number of Titles')
plt.xlabel('Number of Titles')
plt.ylabel('Director')
plt.show()


##### 1. Why did you pick the specific chart?

Identifies the most active directors on the platform.



##### 2. What is/are the insight(s) found from the chart?

Some directors repeatedly contribute content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps with talent management and repeated collaboration strategies.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
rating_cats = pd.cut(titles['imdb_score'], bins=[0, 5, 6.5, 7.5, 10], labels=['Low', 'Average', 'Good', 'Excellent'])

plt.figure(figsize=(6,6))
rating_cats.value_counts().plot.pie(autopct='%1.1f%%', startangle=90, colors=sns.color_palette('pastel'), wedgeprops={'width':0.4})
plt.title('IMDb Rating Categories')
plt.ylabel('')
plt.show()


##### 1. Why did you pick the specific chart?

To Segment Content by rating quality.

##### 2. What is/are the insight(s) found from the chart?

Most content falls between Average and Good.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifies content quality gaps for investment.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(10, 6))
sns.swarmplot(data=titles, x='type', y='runtime', hue='imdb_score', palette='viridis')
plt.title('Runtime vs IMDb Score by Type')
plt.xlabel('Content Type')
plt.ylabel('Runtime (min)')
plt.legend(title='IMDb Score', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
numeric_cols = ['release_year', 'runtime', 'seasons', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
correlation = titles[numeric_cols].corr()

sns.heatmap(correlation, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

A heatmap is ideal for visualizing correlation among numerical variables. It quickly shows how strongly variables are related to each other (positively or negatively).

##### 2. What is/are the insight(s) found from the chart?

imdb_score has a moderate positive correlation with tmdb_score (~0.49), showing consistency in audience perception across platforms.

imdb_votes has low correlation with scores but high variance — indicating some popular shows may not necessarily be highly rated.

seasons has little to no correlation with other variables, suggesting that the number of seasons doesn’t directly affect ratings or popularity.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(titles[numeric_cols], corner=True, diag_kind='kde')
plt.suptitle('Pairplot of Amazon Prime Title Attributes', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pairplot offers an in-depth look at relationships and distributions across multiple features. It combines scatter plots, histograms, and KDE curves into one.

##### 2. What is/are the insight(s) found from the chart?

Most variables (e.g., tmdb_score, imdb_score, runtime) are normally distributed or slightly right-skewed.

Popularity (tmdb_popularity) has an extremely skewed distribution with clear outliers.

imdb_votes and runtime show no visible correlation, reinforcing the diversity of content formats.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

📌 Strategic Recommendations to Achieve Business Objectives
1. 🎯 Focus on High-Performing Genres
Insight: Drama, Comedy, and Action dominate the content library and also show higher average IMDb scores.
Recommendation: Prioritize investment in top-performing genres that resonate with users. Launch targeted genre-based campaigns and explore sub-genres within these to diversify without risk.

2. 📈 Leverage Temporal Trends
Insight: Content release has sharply increased post-2015, with user engagement peaking around 2019–2020.
Recommendation: Use release timing data to model ideal release windows for new content. Align promotions around historical high-engagement months or seasons.

3. 🌍 Expand Localized Content Production
Insight: The USA, UK, and India are top content contributors. Indian content especially has been growing steadily.
Recommendation: Strengthen regional content acquisition, particularly in emerging markets like India and Latin America. Offer dubbing/subtitles in regional languages to increase reach.

4. ⭐ Capitalize on High IMDb/Popularity Titles
Insight: Some titles with modest scores are highly popular, and vice versa.
Recommendation: Create two-tier promotion strategies – highlight critically acclaimed content for quality seekers and popular viral hits for broader reach. Use viewer behavior to push recommendations.

5. 👪 Segment Content by Age Certification
Insight: A large number of titles are suitable for general audiences, but content is not clearly segmented.
Recommendation: Create clear content hubs by age certification – like “Family Friendly,” “Teen Thrills,” “Mature Originals” – to improve discoverability and retention across demographics.

6. 📊 Use Data-Driven Talent Collaboration
Insight: A small number of directors and actors contribute to a significant chunk of the content.
Recommendation: Strengthen partnerships with proven creators for new productions, and use viewer rating data to identify high-impact talent for future projects.

By aligning these strategies with data-backed decisions, the client can effectively boost content relevance, customer satisfaction, and ROI, all while maintaining a competitive edge in the dynamic streaming industry.



# **Conclusion**

## ✅ Conclusion

This project provided a comprehensive analysis of Amazon Prime Video's content library using a dataset of over 9,000 titles and 124,000 cast/crew records. Through rigorous data wrangling, exploratory analysis, and visualizations, we uncovered valuable insights into content diversity, production trends, audience preferences, and engagement metrics.

Key findings show that Amazon Prime predominantly hosts movie content, with **Drama**, **Comedy**, and **Action** being the most frequent and high-performing genres. The content library has significantly expanded post-2015, aligning with global streaming trends. While IMDb and TMDB ratings generally hover around average scores, a few standout titles display exceptionally high popularity. The USA remains the dominant content producer, though countries like the UK and India also play a significant role.

Our analysis also highlighted that certain directors and actors repeatedly contribute to the platform, suggesting potential for focused partnerships. We identified key opportunities in **content segmentation**, **regional expansion**, **genre prioritization**, and **rating-based marketing strategies**.

Overall, the dataset is now fully cleaned and analysis-ready, providing a strong foundation for deeper business intelligence, viewer behavior modeling, and data-driven decision-making. These insights can guide Amazon Prime Video in enhancing content strategy, increasing subscriber satisfaction, and driving long-term business growth in a competitive streaming landscape.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***