<a href="https://colab.research.google.com/github/rahulv8700/EDA-PROJECT/blob/main/Capstone_(2)_Project_Amazon_Prime_TV_Shows_and_Movies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Exploratory Data Analysis on Amazon Prime TV Shows & Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The global streaming industry has grown rapidly in recent years, with platforms like Amazon Prime Video competing to capture diverse audiences through vast and varied content libraries. With an ever-expanding catalog of TV shows and movies, it becomes essential for companies to leverage data-driven insights to understand audience preferences, optimize content strategies, and maintain a competitive edge. This project aims to analyze Amazon Prime Video’s catalog using exploratory data analysis (EDA) to uncover patterns in genres, ratings, regional content distribution, and popularity trends.

The dataset used in this project consists of two CSV files, combining over 9,000 unique titles and 124,000 credits. The titles.csv file contains attributes such as title name, type (movie or TV show), release year, age certification, runtime, number of seasons, genres, IMDb and TMDb scores, and production countries. The credits.csv file provides information about actors and directors, including their names, roles, and associations with specific titles. Together, these files offer both content-level and person-level insights, enabling a holistic exploration of Amazon Prime Video’s offerings.

The project begins with data loading and cleaning. Initial steps involved handling missing values, standardizing formats (such as runtime, release year, and certification categories), and merging datasets where necessary. Special attention was given to categorical columns like genres and production countries, which often contain multiple values per record. Cleaning this data ensured that meaningful aggregations and visualizations could be performed effectively.

Once the dataset was prepared, the exploratory data analysis (EDA) phase was conducted. The first observation was the distribution of content type: Amazon Prime Video has a significantly larger share of movies compared to TV shows, reflecting its emphasis on film-based entertainment. A time-series analysis of release years showed that the bulk of Prime’s catalog consists of recent titles, particularly from 2010 onwards, aligning with the broader boom in digital streaming services during this period.

The analysis of genres revealed that Drama and Comedy dominate the platform, followed by Action, Thriller, and Romance. This suggests a balanced mix of entertainment, catering to both light-hearted and intense viewing preferences. Interestingly, niche genres such as Documentary and Animation also have significant representation, indicating Prime’s efforts to provide variety.

Looking at regional contributions, the United States and India emerged as the top producers of content, together accounting for a majority of titles on the platform. This reflects Amazon Prime’s strong presence in these two markets, which are among its largest customer bases. Other countries such as the United Kingdom, Canada, and Japan also contribute meaningfully, demonstrating the platform’s global strategy.

The age certification analysis highlighted that most titles fall under categories such as TV-MA, R, and PG-13, showing that Amazon Prime focuses heavily on adult-oriented as well as family-friendly content. This distribution suggests that the platform targets both younger audiences and mature viewers.

In terms of ratings and popularity, IMDb and TMDb scores were used to assess content quality and viewer reception. Several titles achieved exceptionally high scores, demonstrating that Prime’s catalog includes critically acclaimed works. The correlation between IMDb scores and popularity metrics showed that while higher ratings often correspond to popularity, some lower-rated titles still gained traction, possibly due to strong marketing or trending actors.

The credits data allowed for the identification of the most frequent actors and directors featured on Prime Video. A few individuals appeared repeatedly across multiple titles, indicating their significant contribution to the platform’s content. This insight could help in understanding star power and its influence on audience engagement.

Visualizations such as bar charts, line plots, heatmaps, and pie charts were employed throughout the project to make insights more interpretable. For example, line plots illustrated trends in content releases over decades, while bar charts showcased genre frequency and country-wise distribution. A heatmap of correlations between IMDb scores, votes, and popularity offered a deeper view of how audience ratings relate to engagement.

In conclusion, this exploratory analysis provided valuable insights into Amazon Prime Video’s catalog. The platform is primarily dominated by movies, with drama and comedy as the most common genres. The majority of titles are recent releases, reflecting Prime’s strategy of providing up-to-date and relevant content. The dominance of the United States and India in production highlights the importance of these regions in Amazon’s global business strategy. Finally, age certification patterns and audience ratings further confirm that Amazon Prime caters to a diverse audience base, ranging from family-friendly to mature content.

These findings can be useful not only for business decision-making but also for guiding recommendation systems, content acquisition strategies, and marketing campaigns. As the streaming industry continues to grow, leveraging such data-driven insights will remain a key factor in ensuring sustained success and audience satisfaction.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

The problem, therefore, is to analyze Amazon Prime Video’s content dataset through Exploratory Data Analysis (EDA) in order to answer key questions such as:



*   What genres, content types (movies vs. shows), and certifications dominate the platform?
*   Which countries contribute the most to Amazon Prime Video’s catalog?
*   What are the patterns in IMDb ratings, votes, and popularity scores?
*   Who are the most frequent actors and directors featured on the platform?
*   How has the distribution of content evolved across years?















#### **Define Your Business Objective?**

The objective of this project is to analyze Amazon Prime Video’s catalog through Exploratory Data Analysis (EDA) to uncover insights into content diversity, regional contributions, audience preferences, and key contributors. These insights will help in improving recommendation systems, guiding content acquisition strategies, and enhancing user engagement in the competitive streaming industry

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import gdown


### Dataset Loading

In [None]:
# Google Drive shared file IDs
titles_id = "YOUR_TITLES_FILE_ID"
credits_id = "YOUR_CREDITS_FILE_ID"

# Google Drive file IDs (from your links)
titles_id = "19D49q8Da-o4PT1Ki01EvckCd_ynA189I"
credits_id = "1kB19bUSVDlUw7wro7NnD_aTKdwk47-ng"

# Construct proper download URLs
titles_url = f"https://drive.google.com/uc?id={titles_id}&export=download"
credits_url = f"https://drive.google.com/uc?id={credits_id}&export=download"

# Download files
gdown.download(titles_url, "titles.csv", quiet=False)
gdown.download(credits_url, "credits.csv", quiet=False)


# Load into pandas
titles = pd.read_csv("titles.csv")
credits = pd.read_csv("credits.csv")


### Dataset First View

In [None]:
# Dataset First Look
print("Titles Data:")
display(titles.head())

print("Credits Data:")
display(credits.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Titles Data:")
print("Number of Rows:", titles.shape[0])
print("Number of Columns:", titles.shape[1])

print("Credits Data:")
print("Number of Rows:", credits.shape[0])
print("Number of Columns:", credits.shape[1])

### Dataset Information

In [None]:
# Dataset Info
print("Titles Data:")
print(titles.info())

print("Credits Data:")
print(credits.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Titles Data:")
print("Number of Duplicate Rows:", titles.duplicated().sum())

print("Credits Data:")
print("Number of Duplicate Rows:", credits.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Titles Data:")
print(titles.isnull().sum())

print("Credits Data:")
print(credits.isnull().sum())

In [None]:
# Visualizing the missing values


# For titles dataset
plt.figure(figsize=(12,6))
sns.heatmap(titles.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values in Titles Dataset")
plt.show()

# For credits dataset
plt.figure(figsize=(12,6))
sns.heatmap(credits.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values in Credits Dataset")
plt.show()

# Titles dataset missing %
titles_missing = titles.isnull().mean() * 100

plt.figure(figsize=(10,5))
titles_missing[titles_missing > 0].sort_values(ascending=False).plot(kind="bar", color="salmon")
plt.ylabel("Percentage of Missing Values")
plt.title("Missing Values Percentage per Column - Titles Dataset")
plt.show()

# Credits dataset missing %
credits_missing = credits.isnull().mean() * 100

plt.figure(figsize=(10,5))
credits_missing[credits_missing > 0].sort_values(ascending=False).plot(kind="bar", color="skyblue")
plt.ylabel("Percentage of Missing Values")
plt.title("Missing Values Percentage per Column - Credits Dataset")
plt.show()



### What did you know about your dataset?

**1. Titles Dataset**

Contains information about **9,000 unique** shows & movies on Amazon Prime (U.S. region).

**15 columns** describing each title.

Key columns:

*  id: Unique ID of the title (used to link with credits dataset).

*  title: Name of the show or movie.

*  show_type: Whether the title is a Movie or a TV Show.

*  release_year: The year it was released.

*  age_certification: The audience rating (ex: PG, R, etc.).

*  runtime: Duration in minutes (for movies) or per episode (for shows).

*  genres: Genre(s) of the content (ex: Drama, Comedy, Thriller).

*  production_countries: The countries that produced the title.

*  seasons: Number of seasons (only for TV shows).

*  imdb_score & imdb_votes: IMDb ratings and votes.

*  tmdb_popularity & tmdb_score: Popularity and rating from TMDB.

**2. Credits Dataset**

Contains information about **124,000** actors and directors linked to the titles.

**5 columns** describing people involved in production.

**Key columns:**

*  person_id: Unique ID of the person.

*  id: Title ID (foreign key that links to the Titles dataset).

*  name: Name of the actor or director.

*  character_name: Name of the character (for actors).

*  role: Whether the person is an ACTOR or DIRECTOR.

**3. Data Types**

A mix of **categorical** data (e.g., title, genre, country, role) and **numerical data** (e.g., runtime, release_year, IMDb score, votes, popularity).

This makes it suitable for both **descriptive statistics** and **visualizations.**

**4. Potential Issues**

*  Missing values in some columns (e.g., age_certification, imdb_score, seasons).

*  Duplicates possible in the credits dataset (since many actors can appear in the same title).

*  Genres and production countries are stored as lists/strings, which may need cleaning before analysis.

**5. What Can We Analyze?**

*  From this dataset, we can answer questions like:

*  What percentage of content is movies vs. TV shows?

*  Which genres are most common?

*  Which years saw the most releases?

*  What are the highest-rated titles (IMDb & TMDB)?

*  Which actors or directors appear the most?

*  How diverse is Amazon Prime’s catalog in terms of production countries?


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles Data:")
print(titles.columns)

print("Credits Data:")
print(credits.columns)


In [None]:
# Dataset Describe
print("Titles Data:")
display(titles.describe())

print("Credits Data:")
display(credits.describe())

### Variables Description

**Titles Dataset**  
- `id` → Unique identifier for each title  
- `title` → Name of the movie or TV show  
- `show_type` → Type of content – Movie or TV Show  
- `release_year` → Year of release  
- `age_certification` → Age rating (e.g., PG, R, TV-MA)  
- `runtime` → Duration in minutes or per episode runtime  
- `genres` → Genre(s) associated with the title  
- `production_countries` → Producing country/countries  
- `seasons` → Number of seasons (null for movies)  
- `imdb_score` → IMDb rating (0–10)  
- `imdb_votes` → Number of IMDb votes  
- `tmdb_popularity` → Popularity score on TMDB  
- `tmdb_score` → Average rating score on TMDB  

**Credits Dataset**  
- `person_id` → Unique identifier for each actor/director  
- `id` → Foreign key linking to `titles.id`  
- `name` → Name of the person (actor/director)  
- `character_name` → Character played (for actors)  
- `role` → Role of the person – ACTOR or DIRECTOR  


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique Values in Titles Dataset:")
for column in titles.columns:
    unique_values = titles[column].unique()
    print(f"{column}: {unique_values}")

print("\nUnique Values in Credits Dataset:")
for column in credits.columns:
    unique_values = credits[column].unique()
    print(f"{column}: {unique_values}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Example handling

titles['age_certification'].fillna("Unknown", inplace=True)
titles['seasons'].fillna(0, inplace=True)   # Movies don’t have seasons

# Fix Data Types

titles['release_year'] = pd.to_numeric(titles['release_year'], errors='coerce')
titles['runtime'] = pd.to_numeric(titles['runtime'], errors='coerce')
titles['seasons'] = pd.to_numeric(titles['seasons'], errors='coerce')

# Merge Both Datasets

prime_data = titles.merge(credits, on="id", how="left")

print("Final Dataset Shape:", prime_data.shape)
print(prime_data.head())


### What all manipulations have you done and insights you found?

  

###  Data Manipulations Done:
1. **Data Loading**
   - Imported `titles.csv` and `credits.csv` into Pandas DataFrames.  
   - Verified shape and structure of both datasets.  

2. **Duplicate Handling**
   - Checked duplicates in both datasets.  
   - Removed duplicate rows to ensure clean analysis.  

3. **Missing Value Treatment**
   - Identified missing values using `isnull().sum()`.  
   - Visualized missing data with heatmaps/barplots.  
   - Dropped or imputed missing values where appropriate.  

4. **Data Type Conversion**
   - Converted numeric columns (`release_year`, `runtime`, `seasons`) into correct data types.  
   - Ensured categorical variables (`genres`, `production_countries`, `show_type`) remain categorical.  

5. **Understanding Variables**
   - Used `.info()` and `.describe()` to understand numerical & categorical variables.  
   - Explored unique values in `genres`, `production_countries`, and `show_type`.  

---

###  Initial Insights from Data:
1. **Dataset Size**
   - ~9k unique titles in `titles.csv`.  
   - ~124k records of actors/directors in `credits.csv`.  

2. **Content Type Distribution**
   - Amazon Prime is **movie-heavy** compared to TV shows.  

3. **Release Year Trends**
   - Majority of content released after **2000**.  
   - Significant increase in titles after **2010**, indicating platform expansion.  

4. **Runtime & Seasons**
   - Movies mostly range **90–120 minutes**.  
   - TV shows vary widely, most between **1–10 seasons**.  

5. **IMDb Ratings**
   - Average IMDb rating: **6–7** (above average quality).  
   - Some highly rated titles above **8+**, key highlights for Prime.  

6. **Genre Distribution**
   - Popular genres: **Drama, Comedy, Action, Thriller** dominate.  

7. **Production Countries**
   - Majority titles produced in the **United States**, followed by **India** and some European countries.  

---

✅ With these cleaning steps, the dataset is now **ready for Exploratory Data Analysis (EDA)** to uncover deeper insights into Amazon Prime Video’s catalog.  


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Chart - 1 --> Content Type Distribution (Movies vs TV Shows)**

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(6,5))
sns.countplot(data=titles, x='type')
plt.title("Content Type Distribution (Movies vs TV Shows)")
plt.xlabel("Content Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A simple countplot helps us understand the basic composition of Prime Video’s library. Knowing whether the platform is skewed more toward movies or TV shows sets the foundation for deeper analysis.

##### 2. What is/are the insight(s) found from the chart?


Answer Here

*  The platform has a much larger proportion of movies compared to TV shows.

*  TV shows are relatively fewer, indicating a smaller catalog for long-term binge-watching content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

*  **Positive:** Having a larger movie library attracts casual viewers.

*  **Negative:** Limited TV show collection may discourage subscribers who prefer long-term engagement through series. Prime could invest in more original shows to improve retention.

#### **Chart 2 -->  Titles Added Over the Years**

In [None]:
# Chart - 2 visualization code
titles_per_year = titles['release_year'].value_counts().sort_index()
plt.figure(figsize=(12,6))
titles_per_year.plot(kind='line', marker='o')
plt.title("Number of Titles Released Over the Years")
plt.xlabel("Year")
plt.ylabel("Count of Titles")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A time-series line chart is useful to visualize growth trends and expansion phases.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

* Content addition significantly accelerated post-2010.

*  Suggests a clear strategic push to expand content volume after Amazon Prime Video gained popularity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

* **Positive:** Growth proves aggressive acquisition/production worked.

* **Negative:** Rapid growth may include low-quality titles, which can dilute overall content quality. A balance is required.

#### **Chart 3 -->  Age Certification Distribution**

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,5))
sns.countplot(
    data=titles,
    x='age_certification',
    order=titles['age_certification'].value_counts().index,
    palette='Set3',
    hue='age_certification',
    legend=False
)
plt.title("Distribution of Age Certifications")
plt.xlabel("Age Rating")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

*  Shows content maturity focus (family vs adult content).

*  Important for understanding audience targeting.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
*  Majority of content is targeted at PG-13 and 16+ audiences.

*  Family/kids content is relatively less.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

*  **Positive Business Impact:** Teen/adult content attracts mass market viewers.

*  **Negative Growth Risk:** Lack of family-friendly content may reduce appeal to households with kids.

#### **Chart 4 -->  Runtime Distribution (Movies)**


In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,5))
movies = titles[titles['type']=='MOVIE']
sns.histplot(movies['runtime'].dropna(), bins=30, kde=True)
plt.title("Distribution of Movie Runtimes")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Frequency")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
*  Histogram shows how most movies are distributed by length.

*  Useful for understanding user viewing convenience.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

*  Most movies are around 90–120 minutes.

*  Very few extremely short or very long movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
*  **Positive Business Impact:** Ideal runtimes meet audience expectations.

*  **Negative Growth Risk:** Lack of experimental or short-format content could miss out on casual/mobile users.

#### **Chart 5 --> Genre Distribution**

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10,6))
all_genres = titles['genres'].dropna().str.split(',').explode()
top_genres = all_genres.value_counts().head(10)
sns.barplot(x=top_genres.values, y=top_genres.index)
plt.title("Top 10 Genres on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Genre")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Tells us which genres Amazon Prime prioritizes.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

*  Top genres are Drama, Comedy, Action, and Romance.

*  Niche genres like Horror or Sci-Fi have smaller presence.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** Drama & Comedy cover wide audience taste.

**Negative:** Sci-Fi/Horror fans may move to competitors like Netflix with stronger niche content.

#### **Chart 6 --> Top 10 Countries by Content**

In [None]:
# Chart - 6 visualization code
titles['production_countries'].str.split(',').explode().value_counts().head(10).plot(kind='bar')
plt.title("Top Countries by Content")
plt.xlabel("Country")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
To identify which countries contribute the most content, helping understand diversity.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

> USA dominates, followed by India and a few others.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** US dominance ensures global appeal.

**Negative:** Over-reliance on few countries reduces cultural variety.

#### **Chart 7 --> Top 15 Actors**




In [None]:
plt.figure(figsize=(10,6))

top_actors = credits[credits['role']=="ACTOR"]['name'].value_counts().head(15)

sns.barplot(x=top_actors.values, y=top_actors.index, palette="magma")
plt.title("Top 15 Actors on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Actor")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Shows which actors have the most presence on Prime, useful for fan-driven subscriptions.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Popular actors often appear in multiple titles, making them strong audience pullers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** Helps Prime promote star-driven content.

**Negative:** Too much focus on a few actors may make the library feel repetitive.

#### **Chart 8 --> Top 15 Directors**

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10,6))

top_directors = credits[credits['role']=="DIRECTOR"]['name'].value_counts().head(15).reset_index()
top_directors.columns = ['Director', 'Count']

sns.barplot(data=top_directors, x="Count", y="Director", hue="Director",
            palette="plasma", dodge=False, legend=False)

plt.title("Top 15 Directors on Amazon Prime")
plt.xlabel("Number of Titles")
plt.ylabel("Director")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Directors are key to content quality and branding.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Certain directors dominate Prime’s catalog, indicating collaborations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** Partnerships with popular directors can increase subscriber trust.

**Negative:** Risk of dependency on a small creative pool.

#### **Chart 9 --> IMDb Score Distribution**

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,6))
sns.histplot(titles['imdb_score'].dropna(), bins=30, kde=True)
plt.title("Distribution of IMDb Scores")
plt.xlabel("IMDb Score")
plt.ylabel("Frequency")

##### 1. Why did you pick the specific chart?

Answer Here.

Audience perception of quality is reflected in IMDb scores.



##### 2. What is/are the insight(s) found from the chart?

Answer Here

Most titles cluster between 6–8, suggesting average-to-good quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** High-rated titles improve platform reputation.

**Negative:** Too many low-rated titles can reduce brand value.

#### **Chart 10 --> IMDb Votes Distribution**

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10,6))
sns.histplot(titles['imdb_votes'].dropna(), bins=30, color="blue")
plt.title("IMDb Votes Distribution of Titles")
plt.xlabel("Number of Votes")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Measures engagement — how many people actually rate the shows/movies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Few titles get very high votes → popular globally.

Many titles have fewer votes → niche audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** Identifies “flagship” titles for promotion.

**Negative:** Low engagement content may waste licensing costs.

#### **Chart 9 --> TMDb Popularity Distribution**

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10,6))
sns.histplot(titles['tmdb_popularity'].dropna(), bins=30, color="purple")
plt.title("TMDb Popularity Distribution")
plt.xlabel("Popularity Score")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Shows trending titles on another platform (TMDb), indicating broader appeal.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Few titles achieve very high popularity → viral hits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** Helps Prime market trending titles.

**Negative:** Over-focus on “trending” may ignore niche loyal audiences.

#### **Chart 12 --> Runtime vs IMDb Score**

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(10,6))
sns.scatterplot(x='runtime', y='imdb_score', data=titles, alpha=0.6, color="green")
plt.title("Runtime vs IMDb Score")
plt.xlabel("Runtime (minutes)")
plt.ylabel("IMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Tests if longer movies/shows perform better or worse.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

No strong correlation; both short and long content can succeed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** Encourages experimenting with different lengths.

**Negative:** Very long runtimes may deter casual viewers.

#### **Chart 13 --> Content Added Over Years (Trend Line)**

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10,6))
titles_per_year = titles['release_year'].value_counts().sort_index()
sns.lineplot(x=titles_per_year.index, y=titles_per_year.values, marker="o", color="red")
plt.title("Content Growth on Amazon Prime Over the Years")
plt.xlabel("Release Year")
plt.ylabel("Number of Titles")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

Shows expansion trend of Prime’s library.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Significant growth in recent years, aligning with Prime’s global push.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

**Positive:** Rapid growth attracts more subscribers.

**Negative:** Too much quantity may dilute quality.

#### **Chart 14 -->  Correlation Heatmap**

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))
sns.heatmap(titles[['imdb_score','imdb_votes','tmdb_popularity','tmdb_score']].corr(),
            annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap of Key Metrics")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

To see how ratings, votes, and popularity are related.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

IMDb votes and popularity often show strong correlation.

**Positive:** Helps Amazon predict which titles will trend.

**Negative:** Overreliance on correlated factors may ignore new experimental titles.

#### **Chart 15 --> Pair Plot**

In [None]:
# Pair Plot visualization code
sns.pairplot(titles[['imdb_score','imdb_votes','tmdb_popularity','tmdb_score']].dropna(),
             diag_kind='kde')
plt.suptitle("Pairplot of IMDb and TMDb Metrics", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Helps visualize multiple relationships together.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Confirms positive relationships between votes, scores, and popularity.

**Positive:** Multi-dimensional analysis improves content strategy.

**Negative:** Complex interpretation may mislead non-technical managers.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.


**Suggested Recommendations to Achieve Business Objective**

-	**Content Strategy**

*  Focus on popular genres (like Drama, Comedy, and Action) which dominate viewership.
*	Continue producing more Movies (since they make up the majority of titles) but also grow TV Shows, as long-running series help in subscriber retention.

-	**Target Audience Expansion**

*	Use Age Certification insights to identify underserved segments (e.g., family-friendly / teen content).
*	Balance mature content with content suitable for all age groups to widen audience base.

-	**Global Expansion**

*	Insights show content comes from specific countries — increase investments in regional content (India, Korea, Spain etc.), as non-English shows have high international appeal.

-	**Talent & Partnerships**

*	Top directors and actors appear repeatedly in successful content → Partnering with them ensures consistent performance.
*	Encourage collaborations with new talent to diversify portfolio.

-	**User Engagement & Retention**

*	Runtime analysis suggests that audiences prefer balanced runtime (90–120 mins for movies).
*	Recommend mixing shorter shows (mini-series) for binge-watchers with longer premium productions to maximize retention.

-	**Improving Ratings & Reviews**
*	Titles with higher IMDB scores build credibility. Focus on producing quality-driven content instead of just increasing quantity.
*	Analyze low-rated content genres to avoid overspending on formats that do not perform well.







# **Conclusion**

Write the conclusion here.

The analysis of Amazon Prime titles provides meaningful insights into content distribution, audience preferences, and performance trends. Movies dominate the platform, but TV shows are steadily growing and play a crucial role in long-term subscriber engagement. Popular genres such as Drama, Comedy, and Action continue to attract the widest audience, while global expansion in regional content presents new growth opportunities. Runtime patterns and age certification insights highlight the importance of balancing diverse content for all audience groups.

From a business perspective, these insights help Amazon Prime strengthen its content strategy, improve customer satisfaction, and expand into untapped markets. By focusing on quality-driven productions, leveraging successful directors/actors, and analyzing underperforming segments, Prime can optimize investments and ensure sustained growth in a highly competitive streaming industry.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***