<a href="https://colab.research.google.com/github/sumitmishra523/Amazon-Prime-TV-Shows-and-Movies-EDA-/blob/main/module_2_EDA_project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# amazon  prime tv show analysis project    ~



##### **Project Type**    - EDA
##### **Contribution**    -  Individual
 Name ~ Sumit mishra

# **Project Summary -**

Project Title ~ Amazon Prime TV Shows – Exploratory Data Analysis

Objective  ~ To explore and analyze TV show data from Amazon Prime to identify trends, content distribution, and key insights related to genres, release years, ratings, and content origin.

Tools Used ~  Python, Pandas, NumPy, Matplotlib, Seaborn

Approach ~

Data loading and inspection

Missing value treatment

Data type correction and formatting

Univariate and bivariate analysis

Visualization of trends and distributions


Outcome ~ Derived insights on genre distribution, country-wise content contribution, release year trends, target audience, and content rating patterns.







# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Amazon Prime hosts a vast library of TV shows from various genres, countries, and time periods. However, understanding the composition, trends, and patterns in this content library requires structured analysis. The lack of clear visibility into content distribution limits strategic decisions in areas like marketing, content acquisition, and user targeting.

This project aims to:

Identify dominant genres and popular content categories

Analyze content release trends over time

Understand target audience through age ratings

Examine geographical distribution of TV shows

Detect missing or inconsistent data affecting content insights


The goal is to generate data-driven insights that can support content strategy and platform optimization.

#### **Define Your Business Objective?**

To analyze the Amazon Prime TV Shows dataset and extract actionable insights that can assist in:

Enhancing content acquisition strategy based on genre, region, and release trends

Understanding viewer segmentation through age ratings and show types

Identifying underrepresented categories or regions for content expansion

Improving user experience by aligning content offerings with demand patterns

Supporting data-driven decisions for marketing and recommendation systems


This analysis enables stakeholders to make informed choices for platform growth, customer retention, and competitive positioning in the OTT space.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/titles.csv')
df1 = pd.read_csv('/content/drive/MyDrive/credits.csv')


merge the data set

In [None]:
merged_df = pd.merge(df, df1, on='id', how='inner')


### Dataset First View

In [None]:
merged_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
merged_df.shape

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
#Duplicate Rows count
duplicate_df = merged_df.duplicated()
print("Number of duplicate rows:", duplicate_df.sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count of null values in each column
merged_df.isnull().sum()


In [None]:
# Visualizing the missing values
plt.figure(figsize=(15, 8))

# Plot missing values heatmap
sns.heatmap(merged_df.isnull(), cbar=False, cmap="YlGnBu", yticklabels=False)

# Add titles and labels
plt.title('Missing Values Heatmap (Seaborn)', fontsize=19)
plt.xlabel(' Data Columns')
plt.ylabel('Row Index')

plt.show()

### What did you know about your dataset?

The dataset represents a rich catalog of entertainment content, combining titles and their associated cast and crew information. It’s structured in two parts — titles.csv and credits.csv — which have been merged to create a comprehensive view of both content and contributors.

🧾 Key Insights:
***Content Scope:
The dataset includes a diverse mix of movies and shows from different genres, release years, and durations — offering strong potential for analysis on content trends over time.

***Entity Relationship:

Each title (movie or show) has a unique ID, linked to multiple people in the credits file.

The merged structure enables analysis of how actors, directors, and crew are distributed across the content catalog.

***Missing & Duplicate Data:

Identified missing values in columns such as genres and runtime, which may affect recommendation models or content filtering unless handled properly.

Detected duplicate rows, suggesting the need for data cleaning to ensure accuracy in analysis.

***Usability:
The dataset is highly suitable for tasks such as:

⭐ IMDb rating trend analysis

🎬 Actor/Director profiling

🧠 Content-based filtering

📊 Visual storytelling (via genres, types, timelines)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe()

### Variables Description

## 🧾 Variable Description

| Column Name          | Description                                                  |
|----------------------|--------------------------------------------------------------|
| `id`                 | Unique identifier for each title (used to link both files)   |
| `title`              | Title of the movie or TV show                                |
| `type`               | Type of content – either `"MOVIE"` or `"SHOW"`               |
| `description`        | Short summary of the plot or theme                           |
| `release_year`       | Year in which the title was released                         |
| `age_certification`  | Content rating (e.g., PG, R, TV-MA)                          |
| `runtime`            | Duration in minutes                                          |
| `genres`             | Comma-separated list of genres (e.g., Drama, Comedy)         |
| `production_countries` | Countries where the title was produced                     |
| `seasons`            | Number of seasons (for shows)                                |
| `imdb_id`            | IMDb identifier (if available)                               |
| `imdb_score`         | IMDb rating score (0 to 10)                                  |
| `tmdb_score`         | TMDb score (if available)                                    |
| `tmdb_popularity`    | Popularity metric from TMDb                                  |
| `name`               | Name of the cast or crew member                              |
| `character`          | Character played (for actors)                                |
| `role`               | Role of the person (e.g., ACTOR, DIRECTOR, etc.)             |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Show number of unique values for each column
merged_df.nunique()


## 3. ***Data Wrangling***

**converting data types**

In [None]:
# Convert 'seasons' column to integer
merged_df['seasons'] = pd.to_numeric(merged_df['seasons'], errors='coerce').astype('Int64')
merged_df['imdb_votes'] = pd.to_numeric(merged_df['imdb_votes'], errors='coerce').astype('Int64')

*** removing  duplicates***

In [None]:
# Write your code to make your dataset analysis ready.
#dropping duplicates firsly
merged_df.drop_duplicates()
# 👀 Check number of duplicate rows
duplicate_count = merged_df.duplicated().sum()
print(f"Number of duplicate rows removed: {duplicate_count}")
#DataFrame now has 124,179 unique rows (124,347 - 168).

*** clearing out null/ NAN values and placing valid values ***

In [None]:
#  Drop rows that contain any missing values to maintain data completeness
merged_df['age_certification'] = merged_df['age_certification'].fillna('Not Rated')
merged_df['runtime'] = merged_df['runtime'].fillna(merged_df['runtime'].median())
merged_df['imdb_votes'] = merged_df['imdb_votes'].fillna(merged_df['imdb_votes'].median())
merged_df['tmdb_score'] = merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].median())
merged_df['genres'] = merged_df['genres'].fillna('Unknown')
merged_df['production_countries'] = merged_df['production_countries'].fillna('Unknown')
merged_df['seasons'] = merged_df['seasons'].fillna(0)
# 🎯 Fill missing IMDb scores with the median to preserve rating distribution
merged_df['imdb_score'] = merged_df['imdb_score'].fillna(merged_df['imdb_score'].median())
#lets see is there any imp columns left who have null values which will raise a concern while taking out insights
merged_df.isna().sum()

In [None]:
# Select relevant numeric columns
columns_to_plot = [ 'runtime','imdb_votes','imdb_score','tmdb_popularity','tmdb_score' ]
plt.figure(figsize=(12, 6))
sns.boxplot(data=df[columns_to_plot])
plt.title("Boxplot for Outlier Detection in Amazon Prime Data")
plt.xticks(rotation=45)
plt.show()

 # shows or blockbuster movies naturally gather hundreds or thousands or even millions of votes. so according to me its normal to have some high values in imdb votes and others are giving perfect insights so dont need to remove the imdb high values as outliers

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

sns.histplot(merged_df['imdb_score'], bins=20, kde=True)
plt.title("Distribution of IMDb Scores")
plt.xlabel("IMDb Score")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Histogram is ideal for visualizing the distribution of continuous numerical data like IMDb scores. KDE line helps understand the probability density and shape of the distribution

##### 2. What is/are the insight(s) found from the chart?

Most shows have IMDb scores between 6 and 8, with a sharp decline beyond that range. Very few shows score below 4 or above 9, indicating that the majority of Prime Video content falls in the mid-to-high quality range as rated by users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes.
The insights derived from EDA can lead to strategic improvements and positive business outcomes, such as:

Content Strategy Optimization:
Understanding genre popularity, country-wise contribution, and IMDb rating patterns helps Amazon prioritize what kind of shows to acquire or promote.

User Retention & Personalization:
Age rating trends and genre preferences can improve recommendation systems, keeping users engaged longer.

Content Quality Assurance:
Detecting that most shows lie between IMDb 6–8 indicates stable content quality. Focus can now shift to improving or replacing underperforming content.

Market Expansion Decisions:
If certain countries or genres are underrepresented, Amazon can fill those gaps to attract new demographics.                                               


some insights might highlight areas of concern but will not impact negetively.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
sns.histplot(merged_df['runtime'], bins=30, kde=True, color='orange')
plt.title("Distribution of Runtimes")
plt.xlabel("Runtime (minutes)")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is ideal to visualize the spread and frequency of numerical data, in this case, the runtime (in minutes) of TV shows. It helps detect central tendency, skewness, and outliers in runtime values.

##### 2. What is/are the insight(s) found from the chart?

Majority of TV shows on Amazon Prime have runtimes between 20 to 60 minutes, which aligns with typical episode lengths. Very few shows exceed 100+ minutes, indicating a clear focus on short to mid-length episodic content rather than long-format or feature-length formats.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Most shows are 20–60 mins, which suits binge-watching and today’s short attention spans.

Helps increase completion rates and keeps users engaged.


Very few long-format shows may lead to lack of content variety.

Could alienate users looking for in-depth or special content.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
sns.countplot(x='type', data=merged_df, palette='Set2')
plt.title("Count of Movies vs Shows")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

To compare the total number of movies vs TV shows available on Amazon Prime.



##### 2. What is/are the insight(s) found from the chart?

Movies significantly outnumber shows — indicating a movie-heavy library.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on movies meets demand for quick, one-time content consumption.

Good for casual users who prefer short engagement.

Fewer shows can affect long-term user retention, as series keep users hooked over time.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
sns.countplot(y='age_certification', data=merged_df,
              order=merged_df['age_certification'].value_counts().index,
              palette='Set3')
plt.title("Distribution of Age Certifications")
plt.xlabel("Count")
plt.ylabel("Age Certification")
plt.show()

##### 1. Why did you pick the specific chart?

To analyze age-based classification of content and understand the target audience segments on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

A large portion of content is Not Rated, followed by R-rated and PG-13, indicating a focus on mature audiences. There's limited content for kids and general audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Strong presence of teen and adult content matches OTT trends and viewer demand for mature genres.

High number of “Not Rated” content may reflect data inconsistency or unreviewed content, affecting parental control and personalization.

Low kids/family content may limit platform’s reach in multi-user households.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
sns.boxplot(x='type', y='imdb_score', data=merged_df)
plt.title("IMDb Score Distribution by Type")
plt.show()

##### 1. Why did you pick the specific chart?

Boxplot helps compare the rating distribution between movies and TV shows, showing median, spread, and outliers clearly.

##### 2. What is/are the insight(s) found from the chart?

IMDb scores for both shows and movies are similar, but TV shows have slightly higher median ratings and more outliers. This suggests more variability in show quality compared to movies

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

TV shows often receive better audience response, indicating strong engagement and content depth.

High number of outliers in shows may signal inconsistent quality, which can confuse or frustrate users.


#### Chart - 6

In [None]:
# Chart - 6 visualization code
sns.scatterplot(x='runtime', y='imdb_score', data=merged_df, alpha=0.4)
plt.title("Runtime vs IMDb Score")
plt.xlabel("Runtime")
plt.ylabel("IMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

To observe any relationship or trend between the length of content (runtime) and its IMDb rating.



##### 2. What is/are the insight(s) found from the chart?

There's no strong correlation — both short and long shows receive a wide range of ratings. Most content clusters below 150 minutes runtime, regardless of score.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Viewers are not biased by length — both short and long content can perform well.

Gives flexibility to creators and content teams.

No clear trend means runtime alone can’t predict success, so focus should be on storytelling quality, not just duration.


#### Chart - 7

In [None]:
# Chart - 7 visualization code
age_group = merged_df.groupby('age_certification')['imdb_score'].mean().reset_index()
sns.barplot(x='imdb_score', y='age_certification', data=age_group, palette='magma')
plt.title("Average IMDb Score by Age Certification")
plt.xlabel("Average Score")
plt.ylabel("Age Certification")
plt.show()

##### 1. Why did you pick the specific chart?

To compare audience ratings across age categories, helping understand which type of content performs best.

##### 2. What is/are the insight(s) found from the chart?

Kids’ content (TV-Y, TV-Y7, TV-PG) has the highest average IMDb scores, while adult-rated or not-rated content tends to score lower.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

High ratings for family/kids content show strong audience satisfaction — a growth area to tap into.

Lower ratings in R-rated or unrated content may reflect quality issues or misaligned expectations.


#### Chart - 8

In [None]:
# Chart - 8 visualization code
top_years = merged_df['release_year'].value_counts().nlargest(20).sort_index()
top_years.plot(kind='bar', color='teal')
plt.title("Top 20 Most Active Release Years")
plt.xlabel("Year")
plt.ylabel("Number of Titles")
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To show how content production or acquisition has evolved over the years — identifying growth spikes and activity trends.

##### 2. What is/are the insight(s) found from the chart?

Massive growth in content release observed post 2010, peaking around 2017–2020. It highlights Amazon Prime’s aggressive push into content expansion in recent years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Increasing yearly releases reflect strong investment, platform scaling, and market competitiveness.

More recent content helps attract users looking for fresh and relevant content.

Sudden dips after 2020 (possibly due to COVID or oversaturation) might indicate production slowdowns or content fatigue.



#### Chart - 9

In [None]:
# Chart - 9 visualization code
sns.scatterplot(x='runtime', y='imdb_score', hue='type', data=merged_df, alpha=0.4)
plt.title("Runtime vs IMDb Score by Type")
plt.show()

##### 1. Why did you pick the specific chart?

To compare runtime vs IMDb score while also visually separating movies and TV shows for deeper insights.

##### 2. What is/are the insight(s) found from the chart?

TV shows (blue) cluster around shorter runtimes (mostly <60 mins) with stable IMDb scores.

Movies (orange) have more spread in runtime (up to 500 mins) but still show no strong correlation with ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Viewers appreciate both shows and movies across different lengths — shows consistent demand across formats.

Flexibility to experiment with runtimes for both content types.

No runtime advantage in predicting success. So longer runtime ≠ better rating, which may waste budget if not executed well

#### Chart - 10

In [None]:
# Chart - 10 visualization code
corr_cols = ['imdb_score', 'imdb_votes', 'runtime', 'tmdb_score']
sns.heatmap(merged_df[corr_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

##### 1. Why did you pick the specific chart?

To quantify the relationships between numeric features like IMDb score, runtime, and votes — visually highlights strong/weak correlations.

##### 2. What is/are the insight(s) found from the chart?

IMDb score is moderately correlated with IMDb votes (0.26), meaning popular content often gets better ratings.

Runtime has very weak correlation with IMDb or TMDB scores — so longer shows don’t guarantee higher ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

More votes = better ratings suggests that engaging more viewers can help increase perception and visibility.

Runtime doesn’t significantly affect performance — teams focusing too much on show/movie length might be wasting effort.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
top_actors = merged_df[merged_df['role'] == 'ACTOR']['name'].value_counts().nlargest(10)
sns.barplot(x=top_actors.values, y=top_actors.index, palette='viridis')
plt.title("Top 10 Most Frequent Actors")
plt.xlabel("Appearances")
plt.ylabel("Actor")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart was selected to effectively compare the frequency of appearances for the top ten actors. Its clear, horizontal bars provide an intuitive visualization of the data, making it easy to rank and compare the actors based on their appearance counts.

##### 2. What is/are the insight(s) found from the chart?

George 'Gabby' Hayes is the most frequently appearing actor, with a significantly higher count than any other actor on the list.
 * The top three actors (George 'Gabby' Hayes, Roy Rogers, and Bess Flowers) have a notably higher frequency than the rest of the actors.
 * The frequency of appearances decreases progressively down the list, with the bottom half of the actors having a much smaller difference in their appearance counts compared to the top actors.
 * The distribution shows a clear long-tail effect, with a few actors dominating the appearance count, while the majority have lower and more evenly distributed frequencies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from this chart can drive positive business impact in several ways:
 * Content Strategy: Studios or production companies can leverage the popularity of top-ranking actors like George 'Gabby' Hayes to inform future casting decisions or to promote existing content featuring them. This data can help in developing a content strategy that capitalizes on proven audience interest.
 * Marketing and Promotion: The most frequent actors can be used as key marketing assets. By highlighting their involvement, businesses can attract a larger audience and increase viewership, leading to higher engagement and potential revenue.
 * Talent Management: This information is valuable for talent agencies and managers to understand an actor's market value and to negotiate contracts based on their historical prominence and appeal.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
country_scores = merged_df.groupby('production_countries')['imdb_score'].mean().nlargest(10).reset_index()
sns.barplot(x='imdb_score', y='production_countries', data=country_scores, palette='cubehelix')
plt.title("Top 10 Countries by Avg IMDb Score")
plt.xlabel("Average IMDb Score")
plt.ylabel("Country")
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart is ideal here because it allows for easy reading of the long country group names and clearly shows the ranking of average IMDB scores from highest to lowest

##### 2. What is/are the insight(s) found from the chart?

Collaborations Lead: Multi-country collaborations, particularly those involving 'AU', 'CA', and 'GB', achieve the highest average IMDb scores.
 * Top Contributors: The United States, Australia, and Great Britain frequently appear in the highest-rated country clusters.
 * Global Quality: High-scoring productions are not confined to a single geographic region

#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
corr_cols = ['imdb_score', 'imdb_votes', 'runtime', 'tmdb_score']

# Compute correlation matrix
correlation_matrix = merged_df[corr_cols].corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Heatmap: IMDb Score, Votes, Runtime, TMDb Score")
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is ideal for visualizing correlation among multiple numeric variables. It gives a quick overview of how strongly variables are related to one another, using color gradients

##### 2. What is/are the insight(s) found from the chart?

imdb_score and tmdb_score might be positively correlated, showing consistency in rating platforms.

imdb_votes might have low correlation with scores, revealing that popular titles aren't always the highest rated.

Runtime might show weak or no correlation with ratings.



#### Chart - 14 - Pair Plot

In [None]:
# Pair Plot visualization code
# Create pairplot with type as hue
sns.pairplot(
    merged_df[['imdb_score', 'imdb_votes', 'runtime', 'type']],
    hue='type',
    palette='husl',
    corner=True
)
plt.suptitle("Pair Plot: Scores, Votes, Runtime by Type", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

To explore relationships between multiple numeric features simultaneously and see how they differ for Movies vs TV Shows.

##### 2. What is/are the insight(s) found from the chart?

Clusters or patterns among MOVIE vs SHOW

Runtime vs IMDb score may show that longer shows are not always better rated

IMDb votes vs score can reveal high vote count doesn’t guarantee high ratings

Compare whether movies or shows tend to dominate high-scoring or long-duration zones

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Prioritize TV Shows & Family Content:
TV shows have better audience ratings, and kids/family-rated content shows higher satisfaction. Expanding in these areas can improve engagement and retention.

Fix Missing Age Ratings:
A large portion of content is marked "Not Rated", which affects parental controls and personalization. Cleaning this data will improve the user experience.

Focus on Quality, Not Runtime:
There's no strong link between runtime and ratings. Instead of producing longer content, invest in strong scripts and storytelling.

Encourage User Reviews & Ratings:
More IMDb votes correlate with higher scores. Prompt users to rate content — this builds trust and improves discoverability.

Maintain Consistent Quality:
Content production increased after 2015, but maintaining quality is essential to avoid audience drop-off and platform fatigue.

# **Conclusion**

The exploratory data analysis on Amazon Prime Video content revealed several important trends and opportunities. The platform has a strong focus on movies, but TV shows tend to receive better and more consistent ratings, making them valuable for long-term engagement.

IMDb scores are mostly concentrated between 6 and 8, showing overall user satisfaction. However, there's room to improve the quality of both highly-rated and low-rated content. Content certified for family and younger audiences performs exceptionally well, yet it remains underrepresented.

Most content falls between 30–120 minutes, and runtime doesn't show a strong connection with ratings — highlighting that viewers value quality more than length. The spike in content releases after 2015 reflects Amazon's aggressive content expansion, but maintaining quality amid growth is crucial.

User engagement, especially through reviews and votes, plays a key role in perceived content value. Improved metadata, more consistent age ratings, and better content diversity will enhance the user experience.

In conclusion, Amazon Prime can further strengthen its position by focusing on high-quality shows, expanding family-friendly offerings, and optimizing content based on real viewer preferences.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***