# **Project Name**    - Amazon Prime TV Shows and Movies EDA



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** - Mrunali Wadi


# **Project Summary -**

Write the summary here within 500-600 words.

In an Exploratory Data Analysis (EDA) project on Amazon Prime TV shows and movies, I can perform various analyses to uncover trends and actionable insights. Here are key things I can do:

This project is EDA of the Amazon Prime library, where I used the titles and credits datasets. My primary objective was to uncover key trends influencing subscription growth by analyzing various aspects of the content and audience metrics such as IMDb and TMDB scores.

I focused on understanding the datasets by loading and merging the 'titles' and 'credits' datasets. This process gave a merged dataset with more than
9,800 rows and 18 columns. I ideentified and remove duplicate entires to ensure fairness in analysis. I also handled missing values of each column like filling missing season information for movies with zero, categorizing missing age certification as 'Unknown', and imputing numerical missing values with the median.

I did an analytical distribution of content types (movies vs. shows), analyzed the number of seasons for TV series, and explored trends in content releases over the years. I visualized the popularity and distribution of genres using a word cloud, which highlighted the most available content categories. By examining IMDb and TMDB score distributions, I gained insights to the Amazon Prime content. I found the relationship between content runtime and IMDb scores, differentiating between movies and shows, and identified the top performing directors based on their average IMDb ratings.

The insight I gained are valuable for strategic decision making. Understanding which content types and genres perform well can help Amazon Prime. Analyzing release trends enables better prediction and planning.

# **GitHub Link -**

[Github Link for Amazon Prime](https://github.com/wadimrunal/Amazon-Prime-EDA-Project)

# **Problem Statement**


Amazon Prime Video is a major player in the highly competitive streaming market. To maintain and accelerate subscription growth, the platform needs a data-driven content strategy. The current challenge is the lack of clear, actionable insights into which specific types of content (genres, formats, quality metrics like IMDb scores, and creators like directors) resonate most strongly with the audience and drive engagement.

#### **Define Your Business Objective?**


Perform a comprehensive Exploratory Data Analysis (EDA) on the Amazon Prime content library (using titles and credits datasets) to identify key trends and content performance metrics. The goal is to provide strategic recommendations that optimize content acquisition and production decisions, ultimately maximizing user retention and new subscriber growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
import ast
from wordcloud import WordCloud, STOPWORDS

### Dataset Loading

In [None]:
# Load Dataset
titles = pd.read_csv('/content/titles.csv')
credits = pd.read_csv('/content/credits.csv')

In [None]:
# Merge the atasets normally gives wrong data with a lot of unwanted rows
merged_df = pd.merge(credits, titles, on='id', how= 'inner')
merged_df.shape

In [None]:
merged_df.head(10)

In [None]:
# Merge the datasets
merged_df = pd.merge(credits, titles, on="id", how="inner")
# Separate actors and directors
actors_df = merged_df[merged_df["role"] == 'ACTOR']
directors_df = merged_df[merged_df["role"] == 'DIRECTOR']
# Group each by movie and collect names and characters
actors_grouped = (
    actors_df.groupby("id", as_index=False)
    .agg({"name": list, "character": list})
    .rename(columns={"name": "actor_names", "character": "actor_characters"}))
directors_grouped = (
    directors_df.groupby("id", as_index=False)
    .agg({"name": list})
    .rename(columns={"name": "director_names"}))
# Merge both actor and director summaries back into titles
merged_df = (
    titles.merge(actors_grouped, on ="id", how = "left")
          .merge(directors_grouped, on = "id", how='left')
)
# Fill NaNs with empty lists where applicable
merged_df['actor_names'] = merged_df["actor_names"].apply(lambda x: x if isinstance(x, list) else []).astype(str)
merged_df['actor_characters'] = merged_df["actor_characters"].apply(lambda x: x if isinstance(x, list) else []).astype(str)
merged_df['director_names'] = merged_df["director_names"].apply(lambda x: x if isinstance(x, list) else []).astype(str)

### Dataset First View

In [None]:
# Dataset First Look
merged_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
merged_df.shape

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
#convert list columns to strings for duplication check
merged_df['actor_names_str'] = merged_df['actor_names'].astype(str)
merged_df['actor_characters_str'] = merged_df['actor_characters'].astype(str)
merged_df['director_names_str'] = merged_df['director_names'].astype(str)
# check for duplicates using the new string columns
duplicate_count = merged_df.duplicated(subset=[
    'id','title', 'type', 'description', 'release_year', 'age_certification', 'runtime', 'genres', 'production_countries',
    'seasons', 'imdb_id', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score', 'actor_names_str',
    'actor_characters_str', 'director_names_str']).sum()
print(f"Number of duplicate rows: {duplicate_count}")
# Drop the temporary string columns
merged_df = merged_df.drop(columns=['actor_names_str', 'actor_characters_str', 'director_names_str'])


In [None]:
# show duplicate rows
duplicate_rows = merged_df[merged_df.duplicated()]
display(duplicate_rows)

In [None]:
# Drop the duplicate rows
merged_df.drop_duplicates(inplace=True)

In [None]:
# Dataset Rows and Column count
merged_df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
merged_df.isnull().sum()[merged_df.isnull().sum()>0]

In [None]:
# Missing values/null values mean
merged_df.isnull().mean()[merged_df.isnull().mean()>0]*100

In [None]:
# Visualizing the missing values
merged_df.isnull().sum().plot(kind='bar')

### What did you know about your dataset?

After loading and merging my dataset, I noticed this was a large dataset with 9868 rows and 18 column within. When I saw the dataset information i could find that count of all the columns are not same. Which imples there are NaN values in the dataset. Well what else can we hope from a real time dataset. I also noticed duplicate values in the dataset( 3 duplicates values in the dataset) and deleted them to maintain fairness in performing analysis.
- The dataset contains 9868 rows and 18 columns after merging and dropping duplicates.
- There are missing values in several columns, including description, age_certification,seasons, imdb_id, imdb_Score,imdb_vote, tmdb_popularity and tmdb_score. The age_certification and seasons columns have a particularly high percentage of missing values.
- There were 3 duplicate rows identified and dropped.
- The dataset includes information about titles(movies and shows), there description, release year, age certification, runtimes, geners, production countries, seasons(for shows) and various IMDB and TMDB metrics(id, score, votes, popularity) it also includes the names and characters of actors and the the name of directors.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe().transpose()

In [None]:
merged_df.describe(include='object').transpose()

### Variables Description

- id: The title ID on JustWatch.
- title: The name of the title.
- show_type: TV show or movie.
- description: A brief description.
- release_year: The release year.
- age_certification: The age certification.
- runtime: The length of the episode (SHOW) or movie.
- genres: A list of genres.
- production_countries: A list of countries that produced the title.
- seasons: Number of seasons if it's a SHOW.
- imdb_id: The title ID on IMDB.
- imdb_score: Score on IMDB.
- imdb_votes: Votes on IMDB.
- tmdb_popularity: Popularity on TMDB.
- tmdb_score: Score on TMDB.
- actor_names: A list of actor name in the title.
- actor_characters: A list of character names played by actor in the title.
- director_names: A list of director names for the title
(actor_names, actor_characters, director_names converted to string for duplicate checking.)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
merged_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# if show_type = 'Movie' replace the value of seasons from Nan to 0 nad if there is any value keep it the same
merged_df.loc[merged_df['type'] == 'MOVIE', 'seasons'] = merged_df.loc[merged_df['type'] == 'MOVIE', 'seasons'].fillna(0)
# change datatype of seasons from float64 to int64
merged_df['seasons'] = merged_df['seasons'].astype('int64')
# replace NaN in imdb_id column with 'ttttttttt'
merged_df['imdb_id'] = merged_df['imdb_id'].fillna('ttttttttt')
# replace NaN in imdb_score imdb_vote tmdb_popularity tmdb_score column with their median
merged_df['imdb_score'] = merged_df['imdb_score'].fillna(merged_df['imdb_score'].median())
merged_df['imdb_votes'] = merged_df['imdb_votes'].fillna(merged_df['imdb_votes'].median())
merged_df['tmdb_popularity'] = merged_df['tmdb_popularity'].fillna(merged_df['tmdb_popularity'].median())
merged_df['tmdb_score'] = merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].median())
# replace NaN in description column to 'no description'
merged_df['description'] = merged_df['description'].fillna('description')
# replace NaN in age_certification column with relevant values
merged_df['age_certification'] = merged_df['age_certification'].fillna('Unknown')

In [None]:
# checking if we handled all the null values or not
merged_df.isnull().sum()[merged_df.isnull().sum()>0]

In [None]:
merged_df.head(10)

In [None]:
# download the merged_df dataset
merged_df.to_csv("merged_df_clean.csv")

### What all manipulations have you done and insights you found?

# Manipulations:
- I identified and removed 3 duplicate rows in the merged dataset to ensure data integrity for analysis.
# Handled Missing Values:
- For MOVIE types, we filled missing 'seasons' values with 0 as the movies do not have seasons.
-  I filled missing age_certification values with unknown as it's a categorical variable with a signicant number of missing entries.
- I handled missing value in description, imdb_id, imdb_score, imdb_votes, tmdb_popularity, tmdb_score by replacing them with the median values so that outliers does not affect them.
# Initial Insights:
- The dataset contains a substantial 9868 rows and 18 columns.
- There are several columns with missing values with age_certification and seasons having the highest percentages of missing data.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Pie chart of count of type column
sns.set_style("whitegrid")
merged_df['type'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90, figsize=(5,5))
plt.title("Type of TV Show")
plt.show()

##### 1. Why did you pick the specific chart?

I choose a pie chart for the 'type' column because it has only 2 catagory("MOVIE' and 'SHOW') within the entire dataset. So pie chart provides a clear and simple representation of the 2 categories

##### 2. What is/are the insight(s) found from the chart?


From the pie chart, the primary insight is the distribution of content types available in this dataset. It clearly shows titles that are classified as 'MOIVE' is more dominant compared to 'SHOW' category.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definitely it will lead to positive impact because it clearly shows us what type of show is on the platform to watch and which type of content is most available is most available on the platform.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Bar chart of seasons column with type = 'Show'
merged_df[merged_df['type']=='SHOW'] ['seasons'].value_counts().plot(kind ='bar', figsize=(6,5))
plt.title('Seasons of TV shows')
plt.xlabel('Seasons')
plt.ylabel('Count of TV shows')
plt.tight_layout()
# frequency of each season
for i, count in enumerate(merged_df[merged_df['type'] == 'SHOW']['seasons'].value_counts()):
  plt.text(i, count, str(count), ha ='center', va= 'bottom')
print(f'Total shows count: {merged_df[merged_df['type'] == 'SHOW']['seasons'].value_counts().sum()}')
plt.show()

##### 1. Why did you pick the specific chart?

 I choose a bar chart to visualize the 'seasons, column specifically for TV shows (type== 'SHOW') because a bar chart is effective for displaying the distribution of the number of seasons. It clearly shows the count of TV shows for each specific number of seasons which is easy to compare the frequency of shows with different season lengths.

##### 2. What is/are the insight(s) found from the chart?

From the bar chart showing the distribution of seasons for TV shows, the frequency of shows based on the number of seasons they have. We can observe that 1 season is most prominent and we have more than 700 shows that are having 1 season.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights from the bar charts for TV shows can help create a positive business impact. knowing the most common number of seasons for TV shows on the platform can inform content strategy. This could lead to increased viewer engagement and attract new subscriber interested in that type of content. Also if shows has very high season counts it leads to low viewership.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Release year trend how content releases have changed over time
plt.figure(figsize=(8,6))
releases_by_year = merged_df['release_year'].value_counts().sort_index()
plt.plot(releases_by_year.index, releases_by_year.values, color='purple')
plt.title("Number of Releases by Year")
plt.xlabel("Release Year")
plt.ylabel('Count')
peak_years = releases_by_year.nlargest(1).index
# Add annotations for peak years
for year, count in releases_by_year.items():
  if year in peak_years:
    plt.annotate(f'{year}:{count}', (year,count), textcoords="offset points", xytext=(0,10), ha='center')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I picked line chart because it is ideal for visualizing trends over a continuous period. It shows how the number of releases has changed year by year.

##### 2. What is/are the insight(s) found from the chart?

FRom the number of release chart, we can observed the content availability on Amazon Prime over time. The chart reveals periods of growth in releases. This insight helps understand the platform's content expansion strategy.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Observing a consistent release over the years can indicate a strong investment in expanding the content library, which is important for attracting subscribers. Identiflying peak years of releases can also inform strategies around marketing to promote the platform.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Drop NaNs and ensure everything is a list
genre_series = merged_df['genres'].dropna().apply(lambda x: x if isinstance(x, list) else str(x).strip("[]").replace("'","").split(','))
all_genres = [genre.strip() for sublist in genre_series for genre in sublist if isinstance(genre, str)]
genre_text = ' '.join(all_genres)
wordcloud = WordCloud(width = 600, height= 400, background_color='white').generate(genre_text)
plt.figure(figsize=(10,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Top Available Genres")
plt.show()

##### 1. Why did you pick the specific chart?

I picked a word cloud to visualize the distribution of genres because it provides a visually engaging way to represent the frequency of different genres in the dataset. Here, the size of each word is proportional to its frequency, making it easy to quikly identify the most common genres available on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

From the word cloud, identification of the most available on Amazon Prime is easy. The genres that appear largest in the word cloud are the most frequent ones in the dataset. This provides a quick visual summary of the most available content categories on the Amazone Prime.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definitely based on this we can know that type of mixed content audience are prefering and what content should the company invest on.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# IMDB and TMDB score distribution
# creating graph with 1 row and 2 columns
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
# plot imdb_score histogram
axes[0].hist(merged_df['imdb_score'].dropna(), bins= 20, alpha=0.7, color= 'teal')
axes[0].set_title('IMDB Score Distribution')
axes[0].set_xlabel("IMDB Score")
axes[0].set_ylabel("Frequency")


# plot tmdb_score histogram
axes[1].hist(merged_df['tmdb_score'].dropna(), bins=20, alpha=0.7, color= 'salmon')
axes[1].set_title('TMDB Score Distribution')
axes[1].set_xlabel("TMDB Score")
axes[1].set_ylabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
merged_df['tmdb_score'].head()

##### 1. Why did you pick the specific chart?

I used histogram to find out the distribution frequency of single numerical value within the spread and to find any unusual patterns.

##### 2. What is/are the insight(s) found from the chart?

We can observe the shape of the distributions, such as whether they are skewed or normally distributed, and identify the most frequent score ranges for both IMDB and TMDB. Comparing the two histograms, we can say that 6 is the median value for both IMDB and TMDB ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights from the IMDB and TMDB score distribution charts can help create a positive business impact. Understanding the typical rating range of content on the platform is crucial. If the distribution shows a highly peak values, it is a positive insight. If the distribution is heavily skewed towards lower scores, or if there's a significant peak in the lower rating ranges, this could lead to negative growth.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
def extract_countries(val):
  try:
      parsed = ast.literal_eval(val) if isinstance(val, str) else val
      if isinstance(parsed, (dict, list)) and not parsed:
        return[]
      return[parsed] if isinstance(parsed, str) else parsed
  except:
    return[]

exploded = (merged_df.assign(parsed_countries=merged_df['production_countries'].apply(extract_countries)).explode('parsed_countries'))
exploded = exploded[exploded['parsed_countries'].astype(bool)]
top_countries = exploded['parsed_countries'].value_counts().head(10)
plt.figure(figsize = (8,4))
top_countries.sort_values(ascending=True).plot(kind="barh", color='red', edgecolor='black')
plt.title("Top 10 Production Countries")
plt.xlabel("Number of Titles")
plt.ylabel('Country')
plt.tight_layout()
for index, value in enumerate(top_countries.sort_values(ascending=True).values):
  plt.text(value, index, str(value), va='center', color='black')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a horizontal bar chart because it is an effective way to display and compare the counts of titles for the top 10 production countries. Bar chart makes it easy to visually compare the number of titles produced by each country and quickly identity the countries with the hightest production in the dataset.

##### 2. What is/are the insight(s) found from the chart?

When comes to movies or show production like we know Hollywood industry is very vast, ist clearly shows how many shoes were produced overall by the industry. US clearly dominates the  production values and next comes India in ranking but the difference between first and second places is also incomparible.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the top production countries can definately have a positive business impact. Knowing where most of the content originates helps in understanding the target audience and can inform strategies for acquiring more content from  those regions or even investing in local productions.


#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Requires reshaping the data by Stacked Bar Chart
release_type_pivot = merged_df.groupby(['release_year', 'type']).size().unstack().fillna(0)
# Focus on recent decades for clarity
release_type_pivot = release_type_pivot[release_type_pivot.index >= 1990]
release_type_pivot.plot(kind='bar', stacked=True, figsize=(12, 7), colormap='viridis')
plt.title('Content Type Distribution Over Release Years (Since 1990)')
plt.xlabel('Release Year')
plt.ylabel('Total Count')
plt.legend(title='Show Type')
plt.show()

##### 1. Why did you pick the specific chart?

This visualizes how the mix of content has changed annually. It answers the question: "Did Amazon add more TV shows or movies last year compared to a decade ago?"

##### 2. What is/are the insight(s) found from the chart?

You might observe that older years are dominated by the "Movie" segment of the bar, while recent years show the "TV Show" segment growing proportionally larger.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the proportion of TV shows is increasing, this validates the shift away from being purely a movie service. This data supports budget requests for multi-season TV original productions which are key drivers of long-term subscriptions.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10,6))
sns.scatterplot(data=merged_df, x='runtime', y='imdb_score', hue='type', alpha=0.7)
plt.title('Duration vs IMDB Score by Type')
plt.xlabel('Duration (minutes)')
plt.ylabel('IMDB Score')
plt.tight_layout()
plt.xlim(0,250)
# calculating median IMDB score for movies and shows
median_imdb_movie = merged_df[merged_df['type'] == 'MOVIE']['imdb_score'].median()
median_imdb_show = merged_df[merged_df['type'] == 'SHOW']['imdb_score'].median()
# Median of IMDB scores
plt.axhline(y=median_imdb_movie, color='blue', linestyle='--', label = f'Median Movie IMDB Score ({median_imdb_movie:.2f})')
plt.axhline(y=median_imdb_show, color='red', linestyle='--', label = f'Median Show IMDB Score ({median_imdb_show:.2f})')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

By plotting each title as a point based on its runtime and IMDB score. we can observe if there is a correlation between the duration of a title and its rating. Adding the 'type' as a hue allows us to differentiate between movies and shows and see if the relationship or the distribution of scores and runtimes differs between these two content types.


##### 2. What is/are the insight(s) found from the chart?

**Distribution of Movies and Shows** : We can observe the typical runtime range for movies and shows. Movies are generally concentrated in a shorter runtime range, while shows, represented by their average runtime per episode, might show a different distribution.
**Outliers** : The scatter plot can also highlight any outliers such as movies with exceptionally long runtimes or titles with unusually high or low IMDB scores relative to their duration.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the typical runtime and IMDB score ranges for both movies and shows can inform content acquisition and production strategies. If the data shows that movies within a certain runtime range tend to have higher IMDB scores, Amazon Prime might prioritize acquiring or producing more content within that duration. Similarly, knowing if shows generally have higher or lower IMDB scores compared to movies can influence the balance of content type on the platform.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Successful Directors Chart
# Converting list to string and removing rows with no value
directors_exploded = merged_df.assign(director_name=merged_df['director_names'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)).explode('director_name')
directors_exploded = directors_exploded[directors_exploded['director_name'].astype(bool)]
# Calculating the average IMDB score
director_avg_imdb_score = directors_exploded.groupby('director_name')['imdb_score'].mean().sort_values(ascending= False)
top_10_directors = director_avg_imdb_score.head(10)
# create a bar chart
plt.figure(figsize=(10,6))
top_10_directors.sort_values(ascending=True).plot(kind='barh', color = 'royalblue')
plt.title('Top 10 Directors by Average IMDB Score')
plt.xlabel('Average IMDB Score')
plt.ylabel('Director Name')
plt.tight_layout()
# Labeling the value
for index, value in enumerate(top_10_directors.sort_values(ascending=True).values):
  plt.text(value, index, f'{value:.2f}', va='center')
plt.show()


##### 1. Why did you pick the specific chart?

I picked horizontal bar chart because it is an effective way to display and compare the average IMDB scores of the top 10 directors.

##### 2. What is/are the insight(s) found from the chart?

Digpal Lanjekar is the most successful director based on his IMDB scores. All top 10 directors have more than 9.0 average when it comes to IMDB scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifying directors with consistently high average IMDB scores is valuable for streaming platform. This informationcan inform content acquisition strategies, encouraging the platform to acquire more titles directed by these highly rated directors.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Bar chart of seasons column with type = 'Movie'
merged_df[merged_df['type'] == 'MOVIE'] ['seasons'].value_counts().plot(kind='bar', figsize=(8,6))
plt.title('Season count of Movies')
plt.xlabel('Seasons')
plt.ylabel('Count of Movies')
plt.tight_layout()
# frequency of each season
for i, count in enumerate(merged_df[merged_df['type'] == 'MOVIE']['seasons'].value_counts()):
  plt.text(i, count, str(count), ha='center', va='bottom')
print(f'Total Movies count: {merged_df[merged_df['type'] == 'MOVIE']['seasons'].value_counts().sum()}')
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart gives the information about the count of all 'MOVIE' shows in Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

All titles classified as 'Movie' in this dataset have a season count 0.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This is basic information to the team about how to sort the type of content and its seasons.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
all_actors = ' '.join(merged_df['actor_names'].dropna().astype(str).tolist())
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='black', max_words=50).generate(all_actors)

# Display the generated image:
plt.figure(figsize=(10, 8), facecolor=None)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title('Frequency of Actor Appearances (Word Cloud)')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a word cloud to visualize the distribution of acotr name because it provides a visually engaging way to represent the frequency of different name in the dataset. Here, the size of each word is proportional to its frequency, making it easy to quikly identify the most famous actor on Amazon Prime.

##### 2. What is/are the insight(s) found from the chart?

From the word cloud, identification of the most famous actor on Amazon Prime is easy. The actor name that appear largest in the word cloud are the most frequent ones in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Definitely, based on this, we can know what type of actor is loved by the audience.The company is preferring to invest in this type of actor.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Horizontal Bar Chart
sns.set_style("whitegrid")
merged_df['genres_list'] = merged_df['genres'].astype(str).str.split(', ')
exploded_genres_df = merged_df.explode('genres_list')
exploded_genres_df = exploded_genres_df[exploded_genres_df['genres_list'] != 'nan']
exploded_genres_df = exploded_genres_df[exploded_genres_df['genres_list'] != '']
exploded_genres_df = exploded_genres_df.reset_index(drop=True)

# 2. Identify the OVERALL top 15 genres across both movies and shows
top_genres_order = exploded_genres_df['genres_list'].value_counts().head(15).index

# 3. Filter the exploded DataFrame to only include those top 15 genres
df_top_genres = exploded_genres_df[exploded_genres_df['genres_list'].isin(top_genres_order)]

# --- Generate the Grouped Horizontal Bar Chart ---
plt.figure(figsize=(12, 9))

sns.countplot(
    data=df_top_genres,
    y='genres_list',             # Genres on the Y-axis
    order=top_genres_order,      # Order them by overall frequency
    hue='type',             # Use color to separate Movies and TV Shows
    palette='dark',              # A good palette for comparison
)

plt.title('Top 15 Genres Split by Content Type (Movies vs. TV Shows)')
plt.xlabel('Count of Titles')
plt.ylabel('Genre')
plt.legend(title='Content Type')
plt.show()

##### 1. Why did you pick the specific chart?

 Grouped Horizontal Bar Chart is chosen when the goal is not just to see the total frequency of genres, but to compare the composition of content formats within those genres. A grouped horizontal bar chart is the ideal choice as it allows for a clear side-by-side comparison (using color for Movie vs. TV Show) for each of the top 15 categories, making it easy to identify format-specific trends.

##### 2. What is/are the insight(s) found from the chart?

The primary insight is the revelation that Amazon’s catalog composition varies significantly by genre. For instance, you will likely observe that the "Action" bar is overwhelmingly composed of the "Movie" segment, while the "Drama" bar has a more substantial "TV Show" component, revealing format-specific content strategies.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The business can use this insight to identify content gaps (e.g., a high demand genre like "Action" having very few TV shows) and specifically target acquisitions in the weaker format to ensure a balanced library that meets diverse viewer expectations and improves overall subscriber retention.



#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Ensure 'age_certification' and 'imdb_score' are clean and free of NaNs for plotting
plt.figure(figsize=(10, 6))
sns.boxplot(data=merged_df.dropna(subset=['age_certification', 'imdb_score']),
            x='age_certification', y='imdb_score', palette='Set2')
plt.title('IMDB Score Distribution per Age Certification')
plt.xlabel('Age Certification')
plt.ylabel('IMDB Score')

##### 1. Why did you pick the specific chart?

This chart investigates whether content within a specific age rating is generally perceived as higher or lower quality than others.

##### 2. What is/are the insight(s) found from the chart?

The median average score across all ratings is surprisingly similar (around 7.0), but G-rated content shows very low score variance (scores are consistently average), while R or TV-MA content has high variance (many critically acclaimed and many panned films).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight is crucial for quality control. It suggests the adult content category has both high-risk and high-reward options for subscribers. For family-friendly content, quality is more predictable, ensuring a stable and reliable offering for family subscription plans.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(9, 7))
# Select only relevant numerical columns for the correlation matrix
corr_matrix = merged_df[['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score', 'runtime', 'release_year']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()



##### 1. Why did you pick the specific chart?

Provides a numerical summary of how every quantitative variable relates to every other.

##### 2. What is/are the insight(s) found from the chart?

There is a high correlation between imdb_score and tmdb_score (as expected from Chart 9). There is also a strong correlation between imdb_votes and tmdb_popularity. Runtime and release year likely show little to no correlation with scores, suggesting quality is independent of length or age.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(merged_df[['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']])
plt.suptitle('Pair Plot of Key Numerical Variables', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot provides a matrix of scatter plots for every pair of numerical variables, with histograms on the diagonal. It offers a comprehensive visual overview of both individual distributions and pairwise relationships, allowing for the detection of non-linear patterns that a simple correlation value might miss.

##### 2. What is/are the insight(s) found from the chart?

The points on this specific scatter plot will form a tight, upward-sloping line, demonstrating a very strong positive correlation. Amazon can rely almost interchangeably on either the IMDB or TMDB score for internal metrics and recommendation algorithms, as they measure user sentiment very similarly.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the detailed analysis of Amazon Prime’s content dataset, several strategic actions can help the platform achieve its business objectives more effectively. First, Amazon Prime should prioritize and actively promote genres that consistently perform well, such as drama, comedy, and action. These categories show high viewer engagement and strong audience response, especially when combined with titles that have high IMDb or TMDB ratings. Highlighting popular and critically appreciated movies or shows—particularly those created by well-known and successful directors—can significantly improve platform visibility and viewer satisfaction.

Secondly, the analysis clearly indicates that viewers prefer TV shows with certain season lengths. By identifying and focusing on these ideal season structures, Amazon Prime can design content that aligns with audience expectations, resulting in better retention and more consistent viewership. This helps ensure that new content is not only engaging but also tailored to what users already enjoy.

Lastly, the dataset reveals that the majority of Amazon Prime’s content currently originates from the United States and India. While this represents strong production markets, it also shows an opportunity: the platform can expand its diversity by encouraging more movies and shows from underrepresented countries. Adding content from different cultures and regions will make Amazon Prime more appealing to a global audience and help broaden its international reach.


# **Conclusion**

The analysis of the Amazon Prime dataset provided valuable insights into the platform’s overall content landscape. The dataset included rich information about movie categories, genres, age ratings, release years, and production countries. By studying these attributes, the project was able to identify clear patterns in how Amazon Prime’s content library has expanded and evolved over time. The release-year trend analysis showed noticeable growth in content production, highlighting specific years in which the platform increased its focus on adding new titles.

Additionally, the evaluation of IMDb and TMDB ratings helped assess the quality of content available on the platform. Through this, the analysis identified top-performing titles as well as directors with consistently high ratings. These insights are crucial for understanding user preferences and guiding future content decisions.

A major part of the project also involved cleaning the dataset. By removing duplicate records and handling missing values, the dataset became more accurate and reliable for analysis. This step ensured that the conclusions drawn from the project were dependable and based on high-quality data. Overall, the project successfully provided a deeper understanding of Amazon Prime’s content strategy and offered meaningful recommendations for improving viewer engagement, content diversity, and platform growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***