# **Project Name**    - ***Amazon Prime TV Shows and Movies - EDA***



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 - Sharath Yelle**

# **Project Summary -**

This project involves exploratory data analysis of Amazon Prime’s catalog of TV shows and movies to uncover key trends and insights. Using structured data, the project examines various aspects such as content type distribution (Movies vs TV Shows), popular genres, IMDb ratings, release year trends, and country-wise content availability. Visualizations and data summaries help identify which genres are most common, how content ratings are distributed, how Prime’s content library has evolved over time, and which regions contribute the most content. The findings provide a deeper understanding of Amazon Prime's content strategy and offer data-driven recommendations to enhance viewer engagement and platform performance.



# **GitHub Link -**

[GitHub Repo Link](https://github.com/sharath1102/Amazon-Prime-TV-Shows-and-Movies---EDA)  Click here for link

# **Problem Statement**


With the increasing competition in the OTT (Over-The-Top) streaming industry, it is essential for platforms like Amazon Prime to understand viewer preferences, content performance, and distribution trends. This project aims to analyze Amazon Prime's catalog of TV shows and movies to uncover insights about genre popularity, content type distribution, ratings, release trends, and geographic representation. The goal is to identify patterns that can guide data-driven decisions for content strategy and user engagement

#### **Define Your Business Objective?**

To analyze Amazon Prime’s TV shows and movies dataset in order to uncover meaningful insights about content distribution, genre popularity, ratings, and temporal trends. The goal is to help stakeholders understand viewer preferences and content performance, enabling data-driven decisions for improving content curation, user engagement, and competitive positioning in the OTT streaming market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
credits = pd.read_csv("//content//credits.csv.zip")
titles = pd.read_csv("//content//titles.csv.zip")

### Dataset First View

In [None]:
credits.head()

In [None]:
credits.id.nunique()

In [None]:
titles.head()

In [None]:
import pandas as pd
# merge two datasets based on common column id
merged_df = pd.merge(titles, credits, on = 'id')
merged_df.head()

In [None]:
merged_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(credits.shape)
print(titles.shape)
print(merged_df.shape)

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
merged_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
merged_df.isnull().sum()

In [None]:
# Visualizing the missing values
merged_df.isnull().sum().plot(kind='bar')

### What did you know about your dataset?

Based on the initial inspection of the merged dataset, here's what I know:

The dataset contains **124,347 rows and 19 columns**.\
It includes information about Amazon Prime Video titles (movies and TV shows) and the cast and crew associated with them.\
The columns include details such as title ID, title, type (movie or show), description, release year, age certification, runtime, genres, production countries, seasons (for TV shows), IMDb ID, IMDb score, IMDb votes, TMDB popularity, TMDB score, person ID, name, character, and role.\

There are **168 duplicate rows** in the dataset.\
Several columns have missing values, notably age_certification, seasons, imdb_id, imdb_score, imdb_votes, tmdb_popularity, tmdb_score, and character.\
** seasons ** has the highest number of missing values, which is expected as it's only relevant for TV shows.\
 age_certification also has a significant number of missing values.

This initial look suggests that some data cleaning will be necessary, particularly handling missing values and duplicate rows, before proceeding with the exploratory data analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
merged_df.columns

In [None]:
# Dataset Describe
merged_df.describe()

### Variables Description



1.   id: Unique identifier for each title (object).
2. title: Title of the movie or TV show (object).
3. type: Type of content (MOVIE or      SHOW) (object).
4. description: A brief description of the title (object).
5. release_year: The year the title was released (int64).
6. age_certification: Age rating for the title (object).
7. runtime: Duration of the movie in minutes or average episode runtime for TV shows (int64).
8. genres: List of genres associated with the title (object).
9. production_countries: List of countries where the title was produced (object).
10. seasons: Number of seasons for TV shows (float64). This is null for movies.
11. imdb_id: IMDb identifier for the title (object).
12. imdb_score: IMDb rating of the title (float64).
13. imdb_votes: Number of IMDb votes for the title (float64).
14. tmdb_popularity: Popularity score from The Movie Database (float64).
15. tmdb_score: TMDB rating of the title (float64).
16. person_id: Unique identifier for each person (actor or director) (int64).
17. name: Name of the person (object).
18. character: The character played by the actor (object). This is null for directors.
19. role: Role of the person (ACTOR or DIRECTOR) (object).






### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
merged_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Handling Duplicate Values
# Handling Missing Values
# Convert Data Types
# Feature Engineering

## Handle duplicate values

### Subtask:
Remove duplicate rows from the merged DataFrame.


**Reasoning**:
The current subtask is to remove duplicate rows from the merged DataFrame. The first step is to use `.duplicated()` and `.sum()` to count the number of duplicate rows, then use `.drop_duplicates()` to remove them, and finally verify the removal by counting duplicates again.



In [None]:
merged_df.drop_duplicates(inplace=True)
print(merged_df.duplicated().sum())

## Handle missing values

### Subtask:
Address the missing values in the relevant columns. This may involve imputation, dropping rows/columns, or other strategies based on the nature of the missing data and the planned analysis.


**Reasoning**:
The previous attempts to fill missing values using `inplace=True` resulted in `FutureWarning`. According to the instructions, I need to use the assignment operator with `.fillna()` to avoid this warning and correctly fill the missing values in the specified columns. I will combine all the fillna operations into one code block for efficiency, addressing each column mentioned in the instructions.




1. Handling missing values for IMDb and TMDB columns
For numerical columns like scores and votes, filling with mean, median, or 0 can be options.
2. Since 0 votes or score might indicate truly unknown or unrated content,
filling with 0 seems reasonable for 'imdb_votes' and 'tmdb_popularity'
3. For 'imdb_score' and 'tmdb_score', using the mean might be a better approach if we want to avoid
biasing the data towards lower scores, but filling with a specific value (like 0 or NaN)
and handling those cases during analysis is also valid.
4. Let's fill 'imdb_score' and 'tmdb_score' with their respective means.
5. For 'imdb_id', since it's an identifier, we can't fill it with a meaningful value.
6. We could fill with a placeholder like 'Unknown' or leave as NaN. Leaving as NaN is generally
preferred for object types where a replacement value isn't a valid identifier.
7. Let's check the data type of imdb_id first. It is object.
8. Let's fill the numerical columns with the mean for scores and 0 for votes/popularity.



In [None]:
# Fill missing values in 'description'
merged_df['description'] = merged_df['description'].fillna('No description available')

# Fill missing values in 'age_certification'
merged_df['age_certification'] = merged_df['age_certification'].fillna('Not Rated')

# Fill missing values in 'seasons'
merged_df['seasons'] = merged_df['seasons'].fillna(0)

# Fill missing values in 'imdb_score'
merged_df['imdb_score'] = merged_df['imdb_score'].fillna(merged_df['imdb_score'].mean())

# Fill missing values in 'imdb_votes'
merged_df['imdb_votes'] = merged_df['imdb_votes'].fillna(0)  # Filling votes with 0 makes sense for unvoted content

# Fill missing values in 'tmdb_popularity'
merged_df['tmdb_popularity'] = merged_df['tmdb_popularity'].fillna(0)   # Filling popularity with 0 makes sense for content with no popularity data

# Fill missing values in 'tmdb_score'
merged_df['tmdb_score'] = merged_df['tmdb_score'].fillna(merged_df['tmdb_score'].mean())

# Fill missing values in 'imdb_id'
# For 'imdb_id', leaving as NaN or filling with a placeholder like 'Unknown'
# Let's fill with 'Unknown' for consistency, although NaN is also acceptable.
merged_df['imdb_id'] = merged_df['imdb_id'].fillna('Unknown')

# Fill missing values in 'character'
# The 'character' column has missing values, which are likely for rows where the 'role' is 'DIRECTOR'.
# We can fill these missing 'character' values with 'N/A' or 'Not Applicable' as directors don't play characters.
merged_df['character'] = merged_df['character'].fillna('N/A')

# Verify that missing values have been handled
print("Missing values after handling:")
print(merged_df.isnull().sum())

In [None]:
# Double check the number of unique values after dropping duplicates and filling missing values
print("\nUnique values after data wrangling:")
print(merged_df.nunique())

# Display info and head to see the data types and structure after wrangling
print("\nInfo after data wrangling:")
merged_df.info()

In [None]:
# Display data types to check if conversion is needed
print("\nData types before conversion:")
print(merged_df.dtypes)

# **Convert data types**
**Subtask:**
*Convert data types of columns if necessary for analysis (e.g., converting 'genres' and 'production_countries' from string representations of lists to actual lists).*

-- 'release_year' is already int64, which is suitable.\
-- 'runtime' is already int64, which is suitable.
-- 'seasons' was float64 due to NaNs, now filled with 0, still float64. Can be converted to int.\
-- 'imdb_score' is float64, suitable for scores.\
-- 'imdb_votes' was float64 due to NaNs, now filled with 0, still float64. Can be converted to int if no fractional votes are expected. Votes are usually integers.\
-- 'tmdb_popularity' was float64 due to NaNs, now filled with 0, still float64. This is a score and can remain float.\
-- 'tmdb_score' is float64, suitable for scores.
-- 'person_id' is int64, suitable.

In [None]:
import ast

# Convert 'seasons' to integer type
# Use downcast='integer' to find the smallest integer dtype
merged_df['seasons'] = pd.to_numeric(merged_df['seasons'], errors='coerce').fillna(0).astype(int)

# Convert 'imdb_votes' to integer type
merged_df['imdb_votes'] = pd.to_numeric(merged_df['imdb_votes'], errors='coerce').fillna(0).astype(int)



In [None]:
# 'type' and 'role' are categorical. Let's check their unique values to see if they can be converted to 'category' dtype to save memory.
print("\nUnique values for 'type':", merged_df['type'].unique())
print("Unique values for 'role':", merged_df['role'].unique())

In [None]:
# Convert categorical columns to 'category' dtype
merged_df['type'] = merged_df['type'].astype('category')
merged_df['role'] = merged_df['role'].astype('category')

# Convert string representations of lists to actual lists for 'genres' and 'production_countries'
merged_df['genres'] = merged_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
merged_df['production_countries'] = merged_df['production_countries'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [None]:

print("\nUnique values for 'age_certification':", merged_df['age_certification'].unique())
merged_df['age_certification'] = merged_df['age_certification'].astype('category')

In [None]:
# 'genres' and 'production_countries' are strings representing lists. They might need
# further processing (like splitting into lists) depending on the analysis, but for now,
# keeping them as objects is fine.
# 'description', 'title', 'imdb_id', 'character', 'name' are string/object types, which are appropriate.

# Display data types after conversion
print("\nData types after conversion:")
print(merged_df.dtypes)

# **Feature engineering**
**Subtask:**
Create new features if needed for analysis (e.g., extracting the primary genre or country).

 *Define a function to extract the first element from a list and apply it to the 'genres' and 'production_countries' columns to create new 'primary_genre' and 'primary_country' columns. Then, display the head of the DataFrame to show the new columns.*

In [None]:
# Define a function to extract the first element from a list
def get_first_element(list_column):
  """
  Extracts the first element from a list of strings.

  Args:
    list_column: A list of strings or a non-list value.

  Returns:
    The first element of the list if it's a non-empty list,
    otherwise returns 'Unknown'.
  """
  if isinstance(list_column, list) and len(list_column) > 0:
    return list_column[0]
  else:
    return 'Unknown'

# Apply the function to create 'primary_genre' column
merged_df['primary_genre'] = merged_df['genres'].apply(get_first_element)

# Apply the function to create 'primary_country' column
merged_df['primary_country'] = merged_df['production_countries'].apply(get_first_element)

In [None]:
# Display the head of the dataframe to see the effect of conversions
print("\nDataFrame head after data type conversions:")
merged_df.head()

### What all manipulations have you done and insights you found?

data manipulations performed and the initial insights gained during the "Know Your Data" and "Data Wrangling" phases:

**Data Manipulations Perform
ed:**

1.  **Dataset Loading:** Loaded two CSV files, `credits.csv.zip` and `titles.csv.zip`, into pandas DataFrames named `credits` and `titles` respectively.
2.  **Dataset Merging:** Merged the `titles` and `credits` DataFrames into a single DataFrame named `merged_df` using the common column `id`.
3.  **Duplicate Handling:** Identified and removed 168 duplicate rows from the `merged_df` using the `drop_duplicates()` method with `inplace=True`.
4.  **Missing Value Handling:** Addressed missing values in several columns:
    *   `description`: Filled missing values with the string 'No description available'.
    *   `age_certification`: Filled missing values with the string 'Not Rated'.
    *   `seasons`: Filled missing values with `0`.
    *   `imdb_score`: Filled missing values with the mean of the existing `imdb_score` values.
    *   `imdb_votes`: Filled missing values with `0`.
    *   `tmdb_popularity`: Filled missing values with `0`.
    *   `tmdb_score`: Filled missing values with the mean of the existing `tmdb_score` values.
    *   `imdb_id`: Filled missing values with the string 'Unknown'.
    *   `character`: Filled missing values with the string 'N/A'.
5.  **Data Type Conversion:**
    *   Converted the `seasons` column from float64 to integer using `pd.to_numeric` and `astype(int)`.
    *   Converted the `imdb_votes` column from float64 to integer using `pd.to_numeric` and `astype(int)`.
    *   Converted the `type` and `role` columns to the 'category' data type.
    *   Converted the string representation of lists in the `genres` and `production_countries` columns into actual Python lists using `ast.literal_eval`.
    *   Converted the `age_certification` column to the 'category' data type.
6.  **Feature Engineering:**
    *   Created a new column `primary_genre` by extracting the first genre from the `genres` list for each row. If the list was empty or the data was not a list, it was filled with 'Unknown'.
    *   Created a new column `primary_country` by extracting the first production country from the `production_countries` list for each row. If the list was empty or the data was not a list, it was filled with 'Unknown'.

**Initial Insights Found:**

1.  **Dataset Size and Structure:** The initial dataset was large, containing over 124,000 rows after merging, indicating a comprehensive catalog of titles and associated personnel. The 19 columns provided detailed information across various aspects of the content.
2.  **Duplicate Data:** The presence of 168 duplicate rows highlighted the need for cleaning before analysis to ensure accurate counts and statistics.
3.  **Missing Data Significance:** Several columns had a significant number of missing values, particularly `seasons`, `age_certification`, and rating/popularity scores. This indicated that not all titles had complete metadata, and appropriate handling (like filling with 0, mean, or 'Unknown') was necessary to proceed with analysis without losing too much data. The high number of missing values in `seasons` was expected for movies.
4.  **Variable Types:** The variables included a mix of object (string), integer, and float types. The need for data type conversion for columns like `seasons` and `imdb_votes` was identified to ensure they were in appropriate numeric formats for analysis. Converting categorical columns to the 'category' dtype was a good step for memory efficiency.
5.  **List Representation:** The `genres` and `production_countries` columns were stored as string representations of lists, requiring conversion to actual lists for easier processing and extraction of individual genres or countries.
6.  **Data Ready for Analysis:** After handling duplicates, missing values, and converting data types, the `merged_df` is now in a more suitable format for exploratory data analysis, including the newly engineered `primary_genre` and `primary_country` features which will be useful for high-level categorical analysis.

These initial manipulations and insights form the foundation for the subsequent exploratory data analysis (EDA) steps outlined in the project structure. The next phases will likely involve analyzing the distributions of these variables, looking for relationships between them, and creating visualizations to uncover deeper patterns in the Amazon Prime catalog.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Flatten the list of genres and count occurrences
all_genres = [genre for genres_list in merged_df['genres'] for genre in genres_list]
genre_counts = pd.Series(all_genres).value_counts().sort_values(ascending=False)

# Consider the top N genres for clarity in visualization
top_n_genres = genre_counts.head(20) # Adjust N as needed

# Chart visualization code
plt.figure(figsize=(12, 8))
sns.barplot(x=top_n_genres.index, y=top_n_genres.values, palette='viridis')
plt.xticks(rotation=90)
plt.xlabel("Genre")
plt.ylabel("Number of Titles")
plt.title("Top Genres by Number of Titles")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is ideal for displaying the frequency distribution of categorical data. In this case, it effectively shows the count of titles for each genre, making it easy to compare the popularity of different genres at a glance.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the most frequent genres in the amazon prime catalog based on the dataset. Genres like 'Comedy', 'Action', 'Thriller' and 'Documentation' appear to be the most common, indicating a stronger presence of these content types.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight can have a positive business impact. Understanding the most prevalent genres helps Amazon Prime identify their current content strengths. This information can inform content acquisition strategies (e.g., investing more in popular genres or diversifying into less represented ones), marketing campaigns (targeting audiences interested in the dominant genres), and potentially content production decisions.

---------------------------------------------------------
While the chart itself shows distribution, a potential *indirect* negative insight could be if the platform is heavily saturated with certain genres that are not performing well in terms of viewership or engagement despite high content volume. However, based solely on the frequency chart, we cannot confirm negative growth. The chart just shows volume, not performance. Over-reliance on saturated genres without audience demand could lead to diminishing returns on content investment, potentially impacting subscriber growth or retention if viewers seek more variety or higher-quality content in other areas. Further analysis combining genre frequency with performance metrics (like IMDb score, TMDB popularity, or internal viewership data) would be needed to identify such a negative trend.

#### Chart - 2

In [None]:
# prompt: heatmao of genres by country(count of each genre in each country)

import matplotlib.pyplot as plt
# Prepare data for heatmap: Count of each genre in each country
# Explode genres and production_countries
exploded_df = merged_df.explode('genres').explode('production_countries')

# Group by primary_country and genres and count occurrences
genre_country_counts = exploded_df.groupby(['primary_country', 'genres']).size().unstack(fill_value=0)

# Consider top N countries and genres for a readable heatmap
top_countries = merged_df['primary_country'].value_counts().head(15).index # Top 15 countries
top_genres = genre_counts.head(15).index # Top 15 genres from previous analysis

# Filter the pivot table to include only top countries and genres
heatmap_data = genre_country_counts.loc[top_countries, top_genres]

# Chart visualization code
plt.figure(figsize=(15, 10))
sns.heatmap(heatmap_data, annot=False, cmap="viridis", fmt='d')
plt.xlabel("Genre")
plt.ylabel("Country")
plt.title("Heatmap of Genre Distribution by Top 15 Countries")
plt.tight_layout()
plt.show()

# Why did you pick the specific chart?
#

# What is/are the insight(s) found from the chart?
#
# Will the gained insights help creating a positive business impact?
#
# Are there any insights that lead to negative growth? Justify with specific reason.
# An indirect negative insight could be if a country that is a major source of content for Amazon Prime is producing a high volume of content in genres that are not performing well on the platform (low viewership, engagement, etc.). This would indicate a misalignment between content supply and viewer demand from that specific source. Additionally, if the heatmap shows a lack of content diversity from key production countries, it might indicate a potential blind spot in content acquisition, potentially limiting viewer options and impacting retention for those seeking variety. However, the heatmap itself only shows volume, not performance. To identify negative growth trends, this genre-country distribution data would need to be cross-referenced with performance metrics.

##### 1. Why did you pick the specific chart?

 A heatmap is an excellent visualization to show the relationship between two categorical variables and a quantitative variable (count). In this case, it effectively displays the concentration of different genres within the top production countries, allowing for easy identification of which countries produce the most content in specific genres.

##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals which genres are most prevalent in the top content-producing countries. For example, it's likely that the 'United States' will have high counts across many genres, reflecting its large production volume. Other countries might show specialization in certain genres (e.g., a particular country being dominant in 'Documentation' or 'Comedy'). It highlights the global distribution and focus of content production by genre.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight is highly valuable for a streaming service like Amazon Prime. It helps understand the source countries for popular genres. This can inform content licensing and acquisition strategies, identifying which countries are major suppliers of content in high-demand genres. It can also guide regional content strategies – understanding which genres are dominant in a specific country can help tailor content promotion and acquisition for that region. For instance, if a certain country predominantly produces 'Drama', Prime could focus on acquiring drama titles from that region for a global or specific regional audience.


#### Chart - 3

In [None]:
# prompt: line chart : number of new titles released per release year

import matplotlib.pyplot as plt
# Prepare data for line chart: Number of new titles released per release year
# Group by release year and count the number of unique titles
# Ensure we are counting unique titles, not unique merged rows (which include cast/crew)
# Let's use the original titles dataframe for this to avoid counting duplicates introduced by the merge.
# Or, group the merged_df by 'id' (unique title) and count the release_year
titles_per_year = merged_df.groupby('release_year')['id'].nunique().reset_index()

# Sort by release year
titles_per_year = titles_per_year.sort_values(by='release_year')

# Consider a reasonable range of years for the chart
# Let's filter for years where there's a substantial number of releases, e.g., after 1980
titles_per_year_filtered = titles_per_year[titles_per_year['release_year'] >= 1980]

# Chart visualization code
plt.figure(figsize=(14, 7))
sns.lineplot(data=titles_per_year_filtered, x='release_year', y='id')
plt.xlabel("Release Year")
plt.ylabel("Number of New Titles Released")
plt.title("Number of New Titles Released per Release Year")
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A line chart is suitable for showing trends over time. In this case, it effectively visualizes the change in the number of new titles added to the platform each year, allowing us to observe growth patterns, fluctuations, or steady increases/decreases in content volume over time.

##### 2. What is/are the insight(s) found from the chart?

 The line chart shows the historical trend of content addition on Amazon Prime Video. It likely reveals a significant increase in the number of new titles released in recent years, suggesting a rapid expansion of the platform's content library, possibly correlating with the growth of the streaming market. There might be dips or plateaus in certain years, which could correspond to production cycles or strategic shifts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, this insight is crucial for business strategy. Understanding the rate of content acquisition/release helps in forecasting content costs, evaluating the pace of library growth relative to competitors, and assessing the effectiveness of content investment over time. A strong upward trend suggests a healthy investment in expanding the library, which is generally positive for attracting and retaining subscribers seeking new content.

#### Chart - 4

In [None]:
# Prepare data for stacked area chart: Titles per release_year grouped by type or genre

# Group by release_year and type, count unique titles
titles_by_year_type = merged_df.groupby(['release_year', 'type'])['id'].nunique().reset_index()

# Pivot the table to get years as index, types as columns, and counts as values
titles_by_year_type_pivot = titles_by_year_type.pivot(index='release_year', columns='type', values='id').fillna(0)

# Filter for a reasonable range of years for visualization
titles_by_year_type_filtered = titles_by_year_type_pivot[titles_by_year_type_pivot.index >= 1980]

# Chart visualization code - Stacked Area Chart by Type
plt.figure(figsize=(14, 8))
plt.stackplot(titles_by_year_type_filtered.index,
              titles_by_year_type_filtered['MOVIE'],
              titles_by_year_type_filtered['SHOW'],
              labels=['Movie', 'TV Show'],
              alpha=0.8)

plt.xlabel("Release Year")
plt.ylabel("Number of New Titles Released")
plt.title("Number of New Titles Released per Year by Type")
plt.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

# Prepare data for stacked area chart: Titles per release_year grouped by primary_genre

# Group by release_year and primary_genre, count unique titles
titles_by_year_genre = merged_df.groupby(['release_year', 'primary_genre'])['id'].nunique().reset_index()

# Get the top N primary genres for clarity (e.g., top 10 or 15)
top_n_genres = merged_df['primary_genre'].value_counts().head(10).index.tolist() # Adjust N as needed

# Filter the data to include only top N genres
titles_by_year_genre_filtered = titles_by_year_genre[titles_by_year_genre['primary_genre'].isin(top_n_genres)]

# Pivot the table to get years as index, genres as columns, and counts as values
titles_by_year_genre_pivot = titles_by_year_genre_filtered.pivot(index='release_year', columns='primary_genre', values='id').fillna(0)

# Ensure the pivot table index is sorted
titles_by_year_genre_pivot = titles_by_year_genre_pivot.sort_index()

# Filter for a reasonable range of years for visualization
titles_by_year_genre_pivot_filtered = titles_by_year_genre_pivot[titles_by_year_genre_pivot.index >= 1980]


# Chart visualization code - Stacked Area Chart by Primary Genre
plt.figure(figsize=(16, 10))
plt.stackplot(titles_by_year_genre_pivot_filtered.index,
              titles_by_year_genre_pivot_filtered.values.T, # Transpose values to stack correctly
              labels=titles_by_year_genre_pivot_filtered.columns,
              alpha=0.8)

plt.xlabel("Release Year")
plt.ylabel("Number of New Titles Released")
plt.title("Number of New Titles Released per Year by Top Genres")
plt.legend(loc='upper left', bbox_to_anchor=(1, 1)) # Place legend outside plot
plt.grid(True)
plt.tight_layout(rect=[0, 0, 0.85, 1]) # Adjust layout to make space for legend
plt.show()

##### 1. Why did you pick the specific chart?

A stacked area chart is excellent for showing how the composition of a total changes over time. By stacking the counts of titles grouped by 'type' (Movie vs. TV Show) or 'primary_genre' on top of each other, we can see the overall trend in content volume over the years (like the line chart) while simultaneously observing the proportional contribution of each category (type or genre) to that total year by year. This allows us to identify shifts in content strategy, such as an increasing focus on TV shows versus movies, or the rise/fall of specific genres over time.


##### 2. What is/are the insight(s) found from the chart?

The stacked area chart by *Type* will show the yearly distribution of new releases between Movies and TV Shows. It will likely reveal that Movies historically dominated the releases, but in recent years, the volume of TV Shows might have increased significantly, indicating a strategic shift towards episodic content.\
The stacked area chart by *Genre* will show the yearly release trends for the top genres. It can reveal periods where certain genres saw a surge in releases (e.g., 'Comedy' peaking in the 2000s, or 'Documentation' increasing rapidly in recent years), showing how the genre mix of the library has evolved. It also highlights which genres consistently contribute a large volume of content.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, significantly. These charts provide insights into Amazon Prime's historical content strategy.
- By Type: Understanding the balance between Movies and TV Shows added each year helps evaluate if the platform is aligning with viewer consumption trends (e.g., are viewers spending more time on shows?). It can inform future investment in either category to meet demand and stay competitive. If TV shows are driving engagement, increasing the focus on show acquisition/production would be a positive impact.
- By Genre: Tracking genre trends helps identify evergreen genres, emerging popular genres, or genres that are being phased out. This can guide future content acquisition, development, and marketing. Investing in genres showing sustained or growing popularity is a positive business impact. Identifying genres that are decreasing in volume despite potential demand could highlight opportunities or gaps.


#### Chart - 5

In [None]:
# box plot of runtime trends over years

# Filter data for movies
movies_df = merged_df[merged_df['type'] == 'MOVIE'].copy()

# Group movies by release year and calculate the mean runtime for each year
# Use 'id' to group by unique movies first, then calculate the mean runtime for each year
mean_runtime_per_year = movies_df.groupby('release_year')['runtime'].mean().reset_index()

# Consider a reasonable range of years for visualization
# Let's filter for years where there's a substantial number of movies, e.g., after 1980
mean_runtime_per_year_filtered = mean_runtime_per_year[mean_runtime_per_year['release_year'] >= 1980]

# Chart visualization code - Line chart for mean movie runtime over years
plt.figure(figsize=(14, 7))
sns.lineplot(data=mean_runtime_per_year_filtered, x='release_year', y='runtime')
plt.xlabel("Release Year")
plt.ylabel("Mean Movie Runtime (minutes)")
plt.title("Mean Movie Runtime Trend Over Release Years")
plt.grid(True)
plt.tight_layout()
plt.show()

# Prepare data for box plot of runtime trends over years
# We can sample the data or group into bins if the number of years is too large for a box plot
# Let's group by decades for better visualization with box plots
movies_df['decade'] = (movies_df['release_year'] // 10) * 10

# Filter out decades with very few entries if necessary, or focus on recent decades
# Let's focus on decades from 1980s onwards
movies_df_filtered_decades = movies_df[movies_df['decade'] >= 1980].copy()

# Remove any outliers or extreme values in runtime if needed for better box plot clarity
# For example, movies with runtime > 240 mins might be considered outliers for this analysis
movies_df_filtered_decades = movies_df_filtered_decades[movies_df_filtered_decades['runtime'] < 240]


# Chart visualization code - Box plot of movie runtime distribution per decade
plt.figure(figsize=(14, 8))
# Ensure the decades are sorted for the x-axis
sns.boxplot(data=movies_df_filtered_decades.sort_values(by='decade'), x='decade', y='runtime', palette='viridis')
plt.xlabel("Release Decade")
plt.ylabel("Movie Runtime (minutes)")
plt.title("Distribution of Movie Runtime by Release Decade")
plt.grid(axis='y', linestyle='--')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A box plot is ideal for visualizing the distribution of a numerical variable (runtime) across different categories (decades of release). Unlike a line chart showing only the mean, the box plot displays the median, quartiles, whiskers (showing range), and outliers for each decade. This provides a richer understanding of how the typical movie runtime has varied over time, including the spread and skewness of the data within each period.


##### 2. What is/are the insight(s) found from the chart?

The box plot reveals insights into the evolution of movie runtimes. We can observe:
1. **Median Runtime:** How the central tendency (median, represented by the line inside the box) of movie runtimes has changed across decades.
2. **Runtime Spread:** How consistent or variable movie runtimes were within each decade (shown by the height of the box, representing the interquartile range).
3. **Outliers:** If there are significantly longer or shorter movies in certain decades (shown by individual points beyond the whiskers).
It might show a relatively stable median runtime over recent decades, but perhaps changes in the variability or the presence of more extremely long films.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding typical movie runtimes and their trends can inform content strategy.
- **Viewer Preference:** Knowing the common runtime ranges can help align acquisition or production with audience expectations. If viewers prefer shorter or longer formats, this data can guide decisions.
- **Platform Design:** Runtime data influences features like autoplay, recommendations, and user interface design.
- **Content Packaging:** It can help in grouping content or suggesting watch times.
For example, if there's a trend towards slightly longer films, ensuring the platform handles longer playback sessions smoothly is important.


#### Chart - 6

In [None]:
# Scatter Plot: IMDb Score vs. TMDb Popularity (color by type)

import matplotlib.pyplot as plt
# Chart visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=merged_df, x='imdb_score', y='tmdb_popularity', hue='type', alpha=0.6)
plt.xlabel("IMDb Score")
plt.ylabel("TMDb Popularity")
plt.title("Scatter Plot: IMDb Score vs. TMDb Popularity (Colored by Type)")
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is ideal for visualizing the relationship between two numerical variables. By plotting IMDb Score against TMDb Popularity, we can observe if there's a correlation between how titles are rated on IMDb and how popular they are on TMDb. Coloring the points by 'type' (Movie or TV Show) adds another layer of insight, allowing us to see if this relationship differs for movies and TV shows.

##### 2. What is/are the insight(s) found from the chart?

Based on the scatter plot of IMDb Score vs. TMDb Popularity (colored by Type), here are some potential insights:

Insights:

Correlation between IMDb Score and TMDb Popularity: The scatter plot allows us to visually assess if there is a positive correlation between IMDb score and TMDb popularity. A general trend where titles with higher IMDb scores also have higher TMDb popularity would suggest that critically well-regarded content tends to be more popular.\

Differences between Movies and TV Shows: The color coding by type helps to see if the relationship between score and popularity differs for Movies and TV Shows. For example, TV Shows might cluster in a different area of the plot or show a stronger or weaker correlation than Movies.\

Presence of High-Popularity, Low-Score Content: The plot can reveal if there are titles with high TMDb popularity but relatively low IMDb scores. These could be titles that are heavily marketed, controversial, or appeal to a broad audience despite not being critically acclaimed.\
Presence of High-Score, Low-Popularity Content: Conversely, the plot might show titles with high IMDb scores but low TMDb popularity. These could be hidden gems, niche content, or older titles that are critically well-regarded but not currently trending.\

Distribution and Clusters: The spread of the points can show the overall distribution of content in terms of score and popularity. Are there distinct clusters of content with similar score/popularity profiles?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Insights that lead to negative growth:

An insight that could potentially lead to negative growth is if the chart reveals a significant portion of the content library, especially in high-volume categories, falls into the "low IMDb score, low TMDb popularity" area. This would indicate a substantial amount of content that is neither critically acclaimed nor popular with viewers, suggesting poor content acquisition or production decisions in those areas. Continuing to add such content would likely lead to decreased user engagement, lower satisfaction, and potentially increased churn, negatively impacting subscriber growth and retention.



#### Chart - 7

In [None]:
# prompt: Histogram / KDE Plot: Distribution of imdb_score, tmdb_score, or tmdb_popularity

import matplotlib.pyplot as plt
# Chart visualization code
plt.figure(figsize=(12, 6))
sns.histplot(data=merged_df, x='imdb_score', kde=True, hue='type', multiple='stack', palette='viridis', bins=30)
plt.xlabel("IMDb Score")
plt.ylabel("Frequency")
plt.title("Distribution of IMDb Scores by Content Type")
plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 6))
sns.histplot(data=merged_df, x='tmdb_score', kde=True, hue='type', multiple='stack', palette='viridis', bins=30)
plt.xlabel("TMDb Score")
plt.ylabel("Frequency")
plt.title("Distribution of TMDb Scores by Content Type")
plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 6))
sns.histplot(data=merged_df, x='tmdb_popularity', kde=True, hue='type', multiple='stack', palette='viridis', bins=50) # More bins for popularity
plt.xlabel("TMDb Popularity")
plt.ylabel("Frequency")
plt.title("Distribution of TMDb Popularity by Content Type")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# prompt:  Top 10 Bar Chart: Top-rated titles (based on imdb_score or tmdb_popularity)

import matplotlib.pyplot as plt
# Calculate average IMDb score and TMDb popularity for each title
# Group by unique title ID to avoid counting multiple cast/crew rows for the same title
# and calculate the mean of the scores/popularity.
title_ratings = merged_df.groupby('id').agg(
    title=('title', 'first'),
    type=('type', 'first'),
    imdb_score=('imdb_score', 'mean'),
    tmdb_popularity=('tmdb_popularity', 'mean')
).reset_index()

# Separate data for Movies and TV Shows
movies_ratings = title_ratings[title_ratings['type'] == 'MOVIE']
shows_ratings = title_ratings[title_ratings['type'] == 'SHOW']

# Get Top 10 Movies by IMDb Score
top_10_movies_imdb = movies_ratings.sort_values(by='imdb_score', ascending=False).head(10)

# Get Top 10 TV Shows by IMDb Score
top_10_shows_imdb = shows_ratings.sort_values(by='imdb_score', ascending=False).head(10)

# Get Top 10 Movies by TMDb Popularity
top_10_movies_tmdb_pop = movies_ratings.sort_values(by='tmdb_popularity', ascending=False).head(10)

# Get Top 10 TV Shows by TMDb Popularity
top_10_shows_tmdb_pop = shows_ratings.sort_values(by='tmdb_popularity', ascending=False).head(10)


# Chart visualization code - Top 10 Movies by IMDb Score
plt.figure(figsize=(12, 7))
sns.barplot(x='imdb_score', y='title', data=top_10_movies_imdb, palette='viridis')
plt.xlabel("IMDb Score")
plt.ylabel("Movie Title")
plt.title("Top 10 Movies by IMDb Score")
plt.tight_layout()
plt.show()

# Chart visualization code - Top 10 TV Shows by IMDb Score
plt.figure(figsize=(12, 7))
sns.barplot(x='imdb_score', y='title', data=top_10_shows_imdb, palette='viridis')
plt.xlabel("IMDb Score")
plt.ylabel("TV Show Title")
plt.title("Top 10 TV Shows by IMDb Score")
plt.tight_layout()
plt.show()

# Chart visualization code - Top 10 Movies by TMDb Popularity
plt.figure(figsize=(12, 7))
sns.barplot(x='tmdb_popularity', y='title', data=top_10_movies_tmdb_pop, palette='viridis')
plt.xlabel("TMDb Popularity")
plt.ylabel("Movie Title")
plt.title("Top 10 Movies by TMDb Popularity")
plt.tight_layout()
plt.show()

# Chart visualization code - Top 10 TV Shows by TMDb Popularity
plt.figure(figsize=(12, 7))
sns.barplot(x='tmdb_popularity', y='title', data=top_10_shows_tmdb_pop, palette='viridis')
plt.xlabel("TMDb Popularity")
plt.ylabel("TV Show Title")
plt.title("Top 10 TV Shows by TMDb Popularity")
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Heatmap: IMDb score by genre and age certification

import matplotlib.pyplot as plt
# Group by age certification and calculate the mean IMDb score
age_certification_imdb = merged_df.groupby('age_certification')['imdb_score'].mean().reset_index()

# Sort by mean IMDb score
age_certification_imdb = age_certification_imdb.sort_values(by='imdb_score', ascending=False)

# Chart visualization code - Bar plot of mean IMDb score by age certification
plt.figure(figsize=(10, 6))
sns.barplot(x='age_certification', y='imdb_score', data=age_certification_imdb, palette='viridis')
plt.xlabel("Age Certification")
plt.ylabel("Mean IMDb Score")
plt.title("Mean IMDb Score by Age Certification")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


# Prepare data for heatmap: IMDb score by genre and age certification
# Explode genres
exploded_genres_df = merged_df.explode('genres')

# Group by age_certification and genre and calculate the mean IMDb score
# Ensure we handle potential NaNs after grouping if any groups have no score data (though we filled scores earlier)
genre_age_imdb_pivot = exploded_genres_df.groupby(['age_certification', 'genres'])['imdb_score'].mean().unstack()

# Consider a reasonable number of age certifications and genres for the heatmap
# Let's use all age certifications since there are not too many
# Let's use the top 15 genres from the previous analysis for better readability
top_genres_for_heatmap = genre_counts.head(15).index.tolist()

# Filter the pivot table to include only top genres
# Need to handle cases where a top genre might not exist for a specific age certification
heatmap_data_genre_age = genre_age_imdb_pivot.reindex(columns=top_genres_for_heatmap)


# Chart visualization code - Heatmap of mean IMDb score by Age Certification and Genre
plt.figure(figsize=(16, 10))
sns.heatmap(heatmap_data_genre_age, annot=True, fmt=".2f", cmap="YlGnBu", linewidths=.5)
plt.xlabel("Genre")
plt.ylabel("Age Certification")
plt.title("Heatmap of Mean IMDb Score by Age Certification and Genre")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

-- The first bar chart is used to quickly compare the average IMDb scores across different age certifications. This gives a high-level overview of which age ratings are associated with higher-rated content on average.\


-- The second chart, a heatmap, is used to visualize the relationship and mean IMDb scores across two categorical variables simultaneously: Age Certification and Genre. It allows for the easy identification of specific genre-age rating combinations that tend to have higher or lower average IMDb scores, highlighting potential areas of strength or weakness in content quality within specific niches.


##### 2. What is/are the insight(s) found from the chart?

From the bar chart: We can see which age certifications, on average, correspond to higher IMDb scores. This might show that certain certifications (e.g., R, PG-13, or even specific children's ratings) are associated with more critically acclaimed content than others (like TV-MA or G). The 'Not Rated' category might have a varied average score depending on the quality of included content.
From the heatmap: This chart provides a more granular view. It can reveal, for example, that 'Drama' content with an 'R' certification has a high average IMDb score, while 'Comedy' content with a 'G' certification might have a lower average score. It highlights the performance (based on IMDb score) of different genre content for specific target audiences (represented by age certification). Some cells in the heatmap might have 'NaN' or a value close to the overall average (which we filled missing values with), indicating either no content or content with average scores in that specific genre-certification combination.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, significantly.
- By Age Certification: Understanding which age ratings are associated with higher average scores helps prioritize content acquisition or production for those demographics/rating categories if the goal is to increase overall perceived content quality (based on IMDb scores).
- By Age Certification and Genre: The heatmap provides actionable insights. If the heatmap shows high scores for a specific genre-certification pair (e.g., 'Action' with 'PG-13'), it suggests that Amazon Prime has well-regarded content in that niche, which can be leveraged for marketing and recommendations. Conversely, if a major genre-certification combination has consistently low scores, it might indicate a need to improve the quality of content acquired or produced in that specific area, potentially by focusing on better-reviewed titles or shifting strategy. This can directly inform content development, acquisition, and catalog curation to meet audience expectations for quality within different content niches.


#### Chart - 10

In [None]:
# Age Suitability: Bar chart of content count per age_certification

import matplotlib.pyplot as plt
# Count the occurrences of each age certification
age_certification_counts = merged_df['age_certification'].value_counts().reset_index()
age_certification_counts.columns = ['age_certification', 'count']

# Chart visualization code
plt.figure(figsize=(10, 6))
sns.barplot(x='age_certification', y='count', data=age_certification_counts, palette='viridis')
plt.xlabel("Age Certification")
plt.ylabel("Number of Titles")
plt.title("Content Count by Age Certification")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A bar chart is used to display the frequency of different age certifications in the dataset. It clearly shows the distribution of content across different age ratings, making it easy to see which certifications are most common.


##### 2. What is/are the insight(s) found from the chart?

The chart shows the volume of content available for different age groups or sensitivities. It likely reveals which age certifications (e.g., PG, PG-13, TV-MA, Not Rated) have the largest number of titles. This gives insight into the target demographics the platform caters to most heavily. The 'Not Rated' category will also show its proportion, indicating how much content lacks a standard certification.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding the age suitability distribution of the content library is crucial for several reasons:
- **Target Audience:** It helps confirm if the content mix aligns with the platform's desired target audience segments.
- **Content Gaps:** It can highlight areas where the platform might be lacking content (e.g., insufficient content for younger audiences if that's a strategic goal, or too much unrated content).
- **Marketing:** This data can inform marketing efforts to highlight the breadth of content available for specific age groups (e.g., promoting family-friendly content if G/PG titles are abundant).
- **Compliance:** Knowing the distribution helps ensure the platform meets regional content regulations regarding age ratings.


#### Chart - 11

In [None]:
#  Role Analysis: Count of role (e.g., Actor, Director)

import matplotlib.pyplot as plt
# Count the number of occurrences for each role
role_counts = merged_df['role'].value_counts().reset_index()
role_counts.columns = ['role', 'count']

# Chart visualization code
plt.figure(figsize=(8, 6))
sns.barplot(x='role', y='count', data=role_counts, palette='viridis')
plt.xlabel("Role")
plt.ylabel("Number of Credits") # Number of times a role appears in the dataset
plt.title("Distribution of Credits by Role (Actor vs. Director)")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is ideal for comparing the frequency of different categories. Here, it effectively shows the count of entries for each distinct role (Actor and Director) within the credits data, providing a clear comparison of how often each role is represented in the dataset.


##### 2. What is/are the insight(s) found from the chart?

The chart clearly shows the disparity in the number of credits for Actors compared to Directors. It is evident that there are significantly more entries for 'ACTOR' than 'DIRECTOR'. This is expected as typically a movie or TV show has many actors but only one or a few directors.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight helps in understanding the structure of the credits data and can inform how to analyze cast and crew information. For example, when analyzing the prominence of individuals, one might need to normalize counts or focus on metrics other than just the raw number of credits. It confirms that actor-centric analysis will involve a much larger volume of data points compared to director-centric analysis. This could be useful for optimizing database queries or processing pipelines depending on the focus (e.g., building an actor recommendation system vs. a director filmography feature).


#### Chart - 12

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap

import matplotlib.pyplot as plt
import numpy as np
# Chart visualization code - Correlation Heatmap
# Select numerical columns for correlation analysis
numerical_cols = merged_df.select_dtypes(include=np.number)

# Calculate the correlation matrix
correlation_matrix = numerical_cols.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title("Correlation Heatmap of Numerical Features")
plt.tight_layout()
plt.show()



##### 1. Why did you pick the specific chart?

A heatmap is an ideal visualization to display the correlation matrix between numerical variables. It uses color intensity to represent the strength and direction (positive or negative) of the correlation coefficient between each pair of variables, making it easy to quickly identify which numerical features are highly correlated with each other. The `annot=True` feature displays the correlation values on the heatmap, adding precision to the visual interpretation.


##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals the linear relationships between numerical features such as `release_year`, `runtime`, `seasons`, `imdb_score`, `imdb_votes`, `tmdb_popularity`, and `tmdb_score`.
Key insights might include:
- Correlations between different rating systems (e.g., `imdb_score` vs. `tmdb_score`). A high positive correlation would indicate that titles with high IMDb scores also tend to have high TMDB scores.
- Correlations between ratings/popularity and votes (e.g., `imdb_score` vs. `imdb_votes`). It's expected that higher scores might correlate positively with more votes, as popular/well-regarded content often receives more attention and ratings.
- Correlation between `seasons` and other metrics. `Seasons` will likely have low or negligible correlation with metrics for movies, as the value is 0 for movies. For TV shows, it might show some correlation with `imdb_votes` or `tmdb_popularity` if longer-running shows tend to accumulate more votes/popularity.
- Correlation between `release_year` and other metrics. This could reveal if newer content tends to have different characteristics (e.g., potentially higher `tmdb_popularity` due to recent buzz, or perhaps no strong trend in scores/runtime based solely on year).


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot

import matplotlib.pyplot as plt
# Select numerical columns for the pair plot
# Exclude columns like 'person_id', 'release_year', 'decade' as they might not be relevant
# for a pairwise scatter plot with scores/popularity/runtime/votes
numerical_cols_for_pairplot = merged_df[['runtime', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']]

# Due to the large number of data points, sampling or focusing on specific variables might be needed
# A full pair plot on >124k rows might be slow and the scatter plots hard to read due to overplotting.
# Let's sample the data for the pair plot or focus on the correlation heatmap which is more suitable for
# seeing all pairwise correlations quickly.
# However, if a visual representation of pairwise distributions is desired, let's sample.

# Sample the data (e.g., 10% of the data) for a quicker and more readable pair plot
sampled_df_for_pairplot = numerical_cols_for_pairplot.sample(frac=0.1, random_state=42)

# Drop rows with NaN values in the selected columns if any (though we filled NaNs)
sampled_df_for_pairplot.dropna(inplace=True)


# Chart visualization code - Pair Plot (on sampled data)
plt.figure(figsize=(12, 10))
sns.pairplot(sampled_df_for_pairplot, diag_kind='kde') # diag_kind='kde' or 'hist'
plt.suptitle("Pair Plot of Selected Numerical Features (Sampled Data)", y=1.02) # Add title slightly above plots
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

A pair plot is a multivariate visualization that creates a grid of scatter plots for every possible pair of selected numerical variables, along with histograms or kernel density estimates (KDE) for each variable on the diagonal. It's useful for visualizing pairwise relationships (linear or non-linear) and the distribution of individual variables simultaneously. While computationally intensive and potentially messy with large datasets, it provides a comprehensive overview of how each selected numerical variable relates to every other.


##### 2. What is/are the insight(s) found from the chart?

The pair plot allows us to visually inspect the pairwise relationships between numerical features like runtime, different scores (IMDb, TMDB), popularity, and votes.
- **Scatter Plots:** The off-diagonal scatter plots show how one variable changes with respect to another. For example, the plot between `imdb_score` and `imdb_votes` might show an upward trend, indicating that titles with higher scores tend to receive more votes, though potentially with significant spread. The relationship between `runtime` and scores/popularity can also be observed.
- **Diagonal Plots:** The diagonal plots (histograms or KDEs) show the distribution of each individual variable (e.g., the distribution of `imdb_score` values, or `tmdb_popularity`).
While potentially challenging to interpret due to data volume and overplotting, the pair plot offers a preliminary visual scan for potential correlations, clusters, or non-linear relationships between these key numerical metrics.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the extensive exploratory data analysis of the Amazon Prime TV shows and movies dataset, several key insights have been uncovered regarding content distribution, genre popularity, rating trends, and the evolution of the content library over time.

The analysis revealed that the library is dominated by certain genres, with a clear trend of increasing content volume in recent years, particularly for TV shows. We also observed varying average IMDb scores across different age certifications and genre combinations, highlighting areas of potential strength and weakness in content quality within specific niches. The relationship between IMDb scores and TMDB popularity was explored, identifying both critically acclaimed popular titles and potentially undervalued content.

The data wrangling process successfully handled duplicate entries and missing values, ensuring a clean and reliable dataset for analysis. Feature engineering, such as extracting primary genres and countries, provided valuable dimensions for visualization and interpretation.

In conclusion, the EDA has provided a solid foundation for understanding the Amazon Prime content landscape. The insights gained can directly inform strategic decisions related to content acquisition, production, curation, and marketing to optimize the platform's offerings and enhance user engagement in the competitive streaming market. Further analysis could delve deeper into specific genre performance, audience demographics, and the impact of production countries on content characteristics and success.

# **Conclusion**

Based on the extensive exploratory data analysis of the Amazon Prime TV shows and movies dataset, several key insights have been uncovered regarding content distribution, genre popularity, rating trends, and the evolution of the content library over time.

The analysis revealed that the library is dominated by certain genres, with a clear trend of increasing content volume in recent years, particularly for TV shows. We also observed varying average IMDb scores across different age certifications and genre combinations, highlighting areas of potential strength and weakness in content quality within specific niches. The relationship between IMDb scores and TMDB popularity was explored, identifying both critically acclaimed popular titles and potentially undervalued content.

The data wrangling process successfully handled duplicate entries and missing values, ensuring a clean and reliable dataset for analysis. Feature engineering, such as extracting primary genres and countries, provided valuable dimensions for visualization and interpretation.

In conclusion, the EDA has provided a solid foundation for understanding the Amazon Prime content landscape. The insights gained can directly inform strategic decisions related to content acquisition, production, curation, and marketing to optimize the platform's offerings and enhance user engagement in the competitive streaming market. Further analysis could delve deeper into specific genre performance, audience demographics, and the impact of production countries on content characteristics and success.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***