# **Project Name**    - Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA (Exploratory Data Analysis)
##### **Contribution**    - Individual


# **Project Summary -**

Streaming platforms like Amazon Prime Video have transformed entertainment consumption, offering thousands of movies and TV shows across genres and countries. Understanding content trends and audience preferences is essential for data-driven decision-making. This project performs Exploratory Data Analysis (EDA) on Amazon Prime Video’s dataset to uncover insights about its content and contributors.

The dataset includes two CSV files: titles.csv with over 9,000 titles and 15 attributes (title, type, genre, release year, IMDb and TMDB scores, runtime, production countries, and seasons for TV shows), and credits.csv with over 124,000 records of actors and directors, including their names, roles, and character names. Together, these datasets provide a detailed view of the platform’s content library.

The project aims to answer key questions: Which type of content dominates? What are the most common genres and producing countries? How has content release evolved over time? Who are the most frequent actors and directors? How do IMDb ratings and popularity scores vary? Insights from this analysis help stakeholders make informed decisions on content strategy, marketing, and audience engagement.

EDA begins with data exploration, checking column types, missing values, and distributions. Data cleaning handles missing values and ensures consistent formats. Hypotheses guide the analysis, such as movies being more frequent than TV shows, Drama and Comedy as dominant genres, and the US as the leading content producer.

Visualizations include bar charts for content types and top genres, line charts for yearly trends, histograms for IMDb ratings, and plots for top countries and frequent actors/directors. Each plot is accompanied by remarks to interpret findings in a business context.

Key insights show that movies outnumber TV shows, Drama, Comedy, and Action are the most common genres, and content production has increased significantly after 2015. IMDb ratings mostly range from 6 to 8, indicating general audience approval. The United States, India, and the UK are leading producers, while certain actors and directors frequently appear in popular titles, highlighting their influence.

In conclusion, this EDA provides actionable insights for Amazon Prime stakeholders. The structured workflow—from exploration and cleaning to visualization and interpretation—demonstrates a practical approach for analyzing datasets, supporting informed business decisions and strategic content planning.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Which genres dominate the platform?

How is content distributed across regions and years?

What are the trends in IMDb ratings and popularity?

Who are the top actors/directors contributing to popular content?

#### **Define Your Business Objective?**

The objective of this project is to analyze Amazon Prime Video’s content library to uncover trends, audience preferences, and content performance. It aims to identify popular content types, genres, producing countries, and key contributors to support data-driven decisions. The insights will help content creators, marketers, and platform managers optimize content strategy and improve audience engagement.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
import zipfile
with zipfile.ZipFile("titles.csv.zip", "r") as zip_ref:
    zip_ref.extractall()
with zipfile.ZipFile("credits.csv.zip", "r") as zip_ref:
    zip_ref.extractall()

### Dataset Loading

In [None]:
# Load Dataset
titles = pd.read_csv("titles.csv")
credits = pd.read_csv("credits.csv")

### Dataset First View

In [None]:
# Dataset First Look

# For titles
print("First 5 rows of Titles Dataset:")
print(titles.head())

# For Credits
print("\nFirst 5 rows of Credits Dataset:")
print(credits.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# For Titles
print("Titles Rows & Columns:")
print(titles.shape)

# For Credits
print("\nCredits Rows & columns:")
print(credits.shape)

### Dataset Information

In [None]:
# Dataset Info

# For Titles
print("Titles info:")
titles.info()

# For Credits
print("\nCredits info:")
credits.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# For titles
print("Duplicate Value Count in Titles:")
print(titles.duplicated().sum())

# For Credits
print("\nDuplicate Value Count in Credits:")
print(credits.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing Values in Titles Dataset:")
print(titles.isnull().sum())

print("\nMissing Values in Credits Dataset:")
print(credits.isnull().sum())

In [None]:
# Visualizing the missing values

# For titles dataset
plt.figure(figsize=(10,6))
sns.heatmap(titles.isnull(), cbar=False)
plt.title("Missing values in Titles Dataset")
plt.show()

# For credits dataset
plt.figure(figsize=(10,6))
sns.heatmap(credits.isnull(), cbar=False)
plt.title("Missing Values in credits Dataset")
plt.show()

### What did you know about your dataset?

**The Titles dataset** contains information about movies and TV shows, including ID, title, type, description, release year, genre, runtime, seasons, ratings, and popularity metrics.

**The Credits dataset** contains information about actors and directors associated with each title, including person ID, name, role, and character name.

Some columns have missing values:

*   In **Titles**, missing data exists in description, age_certification, seasons, imdb_score, imdb_votes, tmdb_score, and tmdb_popularity.
*   In **Credits**, the character has many missing values, while other columns are complete.


Most identifier columns (id, person_id, title, name) have no missing values, which is good.

seasons is missing for many rows because movies don’t have seasons, which is expected.

Duplicate rows are minimal or zero, so the dataset is mostly clean and reliable.

Overall, the dataset provides a **comprehensive view of Amazon Prime content and contributors**, with a few gaps that can be handled during data cleaning.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# For Titles
print("Titles Columns:")
print(titles.columns)

# For Credits
print("\nCredits Columns:")
print(credits.columns)

In [None]:
# Dataset Describe

# For Titles dataset
print("Titles Dataset Description:")
print(titles.describe(include='all'))  # include='all' shows numeric + non-numeric columns

# For Credits dataset
print("\nCredits Dataset Description:")
print(credits.describe(include='all'))

### Variables Description

**Titles Dataset:**
The Titles dataset contains information about all movies and TV shows on Amazon Prime Video. Key variables include:

id: Unique identifier for each title.

title: Name of the movie or TV show.

type: Specifies whether the content is a Movie or TV Show.

description: Short synopsis of the title.

release_year: Year the title was released.

age_certification: Age rating (e.g., PG, R), may be missing for some titles.

runtime: Duration of the movie or episode in minutes.

genres: Genre(s) of the title.

production_countries: Countries where the title was produced.

seasons: Number of seasons (for TV Shows only).

imdb_id, imdb_score, imdb_votes: IMDb identifiers, rating, and number of votes.

tmdb_popularity, tmdb_score: Popularity and rating on TMDb.

**Credits Dataset:**
The Credits dataset contains information about actors and directors for each title. Key variables include:

person_id: Unique ID of the actor or director.

id: Title ID linking to the Titles dataset.

name: Name of the actor or director.

character: Character name (if the person is an actor).

role: Role type — ACTOR or DIRECTOR.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique Values in Titles Dataset:")
print(titles.nunique())

print("\nUnique values in Credits Dataset:")
print(credits.nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Titles dataset
titles = titles.drop_duplicates()  # remove duplicates

# Handle missing values safely
titles.loc[:, 'description'] = titles['description'].fillna('No description')
titles.loc[:, 'age_certification'] = titles['age_certification'].fillna('Not Rated')
titles.loc[:, 'seasons'] = titles['seasons'].fillna(0)  # 0 for movies
titles.loc[:, 'imdb_score'] = titles['imdb_score'].fillna(titles['imdb_score'].mean())
titles.loc[:, 'imdb_votes'] = titles['imdb_votes'].fillna(0)
titles.loc[:, 'tmdb_popularity'] = titles['tmdb_popularity'].fillna(titles['tmdb_popularity'].mean())
titles.loc[:, 'tmdb_score'] = titles['tmdb_score'].fillna(titles['tmdb_score'].mean())

# Convert data types safely
titles.loc[:, 'release_year'] = titles['release_year'].astype(int)
titles.loc[:, 'seasons'] = titles['seasons'].astype(int)

# Credits dataset
credits = credits.drop_duplicates()  # remove duplicates
credits.loc[:, 'character'] = credits['character'].fillna('Unknown')

# Check shapes
print("Titles dataset shape:", titles.shape)
print("Credits dataset shape:", credits.shape)

# Check missing values after wrangling
print("\nMissing values after wrangling (Titles):")
print(titles.isnull().sum())

print("\nMissing values after wrangling (Credits):")
print(credits.isnull().sum())

# First 5 rows of Credits dataset
print("\nFirst 5 rows of Credits dataset:")
print(credits.head())

### What all manipulations have you done and insights you found?

**Data Manipulations Done**:

1. **Dropped duplicate rows** in both Titles and Credits datasets to ensure data uniqueness.

2. **Handled missing values:**

*   Filled missing description with “No description” in Titles.
*   Filled missing age_certification with “Not Rated”.
*   Filled missing seasons with 0 for movies.
*   Filled numeric missing values (imdb_score, tmdb_score, tmdb_popularity) with the column mean.
*   Filled missing imdb_votes with 0.
*   Filled missing character in Credits with “Unknown”.

3. **Converted data types:** release_year and seasons to integers for proper analysis.

4. **Reset index** in both datasets to keep the dataframe neat after cleaning.

**Insights Found**
*   Most identifier columns (id, person_id, title, name) were already complete — no missing values.
*   Missing values were mostly in optional or descriptive columns, e.g., seasons (for movies), age_certification, character.
*   Numeric columns like imdb_score, tmdb_score, and tmdb_popularity had some missing values, which could affect analysis if not handled.
*   After cleaning, both datasets are ready for analysis, with no missing values and proper data types.
*   This cleaning ensures that subsequent visualizations and statistical analysis will be accurate and meaningful.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Count of Movies vs TV Shows
plt.figure(figsize=(8,5))
sns.countplot(data=titles, x='type', hue='type', palette='viridis', dodge=False)
plt.title("Distribution of Titles by Type (Movies vs TV Shows)")
plt.xlabel("Title Type")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a count plot (bar chart) here because the data is categorical (type = Movie or TV Show). A bar chart is ideal for showing how many titles belong to each category. It makes it very easy to compare the two and clearly see which dominates.

##### 2. What is/are the insight(s) found from the chart?

From the chart, we can see the distribution of titles on Amazon Prime Video:

*   The platform has more Movies than TV Shows.
*   This indicates that Amazon Prime’s content library is heavily focused on Movies, which can influence user preferences, marketing strategies, and content acquisition decisions.
*   Stakeholders can use this insight to balance content types or invest further in the category that drives higher engagement.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   Knowing that Amazon Prime has more Movies than TV Shows helps content teams and marketers target users who prefer movies, plan promotions, and invest in acquiring more popular movie content.
*   It also helps in strategic decisions like creating bundles, recommending titles, or producing original movies to drive subscriptions.

**Potential Negative Growth:**
*   The heavy focus on Movies may lead to underrepresentation of TV Shows, which could limit engagement for users who prefer series.
*   If competitors offer a more balanced library, Amazon Prime could lose potential subscribers who prioritize TV Shows over Movies.

**Justification:**
*   The insight highlights a content imbalance, which is a key factor that can positively or negatively affect user retention and growth depending on how the business leverages it.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Chart 2: Number of Titles Added Over Years (Line Chart)
# ---------------------------------------
year_counts = titles['release_year'].value_counts().sort_index()
plt.figure(figsize=(10,6))
plt.plot(year_counts.index, year_counts.values, marker='o', linestyle='-', color='green')
plt.title('Number of Titles Added Over Years')
plt.xlabel('Year')
plt.ylabel('Number of Titles')
plt.grid(True)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I used a line chart because the data involves years (a continuous variable) and the number of titles released per year (a trend over time). A line chart is the best way to show changes and growth patterns across time. It allows us to easily spot peaks, drops, or trends in title releases.

##### 2. What is/are the insight(s) found from the chart?


*   **Drama** dominates the catalog by a wide margin, with the highest count among all genres.
*   **Comedy** is the clear second, while Thriller, Action, and Romance make up a strong mid-tier cluster.
*   **Crime**, **Documentary**, and **Horror** are present but noticeably smaller than the top five.
*   **Family** and **European** appear least frequent within the top ten, indicating more niche coverage compared to Drama/Comedy.

Overall, the library is skewed toward **Drama** and **Comedy**, with **thrillers/action/romance** providing secondary depth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business impact:**
Yes, the gained insights can help create a positive business impact. Knowing that Drama and Comedy are the most popular genres, Amazon can prioritize producing or acquiring more titles in these categories to attract and retain a wider audience. Additionally, highlighting these genres in personalized recommendations can increase viewer engagement.

**Negative growth:**
On the other hand, the lower representation of genres like Family or European may point to potential gaps. This does not necessarily lead to negative growth but could be seen as a missed opportunity—if Amazon does not expand in these areas, competitors might capture that niche audience. However, focusing too much on underperforming genres without sufficient demand could result in higher costs with limited returns.

**Justification:**
Thus, the insight suggests a balanced strategy: continue strengthening Drama and Comedy for mainstream growth, while selectively investing in underrepresented genres to diversify the catalog and reduce dependency on only a few genres.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# Filter only actors
actors_df = credits[credits['role'] == 'ACTOR']

# Count top 10 actors by number of titles
top_actors = actors_df['name'].value_counts().head(10)
plt.figure(figsize=(10,6))
top_actors.plot(kind='barh', color='orange')
plt.title('Top 10 Actors by Number of Titles')
plt.xlabel('Number of Titles')
plt.ylabel('Actors/Directors')
plt.gca().invert_yaxis()  # Highest at top
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart since we’re comparing actors based on the count of titles. A bar chart is well-suited for categorical comparisons, and making it horizontal ensures the longer actor names fit neatly. It makes it very easy to see which actor appears most often.

##### 2. What is/are the insight(s) found from the chart?

*   The **United States** is the leading producer of content on Amazon Prime.
*   **India** also has a strong presence, contributing a large number of titles.
*   Other countries like the **United Kingdom, Canada, and Japan** contribute significantly but far less than the US and India.
*   This shows that Amazon Prime’s content library is **dominated by a few major countries**, while contributions from other countries are relatively small.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*  **Positive Business Impact:**
Yes, the insights are useful. Since the chart shows that the majority of Amazon Prime content comes from countries like the United States and India, Amazon can continue to strengthen partnerships with production houses in these regions. This ensures a consistent flow of content that already appeals to large subscriber bases.
*   **Negative Growth Concern:**
However, over-reliance on just a few countries may reduce content diversity. Audiences from other regions may feel underrepresented, which could lead to slower growth in international markets. For example, if countries in Europe, Africa, or Latin America are underrepresented, Amazon Prime might lose potential subscribers from those regions.
*   **Justification:**
Balancing content production between dominant countries and underrepresented regions can help Amazon Prime expand its global reach, reduce the risk of negative growth, and improve customer satisfaction worldwide.






#### Chart - 4

In [None]:
# Chart - 4 visualization code

# Chart 4: Distribution of Titles by Genre (Pie Chart)

# Split genres and explode into separate rows
titles['genres'] = titles['genres'].apply(lambda x: eval(x) if isinstance(x, str) and x.startswith('[') else [x])
titles_exploded = titles.explode('genres')

# Count genres
genre_counts = titles_exploded['genres'].value_counts()

# Plot pie chart (top 10 only for clarity)
plt.figure(figsize=(8,8))
genre_counts.head(10).plot.pie(autopct='%1.1f%%', startangle=140, shadow=True)
plt.ylabel('')  # remove y-label
plt.title('Top 10 Genres on Amazon Prime')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I used a pie chart here because genres represent categorical distribution, and we want to show proportions. A pie chart makes it easy to visualize the share of each genre among all titles. By showing only the top 10 genres, we keep the chart readable and highlight the most dominant genres.

##### 2. What is/are the insight(s) found from the chart?

*   The platform has a wide variety of genres, but only a few dominate (like Drama, Comedy, Action, and Thriller).
*   Genres outside the top 10 contribute very little, indicating a highly skewed genre distribution.
*   This shows Amazon Prime focuses heavily on mainstream entertainment categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
Yes, the insights are useful. Since the chart shows that the majority of titles fall under popular genres like Drama, Comedy, and Action, Amazon can prioritize acquiring or producing more content in these categories. This helps attract a wider audience base and ensures strong engagement, as these genres appeal to the largest group of subscribers.

**Negative Growth Concern:**
However, heavy focus on only a few genres may lead to reduced diversity in the catalog. Audiences with niche preferences, such as Documentary or Musical lovers, might feel ignored, resulting in dissatisfaction and potential subscriber loss.

**Justification:**
By balancing investments between mainstream genres (for mass appeal) and niche genres (for unique content variety), Amazon can maximize viewership, prevent content fatigue, and avoid negative growth risks caused by limited genre diversity.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# Clean up countries column
titles['production_countries'] = titles['production_countries'].fillna("Unknown")

# Convert string list like "['US']" → "US"
titles['production_countries'] = titles['production_countries'].astype(str).str.replace(r"[\[\]\'\"]", "", regex=True)

# If multiple countries, take the first one
titles['main_country'] = titles['production_countries'].apply(lambda x: x.split(",")[0].strip())

# Now count top 10
country_counts = titles['main_country'].value_counts().head(10)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(y=country_counts.index, x=country_counts.values)
plt.title("Top 10 Countries by Number of Titles")
plt.xlabel("Number of Titles")
plt.ylabel("production_Country")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart here because we’re comparing countries and their contribution to the number of titles. Bar charts are ideal for categorical comparisons, and the horizontal layout makes country names easier to read. It clearly shows which countries dominate content production.

##### 2. What is/are the insight(s) found from the chart?

*   Content production is heavily dominated by a few countries (like the US, India, and the UK).
*   Smaller countries contribute very little, indicating regional imbalance in the content library.
*   This highlights Prime’s strong focus on Western and Indian audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
Yes, the insights are useful. Since the chart shows that the majority of Amazon Prime content comes from countries like the United States and India, Amazon can continue to strengthen partnerships with production houses in these regions. This ensures a consistent flow of content that already appeals to large subscriber bases.

**Negative Growth Concern:**
However, over-reliance on just a few countries may reduce content diversity. Audiences from other regions may feel underrepresented, which could lead to slower growth in international markets. For example, if countries in Europe, Africa, or Latin America are underrepresented, Amazon Prime might lose potential subscribers from those regions.

**Justification:**
Balancing content production between dominant countries and underrepresented regions can help Amazon Prime expand its global reach, reduce the risk of negative growth, and improve customer satisfaction worldwide.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# Chart 6 - Top 10 Content Ratings
plt.figure(figsize=(10,6))
titles['age_certification'].value_counts().head(10).plot(kind='bar', color='skyblue')
plt.title('Top 10 Content Ratings on Amazon Prime')
plt.xlabel('Age Certification')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is chosen because we are dealing with categorical data (age ratings like 13+, 16+, etc.) and need to compare their frequency. It makes it easy to see which certification dominates Amazon Prime content.

##### 2. What is/are the insight(s) found from the chart?

*   Majority of content falls under categories like 13+ and 16+, meaning Prime caters heavily to teenagers and young adults.
*   Fewer titles exist in the 7+ or All Ages categories, suggesting less family/kids-oriented content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Amazon can continue focusing on teen/young adult categories since they represent the biggest subscriber group.

**Negative Growth Concern:** Underrepresentation of kids/family content could limit growth in households with children. For example, Netflix invests heavily in kids’ shows, which may attract families away from Prime.

**Justification:** A balanced investment in both youth-focused and family-friendly content ensures broader audience engagement and prevents losing family subscribers.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Chart 7 - Average IMDb Rating by Genre

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Join all actor names into one big string
text = " ".join(credits['name'].astype(str))

# Generate word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display it
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a word cloud because it visually highlights the most frequent names (actors/directors) in the dataset. Bigger words instantly show dominance, making it easier to identify key contributors compared to tables or bar charts.

##### 2. What is/are the insight(s) found from the chart?

*  Few actors/directors appear very frequently.
*  Names like Michael, David, John stand out, showing repeated collaborations.
*  Most other names have much smaller presence.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
Amazon can strengthen its brand by continuing collaborations with the most frequent and recognizable actors/directors, as audiences are naturally attracted to familiar faces. This builds loyalty and increases the chances of high viewership for new releases featuring these individuals.

**Negative Growth Concern:**
Over-reliance on a small set of actors/directors may limit creative diversity and discourage subscribers who are looking for fresh talent or unique storytelling. Competitors like Netflix often diversify their cast and crew, which helps them appeal to a broader audience base.

**Justification:**
A balanced approach is essential. While retaining partnerships with high-profile names secures consistent popularity, Amazon should also invest in showcasing new or underrepresented talent. This not only reduces the risk of content fatigue but also allows Prime to attract niche audiences and differentiate itself from competitors.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# Filter only directors
directors = credits[credits['role'] == 'DIRECTOR']

# Count appearances of each director
top_directors = directors['name'].value_counts().head(10)

# Plot
plt.figure(figsize=(10,6))
top_directors.plot(kind='bar', color='salmon', edgecolor='black')
plt.title("Top 10 Directors with Most Titles", fontsize=14)
plt.xlabel("Director", fontsize=12)
plt.ylabel("Number of Titles Directed", fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I picked a bar chart because it provides a clear and simple comparison of how many titles each director has contributed. Since we’re focusing on the top 10 directors, the bar chart is the most effective way to highlight who is leading in content production.

##### 2. What is/are the insight(s) found from the chart?

*   Joseph Kane has directed the most titles among all directors.
*   Sam Newfield and Jay Chapman also have high contributions.
*   The majority of directors in the top 10 have directed between 15–40 titles.
*   There is a visible gap between the top 3 directors and the rest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:** Amazon can strengthen collaborations with high-output directors like Joseph Kane and Sam Newfield to ensure a consistent flow of new content.

**Negative Growth Concern:** Relying heavily on the same set of directors may result in less creative diversity, potentially leading to audience fatigue.

**Justification**: Balancing partnerships between proven high-output directors and fresh new talent will help Amazon maintain both content volume and originality.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Filter only movies (make a copy)
movies = titles[titles['type'] == 'MOVIE'].copy()

# Create a clean runtime column
movies['runtime_clean'] = (
    movies['runtime']
    .astype(str)               # make sure it's string
    .str.replace('min', '')    # remove 'min'
    .str.strip()               # remove extra spaces
)

# Convert to numeric safely
movies['runtime_num'] = pd.to_numeric(movies['runtime_clean'], errors='coerce')

# Drop missing values
movies = movies.dropna(subset=['runtime_num'])

# Plot histogram
plt.hist(movies['runtime_num'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Movie Runtimes on Amazon Prime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('Number of Movies')
plt.show()

##### 1. Why did you pick the specific chart?

I choose a histogram/KDE for movie durations because duration is a continuous variable. A histogram shows how frequently different ranges of durations occur, while the KDE curve smooths the distribution to highlight common runtime ranges.

##### 2. What is/are the insight(s) found from the chart?

*   Most movies fall between ~80 to 120 minutes – this is the peak range of the histogram.
*   Few very short movies (<60 mins) and very long ones (>180 mins) – these are rare cases (outliers).
*   The distribution looks a bit like a bell curve but slightly right-skewed – meaning there are more movies a little longer than the average.
*   The average runtime seems to be around 100 minutes, which matches the typical industry standard for feature films.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
*   **Content Planning:** Knowing that most movies are 80–120 minutes long helps streaming platforms (like Amazon Prime) maintain an optimal balance of viewer engagement. Movies in this range are likely to attract and retain audiences.
*   **Recommendation Engine:** Platforms can recommend average-length movies more often since they match user preferences and maximize watch completion rates.
*   **Production Strategy:** Studios can design new content around this range to meet audience expectations.

**Negative Growth Risks:**
*   Very short movies (<60 mins) may disappoint audiences, leading to lower satisfaction and churn.
*   Very long movies (>180 mins) might see reduced completion rates, meaning users may drop off midway, affecting engagement metrics.

**Justification:**
While most runtimes align well with audience preferences (positive impact), outlier runtimes (too short or too long) can negatively affect user satisfaction and watch time, which may hurt growth if produced in large numbers.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Count number of titles released per year
titles_per_year = titles['release_year'].value_counts().sort_index()

plt.figure(figsize=(15,6))
plt.plot(titles_per_year.index, titles_per_year.values, marker='o', linestyle='-')

plt.title("Number of Titles Released Over the Years", fontsize=14)
plt.xlabel("Release Year", fontsize=12)
plt.ylabel("Number of Titles", fontsize=12)

# Rotate x-axis labels
plt.xticks(rotation=45)

# Show only every 5th year for readability
plt.xticks(titles_per_year.index[::5])

plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

##### 1. Why did you pick the specific chart?

I picked a line chart because it is best suited to show trends over time. Since we are analyzing how the number of titles released has changed across years, the line chart makes it easy to observe increases, decreases, and overall growth patterns.

##### 2. What is/are the insight(s) found from the chart?

The chart shows how the production and release of titles have evolved year by year.
*   There is a steady increase in the number of releases after a certain period, showing rising investment in content.
*   Some years may show sharp spikes (indicating heavy content production) or sudden dips (possibly due to external factors like market slowdown, regulations, or global events).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive business impact:**
Yes, the insights can help create a positive business impact because they highlight the years of rapid content growth, which is useful for understanding what strategies worked in the past. A consistent rise means higher audience engagement and platform expansion.

**Negative growth:**
On the negative side, if the chart shows dips in certain years, it may indicate reduced investment in new titles or challenges in content acquisition/production. Such declines could hurt user retention, as fewer new titles may reduce viewer interest. Recognizing these patterns allows the business to avoid repeating past mistakes and maintain steady growth.

**Justification:**
A consistent upward trend in the number of titles indicates that the platform is actively expanding its library, which supports positive business growth by attracting and retaining more viewers. However, any visible dips in the trend may signal reduced content availability, which could lead to negative growth by lowering audience engagement. Recognizing these patterns helps the business plan content strategies more effectively.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

By leveraging insights on popular genres, top-performing talent, and viewer preferences, Amazon Prime can increase user engagement, attract new subscribers, and optimize content investment efficiently.

# **Conclusion**

The analysis highlighted not only the most popular genres and top creators but also patterns in content release over the years, duration preferences, and audience engagement trends. It shows how Amazon Prime can identify gaps in content, predict what type of shows or movies are likely to perform well, and make strategic decisions about investments and marketing. Overall, data-driven insights from the Titles and Credits datasets provide a strong foundation for optimizing the platform’s content library and enhancing viewer satisfaction.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***