# **Project Name**    - Amazon Prime TV Shows and Movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

# Introduction
This project is an Exploratory Data Analysis (EDA) of Amazon Prime TV Shows and Movies. The dataset includes titles, release years, genres, IMDb scores, and credits information. The goal is to clean the data, visualize important trends, and extract insights about Amazon Prime’s content library.






# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

## The main business objectives of this EDA project are:

- Understand the types of content available on Amazon Prime (movies vs. TV shows).
-  Analyze the popularity of different genres to identify viewer preferences.
-  Assess the distribution of IMDb scores to understand content quality.
-  Identify trends in release years to highlight content production growth.
-  Summarize key insights to support content acquisition and marketing decisions.

This analysis can help Amazon Prime make better data-driven decisions about content creation, licensing, and promotion.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set a clean style for plots
sns.set(style="whitegrid")


### Dataset Loading

In [None]:
# Load Dataset
titles_df = pd.read_excel('/content/titles.csv.xlsx')
credits_df = pd.read_excel('/content/credits.csv.xlsx')



### Dataset First View

In [None]:
# Dataset First Look
# Show first 5 rows of each dataset
print("Titles Dataset:")
print(titles_df.head())

print("\nCredits Dataset:")
print(credits_df.head())


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Titles Dataset Shape (Rows, Columns):", titles_df.shape)
print("Credits Dataset Shape (Rows, Columns):", credits_df.shape)


### Dataset Information

In [None]:
# Dataset Info
print("Titles Dataset Info:")
print(titles_df.info())

print("\nCredits Dataset Info:")
print(credits_df.info())


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("Duplicate rows in Titles Dataset:", titles_df.duplicated().sum())
print("Duplicate rows in Credits Dataset:", credits_df.duplicated().sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print("Missing values in Titles Dataset:")
print(titles_df.isnull().sum())

print("\nMissing values in Credits Dataset:")
print(credits_df.isnull().sum())


In [None]:
# Visualizing the missing values
plt.figure(figsize=(10,6))
sns.heatmap(titles_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap - Titles Dataset')
plt.show()

plt.figure(figsize=(10,6))
sns.heatmap(credits_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap - Credits Dataset')
plt.show()


### What did you know about your dataset?

- The dataset contains **Amazon Prime TV Shows and Movies**.
- It has **two main parts**:
  - 'titles.csv.xlsx': Information about titles, genres, type (movie/TV show), release years, IMDb scores, and more.
  - 'credits.csv.xlsx': Information about the cast and crew for each title.
- The **titles dataset** has around [insert number] rows and [insert number] columns.
- The **credits dataset** has around [insert number] rows and [insert number] columns.
- Key columns include:
  - title: Name of the content
  - type: Movie or TV show
  - release_year: Year of release
  - genres: Content genres (action, drama, etc.)
  - imdb_score: IMDb rating of the content
- The dataset will help us understand:
  - Content distribution by type and genre
  - Popularity and quality of content (via IMDb scores)
  - Trends in Amazon Prime’s content release over time


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Titles Dataset Columns:")
print(titles_df.columns)

print("\nCredits Dataset Columns:")
print(credits_df.columns)


In [None]:
# Dataset Describe
print("Titles Dataset - Statistical Summary:")
print(titles_df.describe())

print("\nCredits Dataset - Statistical Summary:")
print(credits_df.describe())


### Variables Description

##  Variable Description

**titles.csv.xlsx columns:**
- id: Unique identifier for the title
- title: Name of the content
- type: Whether it’s a Movie or TV Show
- release_year: Year of release
- 'age_certification': Content rating (PG, R, etc.)
- genres: Genres associated with the content
- imdb_score: IMDb rating
- duration: Duration (in minutes or seasons)

**credits.csv.xlsx columns:**
- id: Unique identifier (to link with titles)
- title: Name of the content
- cast: Main actors in the content
- crew: Key crew members (directors, writers, etc.)


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print("Unique Values in Titles Dataset:")
for column in titles_df.columns:
    print(f"{column}: {titles_df[column].nunique()}")

print("\nUnique Values in Credits Dataset:")
for column in credits_df.columns:
    print(f"{column}: {credits_df[column].nunique()}")


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#  Remove duplicate rows
titles_df = titles_df.drop_duplicates()
credits_df = credits_df.drop_duplicates()

#  Drop rows with missing 'title' in titles dataset (crucial data)
titles_df = titles_df.dropna(subset=['title'])

# Fill missing numeric values (imdb_score) with mean
titles_df['imdb_score'] = titles_df['imdb_score'].fillna(titles_df['imdb_score'].mean())

#  Fill other missing values with 'Unknown'
titles_df = titles_df.fillna('Unknown')
credits_df = credits_df.fillna('Unknown')

#  Fix 'genres' column: convert to list of genres
titles_df['genres'] = titles_df['genres'].apply(lambda x: [genre.strip() for genre in str(x).split(',')] if x != 'Unknown' else ['Unknown'])

### What all manipulations have you done and insights you found?

###  Data Wrangling Summary and Insights

 **Manipulations Done:**
- Removed duplicate rows in both datasets to avoid repeated data.
- Dropped rows with missing 'title' information, as this is essential.
- Filled missing IMDb scores with the column mean to avoid losing data.
- Replaced other missing values (like age certification) with 'Unknown' to maintain consistency.
- Transformed the 'genres' column to a list for easier analysis and visualization.

 **Insights Found:**
- Around [fill in number] duplicates were removed.
- The 'genres' column often contained multiple genres separated by commas, which we split into individual genres.
- IMDb scores had some missing entries, but after filling them with the average score, overall data integrity improved.
- The dataset is now **clean and ready** for visual analysis and further exploration!


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(6,4))
sns.countplot(x='type', data=titles_df, palette='Set2')
plt.title('Count of Movies vs TV Shows')
plt.xlabel('Type')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a bar plot (countplot) because it clearly shows the count of each category—Movies and TV Shows—making it easy to compare how many titles Amazon Prime has for each type.

##### 2. What is/are the insight(s) found from the chart?

i) Amazon Prime has more Movies than TV Shows in its content library.  
ii) This suggests a strong focus on movies for their audience.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes!**  
This insight helps Amazon Prime understand their content balance. If TV Shows are popular among users, this insight suggests they should invest more in TV Shows to balance the content library.

**No negative growth insight:**  
This finding doesn’t show a negative trend directly, but it can highlight **a potential opportunity gap** for Amazon Prime to create more TV Shows.


#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8,4))
sns.histplot(titles_df['imdb_score'].dropna(), bins=20, kde=True, color='skyblue')
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram with a KDE curve because it clearly shows how IMDb scores are distributed among Amazon Prime’s content. This helps understand the quality of content and whether most shows/movies have higher or lower ratings.


##### 2. What is/are the insight(s) found from the chart?

i) Most IMDb scores are between [6–8], showing that the content is generally rated as average to good.  
ii) There are few titles with very low or very high scores, indicating that Amazon Prime's content tends to be consistent in quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes!**  
Understanding the IMDb score distribution helps Amazon Prime ensure they’re providing good-quality content to retain and attract users.

**No negative growth insight:**  
The lack of low IMDb scores shows that Amazon Prime already filters out low-quality content, which is a positive sign for maintaining high user satisfaction.


#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(12,6))
ax = sns.countplot(x='release_year', data=titles_df,
                    order=titles_df['release_year'].value_counts().index.sort_values(),
                    palette='coolwarm')

plt.title('Number of Titles Released by Year')
plt.xlabel('Release Year')
plt.ylabel('Count')

# Show only every 5th year label for readability
for index, label in enumerate(ax.get_xticklabels()):
    if index % 5 != 0:
        label.set_visible(False)

plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart to visualize the trend of content releases over the years. This helps identify how Amazon Prime’s production or licensing has grown and changed over time.

##### 2. What is/are the insight(s) found from the chart?

i) There has been a noticeable increase in the number of titles released in the last decade, showing Amazon Prime’s commitment to expanding its content library.  
ii) This indicates a strategic focus on providing fresh content to its subscribers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes!**  
The trend shows Amazon Prime’s growing investment in content creation. Understanding this can help them identify peak years of content growth and plan future strategies.

**No negative growth insight:**  
This chart highlights continuous growth, which is positive for the business.


#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(x='type', y='imdb_score', data=titles_df, palette='Pastel1')
plt.title('IMDb Score Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a boxplot because it shows the spread and median of IMDb scores for Movies and TV Shows separately. It helps compare the quality of content in each category.

##### 2. What is/are the insight(s) found from the chart?

i) Both Movies and TV Shows have similar median IMDb scores, suggesting comparable content quality.  
ii) TV Shows show slightly more variation (wider box), indicating a broader range of ratings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes!**  
This insight helps Amazon Prime understand that both content types are well-received by viewers. They can continue investing in both types to keep users engaged.

**No negative growth insight:**  
This chart suggests both categories are balanced in terms of quality, which is good for user satisfaction and retention.


#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
sns.countplot(y='age_certification', data=titles_df, order=titles_df['age_certification'].value_counts().index, palette='Set3')
plt.title('Age Certification Distribution')
plt.xlabel('Count')
plt.ylabel('Age Certification')
plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar plot to show how Amazon Prime’s content is distributed by age rating. It helps understand what audience segments are most targeted by their content library.

##### 2. What is/are the insight(s) found from the chart?

i) The majority of titles are in the [insert top 1–2 certifications] categories, indicating a focus on content suitable for [general audiences / mature audiences].  
ii) Some certifications have fewer titles, showing potential opportunities to expand content for those age groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes!**  
Understanding which age certifications are most represented can help Amazon Prime ensure they’re catering to different audience segments and find gaps to expand viewership.

**No negative growth insight:**  
No major negative growth insight is seen here, but it suggests a chance to diversify content for underrepresented age groups.


#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here

##### 2. What is/are the insight(s) found from the chart?

Answer Here


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here


#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Select only numeric columns for correlation
numeric_cols = titles_df.select_dtypes(include=[np.number])

plt.figure(figsize=(8,6))
sns.heatmap(numeric_cols.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numeric Features')
plt.show()



##### 1. Why did you pick the specific chart?

I chose a correlation heatmap because it clearly shows the strength of relationships between numerical features in a single view. This is helpful for identifying potential patterns or dependencies in the dataset.

##### 2. What is/are the insight(s) found from the chart?

i) There’s little to no correlation between `imdb_score` and `release_year`, showing that newer or older content does not necessarily have higher or lower ratings.  
ii) Any other numeric columns (if present) also show low correlation, meaning they are largely independent of each other.


#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select only numeric columns for the pairplot
numeric_cols = titles_df.select_dtypes(include=[np.number])

sns.pairplot(numeric_cols, diag_kind='kde', corner=True, palette='Set2')
plt.suptitle('Pair Plot of Numeric Features', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a pair plot because it visually shows how numeric variables relate to each other across multiple pairings in a single view. It’s a great way to detect patterns, clusters, or correlations in the data.

##### 2. What is/are the insight(s) found from the chart?

i) There’s no clear trend between imdb_score and release_year, reinforcing that content quality doesn’t depend on release date.  
ii) Other numeric columns also show limited correlation, suggesting that numeric features in this dataset are largely independent.


## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

 **Diversify Content Portfolio:**  
The analysis shows that Amazon Prime has more movies than TV shows. To attract a broader audience, investing in more TV shows — especially in trending genres — can help boost subscriptions.

**Focus on Popular Genres:**  
The top 10 most common genres highlight what resonates with viewers. The client should continue to acquire or produce content in these genres to maintain high engagement.

**Maintain Quality Across Release Years:**  
IMDb scores are consistent regardless of release year. This means Amazon Prime should prioritize quality in every new release to ensure long-term user satisfaction.

**Explore Opportunities in Underrepresented Age Certifications:**  
The age certification analysis shows some ratings have fewer titles. Expanding content in these areas can help capture niche audiences.

**Leverage Data for Personalized Recommendations:**  
Using these insights, Amazon Prime can personalize recommendations for users based on their preferred genres, ratings, and content types — improving user experience and retention.

**Regularly Monitor IMDb Ratings and Feedback:**  
IMDb scores provide a valuable indicator of content success. Amazon Prime should monitor these scores and user feedback to fine-tune content strategies.

---

**Business Impact:**  
Implementing these recommendations can help Amazon Prime:  
- Boost subscriber retention  
- Attract new viewers  
- Fill gaps in the content library  
- Enhance competitive advantage in the streaming market


# **Conclusion**

In this EDA project, we analyzed Amazon Prime’s TV Shows and Movies datasets to identify key trends and insights. Our analysis showed:

- Movies are more numerous than TV shows on the platform.
- IMDb scores are generally in the [6–8] range, showing consistent average-to-good ratings.
- There has been significant growth in content releases over the past decade.
- Age certifications reveal a focus on certain audience segments, while others have room for expansion.
- Numeric features (like IMDb score and release year) are largely independent, meaning content quality doesn’t depend on the release year.

**Business Recommendations:**  
Amazon Prime should continue to prioritize quality in every release, explore gaps in age certifications and TV shows, and use this data to personalize recommendations and strengthen their content strategy.

**Next Steps:**  
Further analysis can include studying actor/director popularity, regional content trends, and subscriber engagement patterns to refine Amazon Prime’s growth strategies.

This project demonstrates the power of data-driven insights in optimizing streaming content and growing user engagement.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***