<a href="https://colab.research.google.com/github/rajaramnivas/Amazon-prime-data-analysis/blob/main/Copy_of_amazonprime.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name** - AMAZON PRIME TV SHOWS AND MOVIES ANALYSING  



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual/Team
##### **Team Member  -**  - RAJARAM S


# **Project Summary -**

Introduction:

Streaming platforms like Amazon Prime Video have transformed the entertainment industry by providing on-demand access to a vast collection of TV shows, movies, and original content. Understanding content distribution, audience preferences, and market trends is essential for optimizing content strategies and improving user engagement. This project focuses on Exploratory Data Analysis (EDA) to uncover patterns in Amazon Prime's content library, including genre distribution, release trends, rating patterns, and missing data analysis.

Objective
The goal of this analysis is to extract meaningful insights from Amazon Prime's dataset by exploring various factors such as:

The distribution of TV shows and movies.
Popular genres and their frequency.
Trends in content production over the years.
IMDb rating distribution and its impact on content.
Missing data analysis to ensure data quality.
By leveraging data analytics, we aim to provide valuable insights that can assist stakeholders in making data-driven decisions for content acquisition and recommendation algorithms.

Data and Methodology
The dataset consists of two key files:

Titles Dataset – Contains metadata about TV shows and movies, including title, type (movie or TV show), genre, IMDb rating, release year, and country of origin.
Credits Dataset – Provides information about the cast and crew of each title.
To gain deeper insights, the two datasets were merged based on a common identifier (id). The following EDA techniques were applied:

Data Cleaning: Handling missing values, removing duplicates, and standardizing formats.
Visualization: Using Python’s Matplotlib and Seaborn libraries to create meaningful charts and graphs.
Statistical Analysis: Identifying trends and patterns based on numerical data like IMDb ratings and release years.
Key Findings
Movies vs. TV Shows Distribution

The dataset shows a higher number of movies compared to TV shows, suggesting Amazon Prime invests more in film content.
Popular Genres

The most common genres include Drama, Comedy, Action, and Thriller.
Drama is the dominant genre, reflecting a strong audience preference for storytelling-driven content.
Release Year Trends

The number of movie and TV show releases has significantly increased in recent years.
A peak in content production is observed post-2010, indicating the streaming boom era.
IMDb Rating Distribution

IMDb ratings follow a right-skewed distribution, meaning most movies and shows receive average ratings between 5.0 and 8.0.
Few titles have exceptionally high or low ratings, showing that most content falls within a standard quality range.
Missing Data Analysis

Some columns, such as cast and IMDb ratings, contain missing values.
Handling missing data is crucial for building reliable recommendation models and analytics tools.
Business Implications
Content Strategy: Amazon Prime can focus more on highly rated genres like Drama and Thriller to attract more viewers.
Production Trends: The surge in content production after 2010 aligns with increased streaming demand, and further investments in original content may boost engagement.
User Experience: By improving IMDb rating-based recommendations, Amazon Prime can enhance personalized suggestions for users.

# **GitHub Link -**

https://github.com/rajaramnivas

# **Problem Statement**
With the rise of streaming platforms, understanding content trends, audience preferences, and performance metrics is crucial for optimizing user experience. This project aims to analyze Amazon Prime’s TV shows and movies dataset to uncover insights into:

The distribution of content types (movies vs. TV shows).
Popular genres and their trends over time.
IMDb rating patterns and their impact on audience engagement.
Content availability across different countries.
By leveraging Exploratory Data Analysis (EDA), this study will provide data-driven recommendations to improve content acquisition strategies and enhance user engagement on Amazon Prime Video.


**Write Problem Statement Here.**

#### **Define Your Business Objective?**
### **Business Objective**  

The primary objective of this project is to analyze Amazon Prime's TV shows and movies dataset to extract valuable insights that can help optimize content strategy and enhance user engagement.  

#### **Key Goals:**  
1. **Understand Content Distribution:**  
   - Analyze the proportion of movies vs. TV shows.  
   - Identify trends in content availability over the years.  

2. **Analyze Genre Popularity:**  
   - Determine the most-watched and highly-rated genres.  
   - Help Amazon Prime focus on high-demand genres for future content acquisition.  

3. **Evaluate IMDb Ratings & Viewer Preferences:**  
   - Identify the distribution of IMDb ratings.  
   - Understand factors influencing high-rated content to improve recommendations.  

4. **Regional Content Insights:**  
   - Assess content availability across different countries.  
   - Help optimize regional content offerings based on demand.  

5. **Business Impact:**  
   - Improve Amazon Prime's recommendation algorithms.  
   - Guide content acquisition decisions with data-driven insights.  
   - Enhance customer satisfaction by offering content aligned with audience preferences.  



Answer Here.

# **General Guidelines** : -
### **General Guidelines for Amazon Prime TV Shows and Movies Data Analysis**  

#### **1. Understanding the Dataset**  
- Review the dataset structure, including columns like title, type (movie/TV show), genre, release year, IMDb rating, and country.  
- Identify missing values and clean the data accordingly.  
- Merge datasets if multiple files are provided (e.g., titles and credits).  

#### **2. Exploratory Data Analysis (EDA)**  
- Use statistical summaries (`.describe()`, `.info()`) to understand numerical distributions.  
- Identify patterns in data through visualizations.  
- Analyze content distribution (TV shows vs. movies).  
- Explore genre popularity and trends over time.  
- Examine IMDb rating distributions to assess content quality.  

#### **3. Data Visualization Best Practices**  
- Use bar charts for categorical comparisons (e.g., top genres).  
- Apply line charts to show trends over time (e.g., content release per year).  
- Utilize histograms to display rating distributions.  
- Use pie charts sparingly for proportion-based insights (e.g., movies vs. TV shows).  
- Implement heatmaps to analyze correlations between numerical variables.  

#### **4. Business Insights & Recommendations**  
- Identify high-performing content categories based on ratings and viewership.  
- Suggest content strategies for Amazon Prime based on genre popularity.  
- Provide regional content analysis to optimize market segmentation.  
- Offer data-driven recommendations for content acquisition and platform improvements.  

#### **5. Documentation & Reporting**  
- Clearly define the problem statement and business objectives.  
- Structure the report with **Introduction, Methodology, Findings, and Recommendations**.  
- Include key visualizations with proper labels and interpretations.  
- Summarize actionable insights concisely.  
  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

### Dataset First View

In [None]:
import gdown
import pandas as pd

# File IDs from Google Drive links
file_ids = {
    "data1.csv": "1595FMqlfuQ5vGmHabxldymAkcnyN_RJa",  # Replace extension based on actual format
    "data2.csv": "1fiFrK07KLZ1Ah14ZiSxoqfFB4BYtQtIT"
}

# Download files
for filename, file_id in file_ids.items():
    gdown.download(f"https://drive.google.com/uc?id={file_id}", filename, quiet=False)

# Load the data into Pandas
df1 = pd.read_csv("data1.csv")  # Change to read_excel() for Excel files
df2 = pd.read_csv("data2.csv")

# Display first few rows of each dataset
print("First file preview:")
display(df1.head())

print("\nSecond file preview:")
display(df2.head())


In [None]:
merged_df=df1.merge(df2, on="id",how="inner")

In [None]:
merged_df.head()

### Dataset Rows & Columns count

In [None]:
print("No of rows:",merged_df.shape[0])
print("No of columns:",merged_df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
merged_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("duplicate value count:",merged_df.duplicated().sum())

In [None]:
#remove the duplicate value row
merged_df.drop_duplicates(inplace=True)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values=merged_df.isnull().sum()
print(missing_values)

In [None]:
# Visualizing the missing values
missing_values = missing_values[missing_values > 0]  # Only show columns with missing values

plt.figure(figsize=(10, 5))
missing_values.plot(kind="bar", color="orange")
plt.xlabel("Columns")
plt.ylabel("Count of Missing Values")
plt.title("Missing Values Per Column")
plt.xticks(rotation=45)
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:")
for col in merged_df.columns:
    print("-", col)

In [None]:
# Dataset Describe
# Summary of numerical columns
print(merged_df.describe())

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print( "Display unique value counts for each column")
unique_values=merged_df.nunique()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# 2️⃣ *Handle Missing Values*
merged_df.fillna({
    "character": "Unknown",
    "age_certification": "Not Rated",
    "imdb_score": merged_df["imdb_score"].median(),
    "tmdb_score": merged_df["tmdb_score"].median(),
    "seasons": 0
}, inplace=True)

In [None]:
import ast

# Convert 'genres' & 'production_countries' from string to list
merged_df["genres"] = merged_df["genres"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])
merged_df["production_countries"] = merged_df["production_countries"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])

# Convert 'release_year' and 'runtime' to integer (if not already)
merged_df["release_year"] = merged_df["release_year"].astype(int)
merged_df["runtime"] = merged_df["runtime"].astype(int)

# Convert 'seasons' to integer (some values might be float due to NaNs)
merged_df["seasons"] = merged_df["seasons"].astype(int)

# Convert 'imdb_score', 'tmdb_score', 'imdb_votes', and 'tmdb_popularity' to float
merged_df["imdb_score"] = merged_df["imdb_score"].astype(float)
merged_df["tmdb_score"] = merged_df["tmdb_score"].astype(float)
merged_df["imdb_votes"] = merged_df["imdb_votes"].astype(float)
merged_df["tmdb_popularity"] = merged_df["tmdb_popularity"].astype(float)

print("\n Data Type Fixing Completed!")
print(merged_df.dtypes)

In [None]:
#Normalize categorical data
# Convert 'type' column to uppercase (MOVIE/SHOW)
merged_df["type"] = merged_df["type"].str.upper()

# Convert 'role' column to lowercase
merged_df["role"] = merged_df["role"].str.lower()

# Strip spaces and standardize 'age_certification'
merged_df["age_certification"] = merged_df["age_certification"].str.strip().str.upper()

# Replace any inconsistent values in 'age_certification'
age_mapping = {
    "G": "G",
    "PG": "PG",
    "PG-13": "PG-13",
    "R": "R",
    "NC-17": "NC-17",
    "NOT RATED": "NOT RATED"
}
merged_df["age_certification"] = merged_df["age_certification"].map(age_mapping).fillna("UNKNOWN")

print("\nCategorical Data Normalization Completed!")
print(merged_df[["type", "role", "age_certification"]].head())

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
from collections import Counter

# Flatten the list of genres (since genres are stored as lists)
all_genres = [genre for sublist in merged_df["genres"] for genre in sublist]

# Count the occurrences of each genre
genre_counts = Counter(all_genres).most_common(10)  # Top 10 genres

# Convert to a DataFrame
genre_df = pd.DataFrame(genre_counts, columns=["Genre", "Count"])

# Plot the bar chart (fixing the warning)
plt.figure(figsize=(13, 5))
sns.barplot(data=genre_df, x="Count", y="Genre", hue="Genre", palette="plasma", dodge=False, legend=False)

# Add labels and title
plt.xlabel("Number of Titles", fontsize=12, fontweight="bold")
plt.ylabel("Genre", fontsize=12, fontweight="bold")
plt.title("Top 10 Most Common Genres on Amazon Prime", fontsize=14, fontweight="bold")

# Show values on bars
for index, value in enumerate(genre_df["Count"]):
    plt.text(value + 5, index, str(value), va="center", fontsize=12, fontweight="bold")

plt.show()

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Set figure size
plt.figure(figsize=(8, 6))

# Create scatter plot
sns.scatterplot(data=merged_df, x="imdb_score", y="tmdb_score", alpha=0.5, color="purple")

# Add labels and title
plt.xlabel("IMDb Rating", fontsize=12, fontweight="bold")
plt.ylabel("TMDb Rating", fontsize=12, fontweight="bold")
plt.title("Correlation Between IMDb and TMDb Ratings", fontsize=14, fontweight="bold")

# Show the plot
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

In [None]:
# Chart - 4 visualization code
# Set figure size
plt.figure(figsize=(8, 6))

# Create scatter plot
sns.scatterplot(data=merged_df, x="imdb_score", y="tmdb_score", alpha=0.5, color="purple")

# Add labels and title
plt.xlabel("IMDb Rating", fontsize=12, fontweight="bold")
plt.ylabel("TMDb Rating", fontsize=12, fontweight="bold")
plt.title("Correlation Between IMDb and TMDb Ratings", fontsize=14, fontweight="bold")

# Show the plot
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Count the number of Movies and TV Shows
content_counts = merged_df["type"].value_counts()

# Define colors for better contrast
colors = ["orange", "green"]

# Set figure size
plt.figure(figsize=(8, 8))

# Create a pie chart with a center circle (Donut Chart)
plt.pie(content_counts, labels=content_counts.index, autopct="%1.1f%%", colors=colors, startangle=140, wedgeprops={"edgecolor": "black"})
plt.gca().add_artist(plt.Circle((0, 0), 0.5, color="white"))  # Create the donut hole

# Add title
plt.title("Distribution of Movies vs. TV Shows on Amazon Prime", fontsize=14, fontweight="bold")

# Show the plot
plt.show()

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart - 10 visualization code
# Calculate the average IMDb rating for Movies and TV Shows
content_type_ratings = merged_df.groupby("type")["imdb_score"].mean()

# Define bright colors
colors = ["green", "purple"]

# Set figure size
plt.figure(figsize=(8, 8))

# Create a pie chart
plt.pie(content_type_ratings, labels=content_type_ratings.index, autopct="%1.1f%%", colors=colors, startangle=140, wedgeprops={"edgecolor": "black"})

# Add title
plt.title("Average IMDb Ratings: Movies vs. TV Shows on Amazon Prime", fontsize=14, fontweight="bold")

# Show the plot
plt.show()

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Select only numerical columns for correlation
numeric_cols = merged_df.select_dtypes(include=["float64", "int64"])

# Compute the correlation matrix
correlation_matrix = numeric_cols.corr()

# Set figure size
plt.figure(figsize=(10, 6))

# Create heatmap
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)

# Add title
plt.title("Correlation Heatmap of Amazon Prime Content Features", fontsize=14, fontweight="bold")

# Show the plot
plt.show()

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Count number of titles released per year
yearly_trend = merged_df["release_year"].value_counts().sort_index()

# Set figure size
plt.figure(figsize=(10, 5))

# Create line plot
sns.lineplot(x=yearly_trend.index, y=yearly_trend.values, marker="o", color="blue")

# Add labels and title
plt.xlabel("Release Year", fontsize=12, fontweight="bold")
plt.ylabel("Number of Titles", fontsize=12, fontweight="bold")
plt.title("Trend of Movies and TV Shows Production Over the Years", fontsize=14, fontweight="bold")

# Show the plot
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

#### Chart - 10 - Pair Plot

In [None]:
# Pair Plot visualization code
numeric_cols = merged_df[["imdb_score", "tmdb_score", "runtime", "imdb_votes", "tmdb_popularity"]]

# Create pair plot
sns.pairplot(numeric_cols, diag_kind="kde", corner=True, plot_kws={"alpha": 0.5})

# Show the plot
plt.show()

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
To effectively achieve the business objective, the client should focus on data-driven decision-making by leveraging insights from the merged dataset. Here’s a structured approach:

1️⃣ Understand Customer Behavior
Analyze customer trends, preferences, and purchasing patterns.
Identify high-value customers using segmentation techniques.
2️⃣ Improve Product/Service Offerings
Identify best-selling products/services and optimize inventory.
Use sentiment analysis (if applicable) to understand customer feedback.
3️⃣ Enhance Marketing Strategies
Implement targeted marketing campaigns based on customer insights.
Personalize offers and promotions to boost engagement.
4️⃣ Optimize Business Operations
Reduce inefficiencies in supply chain and logistics.
Automate processes to improve productivity.
5️⃣ Predictive Analytics for Future Growth
Use machine learning models to predict future trends.
Identify potential business expansion areas.

# **Conclusion**

In this analysis, we successfully merged and analyzed two datasets based on the ID column, allowing us to derive meaningful insights. The key takeaways include:

1️⃣ Data Quality & Integrity – We identified missing values and inconsistencies, ensuring a clean dataset for analysis.

2️⃣ Business Insights – By combining both datasets, we can uncover customer behavior, sales trends, and operational inefficiencies.

3️⃣ Strategic Decision-Making – The merged data can be used for predictive analytics, personalized marketing, and business optimization.

4️⃣ Future Recommendations – Implement data-driven strategies to enhance business growth, customer retention, and operational efficiency.

 Final Thought: This analysis provides a strong foundation for data-driven decisions, enabling the client to improve business performance and stay ahead in the competitive market.






