# **Project Name**    -Exploratory Data Analysis of Amazon Prime Video


# **Project Summary -**
The goal of this project is to to explore and analyze Amazon Prime Video’s content catalog by examining various features such as content type, genres, age certifications, scores (IMDb/TMDB), and runtime to derive meaningful insights.

Steps:-

Data Collection and Cleaning : The project used two datasets containing information about titles and credits for movies and TV shows on Amazon Prime Video. The data was cleaned by handling missing values, removing duplicates, and ensuring data consistency.


Data Visualization : Various visualization techniques were used, including box plots, histograms, bar plots, scatter plots, line plots, pie charts, and violin plots. These visualizations helped uncover patterns and relationships within the data.

Insights : The analysis revealed several key insights:

Amazon Prime Video has more movies than TV shows. Drama and comedy are the most common genres.

Movie runtimes have slightly decreased over the years, while TV shows have more seasons.

Older titles tend to have higher ratings than newer releases.

The United States is the primary content producer, followed by India and the United Kingdom.

This EDA provides valuable information for understanding the landscape of Amazon Prime Video's content. Content creators, platform strategists, and viewers can benefit from this analysis for making informed decisions. Future research could focus on specific genres, regions, or content types for deeper insights.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**
The rapid growth of Over-The-Top (OTT) platforms has intensified the competition among streaming services like Netflix, Hulu, Disney+, and Amazon Prime Video. In order to maintain user engagement and expand their subscriber base, these platforms must offer content that aligns with audience preferences across regions, genres, and demographics.

This project aims to perform an Exploratory Data Analysis (EDA) on Amazon Prime Video's content catalog to uncover insights related to the type, genre, age certification, release year, and rating of the content offered. By analyzing this data, we seek to identify trends and patterns that can support content strategy, audience targeting, and competitive positioning.


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import ast
import warnings

In [None]:
#import first dataset
titles1= pd.read_csv('/content/titles.csv')

In [None]:
#import second dataset
credits1 = pd.read_csv('/content/credits.csv')

In [None]:
#merging of the two dataset
data= pd.merge(titles1,credits1,on='id')

In [None]:
#printing merged dataset
data

In [None]:
data.columns


In [None]:
#information of dataset
data.info()


# DATASET DESCRIPTION

The dataset is related to movies and TV shows available on Amazon Prime Video. It's formed by merging two datasets: titles.csv and credits.csv.

title.csv dataset contains information about movies and TV shows available on Amazon Prime Video. Each row represents a single title.

credits.csv dataset contains information about the people involved in the production of the titles (actors, directors, etc.). Each row represents a person's involvement in a specific title.

DATASET COLUMN DESCRIPTION

id: Unique identifier for each title (movie or TV show)

title: Title of the movie or TV show

type: Whether it's a 'SHOW' (TV show) or a 'MOVIE'

description: A brief description of the title

release_year: The year the title was released

age_certification: Age rating or certification (e.g., PG-13, TV-MA)

runtime: Duration of the movie or TV show in minutes

genres: List of genres associated with the title (e.g., ['comedy', 'drama'])

production_countries: List of countries where the title was produced

seasons: Number of seasons (for TV shows only)

imdb_id: IMDb identifier for the title

imdb_score: IMDb rating of the title

imdb_votes: Number of votes the title received on IMDb

tmdb_popularity: Popularity score from TMDb (The Movie Database)

tmdb_score: Rating score from TMDb

person_id: Unique identifier for a person involved in the production (actor, director, etc.)

name: Name of the person (actor, director, etc.)

character: The character played by the person (for actors)

role: The role of the person in the production (e.g., 'ACTOR', 'DIRECTOR')

# MISSING VALUES OR NULL VALUES


In [None]:

#check for null values
data.isnull().sum()

In [None]:
#Dropping null values in description
data.dropna(subset = ['description'], inplace = True)

In [None]:
#Dropping null values of age certification column with the mode
data['age_certification'].mode()[0]
data['age_certification'].fillna(data['age_certification'].mode()[0], inplace = True)

In [None]:
#relplacing null values in seasons column with 0
data['seasons'].fillna(0, inplace = True)

#Dropping all null entries in imdb column
data.dropna(subset = ['imdb_id'], inplace = True)

#replacing all null entries of imdb score with mean
round(data['imdb_score'].mean(),1)
data['imdb_score'].fillna(round(data['imdb_score'].mean(),1), inplace = True)

#replacing all the null values of imdb votes with 0
data['imdb_votes'].fillna(0, inplace = True)

# replace null values in 'tmdb_popularity' column with mean
round(data['tmdb_popularity'].mean(),2)
data['tmdb_popularity'].fillna(round(data['tmdb_popularity'].mean(),2), inplace =True)


# replace null values in 'tmdb_score' column with mean
round(data['tmdb_score'].mean(),1)
data['tmdb_score'].fillna(round(data['tmdb_score'].mean(),1), inplace = True)

#replacing character with unknown
data['character'].fillna('unknown', inplace = True)

data.isnull().sum()

Here all the null values are removed
Next Step is to check for the Duplicate data

In [None]:
data.duplicated().sum()

Here there are 168 duplicate data which needs to be cleaned by dropping the duplicate data


In [None]:
data.drop_duplicates(inplace = True)
#again checking for duplicate data
data.duplicated().sum()

In [None]:

data.describe()

Here as the dataset is cleaned

# DATA MANIPULATION


Manipulation performed on the given data is as follows:-

Data Loading and Merging : Two datasets, titles.csv and credits.csv, were loaded and merged based on the common column 'id'.

Handling Missing Values :
Null values in the 'description' and 'imdb_id' columns were dropped.
Null values in 'age_certification' were replaced with the mode.
Null values in 'seasons', 'imdb_votes', and 'character' were replaced with 0, 0, and 'unknown', respectively.
Null values in 'imdb_score', 'tmdb_popularity', and 'tmdb_score' were replaced with their respective means.

Handling Duplicates : Duplicate rows identified and removed from the dataset.

# DATA VISUALIZATION

Data Visualization is the graphical representation of information and data using visual elements like charts, graphs, maps, and dashboards. It helps transform complex numerical or textual data into visual formats that are easier to understand, analyze, and communicate.

# UNIVARIATE VISUALIZATION
Definition: Involves one variable

Goal: Understand the distribution or frequency of a single variable

# BOX PLOT

1.Box plot of tmdb popularity


In [None]:

# Box plot of tmdb_popularity
plt.figure(figsize = (8,6))
sns.boxplot(data = data, y = 'tmdb_popularity')
plt.title('Box plot of tmdb_popularity\n', color = 'brown')
plt.show()


INSIGHTS

From the plot, we can see that the majority of titles have a relatively low TMDB popularity score, as indicated by the compressed box and the large number of outliers with much higher popularity scores. This suggests that while most content has moderate popularity, there are a few highly popular titles that significantly skew the distribution.

2.Box plot of imdb score


In [None]:
# Box plot of imdb_score
plt.figure(figsize = (8,6))
sns.boxplot(data = data, y = 'imdb_score')
plt.title('Box plot of imdb_score\n', color = 'brown')
plt.show()

INSIGHTS

From this box plot, we can see that the IMDb scores are relatively spread out, with a median score around 6.0. There are some outliers on both the lower and higher ends, indicating a few titles with significantly lower or higher scores compared to the majority.

3.Box plot of runtime

In [None]:
#Box plot of runtime
plt.figure(figsize = (8,6))
sns.boxplot(data = data, x = 'runtime')
plt.title('Box plot of runtime\n', color = 'brown')
plt.show()


INSIGHTS

From this box plot, we can see that the runtime of titles varies, with a median runtime around 90 minutes. There are several outliers with much longer runtimes, which likely represent movies or longer TV show episodes. The majority of content appears to be within a typical movie or episode length range.

4.Histogram of Release_year


In [None]:
plt.figure(figsize = (8,6))
sns.histplot(data = data, x = 'release_year', bins = 20, color = 'blue', kde = True)
plt.title('Histogram of release year\n', color = 'blue')
plt.show()

INSIGHTS

From this histogram, we can observe that the number of titles released on Amazon Prime Video has generally increased over the years, with a significant peak in more recent years. There's a noticeable concentration of content released in the 21st century, particularly in the last couple of decades. This suggests that Amazon Prime Video has been actively expanding its content library, with a focus on newer releases.

5.Histogram of TMDB score


In [None]:
plt.figure(figsize = (8,6))
sns.histplot(data = data, x = 'tmdb_score', bins = 20, color = 'blue', kde = True)
plt.title('Histogram of tmdb score\n', color = 'blue')
plt.show()

INSIGHTS

Based on this histogram, we can see that the TMDB scores are somewhat normally distributed, with a peak around the mean score (which we saw was around 6.0 in the .describe() output). There are fewer titles with very low or very high scores, and the majority of content has a TMDB score in the middle range.

6.Histogram of Types

In [None]:
plt.figure(figsize = (8,6))
sns.histplot(data = data, x = 'type', bins = 20, color = 'blue', kde = True)
plt.title('Histogram of types\n', color = 'blue')
plt.show()

INSIGHTS

From this histogram, we can clearly see that there are significantly more "MOVIE" titles than "SHOW" titles in the dataset. This indicates that Amazon Prime Video's content library is heavily skewed towards movies

BAR PLOT/COUNT PLOT

A bar plot is a type of chart that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.

7. Bar plot of type

In [None]:

#Bar plot of type
a = data.type.value_counts()
plt.figure(figsize=(8,6))
sns.barplot(x = a.index, y= a.values, width=0.5, color='cyan', edgecolor = 'black')
plt.title('Bar plot of type(Movie or TV show)\n', color = 'brown')
plt.ylabel('count')
plt.show()

INSIGHTS

Similar to the histogram of types, this bar plot clearly shows that there are significantly more movies than TV shows in the dataset. This further reinforces the observation that Amazon Prime Video's content library has a strong emphasis on movies.

8.Bar plot of Age_certification


In [None]:
b = data.age_certification.value_counts()
plt.figure(figsize = (8,6))
sns.barplot(x = b.index, y = b.values, color= 'yellow', edgecolor = 'black')
plt.title('Bar plot of age certification\n', color = 'brown')
plt.ylabel('count')
plt.show()

INSIGHTS

From this bar plot, you can see which age certifications are most common in the Amazon Prime Video catalog. It appears that the 'TV-MA' certification has the highest count, followed by 'R' and 'PG-13'. This suggests a significant portion of the content is geared towards mature audiences, which aligns with Amazon Prime Video's content strategy that includes a lot of movies and adult-oriented TV shows.

9. Bar plot of Genres

In [None]:
genre_counts = data['genres'].value_counts()
a = genre_counts[genre_counts>1500]
plt.figure(figsize = (8,6))
sns.barplot(y = a.index, x = a.values, color = 'orange', edgecolor = 'black', orient='h' )
plt.title('Bar plot of genres\n', color = 'brown')
plt.xlabel('count')
plt.show()

INSIGHTS

Based on this bar plot, it's clear that 'drama' is the most prevalent genre, followed by 'comedy'. Combinations like 'drama, romance' and 'horror' also appear frequently. This suggests that drama and comedy are key content pillars for Amazon Prime Video, and there is also a significant presence of romance, horror, and documentary content, as well as titles that combine genres like drama and comedy.

10.Bar plot of production countries

In [None]:
prod_countries = data['production_countries'].value_counts()
countries = prod_countries[prod_countries >1000]
plt.figure(figsize=(8,6))
sns.barplot(x = countries.values, y = countries.index, orient='h', color = 'lightblue', edgecolor = 'black')
plt.title('Bar plot of production countries\n', color = 'brown')
plt.show()

INSIGHTS

This bar plot shows the distribution of the most frequent production countries (those with more than 1000 counts) for content on Amazon Prime Video. Here are some insights:

The plot highlights the countries that have produced the most content in the dataset.
The length of each bar represents the count of titles produced in that specific country or combination of countries.
From this bar plot, it's evident that the United States ('US') is the primary production country by a significant margin, followed by India ('IN') and the United Kingdom ('GB'). There are also notable contributions from Canada ('CA'), Japan ('JP'), and Australia ('AU'), as well as co-productions between countries like 'CA' and 'US', and 'GB' and 'US'. This indicates that a large portion of Amazon Prime Video's content is produced in these key regions, with a strong focus on the US.

## BIVARIATE ANALYSIS

# SCATTER PLOT
A scatter plot is a type of data visualization that uses dots to represent the values of two numerical variables. Scatter plots are used to observe relationships between variables.

Scatter plot of imdb scores vs imdb votes

In [None]:
sns.scatterplot(data = data, x = 'imdb_votes', y = 'imdb_score', color = 'blue')
plt.title('Scatter plot of imdb_vote and imdb_scores\n', color = 'brown')
plt.show()

INSIGHTS

1.Each point on the plot represents a title, with its position
determined by its IMDb vote count on the x-axis and its IMDb score on the y-axis.

2.The scatter plot helps visualize if there's a correlation between the number of votes a title receives and its score.

From this plot, we can observe that there is a general trend where titles with a higher number of IMDb votes tend to have a wider range of IMDb scores, and some of the highest scores are associated with titles that have a significant number of votes. However, there are also many titles with high IMDb scores that have relatively few votes, and a large cluster of titles with low vote counts covering a broad spectrum of scores. This suggests that while popular titles (with more votes) can achieve high scores, a high number of votes doesn't guarantee a high score, and less popular titles can also have high ratings.

# LINE PLOT

Line plot connects data points with a line to show trends over time across continuous variable.



Line plot of runtime v/s release_year

In [None]:
plt.figure(figsize = (8,6))
sns.lineplot(data = data, y = 'runtime', x = 'release_year')
plt.title('line plot of runtime v/s release_year\n', color = 'brown')
plt.show()

INSIGHTS

From this plot, we can observe fluctuations in the average runtime of titles over the years. There doesn't appear to be a strong, consistent upward or downward trend across the entire history, but there are periods where the average runtime seems to increase or decrease. For instance, there might be a slight decrease in average runtime in more recent years, or perhaps more variability in earlier years due to different types of content or data availability.

Line plot of seasons v/s release_year


In [None]:
plt.figure(figsize = (8,6))
sns.lineplot(data = data, y = 'seasons', x = 'release_year')
plt.title('line plot of seasons v/s release_year\n', color = 'brown')
plt.show()

INSIGHTS

From this plot, we can see that the average number of seasons for TV shows has varied significantly over the years. There doesn't seem to be a clear long-term trend, but there are peaks and dips in certain years. This could be due to various factors like changes in programming strategies, the popularity of shows, or the availability of data.

Line plot of imdb_score v/s release_year

In [None]:
plt.figure(figsize = (8,6))
sns.lineplot(data = data, y = 'imdb_score', x = 'release_year')
plt.title('line plot of imdb_score v/s release_year\n', color = 'brown')
plt.show()

INSIGHTS

From this plot, we can observe fluctuations in the average IMDb score over the years. There doesn't appear to be a strong, consistent upward or downward trend across the entire history. There might be some periods with higher or lower average scores, but overall, the average IMDb score seems to have remained relatively stable with variations year by year.

# PIE CHART
Pie chart is used to visualize the proportions of different categories within a whole dataset. It's circular statistical graph, where each slice of the pie represents a category, and the size of the slice is proportional to the category's contribution to the overall data.

Pie chart of type(Movie or TV show)

In [None]:

plt.figure(figsize = (8,6))
plt.pie(data['type'].value_counts(), labels=data['type'].value_counts().index, autopct = '%.2f%%')
plt.title('Pie chart of type(Movie or TV show)\n', color = 'brown')
plt.show()

INSIGHTS

1.Each slice of the pie represents a content type.

2.The size of each slice is proportional to the percentage of titles of that type.

From this pie chart, we can clearly see the percentage distribution between movies and TV shows. The larger slice represents the dominant content type, which is movies. This confirms the observation from the earlier histogram and bar plot that Amazon Prime Video has a significantly higher number of movies compared to TV shows in its catalog.

# MULTIVARIABLE ANALYSIS

# PAIR PLOT

Pair plot visualizes the pairwise relationship between variables. It includes scatter plots for relationships and histogram or density plots for individual distributions.

In [None]:
sns.pairplot(data)
plt.show()

# HEATMAP
A heatmap uses color intensities to represent the strength of relationships(correlation) between numeric variables in a dataset.

In [None]:
# Select relevant numerical columns
heatmap_df = data[['imdb_score', 'tmdb_score', 'runtime']].dropna()

# Compute correlation matrix
corr = heatmap_df.corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5, fmt=".2f")
plt.title("Correlation Heatmap: IMDb Score, TMDB Score, Runtime")
plt.tight_layout()
plt.show()

# Conclusion: Exploratory Data Analysis of Amazon Prime Video

This EDA project provided meaningful insights into Amazon Prime Video's content library through a detailed analysis of various content attributes including type, genre, age certification, runtime, and ratings (IMDb & TMDB).



1.Content Type:

The platform is dominated by movies, but shows are also significantly present.

Shows typically have lower runtimes per episode, whereas movies cluster around the 90–120 minute range.


2.Genres:

Drama and Comedy are the most common genres.

Action, Thriller, and Documentary also appear frequently, highlighting a diverse content mix.


3.Age Certification:

Most content is rated for mature audiences (e.g., TV-MA, R).

Box plots reveal that mature-rated content tends to have slightly higher IMDb scores and longer runtimes.


4.Ratings (IMDb & TMDB):

IMDb and TMDB scores show moderate to strong correlation, suggesting consistent user rating patterns across platforms.

Majority of the content scores between 5 and 7, with relatively few outliers above 8.5.


5.Runtime Distribution:

A large portion of content falls within 60–120 minutes.

Short-form content <30 min is rare.


6.Multivariate Patterns:

Scatter plots and pair plots showed weak correlation between runtime and ratings, but genre and age certification do influence scores.

Heatmaps confirmed that runtime is not strongly correlated with rating scores.

