# **Project Name**    - Amazon Prime TV shows and movies



##### **Project Type**    - EDA
##### **Contribution**    - Individual
  


# **Project Summary -**

The objective of the project was to  explore and analyse data from the two datasets Titles and Credits take from Amazon prime video. The aim was to  clean, standardize and derive business level insights that can help stake holder in the streaming industry make informed decision. This project not only focus on the technical data processing but also  extracting real-world actionable insights from the data.

### Dataset Overview

The Titles dataset contains information about various TV shows and movies available on the platform, including metadata such as title names, release years, age certifications, runtimes, genres, production countries, IMDb and TMDB scores, and popularity metrics. It also differentiates between movies and TV shows, along with the number of seasons for shows.

The Credits dataset provides a mapping between titles and the people involved in their creation primarily actors and directors. It contains each person's name, role, and in the case of actors, the character names they portrayed.

### Data Preparation and Cleaning
Data cleaning was a crucial first step to ensure accuracy and consistency. This included:

* Standardizing textual fields (e.g., converting age certifications to uppercase and stripping whitespace).

* Handling missing values appropriately, either through imputation or removal depending on the case.

* Splitting and processing list-type fields (e.g., genres, production countries) for analysis.

* Ensuring that numerical values like IMDb scores, votes, tmdb_popularity , tmdb score does'nt have outlier.

### Key Business-Level Insights
From this cleaned data, several meaningful insights were derived:

* Genre Popularity Trends – Identified the most produced and most popular genres over the years, highlighting shifts in audience preferences.

* Content Type Distribution – Measured the ratio of movies to TV shows, revealing trends toward serialized content versus standalone films.

* Country Contribution Analysis – Determined which production countries dominate the catalog and their correlation with higher IMDb or TMDB scores.

* Consistent Performing – Analyzed whether certain directors or actors consistently appear in high-rated or highly popular titles.

* Age Certification Patterns – Mapped how age ratings align with popularity, providing guidance for future content targeting specific demographics.

### Business Relevance
The insights obtained from this analysis is help full content acquisition, marketing strategies, and production investment decisions. Like, knowing that a certain genre is consistently trending allows streaming platforms to prioritize acquiring or producing content in that category. Similarly, identifying top-performing actors or directors can inform casting decisions for new productions.

### Tools & Technologies Used
* Python: Data cleaning, transformation, and analysis using Pandas and NumPy.

* Matplotlib & Seaborn: Data visualization to identify patterns and trends.

* Jupyter Notebook: Interactive analysis and reporting.

# **GitHub Link -**

https://github.com/saikatM333/Netflix_EDA.git


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**

The streaming industry has become highly competitive, with multiple platforms vying for audience attention through diverse and engaging content. Amazon Prime Video, as one of the major players in this space, offers a vast library of movies and shows across various genres, regions, and formats.

In such a market, understanding content diversity, regional distribution, evolving trends, and audience preferences is essential for sustaining subscriber growth and improving engagement. Data-driven insights enable platforms to invest in the right content, cater to regional demands, and enhance overall viewer satisfaction.

This dataset has been compiled to analyze all shows and movies available on Amazon Prime Video, with the objective of identifying patterns and trends that can guide business strategies. The analysis focuses on:

* Content Diversity: Determining which genres and categories dominate the platform.

* Regional Availability: Understanding how content distribution varies across countries and regions.

* Trends Over Time: Examining how the content library has evolved over the years.

* Ratings & Popularity: Assessing IMDb ratings, TMDB scores, and popularity to identify high-performing titles.

By analyzing these factors, businesses, content creators, and data analysts can uncover actionable insights to optimize content acquisition, identify audience preferences, address content gaps, and enhance engagement—ultimately strengthening Amazon Prime Video’s competitive position in the streaming industry.



#### **Define Your Business Objective?**

***TO IDENTIFY HIGH-PERFORMING SHOWS, MOVIES, ACTORS BASED ON RATINGS, TRENDS TO INCREASE THE PLATFORM ENGAGEMENT***

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
credits_data =  pd.read_csv('credits.csv')
titles_data =  pd.read_csv('titles.csv')



### Dataset First View

In [None]:
# first we will look into to  titles data
titles_data.head()

In [None]:
# secound we will se the credits data set
credits_data.head()


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Credits row column count" , credits_data.shape)
print("Titles row column count" , titles_data.shape)

### Dataset Information

In [None]:
# Dataset Info
# first we will see titles data set
titles_data.info()

In [None]:
# secound we will see credits dataset
credits_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("number of duplicated rows in titles : ",titles_data.duplicated().sum())
print("number of duplicates rows in credits :", credits_data.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# copying the data
titles_cleaned_data =  titles_data.copy()
credits_cleaned_data =  credits_data.copy()


*checking number of null values in the imdb_score , tmdb_score , tmdb_popularity of credits data*

In [None]:

# Missing Values/Null Values Count
def null_missing_val(df):
    features_null =dict()
    try:
        cols =  df.columns
        for col in cols :
            if df[col].isnull().sum() != 0:
                features_null[col] =  df[col].isnull().sum()
    except Exception as e :
        print( 'Error: ',e)
    return features_null

In [None]:
titles_null =  null_missing_val(titles_cleaned_data)
titles_null

In [None]:
credits_null =  null_missing_val(credits_cleaned_data)
credits_null

In [None]:
# Visualizing the missing values
sns.barplot(y = titles_null.values() , x =  titles_null.keys())
plt.xticks(rotation = 90)
plt.title('null Values in titles')
plt.show()


### What did you know about your dataset?

This dataset talk about the movies, shows and actors  worked on that content.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("list of columns in credits dataset")
credits_data.columns

In [None]:
print('list of columns in tites')
titles_data.columns

In [None]:
# Dataset Describe
#  some basic inshights about the titles data about numerical type

titles_data.describe()

In [None]:
#  some basic inshights about the titles data about categorical/object type

titles_data.describe(include =  'object')

In [None]:
#  some basic inshights about the credits data about numerical type

credits_data.describe()

In [None]:
#  some basic inshights about the credits data about categorical/object type

credits_data.describe( include =  'object')

### Variables Description

*titles dataset*
* id: The title ID on JustWatch.
* title: The name of the title.
* show_type: TV show or movie.
* description: A brief description.
* release_year: The release year.
* age_certification: The age certification.
* runtime: The length of the episode (SHOW) or movie.
* genres: A list of genres.
* production_countries: A list of countries that produced the title.
* seasons: Number of seasons if it's a SHOW.
* imdb_id: The title ID on IMDB.
* imdb_score: Score on IMDB.
* imdb_votes: Votes on IMDB.
* tmdb_popularity: Popularity on TMDB.
* tmdb_score: Score on TMDB.

*Credits dataset*
* person_ID: The person ID on JustWatch.
* id: The title ID on JustWatch.
* name: The actor or director's name.
* character_name: The character name.
* role: ACTOR or DIRECTOR.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
def unique_values_each_variable(df):
    '''
    parameter1 -  df : dataset
    the function shows the number unique value for each varaiable/ column value
    '''
    try:
        for col in df.columns :
            print(f'unique values of {col}: {df[col].nunique()}')
    except Exception as e:
        print("Error: ", e)

In [None]:
unique_values_each_variable(titles_cleaned_data)

In [None]:
unique_values_each_variable(credits_cleaned_data)

## 3. ***Data Wrangling***

In [None]:
# Cleaning the titles data
# handaling null values of imdb_score


titles_cleaned_data['imdb_score']=titles_cleaned_data['imdb_score'].fillna(titles_cleaned_data['imdb_score'].median())
titles_cleaned_data['imdb_score'].isnull().sum()

In [None]:
# handaling null values of tmdb_score
titles_cleaned_data['tmdb_score']=titles_cleaned_data['tmdb_score'].fillna(titles_cleaned_data['tmdb_score'].median())
titles_cleaned_data['tmdb_score'].isnull().sum()

In [None]:
# handaling null values of imdb_popularity
titles_cleaned_data['tmdb_popularity']=titles_cleaned_data['tmdb_popularity'].fillna(titles_cleaned_data['tmdb_popularity'].median())
titles_cleaned_data['tmdb_popularity'].isnull().sum()



In [None]:
def handle_outlier(df, columns):
    '''
    parameter1 - df  : dataset
    parameter2 - column : dataset
    this functions take the the dataset and columns list and then
    find the lower and upper bound and then filter out/ drop the row of  the values perticular col values does not
      fall with in the range  '''
    try :
        print("number of rows before outlier removal",df.shape)
        for col in columns:
            if col in df.columns:
                Q1 = df[col].quantile(0.25)
                Q3 = df[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                df = df.loc[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
                print(f"Outliers removed from column '{col}'")
            else:
                print(f"Column '{col}' not found.")
        print("number of rows after outlier removal",df.shape)
    except Exception as e :
        print("Error" ,e )
    return df


In [None]:
# first normalizing the tmdb_popularity
# created a  new column and stored
titles_cleaned_data['tmdb_popularity_normalized'] = np.log(titles_cleaned_data['tmdb_popularity'])


In [None]:
# romove the outlier for the tmdb_popularity_normalized and clean the outlier.
# in this imdb_score and tmdb_score is  not included as all the values are in the range of between 1-10 which  are valid range of value.
col =  [ 'tmdb_popularity_normalized']
titles_cleaned_data_no_outlier =  handle_outlier( titles_cleaned_data , col)


In [None]:

titles_cleaned_data.info()

In [None]:
# in the titles data set genre is in format '['romance', 'war', 'drama']' object
# first from string converting to list
# then from list it is being exploded to individual item in the list for each value in genre column.
import ast
titles_cleaned_data_no_outlier['genres'] = titles_cleaned_data_no_outlier['genres'].apply(ast.literal_eval)

# Step 2: Explode the list into multiple rows


titles_cleaned_data_no_outlier=  titles_cleaned_data_no_outlier.explode('genres').reset_index(drop = True)
titles_cleaned_data_no_outlier



In [None]:

# in the titles data set production_countries  is in format '['US']' object
# first from string converting to list
# then from list it is being exploded to individual item in the list for each value in genre column.

titles_cleaned_data_no_outlier['production_countries'] =  titles_cleaned_data_no_outlier['production_countries'].apply(ast.literal_eval)
titles_cleaned_data_no_outlier =  titles_cleaned_data_no_outlier.explode('production_countries').reset_index(drop =  True)


In [None]:
# standardize the certification col
titles_cleaned_data_no_outlier['age_certification'].apply(lambda x : x.upper().strip() if x== pd.notna(x) else x)

In [None]:
# standardizing the role column in the credits dataset
credits_cleaned_data['role'] =  credits_cleaned_data['role'].apply(lambda x : x.upper().strip() if x== pd.notna(x) else x)

In [None]:
# storing back the final cleaned data set
titles_df =  titles_cleaned_data_no_outlier
credits_df = credits_cleaned_data

### Data Wrangling Code

In [None]:
# getting number of shows and movies

number_shows_movies =  titles_cleaned_data_no_outlier.drop_duplicates(subset='id').value_counts( 'type').reset_index()
number_shows_movies

In [None]:
# popularity of show vs movies

popularity_show_movies =  titles_cleaned_data_no_outlier.groupby('type')['tmdb_popularity_normalized'].mean()
popularity_show_movies

In [None]:
# getting top 10 production countries on basis of number of content uploaded
consistent_content_upload =  titles_df.drop_duplicates(subset='id').value_counts('production_countries').reset_index().sort_values(by = 'count' , ascending= False).head(10)
consistent_content_upload

In [None]:
# calculating the top popular production contries with content upload more than 75
top_popular_countries_sorted = titles_df.drop_duplicates(subset=['id']).groupby('production_countries').agg(

    tmdb_popularity_min = ('tmdb_popularity' , 'min'),
    tmdb_popularity_max = ('tmdb_popularity' , 'max'),
    tmdb_popularity_mean = ('tmdb_popularity' , 'mean'),
    count=('id', 'count')
).reset_index().sort_values(by  =  'tmdb_popularity_mean' , ascending = False)

top_popular_countries_sorted[top_popular_countries_sorted['count'] > 75]


In [None]:
#  contries with contries  IMdb score greater than mean imdb_score and have average imdb_score more total average
imdb_mean  =  titles_df.drop_duplicates(subset=['id'])
imdb_mean = titles_df['imdb_score'].mean()
top_Countries_by_imdb_score = titles_df.drop_duplicates(subset=['id']).groupby('production_countries').agg(
    imdb_score_mean = ('imdb_score' , 'mean'),
    count=('id', 'count')
).reset_index().sort_values(by  =  'imdb_score_mean' , ascending = False)

top_popular_countries_sorted[top_popular_countries_sorted['count'] > 75]
top_Countries_by_imdb_score[(top_Countries_by_imdb_score['imdb_score_mean']>  imdb_mean) & (top_Countries_by_imdb_score['count']> 75) ]


In [None]:
def get_most_popular_genres_by_country(df, country_code):
    ''' Parameter1 - df : dataset
        Parameter2 - country_code : 'US' , 'IN' , 'KE', 'HU'
        this fuction take this two parameter data set and country code and then
        return the sotred least from most popular genre produced to  least poplualr genre produced
    '''
    try:
        # Validate input DataFrame
        county_wise__most_popluar_ganre =  df.groupby(['genres' , 'production_countries'])['tmdb_popularity_normalized'].mean().reset_index()
        return county_wise__most_popluar_ganre.loc[ county_wise__most_popluar_ganre['production_countries'] == country_code ].sort_values(by = 'tmdb_popularity_normalized' , ascending =  False)


    except Exception as e:
        print(f" error: {e}")
        return None


In [None]:
# getting the list of genre produced by the the US in a sorted manner in decending order for tmdb_popularity
get_most_popular_genres_by_country(titles_df , 'US')

In [None]:
# getting the list of genre produced by the the India in a sorted manner in decending order for tmdb_popularity

get_most_popular_genres_by_country(titles_df , 'IN')

In [None]:
# creating a new column decade
titles_df['decade'] =  (titles_df['release_year']//10) * 10
trend_df = titles_df.drop_duplicates(subset='id').groupby(['decade', 'genres'])['tmdb_popularity'].mean().reset_index()
trend_df

In [None]:
# how popularity of shows and movies are working out to over the decades
trend_shows_movies_popularity =  titles_df.groupby(['type' , 'decade'])['tmdb_popularity'].mean().reset_index()
trend_shows_movies_popularity

In [None]:

# merge the titles and credits to get more inshights
credits_titles_data  =  pd.merge(credits_df , titles_df , on = 'id' , how = 'inner')


In [None]:
# list of columns in merged dataset
credits_titles_data.columns

In [None]:
# get first 5 rows in of the credits_titles_data
credits_titles_data.head()

In [None]:
# unique person id
print(credits_titles_data['id'].nunique())
print(credits_titles_data['person_id'].nunique())

In [None]:
# top performing actors in each genre
def get_most_popular_actor(df , genre = 'no_input'):
  '''
  parameter1 - df : dataset
  parameter2 - genre : 'action' , 'comedy' ,'sci-fi' | leave as None for top actors overall.
  it return the top 10 person most popular in the perticular genre
  '''
  try:
    if (genre != 'no_input'):
      df =  df.loc[(df['genres']==genre) &(df['role'] =='ACTOR')]
      return df.groupby(['genres','person_id'])['tmdb_popularity'].max().reset_index().sort_values(by ='tmdb_popularity', ascending =  False).head(10)
    else :
      df =  df.loc[df['role'] =='ACTOR']
      return df.groupby(['genres','person_id'])['tmdb_popularity'].max().reset_index().sort_values(by ='tmdb_popularity', ascending =  False).head(10)
  except Exception as e :
    print("Error" , e )

In [None]:
top_10_actor_perticular_genre= get_most_popular_actor(credits_titles_data , 'action')
top_10_actor_perticular_genre

In [None]:
top_10_actor  =  get_most_popular_actor(credits_titles_data)
top_10_actor

In [None]:
def get_most_popular_production(df , genre):
  '''
  parameter1 - df : dataset
  parameter2 - genre : 'action' , 'comedy' ,'sci-fi'
  it return the top 10 production_countries highest imdb_score in the perticular genre
  '''
  try:
    df =  df.loc[(df['genres']==genre)]
    return df.groupby('production_countries')['imdb_score'].mean().reset_index().sort_values(by ='imdb_score', ascending =  False).head(10)
  except Exception as e :
    print("Error" , e )


In [None]:
get_most_popular_production(titles_df, 'action')

In [None]:
# number of content  in each decade
decade_count =  titles_df.drop_duplicates('id').value_counts('decade').reset_index()
decade_count.rename({'count':'number_of_content_upload'},inplace =  True , axis =  1)
decade_count

In [None]:
# function for perticular actor/director popularity over decade
# i will give actor it will give poppularity across the decade

credits_titles_data.columns

In [None]:
# Top 10% IMDb score movies
top_movies = titles_df.drop_duplicates('id').sort_values(by='imdb_score', ascending=False)
top_movies = top_movies[top_movies['imdb_score'] >= top_movies['imdb_score'].quantile(0.90)]

top_movies


In [None]:
# unique age certification
list_of_certi = titles_df['age_certification'].dropna().unique().tolist()
print(list_of_certi)

In [None]:
#unique list of genre
list_of_genre =  titles_df['genres'].dropna().unique().tolist()
print(list_of_genre)

In [None]:
#unique list of production
list_of_production =  titles_df['production_countries'].dropna().unique().tolist()
print(list_of_production)

In [None]:
titles_df[['title','type','runtime']].head(20)

In [None]:

# divide the runtime in to  4 category  short , medium , long , epic and  and
# trying to analyse the trend in term of popularity imdb score and tmdb score for movies

titles_df['duration_category'] = pd.cut(titles_df['runtime'], bins=[0, 30, 90, 180, 500], labels=["Short", "Medium", "Long", "Epic"])
duration_trend = titles_df[titles_df['type']== 'MOVIE'].groupby('duration_category').agg(
    tmdb_popularity_mean = ('tmdb_popularity' , 'mean'),
    imdb_score_mean =('imdb_score','mean'),
    tmdb_score_mean =('tmdb_score','mean')
).reset_index()
duration_trend


In [None]:
#effect on popularity of certification
titles_df.groupby('age_certification')['tmdb_popularity'].mean().sort_values(ascending=False)


In [None]:
def get_top_directors(df):
    '''
    parameter1 df : dataset
    the function will return the top directors based on the popularity if movies produced
    '''
    return df[df['role'] == 'DIRECTOR'].groupby('person_id')['tmdb_popularity'].mean().reset_index().sort_values(by='tmdb_popularity', ascending=False).head(10)


In [None]:
get_top_directors(credits_titles_data)

In [None]:

def consistent_performer(df  , role ):
  '''
  parameter1 df : dataset
  parameter2 role : 'ACTOR' , 'DIRECTOR'

  '''
  try:
    popularity = df.loc[df['role'] ==role,['id','person_id','name' , 'tmdb_popularity']  ].groupby('person_id').agg(
    mean_popularity =  ('tmdb_popularity' , 'mean') ,
    count = ('id' , 'count')
    ).query('count>5').sort_values(by =  'mean_popularity' , ascending  =  False).reset_index()
    return popularity
  except Exception as e :
    print('Error', e )

In [None]:
# getting consitence actors as per performance
consitent_actors = consistent_performer(credits_titles_data , 'ACTOR')
consitent_actors

In [None]:
# getting consitent director as per performance
consitent_director = consistent_performer(credits_titles_data , 'DIRECTOR')
consitent_director

In [None]:
# getting the director actor pairs which are doing well in the consitently.
def get_top_director_actor_pairs_by_imdb(df , min_work = 2 ):
  """
  parameter 1 df-  dataframe.
  parameter2  min_work  - int ex 1,2,3.
  Finds the top Director- Actor pairs that produce the hiest average IMDB scores.
  """
  try :
    directors = df.loc[df['role'] == 'DIRECTOR' , ['id', 'person_id', 'name', 'imdb_score']].drop_duplicates(subset = 'id')
    directors =  directors.rename(columns={'person_id': 'director_id', 'name': 'director_name'})

    actors =  df.loc[df['role'] == 'ACTOR' , ['id', 'person_id', 'name', 'imdb_score']].drop_duplicates(subset = 'id')
    actors = actors.rename(columns={'person_id': 'actor_id', 'name': 'actor_name'})
    pairs = pd.merge(directors, actors, on=['id', 'imdb_score'])

    pair_stats = (pairs.groupby(['director_id', 'director_name', 'actor_id', 'actor_name'])
            .agg(
                avg_imdb_score=('imdb_score', 'mean'),
                work_count=('id', 'count')
            ).reset_index())

        # Filter by minimum work done together
    pair_stats = pair_stats[pair_stats['work_count'] >= min_work]

        # Sort by IMDb score
    pair_stats = pair_stats.sort_values(by='avg_imdb_score', ascending=False)

    return pair_stats

  except Exception as e :
    print('Error', e)

In [None]:
top_director_actor_pairs = get_top_director_actor_pairs_by_imdb(credits_titles_data, min_work=2)
top_director_actor_pairs[top_director_actor_pairs['avg_imdb_score'] > 7.5]

In [None]:
# identify the shows/movies which are not popular but have high imdb_score -> comparing the number of votes by popularity
imdb_score_threshold =  titles_df['imdb_score'].quantile(0.85)
popularity_threshold =  titles_df['tmdb_popularity_normalized'].quantile(0.25)

high_imdb_score_low_popularity = titles_df[(titles_df['imdb_score']>= imdb_score_threshold)& (titles_df['tmdb_popularity_normalized'] <= popularity_threshold)].drop_duplicates('id')

high_imdb_score_low_popularity

In [None]:
# identify the shows/movies which are popular but have low imdb_score -> comparing the number of votes by popularity
imdb_score_threshold =  titles_df['imdb_score'].quantile(0.25)
popularity_threshold =  titles_df['tmdb_popularity_normalized'].quantile(0.80)

low_imdb_score_high_popularity = titles_df[(titles_df['imdb_score']<= imdb_score_threshold)& (titles_df['tmdb_popularity_normalized'] >= popularity_threshold)].drop_duplicates('id')

low_imdb_score_high_popularity

### What all manipulations have you done and insights you found?

* Data Manipulations
    * Missing value Handlings -  replaced imdb_score , tmdb_score , tmdb_popularity replaced with median.

    * Outlier treatment
        * Normalized the tmdb_popularity using log transformation and created tmdb_popularity_normalized.
        * removed outlier from tmdb_popularity_normalized using IQR.
    * Data Transformation And standardization  
        * converted genres and production_contries from stringified lists to  actual list then exploded into  individual rows.
        * Standardization age_certification  values by  converting to  upper case and stripping spaces.
        * created deacede column from realase year.
        * categorize runtime into  short, medium , long epic.
    * data Merging
        * Merged titles and credits datasets on id.
* Insights Derived
    * number of movies are almost 8 times the number of shows.
    * popularity of shows are more compare to  movies.
    * top 5 contries uploading constantly  are US, IN, GB, CA, JP
    * highest average popularity KE , HU, TN .
    * some of the production countries with IMDB score greater than 8 are CU, PF, CR, SY.
    * in action category  most popular production is BA, SI, FJ, PF, GR.
    * in scifi category most popular production is MY, CL, CN, JP, RS.
    * 2010 and 2020 have highiest and there is growth in content uploaded per decade.
    * from india , us there is high imdb_score movies.
    * TV-MA , TV_Y7 has highest tmdb_popularity.


![alt text](image.png)
![alt text](image-1.png)

most popular productions scifi genre


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart -1  histogram plot for imdb_score , tmdb_score , popularity

In [None]:
# Chart - 1 visualization code
fig , axs =  plt.subplots(1,3 , figsize =  (18,6))
features =  ['imdb_score' , 'tmdb_score' , 'tmdb_popularity']
heading =  ['IMDB Score Distribution' , 'TMDB Score Distribution' , 'TMDB Popularity Distribution']
for i in range(len(features)):
  sns.histplot(x= features[i] , data  =  titles_data , kde =  True , ax=  axs[i])
  plt.xlabel(features[i])
  axs[i].set_title(heading[i])
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To understand the distribution of the variables and skewness in thier distribution.

##### 2. What is/are the insight(s) found from the chart?

imdb_score andn tmdb_score is left skewed and tmdb_popularity is highly right skewed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

this will help to clean the data and fill the null value by helping use to choose between median and mode.

#### Chart - 2 box plot for imdb_score , tmdb_score , tmdb_popularity

In [None]:
# Chart - 2 visualization code

fig , axs =  plt.subplots(5,1 , figsize =  (15,14))
features =  ['imdb_score' , 'tmdb_score' , 'tmdb_popularity' ,'tmdb_popularity_normalized']
heading =  ['IMDB Score Box Plot' , 'TMDB Score Box Plot' , 'TMDB Popularity','TMDB Popularity Normalized Box Plot']

for i in range(len(features)):
  sns.boxplot(x  =  features[i] , data  =  titles_cleaned_data , ax= axs[i])
  plt.xlabel(features[i])
  axs[i].set_title(heading[i])
  plt.tight_layout(pad=3.0)
sns.boxplot(x  = 'tmdb_popularity_normalized'  , data  =  titles_df , ax= axs[4])
plt.xlabel('tmdb_popularity_normalized ')
axs[4].set_title('tmdb_popularity_normalized after clean Box plot')
plt.show()

##### 1. Why did you pick the specific chart?

Box plots are the most efficient way to  detct oulier understand the value distribution.

##### 2. What is/are the insight(s) found from the chart?

Yes so for imdb_score and tmdb score tho there extream value exists but we cant label them as ouulier as they both are  in the valid range of 1 to  10 .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

no there no such indication of the neagtive growth it just tell imdb_score and tmdb_score are  in the correct range and we need to  handle the popularity  by bring it to  certain range by normalizing and remove them.

#### Chart - 3 pie_chart movies and shows

In [None]:
# Chart - 3 visualization code

plt.pie(x= number_shows_movies['count'] , labels = number_shows_movies['type'] , autopct = '%1.1f%%' )
plt.show()

##### 1. Why did you pick the specific chart?

* pie chart effectivly shows the frequency of the each classes.

##### 2. What is/are the insight(s) found from the chart?

production of shows are comperatively too less than movies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, shows get a high popularty gain and shows are highly enagaging which effect the enagagement rate in the platform but due to less number of shows production platform deprived of high engagement.



#### Chart - 4 histogram genre count

In [None]:
# Chart - 4 visualization code

# genre vs popularity
sns.histplot(x  =  'genres' , data  =  titles_df)
plt.xticks(rotation =90)
plt.show()

##### 1. Why did you pick the specific chart?

hist plot are good to visually see the count of unique values within a single variable

##### 2. What is/are the insight(s) found from the chart?

there is high imbalance in production of genres, which shows the domanance of few genre in the paltform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes ,
* genre like scifi, fantasy have high popularaity but there production is less which is against the liking of audience .
* the platform have less choice in term of quantity of diffrent type of content they want to watch it is saturated with drama type content.

#### Chart - 5 pie chart number of high imdb score movies(above 90 percentile) by contries

In [None]:
# Chart - 5 visualization code
country_counts = top_movies['production_countries'].value_counts().reset_index()
country_counts.columns = ['production_countries', 'count']

 # Group top 6 and rest as "Others"
top_6 = country_counts[:6]
others = country_counts[6:]
others_sum = others['count'].sum()

# Append "Others" to the top 6
final_counts = top_6.copy()
final_counts.loc[len(top_6)] = ['Others', others_sum]


plt.figure(figsize=(8, 8))
plt.pie(final_counts['count'], labels=final_counts['production_countries'], autopct='%1.1f%%', startangle=140)
plt.title('Top 10% IMDb Movies by Production Country (Grouped)')
plt.axis('equal')
plt.show()

##### 1. Why did you pick the specific chart?

pie chart effectively shows the percentage of each class contribution in the total percentage.

##### 2. What is/are the insight(s) found from the chart?

IN , US, GB, CA, CN, GB are aomong top contributors in top 10 percent of highest IMDB score movies, they  are at the top of the race race of producing good content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

no there is no such negative growth indications in the chart it just shows which top production contries which provide the Imdb score movies which is above 90% of all imbd score.

#### Chart - 6 genre vs tmdb_popularity barplot

In [None]:
# Chart - 6 visualization code
sns.barplot(x  =  'genres' ,  y =  'tmdb_popularity' , data  =  titles_df)
plt.xticks(rotation =90)
plt.show()

##### 1. Why did you pick the specific chart?

bar chart best for categorical values and bivariate analysis and represnt the data very clearly with clear meassage.

##### 2. What is/are the insight(s) found from the chart?

* animation, sci-fi, fantacy, and european are highly popular genres and liked by audiences much higher on average compare to  other genres.

* documantation and wester and least popular genres .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, western, music and documentation genre have very less popularity score,  which means this content reduce the engagement rate in the palatform due to disliking of such content hence investing more on this category will have less return on per investing.

#### Chart - 7 bar plot type of content( movie , show) vs tmdb_popularity

In [None]:
# Chart - 7 visualization col
# shows and movies popularity
sns.barplot( x = 'type' , y= 'tmdb_popularity_normalized' , data  =  titles_df.drop_duplicates(subset='id').groupby('type')['tmdb_popularity_normalized'].mean().reset_index())

In [None]:
titles_df.drop_duplicates(subset='id').groupby('type')['tmdb_popularity_normalized'].mean().reset_index()

##### 1. Why did you pick the specific chart?

Bar chart are best to plot graph between continous vs categorical type analysis.

##### 2. What is/are the insight(s) found from the chart?

movies on an average has less popularity compare to the shows.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, the number of shows produced is very less, but shows have high popularity among the audience which is against the buisness growth.  


#### Chart - 8 barplot genre vs IMDB score

In [None]:
# Chart - 8 visualization code
sns.barplot(x = 'genres', y =  'imdb_score'  , data  = titles_df  )
plt.xticks(rotation=90)
plt.title('Genre Imdb Score')
plt.xlabel('Genre')
plt.ylabel('IMDB score')
plt.show()

##### 1. Why did you pick the specific chart?

bar plots are good to represent numerical vs categorical values, the visual ploted repesent the mean imdb score of each genre which repesent the liking of each genre.

##### 2. What is/are the insight(s) found from the chart?

there is no significant variations in the imdb score with respect to  genre.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* there can be increased in the good horror content.
* documentation, genre has high imdb score it means the content is liked by audiences but earlier we show number of content produced in documentation genre is very  less.

#### Chart - 9 bar plot duration vs (tmdb_popularity , imdb_score, tmdb_score)

In [None]:
# Chart - 9 visualization code
melted = duration_trend.melt(id_vars="duration_category",
                             value_vars=["tmdb_popularity_mean", "imdb_score_mean", "tmdb_score_mean"],
                             var_name="Metric",
                             value_name="Value")

# Replace column names for better display
melted['Metric'] = melted['Metric'].replace({
    'tmdb_popularity_mean': 'TMDb Popularity',
    'imdb_score_mean': 'IMDb Score',
    'tmdb_score_mean': 'TMDb Score'
})

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(data=melted, x='duration_category', y='Value', hue='Metric')

# Labels and title
plt.title('Average Scores & Popularity by Duration Category')
plt.xlabel('Duration Category')
plt.ylabel('Average Value')
plt.legend(title='Metric')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

bar plot are excellent in representing categorical vs continous type variable.

##### 2. What is/are the insight(s) found from the chart?

* epic content type(movies) are more popular with higher imbd and tmdb score than other category.

* long,  range are the performing well but less then epic type movies.

* popularaty of movies is very less in short, medium type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, audience dont like short type range content as it is showing the least popularity along with low imdb and tmdb score.

#### Chart - 10 line plot trend over the decades

In [None]:
# Chart - 10 visualization code

unique_genres = trend_df['genres'].unique()
n = len(unique_genres)

cols = 3
rows = (n + cols - 1) // cols  # Ceiling division

# Define figure and axes
fig, axes = plt.subplots(rows, cols, figsize=(18, 5 * rows), sharex=False, sharey=False)
axes = axes.flatten()

# Define a nice color palette
palette = sns.color_palette("Set2", n_colors=n)

# Set global style
sns.set_theme(style="whitegrid")

for i, genre in enumerate(unique_genres):
    genre_data = trend_df[trend_df['genres'] == genre]
    sns.lineplot(
        data=genre_data,
        x='decade',
        y='tmdb_popularity',
        ax=axes[i],
        color=palette[i % len(palette)],
        marker='o',
        linewidth=2
    )
    axes[i].set_title(f'Genre: {genre}', fontsize=14, weight='bold')
    axes[i].set_xlabel('Decade', fontsize=12)
    axes[i].set_ylabel('Avg Popularity', fontsize=12)
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].grid(True)

# Hide unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.suptitle(' TMDb Popularity Trend per Genre (by Decade)', fontsize=18, weight='bold', y=1.02)
plt.tight_layout()
plt.subplots_adjust(top=1)  # Adjust for suptitle space
plt.show()


##### 1. Why did you pick the specific chart?

line chart are best to comparethe trends spicially when trend has to  be analysed across the time.

##### 2. What is/are the insight(s) found from the chart?

* action, comedy, drama, triller , crime in genral has growing popularity.

* anime has decresing trend.

* horror movies shows a dip in popularity in last two deacade.

* scifi has high popularity in last two decade.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

yes,
* there is a negative trend in the popularity of some genre like animation which clearly show the shift in intrest of audiuence for time as it is one of the highest popular genre.

* also there is uncerternity among some genre which shows the genre are not liked but content produced during that decade may be good like sport



#### Chart - 11 line plot decade vs popularity

In [None]:
# Chart - 11 visualization code

# trend across decades of shows vs movies and popupularity
trend_shows_movies_popularity
fig , axs =  plt.subplots(1,2 , figsize = (10, 4))
palette = sns.color_palette("Set2", n_colors=2)
sns.lineplot(data  =  trend_shows_movies_popularity[trend_shows_movies_popularity['type'] =='MOVIE'] , x = 'decade' , y= 'tmdb_popularity' , ax=  axs[0] , color = palette[1])
axs[0].set_title('Movie trend Over the decade')

sns.lineplot(data  =  trend_shows_movies_popularity[trend_shows_movies_popularity['type'] =='SHOW'] , x = 'decade' , y= 'tmdb_popularity' , ax=  axs[1] , color = palette[0])
axs[1].set_title('Shows trend Over the decade')
plt.show()

##### 1. Why did you pick the specific chart?

line chart is best suited for time line analysis and more ever here i have seprated the movie and shows which clearly represent the trend.

##### 2. What is/are the insight(s) found from the chart?

* for movies there is linear growth in popularity from 1930s till 2000s but then there is dip 2000 to 2020s.

* shows are constantly more popular than movies over the decades. but there are fluctuations in the popularity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes, from 2000s to 2020 there is dip in popularity of movies, so if there is a dip in the popularity of the content being created then audiences are losing intrest from the content displayed in platform.

#### Chart - 12 pair plot

In [None]:
# Chart - 12 visualization code
movies = titles_cleaned_data_no_outlier[titles_cleaned_data_no_outlier['type'] == 'MOVIE'].drop_duplicates(subset='id')
shows = titles_cleaned_data_no_outlier[titles_cleaned_data_no_outlier['type'] == 'SHOW'].drop_duplicates(subset='id')
sns.pairplot(movies[['runtime', 'imdb_score', 'tmdb_score', 'tmdb_popularity']])
plt.suptitle("Movies: Runtime vs Ratings/Popularity", y=1.02)
plt.show()

shows[['seasons', 'imdb_score', 'tmdb_score', 'tmdb_popularity']].corr()
sns.pairplot(shows[['seasons', 'imdb_score', 'tmdb_score', 'tmdb_popularity']])
plt.suptitle("Shows: Seasons vs Ratings/Popularity", y=1.02)
plt.show()






##### 1. Why did you pick the specific chart?

It tells the the realtionship between the multiple variable and show the clients very effectivily.

##### 2. What is/are the insight(s) found from the chart?

imdb_score in genral reduced after the runtime of 175 before it it was showing many movies with the high score.

there are many high popularity movies whifh are in between 75 - 175 runntime.

below 75 runtime again the the popularity is reducing.

all the movies with above 8 tmdb_sore is in between 75 185.

seasons with 1 - 7 there are many ssows which have high imdb_score.

shows with more than 20 seasons has significcatnt drop in popularty , imdb score and tmdb score.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

yes , if runtime goes above 175 popularity , imdb_score and and tmdb-score redice significantly.

and if seasons is more than 20 there is a drop in popularity , imdb_score , tmdb_score.

#### Chart - 13 correlation imdb_score, imdb_votes, tmdb_popularity, tmdb_score

In [None]:
# Chart - 13 visualization code
titles_data_for_relation  =  titles_data.drop_duplicates(subset='id').dropna(subset= ['imdb_score' , 'imdb_votes' , 'tmdb_popularity','tmdb_score'])
titles_data_for_relation[['imdb_score' , 'imdb_votes' , 'tmdb_popularity','tmdb_score' ]].corr()
sns.heatmap(titles_data_for_relation[['imdb_score' , 'imdb_votes' , 'tmdb_popularity','tmdb_score' ]].corr() , annot = True , fmt = '.2f' )

plt.title('Correlation Between IMDb & TMDb Scores/Votes/Popularity')
plt.xticks(rotation =  90)
plt.show()

##### 1. Why did you pick the specific chart?

to show how the imdb_score , tmdb_popularity , tmdb_score , imdb_votes , to understant how each of these variable are realted to each other.

##### 2. What is/are the insight(s) found from the chart?

there is no significant relationship among the variables but imdb_score and tmdb-score has medium strength relationships.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

For individual platform tmdb and imdb we need to perform independent analysis for each as there is no high correlation between them.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



*   increase the content in documentation  and reality genra people like it but the number of content uploaded are less documentation(< 1500) and reality(< 100).

*   there is drop in content upload from last deaced hence need to increse the content upload.

*   there are many contries which doesnt have consistent content upload promt them to upload the content.

*   increase the number of shows upload as shows genrally tends to show more popularity which  will increse the ingagement in the platform.

*   promte the marketing of the movies with high imdb score more than 80 percentile and low popularity less than 25 percentile.

*   prome the hype of most popular movies with less imdb score so that trending videos are reached and watched maximum audinces.

*   there are few contry with high popularity but imdb score is less which is result of regional movies like KOREAN .

*   region based shows and movies should be promoted as movie may not have high popularity due to  less global promotion or less production but  liked by locals.
*   history , documentation , war , anime are have very high imdb score and should be permoted globaly.

*   Give loyalty award for consitence content producer and provide award to for inconsitent producer for producing new content to promote content creation.

* the content length should be kept  90-200 minutes to get more popularity.

# **Conclusion**

* The duration category field is closely related to  audience rating and popularity scores
* The IMDB and TMDB score fileds are positively correlated, while popularity is influenced by additional  factor such as trends and marketing.
* Long and epic duration content like upto 90-180 minutes have recieved higher ratings and better popularity compared to shoter or medium length content.
*  Number of content uploaded has incresed significantly  of last few decade but there is dip in las two decade.
* Short duration content genrally under performs in both ratings and popularity.
* INDIA and USA are the most consitent content uploders.
* USA and INDIA are top most contry to produce content which  has imdb rating greater then 90% of other movies.
* Local or regional content have shown higher popularty among that perticular region.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***