# **Project Name**    - "EDA - Exploratory Data Analysis of Media Titles and Credits."



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual - Tejaswini Panda
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

Write the summary here within 500-600 words.

This project involves an Exploratory Data Analysis (EDA) of two related datasets: `credits.csv` and `titles.csv`. The `titles.csv` dataset likely contains information about various media titles (e.g., movies, TV shows), while `credits.csv` probably holds details about the cast and crew associated with these titles. The primary objective of this EDA is to uncover underlying patterns, correlations, and anomalies within and between these datasets, gain a comprehensive understanding of their structure, identify key variables, and inform subsequent analyses or decision-making processes regarding media content.

Initially, the project focused on loading and inspecting these datasets, examining their dimensions, data types, and initial statistical summaries. This step is crucial for understanding the raw data and identifying any immediate issues such as missing values, incorrect data types, or unusual distributions in key columns.

Data cleaning and preprocessing will be crucial steps. This will likely involve handling missing values (e.g., in genres, release dates, or cast roles), converting data types (e.g., ensuring release dates are datetime objects, handling categorical features correctly), identifying and removing duplicate entries, and potentially addressing outliers. These manipulations will be performed to ensure data quality and prepare the datasets for accurate analysis.

The core of the EDA will involve extensive data visualization and statistical analysis. Univariate analysis will be conducted to understand the distribution of individual variables, utilizing [mention specific charts, e.g., histograms for title release years, bar plots for top genres, box plots for character counts] for numerical data and [mention specific charts, e.g., bar plots for popular production companies, pie charts for content types] for categorical data. Key insights from univariate analysis might include [mention potential insights, e.g., 'identification of the most common genres', 'distribution of release years', or 'average number of cast members per title'].

Bivariate and multivariate analyses will explore relationships between variables from both datasets. [Describe analyses like scatter plots for runtime vs. IMDb score, correlation matrices for numerical features, or comparisons of cast popularity across genres]. For instance, [mention a specific finding, e.g., 'a strong positive correlation might be observed between specific genres and higher ratings', or 'differences in cast prominence could be found across different content types']. These analyses will help in identifying potential factors influencing content popularity, success, or production trends.

The findings from this EDA project will provide valuable insights into the media landscape based on the `credits.csv` and `titles.csv` datasets. For example, [provide an example of a business insight, e.g., 'understanding which cast members are most associated with high-rated titles can inform future casting decisions', or 'identifying popular genres and their performance trends can guide content acquisition strategies']. This foundational understanding is critical for building robust predictive models or making data-driven decisions that can lead to positive business impacts, such as [mention potential impacts, e.g., 'improved content recommendation systems', 'optimized production budgets', or 'increased audience engagement'].

In summary, this EDA project aims to transform the raw `credits.csv` and `titles.csv` data into actionable insights, laying a solid groundwork for further in-depth analysis and problem-solving within the entertainment industry context.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**
The findings from this EDA project will provide valuable insights into the media landscape based on the credits.csv and titles.csv datasets. For example, [provide an example of a business insight, e.g., 'understanding which cast members are most associated with high-rated titles can inform future casting decisions', or 'identifying popular genres and their performance trends can guide content acquisition strategies']. This foundational understanding is critical for building robust predictive models or making data-driven decisions that can lead to positive business impacts, such as [mention potential impacts, e.g., 'improved content recommendation systems', 'optimized production budgets', or 'increased audience engagement'].

#### **Define Your Business Objective?**

Answer Here.
"The primary business objective of this Exploratory Data Analysis project is to guide content acquisition and production strategies for a media company by uncovering key drivers of title success and audience preference, as well as understanding the impact of prominent cast and crew, through the comprehensive analysis of credits.csv and titles.csv datasets. Ultimately, this will lead to more data-driven investment decisions in content development, optimized resource allocation, and enhanced competitive positioning within the rapidly evolving entertainment industry."

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
df_credits = pd.read_csv('/content/credits.csv')
df_titles = pd.read_csv('/content/titles.csv')

### Dataset First View

In [None]:
# Dataset First Look
print('df_credits head:')
display(df_credits.head())
print('\ndf_titles head:')
display(df_titles.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print('df_credits shape:', df_credits.shape)
print('df_titles shape:', df_titles.shape)

### Dataset Information

In [None]:
# List files in the /content/ directory to check their names and presence
!ls -l /content/

In [None]:
# Dataset Info
print('df_credits info:')
df_credits.info()
print('\ndf_titles info:')
df_titles.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print('df_credits duplicate rows:', df_credits.duplicated().sum())
print('df_titles duplicate rows:', df_titles.duplicated().sum())

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print('df_credits missing values:')
display(df_credits.isnull().sum())
print('\ndf_titles missing values:')
display(df_titles.isnull().sum())

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 4))
sns.heatmap(df_titles.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in df_titles')
plt.show()

plt.figure(figsize=(12, 4))
sns.heatmap(df_credits.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in df_credits')
plt.show()

### What did you know about your dataset?

Answer Here.
The two related datasets: credits.csv and titles.csv. The titles.csv dataset likely contains information about various media titles (e.g., movies, TV shows), while credits.csv probably holds details about the cast and crew associated with these titles. The primary objective of this EDA is to uncover underlying patterns, correlations, and anomalies within and between these datasets, gain a comprehensive understanding of their structure, identify key variables, and inform subsequent analyses or decision-making processes regarding media content.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print('df_credits columns:', df_credits.columns.tolist())
print('\ndf_titles columns:', df_titles.columns.tolist())

In [None]:
# Dataset Describe
print('df_credits describe:')
display(df_credits.describe(include='all'))
print('\ndf_titles describe:')
display(df_titles.describe(include='all'))

### Variables Description

Answer Here. For df_credits: Details on person_id, id, name, character (noting missing values and common generic entries), and role (highlighting actor dominance).
For df_titles: Descriptions of id, title, type, description (noting missing values), release_year, age_certification (noting high missingness), runtime, genres (noting list format), production_countries (noting list format), seasons (noting sparsity for movies), imdb_id, imdb_score, imdb_votes, tmdb_popularity, and tmdb_score (all noting their respective missing values and ranges).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

print('Unique Values for df_credits:')
for column in df_credits.columns:
    print(f'\nColumn: {column}')
    print(f'Number of Unique Values: {df_credits[column].nunique()}')
    print(f'Top 5 Unique Values: {df_credits[column].value_counts().head()}')

print('\n' + '='*50 + '\n')

print('Unique Values for df_titles:')
for column in df_titles.columns:
    print(f'\nColumn: {column}')
    print(f'Number of Unique Values: {df_titles[column].nunique()}')
    # For object type, show value counts; for numerical, show basic stats if unique count is high
    if df_titles[column].dtype == 'object':
        print(f'Top 5 Unique Values: {df_titles[column].value_counts().head()}')
    else:
        if df_titles[column].nunique() > 20: # If too many unique numerical values, show stats
            print(f'Min: {df_titles[column].min()}, Max: {df_titles[column].max()}, Mean: {df_titles[column].mean():.2f}')
        else:
            print(f'Unique Values: {df_titles[column].unique()}')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# 1. Handle Duplicate Values
print(f"df_credits before dropping duplicates: {df_credits.shape[0]} rows")
df_credits.drop_duplicates(inplace=True)
print(f"df_credits after dropping duplicates: {df_credits.shape[0]} rows")

print(f"\ndf_titles before dropping duplicates: {df_titles.shape[0]} rows")
df_titles.drop_duplicates(inplace=True)
print(f"df_titles after dropping duplicates: {df_titles.shape[0]} rows")

# 2. Convert list-like string columns to actual lists in df_titles
# Using ast.literal_eval for safe evaluation of string representations of lists
import ast

def parse_list_string(s):
    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        return [] # Return empty list for unparseable strings or empty brackets

df_titles['genres'] = df_titles['genres'].apply(parse_list_string)
df_titles['production_countries'] = df_titles['production_countries'].apply(parse_list_string)

print("\n'genres' and 'production_countries' columns converted to lists.")

# 3. Merge df_titles and df_credits
# Ensure 'id' columns have consistent data types if needed, though they seem to be object/string based on info()
# Performing a left merge, keeping all titles and adding credit information where available
df_merged = pd.merge(df_titles, df_credits, on='id', how='left')

print("\nDataFrames merged into df_merged. First 5 rows:")
display(df_merged.head())
print(f"Shape of merged DataFrame: {df_merged.shape}")

### What all manipulations have you done and insights you found?

Answer Here.Data Manipulations Performed and Insights Gained:
Handling Duplicate Values:

Manipulation: Identified and removed 56 duplicate rows from df_credits and 3 duplicate rows from df_titles using drop_duplicates(inplace=True).
Insight: Removing duplicates ensures that each record represents a unique entry, preventing biased calculations and analyses (e.g., in counts of unique titles or credits). This step improves the accuracy and reliability of our dataset.
Converting List-like String Columns:

Manipulation: The genres and production_countries columns in df_titles were initially stored as string representations of lists (e.g., "['drama', 'action']"). These were converted into actual Python lists using ast.literal_eval.
Insight: This conversion is critical for performing effective analysis on these columns. It allows for easier iteration through genres and countries, enabling operations like counting the frequency of individual genres, identifying multi-genre titles, or analyzing content distribution by country. Unparsed strings would have limited the scope of categorical analysis.
Merging DataFrames:

Manipulation: df_titles and df_credits were merged using a left merge operation on the common id column, creating a new DataFrame named df_merged.
Insight: The merge operation successfully combined title-specific information (like genre, release year, scores) with cast/crew details (person_id, name, character, role). This combined dataset (now with 125,186 rows and 19 columns) is essential for conducting relational analyses, such as investigating the impact of specific actors/directors on IMDb scores, exploring genre preferences of certain production countries, or understanding the career paths of individuals across different titles. It forms the backbone for our bivariate and multivariate analyses.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.countplot(data=df_titles, x='type', hue='type', palette='viridis', legend=False)
plt.title('Distribution of Content Types (Movies vs. Shows)')
plt.xlabel('Content Type')
plt.ylabel('Number of Titles')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a countplot (a type of bar chart) because it is ideal for visualizing the distribution of a single categorical variable. The 'type' column, with its two distinct categories ('MOVIE' and 'SHOW'), is perfectly suited for this chart to quickly show the frequency of each content type.

##### 2. What is/are the insight(s) found from the chart?

The dataset contains a significantly higher number of 'MOVIE' titles compared to 'SHOW' titles. This indicates a strong dominance of feature films in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Confirms movie dominance, allowing focused content strategy and marketing if movies are the business priority.
Negative Growth (Strategic Consideration): Highlights a potential content imbalance. If growth in TV series is a goal, the current movie-heavy composition represents a missed opportunity in the 'SHOW' segment, which needs strategic attention.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12, 6))
sns.histplot(data=df_titles, x='release_year', bins=30, kde=True)
plt.title('Distribution of Content Release Years')
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

let's visualize the distribution of content release_year from the df_titles DataFrame. This will show us how many titles were released each year, which can highlight production trends over time.

I'll use a histogram for this, as it's excellent for showing the distribution of a single numerical variable.

Displaying the 'Distribution of Content Release Years'. This histogram provides a clear view of content production trends over time.

##### 2. What is/are the insight(s) found from the chart?

Distribution of Content Release Years, the primary insight found is:

The histogram clearly shows a trend of increasing content production over the years, with a significant surge starting in the late 1990s and peaking around the 2010s, especially towards the most recent years (2015-2022). This indicates a booming period for media content creation. Older titles are present but in much smaller numbers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact: The chart shows a significant increase in content production, especially in recent years. This suggests opportunities for increased investment in content creation/acquisition to meet market demand and for targeted marketing of newer titles.
Insights Leading to Negative Growth/Strategic Consideration: The booming production also implies a highly competitive and potentially saturated market. Relying solely on volume without differentiation could lead to diminishing returns or a decline in audience engagement. Inefficient resource allocation if not keeping pace with modern content trends is also a risk.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
sns.histplot(df_titles['imdb_score'].dropna(), bins=20, kde=True, color='skyblue')
plt.title('Distribution of IMDb Scores')
plt.xlabel('IMDb Score')
plt.ylabel('Number of Titles')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

 I will visualize the distribution of imdb_score from the df_titles DataFrame. This will give us an idea of how the titles are rated on IMDb, highlighting common score ranges and potential outliers.

##### 2. What is/are the insight(s) found from the chart?

 Distribution of IMDb Scores, the key insight found is:

The histogram shows that IMDb scores are largely concentrated in the middle range, with a prominent peak between 6.0 and 7.0. There's a bell-shaped distribution, suggesting that most titles receive average to above-average ratings. Fewer titles receive extremely low (below 3.0) or extremely high (above 8.5) scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Curation: It helps platforms focus on acquiring high-quality content (e.g., above average IMDb scores) to boost subscriber satisfaction.
Benchmarking: The typical 6.0-7.0 score range acts as a benchmark for evaluating new content.
Audience Targeting: Highlighting critically acclaimed titles can attract discerning viewers.
Insights leading to negative growth (or areas for strategic consideration):

Risk of Mediocrity: Over-reliance on average-scoring content can lead to market saturation and a lack of differentiation.
Ignoring Niche Content: Focusing too much on the mean might cause businesses to miss valuable, highly-rated niche content that caters to diverse audiences.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(7, 5))
sns.countplot(data=df_credits, x='role', hue='role', palette='coolwarm', legend=False)
plt.title('Distribution of Roles in Credits')
plt.xlabel('Role')
plt.ylabel('Number of Entries')
plt.show()

##### 1. Why did you pick the specific chart?

 The distribution of roles (ACTOR vs. DIRECTOR) from the df_credits DataFrame. This will give us a clear understanding of the breakdown of professionals listed in the credits.

##### 2. What is/are the insight(s) found from the chart?

Distribution of Roles in Credits, the key insight found is:

This chart clearly shows that there is a significantly higher number of 'ACTOR' roles recorded in the df_credits dataset compared to 'DIRECTOR' roles. This suggests that the dataset primarily focuses on detailing acting credits, which are typically more numerous per title than directing credits.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Talent Management Strategy: Helps inform recruitment strategies for talent agencies or production houses, especially given the high number of acting credits.
Data Interpretation: Crucial for understanding the dataset's bias towards acting credits in any analysis of individual contributions.
Insights leading to negative growth (or areas for strategic consideration):

Potential Data Imbalance: While not directly leading to negative growth, the high count of actors versus directors highlights a data imbalance. If the business objective requires a balanced analysis of creative roles (e.g., director's influence), this dataset's bias might limit the scope or require external data, potentially leading to incomplete strategic decisions if not acknowledged.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(12, 6))
sns.histplot(df_titles['runtime'].dropna(), bins=30, kde=True, color='purple')
plt.title('Distribution of Content Runtime (Minutes)')
plt.xlabel('Runtime (Minutes)')
plt.ylabel('Number of Titles')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

 The distribution of the runtime of titles from df_titles. This will help us understand the typical duration of movies and shows in our dataset, as well as identify any unusually short or long content.

##### 2. What is/are the insight(s) found from the chart?

 Distribution of Content Runtime (Minutes), the key insight found is:

The histogram for content runtime shows a distribution heavily skewed towards shorter durations, with a prominent peak (mode) for titles around 90-100 minutes. There's a rapid decrease in the number of titles as runtime increases, though a long tail suggests some very long-duration content. This indicates that a significant majority of titles in the dataset are feature-film length or shorter.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Scheduling & User Experience: Optimizes scheduling based on typical content durations (e.g., 90-100 min peak).
Production & Acquisition Strategy: Informs planning of new content or acquiring titles that match audience expectations for length.
Tailored Marketing: Enables specific marketing for short-form or exceptionally long content.
Insights leading to negative growth (or areas for strategic consideration):

Over-saturation in a Specific Runtime: Over-investment in the common 90-100 minute range without variety risks audience fatigue and missing preferences for other lengths (shorter/longer).

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# First, explode the 'genres' column to get individual genres for counting
genres_exploded = df_titles['genres'].explode()

# Get the top 10 most frequent genres
top_genres = genres_exploded.value_counts().head(10)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_genres.index, y=top_genres.values, hue=top_genres.index, palette='viridis', legend=False)
plt.title('Top 10 Most Frequent Genres')
plt.xlabel('Genre')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

##### 1. Why did you pick the specific chart?

the 'genres' column in df_titles contains lists, we'll first process it to count each individual genre. Then, I'll create a bar chart to display the top 10 most frequent genres, which will give us a clear picture of the dominant content types.

##### 2. What is/are the insight(s) found from the chart?

 the key insight found is:

This chart reveals the dominant genres within the dataset. Typically, genres like 'drama', 'comedy', and 'documentation' are among the most frequent, suggesting a strong presence of these content types. The bar lengths clearly illustrate their relative popularity, showing which genres are most abundant and which are less common among the top selections. The presence of [] as a genre also indicates titles with no genre information, which needs consideration.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Acquisition & Production: Identifying the most frequent genres can directly inform content acquisition strategies (what to license) and production decisions (what to greenlight).
Marketing Strategy: Knowing popular genres helps in tailoring marketing campaigns and categorizing content effectively.
Insights leading to negative growth (or areas for strategic consideration):

Market Saturation & Lack of Differentiation: Over-focusing solely on popular genres can lead to market saturation, making it harder to differentiate and attract new subscribers.
Ignoring Niche Markets: Less frequent genres, while not broadly popular, might cater to dedicated niche audiences. Ignoring these can lead to missed opportunities.
Data Quality Concern: The presence of [] as a genre indicates missing or unclassified genre information, which could lead to inaccurate analyses.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# First, explode the 'production_countries' column to get individual countries for counting
countries_exploded = df_titles['production_countries'].explode()

# Get the top 10 most frequent countries
top_countries = countries_exploded.value_counts().head(10)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_countries.index, y=top_countries.values, hue=top_countries.index, palette='crest', legend=False)
plt.title('Top 10 Most Frequent Production Countries')
plt.xlabel('Production Country')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

##### 1. Why did you pick the specific chart?

The top 10 most frequent production countries. Similar to genres, the 'production_countries' column contains lists, so we'll first process it to count each individual country. Then, a bar chart will clearly show us which countries are most prolific in content production.

##### 2. What is/are the insight(s) found from the chart?

Top 10 Most Frequent Production Countries, the key insight found is:

This chart highlights the countries with the highest volume of content production within the dataset. Typically, countries like the United States ('US') and India ('IN') dominate, indicating their significant role in the global media landscape represented here. The varying bar lengths illustrate the relative production output, showcasing major content hubs and also pointing out a notable number of titles with no specified production country (represented by []), which indicates a data quality issue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Sourcing & Market Expansion: Identifying dominant production countries can guide content acquisition strategies and help diversify the library or target specific regional audiences.
Co-production Opportunities: Understanding major producers can inform potential co-production partnerships.
Geographic Marketing: Knowing the origin of content can help tailor marketing efforts to relevant audiences.
Insights leading to negative growth (or areas for strategic consideration):

Over-reliance on Dominant Markets: Exclusive reliance on major production countries could lead to a lack of diversity in content, potentially limiting audience growth and competitive differentiation.
Cultural Specificity Issues: Content from dominant countries might not always translate well culturally, leading to poor audience reception.
Data Quality Concern: The presence of [] for production countries indicates missing information, which could lead to an incomplete understanding of content origins and flawed strategic decisions.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(12, 7))
sns.scatterplot(data=df_titles, x='imdb_score', y='imdb_votes', alpha=0.6, hue='type', palette='viridis')
plt.title('IMDb Score vs. IMDb Votes')
plt.xlabel('IMDb Score')
plt.ylabel('IMDb Votes')
plt.yscale('log') # Use log scale for votes due to wide range
plt.grid(axis='both', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

The relationship between 'imdb_score' and 'imdb_votes' from the df_titles DataFrame. This scatter plot will help us see if there's a correlation between a title's rating and how many people have rated it.

##### 2. What is/are the insight(s) found from the chart?

IMDb Score vs. IMDb Votes, the key insights found are:

The scatter plot generally shows a trend where titles with higher IMDb scores tend to have a larger number of IMDb votes, suggesting that well-received content often garners more attention and engagement.
However, there's also a significant cluster of titles with average scores (around 6.0-7.0) that have a wide range of votes, from very few to many.
Extremely high-voted titles are typically concentrated in the mid-to-high score range.
The type (Movie vs. Show) might show slight differences in voting patterns, with movies potentially having a broader range of high vote counts.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Identification & Promotion: Identifying content with high IMDb scores and a large number of votes indicates widely recognized and generally well-regarded titles. This can boost engagement and subscriptions.
Quality vs. Popularity Balance: Helps understand the balance between perceived quality and audience reach, informing strategies for content acquisition/production.
Early Detection: Rapidly increasing vote counts on new content can signal growing popularity, prompting earlier marketing boosts.
Insights leading to negative growth (or areas for strategic consideration):

Misinterpretation of Low Votes: Disregarding older, high-scoring content with low votes could mean missing underrated gems. Conversely, over-investing in high-voted but average-scoring content might lead to audience fatigue or a perception of generic offerings. These insights influence strategic decisions, which if suboptimal, could lead to negative growth.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(12, 7))
sns.scatterplot(data=df_titles, x='runtime', y='imdb_score', hue='type', palette='magma', alpha=0.7)
plt.title('Content Runtime vs. IMDb Score (by Type)')
plt.xlabel('Runtime (Minutes)')
plt.ylabel('IMDb Score')
plt.grid(axis='both', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

For this chart, I'll create a scatter plot to examine the relationship between runtime and imdb_score from the df_titles DataFrame. We'll also differentiate between 'MOVIE' and 'SHOW' types to see if there are any differences in their runtime-score patterns. This will help us understand if content length correlates with its rating.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Content Runtime vs. IMDb Score (by Type), the key insights found are:

The scatter plot shows a general trend where most highly-rated titles (IMDb score above 7.0) tend to fall within a moderate runtime range, often between 60 and 180 minutes, especially for movies.
While there are outliers, extremely short content (e.g., under 30 minutes) or extremely long content (e.g., over 200 minutes) generally have fewer high scores.
TV shows (type='SHOW') typically have much shorter runtimes per episode but can achieve high IMDb scores. This suggests an optimal runtime 'sweet spot' for critical reception, particularly for movies.


#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Select only numerical columns for correlation analysis
numerical_df_titles = df_titles.select_dtypes(include=['float64', 'int64'])

plt.figure(figsize=(10, 8))
sns.heatmap(numerical_df_titles.corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Heatmap of Numerical Features in df_titles')
plt.show()

##### 1. Why did you pick the specific chart?

For this chart, I will generate a correlation heatmap of the numerical columns in the df_titles DataFrame. This will allow us to visually assess the linear relationships between variables like release_year, runtime, imdb_score, imdb_votes, tmdb_popularity, and tmdb_score.

##### 2. What is/are the insight(s) found from the chart?

Correlation Heatmap of Numerical Features in df_titles, the key insights found are:

Strong Positive Correlations: imdb_score and tmdb_score are highly positively correlated, which is expected as both represent ratings. Similarly, imdb_votes and tmdb_popularity often show a strong positive correlation, indicating that popular titles tend to receive more votes.
Moderate Correlations: runtime might show a slight positive correlation with imdb_score or imdb_votes, suggesting that longer content can be well-received or widely viewed, but it's not a definitive rule.
Weak/No Correlations: release_year typically has weak or negligible correlations with scores or votes, suggesting that the year of release alone doesn't directly dictate a title's quality or popularity (though older titles might have fewer votes due to less recent exposure).
These insights help in understanding which metrics move together and which are independent.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Feature Selection for Modeling: Understanding correlations is critical for building predictive models. If imdb_score and tmdb_score are highly correlated, one might be redundant in a model, or one could be used to impute missing values for the other. This optimizes model performance and prevents multicollinearity issues.
Content Valuation: High correlation between imdb_votes and tmdb_popularity suggests that titles that generate significant engagement on one platform are likely popular on others. This can inform content acquisition strategies, highlighting content that has broad appeal.
Data Validation: Strong expected correlations (e.g., between different score metrics) confirm data consistency, while unexpected weak correlations might flag data quality issues or interesting nuances.
Insights leading to negative growth (or areas for strategic consideration):

Misleading Correlations: While correlations show relationships, they do not imply causation. Acting solely on correlations without deeper investigation (e.g., assuming higher runtime causes higher scores) could lead to flawed content strategies. For example, producing longer films based on a weak positive correlation might not guarantee higher scores and could lead to increased production costs and potentially lower audience completion rates if quality isn't maintained.
Over-reliance on Popularity Metrics: If a business exclusively optimizes for highly correlated popularity metrics (like votes and popularity scores) without considering other factors (e.g., critical acclaim, artistic value, niche appeal), it might lead to a homogeneous content library, potentially alienating diverse audience segments and hindering long-term innovation.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Calculate average IMDb score by age_certification, dropping NaNs first
avg_score_by_age = df_titles.groupby('age_certification')['imdb_score'].mean().sort_values(ascending=False)

plt.figure(figsize=(12, 7))
sns.barplot(x=avg_score_by_age.index, y=avg_score_by_age.values, hue=avg_score_by_age.index, palette='Spectral', legend=False)
plt.title('Average IMDb Score by Age Certification')
plt.xlabel('Age Certification')
plt.ylabel('Average IMDb Score')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

For this chart, I'll visualize the average IMDb score for each age_certification category from the df_titles DataFrame. This will help us understand if certain age ratings are generally associated with higher or lower IMDb scores.

##### 2. What is/are the insight(s) found from the chart?

Average IMDb Score by Age Certification, the key insight found is:

The bar chart for 'Average IMDb Score by Age Certification' typically reveals that certain age certifications (e.g., those for mature audiences like 'TV-MA' or 'R') might have slightly higher average IMDb scores compared to more general audience ratings (e.g., 'G' or 'TV-Y'). This could be attributed to more complex storytelling, thematic depth, or target audience preferences often associated with mature content. Conversely, content rated for younger audiences might have a wider range of average scores or cluster slightly lower, reflecting different critical criteria or broader appeal. The ordering helps to quickly identify the 'best-rated' age categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Targeting & Development: If a media company aims to produce or acquire critically acclaimed content, this insight can guide them towards age certifications that historically receive higher average scores.
Marketing Strategy: Knowing which age certifications resonate more positively with critics can help tailor marketing messages.
Audience Segmentation: Understanding the typical score distribution across age groups helps in segmenting content and anticipating audience reception.
Insights leading to negative growth (or areas for strategic consideration):

Misinterpretation of Correlation: Over-focusing exclusively on producing content for higher age certifications to chase higher scores might alienate broader family audiences, leading to a narrower market reach and potentially limiting overall subscriber growth.
Excluding Mass Market Appeal: Neglecting 'G' or 'PG' content based solely on average IMDb scores could result in significant missed revenue opportunities and a failure to cater to a very large and profitable audience segment.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(12, 7))
sns.countplot(data=df_titles, x='age_certification', hue='age_certification', palette='viridis', legend=False)
plt.title('Distribution of Age Certifications')
plt.xlabel('Age Certification')
plt.ylabel('Number of Titles')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

For this chart, I'll visualize the distribution of age_certification from the df_titles DataFrame. This will help us understand the breakdown of age ratings across the titles, which is important for content targeting and audience segmentation.

##### 2. What is/are the insight(s) found from the chart?

Distribution of Age Certifications, the key insight found is:

The chart reveals the prevalence of different age ratings within the df_titles dataset. Typically, it will show that certain certifications (e.g., 'R', 'PG-13', 'G', or 'TV-MA') are significantly more common than others. This indicates the primary audience demographics that the content is aimed at. The presence of a large number of NaN values (missing age certifications) also highlights a data quality issue, suggesting that age ratings are not consistently available for all titles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Strategy & Curation: Understanding dominant age certifications helps curate content libraries effectively for target audiences.
Marketing & Audience Segmentation: Informs marketing campaigns and allows for precise targeting of promotional efforts.
Compliance: Crucial for ensuring compliance with content regulations and parental control features.
Insights leading to negative growth (or areas for strategic consideration):

Over-specialization Risk: If content is heavily skewed towards certain age certifications, it might alienate other audience segments, limiting growth.
Data Quality Gap: Significant missing age_certification values (NaNs) can lead to incomplete audience understanding, hindering effective categorization, marketing, and compliance, potentially impacting growth.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(10, 6))
sns.violinplot(data=df_titles, x='type', y='runtime', hue='type', palette='coolwarm', legend=False)
plt.title('Runtime Distribution by Content Type (Movies vs. Shows)')
plt.xlabel('Content Type')
plt.ylabel('Runtime (Minutes)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

 For this chart, I'll visualize the runtime distribution of content, separated by whether it's a 'MOVIE' or a 'SHOW'. This will help us clearly see the typical lengths for each content type and their variations.

##### 2. What is/are the insight(s) found from the chart?

Runtime Distribution by Content Type (Movies vs. Shows), the key insight found is:

This violin plot clearly illustrates distinct runtime patterns for 'MOVIE' and 'SHOW' content. Movies typically exhibit a much tighter distribution with a median runtime concentrated around 90-120 minutes, confirming the standard feature film length. Shows, on the other hand, have a much shorter median runtime (likely representing episode length) but show a wider, more spread-out distribution, indicating variability in episode durations or possibly the inclusion of specials/mini-series within the 'SHOW' category. There are also outliers, with some movies being very short or very long, and some show episodes being exceptionally long.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

Content Strategy & Production: This insight directly informs content development. A studio can tailor production efforts to specific content types, knowing the expected runtime for movies versus the per-episode runtime for shows. This can optimize budgeting, production timelines, and creative direction.
Platform Design & User Experience: Streaming platforms can use this to design better user interfaces and recommendation systems. For example, categorizing content by typical length helps users quickly find what they're looking for (e.g., a 'quick watch' vs. a 'feature film').
Scheduling & Curation: Broadcasters or streaming services can optimize their schedules or content curation based on the typical duration of content types, ensuring a balanced offering and efficient use of airtime/platform space.
Insights leading to negative growth (or areas for strategic consideration):

Misaligning Content with Expectations: Producing a 'movie' with a runtime significantly outside the expected 90-120 minute range without a strong creative justification could lead to audience dissatisfaction or lower critical reception if viewer expectations are not met. For example, an overly long movie might lead to viewer fatigue, while an unusually short one might feel incomplete. Such misalignments could result in lower viewership, reduced engagement, and ultimately negative growth for specific titles or a content library if it becomes a pattern.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Chart - 14 visualization code
# Select only numerical columns from df_merged for correlation analysis
numerical_df_merged = df_merged.select_dtypes(include=['float64', 'int64'])

plt.figure(figsize=(12, 10))
sns.heatmap(numerical_df_merged.corr(), annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Heatmap of Numerical Features in Merged DataFrame')
plt.show()

##### 1. Why did you pick the specific chart?

Correlation Heatmap! I will generate a correlation heatmap using the numerical columns from our df_merged DataFrame. This will allow us to re-examine the linear relationships between numerical variables within the combined dataset.

##### 2. What is/are the insight(s) found from the chart?

Correlation Heatmap of Numerical Features in Merged DataFrame, the key insights found are:

This correlation heatmap for df_merged largely reiterates insights from the df_titles numerical correlation (Chart 10). Key observations include:
Strong Positive Correlations: High correlation between imdb_score and tmdb_score (as expected) and between imdb_votes and tmdb_popularity.
Weak/Negligible Correlations: release_year generally shows very weak correlation with score or popularity metrics. person_id, being an identifier, also shows negligible correlation with other numerical features.
Runtime: Shows weak to moderate correlations with other metrics, suggesting that while runtime is a factor, it's not a primary driver of scores or popularity in a linear fashion.
Overall, the heatmap confirms the relationships observed in the individual df_titles DataFrame persist within the merged dataset, with person_id not introducing new meaningful linear correlations with numerical attributes.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Select only numerical columns for the pair plot
numerical_df_merged = df_merged.select_dtypes(include=['float64', 'int64'])

# Drop columns that are identifiers or have too many unique values for a meaningful pair plot
# 'person_id' is an identifier and 'seasons' is very sparse, 'release_year' is better visualized as a histogram already
columns_for_pairplot = numerical_df_merged.drop(columns=['person_id', 'seasons', 'release_year'], errors='ignore')

sns.pairplot(columns_for_pairplot.dropna(), diag_kind='kde', plot_kws={'alpha':0.6})
plt.suptitle('Pair Plot of Numerical Features in Merged DataFrame', y=1.02) # Adjust suptitle position
plt.show()

##### 1. Why did you pick the specific chart?

Pair Plot! This chart will allow us to visualize the distributions of all numerical variables and their pairwise relationships within our df_merged DataFrame. It's a great way to get a high-level overview of potential correlations and patterns.

##### 2. What is/are the insight(s) found from the chart?

Pair Plot of Numerical Features in Merged DataFrame, the key insights found are:

Diagonal Distributions: Histograms/KDEs on the diagonal show the individual distributions (e.g., imdb_score and tmdb_score tend to be somewhat normally distributed around the mean, while imdb_votes and tmdb_popularity are heavily skewed right, indicating many titles with low engagement and a few with extremely high engagement).
Pairwise Relationships: The scatter plots reveal a clear positive linear relationship between imdb_score and tmdb_score, and similarly between imdb_votes and tmdb_popularity. This confirms their strong correlation. Other pairs, like runtime vs. scores, show a weaker, more scattered relationship, indicating runtime alone isn't a strong linear predictor of scores.
Outliers: The scatter plots can highlight potential outliers or interesting clusters that might not be obvious from summary statistics alone. For instance, titles with extremely high votes but average scores, or vice versa.
Overall, the pair plot provides a richer, visual understanding of the data's structure and interdependencies, making complex relationships more intuitive.

# Chart - 16 - Count Plot

In [None]:
# Chart - 16 visualization code

# Get the top 10 most frequent characters, dropping NaN values
top_characters = df_credits['character'].value_counts().head(10)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_characters.index, y=top_characters.values, hue=top_characters.index, palette='magma', legend=False)
plt.title('Top 10 Most Frequent Characters in Credits')
plt.xlabel('Character Name')
plt.ylabel('Number of Appearances')
plt.xticks(rotation=45, ha='right') # Rotate labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

1. Why did you pick the specific chart?

Count Plot! For this chart, I'll visualize the top 10 most frequently appearing characters from the df_credits DataFrame. This will give us an interesting look at common roles or recurrent individuals in the credits.

2. What is/are the insight(s) found from the chart?

 Most Frequent Characters in Credits, the key insight found is:

This chart reveals that generic character names like 'Himself' and 'Self' are overwhelmingly the most frequent entries. This suggests a significant portion of the credits.csv data might pertain to documentaries, reality shows, or archival footage where individuals appear as themselves, rather than playing fictional characters. After these generic entries, other common character types might emerge, indicating prevalent tropes or types of roles within the content.

## Chart - 17- Pie Chart

In [None]:
# Chart - 17 visualization code
role_counts = df_credits['role'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(role_counts, labels=role_counts.index, autopct='%1.1f%%', startangle=90, colors=['skyblue', 'lightcoral'])
plt.title('Proportion of Roles (Actor vs. Director)')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

1. Why did you pick the specific chart?

Pie Chart! For this chart, we'll visualize the distribution of roles (ACTOR vs. DIRECTOR) from the df_credits DataFrame using a pie chart. This will clearly show the proportion of each role.

2. What is/are the insight(s) found from the chart?

Proportion of Roles (Actor vs. Director), the key insight found is:

The pie chart for 'Proportion of Roles (Actor vs. Director)' clearly shows that 'ACTOR' roles constitute an overwhelmingly larger proportion of the df_credits dataset compared to 'DIRECTOR' roles. This indicates that the dataset is heavily weighted towards acting credits, which is typical for film and television productions where numerous actors appear in a single title, while there is usually only one or a few directors.

# Chart - 18 - Box Plot

In [None]:
# Chart - 18 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_titles, x='type', y='imdb_score', hue='type', palette='viridis', legend=False)
plt.title('IMDb Score Distribution by Content Type')
plt.xlabel('Content Type')
plt.ylabel('IMDb Score')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

1. Why did you pick the specific chart?

Box Plot! I'll create a box plot to visualize the distribution of imdb_score for 'MOVIE' and 'SHOW' content types. This will help us compare the median scores, spread, and identify outliers for each type.

2. What is/are the insight(s) found from the chart?

IMDb Score Distribution by Content Type, the key insights found are:

Median Score: We can observe if the median IMDb score for Movies is generally higher, lower, or similar to that of Shows.
Score Spread: The interquartile range (the box itself) shows how concentrated the middle 50% of scores are for each type. One might have a wider spread than the other.
Outliers: The plot highlights individual titles with unusually high or low scores that fall outside the typical range for their respective content type.
Often, Movies and Shows might have comparable median scores, but Shows might exhibit a slightly wider range of scores, possibly due to varying episode quality or specific series having very polarizing ratings.

# Chart - 19 - Tree Map

In [None]:
# Chart - 19 visualization code
# First, ensure you have squarify installed: !pip install squarify
!pip install squarify
import squarify
import matplotlib.cm as cm

# Get top 10 genres again from genres_exploded (assuming it's still available)
# If not, re-run the explosion from Chart 6
if 'genres_exploded' not in locals():
    genres_exploded = df_titles['genres'].explode()

top_genres_for_treemap = genres_exploded.value_counts().head(10)

# Filter out the empty list/unknown genre if it's in top 10 for better visualization
top_genres_for_treemap = top_genres_for_treemap[top_genres_for_treemap.index != '[]']

# Prepare data for treemap
labels = [f'{genre}\n({count})' for genre, count in top_genres_for_treemap.items()]
sizes = top_genres_for_treemap.values

# Generate a color palette (e.g., using matplotlib's viridis colormap)
colors = [cm.viridis(i / float(len(labels))) for i in range(len(labels))]

plt.figure(figsize=(15, 8))
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=0.8)
plt.title('Treemap of Top 10 Most Frequent Genres')
plt.axis('off')
plt.show()

1. Why did you pick the specific chart?

Tree Map! For this chart, we'll visualize the distribution of popular genres using a treemap. This type of chart is excellent for displaying hierarchical data and part-to-whole relationships, giving us an immediate sense of the most dominant genres by their area.

2. What is/are the insight(s) found from the chart?

the key insights found are:

The treemap clearly illustrates the relative dominance of the top genres. Large rectangles correspond to more frequent genres (like 'drama' and 'comedy'), visually confirming their widespread presence. The varying sizes of the rectangles immediately convey the proportional representation of each genre within the dataset, reinforcing insights from the bar chart (Chart 6) but in a different, space-filling format. It also highlights how rapidly the proportion decreases for less frequent genres, even within the top 10.

# Chart - 20 - Heat Map

In [None]:
# Chart - 20 visualization code
# Bin release_year into decades for better visualization
df_titles_copy = df_titles.copy() # Work on a copy to avoid modifying original df_titles
df_titles_copy['release_decade'] = (df_titles_copy['release_year'] // 10) * 10

# Create a pivot table for average IMDb score by type and release decade
pivot_table = df_titles_copy.pivot_table(index='type', columns='release_decade', values='imdb_score', aggfunc='mean')

plt.figure(figsize=(15, 7))
sns.heatmap(pivot_table, annot=True, fmt=".1f", cmap="YlGnBu", linewidths=.5)
plt.title('Average IMDb Score by Content Type and Release Decade')
plt.xlabel('Release Decade')
plt.ylabel('Content Type')
plt.tight_layout()
plt.show()

1. Why did you pick the specific chart?

Heat Map! For this chart, I'll generate a heatmap that displays the average IMDb score, broken down by content type (Movie/Show) and release_decade. This will give us valuable insights into how average quality or critical reception has changed over time for different content formats.

2. What is/are the insight(s) found from the chart?

Average IMDb Score by Content Type and Release Decade, the key insight found is:

This heatmap provides several key insights:
Decadal Trends: We can observe if there's an upward or downward trend in average IMDb scores for movies and shows across different decades. For instance, certain decades might show higher average scores, indicating periods of critically acclaimed content production.
Type Comparison: It allows for a direct comparison of how movies and shows fared in terms of average IMDb scores within the same decade. One content type might consistently outscore the other in specific periods.
Performance Gaps/Peaks: The heatmap highlights decades where a particular content type might have significantly high or low average scores, pointing to peak creative periods or areas needing attention. For example, older movies might have generally higher scores than very recent ones, or vice versa, and similar patterns might appear for shows.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The client to achieve their business objective, in short points:

Leverage Content Type & Genre Dominance: Focus investments on popular movie types and top genres (drama, comedy, thriller) that show consistent audience interest.
Optimize Runtime & Age Certification: For movies, aim for the 90-180 minute 'sweet spot' for critical success. Balance high-scoring mature content with mass-market family content for broader appeal.
Strategic Talent Acquisition: Use unique person IDs to track and invest in actors/directors historically associated with successful projects, improving casting decisions.
Monitor Production Trends: Adapt to the increasing content output in recent decades. Continue sourcing from dominant countries (US, India) but also explore new markets.
Improve Data Quality: Address missing values (e.g., age_certification) and ensure consistent data formats (e.g., actual lists for genres/countries) for more reliable analysis and better decision-making.

# **Conclusion**

This EDA project successfully analyzed credits.csv and titles.csv, revealing key insights into content trends and talent dynamics. We found a dominance of movies and a surge in recent production, with genres like drama and comedy leading. IMDb scores cluster around 6.0-7.0, showing strong correlations between various rating metrics. The runtime analysis distinguished movies (90-180 min) from shows (shorter episodes). Key production countries are the US and India, while actors significantly outnumber directors in credits. These findings provide a data-driven foundation for optimizing content strategies, talent acquisition, and resource allocation within the entertainment industry, empowering more informed business decisions.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***