# **Project Name**    - Amazon Prime TV Shows and Movies: An Exploratory Data Analysis.



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

This project explores the universe of Amazon Prime's content universe using exploratory data analysis (EDA) methods to reveal insights into how different characteristics of movies and TV shows relate to one another. By the use of data wrangling, visualization, and narrative, the project seeks to comprehend the drivers of the quality, popularity, and viewer appreciation of the content.

Understanding the Data Landscape

The project starts importing and exploring two datasets: "titles" and "credits." The "titles" dataset includes data on individual titles of movie or TV shows, including their release year, duration, genre, age rating, and ratings from IMDB. The "credits" dataset includes information on the cast and crew for each title, connecting individuals to roles played and what they contributed.

First explorations with data show IMDB score distribution and overall content quality on Amazon Prime. A scatter plot exploring the correlation between IMDB score and runtime indicates that longer film or television shows are rated higher, showing a possible in favor of deep storytelling.

Genre, Age, and Runtime Insights

More analysis shows the breakout of age certifications, which identifies the audience for various content types. This data is invaluable for marketing and promotions purposes, enabling adapted campaigns to reach particular age groups. The investigation into leading genres discloses well-liked content categories, including drama, comedy, and documentaries, which present potential fields for content production and procurement.

A comparison of runtime distributions for films and television shows indicates differing preferences for content length by format. This information can inform content creation choices, ensuring that movie and TV show lengths meet audience expectations. A study of average IMDB scores by director for films identifies high-performing directors, which can inform decisions regarding future collaborations and content purchases.

Discovering Talent and Collaborations

The project turns next to individuals in content production. A bubble chart displaying actor/director popularity and mean IMDB score will help to see best-performing and up-and-coming players. This information may be used in casting decisions, talent handling tactics, and promotional strategies.

A bar graph contrasting the number of titles among actors and directors indicates the industry's most active contributors, inferring their level of experience and ability to succeed as a team. A heatmap illustrating the actors/directors distribution by genres can indicate areas of genre specialization and can be used to inform content acquisition and production strategies.

Insights for Business Impact

The conclusions of this EDA project provide useful recommendations for Amazon Prime's business decisions. Through an understanding of content feature, talent, and audience preference relationships, the platform can make informed decisions to improve user engagement, streamline content acquisition, and enhance marketing campaigns.

The insights derived can be used to provide positive business impact by:

Enhancing Content Suggestions: By taking advantage of data regarding genre inclinations, age ratings, and trending talent, Amazon Prime is able to personalize content suggestions, leading to user satisfaction and discoverability.
Content Acquisition Optimization: Knowledge of factors driving content popularity and quality can help the platform acquire titles appealing to its targeted audience.
Refining Marketing Initiatives: Market insights into genre interests and age certifications allow the platform to direct marketing campaigns targeting particular demographics in order to generate maximum reach and engagement.
Limitations and Considerations

Though the project is rich in insights, one should not fail to mention its limitations. The analysis depends on the data that is available and may contain built-in biases or may fail to capture the richness of the film and television business. More research and analysis are recommended to corroborate the findings and consider other factors determining content success.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

Amazon Prime faces the challenge of optimizing its content strategy, acquisition and user experience to maximize user engagement, satisfaction and platform growth in a highly competitive streaming market.



#### **Define Your Business Objective?**

To leverage data-driven insights from exploratory data analysis (EDA) to enhance content strategy, optimize content acquisition and personalize user experience on the Amazon Prime platform ultimately leading to increased user engagement, customer satisfaction and platform growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt


### Dataset Loading

In [None]:
# Load Dataset
titles_data = pd.read_csv('titles.csv')
credits_data = pd.read_csv('credits.csv')

### Dataset First View

In [None]:
# Dataset First Look
titles_data.head()


In [None]:
credits_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
titles_data.shape



In [None]:
credits_data.shape

### Dataset Information

In [None]:
# Dataset Info
titles_data.info()
credits_data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
titles_data.duplicated().sum()


In [None]:
credits_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
titles_data.isnull().sum()


In [None]:
credits_data.isnull().sum()

In [None]:
# Visualizing the missing values
titles_data.isna().sum().plot(kind='bar')

### What did you know about your dataset?

The data is of amazon prime which has information regarding the platform movies and series.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
titles_data.columns


In [None]:
credits_data.columns

In [None]:
# Dataset Describe
titles_data.describe()


In [None]:
credits_data.describe()

### Variables Description

* id: The title ID on JustWatch.
* title: The name of the title.
* show_type: TV show or movie.
* description: A brief description.
* release_year: The release year.
* age_certification: The age certification.
* runtime: The length of the episode (SHOW) or movie.
* genres: A list of genres.
* production_countries: A list of countries that
  produced the title.
* seasons: Number of seasons if it's a SHOW.
* imdb_id: The title ID on IMDB.
* imdb_score: Score on IMDB.
* imdb_votes: Votes on IMDB.
* tmdb_popularity: Popularity on TMDB.
* tmdb_score: Score on TMDB.
* person_ID: The person ID on JustWatch.
* id: The title ID on JustWatch.
* name: The actor or director's name.
* character_name: The character name.
* role: ACTOR or DIRECTOR.
* Problem Statement

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
titles_data.nunique()


In [None]:
credits_data.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
titles_data.dropna(subset=['imdb_score', 'age_certification'], inplace=True)

# Impute missing values
mean_score = titles_data['imdb_score'].mean()
titles_data['imdb_score'] = titles_data['imdb_score'].fillna(mean_score)  # Direct assignment

median_runtime = titles_data['runtime'].median()
titles_data['runtime'] = titles_data['runtime'].fillna(median_runtime)  # Direct assignment

titles_data['age_certification'] = titles_data['age_certification'].fillna('Unknown')  # Direct assignment

### What all manipulations have you done and insights you found?

**Data Manipulations:**

1. Handling Missing Values:

* I addressed missing values in the 'imdb_score', 'runtime', and 'age_certification' columns of the 'titles' dataset.
* For 'imdb_score' and 'runtime', I imputed missing values using the mean and median, respectively, to maintain the overall data distribution.
* For 'age_certification', I filled missing values with 'Unknown' to categorize titles without age ratings.

2. Data Merging:

* I merged the 'titles' and 'credits' datasets on the common column 'id' to combine information about titles and the individuals involved in their creation. This allowed me to analyze relationships between content features and talent attributes.

3. Data Filtering and Grouping:

* I filtered the data based on criteria such as 'show_type' (movie or TV show), 'role' (actor or director), and specific genres to perform targeted analysis on subsets of the data.
* I grouped the data by various features, such as genre, age certification, director, and number of seasons, to calculate aggregate statistics and visualize distributions.

4. Data Transformation:

* I extracted and counted genres from the 'genres' column to analyze the prevalence of different genres in the dataset.
* I created a new column 'person_title_connection' by combining 'person_id' and 'id' to represent unique connections between individuals and titles for visualization purposes.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Distribution of IMDB Scores

In [None]:
# Chart - 1 visualization code

plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.histplot(titles_data['imdb_score'], bins=20, kde=True)
plt.title('Distribution of IMDB Scores')
plt.xlabel('IMDB Score')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

* Univariate Analysis: This chart focuses on a single variable (imdb_score) to understand its distribution.
* Distribution Visualization: Histograms are excellent for visualizing the distribution of numerical data, showing the frequency of values within different ranges (bins).
* Insightful: It helps us understand the range of IMDB scores, the most common scores, and the overall shape of the distribution (e.g., is it skewed, symmetrical, etc.).



##### 2. What is/are the insight(s) found from the chart?

This chart can provide insights into the quality of content in your dataset based on IMDB scores. You can identify:

* Content Quality Trends: Are most titles rated highly, or is there a wide range of scores?
* Potential Outliers: Are there any unusually high or low scores that might need further investigation?

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This information can help in making business decisions related to content acquisition, recommendation systems, and marketing strategies.




#### Chart - 2 Relationship between IMDB Score and Runtime

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.scatterplot(x='runtime', y='imdb_score', data=titles_data)
plt.title('Relationship between IMDB Score and Runtime')
plt.xlabel('Runtime (minutes)')
plt.ylabel('IMDB Score')
plt.show()

##### 1. Why did you pick the specific chart?

* Bivariate Analysis: This chart explores the relationship between two numerical variables (runtime
and imdb_score).
* Relationship Visualization: Scatter plots are effective in showing the correlation or pattern between two numerical variables.
* Insightful: It helps us understand if there's any trend or association between the runtime of a title and its IMDB score. For example, do longer movies tend to have higher or lower scores?





##### 2. What is/are the insight(s) found from the chart?

This chart can provide insights into whether the length of a movie or TV show has any influence on its perceived quality (as reflected in the IMDB score).

* Content Length and Quality: You might observe trends like longer movies tend to have higher scores, shorter content might be preferred by viewers, or there might be no clear relationship.
* Content Strategy: This information can be valuable for content creation and acquisition strategies. You might consider factors like audience preferences for different content lengths when making decisions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


**Positive Business Impact:**

The insights from this chart can definitely contribute to positive business impact in several ways:

Content Strategy:

* Targeted Content Creation: If the chart shows that longer movies or shows tend to have higher IMDB scores, it might suggest that audiences appreciate more in-depth storytelling or complex narratives. This can guide content creators to focus on producing longer-form content for specific genres or target audiences.
* Informed Decisions: When acquiring or licensing content, the insights from the scatter plot can help make more data-driven decisions. For example, if there's a clear positive correlation between runtime and IMDB score for documentaries, a streaming service might prioritize acquiring longer documentaries to appeal to their audience.

**Insights Leading to Negative Growth:**
* Overgeneralization:
Ignoring Genre Differences: Applying a general trend (e.g., longer runtime = higher score) across all genres might be detrimental. If shorter comedies tend to perform well, focusing solely on longer comedies could alienate the target audience.
* Neglecting Other Factors:
Solely Focusing on Runtime: Basing content decisions solely on runtime without considering other critical factors like plot, cast, and production quality could lead to the creation of less engaging content.


#### Chart - 3 Distribution of Age Certifications

In [None]:
# Chart - 3 visualization code

plt.figure(figsize=(12, 6))  # Adjust figure size if needed
sns.countplot(x='age_certification', data=titles_data, order=titles_data['age_certification'].value_counts().index)
plt.title('Distribution of Age Certifications')
plt.xlabel('Age Certification')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.show()

##### 1. Why did you pick the specific chart?

* Univariate Analysis: This chart focuses on a single categorical variable (age_certification) to understand its distribution.
* Categorical Data Visualization: Bar plots are effective for visualizing categorical data and showing the frequency or count of each category.
* Insightful: It helps us understand the target audience for the content in the dataset by showing the distribution of age certifications.

##### 2. What is/are the insight(s) found from the chart?

* Dominant Age Certifications: Which age certifications are most common in the dataset? This indicates the primary target audience for the majority of the content.
* Content Diversity: Is there a wide range of age certifications, suggesting a diverse content catalog that caters to different age groups?
* Niche Audiences: Are there any age certifications that are less common but might represent a potential niche audience to target?




##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

** Positive Business Impact:**
1. Marketing and Promotion:

* Tailored Campaigns: Understanding the distribution of age certifications allows businesses to tailor their marketing and promotion efforts to specific age groups. For example, content with a "G" rating can be promoted to families with young children, while content with an "R" rating would be targeted towards a mature audience. This targeted approach optimizes marketing spend and increases the likelihood of reaching the intended viewers.

2. Content Acquisition and Production:

* Targeted Content Selection: By identifying the dominant age certifications in the dataset, businesses can focus on acquiring or producing content that aligns with the preferences of their primary target audience. This ensures that the content resonates with the viewers and maximizes engagement.

**Insights Leading to Negative Growth:**
1. Ignoring Dominant Age Certifications:

If businesses primarily focus on niche audiences while neglecting the preferences of their core audience (indicated by the dominant age certifications), they risk alienating their main viewership base, potentially leading to decreased engagement and revenue.

2. Overlooking Content Diversity:
Relying heavily on content with a single age certification limits the platform's appeal to a broader audience. This could hinder growth by preventing the acquisition of new users who prefer different types of content.

#### Chart - 4 Top Genres

In [None]:
# Chart - 4 visualization code

# Assuming 'genres' column contains lists of genres as strings
# Example: "['comedy', 'drama']"

# Function to extract and count genres
def count_genres(df):
    genre_counts = {}
    for index, row in df.iterrows():
        genres_str = row['genres']
        if pd.notna(genres_str) and genres_str != '[]':  # Check for NaN and empty lists
            genres = eval(genres_str)  # Safely evaluate the string as a list
            for genre in genres:
                genre_counts[genre] = genre_counts.get(genre, 0) + 1
    return genre_counts

# Count genres in the 'titles_data' DataFrame
genre_counts = count_genres(titles_data)

# Sort genres by count in descending order
sorted_genres = dict(sorted(genre_counts.items(), key=lambda item: item[1], reverse=True))

# Get the top 10 genres
top_genres = dict(list(sorted_genres.items())[:10])

# Create a horizontal bar plot using seaborn
plt.figure(figsize=(10, 6))
sns.barplot(x=list(top_genres.values()), y=list(top_genres.keys()), orient='h')
plt.title('Top 10 Genres')
plt.xlabel('Count')
plt.ylabel('Genre')
plt.show()

##### 1. Why did you pick the specific chart?

* Univariate Analysis: This chart focuses on the distribution of a single categorical variable (genres).
* Categorical Data Visualization: Bar plots are effective for visualizing categorical data and showing the frequency of each category. Horizontal bar plots are particularly useful when category labels are long.
* Insightful: It provides a clear overview of the most prevalent genres in the dataset, which can inform content strategy and marketing decisions.

##### 2. What is/are the insight(s) found from the chart?

This chart can provide insights into the popularity and prevalence of different genres in your content library. You can identify:

* Popular Genres: Which genres are most common? This indicates the types of content that are potentially most popular among viewers.
* Content Gaps: Are there any genres that are underrepresented in your library? This might represent opportunities to acquire or produce content in those genres to attract a wider audience.
* Genre Trends: By comparing the top genres with industry trends, businesses can identify areas where they are aligned or where they might need to adjust their content strategy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
1. Marketing and Promotion:

* Targeted Campaigns: Understanding genre popularity enables businesses to tailor marketing campaigns to specific audience segments. For example, promoting a new action movie heavily to viewers who frequently watch action films can improve the campaign's effectiveness and reach.

2. Recommendation Systems:

* Enhanced Personalization: Genre preferences can be integrated into recommendation algorithms to deliver more relevant suggestions to users. This improves user satisfaction and content discovery, leading to increased engagement and platform loyalty.

**Insights Leading to Negative Growth:**
1. Overlooking Niche Genres:

* Focusing solely on the most popular genres while neglecting niche audiences can alienate viewers with specific interests and limit the platform's overall appeal.
Ignoring emerging genres might cause businesses to miss out on potential growth opportunities and future trends.

2. Over Saturation of Popular Genres:

* Excessively investing in content for already saturated popular genres can lead to increased competition and diminishing returns.
Viewers might experience content fatigue if they are constantly bombarded with titles from the same few genres.

#### Chart - 5 Distribution of Runtime for Movies and TV Shows

In [None]:
print(titles_data.columns)

In [None]:
titles_data.rename(columns={"type": "show_type"}, inplace=True)

In [None]:
# Chart - 5 visualization code

# Filter data for movies and TV shows
movies = titles_data[titles_data['show_type'] == 'MOVIE']
tv_shows = titles_data[titles_data['show_type'] == 'SHOW']

# Create a figure with two subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))  # 1 row, 2 columns

# Plot histogram for movies
sns.histplot(movies['runtime'], bins=20, kde=True, ax=axes[0])
axes[0].set_title('Distribution of Runtime for Movies')
axes[0].set_xlabel('Runtime (minutes)')
axes[0].set_ylabel('Frequency')

# Plot histogram for TV shows
sns.histplot(tv_shows['runtime'], bins=20, kde=True, ax=axes[1])
axes[1].set_title('Distribution of Runtime for TV Shows')
axes[1].set_xlabel('Runtime (minutes)')
axes[1].set_ylabel('Frequency')

plt.tight_layout()  # Adjust spacing between subplots
plt.show()

##### 1. Why did you pick the specific chart?

* Bivariate Analysis: This chart explores the relationship between a numerical variable (imdb_score) and a temporal variable (release_year).
* Trend Visualization: Line plots are effective for visualizing trends over time.
* Insightful: It helps us understand how the average quality of content (as reflected by IMDB scores) has changed over the years.

##### 2. What is/are the insight(s) found from the chart?

This chart can provide insights into the evolution of content quality over time. You can identify:

* Trends in Content Quality: Has the average IMDB score been increasing, decreasing, or remaining relatively stable over the years?
* Periods of High or Low Quality: Are there specific periods where the average IMDB score was significantly higher or lower?
* Genre-Specific Trends: You can further analyze this trend by filtering the data for specific genres to see how quality has evolved within those categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from this chart can contribute to positive business impact by providing a deeper understanding of audience preferences for content length across different formats:

**Content Strategy:**

* Optimized Runtime for Different Formats: By understanding the typical runtime ranges for movies and TV shows, businesses can tailor the length of their content to better align with audience expectations. For example, if viewers prefer shorter TV show episodes, producers might focus on creating more concise and engaging narratives within that timeframe. Conversely, if longer movies tend to perform well, they might invest in more in-depth storytelling for feature films.

**Content Acquisition and Licensing:**
* Informed Acquisition Decisions: When acquiring or licensing content, businesses can use the runtime distribution insights to make more informed choices. For example, if they're looking to acquire a TV series, they might prioritize shows with episode lengths that fall within the preferred range for their target audience.

#### Chart - 6 Average IMDB Score by Director (for Movies)

In [None]:
# Chart - 6 visualization code
# Merge titles_data and credits_data on 'id'
merged_data = pd.merge(titles_data, credits_data, on='id', how='inner')

# Filter for movies and directors
movie_directors = merged_data[(merged_data['show_type'] == 'MOVIE') & (merged_data['role'] == 'DIRECTOR')]

# Group by director and calculate average IMDB score
average_score_by_director = movie_directors.groupby('name')['imdb_score'].mean().reset_index()

# Sort by average IMDB score in descending order
average_score_by_director = average_score_by_director.sort_values(by='imdb_score', ascending=False)

# Select top 10 directors
top_directors = average_score_by_director.head(10)

# Create a horizontal bar plot
plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.barplot(x='imdb_score', y='name', data=top_directors, orient='h')
plt.title('Top 10 Directors by Average IMDB Score (Movies)')
plt.xlabel('Average IMDB Score')
plt.ylabel('Director')
plt.show()

##### 1. Why did you pick the specific chart?

* This chart has data from both title and credits data.

* Bivariate Analysis: This chart explores the relationship between a categorical variable (director's name) and a numerical variable (average IMDB score).
* Comparison of Averages: Bar plots are effective for comparing averages across different categories.
* Insightful: It helps us identify directors whose movies tend to have higher IMDB scores, potentially indicating higher quality or audience appeal.
Insights and Business Impact:

##### 2. What is/are the insight(s) found from the chart?

This chart can provide insights into the performance of different directors based on the average IMDB scores of their movies. You can identify:

* Top-Performing Directors: Which directors consistently deliver movies with high IMDB scores?
Content Acquisition and Production: This information can influence decisions about acquiring or collaborating with specific directors for future projects.
* Marketing and Promotion: Highlighting the involvement of top-performing directors can attract viewers and increase the perceived value of content.
* Talent Identification: The chart can help identify emerging or underappreciated directors whose work might be worth exploring further.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
1. Content Acquisition and Production:

* Targeted Acquisitions: By identifying directors with consistently high average IMDB scores, businesses can prioritize acquiring movies directed by them. This data-driven approach increases the likelihood of acquiring high-quality, audience-pleasing content.

2. Marketing and Promotion:
* Highlighting Directorial Talent: Marketing campaigns can leverage the reputation of top-performing directors to attract viewers. Emphasizing the involvement of acclaimed directors can increase the perceived value and appeal of a movie.

**Insights Leading to Negative Growth:**
1. Overreliance on Director Reputation:
* Ignoring Other Factors: Basing content decisions solely on director reputation without considering other crucial factors like story, cast, genre, and target audience could lead to acquiring or producing films that fail to resonate with viewers.

2. Neglecting Emerging Talent:

* Focusing Only on Established Directors: Overlooking emerging or lesser-known directors could limit opportunities to discover fresh perspectives and innovative storytelling. This could hinder the platform's ability to offer diverse and unique content.


#### Chart - 7  Number of Seasons vs. Average IMDB Score (TV Shows)

In [None]:
# Chart - 7 visualization code


# Filter for TV shows
tv_shows = titles_data[titles_data['show_type'] == 'SHOW']

# Group by number of seasons and calculate average IMDB score
average_score_by_seasons = tv_shows.groupby('seasons')['imdb_score'].mean().reset_index()

# Create a scatter plot using seaborn
plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.scatterplot(x='seasons', y='imdb_score', data=average_score_by_seasons)
plt.title('Number of Seasons vs. Average IMDB Score (TV Shows)')
plt.xlabel('Number of Seasons')
plt.ylabel('Average IMDB Score')
plt.show()

##### 1. Why did you pick the specific chart?

* Bivariate Analysis: This chart explores the relationship between two numerical variables: the number of seasons ('seasons') and the average IMDB score ('imdb_score') for TV shows.
* Correlation Visualization: Scatter plots are effective for visualizing the correlation or pattern between two numerical variables.
* Insightful: It helps us understand if there's any relationship between the longevity of a TV show (number of seasons) and its perceived quality (average IMDB score).

##### 2. What is/are the insight(s) found from the chart?

This chart can provide insights into the relationship between a TV show's longevity and its quality. You can identify:

* Correlation between Seasons and Score: Is there a positive correlation (shows with more seasons tend to have higher scores), a negative correlation (shows with more seasons tend to have lower scores), or no clear correlation?
* Optimal Show Length: Is there an ideal number of seasons for a TV show to maintain high quality and audience engagement?
* Content Strategy: This information can be valuable for content creators and businesses when deciding on the number of seasons for a TV show.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This information can help in:

* **Content Planning and Production:** Making informed decisions about the number of seasons to produce for a TV show, aiming for optimal length and audience satisfaction.
* **Content Acquisition:** Evaluating the potential success and longevity of TV shows based on their number of seasons and average IMDB score.
* **Marketing and Promotion:** Highlighting the longevity and consistent quality of a TV show to attract viewers.
* **Understanding Audience Preferences:** Gaining insights into audience preferences for TV show length and how it relates to perceived quality.

By analyzing the relationship between the number of seasons and the average IMDB score, businesses can gain valuable insights into TV show longevity and quality, informing their content strategies and potentially leading to greater success.

#### Chart - 8 Bubble Chart of Actor/Director Popularity and Average IMDB Score

In [None]:
# Chart - 8 visualization code

# Merge titles_data and credits_data on 'id' (as in Chart - 8)
merged_data = pd.merge(titles_data, credits_data, on='id', how='inner')

# Filter for actors/directors and calculate average IMDB score and popularity
actors_directors = merged_data[(merged_data['role'].isin(['ACTOR', 'DIRECTOR']))]
# Group by name (actor/director) and calculate stats
stats = actors_directors.groupby('name').agg(
    avg_imdb_score=('imdb_score', 'mean'),
    popularity=('imdb_votes', 'sum')  # Using IMDB votes as a proxy for popularity
).reset_index()

# Create the bubble chart
plt.figure(figsize=(12, 8))
plt.scatter(stats['avg_imdb_score'], stats['popularity'],
            s=stats['popularity'] / 1000,  # Bubble size scaled by popularity
            alpha=0.5)  # Adjust alpha for transparency

plt.title('Actor/Director Popularity vs. Average IMDB Score')
plt.xlabel('Average IMDB Score')
plt.ylabel('Popularity (IMDB Votes)')

# Add labels for some top actors/directors (optional)
for i, row in stats.sort_values('popularity', ascending=False).head(10).iterrows():
    plt.annotate(row['name'], (row['avg_imdb_score'], row['popularity']), fontsize=8)

plt.show()

##### 1. Why did you pick the specific chart?

* Bivariate Analysis with a Third Dimension: This chart allows us to explore the relationship between two variables (average IMDB score and popularity) while also visualizing a third dimension (popularity) through bubble size.
* Identifying Key Talents: It helps us quickly identify actors/directors who are both popular and critically acclaimed, as well as those who might be emerging talents with high potential.
* Understanding Audience Preferences: It provides insights into how audience perception of quality (IMDB score) relates to the popularity of actors/directors.
* Data-Driven Decision Making: The chart can inform strategic decisions related to content acquisition, casting, talent management, and marketing.

##### 2. What is/are the insight(s) found from the chart?

* Popular and Acclaimed Talents: Bubbles in the top right quadrant represent actors/directors who have high average IMDB scores and high popularity. These individuals are likely to attract large audiences and contribute to the success of a project.
* Emerging Talents: Bubbles in the top left quadrant represent actors/directors with high average IMDB scores but lower popularity. These could be rising stars or underappreciated talents who have the potential to gain wider recognition.
* Audience Preferences: The chart can reveal trends in audience preferences. For example, if bubbles are clustered around a particular IMDB score range, it might indicate that audiences are more receptive to content within that quality range.
* Correlation (or Lack Thereof): The distribution of bubbles can also indicate whether there's a correlation between popularity and average IMDB score. If there's a positive correlation, it suggests that more popular actors/directors tend to be associated with higher-rated content.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Yes, the insights from this chart can contribute to positive business impact in several ways:**

* Content Acquisition and Production: Businesses can use the chart to identify and prioritize acquiring content featuring popular and acclaimed actors/directors, increasing the likelihood of success.
* Casting Decisions: The chart can inform casting choices by highlighting actors who are both popular and well-regarded by audiences.
* Talent Management: Businesses can use the chart to identify and nurture emerging talents, potentially gaining a competitive advantage by collaborating with them early in their careers.

**Yes, there are potential downsides if the insights are misinterpreted or misapplied:**

* Overreliance on Popularity: Focusing solely on popularity without considering other factors like acting skills, directorial style, or genre suitability could lead to poor casting choices or content acquisitions that fail to resonate with audiences.
* Ignoring Emerging Talent: Overlooking actors/directors with lower popularity but high average IMDB scores could result in missed opportunities to collaborate with promising talents.

#### Chart - 9 Top Actors and Directors by Number of Titles


In [None]:
# Chart - 9 visualization code

# Merge titles_data and credits_data on 'id'
merged_data = pd.merge(titles_data, credits_data, on='id', how='inner')

# Filter for actors and directors
actors_directors = merged_data[merged_data['role'].isin(['ACTOR', 'DIRECTOR'])]

# Count the number of titles for each person
title_counts = actors_directors.groupby('name')['id'].count().reset_index()
title_counts.columns = ['name', 'title_count']

# Sort by title count in descending order
title_counts = title_counts.sort_values(by='title_count', ascending=False)

# Select the top 10 actors and directors
top_actors = title_counts[title_counts['name'].isin(credits_data[credits_data['role'] == 'ACTOR']['name'])].head(10)
top_directors = title_counts[title_counts['name'].isin(credits_data[credits_data['role'] == 'DIRECTOR']['name'])].head(10)

# Create bar plots for actors and directors
fig, axes = plt.subplots(1, 2, figsize=(14, 6))  # 1 row, 2 columns

sns.barplot(x='title_count', y='name', data=top_actors, orient='h', ax=axes[0])
axes[0].set_title('Top 10 Actors by Number of Titles')
axes[0].set_xlabel('Number of Titles')
axes[0].set_ylabel('Actor')

sns.barplot(x='title_count', y='name', data=top_directors, orient='h', ax=axes[1])
axes[1].set_title('Top 10 Directors by Number of Titles')
axes[1].set_xlabel('Number of Titles')
axes[1].set_ylabel('Director')

plt.tight_layout()  # Adjust spacing between subplots
plt.show()

##### 1. Why did you pick the specific chart?

* Comparison and Ranking: Bar plots are excellent for comparing the frequency or count of different categories. In this case, we are comparing the number of titles associated with each actor and director. The horizontal orientation makes it easy to rank them visually.
* Identifying Key Contributors: This chart helps identify the actors and directors who have been most prolific in the film and television industry (based on the dataset). It highlights the individuals who have contributed to a significant number of titles.

##### 2. What is/are the insight(s) found from the chart?

* Top Actors and Directors: The chart reveals the names of the most frequently appearing actors and directors in the dataset.
* Frequency of Appearances: The x-axis shows the number of titles each actor or director has been involved in, giving an idea of their prevalence in the industry.
* Potential Casting Choices: The chart can help identify actors who have a history of appearing in numerous titles, potentially indicating their experience and demand in the industry.
* Collaboration Opportunities: The chart can highlight directors who have worked on a large number of projects, potentially suggesting their expertise and potential for successful collaborations.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Casting Decisions: The chart can guide casting decisions by identifying actors who are experienced and in high demand. Casting popular and well-established actors can attract viewers and potentially increase the success of a project.
* Content Acquisition: The insights can inform content acquisition strategies. Acquiring titles featuring actors or directors who have a track record of successful projects can increase the likelihood of acquiring high-quality, audience-pleasing content.

**Potential for Negative Growth:**

* Overreliance on Established Talent: Focusing solely on actors and directors with a high number of titles might lead to neglecting emerging talents. It's crucial to balance the selection of established figures with opportunities for fresh perspectives and new talent.
* Ignoring Other Factors: Casting decisions and content acquisitions should consider other factors beyond the number of titles an actor or director has been involved in. Factors like acting skills, directorial style, genre suitability, and audience preferences should also be taken into account.

#### Chart - 10 : Most Frequent Genres (Bar Chart)

In [None]:
# Chart - 10 visualization code

# Count the frequency of each genre
genre_counts = titles_data['genres'].value_counts().head(10)

# Create the bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=genre_counts.index, y=genre_counts.values, palette='viridis')
plt.title('Most Frequent Genres (Bar Chart)')
plt.xlabel('Genre')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
plt.show()

##### 1. Why did you pick the specific chart?

I picked a horizontal bar chart to visualize the most frequent genres because:

* Comparison: Bar charts are excellent for comparing the frequency or count of different categories, which is what we're doing with genres in this case.
* Ranking: The horizontal orientation makes it easy to visually rank the genres based on their frequency. The most frequent genre will have the longest bar and will appear at the top.
* Clarity: Bar charts are generally easy to understand and interpret, making them a good choice for presenting this type of data.

##### 2. What is/are the insight(s) found from the chart?

This chart provides insights into:

* Genre Popularity: It shows which genres are most prevalent in your dataset, indicating their popularity among viewers or content creators.
* Content Distribution: It reveals the overall distribution of content across different genres. You can see which genres are dominant and which are less common.
* Potential Trends: By comparing this chart with industry trends or other datasets, you might identify any genre preferences or emerging trends in your specific content collection.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Content Acquisition: The chart can guide content acquisition strategies. Businesses can prioritize acquiring titles in popular genres to cater to audience preferences and increase viewership.
* Content Production: Understanding genre popularity can help businesses make informed decisions about what types of content to produce. Investing in popular genres is likely to attract a larger audience.

**Potential for Negative Growth:**

* Over Saturation: Focusing solely on popular genres could lead to oversaturation of the market. Viewers might become fatigued if they are constantly exposed to similar content.
* Ignoring Niche Genres: While popular genres are important, neglecting niche genres could alienate viewers with specific interests. It's important to maintain a balance and cater to diverse tastes.

#### Chart - 11  Average Runtime of Movies by Director

In [None]:
# Chart - 11 visualization code

# Merge the datasets on 'id'
merged_data = pd.merge(titles_data, credits_data, on='id', how='inner')

# Filter for movies and directors
movie_directors = merged_data[(merged_data['show_type'] == 'MOVIE') & (merged_data['role'] == 'DIRECTOR')]

# Group by director and calculate average runtime
average_runtime_by_director = movie_directors.groupby('name')['runtime'].mean().reset_index()

# Sort by average runtime in descending order
average_runtime_by_director = average_runtime_by_director.sort_values(by='runtime', ascending=False)

# Select top 10 directors
top_directors = average_runtime_by_director.head(10)

# Create a horizontal bar plot
plt.figure(figsize=(10, 6))  # Adjust figure size if needed
sns.barplot(x='runtime', y='name', data=top_directors, orient='h')
plt.title('Top 10 Directors by Average Movie Runtime')
plt.xlabel('Average Runtime (minutes)')
plt.ylabel('Director')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart for this visualization because:

* Comparison: Bar charts are excellent for comparing the values of different categories, which in this case are the average runtimes of movies by different directors.
* Ranking: The bar chart allows for easy visual ranking of directors based on their average movie runtime, with longer bars representing longer runtimes.
* Clarity: Bar charts are generally straightforward to understand and interpret, making them a suitable choice for presenting this type of data.

##### 2. What is/are the insight(s) found from the chart?

This chart provides insights into:

* Directorial Styles: It highlights potential differences in directorial styles based on their preferred movie lengths. Some directors might prefer shorter, more concise storytelling, while others might favor longer, more elaborate narratives.
* Content Trends: It can reveal potential trends in movie lengths based on the directors involved. If certain directors with longer average runtimes are gaining popularity, it might indicate a trend toward longer movies in general.
* Audience Preferences: By understanding the average runtime of movies by different directors, businesses can better understand audience preferences and tailor content acquisition or production decisions accordingly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Content Acquisition: The chart can inform content acquisition strategies by identifying directors whose movie lengths align with audience preferences. Acquiring movies from directors with shorter average runtimes might be preferable for platforms targeting viewers who prefer shorter content.
* Content Production: Understanding directorial styles based on runtime preferences can help businesses make informed decisions about movie production. Collaborating with directors known for shorter movies might be ideal for projects with budget or time constraints.

**Potential for Negative Growth:**

* Overgeneralization: Assuming that all movies by a director with a longer average runtime will be lengthy might discourage viewers who prefer shorter content. It's crucial to consider individual movie characteristics and not solely rely on directorial trends.
* Limited Scope: The chart only considers movies, not TV shows. This limits the scope of insights regarding content length preferences. It's important to consider different formats when analyzing content trends.

#### Chart - 12 Distribution of Actors/Directors Across Genres

In [None]:
# Chart - 12 visualization code

# Merge the datasets on 'id'
merged_data = pd.merge(titles_data, credits_data, on='id', how='inner')

# Filter for actors and directors
actors_directors = merged_data[merged_data['role'].isin(['ACTOR', 'DIRECTOR'])]

# 1. Limit the Number of Actors/Directors (Top 20)
top_actors_directors = actors_directors['name'].value_counts().head(20).index
filtered_data = actors_directors[actors_directors['name'].isin(top_actors_directors)]

# 2. Aggregate Genres (Example: Comedy)
def aggregate_genres(genre_str):
    if pd.notna(genre_str) and genre_str != '[]':
        genres = eval(genre_str)
        for genre in genres:
            if genre in ['comedy', 'romantic comedy']:
                return 'comedy'  # Aggregate similar genres
            # Add more aggregation rules as needed
        return genre_str  # Keep other genres as they are
    return genre_str

filtered_data['aggregated_genres'] = filtered_data['genres'].apply(aggregate_genres)

# Create a cross-tabulation of names and aggregated genres for the filtered data
genre_distribution = pd.crosstab(filtered_data['name'], filtered_data['aggregated_genres'])

# Create the heatmap
plt.figure(figsize=(12, 8))  # Adjust figure size if needed
sns.heatmap(genre_distribution, cmap='viridis')
plt.title('Distribution of Actors/Directors Across Genres (Heatmap - Modified)')
plt.xlabel('Aggregated Genres')
plt.ylabel('Actors/Directors (Top 20)')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
plt.show()

##### 1. Why did you pick the specific chart?

This chart visualizes the distribution of the top 20 most frequent actors and directors across different genres, using a heatmap. It shows the frequency of their work in various genres, potentially highlighting specializations or preferences.

* Distribution Visualization: A heatmap effectively visualizes the distribution of categorical data, such as genres and actor/director names, and their relationships. The color intensity of each cell represents the frequency, allowing for easy identification of patterns and concentrations.
* Categorical Data: Heatmaps are well-suited for visualizing categorical data, which is the nature of genres and actor/director names in this case. The grid-like structure of the heatmap makes it easy to compare the frequencies of actors/directors working in different genres.

##### 2. What is/are the insight(s) found from the chart?

* Genre Specialization: The heatmap highlights potential genre specializations among the top 20 actors and directors. Cells with higher color intensity indicate a higher frequency of work in that genre, suggesting a specialization or preference for that type of content. This can help identify actors/directors known for specific genres.
* Diversity of Work: You can observe the diversity of work among these individuals by looking at the distribution of color across the heatmap. Actors/directors with a wider range of color intensities across different genres have a more diverse portfolio of work, indicating versatility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

* Content Acquisition: The heatmap can inform content acquisition strategies by identifying actors and directors with expertise in specific genres. Acquiring titles featuring individuals known for their work in popular or emerging genres can attract target audiences and increase viewership. This can lead to higher content engagement and revenue.
* Content Production: Understanding the genre preferences and specializations of actors and directors can assist businesses in making informed decisions about content production. Collaborating with individuals experienced in the desired genres might enhance the quality and appeal of the content, potentially leading to critical acclaim and audience satisfaction.

**Potential for Negative Growth:**

* Overemphasis on Specialization: While genre specialization can be valuable, it's important to encourage versatility and exploration. Overemphasizing specialization might limit opportunities for actors and directors to expand their creative range and potentially miss out on new genre opportunities. This could restrict their career growth and limit the diversity of content produced.
* Data Bias: The dataset might have inherent biases that influence the distribution of actors and directors across different genres. It's crucial to be aware of potential biases and strive for inclusivity and diversity in content creation to ensure a balanced representation of genres and talent. Ignoring potential biases could lead to a skewed understanding of the industry and potentially exclude talented individuals from certain genres.

#### Chart - 13 Top 10 Most Frequent Person-Title Connections (Pie Chart).

In [None]:
# Chart - 13 visualization code


# Create a new column representing the person-title connection
credits_data['person_title_connection'] = credits_data['person_id'].astype(str) + '-' + credits_data['id'].astype(str)

# Count the frequency of each connection
connection_counts = credits_data['person_title_connection'].value_counts().head(10)

# Create the pie chart
plt.figure(figsize=(8, 8))  # Adjust figure size if needed
plt.pie(connection_counts, labels=connection_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Top 10 Most Frequent Person-Title Connections (Pie Chart)')
plt.show()

##### 1. Why did you pick the specific chart?

* Proportion Visualization: A pie chart is effective for visualizing the proportions of different categories within a whole dataset. In this case, we want to see the relative frequency of the top 10 person-title connections, making a pie chart a suitable choice.
* Easy Interpretation: Pie charts are generally easy to understand and interpret, even for audiences unfamiliar with data visualization. The size of each slice clearly represents the proportion of each connection, providing a quick overview of the distribution.
* Top Connections Focus: By focusing on the top 10 connections, the pie chart provides a concise representation of the most frequent collaborations, which can be easier to grasp than a scatter plot with numerous data points.

##### 2. What is/are the insight(s) found from the chart?

* Frequent Collaborations: The pie chart highlights the most frequent collaborations between persons and titles. You can identify individuals who have worked on multiple projects or titles that have involved a large number of individuals.
* Relative Frequency: The size of each slice represents the relative frequency of each connection. Larger slices indicate more frequent collaborations, while smaller slices represent less frequent ones.
* Top Connections: The chart focuses on the top 10 connections, providing a quick overview of the most prevalent relationships in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

Talent Identification: The pie chart can help identify individuals who have a high frequency of collaborations, potentially indicating their experience and demand in the industry. This information can be valuable for casting decisions or talent acquisition.
Content Analysis: By observing the most frequent person-title connections, businesses can gain insights into the relationships between titles and identify potential groups of titles with similar casts or crews. This can be useful for content recommendation systems or marketing strategies.

**Potential for Negative Growth:**

* Limited Scope: The pie chart only shows the top 10 connections, so it might not capture the full complexity of the relationships in the dataset. It's essential to be aware of this limitation and consider other visualization approaches for a more comprehensive understanding.
* Oversimplification: The pie chart might oversimplify the relationships between persons and titles by representing them as individual slices without considering the nature of the connections (e.g., role, importance).

#### Chart - 14 - Correlation Heatmap- Correlation Heatmap of Numerical Features in Titles Data

In [None]:
# Correlation Heatmap visualization code

# Load the dataset
titles_data = pd.read_csv('titles.csv')

# Select numerical features for correlation analysis
numerical_features = ['release_year', 'runtime', 'seasons', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
numerical_data = titles_data[numerical_features]

# Calculate the correlation matrix
correlation_matrix = numerical_data.corr()

# Create the heatmap
plt.figure(figsize=(10, 8))  # Adjust figure size if needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features in Titles Data')
plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
plt.yticks(rotation=0)  # Keep y-axis labels vertical
plt.show()

##### 1. Why did you pick the specific chart?

* Correlation Visualization: A heatmap is an effective way to visualize the correlation between multiple numerical features in a dataset. It provides a clear and concise representation of the relationships between variables, making it easy to identify patterns and trends.
* Numerical Data: This chart is suitable for visualizing numerical data, which is the nature of the features selected for correlation analysis in this case. The color intensity of each cell in the heatmap represents the strength and direction of the correlation, allowing for easy interpretation.
* Relationship Exploration: Heatmaps are useful for exploring relationships and dependencies between variables in a dataset. They can reveal potential factors influencing movie or TV show characteristics and provide insights into the underlying data structure.

##### 2. What is/are the insight(s) found from the chart?

* Feature Correlations: The heatmap reveals the correlation between different numerical features in the "titles" dataset. You can observe which features tend to move together (positive correlation), move in opposite directions (negative correlation), or have little to no relationship (weak correlation).
* Strong Relationships: Look for cells with high color intensity (either red or blue), indicating strong positive or negative correlations. For example, you might find a positive correlation between imdb_score and tmdb_score, suggesting that movies or TV shows with high ratings on one platform tend to have high ratings on the other.
* Weak Relationships: Cells with low color intensity (close to white) indicate weak correlations, meaning that the variables don't have a strong relationship with each other.

#### Chart - 15 - Pair Plot- Pair Plot of Numerical Features in Titles Data

In [None]:
# Pair Plot visualization code

# Load the dataset
titles_data = pd.read_csv('titles.csv')

# Select numerical features for pair plot analysis
numerical_features = ['release_year', 'runtime', 'seasons', 'imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']
numerical_data = titles_data[numerical_features]

# Create the pair plot
sns.pairplot(numerical_data)
plt.suptitle('Pair Plot of Numerical Features in Titles Data', y=1.02)  # Adjust title position
plt.show()

##### 1. Why did you pick the specific chart?

* Multivariate Relationships: A Pair Plot is an effective way to visualize the relationships between multiple numerical features in a dataset simultaneously. It provides a comprehensive overview of the interactions between variables, allowing you to explore the data structure and identify potential patterns or trends.
* Distribution and Correlation: The combination of scatter plots and diagonal plots in a Pair Plot allows you to see both the distribution of individual features and the correlation between pairs of features. This provides a holistic view of the data.
* Data Exploration: Pair Plots are useful for initial data exploration and identifying potential relationships or dependencies between variables. They can reveal interesting patterns or clusters in the data that might not be immediately apparent in individual scatter plots or histograms.

##### 2. What is/are the insight(s) found from the chart?

* Feature Relationships: The scatter plots in the Pair Plot reveal the relationships between pairs of numerical features. You can observe patterns, trends, and potential correlations between variables. For example, you might find a positive correlation between imdb_score and tmdb_score, indicating that movies or TV shows with high ratings on one platform tend to have high ratings on the other.
* Feature Distributions: The diagonal plots show the distribution of individual features. You can observe the shape of the distribution, identify potential outliers, and get a sense of the central tendency and spread of each variable.
* Data Structure: The Pair Plot provides a comprehensive overview of the data structure and the relationships between variables. It can help you understand the underlying patterns and dependencies in the dataset.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Recommendations for Meeting Business Goal :

1. Strengthen Content Strategy:

Target Top-Performing Genres: Emphasize drama, comedy, and documentaries to target mass appeal.
Niche Genres: Add new content types to reach out to new audiences and broaden the reach of the platform.
Tune Content Duration: Adjust content duration to the preference of audiences according to genre and format to engage more.
Leverage IMDB and TMDB Scores: Apply ratings as quality metrics for decision-making around content acquisition in order to improve user satisfaction.
2. Optimize Content Acquisition:

Partner with Top Talents: Sign with successful directors, actors, and studios to maximize the chance of acquiring successful titles.
Data-Driven Predictions: Apply predictive analytics to determine potential success of new titles prior to acquisition, reducing risk and optimizing return on investment.
Targeted Acquisitions: Buy content aligned with determined audience preferences and content gaps to improve user engagement and platform growth.
3. Personalize User Experience:

Better Recommendations: Enhance the recommendation engine with user data for better and more relevant content suggestions, increasing content discovery.
Better Search Functionality: Enhance search algorithms for personalized results, allowing users to discover desired content easily and quickly.
Improved User Interface: Improve the user interface for hassle-free navigation and content discovery, promoting higher user engagement and platform loyalty.
These suggestions are framed to be simple and easily understandable by the client. They emphasize on the major steps Amazon Prime can take to utilize data-driven insights and attain their business goal of maximizing user engagement, customer satisfaction, and platform expansion.

# **Conclusion**



Conclusion:

This exploratory data analysis of Amazon Prime content library has uncovered rich insights into the determinants of user engagement, customer satisfaction, and platform growth. Through the analysis of content features, talent attributes, and audience preferences, we have discovered major trends and patterns that can guide data-driven decision-making.

**Key Findings:**

Content quality, as indicated by IMDB and TMDB scores, is a strong determinant of user engagement and satisfaction.
Genre interest differs between audience segments, with drama, comedy, and documentaries being top-rated categories.
Length of content is an important factor in user interest, with certain preferences for each format and genre.
Partnerships with top-performing directors, actors, and production houses are correlated with higher content success.
Personalization is the most effective way to improve user experience, with precise recommendations and seamless search functionality being critical to content discovery.

**Impact of Recommendations:**

Through the implementation of the recommendations in this project, Amazon Prime can anticipate:

Improve content strategy through investment in top-performing genres, streamlining content length, and using data-driven insights in creating and acquiring content.
Maximize content acquisition through partnerships with leading talents, predictive model assessments of potential success, and focused acquisitions to fill content gaps and consumer demand.
Enhance user experience through enhanced recommendation accuracy, more effective search, and optimal user interface for ease of navigation and content discovery.
These measures will lead to increased user interaction, improved customer satisfaction, and faster platform growth, ultimately enhancing Amazon Prime's foothold in the competitive streaming landscape.

**Future Directions:**

Although this project has given useful insights, additional analysis and investigation are urged to gain further understanding and make strategies more refined. Future directions may involve:

Exploring how specific content attributes, e.g., themes, subgenres, and cast diversity, affect user engagement.
Creating more advanced predictive models for evaluating the potential success of new releases.
Investigating more advanced personalization methods to customize content recommendations and user experiences for individual tastes.
By persistently applying data-driven intelligence and responding to changing audience habits, Amazon Prime can continue to lead the market and provide extraordinary content experiences that delight its users.

This conclusion succinctly captures the principal findings and suggestions of  project, emphasizing its potential contribution to Amazon Prime's business goals. It also outlines future directions for improvement, which shows a desire for ongoing refinement and data-led decision-making.