<center>
    <h1>Project Title: Exploratory Data Analysis on Amazon Prime Video Content</h1>
</center>

<body><h3>Project Description:</h3>
<p>
In this project, we perform an exploratory data analysis (EDA) on a dataset containing information about TV shows and movies available on Amazon Prime Video in the United States. The dataset includes over 9,000 unique titles along with details like genres, release year, ratings, popliarity, and production countries. An additional dataset provides credits information for more than 124,000 actors and directors.
    </p>
    <p>The primary objective of this analysis is to uncover insights related to:

<ul><li>Content Diversity: Identify the most dominant genres on the platform.</li>

<li>Regional Availability: Understand how content is distributed across different production countries.</li></li>

<li>Trends Over Time: Observe how Amazon Prime’s content library has evolved over the years.</li>

<li>IMDb Ratings and Popliarity: Discover the highest-rated and most popliar titles.</li></ul></p>

<p>
We use Python libraries such as Pandas, NumPy, Matplotlib, and Seaborn for data manipliation and visualization. The analysis includes at least five distinct visualizations to effectively communicate trends and patterns in the data. The goal is to generate business-relevant insights that colid influence content strategy, user engagement, and platform growth.
</p></body>



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ast
from collections import Counter
import seaborn as sns
import geopandas as gpd

In [None]:
titles=pd.read_csv("AlmaBetterProjects/Project2/titles.csv")

In [None]:
print(titles.describe()) # to understand the mean,std,min,count values of columns
print(titles.shape) # finding the shape of dataset i.e. rowsxcolumn

<h3>Finding null values in data</h3>

In [None]:
titles.isnull().sum()

<h3>Filling NaN values</h3>

In [None]:
titles = titles.dropna(subset=['imdb_id'])

In [None]:
titles[['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']] = \
    titles[['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']].fillna(
        titles[['imdb_score', 'imdb_votes', 'tmdb_popularity', 'tmdb_score']].median()
    )


In [None]:
titles['description']=titles['description'].fillna("description not available")
titles['age_certification']=titles['age_certification'].fillna("unknown")
titles['seasons']=titles['seasons'].fillna(1)

In [None]:
titles.isnull().sum()

<h3>Content Diversity: What genres and categories dominate the platform?</h3>

In [None]:
import matplotlib.pyplot as plt

cat_data = titles.groupby('type')
data = cat_data['type'].value_counts()

labels = data.index.get_level_values(1)
values = data.values

# Colors: light pink and dark pink
colors = ['#ffb6c1', '#db7093']  # lightpink, palevioletred

# Explode the smaller slice for effect
explode = [0.2,0]

# Create the pie chart
plt.figure(figsize=(8, 8))  # Make the chart bigger overall
patches, texts, autotexts = plt.pie(
    values,
    labels=labels,
    colors=colors,
    autopct='%1.1f%%',
    shadow=True,
    explode=explode,
    textprops={'fontsize': 14}  # Set font size for labels and % values
)

# Set font size for percentage values separately if needed
for autotext in autotexts:
    autotext.set_fontsize(16)
    autotext.set_color('black')

# Title
plt.title('Exploring Categories', fontsize=18)

plt.axis('equal')  # Ensure the pie is circular
plt.show()


In [None]:
titles['genres'] = titles['genres'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) and x != '[]' else [])
#as the generes are a list of strings we need to convert them into each string value separately using ast library

In [None]:
# Flatten all genres into a single list
all_genres = [genre for sublist in titles['genres'] for genre in sublist]

# Use Counter to get frequency of each genre
genre_counts = Counter(all_genres)

# Convert to DataFrame 
genre_titles = pd.DataFrame(genre_counts.items(), columns=['Genre', 'Count']).sort_values(by='Count', ascending=False).reset_index()

print(genre_titles.head())


In [None]:
# ----------------------------
# Step 2: Content Diversity - Donut Chart for Top Genres
# ----------------------------

# Get the top 10 genres for the donut chart
top_10_genres = genre_titles.head(10)

plt.figure(figsize=(10, 8))
# Create a pie chart
plt.pie(top_10_genres['Count'], labels=top_10_genres['Genre'], autopct='%1.1f%%', startangle=140, pctdistance=0.85, colors=sns.color_palette("viridis", n_colors=10))

# Draw a circle in the middle to make it a donut chart
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title("Top 10 Most Common Genres on Amazon Prime", fontsize=18, weight='bold')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

<h3>Regional Availability: How does content distribution vary across different regions?</h3>

In [None]:
titles['production_countries'] = titles['production_countries'].astype(str).apply(lambda x: ast.literal_eval(x) 
                                                                                  if pd.notnull(x) and x != '[]' else [])
# Flatten all genres into a single list

all_genres = [genre for sublist in titles['production_countries'] for genre in sublist]

# Use Counter to get frequency
genre_counts = Counter(all_genres)

genre_counts['US']=genre_counts['US']+genre_counts['United States of America']
print(genre_counts['US'])
del genre_counts['United States of America']

genre_titles = pd.DataFrame(genre_counts.items(), columns=['name', 'Count']).sort_values(by='Count', ascending=False)

print(genre_titles['name'].unique())


In [None]:
country_code = {
    'US': 'United States of America',
    'IN': 'India',
    'GB': 'United Kingdom',
    'CA': 'Canada',
    'FR': 'France',
    'JP': 'Japan',
    'AU': 'Australia',
    'DE': 'Germany',
    'IT': 'Italy',
    'CN': 'China',
    'ES': 'Spain',
    'HK': 'Hong Kong',
    'MX': 'Mexico',
    'KR': 'Korea, Republic of',
    'RU': 'Russian Federation',
    'BE': 'Belgium',
    'IE': 'Ireland',
    'BR': 'Brazil',
    'IL': 'Israel',
    'NZ': 'New Zealand',
    'ZA': 'South Africa',
    'NL': 'Netherlands',
    'NG': 'Nigeria',
    'NO': 'Norway',
    'DK': 'Denmark',
    'TH': 'Thailand',
    'SE': 'Sweden',
    'AR': 'Argentina',
    'CZ': 'Czechia',
    'CH': 'Switzerland',
    'PH': 'Philippines',
    'PL': 'Poland',
    'SK': 'Slovakia',
    'AT': 'Austria',
    'CL': 'Chile',
    'LU': 'Luxembourg',
    'IR': 'Iran, Islamic Republic of',
    'TW': 'Taiwan, Province of China',
    'GR': 'Greece',
    'CO': 'Colombia',
    'FI': 'Finland',
    'RO': 'Romania',
    'UA': 'Ukraine',
    'HU': 'Hungary',
    'AE': 'United Arab Emirates',
    'MY': 'Malaysia',
    'MA': 'Morocco',
    'ID': 'Indonesia',
    'AF': 'Afghanistan',
    'VE': 'Venezuela, Bolivarian Republic of',
    'PR': 'Puerto Rico',
    'EG': 'Egypt',
    'VN': 'Viet Nam',
    'PT': 'Portugal',
    'IS': 'Iceland',
    'TR': 'Turkey',
    'RS': 'Serbia',
    'UY': 'Uruguay',
    'SG': 'Singapore',
    'EE': 'Estonia',
    'KE': 'Kenya',
    'MN': 'Mongolia',
    'QA': 'Qatar',
    'GE': 'Georgia',
    'BO': 'Bolivia, Plurinational State of',
    'PA': 'Panama',
    'CU': 'Cuba',
    'PS': 'Palestine, State of',
    'IO': 'British Indian Ocean Territory',
    'LV': 'Latvia',
    'CR': 'Costa Rica',
    'LB': 'Lebanon',
    'PK': 'Pakistan',
    'TT': 'Trinidad and Tobago',
    'AL': 'Albania',
    'BD': 'Bangladesh',
    'HR': 'Croatia',
    'FJ': 'Fiji',
    'LI': 'Liechtenstein',
    'SI': 'Slovenia',
    'BA': 'Bosnia and Herzegovina',
    'BG': 'Bulgaria',
    'LT': 'Lithuania',
    'JM': 'Jamaica',
    'KZ': 'Kazakhstan',
    'DO': 'Dominican Republic',
    'CY': 'Cyprus',
    'CM': 'Cameroon',
    'SY': 'Syrian Arab Republic',
    'AM': 'Armenia',
    'MT': 'Malta',
    'EC': 'Ecuador',
    'PF': 'French Polynesia',
    'ET': 'Ethiopia',
    'GQ': 'Equatorial Guinea',
    'PY': 'Paraguay',
    'MC': 'Monaco',
    'UG': 'Uganda',
    'SV': 'El Salvador',
    'CI': "Côte d'Ivoire",
    'JO': 'Jordan',
    'BM': 'Bermuda',
    'SO': 'Somalia',
    'SZ': 'Eswatini',
    'KH': 'Cambodia',
    'AQ': 'Antarctica',
    'TC': 'Turks and Caicos Islands',
    'PE': 'Peru',
    'TN': 'Tunisia',
    'LY': 'Libya',
    'XX': 'Unknown',
    'YU': 'Yugoslavia',
    'SU': 'Soviet Union',
    'XK': 'Kosovo',
    'XC': 'Czechoslovakia',
    'AN': 'Netherlands Antilles'
}


In [None]:
genre_titles['name'] = genre_titles['name'].str.strip()
genre_titles['name'] = genre_titles['name'].replace(country_code)
#print(genre_titles.head(10))
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))


# Step 5: Merge your data with world GeoDataFrame
# Sometimes name names don’t match exactly. You may need to rename a few.
merged = world.merge(genre_titles, how='inner', on='name')
#merged = world.set_index('name').join(genre_titles.set_index('name'))
#merged.head()
# Step 6: Plot

In [None]:



actual_vmin = merged['Count'].min()
actual_vmax = merged['Count'].max()
#print(merged['Count'].isnull().sum())
merged['bins'] = pd.cut(merged['Count'],
                        bins=[0, 10, 100, 500, 1000, 3000, 5000, actual_vmax],
                        labels=False)

fig, ax = plt.subplots(1, 1, figsize=(15, 8))
merged.plot(column='bins', ax=ax, cmap='PuRd', legend=True,
            missing_kwds={'color': 'white'})

ax.set_title('Distribution of Amazon Prime shows and movies by Country', fontsize=16)
plt.axis('off')

plt.show()

In [None]:
top10 = genre_titles.head(20)

lavender = '#B57EDC'  # Soft lavender
colors = [lavender] * len(top10)

plt.figure(figsize=(10, 6))
sns.barplot(x='Count', y='name', data=top10, palette=colors)
plt.title('Top 10 Countries with Most Amazon Prime Content')
plt.xlabel('Number of Shows/Movies',fontsize=14)
plt.ylabel('Country',fontsize=14)
plt.tight_layout()
plt.show()


<h2>Trends Over Time: How has Amazon Prime’s content library evolved?</h2>

In [None]:
# Prepare data
total_per_year = titles.groupby(['release_year']).size().reset_index(name='count')
total_per_year['release_year'] = total_per_year['release_year'].astype(str)  # convert to string

plt.figure(figsize=(14, 6))
barplot = sns.barplot(x='release_year', y='count', data=total_per_year, color='#B57EDC')

plt.title('Total Amazon Prime Content Added Each Year', fontsize=14)
plt.xlabel('Release Year')
plt.ylabel('Number of Titles')

# Show only every 10th label
xtick_positions = barplot.get_xticks()
xtick_labels = total_per_year['release_year'].tolist()

# Keep only years divisible by 10
new_labels = [label if int(label) % 10 == 0 else '' for label in xtick_labels]
barplot.set_xticklabels(new_labels)

plt.tight_layout()
plt.show()


In [None]:
grouped = titles.groupby(['release_year', 'type']).size().unstack(fill_value=0)

# Ensure all years are sorted
grouped = grouped.sort_index()
grouped=grouped.tail(30)


# Plotting
x = np.arange(len(grouped.index))
width = 0.35

fig, ax = plt.subplots(figsize=(15, 5))
bar1 = ax.bar(x - width/2, grouped['SHOW'], width, label='TV Show', color='#a678de')
bar2 = ax.bar(x + width/2, grouped['MOVIE'], width, label='Movie', color='#6ad49b')

ax.set_xlabel('Release Year')
ax.set_ylabel('Number of Titles')
ax.set_title('Amazon Prime Content Added Over the Years')
ax.set_xticks(x)
ax.set_xticklabels(grouped.index, rotation=90)
ax.legend()

plt.tight_layout()
plt.show()

<h3>Ratings Distribution - IMDb Ratings & Popularity: What are the highest-rated or most popular shows on the platform?</h3>

In [None]:
# ----------------------------
# ----------------------------
plt.figure(figsize=(12, 7))
sns.scatterplot(data=titles, x='imdb_votes', y='imdb_score', hue='type', palette='deep', alpha=0.7)
plt.title("IMDb Score vs. IMDb Votes on Amazon Prime", fontsize=18, weight='bold')
plt.xlabel("IMDb Votes (Log Scale)", fontsize=14)
plt.ylabel("IMDb Score", fontsize=14)
plt.xscale('log') # Use log scale for votes as they can vary widely, making the plot more readable
plt.grid(True, which="both", ls="--", c=".7")
plt.tight_layout()
plt.show()

In [None]:
titles.groupby('type')['imdb_score'].agg(['max','min','mean'])

In [None]:
titles['type'].value_counts()

In [None]:
print(titles[(titles['type'] == 'MOVIE') & (titles['imdb_score'] == 9.9)]['title'])

In [None]:
print(titles[(titles['type'] == 'SHOW') & (titles['imdb_score'] == 9.7)]['title'])