# 📊 **Netflix Content Analysis** 🔍
### **Comprehensive Exploration of Content Trends, Categories, and Global Distribution**

This report delves into the **distribution of content by type**, **geographical production trends**, **seasonal patterns**, and more, using Netflix's vast library of movies and TV shows. We analyze various aspects, from the **growth in new content over the years** to the **dominant genres** and **countries of origin**, and even explore the impact of **TMDB ratings**.

✨ Key Insights:
- The evolution of Netflix’s content library from 2008 to 2021.
- Global distribution of movies and TV shows across different countries 🌍.
- Trend analysis of content addition over the years 📈.
- Detailed breakdown of **top categories** and their popularity.

Let’s dive into the data and uncover the fascinating insights behind Netflix’s expansive content library! 🚀


In [4]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import folium
import requests
import json
from country_coords import country_coords

In [5]:
df = pd.read_csv("cleaned_netflix.csv")

In [6]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added
0,0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Not Available,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021,September
1,1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021,September


In [7]:
df.drop(columns="Unnamed: 0", inplace=True)

In [8]:
df.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,month_added
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Not Available,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021,September
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021,September


In [9]:
# Convert column to datetime
df["date_added"] = pd.to_datetime(df["date_added"])

In [10]:
# Create month_added column
df["month_added"] = df["date_added"].dt.month

In [11]:
months_dict = {
    1: "January",
    2: "February",
    3: "March",
    4: "April",
    5: "May",
    6: "June",
    7: "July",
    8: "August",
    9: "September",
    10: "October",
    11: "November",
    12: "December"
}


In [12]:
df["month_added"] = df["month_added"].map(months_dict)

# Distribution of Content by Type (Movies vs. TV Shows) 🎬📺

In this analysis, we explored the distribution of content available on Netflix. We focused on **Movies** and **TV Shows** to see how they are represented in the dataset.

The data reveals:
- A **majority of content** is in the form of **Movies**.
- The **TV Shows** count, while substantial, is significantly lower than movies.
  
This provides insights into the content focus on the platform.

## Key Findings:
- **Movies:** 70% of the content
- **TV Shows:** 30% of the content


## Movies vs TV Shows

In [13]:
movies_vs_shows = df.groupby("type").count()[["show_id"]].reset_index()
movies_vs_shows

Unnamed: 0,type,show_id
0,Movie,6126
1,TV Show,2664


In [14]:
fig = px.bar(
    data_frame=movies_vs_shows,
    x="type",
    y="show_id",
    title="Distribution of Netflix Content",
    text="show_id",
    color="type", 
    color_discrete_map={
        "Movie": "#E50914",  
        "TV Show": "#221F1F" 
    }
)

# Customizing labels
fig.update_layout(
    xaxis_title="Type of Content",  # Update X-axis label
    yaxis_title="Number of Shows",  # Update Y-axis label
    title_font=dict(size=24, color="#E50914")  
)

# Show the figure
fig.show()

# Volume of New Content Added Over the Years 📈


Through a **trend analysis**, we can see how the volume of new content has grown on Netflix. Based on the data, we observe a **sharp increase** in the number of shows added to Netflix, especially after 2015.

## Content Added by Year:
- **2008**: 2 shows
- **2009**: 2 shows
- **2010**: 1 show
- **2011**: 13 shows
- **2012**: 3 shows
- **2013**: 11 shows
- **2014**: 24 shows
- **2015**: 82 shows
- **2016**: 426 shows
- **2017**: 1,185 shows
- **2018**: 1,648 shows
- **2019**: 2,016 shows
- **2020**: 1,879 shows
- **2021**: 1,498 shows

### Key Insights:
- **Pre-2015**: The volume of new content added to Netflix was **relatively low**.
- **Post-2015**: There was a **sharp increase**, with a **peak in 2019**.
- **2021** saw a slight drop in content addition compared to 2020, but still significantly higher than earlier years.

## Trend Visualization:
- **2015-2021**: Steady growth with a peak in **2019**.
- **2016-2019**: Significant growth in content addition (from 426 to 2,016 shows).


In [15]:
yearly_content = df.groupby("year_added").count()[["show_id"]].reset_index()
yearly_content

Unnamed: 0,year_added,show_id
0,2008,2
1,2009,2
2,2010,1
3,2011,13
4,2012,3
5,2013,11
6,2014,24
7,2015,82
8,2016,426
9,2017,1185


In [16]:
fig = px.line(
    data_frame=yearly_content,
    x="year_added",
    y="show_id",
    title="Volume of new content over the years"
    
)

fig.update_traces(
    line_color="#E50914",  # Netflix red for the line
    line_width=3  # Optional: Set the thickness of the line
)

# Customizing labels
fig.update_layout(
    xaxis_title="Year",  # Update X-axis label
    yaxis_title="Number of Content",  # Update Y-axis label
    title_font=dict(size=24, color="#E50914")  
)

# Show the figure
fig.show()

##	What are the most common  Most Common Categories in the Dataset 📊

Netflix offers a wide range of content, and in this analysis, we dive into the **most common categories** within the dataset. Based on the content data, here are the top categories on Netflix.

📅 **Top Categories:**
1. **International Movies**: 2,752 titles
2. **Dramas**: 2,426 titles
3. **Comedies**: 1,674 titles
4. **International TV Shows**: 1,349 titles
5. **Documentaries**: 869 titles


## Key Insights:
- **International Movies** account for the largest portion of content, with a total of **2,752 titles**.
- **Dramas** and **Comedies** are also highly represented, contributing to Netflix's broad genre variety.
- **International TV Shows** and **Documentaries** are also significant categories, showcasing the global and educational focus of Netflix's offerings.

### What This Means:
- Netflix's library is **heavily focused on international content**, with both **International Movies** and **International TV Shows** having a prominent presence.
- **Dramas** and **Comedies** are the core genres, which align with global preferences for storytelling.


In [17]:
top5_categories = df['listed_in'].str.get_dummies(sep=", ").sum().sort_values(ascending=False).head(5)
top5_categories

International Movies      2752
Dramas                    2426
Comedies                  1674
International TV Shows    1349
Documentaries              869
dtype: int64

In [18]:

fig = px.bar(
    data_frame=top5_categories,
    title="Top 5 Categories",
    text_auto=True,
    color_discrete_sequence=['#221F1F']
)


# Customizing labels
fig.update_layout(
    xaxis_title="Category",  # Update X-axis label
    yaxis_title="Number of Content",  # Update Y-axis label
    title_font=dict(size=24, color="#E50914")  
)

# Show the figure
fig.show()

## API Integration

###  **API Integrated Analysis Using TMDB API**  
🎥 **TMDB API Ratings Analysis**  
By integrating **TMDB API** ratings with our dataset, we analyzed how these ratings align with Netflix's own classification.

**Findings:**  
- The highest-rated movie based on **TMDB** is **Schindler's List** with a rating of **8.566**.


In [19]:
API_KEY = "e5c136f9e3702f30464c50387cc31ee4"

In [20]:
#Create an empty list to hold movie data
top_movies = []

# Fetch top 200 movies (across multiple pages)
for page in range(1, 11):  # Fetch 10 pages (each contains 20 movies, total = 200 movies)
    url = f"https://api.themoviedb.org/3/movie/top_rated?api_key={API_KEY}&page={page}"
    response = requests.get(url)
    data = response.json()
    
    # Check if the response is valid
    if response.status_code == 200:
        for movie in data["results"]:
            top_movies.append({
                "movie_id": movie["id"],
                "title": movie["title"],
                "rating": movie["vote_average"],
                "release_date": movie["release_date"]})
    else:
        print(f"Failed to fetch data for page {page}")
        break

In [21]:
df_top_movies = pd.DataFrame(top_movies)
df_top_movies.head()

Unnamed: 0,movie_id,title,rating,release_date
0,278,The Shawshank Redemption,8.708,1994-09-23
1,238,The Godfather,8.7,1972-03-14
2,240,The Godfather Part II,8.572,1974-12-20
3,424,Schindler's List,8.567,1993-12-15
4,389,12 Angry Men,8.5,1957-04-10


In [22]:
#Merging the 2 dataframes to get only the movies that are both on Netflix and in TMDB's top-rated list

merged_df=df.merge(df_top_movies,on="title",how="right").dropna()

In [23]:
merged_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating_x,duration,listed_in,description,year_added,month_added,movie_id,rating_y,release_date
3,s7958,Movie,Schindler's List,Steven Spielberg,"Liam Neeson, Ben Kingsley, Ralph Fiennes, Caro...",United States,2018-04-01,1993.0,R,195 min,"Classic Movies, Dramas",Oskar Schindler becomes an unlikely humanitari...,2018.0,April,424,8.567,1993-12-15
10,s7803,Movie,Pulp Fiction,Quentin Tarantino,"John Travolta, Samuel L. Jackson, Uma Thurman,...",United States,2019-01-01,1994.0,R,154 min,"Classic Movies, Cult Movies, Dramas",This stylized crime caper weaves together stor...,2019.0,January,680,8.488,1994-09-10
12,s8405,Movie,The Lord of the Rings: The Return of the King,Peter Jackson,"Elijah Wood, Ian McKellen, Liv Tyler, Viggo Mo...","New Zealand, United States",2020-01-01,2003.0,PG-13,201 min,"Action & Adventure, Sci-Fi & Fantasy",Aragorn is revealed as the heir to the ancient...,2020.0,January,122,8.483,2003-12-17
18,s6881,Movie,GoodFellas,Martin Scorsese,"Robert De Niro, Ray Liotta, Joe Pesci, Lorrain...",United States,2021-01-01,1990.0,R,145 min,"Classic Movies, Dramas",Former mobster Henry Hill recounts his colorfu...,2021.0,January,769,8.456,1990-09-12
24,s2590,Movie,Psycho,Mysskin,"Udhayanidhi Stalin, Aditi Rao Hydari, Nithya M...",India,2020-05-01,2020.0,TV-MA,143 min,"Horror Movies, International Movies, Thrillers",As a visually impaired man attempts to rescue ...,2020.0,May,539,8.427,1960-06-22


**Some of the movies have not the same release year. This means that some movies in the 2 dataframes had the same Title , thus the merging conlcude to some mistakes. To filter these , we have to match the release years from the 2 dataframes**

In [24]:
merged_df["release_year"]=merged_df["release_year"].astype('int')

In [25]:
merged_df["release_date"]=pd.to_datetime(merged_df["release_date"])

In [26]:
merged_df["TMDB_release_year"] = merged_df["release_date"].dt.year

In [27]:
merged_df = merged_df[merged_df['TMDB_release_year'] == merged_df['release_year']]

In [28]:
top5_movies = merged_df.head()

In [29]:

fig = px.bar(
    data_frame=top5_movies,
    x="title",
    y="rating_y",
    title="Top 5 Movies",
    text_auto=True,
    color_discrete_sequence=['#221F1F']
)


# Customizing labels
fig.update_layout(
    xaxis_title="Movie Title",  # Update X-axis label
    yaxis_title="Rating",  # Update Y-axis label
    title_font=dict(size=24, color="#E50914")  
)

# Show the figure
fig.show()

## Location-Based Analysis

### **Where Are Most of the Movies and TV Shows Produced? (Using Folium)**  
🌍 **Geospatial Analysis of Movie and TV Show Production**  
By using **Folium** and the country data, we mapped the production countries of Netflix content.

**Key Findings:**
- **United States** is by far the largest producer with **3680** titles.
- **India** follows with **1046** titles, followed by other major producers such as **United Kingdom** (803), **Canada** (445), **France** (393), and **Japan** (316).


The visualization shows the global distribution of Netflix content across different countries. 


In [30]:
country_content = df['country'].str.get_dummies(sep=", ").sum().sort_values(ascending=False)

In [31]:
# Convert to DataFrame for easier manipulation
country_data = pd.DataFrame(list(country_content.items()), columns=["Country", "Content_Count"])

# Initialize a map centered globally
world_map = folium.Map(location=[20, 0], zoom_start=2)

# Loop through the country_data DataFrame
for index, row in country_data.iterrows():
    country = row['Country']
    count = row['Content_Count']
    
    # Get the coordinates from the dictionary
    coords = country_coords.get(country, None)
    
    # Only add circles for valid coordinates
    if coords and coords != [0, 0]:
        folium.CircleMarker(
            location=coords,
            radius=count ** 0.5,  # Scale circle size by count
            color="red",  # Circle outline color
            fill=True,
            fill_color="black",  # Circle fill color
            fill_opacity=0.7,
            popup=f"{country}: {count} contents",  # Popup with country and content count
        ).add_to(world_map)

#Save the map 
world_map.save("country_content_map.html")
#Display the map
world_map

##	Are there specific times of the year when more content is added?

### **Are There Specific Times of the Year When More Content Is Added?**
📅 **Trend Analysis of Netflix Content Addition by Month**  
Using the dataset, we analyzed the number of shows and movies added to Netflix by month to determine if there are seasonal trends.

**Key Findings:**
- **July** is the peak month for Netflix content additions, with **827** titles added.
- **December** follows closely with **812** titles, while **September** comes in third with **769**.
- The months with the fewest additions are **February** with **562** and **May** with **632** titles added.

The month-wise distribution indicates that content addition is higher during the mid-to-late part of the year (July to December), which may reflect seasonal content releases or business strategies.


In [32]:
monthly_added_content = df.groupby("month_added").count().\
                            sort_values(by="show_id", ascending=False)[["show_id"]].\
                            reset_index()


In [33]:

fig = px.bar(
    data_frame=monthly_added_content,
    x="month_added",
    y="show_id",
    title="Number of added content per Month",
    text_auto=True,
    color_discrete_sequence=['#221F1F']
)


# Customizing labels
fig.update_layout(
    xaxis_title="Month",  # Update X-axis label
    yaxis_title="Total content",  # Update Y-axis label
    title_font=dict(size=24, color="#E50914")  
)

# Show the figure
fig.show()