Analyze the data and generate insights that could help Netflix in deciding which type of shows/movies to produce and how they can grow the business in different countries.

**Data Description**
The dataset provided to you in netflix_data.csv consists of a list of all the TV shows/movies available on Netflix.

- show_id - a unique ID for every movie/show
- type - identifier: a movie or TV show
- title - the title of the movie/show
- director - the name of the director of the movie/show
- cast - actors involved in the movie/show
- country - a country where the movie/show was produced
- date_added - date it was added on Netflix
- release_year - the actual release year of the movie/show
- rating - TV rating of the movie/show
- duration - total duration in minutes or number of seasons
- listed_in - genre
- description - the summary description

You can start by exploring a few questions:

- What type of content is available in different countries?
- How has the number of movies released per year changed over the last 20-30 years?
- Comparison of tv shows vs. movies.
- What is the best time to launch a TV show?
- Analysis of actors/directors of different types of shows/movies.
- Does Netflix has more focus on TV Shows than movies in recent years?
- Understanding what content is available in different countries.

**Practicalities**
- The exploration should have a goal. As you explore the data, keep in mind that you want to answer which type of shows to produce and how to grow the business. Ensure each recommendation is backed by data. The company is looking for data-driven insights, not personal opinions or anecdotes. Assume that you are presenting your findings to business executives who have only a basic understanding of data science. Avoid unnecessary technical jargon.

## Setup and Imports

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# create a spark session
spark = SparkSession.builder.appName("PySpark Netflix Analysis").getOrCreate()

## Load Data

In [0]:
df = spark.read.option("header", True).option("inferSchema", True).csv('/Volumes/workspace/data/csv_files/netflix_data.csv')

num_rows = df.count()
num_cols = len(df.columns)
print(f"Dataset Shape: ({num_rows}, {num_cols})")

display(df.show(5, truncate=False))

Dataset Shape: (1500, 12)
+-------+-----+------------------+------------+---------------------------------------------------------------------+--------------+------------------+------------+------+--------+---------------------------+--------------------------------------------------------------------------+
|show_id|type |title             |director    |cast                                                                 |country       |date_added        |release_year|rating|duration|listed_in                  |description                                                               |
+-------+-----+------------------+------------+---------------------------------------------------------------------+--------------+------------------+------------+------+--------+---------------------------+--------------------------------------------------------------------------+
|s0001  |Movie|The Emerald World |Liam Lee    |Taylor Singh, Jamie Patel, Arjun Singh                               |Canad

## Data Cleaning

### Strip White Spaces

In [0]:
# get all string columns
string_cols = [field.name for field in df.schema.fields if str(field.dataType) == 'StringType']

# Strip white spaces. PySpark operates column wise, so we loop over string columns instead of using applymap
for c in string_cols:
    df = df.withColumn(c, trim(col(c)))

df.show(5, truncate=False)

+-------+-----+------------------+------------+---------------------------------------------------------------------+--------------+------------------+------------+------+--------+---------------------------+--------------------------------------------------------------------------+
|show_id|type |title             |director    |cast                                                                 |country       |date_added        |release_year|rating|duration|listed_in                  |description                                                               |
+-------+-----+------------------+------------+---------------------------------------------------------------------+--------------+------------------+------------+------+--------+---------------------------+--------------------------------------------------------------------------+
|s0001  |Movie|The Emerald World |Liam Lee    |Taylor Singh, Jamie Patel, Arjun Singh                               |Canada        |February 10, 202

### Fill Missing Values

In [0]:
# Define replacement value per column

fill_values = {
    'director': 'Unknown',
    'country': 'Unknown',
    'cast': 'NA',
    'listed_in': 'NA'
}

# Fill missing values
df = df.fillna(fill_values)

df.show(5, truncate=False)

+-------+-----+------------------+------------+---------------------------------------------------------------------+--------------+------------------+------------+------+--------+---------------------------+--------------------------------------------------------------------------+
|show_id|type |title             |director    |cast                                                                 |country       |date_added        |release_year|rating|duration|listed_in                  |description                                                               |
+-------+-----+------------------+------------+---------------------------------------------------------------------+--------------+------------------+------------+------+--------+---------------------------+--------------------------------------------------------------------------+
|s0001  |Movie|The Emerald World |Liam Lee    |Taylor Singh, Jamie Patel, Arjun Singh                               |Canada        |February 10, 202

### convert date

In [0]:
# Assuming the format is something like 'September 9, 2019'
df = df.withColumn(
    'date_added',
    to_date(col('date_added'), 'MMMM d, yyyy')
)

df = df.withColumn('added_year', year(col('date_added')))\
       .withColumn('added_month', month(col('date_added')))

# Show result
df.select('date_added', 'added_year', 'added_month').show(5, truncate=False)

+----------+----------+-----------+
|date_added|added_year|added_month|
+----------+----------+-----------+
|2021-02-10|2021      |2          |
|2022-05-11|2022      |5          |
|2017-03-31|2017      |3          |
|2015-02-28|2015      |2          |
|2022-09-07|2022      |9          |
+----------+----------+-----------+
only showing top 5 rows


### Duration Cleaning

In [0]:
# Extract duration time(int) and duration type

'''
regexp_extract - function that extract a part of string using a regular expression
r'(\d+)' - pattern
\d - digit (0-9)
+ - one or more
() - define a capturing group
last argument - specifies which capturing group to return

'''

df = df.withColumn(
    'duration_int', 
         when(col('duration').contains('Season'), regexp_extract(col('duration'), r'(\d+)', 1).cast('int'))
        .when(col('duration').contains('min'), regexp_extract(col('duration'), r'(\d+)', 1).cast('int'))
        .otherwise(None)
    )

df = df.withColumn(
    'duration_type',
    when(col('duration').contains('Season'), 'Season')
    .when(col('duration').contains('min'), 'Minutes')
    .otherwise(None)
)

df.select('duration', 'duration_int', 'duration_type').show(5, truncate=False)

  '''


+--------+------------+-------------+
|duration|duration_int|duration_type|
+--------+------------+-------------+
|94 min  |94          |Minutes      |
|125 min |125         |Minutes      |
|134 min |134         |Minutes      |
|85 min  |85          |Minutes      |
|143 min |143         |Minutes      |
+--------+------------+-------------+
only showing top 5 rows


### Drop duplicates

In [0]:
df = df.dropDuplicates(['show_id'])

## Exploratoy Data Analysis (EDA)

### Missing values

In [0]:
# calculate missing values per column

missing_df = df.select([
    count(when(col(c).isNull(), c)).alias(c)
    for c in df.columns
    ])

#convert to Pandas for visualization
missing_pd = missing_df.toPandas()

fig = px.imshow(
    missing_pd.T,
    text_auto=True,
    color_continuous_scale='Reds',
    labels=dict(x="Missing Values", y="Columns")
)

fig.update_layout(
    title="Missing Values Heatmap",
    yaxis_title="",
    xaxis_title=""
)

fig.show()

### Distribution of TV Shows and Movies

In [0]:
# Aggregate count of 'type' column
type_counts = df.groupBy('type').agg(count('*').alias('count')).toPandas()

fig = px.bar(
    type_counts,
    x='type',
    y='count',
    color='type',
    color_continuous_scale='coolwarm',
    title='Distribution of Movies vs TV Shows',
    text='count'
)

fig.show()

### Top Countries

In [0]:
# Split country column by comma, explode into multiple rows and strip whitespaces

countries_df = df.select(
    explode(
        split(col('country'), ',')
    ).alias('country')
    ).withColumn('country', trim(col('country')))

# Filter out null or empty strings
countries_df = countries_df.filter(col('country').isNotNull() & (col('country') != ''))

# count occurences and get Top 10
top_countries = countries_df.groupBy('country')\
    .count()\
    .orderBy(desc('count'))\
    .limit(10)

top_countries.show(truncate=False)


+--------------+-----+
|country       |count|
+--------------+-----+
|United States |493  |
|India         |320  |
|United Kingdom|189  |
|Canada        |137  |
|South Korea   |86   |
|Japan         |73   |
|Germany       |58   |
|France        |54   |
|Brazil        |31   |
|Unknown       |30   |
+--------------+-----+



In [0]:
# convert to Pandas
top_countries_pd = top_countries.toPandas()

# Calculate percentage share
total_count = top_countries_pd['count'].sum()
top_countries_pd['percentage'] = (top_countries_pd['count'] / total_count * 100).astype(int)

# Create label combining count and percentage
top_countries_pd['label'] = top_countries_pd.apply(
    lambda row: f"{row['count']} ({row['percentage']}%)", axis=1
)

# Bar Chart
fig = px.bar(
    top_countries_pd,
    x='count',
    y='country',
    orientation='h',
    text='label',  # Use combined label
    color='count',
    color_continuous_scale='Viridis',
    title='Top 10 Countries by Number of Titles'
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Country',
    yaxis=dict(autorange="reversed")
)

fig.show()

### Top Genres

In [0]:
# get each value in separate rows
genres_df = df.select(
    explode(
        split(col('listed_in'), ',')
    ).alias('genre')
).withColumn('genre', trim(col('genre')))

# filter out null or empty strings
genres_df = genres_df.filter(col('genre').isNotNull() & (col('genre') != ''))

# count occurences and get Top 10
top_genres = genres_df.groupBy('genre')\
                    .count()\
                    .orderBy(desc('count'))\
                    .limit(10)

top_genres.show(truncate=False)

+-----------+-----+
|genre      |count|
+-----------+-----+
|Drama      |575  |
|Comedy     |523  |
|Action     |470  |
|Romance    |314  |
|Thriller   |283  |
|Documentary|218  |
|Sci-Fi     |179  |
|Adventure  |166  |
|Horror     |153  |
|Animation  |74   |
+-----------+-----+



In [0]:
# convert to Pandas
top_genres_pd = top_genres.toPandas()

# Calculate percentage share
total_count = top_genres_pd['count'].sum()
top_genres_pd['percentage'] = (top_genres_pd['count'] / total_count * 100).astype(int)

# Create label combining count and percentage
top_countries_pd['label'] = top_countries_pd.apply(
    lambda row: f"{row['count']} ({row['percentage']}%)", axis=1
)

# horizontal bar chart
fig = px.bar(
    top_genres_pd,
    x='count',
    y='genre',
    orientation='h',
    text='count',
    color='count',
    color_continuous_scale='Cividis',
    title='Top 10 Genres by Number of Titles'
)

fig.update_layout(
    xaxis_title='Count',
    yaxis_title='Genre',
    yaxis=dict(autorange="reversed")
)

fig.show()

### Release Year Trend

In [0]:
# filter for release year >= 1980

df_filtered = df.filter(col('release_year') >=1980)

# count number of titles per year
release_year_counts = df_filtered.groupBy('release_year').agg(count('*').alias('count')).orderBy('release_year')

# convert to Pandas
release_year_pd = release_year_counts.toPandas()

# bar chart using Plotly
fig = px.bar(
    release_year_pd,
    x='release_year',
    y='count',
    color='count',
    color_continuous_scale='Teal',
    title='Content Released by Year'
)

fig.update_layout(
    yaxis_title='Count',
    xaxis_title='Year'
)

fig.show()


## Questions

### Q1. What type of content available in different countries

In [0]:
# split of country column
df_splitted = df.withColumn("country", split(col("country"), ","))

# Explode the array to get separate rows
df_exploded = df_splitted.withColumn("country", explode(col("country")))

# Trim whitespace
df_trimmed = df_exploded.withColumn("country", trim(col("country")))

# get top 9 countries
top_countries = df_trimmed.groupBy("country")\
    .count()\
    .orderBy(desc("count"))\
    .limit(9)\
    .select("country")

subset = df_trimmed.join(top_countries, on="country", how="inner")


In [0]:
fig = px.histogram(
    subset.toPandas(),
    y="country",
    color="type",
    barmode="group",
    title="Movies vs TV Shows by Country",
    color_discrete_sequence=px.colors.qualitative.Set2
    )

fig.update_layout(
    xaxis_title = "Count",
    yaxis_title = "Country",
    legend_title_text = "Type",
    height= 600,
    width= 1000
)

fig.show()

### Q2. How has the number of movies released per year changed over the last 20-30 years?

In [0]:
movies = df.filter(df["type"] == "Movie")

# movie count per year
movies_per_year = (
    movies.groupBy("release_year")
        .count()
        .orderBy("release_year")
        .toPandas()
)

fig = px.line(
    movies_per_year,
    x= "release_year",
    y="count",
    title = "Movies Released per Year",
    markers = True
)

fig.update_layout(
    xaxis_title= "Release Year",
    yaxis_title= "Count",
    height= 500,
    width= 900
)

fig.show()

In [0]:
shows = df.filter(df["type"] == "TV Show")

# movie count per year
shows_per_year = (
    shows.groupBy("release_year")
        .count()
        .orderBy("release_year")
        .toPandas()
)

fig = px.line(
    shows_per_year,
    x= "release_year",
    y="count",
    title = "Shows Released per Year",
    markers = True
)

fig.update_layout(
    xaxis_title= "Release Year",
    yaxis_title= "Count",
    height= 500,
    width= 900
)

fig.show()

### Q3. Comparison of tv shows vs. movies.

In [0]:
# Aggregate count of 'type' column
type_counts = df.groupBy('type').agg(count('*').alias('count')).toPandas()

fig = px.bar(
    type_counts,
    x='type',
    y='count',
    color='type',
    color_continuous_scale='coolwarm',
    title='Distribution of Movies vs TV Shows',
    text='count'
)

fig.show()

### Q3. What is the best time to launch a TV show?

In [0]:
tv = df.filter(df["type"]== "TV Show")

# Extract month data

tv_months = (
    tv.filter(tv["added_month"].isNotNull())
    .withColumn("added_month",tv["added_month"].cast("int"))
    .groupBy("added_month")
    .count()
    .orderBy("added_month")
    .toPandas()
)

#line chart

fig = px.line(
    tv_months,
    x="added_month",
    y="count",
    title="TV Shows Added per Month",
    markers=True
)

fig.update_layout(
    xaxis = dict(
        title = "Month",
        tickmode="array",
        tickvals = list(range(1,13)),
        ticktext = [
            "Jan", "Feb", "Mar", "Apr", "May", "Jun",
            "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
        ]
    ),
    yaxis_title = "Count",
    height= 500,
    width= 900
)

fig.show()

### Q5. Analysis of actors/directors of different types of shows/movies.

In [0]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from pyspark.sql.window import Window

def get_top_by_type(df, col_name, alias, n=10):

    exploded = (
         df.withColumn(alias, explode(split(col(col_name), ",")))
          .withColumn(alias, trim(col(alias)))
          .filter(
              (col(alias).isNotNull()) &
              (col(alias) != "") &
              (~lower(col(alias)).isin("unknown","n/a","na","none"))
          )
          .groupBy("type", alias)
          .agg(count("*").alias("count"))
    )

    w = Window.partitionBy("type").orderBy(desc("count"))
    top_n = exploded.withColumn("rank",row_number().over(w)).filter(col("rank") <= n)

    return top_n.drop("rank")
 

# get data
actors_top10 = get_top_by_type(df, "cast", "name")
directors_top10 = get_top_by_type(df, "director", "name")

actors_movies = actors_top10.filter(col("type")== "Movie").toPandas()
actors_tv = actors_top10.filter(col("type")== "TV Show").toPandas() 
directors_movies = directors_top10.filter(col("type")== "Movie").toPandas()
directors_tv = directors_top10.filter(col("type")== "TV Show").toPandas()

# plot subplots

fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        "Top 10 Movie Actors", "Top 10 TV Show Actors",
        "Top 10 Movie Directors", "Top 10 TV Show Directors"
    ],
    horizontal_spacing=0.15, vertical_spacing=0.15
)

def add_bar(row, col, data, color):
    fig.add_trace(
        go.Bar(
            x=data["count"], y=data["name"], orientation="h", marker_color=color),
            row=row,col=col
        )

# add all charts
add_bar(1, 1, actors_movies, "#EF5538")
add_bar(1, 2, actors_tv, "#636EFA")
add_bar(2, 1, directors_movies, "#00CC96")
add_bar(2, 2, directors_tv, "#AB63FA")

fig.update_layout(
    height=900, width=1100,
    title_text= "Top 10 Actors and Directors by Type",
    showlegend=False,
    template="plotly_white"
)

fig.update_yaxes(autorange="reversed")

fig.show()


    

### Q6. Does Netflix have more focus on TV Shows than movies in recent years?

In [0]:
#trend Data
trend = (
    df.filter(col("added_year").isNotNull())
    .groupBy("added_year", "type")
    .agg(count("*").alias("count"))
    .orderBy("added_year")
)

trend_pd = trend.toPandas()

fig = px.line(
    trend_pd,
    x="added_year",
    y="count",
    color="type",
    markers=True,
    title="Trend: Netflix content type over time",
    labels = {
        "added_year": "Year",
        "count": "Number of Titles",
        "type": "Content Type"
    },
    template="plotly_white"
)

fig.update_layout(
    height=600,
    width=900,
    title_font=dict(size=20),
    legend_title_text="Type"
)

fig.show()


### Q7. Understanding what content is available in different countries.

In [0]:
# explode countries
heat = df.withColumn("country", explode(split(col("country"), ",")))
heat = heat.withColumn("country", trim(col("country")))

# explode genres
heat = df.withColumn("genre", explode(split(col("listed_in"), ",")))
heat = heat.withColumn("genre", trim(col("genre")))

# top 6 countries

top_countries = (
    heat.groupBy("country")
    .count()
    .orderBy(col("count").desc())
    .limit(6)
)

heat = heat.join(top_countries.select("country"), on="country", how="inner")

cross = heat.groupBy("country").pivot("genre").count().fillna(0)

cross_pd = cross.toPandas()
cross_pd.set_index('country', inplace=True)

# calculate value/max for each row
normalized_values = cross_pd.div(cross_pd.max(axis=1), axis=0).fillna(0)

countries = cross_pd.index.tolist()
genres = cross_pd.columns.tolist()
values = normalized_values.values.tolist()

fig = go.Figure(
    data = go.Heatmap(
        z=values,
        x=genres,
        y=countries,
        colorscale="Viridis",
        zmin=0,
        zmax=1
    )
)

fig.update_layout(
    title = 'Genres by country(Top 6)',
    xaxis_title= 'Genre',
    yaxis_title='Country'
)

fig.show()


## Insights

1. Content Type Distribution

- Movies - nearly 65% and TV Shows - 35%
- Movies still dominate but TV shows make up 1/3 share. TV shows on the rise in recent years

#### recommendation
- Keep healthy mix. Audience engagement from episodic content boosts retention

2. Trend of releases

- steep rise in releases
- recent years shows growth

#### Recommendation
TV Shows - Q3/Q4 strategic releases do better

3. Country wise content production

- 30% from Asia(India, Japan, South Korea). Shows strong global diversification

#### Recommendation
- To strengthen regional appeal, local language originals Asia and Europe. Coproductions can capture bilingual markets

4. Genre Popularity

- Drama, Comedy, Action dominate

#### Recommendation

- Maintain strong investment in these genres
- Documentary, Sci-fi to attract niche audience

5. Top Actors or directors
- Names are diverse. good for data variety.

#### Recommendation
- Use top recurring profiles for recommendation engines
- combination of genre-country-actor -> recommendation engines

#### Summary

1. Increase regional co-productions
2. Focus on TV series for binge friendly
3. Maintain Drama,Action,Comedy pipeline
4. Focus marketing for Q3/Q4 releases
5. Data recommendation models using genre-country-actor trends 

