In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

## **Étape 1 : Exploration et Premier Diagnostic** **texte en gras**

### 📊 Load Dataset
We read the CSV files that contains information about movies and tv shows on streaming platforms.

In [None]:
file1 = pd.read_csv("/content/drive/MyDrive/GENAI/Week3/Hackathon/MoviesOnStreamingPlatforms.csv")
df1 = pd.DataFrame(file1)
df1.head()

In [None]:
df1["Type"] = "Movies"

In [None]:
df1.drop_duplicates(inplace=True)

In [None]:
file2 = pd.read_csv("/content/drive/MyDrive/GENAI/Week3/Hackathon/tv_shows.csv")
df2 = pd.DataFrame(file2)
df2.head()

In [None]:
df2["Type"] = "TV Shows"

In [None]:
df2.drop(columns=['IMDb'], inplace=True)

In [None]:
df2.drop_duplicates(inplace=True)

We added a column to each DataFrame to specify whether the content is a movie or a series, and then concatenated the two DataFrames.

In [None]:
df3 = pd.concat([df1, df2], axis=0)
df3.head()

In [None]:
df3.shape

We have a total of 14 883 movies and tv shows

Using web scraping, we retrieved a missing column from both datasets: the genre of the movie or series. This information was then added to the previously concatenated DataFrame, as it will be useful for future analyses.

In [None]:
import requests
from tqdm import tqdm

tqdm.pandas()

API_KEY = "f0e52ba803b4b1cc7e190ec4c432c706"

mapping = requests.get(
    f"https://api.themoviedb.org/3/genre/movie/list?api_key={API_KEY}&language=en-US"
).json()['genres']
id2name = {g['id']: g['name'] for g in mapping}

def fetch_tmdb_genres(title):
    res = requests.get(
        f"https://api.themoviedb.org/3/search/movie?api_key={API_KEY}&query={title}"
    ).json()
    results = res.get('results') or []
    if results:
        genre_ids = results[0].get('genre_ids', [])
        return [id2name[gid] for gid in genre_ids if gid in id2name]
    return []

df3['GenresList'] = df3['Title'].progress_apply(fetch_tmdb_genres)

In [None]:
df3 = pd.read_csv("/content/drive/MyDrive/GENAI/Week3/Hackathon/df3.csv")

In [None]:
df3.rename(columns={"GenresList" : "Genre"}, inplace=True)

In [None]:
df3.head()

In [None]:
df3.shape

In [None]:
df3.to_csv("/content/drive/MyDrive/GENAI/Week3/Hackathon/df3.csv")

In [None]:
df = pd.read_csv("/content/drive/MyDrive/GENAI/Week3/Hackathon/df3.csv")

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isna().mean() * 100

# **Étape 2 : Nettoyage et Préparation des Données**

We converted the 'genre' column, which was initially of type object, into a proper list format.

In [None]:
import ast

df["Genre"] = df["Genre"].apply(lambda x: ast.literal_eval(x))
df.head()

In [None]:
df = df[df["Genre"].apply(lambda x : len(x) > 0)]

In [None]:
df.shape

We removed the rows where the genre was not specified, resulting in a final dataset of 11,693 movies.

In [None]:
df.head()

In [None]:
df.drop(columns=['Age', "ID", "Unnamed: 0"], inplace=True)

In [None]:
df = df[df['Rotten Tomatoes'].notna()]

We removed columns that contained no useful information for our analysis, as well as any remaining null values.

In [None]:
df["grades"] = df["Rotten Tomatoes"].str.replace("/100", "").astype(float)

In [None]:
df["grades"] = df["grades"] / 100
df.drop(columns=['Rotten Tomatoes'], inplace=True)

In [None]:
df.head()

We normalized the Rotten Tomatoes scores to make them suitable for analysis.

In [None]:
df_normalized = df.copy()

We applied label encoding to the textual data and one-hot encoding to the movie genres in order to use a confusion matrix and identify dependencies between features.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
genre_dummies = pd.DataFrame(mlb.fit_transform(df_normalized['Genre']), columns=mlb.classes_, index=df_normalized.index)
df_normalized = pd.concat([df_normalized, genre_dummies], axis=1)
df_normalized.drop(columns=['Genre'], inplace=True)
df_normalized

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_normalized["Type"] = le.fit_transform(df_normalized["Type"])
df_normalized["Title"] = le.fit_transform(df_normalized["Title"])
df_normalized.head()

In [None]:
plt.figure(figsize=(15, 10))
corr = df_normalized.corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

The correlation matrix shows weak linear relationships overall. Some related genres like Adventure, Fantasy, and Animation are moderately correlated, and family movies tend to have higher ratings. Platform availability is mostly exclusive, with a slight negative correlation between Netflix and Prime Video. These insights suggest limited linear dependencies.

# **Étape 3 : Analysis**

## **What are the most common genres for top-rated shows and movies across platforms?**

In [None]:
df.head()

In [None]:
df.to_csv("df.csv")

In [None]:
df = pd.read_csv("df.csv")

In [None]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
import ast

df['Genre'] = df['Genre'].apply(lambda x: ast.literal_eval(x))

all_genres = df.explode('Genre')

In [None]:
all_genres.head()

In [None]:
all_genres["Genre"].nunique()

In [None]:
import ipywidgets as widgets

rating_slider = widgets.FloatSlider(
    value = 0.8,
    min = 0.0,
    max = 1.0,
    step = 0.05,
    description = 'Minimum Rating:',
    disabled = False,
    continuous_update = False,
    orientation = 'horizontal',
    readout = True,
)

number_genres = widgets.IntSlider(
    value = 10,
    min = 1,
    max = 19,
    step = 1,
    description = 'Number of Genres:',
    disabled = False,
    continuous_update = False,
)

def plot_genre_distribution(min_rating, num_genres):

  df_filtered = all_genres[all_genres['grades'] >= min_rating]
  grouped_df = df_filtered['Genre'].value_counts().sort_values(ascending=False).head(num_genres)
  plt.figure(figsize=(14, 7))
  sns.barplot(x=grouped_df.index, y=grouped_df.values)
  plt.xlabel("Nombre de titres")
  plt.title(f"{num_genres} genres les plus fréquents (note ≥ {min_rating})")
  plt.show()

widgets.interactive(plot_genre_distribution, min_rating=rating_slider, num_genres=number_genres)

The bar chart and data reveal that among top-rated content (rating ≥ 0.8), the most common genres across platforms are:

Drama (208 titles)

Comedy (137 titles)

Adventure (109 titles)

Action (106 titles)

These genres dominate the top-rated segment, suggesting they are consistently well-received by audiences. Drama, in particular, stands out as the leading genre among highly rated shows and movies

## **How do the release years of shows and movies relate to their average ratings?**

In [None]:
df.head()

In [None]:
df_grouped = df.groupby('Year')['grades'].mean()
df_grouped.head()

In [None]:
plt.figure(figsize=(14, 7))
plt.plot(df_grouped.index, df_grouped.values, marker = 'o')
plt.xlabel("Année")
plt.ylabel("Note moyenne")
plt.title("Évolution de la note moyenne par année")
plt.grid(True)
plt.show()

In [None]:
df[(df["Year"]>=1920) & (df["Year"]<=1925)]

In [None]:
df[(df["Year"]<2010) & (df["Type"] == "Movies")]["grades"].mean()

In [None]:
df[(df["Year"]<2010) & (df["Type"] == "TV Shows")]["grades"].mean()

In [None]:
df[(df["Year"]>=2010) & (df["Type"] == "Movies")]["grades"].mean()

In [None]:
df[(df["Year"]>=2010) & (df["Type"] == "TV Shows")]["grades"].mean()

In [None]:
df_decade = df.copy()

In [None]:
df_decade["decade"] = df_decade["Year"].apply(lambda x: (x // 10) * 10)

In [None]:
from scipy.stats import f_oneway

groups = [group["grades"] for _, group in df_decade.groupby("decade")]
f_statistic, p_value = f_oneway(*groups)

if p_value < 0.05:
    print("There is a significant difference in average ratings between decades.")
else:
    print("There is no significant difference in average ratings between decades.")

The years between 1950 and 2000 show a stable trend in average ratings, often around 0.55 to 0.60.

The early years of cinema (before 1940) display more variable averages, likely due to a limited number of titles.

Since the 2010s, there has been a slight decline in average ratings, driven by lower scores for both films and series. This may reflect stricter critical standards or a drop in content quality.

## **How does the availability of movies and shows differ across platforms like Netflix, Hulu, and Disney+?**

In [None]:
value = {
  "netflix" : df["Netflix"].sum(),
  "hulu" : df["Hulu"].sum(),
  "disney" : df["Disney+"].sum(),
  "Prime video" : df["Prime Video"].sum()
}

In [None]:
value = pd.Series(value)

plt.figure(figsize=(8, 6))
value.sort_values().plot(kind='barh')
plt.title("Number of Movies Available by Platform")
plt.xlabel("Number of Movies")
plt.tight_layout()
plt.show()

Netflix clearly leads in terms of the number of available titles.
It is followed by Prime Video and then Hulu.
Disney+ comes last, likely due to its more focused and franchise-driven catalog.