# Data Visualization
# 1. Introduction
## Dataset Description
This Jupyter Notebook explores a dataset centered around movies, organized into three main components:
1. **The Movie Dataset**: This dataset provides detailed information about individual films. It is fragmented into multiple dataframes that are linked through a unique movie identifier key.
2. **The Oscar Awards Datasets**: This dataset contains comprehensive records of every nomination and winner since the first ceremony of the Oscars.
3. **The Rotten Tomatoes Review Dataset**: This dataset focuses on the reception of movies by critics, with data sourced from the review aggregator Rotten Tomatoes.

The Movie, the Oscar and Review datasets are not interconnected because they lack a shared unique identifier for movies. This happens because these datasets originate from entirely separate sources.<br>
Consequently, analyzing and visualizing the data presents additional challenges, as there is limited information available to effectively correlate a movie's performance and success.

## Methodology
The analysis follows a structured methodology which includes the following steps
1. **Prediction**: For the *In-Depth Visualization* the analysts will develop hypothesis and prediction to encourage critical thinking and expose common misconceptions.
2. **Analysis**: Conducting *Simple* and *In-Depth* exploration of the datasets to identify patterns, trends, and relationships.
3. **Visualization**: Creating meaningful and creative visual representations of the data to enhance understanding and interpretation.
4. **Conclusion**: Summarizing findings and deriving insights from the analysis and visualizations and comparing them with the previous hypothesis.

## Visualization Technologies
A variety of Python libraries are employed to create both static, dynamic, interactive and geographic visualizations. The following libraries are used:
- **Plotly**: For creating interactive and dynamic plots.
- **Geopandas**: For handling and visualizing geographic data.
- **Seaborn**: For generating aesthetically pleasing statistical graphics.
- **Matplotlib**: For static visualizations.
- **Folium**: For creating interactive maps and geographic visualizations.

These tools enable a diverse range of visualization techniques, enhancing the ability to explore and interpret the data effectively.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium as fm
import plotly.express as px

# to evaluate the use
from wordcloud import WordCloud

# 2. Exploratory Data Analysis (EDA)

## Simple Visualizations

### Correlation between runtime and rating

In [None]:
# Create a scatter plot with a regression line
movies_df = pd.read_csv('clean_datasets/movies.csv')
movies_df

In [None]:
# Visualize potential outliers using a boxplot for both variables
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sns.boxplot(y="runtime_in_minutes", data=movies_df, ax=axes[0])
axes[0].set_title("Boxplot of Movie Duration", fontsize=14)
axes[0].set_ylabel("Duration (minutes)")

sns.boxplot(y="rating", data=movies_df, ax=axes[1])
axes[1].set_title("Boxplot of Average Rating", fontsize=14)
axes[1].set_ylabel("Average Rating")

plt.tight_layout()
plt.show()

In [None]:
# Exclude outliers and tv series
filtered_df = movies_df[
    (movies_df["runtime_in_minutes"] > 0) &
    (movies_df["runtime_in_minutes"] <= 200)
    ]

plt.figure(figsize=(10, 6))
sns.regplot(x="runtime_in_minutes", y="rating", data=filtered_df, scatter_kws={"s": 50, "alpha": 0.7},
            line_kws={"color": "red"})
plt.title("Correlation Between Movie Duration and Average Rating", fontsize=16)
plt.xlabel("Duration (minutes)", fontsize=12)
plt.ylabel("Average Rating", fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

### Geographic and Temporal distribution of Movie Production

Geographical and temporal distribution refers to the analysis of how data is spread across different locations (geographical) and over time (temporal). In this case, it shows how movie productions vary by country and year.
Understanding these patterns can helps identify trends, highlight regional differences, and provide information about the evolution of the global film industry over the years.

In [None]:
# Load the datasets
movies_df = pd.read_csv('clean_datasets/movies.csv')
countries_df = pd.read_csv('clean_datasets/countries.csv')

# Merge datasets
df = pd.merge(movies_df, countries_df, left_on='id', right_on='movie_id')

# Clean the merged dataset
df = df.dropna(subset=["release_year"])
df = df.drop(columns=["movie_id", "Unnamed: 0"])
df = df.set_index("id")
df = df[df['release_year'] <= 2023]

# Group by release year and country, and count the number of movies
movie_counts = df.groupby(['release_year', 'country']).size().reset_index(name='movie_count')

After merging the datasets on *id* (from **movies_df**) and *movie_id* (from **countries_df**), the merged dataset is cleaned by removing rows with missing release years and unnecessary columns. <br>
The release_year filter ensures only movies released up to 2023 are included, as the dataset is not updated beyond that year. <br>
Finally, the data is grouped by release_year and country to count the number of movies produced each year in each country.

In [None]:
fig = (px.choropleth(movie_counts,
                     locations='country',  # Country names in the 'country' column
                     color='movie_count',  # Number of movies per country
                     hover_name='country',
                     color_continuous_scale="Viridis",
                     title="Number of Movies Released by Country Over Time",
                     labels={'movie_count': 'Number of Movies'},
                     locationmode="country names",  # Ensure matching by country names
                     animation_frame="release_year",  # Animate by release year
                     animation_group="country",  # Group by country during animation
                     template="plotly_dark")
       .show())

### Studio Influence on Genre

This plot is useful for understanding the dominance of specific studios within each genre. By visualizing which studio produces the most movies in each genre, you can identify trends in studio specialization. <br>
This type of analysis helps to understand how studios influence the landscape of different film genres, offering information about industry trends and competition.

In [None]:
# Load the datasets
studios_df = pd.read_csv('clean_datasets/studios.csv')
genres_df = pd.read_csv('clean_datasets/genres.csv')

# Merge the dataset
df = pd.merge(studios_df, genres_df, left_on='movie_id', right_on='movie_id')

# Clean the merged dataset
df = df.drop(columns=["Unnamed: 0_x", "Unnamed: 0_y"])

# Get all the genres for each studio
studio_genre_counts = df.groupby(['studio', 'genre']).size().reset_index(name='movie_count')

dominant_studios = studio_genre_counts.loc[studio_genre_counts.groupby('genre')['movie_count'].idxmax()]

The dominant studio for each genre is identified by selecting the studio with the highest movie count within each genre.

In [None]:
# Treemap
fig = (px.treemap(dominant_studios,
                  path=['genre', 'studio'],
                  values='movie_count',
                  color='movie_count',
                  color_continuous_scale='Viridis',
                  title='Dominant Studio for Each Genre')
       .show())

### Top 5 Studios for each Genre

This data visualization helps identify which studios dominate specific genres, providing insights into the distribution of movie production across genres.

In [None]:
# TODO: try to make the plot more complete adding others studios movie count

In [None]:
# Get the top 5 studios for each genre
top_studios_per_genre = (
    studio_genre_counts
    .sort_values(
        by=['genre', 'movie_count'],
        ascending=[True, False]
    )
    .groupby('genre')
    .head(5)
)

In [None]:
# Bar Plot
fig = px.bar(
    top_studios_per_genre,
    x='studio',
    y='movie_count',
    color='genre',
    title="Top 5 Studios per Genre",
    labels={
        "studio": "Studio",
        "movie_count": "Number of Movies",
        "genre": "Genre"
    },
    category_orders={
        "genre": top_studios_per_genre['genre'].unique()
    },
    color_discrete_sequence=px.colors.qualitative.Set1,
)

fig.update_layout(
    xaxis_title="Studio",
    yaxis_title="Number of Movies",
    xaxis_tickangle=-45,
    barmode='stack',
    height=600,
)

fig.show()

### Most Frequent Genres

Knowing the most frequent genres in a movie database offers valuable insights into audience preferences and trends, helping studios, distributors, and streaming platforms better understand what types of films are currently in demand.

In [None]:
# Load dataset
genres_df = pd.read_csv('clean_datasets/genres.csv')

# Count occurrences of each genre
genre_movies = genres_df['genre'].value_counts().reset_index(name='movie_count')

In [None]:
# Convert the DataFrame to a dictionary
genre_dict = genre_movies.set_index('genre')['movie_count'].to_dict()

# Generate the word cloud using the frequencies from the dictionary
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(genre_dict)

# Plot the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='kaiser')
plt.axis('off')
plt.title('Word Cloud of Genres')
plt.show()

### Most Popular Release Type per Country (To be completed)

In [None]:
# Load dataset
release_df = pd.read_csv('clean_datasets/releases.csv')

country_release_counts = release_df.groupby(['country', 'distribution_format']).size().reset_index(name='movies')
most_popular_release = country_release_counts.loc[country_release_counts.groupby('country')['movies'].idxmax()]

# TODO: Too many rows to get a readable plot

### Distribution of Countries by Release Format

Understanding the distribution of release formats by country helps identify global trends in movie distribution, highlighting which formats are most commonly adopted

In [None]:
countries_per_format = release_df.groupby('distribution_format')['country'].nunique().reset_index(name='country counts')

plt.figure(figsize=(8, 8))
plt.pie(countries_per_format['country counts'],
        labels=countries_per_format['distribution_format'],
        autopct='%1.1f%%', startangle=90,
        colors=plt.cm.Paired.colors)
plt.title('Proportion of Countries for Each Distribution Format')
plt.axis('equal')
plt.show()