# Data Visualization
# 1. Introduction
## Dataset Description
This Jupyter Notebook explores a dataset centered around movies, organized into three main components:
1. **The Movie Dataset**: This dataset provides detailed information about individual films. It is fragmented into multiple dataframes that are linked through a unique movie identifier key.
2. **The Oscar Awards Datasets**: This dataset contains comprehensive records of every nomination and winner since the first ceremony of the Oscars.
3. **The Rotten Tomatoes Review Dataset**: This dataset focuses on the reception of movies by critics, with data sourced from the review aggregator Rotten Tomatoes.

The Movie, the Oscar and Review datasets are not interconnected because they lack a shared unique identifier for movies. This happens because these datasets originate from entirely separate sources.<br>
Consequently, analyzing and visualizing the data presents additional challenges, as there is limited information available to effectively correlate a movie's performance and success.

## Methodology
The analysis follows a structured methodology which includes the following steps
1. **Introduction**: A brief introduction to the objective of the analysis.
2. **Prediction**: For the *In-Depth Visualization* the analysts will develop hypothesis and prediction to encourage critical thinking and expose common misconceptions.
3. **Analysis and Visualization**: Conducting *Simple* and *In-Depth* exploration of the datasets to identify patterns, trends, and relationships. Creating meaningful and creative visual representations of the data to enhance understanding and interpretation.
4. **Conclusion**: Summarizing findings and deriving insights from the analysis and visualizations and comparing them with the previous hypothesis.

## Visualization Technologies
A variety of Python libraries are employed to create both static, dynamic, interactive and geographic visualizations. The following libraries are used:
- **Plotly**: For creating interactive and dynamic plots.
- **Geopandas**: For handling and visualizing geographic data.
- **Seaborn**: For generating aesthetically pleasing statistical graphics.
- **Matplotlib**: For static visualizations.
- **Folium**: For creating interactive maps and geographic visualizations.

These tools enable a diverse range of visualization techniques, enhancing the ability to explore and interpret the data effectively.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium as fm
import plotly.express as px
import plotly.graph_objects as go

# to evaluate the use
from wordcloud import WordCloud

# 2. Exploratory Data Analysis (EDA)

### *Mapping the Silver Screen*: A Geographical and Temporal Analysis

***Why is it useful to know?***
<br> Analyzing the geographical and temporal distribution of movie production helps to understand how data is spread across different locations and over time.
By analyzing these patterns, we can improve our understanding of trends, regional differences, and the evolution of the global film industry.

***Prediction***
<br> Movie production will most likely increase gradually over the years, with major global producers such as the United States and India likely playing leading roles in this trend.

In [None]:
movies_df = pd.read_csv('clean_datasets/movies.csv')
countries_df = pd.read_csv('clean_datasets/countries.csv')

df = pd.merge(movies_df, countries_df, left_on='id', right_on='movie_id')

df = df.dropna(subset=["release_year"])
df = df.set_index("id")
df = df[df['release_year'] <= 2023]

In [None]:
movie_counts = df.groupby(['release_year', 'country']).size().reset_index(name='movie_count')

***Analysis of the Procedure***

1. *Loading the Datasets*:
   The datasets containing information on movies and countries were loaded using Pandas.

2. *Merging Datasets*:
   The `movies_df` and `countries_df` were merged on the `id` and `movie_id` columns to create a unified dataset.

3. *Cleaning the Data*:
   - Rows with missing `release_year` were removed.
   - The dataset was filtered to include only movies released up to 2023, as the dataset is not updated beyond that year.
<br> <br>
4. *Grouping and Counting*:
   The cleaned dataset was grouped by `release_year` and `country` to count the number of movies produced each year in each country.

In [None]:
fig = (px.choropleth(movie_counts,
                     locations='country', # Country names in the 'country' column
                     color='movie_count', # Number of movies per country
                     hover_name='country',
                     color_continuous_scale="Viridis",
                     title="Number of Movies Produced by Country Over Time",
                     labels={'movie_count': 'Number of Movies'},
                     locationmode="country names", # Ensure matching by country names
                     animation_frame="release_year", # Animate by release year
                     animation_group="country", # Group by country during animation
                     template="plotly_dark")
       .update_layout(
           width=1000,
           height=600,
           title={'x': 0.5, 'xanchor': 'center', 'y': 0.95}
       )
       .show())

***Conclusions***

As expected, movie production increased over the years, with the United States and India playing fundamental roles in this evolution of the film industry. The limited number of movies viewable at the start of the choropleth animation is likely attributable to the accuracy and completeness of the dataset.

### *The Studio Power Play*: Studio Dominance Across Film Genres

***Why is it useful to know?***
<br> Understanding the dominance of specific studios within each genre helps reveal trends in studio specialization. By identifying which studios are most active in particular genres, this analysis provides valuable insights into how different studios influence the film industry.

***Prediction***
<br> Given the presence of major studios like BBC and Warner Bros. Pictures in the database, it's likely that certain studios will dominate specific genres. It's also highly probable that the same studio will appear in multiple genres.

In [None]:
studios_df = pd.read_csv('clean_datasets/studios.csv')
genres_df = pd.read_csv('clean_datasets/genres.csv')

df = pd.merge(studios_df, genres_df, left_on='movie_id', right_on='movie_id')

In [None]:
studio_genre_counts = df.groupby(['studio', 'genre']).size().reset_index(name='movie_count')

top_studios_per_genre = (
    studio_genre_counts
    .sort_values(
        by=['genre', 'movie_count'],
        ascending=[True, False])
    .groupby('genre')
    .head(10)
)

top_studios_per_genre['rank'] = top_studios_per_genre.groupby('genre')['movie_count'].rank(method='first', ascending=False)

***Analysis of the Procedure***
1. *Loading the Datasets*:
   The datasets containing information about studios and genres were loaded using Pandas.

2. *Merging the Datasets*:
   The `studios_df` and `genres_df` datasets were merged on the `movie_id` column to associate each movie with its respective studio and genre.

3. *Counting Movies by Studio and Genre*:
   The dataset was grouped by both `studio` and `genre` to count the number of movies each studio produced within each genre.

4. *Selecting Top Studios*:
   The data was sorted by `genre` and `movie_count`, and the top 10 studios (only relevant data to make a readable plot) producing the most movies in each genre were selected.

5. *Ranking the Studios*:
   A rank was assigned to the studios within each genre based on the number of movies they produced.

In [None]:
fig = px.treemap(
    top_studios_per_genre,
    path=['genre', 'studio'],
    values='movie_count',
    color='movie_count',
    color_continuous_scale='Viridis',
    title='Top 10 Dominant Studio for Each Genre'
)
fig.update_traces(textinfo='label+value', marker=dict(cornerradius=3))
fig.update_layout(
    coloraxis_colorbar=dict(title="Number of Movies"),
    height=600,
    title={'x': 0.5, 'xanchor': 'center', 'y': 0.95})
fig.show()

***Conclusions***
<br> In line with our expectations, studios such as BBC in Drama and Warner Bros. Pictures in Comedy have shown a strong presence. In addition, studios such as Paramount, Toei Company and BBC itself are found in different genres.

### *Genre Giants*: Analysis of the Most Frequent Genres

***Why is it userful to know?***
<br> Identifying the most popular movie genres in the dataset reveals which types of films are most commonly produced, providing insights into historical production trends and audience preferences.

***Prediction***
<br> Making predictions based on the data can be challenging, as during the data cleaning phase, we discovered that the dataset also includes TV series, documentaries, and short films. This diversity of content types may introduce inconsistencies, making it difficult to draw clear conclusions or forecast future trends.

In [None]:
genres_df = pd.read_csv('clean_datasets/genres.csv')

In [None]:
genre_movies = genres_df['genre'].value_counts().reset_index(name='movie_count')

genre_dict = genre_movies.set_index('genre')['movie_count'].to_dict()

***Analysis of the Procedure***

1. *Loading the Dataset*:
   The dataset containing information about genres was loaded using Pandas.

2. *Counting genre occurrences*:
   The occurrences of each genre were counted using the `value_counts()` function.

3. *Creating a Dictionary*:
   The genre counts were converted into a dictionary to make it usable for viewing.

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(genre_dict)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='kaiser')
plt.axis('off')
plt.title('Word Cloud of Genres')
plt.show()

***Conclusion***
<br> The result was expectedly unexpected. As we can see, 'documentary' is among the largest genres, but the fact that 'drama' is the largest gives some assurance about the reliability of our dataset.

### *Distribution Across Borders and Formats*

***Why is it useful to know?***
<br> Analyzing the distribution of movies by country and format provides insights into global movie release trends. It highlights the countries where films are most commonly released and the distribution formats that dominate different markets.

***Prediction***
<br> We predicted that the most common distribution format would be Theatrical, as it has traditionally been the primary method for movie releases worldwide.

In [None]:
release_df = pd.read_csv('clean_datasets/releases.csv')

country_release_counts = release_df.groupby(['country', 'distribution_format']).size().reset_index(name='movies')

countries = country_release_counts['country'].unique()

***Analysis of the Procedure***

1. *Loading the Dataset*:
   The dataset containing information about releases was loaded using Pandas.

2. *Grouping the Data*:
   The data was grouped by `country` and `distribution_format`, counting the number of movies released in each country across different formats.

3. *Visualization*:
    To improve the plot's readability, we implemented a dropdown menu, allowing us to view data for all countries selectively.

In [None]:
fig = go.Figure()
# Add a trace for each country
for country in countries:
    country_data = country_release_counts[country_release_counts['country'] == country]
    fig.add_trace(go.Bar(
        x=country_data['distribution_format'],
        y=country_data['movies'],
        name=country,
        visible=False
    ))

fig.data[0].visible = True

fig.update_layout(
    title="Movie Distribution by Format per Country",
    xaxis_title="Distribution Format",
    yaxis_title="Number of Movies",
    updatemenus=[{
        'buttons': [
            {'label': country,
             'method': 'update',
             'args': [{'visible': [i == j for i in range(len(countries))]},
                      {'title': f"Movie Distribution by Format in {country}"}]}
            for j, country in enumerate(countries)
        ],
        'direction': 'down',
        'showactive': True,
        'active': list(countries).index('Italy'),
        'x': 1,
        'xanchor': 'right',
        'y': 1.1,
        'yanchor': 'bottom',
        'font': {
            'size': 14
        }
    }],
    width=1000,
    height=600
)

fig.show()

#### ***A Check on Distribution of Countries by Release Format***

In [None]:
countries_per_format = release_df.groupby('distribution_format')['country'].nunique().reset_index(name='country counts')

The dataset was grouped by `distribution_format` (e.g., theatrical, streaming) and `country` to determine the number of unique countries using each distribution format. This step provides insights into how the choice of release format varies by country.

In [None]:
plt.figure(figsize=(8, 8))
plt.pie(countries_per_format['country counts'],
        labels=countries_per_format['distribution_format'],
        autopct='%1.1f%%', startangle=90,
        colors=plt.cm.Paired.colors)
plt.title('Proportion of Countries for Each Distribution Format')
plt.axis('equal')
plt.show()

***Conclusion***
<br> This analysis explores global film distribution across various formats, revealing producer preferences by country. Surprisingly, digital distribution is the most popular format, though by a narrow margin.

### *Seasons of Cinema*: Best Release Season by Country

***Why is it useful to know?***
<br> Understanding which is the best release season for movies in different countries provides insights into each country's seasonal preferences, also highlighting their cultural differences.

***Prediction***
<br> Since we don't know the ideal release season for films, forecasting their performance is complicated, but we suspect the results could various.

In [None]:
from utils.utils import get_specific_season, season_colors

In [None]:
release_df = pd.read_csv('clean_datasets/releases.csv')

release_df['date'] = pd.to_datetime(release_df['date'])
release_df = release_df[release_df['date'] >= '1970-01-01']
release_df['season'] = release_df['date'].apply(get_specific_season)

In [None]:
grouped = release_df.groupby(['country', 'season']).size().reset_index(name='movie_count')
best_season = grouped.loc[grouped.groupby('country')['movie_count'].idxmax()]

top_20_countries = best_season.groupby('country')['movie_count'].sum().nlargest(20).index
best_season_top_20 = best_season[best_season['country'].isin(top_20_countries)]

***Analysis of the Procedure***
1. *Loading the Dataset*:
   The dataset containing releases information was loaded using Pandas.

2. *Data Preparation*:
   - The `date` column was converted to a datetime format for easier manipulation.
   - The dataset was filtered to include only movie releases from 1970 onwards. (necessary for the readability and be able to provide a meaningful plot)
   - A `season` column was added by applying the `get_specific_season` function, which categorizes each release date into a season.
<br> <br>
3. *Grouping and Counting Movies*:
   The dataset was grouped by `country` and `season`, and the number of movies released in each group was counted.

4. *Identifying the Best Season*:
   For each country, the season with the highest movie count was identified as the best release season.

5. *Filtering Top Countries*:
   The total movie counts were summed by country to identify the top 20 countries with the most movie releases. The dataset was then filtered to include only these top 20 countries.

In [None]:
fig = px.sunburst(best_season_top_20,
                  path=['season', 'country'],
                  values='movie_count',
                  color='season',
                  color_discrete_map=season_colors,
                  title='Best Movie Release Seasons by Country')
fig.update_layout(
    width=800,
    height=600,
)
fig.show()

***Conclusions***
<br> The plot reveals winter as the preferred season for film releases; this statistic is probably heavily influenced by data from the USA, where more releases were carried out.

### *Lights, Clusters, Action*: Network Graph of Actors Cultural Spheres

***Who are the most interconnected actors?***
<br> Some actors have forged extensive networks by working alongside a variety of co-stars in multiple projects.
<br> In this analysis, we will explore the relationships between actors by constructing a network graph that reveals their connections through shared films. In the graph each node will represent an actor and a shared film will form the connection between them.
<br> The network will help us identify the most central figures in the film industry, based on how many collaborations they've had, and discover clusters of actors who frequently collaborate.

***What do we expect from this analysis?***
- *Stefano*: The dataset spans far back in time, making it challenging to predict which actor or group of actors will be the most represented. However, it is likely that Hollywood stars will form one of the dominant clusters, representing a significant portion of the dataset, as observed in previous analyses.
- *Samuele*: The dataset is vast and includes not only movies but also numerous documentaries, which causes many historical figures featured in them to appear as actors. I expect clustering based on the nationality of movies and actors, with the largest cluster likely being dominated by Hollywood.

#### Analysis and Visualization

In [None]:
# Import utils
from utils.actor_graph_network_utils import network_graph

In [None]:
# Loading the datasets
movies_df = pd.read_csv("clean_datasets/movies.csv")
actors_df = pd.read_csv("clean_datasets/actors.csv")

In [None]:
# Define the function that will draw the graph
def draw_network_graph(graph):
    # Centrality for a node quantifies how many connection to other node within the network (percentage).
    centrality = graph.get('centrality')

    # Draw the edge Scatter
    edge_x = graph.get('edges')[0]
    edge_y = graph.get('edges')[1]
    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines'
    )

    # Draw the node scatter
    node_x = graph.get('nodes')[0]
    node_y = graph.get('nodes')[1]
    node_sizes = graph.get('nodes')[2]
    node_text = graph.get('nodes')[3]
    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        marker=dict(
            size=node_sizes,
            color=list(centrality.values()),
            colorscale='YlGnBu',
            colorbar=dict(
                title="Node Connections"
            ),
            line=dict(width=1, color='#333')
        ),
        text=node_text,
        hoverinfo='text'
    )

    # Merge the node and edge scatter into a single plot
    fig = go.Figure(
        data=[edge_trace, node_trace],
        layout=go.Layout(
            title=dict(
                text="<br>Network graph of top 1000 actors by connections",
                font=dict(size=16)
            ),
            showlegend=False,
            hovermode='closest',
            margin=dict(b=20, l=5, r=5, t=40),
            xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
            yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
    )
    fig.show()

In [None]:
# Preparing the data
## Removing unneeded columns
actors = actors_df.drop(columns=['role'])
## Taking the first 1000 most common actors to have a lighter analysis
actors = actors[actors['name'].isin(actors.value_counts(subset='name').head(1000).index)]

In [None]:
# Build all graph info
graph = network_graph(actors)

In [None]:
draw_network_graph(graph)

The first version of the graph shows a some clustering effect, but the separation between clusters is not very distinct.
<br> A dominant cluster contains nodes with the highest centrality, representing actors who collaborated extensively. These actors belong to an era ranging from the late 1800s to the 1970s. This could be due to a bias in the dataset, where older films and actors are more represented, maybe because of their historical significance.
<br><br>
However, the lack of clarity in the graph highlights the need for refinement. To address this, we will filter the data to include only actors from movies released after 1990. This adjustment will allow us to focus on actors more recognizable to modern audiences and provide a clearer, more relevant representation of the network.

In [None]:
# Preparing the data
## Removing unneeded columns
actors = actors_df.drop(columns=['role'])
## Filtering actors from movies released after 1990
movies = movies_df[movies_df['release_year'] >= 1990]
actors = actors[actors['movie_id'].isin(movies['id'])]
## Taking the first 1000 most common actors to have a lighter analysis
actors = actors[actors['name'].isin(actors.value_counts(subset='name').head(1000).index)]

In [None]:
# Build all graph info
graph = network_graph(actors)

In [None]:
draw_network_graph(graph)

In this updated graph, the clustering effect is significantly more pronounced, and we can clearly distinguish several distinct spheres:
- **Indian Sphere**: A cluster likely representing Bollywood actors and their connections.
- **Japanese Sphere**: A smaller but noticeable group representing actors predominantly from Japanese cinema.
- **Occidental Sphere**: The largest and most dominant cluster, centered around Hollywood stars and their collaborations.
- **Minor Spheres**: Other regional groups, such as Korean, German, and French cinema, forming smaller, less connected clusters.
<br><br>
This graph marks a substantial improvement from the previous version. The separation between the clusters is much clearer, and the identification of key groups is more intuitive. Additionally, the inclusion of actors from recent movies makes it easier to recognize names and associate them with modern cinema trends.

In [None]:
# A peek of the most interconnect actors of the modern era.
centrality = graph.get('centrality')
centrality = pd.DataFrame({'key': list(centrality.keys()), 'value': list(centrality.values())})
centrality[['key', 'value']].sort_values(by='value', ascending=False).head(15)

#### Conclusions

The analysis confirmed that the network graph forms distinct clusters primarily based on geographic and cultural spheres.
<br>As predicted, the Hollywood sphere dominates, reflecting the global influence of American cinema. Other significant clusters include the Indian sphere (Bollywood) and the Japanese sphere, both of which are more noticeable due to their unique cultural identities and relative isolation from globalization in the movie industry.
<br>By narrowing the dataset to focus on more recent movies (from the 1990s onward), the clustering effect became more pronounced, and recognizable modern actors appeared, aligning with our expectations.
<br>Older actors were initially overrepresented due to dataset bias, but filtering improved interpretability.

### *Lights and Shadows*: Poster Color Brightness Across Genres



***Does the brightness of movie posters reveal something about their genres?***
<br> Movie posters play a crucial role in setting the tone and attracting audiences, often using colors to evoke emotions that resonate with a film’s themes.
<br> In this analysis, we will examine the relationship between the brightness of movie posters and their associated genres by analyzing the average color compositions of posters. Each genre will be represented by its average Red, Green, and Blue (RGB) values, and eventually brightness will be calculated.
<br> The analysis will help us highlight how brightness and color composition might represent the different genres. A 3D scatterplot will provide an interactive visualization of these relationships.
<br><br>
***What do we expect from this analysis?***
- *Stefano & Samuele*: We expect genres like horror and thriller to exhibit darker tones, reflecting their tense and dark themes, while comedy and family movies will likely feature brighter cheerful colors. Other genres should fall somewhere in between, forming a gradient of brightness that aligns with their emotional and thematic characteristics.

#### Analysis and Visualization

**NOTE**: The posters_colors.csv dataset has been generated by the script poster_average_color.py, which processes all available posters in the dataset. The total size of the images is approximately 23 GB, making the execution of the script quite time-consuming. The file contains a commented section to perform benchmarks on portions of the dataset, and at the end of file, the timing results from a relatively fast computer are provided. It is recommended not to run the script on the entire dataset but instead to download the final result directly from the link provided in the README.

In [None]:
# Loading the datasets
genres = pd.read_csv('clean_datasets/genres.csv')
posters_colors = pd.read_csv('clean_datasets/posters_colors.csv')

In [None]:
# Preparing the data
## Merge the two datasets
colors_genres = genres.merge(posters_colors, left_on='movie_id', right_on='id', how="inner")
## Removing unneeded columns
colors_genres = colors_genres.drop(columns=['movie_id', 'poster_path', 'poster_link'])
## Drop the null values
colors_genres = colors_genres.dropna(subset=["average_color"])
## Split the RGB values for easier handling
colors_genres[['r', 'g', 'b']] = colors_genres['average_color'].str.split(',', expand=True).astype(int)
## Calculate brightness with relative luminance formula (source: Wikipedia)
colors_genres['brightness'] = (
        0.2126 * colors_genres['r'] +
        0.7152 * colors_genres['g'] +
        0.0722 * colors_genres['b']
).astype(int)
## Group by genre to see average brightness
colors_genres = colors_genres.groupby('genre')[['r', 'g', 'b', 'brightness']].mean().reset_index()

In [None]:
# Create 3D scatterplot
fig = px.scatter_3d(
    colors_genres,
    x='r', y='g', z='b',
    color='brightness',  # Use brightness as the color scale
    text='genre',
    title='Genres Based on Average Brightness',
    color_continuous_scale='Greys_r'  # Use a grayscale color map
)
fig.update_layout(
    scene=dict(
        xaxis_title='Red (R)',
        yaxis_title='Green (G)',
        zaxis_title='Blue (B)',
        camera=dict(
            up=dict(x=0, y=0, z=2),
            center=dict(x=0, y=0, z=0),
            eye=dict(x=2, y=-1.5, z=0.5)
        )
    ),
    width=1200,
    height=600,
    title={'x': 0.5, 'xanchor': 'center', 'y': 0.95})
fig.show()

#### Conclusions

The analysis confirmed that the relationship between movie poster brightness and genres aligns with expected trends, forming clear distinctions in color usage based on thematic tones.
<br>
As predicted, darker genres like Horror, Thriller, and Mystery tend to have lower brightness levels, reflecting their association with suspense and fear. In contrast, lighter genres like Comedy, Family, and Animation exhibit higher brightness values, conveying a more cheerful and playful mood.
Interestingly, genres such as Drama and Science Fiction occupy intermediate positions, showcasing a mix of brightness levels that likely represent their diverse thematic elements.
<br>
This trend underscores the role of color psychology in shaping audience perceptions and highlights how genre conventions are reflected even in visual materials like movie posters.

### *Echoes of Theme*: Correlation between Movie Genre and Theme

***Why is it useful to know?***
<br> This plot is useful for examining how movie genres relate to their themes. By visualizing the most frequent genre-theme combinations, we can see which themes are most prevalent in specific genres. This analysis allows us to understand genre-specific thematic preferences, which could be useful for filmmakers and marketers in targeting audiences more effectively.

***Prediction***
<br> Based on the structure of the dataset and the presence of various genres, we expect to see certain themes emerge more frequently within specific genres.

In [None]:
movie_df = pd.read_csv('clean_datasets/movies.csv')
genre_df = pd.read_csv('clean_datasets/genres.csv')
theme_df = pd.read_csv('clean_datasets/themes.csv')

In [None]:
genre_theme_df = pd.merge(genre_df[['movie_id', 'genre']], theme_df[['movie_id', 'theme']], on='movie_id')
top_combinations = genre_theme_df.groupby(['genre', 'theme']).size().reset_index(name='count')

In [None]:
theme_df.value_counts(subset='theme')

***Analysis of the Procedure***
1. *Loading the Datasets*:
   The datasets containing information about movies, genres, and themes were loaded using Pandas.

2. *Merging Genre and Theme Data*:
   The `genre_df` and `theme_df` datasets were merged on the `movie_id` column to associate each movie with both its genre and its theme.

3. *Counting Genre-Theme Combinations*:
   The combined dataset (`genre_theme_df`) was grouped by both `genre` and `theme` to count how many times each genre-theme combination appears.

4. *Examining Theme Frequency*:
   The `theme_df` was analyzed to determine the most frequent themes in the dataset.
   By examining it, we can already get an idea of which ones are likely to be more common across different associations, especially considering that there are only 19 genres present.

In [None]:
fig = px.scatter(top_combinations,
                 x='genre',
                 y='theme',
                 size='count',
                 color='count',
                 hover_name='theme',
                 hover_data={'genre': True, 'count': True},
                 title='Bubble Plot of Genre-Theme Combinations')
fig.update_layout(
    height=1200,
    xaxis_title="Genre",
    yaxis_title="Theme",
    xaxis_tickangle=45,
    yaxis_tickangle=0,
)
fig.show()

***Conclusions***
<br> As expected, the most frequent genre-theme associations are those that include the themes "Moving relationship stories," "Crude humor and satire," and "Horror, the undead and monster classics." that were the most frequents of the themes dataset

### *How Films Reach Audiences*: A Distribution Type Analysis

***Why is it useful to know?***
<br> This visualization helps us examine the distribution of film releases over time, with a particular focus on how different distribution formats have evolved before and after the year 2000. Understanding these trends allows us to track the industry's adaptation to new technologies and shifts in audience preferences.

***Prediction***
<br> Previous visualization suggests theatrical release as the primary distribution method, pre- and post-2000, but we expect to see a significant increase in digital formats.

In [None]:
releases_df = pd.read_csv('clean_datasets/releases.csv')
movies_df = pd.read_csv('clean_datasets/movies.csv')

In [None]:
merged_df = pd.merge(releases_df, movies_df, left_on='movie_id', right_on='id')

merged_df['release_period'] = merged_df['release_year'].apply(lambda x: 'Before 2000' if x < 2000 else 'After 2000')

distribution_counts = merged_df.groupby(['distribution_format', 'release_period']).size().reset_index(name='Count')

***Analysis of the Procedure***
1. *Loading the Datasets*:
   The datasets containing information about movies and releases were loaded using Pandas.

2. *Merging the Datasets*:
   The two datasets were merged on the `movie_id` (from `releases_df`) and `id` (from `movies_df`) columns to create a dataset containing both release information and movie details.

3. *Creating the Release Period Column*:
   A new column called `release_period` was added to the dataset, categorizing each movie as either "Before 2000" or "After 2000" based on its release year.

4. *Counting Distribution Formats*:
   The dataset was grouped by `distribution_format` and `release_period` to count the occurrences of each distribution format in each period.

In [None]:
plt.figure(figsize=(12, 7))

sns.barplot(x='Count', y='distribution_format', hue='release_period', data=distribution_counts, palette='viridis')

plt.xlabel('Number of Releases')
plt.ylabel('Distribution Format')
plt.title('Number of Releases by Distribution Format (Before and After 2000)')

plt.show()

***Conclusions***
<br> The plot confirms theatrical release as the leading distribution format, and as we expected, there's a rise in digital. A surprising development is the increase in Premieres, possibly reflecting a general increase in the popularity of cinema.

### *A Wider Stage*: Evolution in the Number of Oscar Categories Over the Years

***Why is it useful to know?***
<br> This visualization highlights the trend in the number of Oscar categories awarded over time. By examining how the number of categories has changed, we can gain insights into the evolving scope of the Academy Awards.

***Prediction***
<br> Based on historical trends, we expect to see an increase in the number of Oscar categories over time. The Academy may have added more categories as the film industry has grown and diversified.

In [None]:
oscars_df = pd.read_csv('clean_datasets/oscars.csv')

In [None]:
category_count_per_year = oscars_df.groupby('year_ceremony')['category'].nunique().reset_index(name='category_count')

***Analysis of the Procedure***
1. *Loading the Dataset*:
   The datasets containing information about oscars was loaded using Pandas.

2. *Counting Categories per Year*:
   The dataset was grouped by `year_ceremony`, and the number of unique Oscar categories for each year was counted using the `.nunique()` function.

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(category_count_per_year['year_ceremony'], category_count_per_year['category_count'], color='skyblue')

# Customize the plot
plt.title('Number of Oscar Categories per Year', fontsize=14)
plt.xlabel('Year of Ceremony', fontsize=12)
plt.ylabel('Number of Categories', fontsize=12)
plt.xticks(rotation=45)
plt.grid(True)

plt.tight_layout()
plt.show()

***Conclusions***
<br> The plot shows an unexpected picture. While one might assume that the Oscar categories would increase over time, the data reveal a period of stability beginning around 1970. The only significant expansion occurred between the 1940s and the late 1950s, a period marked by the initial formation and growth of the Academy Awards.

### Trend review type Fresh and Rotten TODO

In [None]:
reviews = pd.read_csv('clean_datasets/reviews.csv')

In [None]:
reviews['review_date'] = pd.to_datetime(reviews['review_date'])
reviews = reviews[reviews['review_date'] >= '2001-01-01']

# Raggruppare i dati per data e tipo di recensione
trend_data = (
    reviews.groupby([reviews['review_date'].dt.to_period('M'), 'type'])
    .size()
    .unstack(fill_value=0)
)

# Convertire l'indice al formato datetime
trend_data.index = trend_data.index.to_timestamp()

# Trovare i picchi e le valli per "Fresh"
fresh_peaks = trend_data[trend_data['Fresh'] == trend_data['Fresh'].max()]
fresh_valleys = trend_data[trend_data['Fresh'] == trend_data['Fresh'].min()]

# Trovare i picchi e le valli per "Rotten"
rotten_peaks = trend_data[trend_data['Rotten'] == trend_data['Rotten'].max()]
rotten_valleys = trend_data[trend_data['Rotten'] == trend_data['Rotten'].min()]

# Creare il grafico
plt.figure(figsize=(12, 6))
plt.plot(trend_data.index, trend_data['Fresh'], label='Fresh', color='green', marker='o')
plt.plot(trend_data.index, trend_data['Rotten'], label='Rotten', color='red', marker='o')

# Aggiungere marker per i picchi e le valli
plt.scatter(fresh_peaks.index, fresh_peaks['Fresh'], color='blue', label='Picchi Fresh', s=100, zorder=5)
plt.scatter(fresh_valleys.index, fresh_valleys['Fresh'], color='cyan', label='Valli Fresh', s=100, zorder=5)

plt.scatter(rotten_peaks.index, rotten_peaks['Rotten'], color='orange', label='Picchi Rotten', s=100, zorder=5)
plt.scatter(rotten_valleys.index, rotten_valleys['Rotten'], color='yellow', label='Valli Rotten', s=100, zorder=5)

# Aggiungere titoli e etichette
plt.title('Andamento delle recensioni Fresh e Rotten con Picchi e Valli', fontsize=16)
plt.xlabel('Data', fontsize=12)
plt.ylabel('Numero di recensioni', fontsize=12)
plt.legend(title='Legenda')
plt.grid(alpha=0.3)
plt.tight_layout()

# Mostrare il grafico
plt.show()

In [None]:
# Assicurarsi che le date siano in formato datetime
reviews['review_date'] = pd.to_datetime(reviews['review_date'])
reviews = reviews[reviews['review_date'] >= '2001-01-01']
reviews = reviews[reviews['is_top_critic'] == True]

# Raggruppare i dati per data e tipo di recensione
trend_data = (
    reviews.groupby([reviews['review_date'].dt.to_period('M'), 'type'])
    .size()
    .unstack(fill_value=0)
)

# Convertire l'indice al formato datetime
trend_data.index = trend_data.index.to_timestamp()

# Creare il grafico
plt.figure(figsize=(12, 6))
plt.plot(trend_data.index, trend_data['Fresh'], label='Fresh', color='green', marker='o')
plt.plot(trend_data.index, trend_data['Rotten'], label='Rotten', color='red', marker='o')

# Aggiungere titoli e etichette
plt.title('Andamento delle recensioni Fresh e Rotten', fontsize=16)
plt.xlabel('Data', fontsize=12)
plt.ylabel('Numero di recensioni', fontsize=12)
plt.legend(title='Tipo di recensione')
plt.grid(alpha=0.3)
plt.tight_layout()

# Mostrare il grafico
plt.show()

In [None]:
# Trovare i picchi per le recensioni "Fresh"
fresh_peaks = trend_data[trend_data['Fresh'] == trend_data['Fresh'].max()]
fresh_valleys = trend_data[trend_data['Fresh'] == trend_data['Fresh'].min()]

# Trovare i picchi per le recensioni "Rotten"
rotten_peaks = trend_data[trend_data['Rotten'] == trend_data['Rotten'].max()]
rotten_valleys = trend_data[trend_data['Rotten'] == trend_data['Rotten'].min()]

# Stampare i risultati
print("Picchi Fresh:")
print(fresh_peaks)

print("\nValle Fresh:")
print(fresh_valleys)

print("\nPicchi Rotten:")
print(rotten_peaks)

print("\nValle Rotten:")
print(rotten_valleys)