# Data Visualization
# 1. Introduction
## Dataset Description
This Jupyter Notebook explores a dataset centered around movies, organized into three main components:
1. **The Movie Dataset**: This dataset provides detailed information about individual films. It is fragmented into multiple dataframes that are linked through a unique movie identifier key.
2. **The Oscar Awards Datasets**: This dataset contains comprehensive records of every nomination and winner since the first ceremony of the Oscars.
3. **The Rotten Tomatoes Review Dataset**: This dataset focuses on the reception of movies by critics, with data sourced from the review aggregator Rotten Tomatoes.

The Movie, the Oscar and Review datasets are not interconnected because they lack a shared unique identifier for movies. This happens because these datasets originate from entirely separate sources.<br>
Consequently, analyzing and visualizing the data presents additional challenges, as there is limited information available to effectively correlate a movie's performance and success.

## Methodology
The analysis follows a structured methodology which includes the following steps
1. **Prediction**: For the *In-Depth Visualization* the analysts will develop hypothesis and prediction to encourage critical thinking and expose common misconceptions.
2. **Analysis**: Conducting *Simple* and *In-Depth* exploration of the datasets to identify patterns, trends, and relationships.
3. **Visualization**: Creating meaningful and creative visual representations of the data to enhance understanding and interpretation.
4. **Conclusion**: Summarizing findings and deriving insights from the analysis and visualizations and comparing them with the previous hypothesis.

## Visualization Technologies
A variety of Python libraries are employed to create both static, dynamic, interactive and geographic visualizations. The following libraries are used:
- **Plotly**: For creating interactive and dynamic plots.
- **Geopandas**: For handling and visualizing geographic data.
- **Seaborn**: For generating aesthetically pleasing statistical graphics.
- **Matplotlib**: For static visualizations.
- **Folium**: For creating interactive maps and geographic visualizations.

These tools enable a diverse range of visualization techniques, enhancing the ability to explore and interpret the data effectively.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium as fm
import plotly.express as px

# 2. Exploratory Data Analysis (EDA)
## Simple Visualizations
### Correlation between runtime and rating

In [None]:
# Create a scatter plot with a regression line
movies_df = pd.read_csv('clean_datasets/movies.csv')
movies_df

In [None]:
# Visualize potential outliers using a boxplot for both variables
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sns.boxplot(y="runtime_in_minutes", data=movies_df, ax=axes[0])
axes[0].set_title("Boxplot of Movie Duration", fontsize=14)
axes[0].set_ylabel("Duration (minutes)")

sns.boxplot(y="rating", data=movies_df, ax=axes[1])
axes[1].set_title("Boxplot of Average Rating", fontsize=14)
axes[1].set_ylabel("Average Rating")

plt.tight_layout()
plt.show()

In [None]:
# Exclude outliers and tv series
filtered_df = movies_df[
    (movies_df["runtime_in_minutes"] > 0) &
    (movies_df["runtime_in_minutes"] <= 200)
]

plt.figure(figsize=(10, 6))
sns.regplot(x="runtime_in_minutes", y="rating", data=filtered_df, scatter_kws={"s": 50, "alpha": 0.7}, line_kws={"color": "red"})
plt.title("Correlation Between Movie Duration and Average Rating", fontsize=16)
plt.xlabel("Duration (minutes)", fontsize=12)
plt.ylabel("Average Rating", fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()