# Explore the National-Scale Environmental Pressure Indicators Dataset

## Introduction
This notebook provides an exploratory data analysis (EDA) of the "National-Scale Environmental Pressure Indicators from 1990 to 2050" dataset. This dataset contains historical (1990–2020) and forecasted (2021–2050) national-level data for key environmental and socioeconomic indicators, including Municipal Solid Waste (MSW) generation, greenhouse gas emissions (CO₂, CH₄, N₂O), GDP per capita (PPP), and population for 43 countries.

Learn more:
- Data Package doi: [10.71728/senscience.k2f7-p5v9](https://doi.org/10.71728/senscience.k2f7-p5v9)

This notebook will guide you through loading, exploring, and visualizing the data to uncover key trends and insights.

### Install and import required libraries

In [None]:
# Install mlcroissant from the source
!sudo apt-get install python3-dev graphviz libgraphviz-dev pkg-config
!pip install mlcroissant[dev]

In [None]:
import mlcroissant as mlc
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tabulate import tabulate

from IPython.display import Markdown, display

## 1. Data Loading
First, we'll load the historical and forecasted data from the provided CSV files into pandas DataFrames. We've also included a commented-out section for loading the data using `mlcroissant` from a `fair2.json` file for future use.

In [None]:
# --- For Future Use: Loading from FAIR² JSON --- #

url = 'https://sen.science/doi/10.71728/senscience.k2f7-p5v9/fair2.json'
dataset = mlc.Dataset(url)
metadata = dataset.metadata.to_json()
print(f"{metadata['name']}: {metadata['description']}")

In [None]:
# List all the record sets available in the dataset
df_record_sets = pd.DataFrame(metadata["recordSet"])
columns_to_keep = {
    "@id": "Record Set ID",
    "description": "Description"
}
df_record_sets = df_record_sets[list(columns_to_keep.keys())]
df_record_sets = df_record_sets.rename(columns=columns_to_keep)

# Convert DataFrame to Markdown table
markdown_table = tabulate(df_record_sets, headers="keys", tablefmt="pipe", showindex=False)
# Render the table as Markdown in Jupyter
display(Markdown(markdown_table))

In [None]:
# Assuming the record sets are named 'historical' and 'forecast'
historical_records = dataset.records(record_set='https://sen.science/doi/10.71728/senscience.k2f7-p5v9/recordsets/Historical')
forecast_records = dataset.records(record_set='https://sen.science/doi/10.71728/senscience.k2f7-p5v9/recordsets/Forecast')

historical_df = pd.DataFrame(historical_records)
forecast_df = pd.DataFrame(forecast_records)

print("Data loaded successfully from FAIR² JSON.")

## 2. Data Overview
Let's get a basic understanding of our datasets, including their shapes and data types.

In [None]:
print("Historical Data Info:")
historical_df.info()
print("\n" + "-"*50 + "\n")
print("Forecast Data Info:")
forecast_df.info()

## 3. Exploratory Data Analysis (EDA)
Now, let's dive deeper into the data to find patterns, anomalies, and relationships.

### 3.1 Check for Missing Values

In [None]:
print("Missing values in Historical Data:")
print(historical_df.isnull().sum())
print("\n" + "-"*50 + "\n")
print("Missing values in Forecast Data:")
print(forecast_df.isnull().sum())

The forecast data has missing values for emissions. This is expected as these are the values to be predicted. For our EDA, we will focus on the variables that are present.

### 3.2 Summary Statistics

In [None]:
display(Markdown("### Historical Data Summary Statistics"))
display(historical_df.describe())
display(Markdown("### Forecast Data Summary Statistics"))
display(forecast_df.describe())

In [None]:
def clean_column_names(df):
    prefix = "https://sen.science/doi/10.71728/senscience.k2f7-p5v9/recordsets/Historical/fields/"
    def prettify(col):
        if col.startswith(prefix):
            col = col[len(prefix):]
        # Insert spaces before capital letters and numbers
        col = col.replace("CO₂", "CO2").replace("CH₄", "CH4").replace("N₂O", "N2O")
        col = ''.join([' ' + c if c.isupper() and i > 0 and col[i-1].islower() else c for i, c in enumerate(col)])
        col = col.replace("GDPPPPcapita", "GDP PPP/capita")
        col = col.replace("MSWgeneration", "MSW generation")
        col = col.replace("CO2emissions", "CO₂ emissions")
        col = col.replace("CH4emissions", "CH₄ emissions")
        col = col.replace("N2Oemissions", "N₂O emissions")
        col = col.replace("CountryName", "Country Name")
        col = col.replace("Attribute", "Attribute")
        col = col.strip()
        return col
    df.columns = [prettify(col) for col in df.columns]
    return df

historical_df = clean_column_names(historical_df)
def clean_forecast_column_names(df):
    prefix = "https://sen.science/doi/10.71728/senscience.k2f7-p5v9/recordsets/Forecast/fields/"
    def prettify(col):
        if col.startswith(prefix):
            col = col[len(prefix):]
        col = col.replace("CO₂", "CO2").replace("CH₄", "CH4").replace("N₂O", "N2O")
        col = ''.join([' ' + c if c.isupper() and i > 0 and col[i-1].islower() else c for i, c in enumerate(col)])
        col = col.replace("GDPPPPcapita", "GDP PPP/capita")
        col = col.replace("MSWgeneration", "MSW generation")
        col = col.replace("CO2emissions", "CO₂ emissions")
        col = col.replace("CH4emissions", "CH₄ emissions")
        col = col.replace("N2Oemissions", "N₂O emissions")
        col = col.replace("CountryName", "Country Name")
        col = col.replace("Attribute", "Attribute")
        col = col.strip()
        return col
    df.columns = [prettify(col) for col in df.columns]
    return df

forecast_df = clean_forecast_column_names(forecast_df)

In [None]:
import re

def clean_binary_columns(df):
    # Helper to decode binary strings and convert to appropriate types
    def decode_value(val):
        if isinstance(val, bytes):
            val = val.decode('utf-8')
        elif isinstance(val, str) and val.startswith("b'") and val.endswith("'"):
            val = val[2:-1]
        # Try to convert to float or int if possible
        if isinstance(val, str):
            # Remove commas and whitespace
            val = val.replace(',', '').strip()
            # Check for float
            try:
                if re.match(r'^-?\d+\.\d+$', val):
                    return float(val)
                elif re.match(r'^-?\d+$', val):
                    return int(val)
            except Exception:
                pass
        return val

    for col in df.columns:
        df[col] = df[col].apply(decode_value)
    return df

historical_df = clean_binary_columns(historical_df)
forecast_df = clean_binary_columns(forecast_df)

### 3.3 Data Visualization

#### Global Distribution of Latest MSW Generation (Choropleth Map)

In [None]:
import plotly.graph_objects as go
import numpy as np
# Combine historical and forecast data for plotting
historical_df_subset = historical_df[['Country Name', 'Attribute', 'MSW generation']]
historical_df_subset = historical_df_subset.rename(columns={'Attribute': 'Year'})

forecast_df['Year'] = forecast_df['Attribute'].str.split('_').str[1]
forecast_df_subset = forecast_df[['Country Name', 'Year', 'MSW generation']]

combined_df = pd.concat([historical_df_subset, forecast_df_subset])
combined_df['Year'] = pd.to_numeric(combined_df['Year'])

# Prepare the latest available data for each country using combined_df
latest_year = combined_df.groupby('Country Name')['Year'].max().reset_index()
latest_data = pd.merge(combined_df, latest_year, on=['Country Name', 'Year'], how='inner')

# Log-transform and cap MSW generation for better visualization
latest_data['log10_MSW_capped'] = np.log10(latest_data['MSW generation'].clip(lower=1))

vmin = latest_data['log10_MSW_capped'].min()
vmax = latest_data['log10_MSW_capped'].max()

fig = go.Figure(go.Choropleth(
    locations=latest_data["Country Name"],
    locationmode="country names",
    z=latest_data["log10_MSW_capped"],
    colorscale="Viridis",
    zmin=vmin,
    zmax=vmax,
    marker_line_color="gray",
    marker_line_width=0.7,
    hovertext=latest_data["Country Name"] + "<br>MSW: " + latest_data["MSW generation"].round(0).astype(str),
    hoverinfo="text",
    showscale=True,
))

fig.update_layout(
    title_text="Latest MSW Generation by Country (log₁₀ scale)",
    geo=dict(showframe=False, showcoastlines=True)
)
fig.show()

The figure below presents a choropleth map visualizing the latest available Municipal Solid Waste (MSW) generation for each country in the dataset, using a logarithmic (log₁₀) scale for improved clarity across a wide range of values. Countries are colored according to their MSW generation, allowing for quick comparison of waste generation levels globally. Hovering over each country reveals its name and the corresponding MSW value.

#### MSW Generation Trends for Selected Countries

In [None]:
# Combine historical and forecast data for plotting
historical_df_subset = historical_df[['Country Name', 'Attribute', 'MSW generation']]
historical_df_subset = historical_df_subset.rename(columns={'Attribute': 'Year'})

forecast_df['Year'] = forecast_df['Attribute'].str.split('_').str[1]
forecast_df_subset = forecast_df[['Country Name', 'Year', 'MSW generation']]

combined_df = pd.concat([historical_df_subset, forecast_df_subset])
combined_df['Year'] = pd.to_numeric(combined_df['Year'])

selected_countries = ['United States', 'China', 'India', 'Brazil']
plot_df = combined_df[combined_df['Country Name'].isin(selected_countries)]

plt.figure(figsize=(14, 8))
sns.lineplot(data=plot_df, x='Year', y='MSW generation', hue='Country Name')
plt.title('Historical and Forecasted MSW Generation for Selected Countries')
plt.ylabel('MSW Generation (tonnes/year)')
plt.xlabel('Year')
plt.grid(True)
plt.show()

The figure above shows the historical and forecasted Municipal Solid Waste (MSW) generation trends for selected countries (United States, China, India, Brazil) from 1990 to 2050. Each line represents a country, illustrating changes in annual MSW generation over time. This visualization highlights both past growth and future projections, allowing for comparison of waste generation trajectories among major economies.

#### Correlation Matrix of Historical Data

In [None]:
plt.figure(figsize=(10, 8))
corr_matrix = historical_df[['Population', 'GDP PPP/capita', 'MSW generation', 'CO₂ emissions', 'CH₄ emissions', 'N₂O emissions']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Historical Environmental Indicators')
plt.show()

## 4. Conclusion
This notebook provided a preliminary exploratory data analysis of the National-Scale Environmental Pressure Indicators dataset. We loaded the historical and forecast data, checked for completeness, and generated summary statistics. 

Our visualizations revealed:
- **Global Distribution of Latest MSW Generation**: The choropleth map illustrated the latest available Municipal Solid Waste (MSW) generation for each country, using a log₁₀ scale. This visualization highlighted stark differences in waste generation across countries, with major economies and populous nations producing the most waste.
- **MSW Generation Trends**: We observed the historical and projected trends in MSW generation for key countries, highlighting significant increases over time, particularly for developing nations.
- **Correlations**: The correlation matrix of the historical data showed strong positive correlations between MSW generation and all three greenhouse gas emissions, as well as between GDP per capita and CO₂ emissions. This suggests a strong link between economic activity, consumption, and environmental impact.
This initial analysis sets the stage for more advanced modeling, such as forecasting future emissions based on socioeconomic drivers or analyzing the decoupling of economic growth from environmental pressures.