<a href="https://colab.research.google.com/github/sjsu-cs133-f25/team5-climatechange-trends/blob/main/notebooks/04_interactive_dashboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Climate Change Dataset Pattern Discovery

### Discovery plan

- **Goal**: Identify structural patterns within the global warming dataset, such as correlated variable blocks, latent dimensions, or natural clusters among countries.
- **Hypothesis:** We expect to find strong correlation between greenhouse gas emissions, industrial activity, and fossil fuel use; and potentially inverse relations with forest area or renewable energy use.
- **Why it matters:** Revealing these structures helps policy makers target leverage points (such as high-emission economies or correlated environmental indicators).

# Data Card


Dataset source & link, shape (rows/cols), units, time coverage:
*   Dataset source: Global Warming Dataset: 195 Countries (1900-2023)
*   Dataset link: https://www.kaggle.com/datasets/ankushpanday1/global-warming-dataset-195-countries-1900-2023
*   shape: 100,000 rows and 26 columns
*   units: numerical, categorical
*   time coverage: 1900-2023


Column dictionary (human-readable; short), key ID columns, target(s) if any:
*   shape: 100,000 rows and 26 columns

      * Country - the country identifier
      * Year - the year, from 1900-2023
      * Temperature_Anomoly - difference in temperature from baseline, in °C
      * CO2_Emissions - total CO2 emissions, in metric tons
      * Population - number of people in the country
      * Forest Area - area of forest cover, in % of land area
      * GDP - Gross Domestic Product, in USD
      * Renewable_Energy_Usage - total energy derived from renewable energy sources, in %
      * Methane_Emissions - total methane emissions, in metric tons CO2 equivalent
      * Sea_Level_Rise - change in sea level, in mm
      * Arctic_Ice_Extent - area covered by arctic ice, in million km²
      * Urbanization - population living in urban areas, in %
      * Deforestation_Rate - loss of forest area, in %
      * Extreme_Weather_Events - count of extreme events that occurred
      * Average_Rainfall - average precipitation, in mm
      * Solar_Energy_Potential - potential for solar energy, in kWh/m²
      * Waste_Management - score of the country's waste management practices, in %
      * Per_Capita_Emissions - total greenhouse gases emissions per capita, in tons / person
      * Industrial_Activity - industrial output or production, in %
      * Air_Pollution_Index - air quality index, from 0-300
      * Biodiversity_Index - measures the variety of species, in %
      * Ocean_Acidification - pH level of ocean
      * Fossil_Fuel_Usage - total energy consumption from fossil fuels, in %
      * Energy_Consumption_Per_Capita - energy usage per person
      * Policy_Score - the country's climate policy performance score, from 0-100
      * Average_Temperature - average temperature, in °C



Missingness snapshot: which columns have NaNs and rough %:
*   NaNs: 14 missing values found


Known quirks (e.g., mixed types, inconsistent labels):
*    the meaning of the values for some of the columns, such as the policy score, and the units for some are ambiguous
*    The countries column did not specify the names of the actual countries and instead labeled countries numerically


### Load

In [1]:
!pip install jupyter_bokeh



In [2]:
!pip install dash



In [3]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import kagglehub

from google.colab import files

# plotly library
from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.express as px

import panel as pn
import panel.widgets as pnw

from dash import Dash, dcc, html
from base64 import b64encode
import io

# download dataset
path = kagglehub.dataset_download("ankushpanday1/global-warming-dataset-195-countries-1900-2023")

# read csv file
df = pd.read_csv(path + "/global_warming_dataset.csv")

Using Colab cache for faster access to the 'global-warming-dataset-195-countries-1900-2023' dataset.


In [4]:
# setting up necessary javascript environment => need to use this when using plotly in google colab
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))

### Prep

In [5]:
# 1 Remove spaces
df = df.rename(columns=lambda c: c.strip().replace(" ", "_"))

# 2 Convert to numeric
df["Year"] = pd.to_numeric(df["Year"], errors="coerce").astype("Int64")
df["Country"] = df["Country"].astype("category")

num_cols = df.columns.difference(["Country", "Year"])
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors="coerce")

# 3 Handle missing values
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# 4 Group by Country, Year
df = df.groupby(["Country", "Year"], as_index=False, observed=False).mean(numeric_only=True)

print("Missing Values:", df.isna().any().sum())
print("Duplicates:", df.duplicated().sum())

Missing Values: 24
Duplicates: 0


### Transforms

In [6]:
# Make categories ordered for nicer sorting/plots.
gdp_order = pd.CategoricalDtype(
    categories=["Low Income", "Lower-Middle Income", "Upper-Middle Income", "High Income"],
    ordered=True
)

cat_order = pd.CategoricalDtype(
    categories=["Low", "Medium", "High"],
    ordered=True
)

# Temperature increase flag
df["Temp_Increased"] = df["Temperature_Anomaly"] > 0

# Air Quality Category
def pollution_category(x):
    if x > 200:     return "Very Unhealthy"
    elif x > 150:   return "Unhealthy"
    elif x > 100:   return "Unhealthy (Sensitive)"
    elif x > 50:    return "Moderate"
    else:           return "Good"
df["Air_Quality_Category"] = df["Air_Pollution_Index"].map(pollution_category)

# GDP per capita and Income group
df["GDP_per_capita"] = df["GDP"] / df["Population"]

def gdp_category(x):
    if x >= 13846:  return "High Income"
    elif x >= 4466: return "Upper-Middle Income"
    elif x >= 1136: return "Lower-Middle Income"
    else:           return "Low Income"
df["GDP_Income_Group"] = df["GDP_per_capita"].map(gdp_category).astype(gdp_order)

# CO2 per capita
df["CO2_per_capita"] = df["CO2_Emissions"] / df["Population"]

# Renewable Energy Usage Groups
def renewable_group(x):
    if x >= 50:     return "High"
    elif x >= 20:   return "Medium"
    else:           return "Low"
df["Renewable_Group"] = df["Renewable_Energy_Usage"].apply(renewable_group).astype("category")

# Industrial Activity Groups
def industrial_activity_group(x):
    if x >= 80:     return "High"
    elif x < 80 & x >= 40:   return "Medium"
    else:           return "Low"
df['Industrial_Activity_Group'] = df.Industrial_Activity.map(renewable_group).astype(cat_order)

# Decades
def get_decade(year):
  decade = year // 10
  decade_str = str(decade * 10) + "'s"
  return decade_str

df['Decade'] = df.Year.map(get_decade)

# Log versions for heavy-tailed per-capita vars (for clearer structure)
df["log_CO2_per_capita"] = np.log1p(df["CO2_per_capita"])
df["log_GDP_per_capita"] = np.log1p(df["GDP_per_capita"])

df

Unnamed: 0,Country,Year,Temperature_Anomaly,CO2_Emissions,Population,Forest_Area,GDP,Renewable_Energy_Usage,Methane_Emissions,Sea_Level_Rise,...,Temp_Increased,Air_Quality_Category,GDP_per_capita,GDP_Income_Group,CO2_per_capita,Renewable_Group,Industrial_Activity_Group,Decade,log_CO2_per_capita,log_GDP_per_capita
0,Country_1,1900,-0.335027,3.984644e+08,3.750466e+08,27.856810,4.573252e+12,60.185651,5.169077e+06,24.478590,...,False,Unhealthy,12193.823826,Upper-Middle Income,1.062440,High,High,1900's,0.723890,9.408767
1,Country_1,1901,0.170373,8.440511e+08,1.001558e+09,69.848395,4.868018e+12,39.525191,2.619170e+06,11.040926,...,True,Unhealthy (Sensitive),4860.447686,Upper-Middle Income,0.842738,Medium,Medium,1900's,0.611253,8.489092
2,Country_1,1902,0.448391,7.090039e+08,3.604418e+08,50.116560,5.464041e+12,71.867926,6.380284e+06,27.972579,...,True,Unhealthy,15159.287549,High Income,1.967041,High,High,1900's,1.087565,9.626435
3,Country_1,1903,1.254878,7.388654e+08,9.164562e+08,68.083451,4.415206e+12,45.673511,5.175568e+06,13.719263,...,True,Unhealthy,4817.695221,Upper-Middle Income,0.806220,Medium,Medium,1900's,0.591236,8.480258
4,Country_1,1904,0.455433,5.804757e+08,4.870284e+08,21.907529,5.453627e+12,55.241938,3.343214e+06,11.041168,...,True,Unhealthy (Sensitive),11197.759534,Upper-Middle Income,1.191872,High,High,1900's,0.784756,9.323558
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24175,Country_99,2019,,,,,,,,,...,False,Good,,Low Income,,Low,Low,2010's,,
24176,Country_99,2020,1.109231,2.160874e+08,5.049405e+08,45.809165,9.250810e+12,0.301486,4.548326e+06,3.454109,...,True,Good,18320.594451,High Income,0.427946,Low,Low,2020's,0.356237,9.815836
24177,Country_99,2021,-0.299082,4.082414e+08,7.817556e+08,46.226872,5.454563e+12,36.308012,7.981679e+06,29.121755,...,False,Unhealthy (Sensitive),6977.324415,Upper-Middle Income,0.522211,Medium,High,2020's,0.420164,8.850564
24178,Country_99,2022,0.123644,5.541478e+08,4.541218e+08,69.822316,3.806007e+12,41.631804,2.784789e+06,27.854428,...,True,Unhealthy,8381.027328,Upper-Middle Income,1.220263,Medium,High,2020's,0.797625,9.033845


### Structured Figures

In [7]:
configure_plotly_browser_state()
pn.extension('plotly')

# Interactive Scatter Plot
df_no_GDP_per_capita_outliers = df[df.GDP_per_capita < 100000]

x_axis_cols = {"x-axis values": ["GDP_per_capita", "GDP"]}
y_axis_cols = {"y-axis values": ["CO2_Emissions", "Methane_Emissions", "Deforestation_Rate"]}

x_axis = pnw.Select(groups=x_axis_cols, name="X-Axis Options")
y_axis = pnw.Select(groups=y_axis_cols, name="Y-Axis Options")

def _scatter(x, y):
    fig = px.scatter(
        df_no_GDP_per_capita_outliers,
        x=x, y=y,
        color="Industrial_Activity_Group",
        width=1000, height=1000,
        title="Interactive Plot of GDP vs. Different Emissions and Deforestation Rate by Industrial Activity Groups"
    )
    fig.add_annotation(
        text="Note: GDP_per_capita ≥ 100,000 excluded.",
        xref="paper", yref="paper", x=0.0, y=-0.12, showarrow=False
    )
    return fig

plot_scatter = pn.bind(_scatter, x=x_axis, y=y_axis)

options = pn.Row(x_axis,
                 y_axis,
                 )

dashboard = pn.Column(options,
          plot_scatter
          )




In [8]:
configure_plotly_browser_state()
pn.extension('plotly')

# Interactive Pie Plot

col_groups = {"x-values": ["GDP_Income_Group", "Industrial_Activity_Group", "Renewable_Group"]}
col_groups_2 = {"y-values": ["CO2_Emissions", "Methane_Emissions", "Deforestation_Rate"]}

names = pnw.Select(groups=col_groups, name="values")
values = pnw.Select(groups=col_groups_2, name="names")


plot_pie = pn.bind(px.pie,
        df,
        names=names,
        values=values,
        width=1000,
        height=1000,
        title = "Interactive Pie Plot of Emissions and Deforestation Rate by GDP, Industrial Activity, and Renewable Groups",
        # labels={'GDP_Income_Group': 'GDP_Income_Group', 'Industrial_Activity_Group': 'Industrial_Activity_Group', 'Renewable_Group': 'Renewable_Group'}
               );

options = pn.Row(names,
                 values,
                 )

dashboard = pn.Column(options,
          plot_pie
          )
dashboard.save("plot_pie.html", embed=True, resources="inline")




In [9]:
configure_plotly_browser_state()

# Animated Scatter Plot
g = px.scatter(df,
               x="GDP_per_capita", y="CO2_Emissions",
               color="Industrial_Activity_Group",
               hover_name="Industrial_Activity_Group",
               title="GDP per capita vs. CO2 Emissions over the years by Industrial Activity Group",
               animation_frame='Year'
               )
g.update_layout(xaxis_range=[0, 20000])

dashboard = pn.Column(pn.pane.Plotly(g, height=600))
dashboard.save("plot_animated.html", embed=True, resources="inline")  # self-contained



# Full Dashboard

In [10]:
from panel.widgets import Tabulator

# Scatter
g = px.scatter(
    df,
    x="GDP_per_capita", y="CO2_Emissions",
    color="Industrial_Activity_Group",
    hover_name="Industrial_Activity_Group",
    title="GDP per capita vs. CO₂ Emissions over the years by Industrial Activity Group",
    animation_frame='Year'
)
g.update_layout(xaxis_range=[0, 20000])
animated_view = pn.pane.Plotly(g, height=650)

# Pie
col_groups = {"x-values": ["GDP_Income_Group", "Industrial_Activity_Group", "Renewable_Group"]}
col_groups_2 = {"y-values": ["CO2_Emissions", "Methane_Emissions", "Deforestation_Rate"]}

names = pnw.Select(groups=col_groups, name="Group (names)")
values = pnw.Select(groups=col_groups_2, name="Value (values)")

plot_pie = pn.bind(
    px.pie, df,
    names=names, values=values,
    width=1000, height=800,
    title="Emissions / Deforestation by Groups"
)

options_pie = pn.Row(names, values)
pie_section = pn.Column(options_pie, plot_pie)

# Animated Scatter
df_no_GDP_per_capita_outliers = df[df.GDP_per_capita < 100000]

x_axis_cols = {"x-axis values": ["GDP_per_capita", "GDP"]}
y_axis_cols = {"y-axis values": ["CO2_Emissions", "Methane_Emissions", "Deforestation_Rate"]}

x_axis = pnw.Select(groups=x_axis_cols, name="X-Axis")
y_axis = pnw.Select(groups=y_axis_cols, name="Y-Axis")

def _scatter(x, y):
    fig = px.scatter(
        df_no_GDP_per_capita_outliers,
        x=x, y=y,
        color="Industrial_Activity_Group",
        width=1000, height=800,
        title="GDP vs. Emissions / Deforestation by Industrial Activity Groups"
    )
    fig.add_annotation(
        text="Note: GDP_per_capita ≥ 100,000 excluded.",
        xref="paper", yref="paper", x=0.0, y=-0.12, showarrow=False
    )
    return fig

plot_scatter = pn.bind(_scatter, x=x_axis, y=y_axis)
options_scatter = pn.Row(x_axis, y_axis)
scatter_section = pn.Column(options_scatter, plot_scatter)

# Data table
year = pnw.IntRangeSlider(name="Year", start=int(df["Year"].min()), end=int(df["Year"].max()),
                          value=(int(df["Year"].min()), int(df["Year"].max())))

fdf = pn.bind(lambda d, r: d[(d["Year"] >= r[0]) & (d["Year"] <= r[1])], df, year)

table = pn.bind(lambda d: Tabulator(
    d[["Country","Year","GDP_Income_Group","Industrial_Activity_Group",
       "GDP_per_capita","CO2_Emissions","Methane_Emissions","Deforestation_Rate"]]
      .sort_values(["Year","Country"]).head(200),
    page_size=20, pagination='remote', sizing_mode="stretch_width"
), fdf)

tabs = pn.Tabs(
    ("Scatter Plot", scatter_section),
    ("Bar Plot", pie_section),
    ("Animated Plot", animated_view),
    ("Data Table", pn.Column(year, table)),
)

dashboard = pn.Column("# Climate Dashboard (Panel + Plotly)", tabs)
dashboard.save("climate_dashboard.html", embed=True, resources="inline")

