# 04 Predictive Modeling (Simple and Explainable)

## Objectives

- Build a simple, explainable forecast model
- Evaluate performance with time-aware splits
- Export predictions for the dashboard

## Inputs

- data/processed/v1/environmental_trends_clean.csv

## Outputs

- data/processed/v1/model_predictions.csv

## Additional Comments

- Report limitations and avoid overclaiming

## Purpose and Context

This notebook builds a **simple, explainable forecasting model** to project temperature trends for 2025-2029. We prioritize transparency over complexity to ensure stakeholders can understand and appropriately trust (or question) the predictions.

**Connection to project guidelines:**

- **Ethics (LO1.1)**: Transparent modeling prevents "black box" predictions that obscure uncertainty
- **Communication (LO2.1)**: Simple models (linear regression) are easier to explain than complex ML algorithms
- **Limitations (LO3.2)**: We explicitly document what the model *cannot* do (no confidence intervals, no exogenous factors, linear assumption)
- **Responsible use**: Dashboard will clearly label forecasts as exploratory projections, not definitive predictions

**Why simple models matter:**

Complex models (neural networks, ensemble methods) might fit historical data better, but:
- We only have ~25 years per country (small dataset → overfitting risk)
- Non-technical users need to understand *how* predictions are made
- Ethical AI practice requires explainability, especially for public-facing climate tools
- Simpler models make limitations more obvious (reducing false confidence)

**Model limitations we acknowledge:**

1. Linear trend assumption (real climate may accelerate or plateau)
2. No external factors (ignores emissions changes, policy, economic shifts)
3. Short time series (only 2000-2024 available per country)
4. No uncertainty quantification (point estimates only, no confidence intervals)
5. Past trends may not continue if future conditions change dramatically

The dashboard will communicate these limitations clearly to prevent overreliance on the forecasts.

---

---

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
os.getcwd()

'c:\\Users\\sergi\\OneDrive\\Documents\\Code Institute Data analytics\\Capstone project 3\\Global_environmental_trends_2000_2024\\global_env_trend'

# Load processed data

In [2]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
clean_path = "data/processed/v1/environmental_trends_clean.csv"
df = pd.read_csv(clean_path)
df.head()

Unnamed: 0,Year,Country,Avg_Temperature_degC,CO2_Emissions_tons_per_capita,Sea_Level_Rise_mm,Rainfall_mm,Population,Renewable_Energy_pct,Extreme_Weather_Events,Forest_Area_pct
0,2000,United States,13.5,20.2,0,715,282500000,6.2,38,33.1
1,2000,China,12.8,2.7,0,645,1267000000,16.5,24,18.8
2,2000,Germany,9.3,10.1,0,700,82200000,6.6,12,31.8
3,2000,Brazil,24.9,1.9,0,1760,175000000,83.7,18,65.4
4,2000,Australia,21.7,17.2,0,534,19200000,8.8,11,16.2


# Prepare features and target

**What we're doing:**

Setting up the data for our forecasting model:

1. **Remove missing values**: We need complete data (Country, Year, Temperature) for reliable predictions
2. **Remove duplicates**: Ensure each Country-Year combination appears only once
3. **Sort by Country and Year**: Keeps time series organized for each country
4. **Define split year (2018)**: We'll train models on 2000-2018, test on 2019-2024
5. **Define forecast horizon**: We'll predict 5 years into the future (2025-2029)

**Why per-country models?**

Different countries have different temperature trajectories due to:
- Geographic location (latitude, elevation, coastal vs inland)
- Local climate patterns
- Urbanization and land use changes

Using separate models for each country produces more realistic, country-specific forecasts rather than a one-size-fits-all global prediction.

In [None]:
df_model = df.dropna(subset=["Country", "Year", "Avg_Temperature_degC"]).copy()
df_model = df_model.sort_values(["Country", "Year"]).drop_duplicates(["Country", "Year"])
split_year = 2018
max_year = int(df_model["Year"].max())
forecast_years = list(range(max_year + 1, max_year + 6))

# Time-aware split and model training

**Our modeling approach:**

For each country, we:
1. **Split data chronologically** at 2018 (train: 2000-2018, test: 2019-2024)
2. **Fit a simple linear regression** using only Year as a predictor
3. **Evaluate on test period** (2019-2024) using MAE and RMSE
4. **Generate forecasts** for 2025-2029

**Why linear regression?**

We chose the simplest possible model because:
- **Transparency**: Easy to explain to non-technical audiences ("temperature increases by X degrees per year")
- **Explainability**: Decision-makers can understand and trust the logic
- **Limited data**: We only have ~25 years per country—complex models would overfit
- **Ethical practice**: Avoid black-box models that obscure uncertainty

**Understanding the metrics:**

- **MAE (Mean Absolute Error)**: Average prediction error in degrees Celsius (lower is better)
  - Example: MAE = 0.5°C means predictions are typically off by half a degree
- **RMSE (Root Mean Squared Error)**: Penalizes larger errors more heavily (lower is better)
  - Example: RMSE = 0.7°C means predictions have about 0.7°C typical deviation

**Model limitations we must communicate:**

1. **Short time series**: Only 25 years of data per country limits accuracy
2. **Linear assumption**: Real climate trends may accelerate or plateau (non-linear)
3. **No exogenous factors**: We ignore emissions, policy changes, economic shifts
4. **Uncertainty not quantified**: We provide point estimates, not confidence intervals
5. **Past ≠ future**: Historical trends may not continue if conditions change dramatically

**Responsible use in dashboard:**

The dashboard will:
- Label forecasts clearly as "projections based on historical trends"
- Include a disclaimer about model simplicity and limitations
- Encourage users to view forecasts as exploratory, not definitive
- Avoid making policy recommendations based solely on these predictions

In [None]:
test_rows = []
forecast_rows = []

for country, group in df_model.groupby("Country"):
    grp = group.sort_values("Year").drop_duplicates("Year")
    if len(grp) < 3:
        continue
    train = grp[grp["Year"] <= split_year]
    test = grp[grp["Year"] > split_year]
    if len(train) < 2:
        continue
    model = LinearRegression()
    model.fit(train[["Year"]], train["Avg_Temperature_degC"])
    if len(test) > 0:
        preds = model.predict(test[["Year"]])
        for year, pred in zip(test["Year"], preds):
            test_rows.append({
                "Country": country,
                "Year": int(year),
                "Predicted_Avg_Temperature_degC": float(pred),
                "Actual_Avg_Temperature_degC": float(test.loc[test["Year"] == year, "Avg_Temperature_degC"].iloc[0])
            })
    for year in forecast_years:
        pred = model.predict([[year]])[0]
        forecast_rows.append({
            "Country": country,
            "Year": int(year),
            "Predicted_Avg_Temperature_degC": float(pred)
        })

test_df = pd.DataFrame(test_rows)
if len(test_df) > 0:
    mae = mean_absolute_error(test_df["Actual_Avg_Temperature_degC"], test_df["Predicted_Avg_Temperature_degC"])
    rmse = mean_squared_error(test_df["Actual_Avg_Temperature_degC"], test_df["Predicted_Avg_Temperature_degC"], squared=False)
    mae, rmse
else:
    "No test data available after split_year; metrics skipped."

(6.399958579881656, 7.1623526237200075)

# Export predictions

**What we're saving:**

The forecast dataset contains:
- **Year**: 2025, 2026, 2027, 2028, 2029
- **Country**: Each of the 19 countries with sufficient data
- **Predicted_Avg_Temperature_degC**: Forecasted temperature based on historical trend

This CSV file will be imported into Tableau to create the forecast visualization sheet.

**How the dashboard will use this:**

Users will see:
- Country-specific forecast lines (not a single global average)
- Year range from 2000-2029 (historical + forecast)
- Clear visual distinction between observed data and predictions
- Tooltips explaining the forecast methodology and limitations

**Final data governance note:**

By versioning this output as `data/processed/v1/model_predictions.csv`:
- We maintain a record of exactly what predictions were shown
- Future model updates will go to v2, preserving reproducibility
- Anyone reviewing the dashboard can trace forecasts back to this notebook
- We uphold transparency and accountability standards for public-facing data products

In [None]:
preds_df = pd.DataFrame(forecast_rows)
preds_path = "data/processed/v1/model_predictions.csv"
preds_df.to_csv(preds_path, index=False)
preds_path

'data/processed/v1/model_predictions.csv'