# 04 Predictive Modeling (Simple and Explainable)

## Objectives

- Build a simple, explainable forecast model
- Evaluate performance with time-aware splits
- Export predictions for the dashboard

## Inputs

- data/processed/v1/environmental_trends_clean.csv

## Outputs

- data/processed/v1/model_predictions.csv

## Additional Comments

- Report limitations and avoid overclaiming

## Purpose and Context

This notebook builds a simple, explainable forecasting model to project temperature trends for 2025-2029. We prioritize transparency over complexity to ensure stakeholders can understand and appropriately trust (or question) the predictions.

The connection to project guidelines spans multiple learning outcomes. For ethics, transparent modeling prevents "black box" predictions that obscure uncertainty. For communication, simple models like linear regression are easier to explain than complex machine learning algorithms. For limitations, we explicitly document what the model cannot do (no confidence intervals, no exogenous factors, linear assumption). For responsible use, the dashboard will clearly label forecasts as exploratory projections, not definitive predictions.

Why simple models matter is an important consideration. Complex models like neural networks or ensemble methods might fit historical data better, but we only have about 25 years per country, which creates overfitting risk with small datasets. Non-technical users need to understand how predictions are made. Ethical AI practice requires explainability, especially for public-facing climate tools. Simpler models make limitations more obvious, reducing false confidence.

Model limitations we acknowledge include five key areas. The linear trend assumption means real climate may accelerate or plateau rather than follow a straight line. No external factors means we ignore emissions changes, policy, and economic shifts. Short time series means only 2000-2024 data is available per country. No uncertainty quantification means we provide point estimates only, without confidence intervals. Past trends may not continue if future conditions change dramatically.

The dashboard will communicate these limitations clearly to prevent overreliance on the forecasts.

---

---

# Change working directory

In [None]:
import os
from pathlib import Path

# Get the notebook's directory from IPython
try:
    from IPython import get_ipython
    notebook_dir = Path(get_ipython().kernel.comm_manager.kernel.notebook_dir) if hasattr(get_ipython(), 'kernel') else None
except:
    notebook_dir = None

# If we got the notebook dir, use it; otherwise use absolute path
if notebook_dir and (notebook_dir / "jupyter_notebooks").exists():
    os.chdir(notebook_dir / "jupyter_notebooks" / "..")
elif (Path.cwd() / "jupyter_notebooks").exists():
    os.chdir(Path.cwd() / "jupyter_notebooks" / "..")
else:
    # Use explicit absolute path
    project_root = Path(r"c:\Users\sergi\OneDrive\Documents\Code Institute Data analytics\Capstone project 3\Global_environmental_trends_2000_2024\global_env_trend")
    os.chdir(project_root)

print(f"Working directory: {os.getcwd()}")

Working directory: c:\Users\sergi\OneDrive\Documents\Code Institute Data analytics\Capstone project 3


# Load processed data

In [21]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
clean_path = "data/processed/v1/environmental_trends_clean.csv"
df = pd.read_csv(clean_path)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/processed/v1/environmental_trends_clean.csv'

# Prepare features and target

**What we're doing:**

Setting up the data for our forecasting model involves several steps. We remove missing values because we need complete data (Country, Year, Temperature) for reliable predictions. We remove duplicates to ensure each Country-Year combination appears only once. We sort by Country and Year to keep time series organized for each country. We define the split year of 2018, meaning we'll train models on 2000-2018 and test on 2019-2024. We define the forecast horizon to predict 5 years into the future, from 2025 to 2029.

Why per-country models? Different countries have different temperature trajectories due to geographic location (latitude, elevation, coastal versus inland), local climate patterns, and urbanization and land use changes. Using separate models for each country produces more realistic, country-specific forecasts rather than a one-size-fits-all global prediction.

In [7]:
df_model = df.dropna(subset=["Country", "Year", "Avg_Temperature_degC"]).copy()
df_model = df_model.sort_values(["Country", "Year"]).drop_duplicates(["Country", "Year"])
split_year = 2018
max_year = int(df_model["Year"].max())
forecast_years = list(range(max_year + 1, max_year + 6))

# Time-aware split and model training

**Our modeling approach:**

For each country, we follow a specific process. We split data chronologically at 2018 (train: 2000-2018, test: 2019-2024). We fit a simple linear regression using only Year as a predictor. We evaluate on the test period (2019-2024) using MAE and RMSE. We generate forecasts for 2025-2029.

Why linear regression? We chose the simplest possible model for important reasons. Transparency means it's easy to explain to non-technical audiences ("temperature increases by X degrees per year"). Explainability means decision-makers can understand and trust the logic. Limited data means we only have about 25 years per country, so complex models would overfit. Ethical practice means we avoid black-box models that obscure uncertainty.

Understanding the metrics helps interpret model performance. Mean Absolute Error (MAE) is the average prediction error in degrees Celsius, where lower is better. For example, MAE equals 0.5°C means predictions are typically off by half a degree. Root Mean Squared Error (RMSE) penalizes larger errors more heavily, where lower is better. For example, RMSE equals 0.7°C means predictions have about 0.7°C typical deviation.

Model limitations we must communicate include several critical factors. Short time series means only 25 years of data per country limits accuracy. Linear assumption means real climate trends may accelerate or plateau (non-linear). No exogenous factors means we ignore emissions, policy changes, and economic shifts. Uncertainty not quantified means we provide point estimates, not confidence intervals. Past does not equal future because historical trends may not continue if conditions change dramatically.

Responsible use in the dashboard requires clear communication. The dashboard will label forecasts clearly as "projections based on historical trends." It will include a disclaimer about model simplicity and limitations. It will encourage users to view forecasts as exploratory, not definitive. It will avoid making policy recommendations based solely on these predictions.

In [8]:
test_rows = []
forecast_rows = []

for country, group in df_model.groupby("Country"):
    grp = group.sort_values("Year").drop_duplicates("Year")
    if len(grp) < 3:
        continue
    train = grp[grp["Year"] <= split_year]
    test = grp[grp["Year"] > split_year]
    if len(train) < 2:
        continue
    model = LinearRegression()
    model.fit(train[["Year"]], train["Avg_Temperature_degC"])
    if len(test) > 0:
        preds = model.predict(test[["Year"]])
        for year, pred in zip(test["Year"], preds):
            test_rows.append({
                "Country": country,
                "Year": int(year),
                "Predicted_Avg_Temperature_degC": float(pred),
                "Actual_Avg_Temperature_degC": float(test.loc[test["Year"] == year, "Avg_Temperature_degC"].iloc[0])
            })
    for year in forecast_years:
        pred = model.predict([[year]])[0]
        forecast_rows.append({
            "Country": country,
            "Year": int(year),
            "Predicted_Avg_Temperature_degC": float(pred)
        })

test_df = pd.DataFrame(test_rows)
if len(test_df) > 0:
    mae = mean_absolute_error(test_df["Actual_Avg_Temperature_degC"], test_df["Predicted_Avg_Temperature_degC"])
    rmse = mean_squared_error(test_df["Actual_Avg_Temperature_degC"], test_df["Predicted_Avg_Temperature_degC"], squared=False)
    mae, rmse
else:
    "No test data available after split_year; metrics skipped."



# Export predictions

**What we're saving:**

The forecast dataset contains Year (2025, 2026, 2027, 2028, 2029), Country (each of the 19 countries with sufficient data), and Predicted_Avg_Temperature_degC (forecasted temperature based on historical trend). This CSV file will be imported into Tableau to create the forecast visualization sheet.

How the dashboard will use this shows several features. Users will see country-specific forecast lines (not a single global average). The year range from 2000-2029 combines historical data with forecasts. Clear visual distinction exists between observed data and predictions. Tooltips explain the forecast methodology and limitations.

Final data governance note: By versioning this output as data/processed/v1/model_predictions.csv, we maintain a record of exactly what predictions were shown. Future model updates will go to v2, preserving reproducibility. Anyone reviewing the dashboard can trace forecasts back to this notebook. We uphold transparency and accountability standards for public-facing data products.

In [None]:
preds_df = pd.DataFrame(forecast_rows)
preds_path = "data/processed/v1/model_predictions.csv"
preds_df.to_csv(preds_path, index=False)
preds_path

'data/processed/v1/model_predictions.csv'