# GAM (Generalized Additive Models) Exploration

This notebook explores algorithms for fitting Generalized Additive Models (GAMs) to data.

## Overview

GAMs are a flexible class of models that extend linear models by allowing non-linear relationships between predictors and the response variable. They use smooth functions to model the relationship between each predictor and the response.

## Key Concepts

1. **Additive Structure**: GAMs assume that the response is the sum of smooth functions of individual predictors
2. **Smooth Functions**: Use splines or other smooth functions to capture non-linear relationships
3. **Penalized Estimation**: Use penalties to control the smoothness of the fitted functions
4. **Model Selection**: Choose appropriate smoothness parameters and model complexity

## Libraries for GAMs

- **pygam**: Python implementation of GAMs
- **scikit-learn**: Limited GAM support
- **statsmodels**: GAM implementation
- **mgcv**: R package (via rpy2)

## Analysis Plan

1. **Data Preparation**: Load and prepare weather data for GAM analysis
2. **Basic GAM Fitting**: Fit simple GAMs to temperature data
3. **Algorithm Comparison**: Compare different GAM fitting algorithms
4. **Model Selection**: Explore methods for choosing optimal smoothness parameters
5. **Visualization**: Create plots to visualize fitted smooth functions
6. **Performance Evaluation**: Assess model performance and interpretability


## Step 1: Setup and Data Loading

First, we'll set up the environment and load the necessary libraries for GAM analysis.


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")


Libraries imported successfully!


## Step 2: Install and Import GAM Libraries

We'll install and import the necessary GAM libraries. We'll start with pygam as it's a popular Python GAM implementation.


In [2]:
# Install pygam if not already installed
try:
    from pygam import LinearGAM, s, f
    print("pygam already installed")
except ImportError:
    print("Installing pygam...")
    import subprocess
    subprocess.check_call(["pip", "install", "pygam"])
    from pygam import LinearGAM, s, f
    print("pygam installed successfully")

# Also try to import statsmodels GAM
try:
    from statsmodels.gam.api import GLMGam, BSplines
    print("statsmodels GAM available")
except ImportError:
    print("statsmodels GAM not available")

print("GAM libraries ready!")


pygam already installed
statsmodels GAM not available
GAM libraries ready!


## Step 3: Load Weather Data

Load temperature data that we can use for GAM analysis. We'll use the TAVG data from our previous analysis.


In [3]:
# Load weather data
import dask.dataframe as dd

# Load TAVG data from the combined weather dataset
try:
    tavg_data = dd.read_parquet('../../../weather_data/weather_1950_2025_combined.parquet').query(
        "ELEMENT == 'TAVG' and year >= 2020 and year <= 2025"
    ).compute()
    
    print(f"Loaded {len(tavg_data)} TAVG records")
    print(f"Years: {sorted(tavg_data['year'].unique())}")
    print(f"Stations: {tavg_data['ID'].nunique()}")
    
except FileNotFoundError:
    print("Weather data file not found. Creating synthetic data for demonstration.")
    
    # Create synthetic temperature data for demonstration
    np.random.seed(42)
    n_stations = 100
    n_days = 365
    
    # Create synthetic temperature data with seasonal patterns
    days = np.arange(1, n_days + 1)
    
    # Base temperature with seasonal variation
    base_temp = 15 + 10 * np.sin(2 * np.pi * days / 365)
    
    # Add station-specific variations
    station_effects = np.random.normal(0, 5, n_stations)
    
    # Create temperature matrix
    temp_data = []
    for i in range(n_stations):
        station_temp = base_temp + station_effects[i] + np.random.normal(0, 2, n_days)
        temp_data.append({
            'station_id': f'STATION_{i:03d}',
            'latitude': np.random.uniform(20, 60),
            'longitude': np.random.uniform(-120, -70),
            'temperature': station_temp
        })
    
    tavg_data = pd.DataFrame(temp_data)
    print(f"Created synthetic data with {len(tavg_data)} records")


Loaded 32639 TAVG records
Years: [2020, 2022, 2023, 2024, 2025]
Stations: 7017


## Step 4: Basic GAM Fitting

Now let's fit our first GAM using pygam. We'll start with a simple model that uses smooth functions of day of year to predict temperature.
