<a href="https://colab.research.google.com/github/zia207/01_Generalized_Linear_Models_Python/blob/main/Notebook/02_01_08_00_glm_gam_introduction_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 8. Generalized Additive Model (GAM)

Generalized Additive Models (GAMs) are powerful tools for modeling complex, nonlinear relationships in data. They combine the flexibility of nonparametric models with the interpretability of linear models. This tutorial will guide you through the fundamentals of GAMs in R. To enhance your understanding, we will build a GAM model from scratch without using any external packages. This approach will illustrate the core principles behind GAM modeling, including smoothing, combining predictor effects, and estimating model parameters. We will also explore various packages for fitting, analyzing, and visualizing GAMs, including popular libraries such as {mgcv} and {gam}. These packages provide robust functions to fit GAMs, offering flexibility with multiple types of smoothers, diagnostics, and model selection criteria. Additionally, we will delve into specific packages designed for GAM visualization and model assessment, equipping you with tools to evaluate and interpret complex relationships in your data.

By the end of this tutorial, you will have a comprehensive understanding of how to use GAMs in R to uncover intricate relationships in your data. You will gain practical skills in fitting, interpreting, and diagnosing GAM models, along with insights into the mathematical principles behind them.

## Overview

A Generalized Additive Model (GAM) is an extension of traditional linear regression models that allows for more flexibility by modeling the relationship between the response variable and each predictor variable as a smooth, non-linear function. This flexibility makes GAMs especially useful when relationships between predictors and the outcome are complex and cannot be adequately captured by a linear model.



### Structure of a GAM

A Generalized Additive Model (GAM) is defined as:

$$ g(E(y)) = \beta_0 + f_1(x_1) + f_2(x_2) + \dots + f_n(x_n) $$

where:

-   $y$ is the dependent variable, or the outcome we’re predicting.
-   $E(y)$ represents the expected value of $y$.
-   $E(y)$ represents the expected value of $y$.
-   $g$ is a link function that links the predictors to the expected value of $y$.
-   $\beta_0$ is the intercept, representing the baseline value of $y$.
-   $f_1(x_1), f_2(x_2), \dots, f_n(x_n)$ are smooth, flexible functions for each predictor variable $x_1, x_2, \dots, x_n$.

In GAMs, instead of assuming each predictor affects the outcome linearly, each predictor has its own flexible function, allowing it to influence the outcome in potentially complex, nonlinear ways.

### Components of a GAM

To better understand GAMs, let's look at the main components:

a.  **Non-Parametric  Smooth Functions** $f_i(x_i)$

Each predictor $x_i$ has its own smooth function, $f_i(x_i)$, which is designed to capture the potentially nonlinear relationship between $x_i$ and $y$). These functions are often estimated using methods like:

-   **Splines**: Splines (e.g., cubic splines) are piecewise polynomials joined smoothly at certain points (called knots). They allow for a smooth curve without specifying an exact form for the relationship.
-   **Local Regression (LOESS/LOWESS)**: A non-parametric regression technique that fits simple models to localized subsets of data, offering a smooth curve without needing a specific functional form.
-   **Kernel Smoothing**: Kernel-based methods allow estimating smooth functions by averaging nearby points, with the weights determined by a kernel function.

These smoothing methods help model the relationship between each predictor and the outcome in a flexible, data-driven way.

b.  **Generalized Framework (Link Function)** $g$

The link function $g$ relates the predictors to the expected value of $y$. Some commonly used link functions are:

-   **Identity Link**: $g(y) = y$, used for continuous outcomes (e.g., linear regression).

-   **Log Link**: $g(y) = \ln(y)$, used for modeling positive, skewed outcomes like counts.

-   **Logit Link**: $g(y) = \ln\left(\frac{y}{1 - y}\right)$, used for binary or proportion data.

By choosing an appropriate link function, GAMs can model different types of outcome distributions (continuous, binary, count data, etc.).

c.  **Additivity Assumption**

GAMs assume that each predictor contributes independently to the outcome, meaning there are no interactions between predictors (although it’s possible to add interaction terms). This additivity makes GAMs interpretable because we can examine the effect of each predictor individually.

### Estimation and Fitting GAMs

To estimate a GAM, the following steps are typically involved:

a.  **Choosing the Smoothness of** $f_i(x_i)$

Each smooth function $f_i(x_i)$ needs to be tuned for “smoothness.” If $f_i(x_i)$ is too flexible, the model might overfit the data, capturing noise rather than true relationships. Conversely, if $f_i(x_i)$ is too rigid, it may miss important trends. Regularization methods, such as penalizing the complexity of $f_i(x_i)$, help control this balance. The degree of smoothness is often chosen by minimizing a model selection criterion like **Generalized Cross-Validation (GCV)** or **Akaike Information Criterion (AIC)**.

b.  **Estimating Coefficients**

The coefficients $\beta_0$ and the functions $f_i(x_i)$ are estimated by maximizing the likelihood of the model (or minimizing a loss function). This is often done using iterative algorithms, like backfitting, that alternate between fitting each function while keeping the others fixed until convergence.

c.  **Diagnostics and Model Evaluation**

Once fitted, a GAM can be evaluated using: - **Residual analysis**: Plotting residuals to check for patterns, which can indicate model misfit. - **Cross-validation**: Splitting data into training and test sets to check predictive performance. - **Model selection criteria**: Metrics like AIC or GCV to compare different model specifications.

### Applications of GAMs

a.  **Ecology and Environmental Science**

In ecology, GAMs are often used to study the relationship between species abundance and environmental factors (e.g., temperature, rainfall). For example, one might model how fish population changes with water temperature, salinity, and nutrient levels in a non-linear manner.

b.  **Economics**

In economics, GAMs can model relationships that are not strictly linear, like how consumer spending varies with income level and Income. GAMs allow each of these factors to influence spending in complex, nonlinear ways.

c.  **Medicine and Public Health**

In medical research, GAMs are used to model the effects of Income, dosage levels, or other health metrics on patient outcomes, where the relationship might be nonlinear (e.g., the effect of dosage on blood pressure might increase up to a point and then level off).

d.  **Marketing and Social Science**

In marketing, GAMs are useful to understand how advertising spend, customer demographics, and other factors impact customer engagement or sales, which often have nonlinear effects.


## Generalized Additive Models (GAMs) in Python

While Python doesn’t have as many native GAM libraries as R, several powerful packages enable fitting Generalized Additive Models with flexibility and performance. Below is a guide to the most popular Python libraries for GAMs:

### 1. **`pyGAM`**

- **`pyGAM`** is the most comprehensive and widely used Python library for fitting GAMs. It is inspired by R’s `mgcv` and supports penalized B-splines, P-splines, and other smooth terms.

- **Key Features:**
  - Offers a variety of spline terms: `s()` for univariate splines, `f()` for factors, `l()` for linear terms, and `te()` for tensor products (multivariate smooths).
  - Supports multiple distributions and link functions via the `distribution=` and `link=` parameters.
  - Built-in cross-validation for smoothing parameter selection.
  - Uses `scipy` and `numpy` under the hood; integrates well with the scientific Python stack.

- **Example:**
  ```python
  from pygam import LinearGAM, s, f
  from pygam.datasets import wage

  X, y = wage()
  gam = LinearGAM(s(0) + s(1) + f(2)).fit(X, y)
  gam.summary()
  ```


### 2. **`statsmodels` (Limited GAM Support)**

- While `statsmodels` doesn’t have full native GAM functionality, it supports **Generalized Linear Models (GLMs)** and **splines via basis functions** (e.g., using `bs()` or `cr()` from `patsy`).

- **Key Features:**
  - You can manually construct spline basis expansions and fit them in a GLM framework.
  - Good for simple additive models with pre-specified knots or degrees of freedom.
  - Lacks automatic smoothing parameter selection and penalization found in `pyGAM` or R’s `mgcv`.

- **Example:**
  ```python
  import statsmodels.api as sm
  import patsy

  # Create spline basis using patsy
  y, X = patsy.dmatrices("wage ~ bs(age, df=5) + education", data=df, return_type='dataframe')
  model = sm.GLM(y, X, family=sm.families.Gaussian()).fit()
  print(model.summary())
  ```


### 3. **`scikit-learn` + Custom Splines (Manual Approach)**

- `scikit-learn` does not directly support GAMs, but you can approximate them by:
  - Using `SplineTransformer` (available in scikit-learn ≥ 1.0) to generate spline basis features.
  - Fitting a penalized linear model (e.g., `Ridge`, `Lasso`) or GLM via `sklearn.linear_model`.

- **Key Features:**
  - Good for educational or prototyping purposes.
  - Requires manual tuning of knots and penalties.
  - No built-in distribution families (beyond Gaussian with linear models).

- **Example:**
  ```python
  from sklearn.preprocessing import SplineTransformer
  from sklearn.linear_model import Ridge
  from sklearn.pipeline import Pipeline

  spline = SplineTransformer(n_knots=10, degree=3)
  model = Pipeline([
      ('spline', spline),
      ('ridge', Ridge(alpha=1.0))
  ])
  model.fit(X.reshape(-1, 1), y)
  ```


### 4. **`GAMboost` / `pyGAM` Alternatives (Less Common)**

- Libraries like `GAMboost` (via `rpy2` wrapper) or `interpret.glassbox` (from Microsoft’s InterpretML) offer GAM-like models, especially for interpretability.

- **InterpretML’s ExplainableBoostingMachine (EBM):**
  - Not strictly a statistical GAM, but fits additive models using boosting.
  - Excellent for high-dimensional, interpretable models.
  - Automatically handles interactions and provides visual explanations.

  ```python
  from interpret.glassbox import ExplainableBoostingRegressor
  from interpret import show

  ebm = ExplainableBoostingRegressor()
  ebm.fit(X, y)
  show(ebm.explain_global())
  ```

.

## Summary and Conclusions

A Generalized Additive Model (GAM) is a powerful extension of GLMs that uses smooth functions to model non-linear relationships in a flexible yet interpretable way. It strikes a balance between parametric models and fully non-parametric or black-box models (like neural networks), making it popular in ecology, medicine, finance, and social sciences.

## Resources

1. pyGAM — Best for stats-style GAMs (like R’s mgcv)  
  Docs: https://pygam.readthedocs.io  

2. InterpretML (EBM) — Best for interpretable ML & auto-GAMs**  
  Site: https://interpret.ml  

3. ISLR Python Examples — Learn theory + code   
→ GitHub: https://github.com/JWarmenhoven/ISLR-python  

4. StatQuest Video — GAMs in <15 mins
 Link: https://youtu.be/8BoWqSOvMnY  





