<a href="https://colab.research.google.com/github/zia207/01_Generalized_Linear_Models_Python/blob/main/Notebook/02_01_07_00_glm_non_normal_introduction_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 8. Generalized Linear Models (GLM) for Non-Normal Continuous Data

Generalized Linear Models (GLMs) are a flexible extension of ordinary linear regression that allow modeling response variables following distributions other than the normal (Gaussian) distribution. They are particularly useful for non-normal continuous data, such as positive skewed values (e.g., costs, times), proportions, or rates, where assumptions like normality, homoscedasticity, or unbounded ranges do not hold.

A GLM has three core components:
- **Random component**: The probability distribution of the response variable $Y$, typically from the exponential family (e.g., Gamma, Beta).
- **Systematic component**: A linear predictor $\eta = X\beta$, where $X$ is the design matrix of predictors and $\beta$ are coefficients.
- **Link function**: $g(\mu) = \eta$, connecting the expected value $\mu = E(Y)$ to the linear predictor. Common links include log, logit, or inverse.

For non-normal continuous data, GLMs handle positive-only values, bounded ranges (e.g., [0,1]), or data with excess zeros/inflations. They are estimated via maximum likelihood, and diagnostics (e.g., deviance residuals) assess fit. Below, I explain the specified models, focusing on their distributions, typical applications, link functions, and key features.

## Gamma Regression (A Special Case of Tweedie Regression)

Gamma regression is a GLM for positive continuous response variables ($Y > 0$) that are right-skewed and heteroscedastic, where variance increases with the mean (e.g., $\text{Var}(Y) = \phi \mu^2$, with dispersion parameter $\phi$).

- **Distribution**: Gamma, parameterized by shape $\alpha$ and rate $\beta$, or mean $\mu = \alpha / \beta$ and shape $1/\phi$. The density is $f(y) = \frac{1}{\Gamma(\alpha) \beta^\alpha} y^{\alpha-1} e^{-y/\beta}$.
- **Link function**: Commonly log ($g(\mu) = \log(\mu)$ ) for multiplicative effects; alternatives include inverse or identity.
- **Applications**: Modeling waiting times, insurance claims, rainfall amounts, or lifetimes (e.g., time to failure in reliability analysis).
- **Key features**: Handles overdispersion better than Poisson for continuous counts; no upper bound, but assumes positivity (data with zeros require adjustments like zero-inflated variants).

## Inverse Gaussian Regression (A Special Case of Tweedie Regression)

Inverse Gaussian (IG) regression models positive continuous data ($Y > 0$) with even stronger right-skewness than Gamma, where variance is cubic in the mean ( $\text{Var}(Y) = \phi \mu^3$).

- **Distribution**: Inverse Gaussian (also called Wald distribution), with density $f(y) = \sqrt{\frac{\lambda}{2\pi y^3}} \exp\left( -\frac{\lambda (y - \mu)^2}{2 \mu^2 y} \right)$, where $\mu$ is the mean and $\lambda = 1/\phi$ is the shape parameter.
- **Link function**: Typically inverse squared ( $g(\mu) = 1/\mu^2$) or log for interpretability.
- **Applications**: Time to first passage in Brownian motion (e.g., stock prices hitting a barrier), insurance claims with high variability, or degradation processes in engineering.
- **Key features**: Useful when data exhibit inverse relationship between mean and variability; it's a member of the exponential family, allowing GLM fitting.

## Beta Regression

Beta regression is designed for continuous responses bounded between 0 and 1 (exclusive, i.e., $0 < Y < 1$), such as proportions, fractions, or rates.

- **Distribution**: Beta, with density $f(y) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} y^{\alpha-1} (1-y)^{\beta-1}$, reparameterized by mean $\mu = \alpha / (\alpha + \beta)$ and precision $\phi = \alpha + \beta$ (variance $\text{Var}(Y) = \mu(1-\mu)/(\phi+1)$).
- **Link function**: Logit ( $g(\mu) = \log(\mu / (1-\mu))$) is standard, allowing interpretation like logistic regression; alternatives include probit or log-log.
- **Applications**: Proportions like exam pass rates, market shares, or soil composition fractions (e.g., clay percentage).
- **Key features**: Handles heteroscedasticity inherent in bounded data (variance peaks at $\mu = 0.5$); assumes no exact 0s or 1s—if present, transform (e.g., $(y(n-1) + 0.5)/n$ for sample size $n$ or use inflated variants.

## Zero-One Inflated Beta Regression

This extends beta regression for data in [0,1] with excess zeros and/or ones (e.g., many observations at boundaries), combining a discrete component for 0/1 with a continuous beta for (0,1).

- **Distribution**: Mixture model: Probability mass at 0 and 1 (modeled via logistic or multinomial), and beta for interior values. Formally, $P(Y=0) = \pi_0$, $P(Y=1) = \pi_1$, and $P(0 < Y < 1) = (1 - \pi_0 - \pi_1) \times \text{Beta}(\mu, \phi)$.
- **Link functions**: Logit for inflation probabilities $\pi_0, \pi_1$; logit for beta mean $\mu$.
- **Applications**: Proportions with structural zeros/ones, like insurance claim ratios (many 0% or 100% claims), voter turnout fractions, or disease prevalence with perfect cures/absences.
- **Key features**: Addresses inflation by separately modeling boundaries; can be zero-inflated (only excess 0s), one-inflated, or both.
- **Estimation**: Extended likelihood; implemented in R's `gamlss` or `zoib` packages.

## Fractional Regression Models

Fractional regression (also called fractional logit/probit) models responses that are fractions or proportions in [0,1], including boundaries, without assuming a beta distribution. It's quasi-likelihood based, focusing on the mean structure.

- **Distribution**: No full distribution assumed; uses quasi-binomial or similar for variance $\text{Var}(Y) = \mu(1-\mu)$, treating it like a binomial proportion but for continuous data.
- **Link function**: Logit (for fractional logit) or probit, mapping $\mu$ to the real line.
- **Applications**: Economic shares (e.g., budget allocations), participation rates, or any bounded ratio where exact 0s/1s occur naturally.
- **Key features**: Robust to distribution misspecification; handles 0s/1s without inflation models; differs from beta by not estimating dispersion.

## Dirichlet Regression

Dirichlet regression models multivariate continuous responses that are compositional (proportions summing to 1, e.g., $Y = (Y_1, \dots, Y_K)$ with $\sum Y_k = 1$, each $Y_k > 0$).

- **Distribution**: Dirichlet, a multivariate beta, with density involving gamma functions and parameters $\alpha_1, \dots, \alpha_K$; mean $\mu_k = \alpha_k / \sum \alpha_j$, precision related to $\sum \alpha_j$.
- **Link function**: Multinomial logit for the means ( $g(\mu_k) = \log(\mu_k / \mu_K)$) for categories 1 to K-1), often with a separate model for precision.
- **Applications**: Compositional data like market shares across brands, soil nutrient breakdowns, or budget allocations by category.
- **Key features**: Accounts for the sum-to-one constraint and dependence between components; extends to mixed (e.g., zero-inflated) versions for zeros.

## Tweedie Regression

Tweedie regression uses the Tweedie distribution, a flexible exponential dispersion family for continuous data with a point mass at zero and positive skew, unifying several GLMs.

- **Distribution**: Tweedie, with variance $\text{Var}(Y) = \phi \mu^p$ (power parameter $p$; special cases: normal ($p=0$), Poisson ($p=1$), Gamma ($p=2$), Inverse Gaussian ($p=3$). For $1 < p < 2$, it's compound Poisson-Gamma (zeros + positive continuous).
- **Link function**: Log for positive means; power links possible.
- **Applications**: Insurance (claims with many zeros and large positives), ecology (species abundance), or rainfall (dry days + amounts).
- **Key features**: Handles zero-inflation and overdispersion endogenously; $p$ can be estimated or fixed.



| Feature                   | Tweedie Regression                     | Gamma Regression                       | Inverse Gaussian Regression              |
| :------------------------ | :------------------------------------- | :------------------------------------- | :--------------------------------------- |
| **Variance Parameter ($p$)** | Variable (estimated or chosen)         | Fixed at $p=2$                         | Fixed at $p=3$                           |
| **Variance-Mean Relation** | $Var(Y) = \phi \mu^p$                  | $Var(Y) = \phi \mu^2$                  | $Var(Y) = \phi \mu^3$                    |
| **Underlying Distribution** | Tweedie family (flexible)              | Gamma distribution                     | Inverse Gaussian distribution            |
| **Data Characteristics**  | Flexible: counts, continuous with zeros, positive continuous, skewed | Positive continuous, often skewed, no zeros. Variance increases quadratically. | Positive continuous, highly skewed, no zeros. Variance increases cubically. |
| **Handling of Zeros**     | Can handle (especially for $1<p<2$)    | No                                     | No                                       |
| **Generality**            | General framework (encompasses others) | Specific case of Tweedie ($p=2$)       | Specific case of Tweedie ($p=3$)         |

In essence, if you know for certain that your data follows a Gamma or Inverse Gaussian distribution, you can directly use Gamma or Inverse Gaussian regression, respectively. However, if you are unsure, or if your data exhibits characteristics that don't perfectly fit these (e.g., presence of many zeros), Tweedie regression offers a more robust and flexible approach by letting the data determine the appropriate variance-mean relationship through the estimation of $p$.

These models are implemented in statistical software (e.g., R, Python's statsmodels), and choice depends on data characteristics like range, skewness, and zeros. For fitting, always check residuals and compare AIC/BIC. If data violate assumptions, consider generalized additive models (GAMs) or other extensions.

## Summary and Conclusion

In summary, Generalized Linear Models (GLMs) provide a robust framework for modeling non-normal continuous data by accommodating various distributions and link functions tailored to specific data characteristics. Key models include: Gamma regression for positive skewed data, Inverse Gaussian for highly skewed continuous data, Beta regression for proportions, Zero-One Inflated Beta for proportions with excess boundaries, Fractional regression for bounded ratios, Dirichlet regression for compositional data, and Tweedie regression for data with a mix of zeros and positive values. Each model addresses unique challenges such as skewness, boundedness, and zero-inflation, making GLMs versatile for diverse applications in fields like economics, ecology, and engineering. Proper model selection, fitting, and diagnostics are crucial for accurate inference and prediction in non-normal continuous data contexts.