# Poisson GLM: Horseshoe Crab Mating Example

Study of nesting horseshoe crabs: each female horseshoe crab had a male crab resident in her nest. The study investigated factors affecting whether the female crab had any other males, called satellites, residing nearby, and if so, how many. The factors included the female crab's color, spine condition, weight, and carapace width.

From Alan Agresti, *Categorical Data Analysis,* Second Edition, Wiley, 2002, pg. 126 (https://mathdept.iut.ac.ir/sites/mathdept.iut.ac.ir/files/AGRESTI.PDF).

In [None]:
'''
Imports
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

%matplotlib inline

In [None]:
'''
Read in input file into Pandas dataframe. The columns have the following meaning:
1. Color: 1 (light medium), 2 (medium), 3 (dark medium), 4 (dark)
2. Spine condition: 1 (both good), 2 (one worn or broken), 3 (both worn or broken)
3. Carapace width: in cm
4. Weight: in kg
5. Number of satellites
'''
df0 = pd.read_csv('CrabSatellites.ssv', sep='\s+', header=None, names=['Color', 'Spine', 'Width', 'Weight', 'Satellites'])
df0['Color'] = df0['Color'].astype('category')
df0['Spine'] = df0['Spine'].astype('category')
print df0.head()
print
print df0.describe()

In [None]:
'''
Plot number of satellites versus carapace width.
'''
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,6))
ax[0].scatter(df0.Width, df0.Satellites, alpha=0.2)
ax[0].set_xlabel('Carapace Width (cm)', fontsize=15)
ax[0].set_ylabel('Number of Satellites', fontsize=15)

bins = np.linspace(23.25, 29.25, 7)
df0['bin'] = np.digitize(df0.Width, bins=bins)
bingroups = df0.groupby('bin')
widths = bingroups['Width'].agg(['mean', 'sem'])
satels = bingroups['Satellites'].agg(['mean', 'sem'])
xval = widths.iloc[:,0].tolist()
xerr = widths.iloc[:,1].tolist()
yval = satels.iloc[:,0].tolist()
yerr = satels.iloc[:,1].tolist()
ax[1].errorbar(x=xval, y=yval, yerr=yerr, xerr=xerr, linestyle='none')
ax[1].set_xlabel('Mean Carapace Width in Bins (cm)', fontsize=15)
ax[1].set_ylabel('Mean Number of Satellites', fontsize=15)

plt.show()

In [None]:
'''
Fit a Poisson GLM model. For now we consider a single predictor, the carapace width,
and wish to predict the number of satellites.
The link function below uses sm.families.links.log, but one could also try sm.families.links.identity.
'''
formula = 'Satellites ~ Width'
PoissonModel = smf.glm(formula=formula, data=df0, family=sm.families.Poisson(link=sm.families.links.log))
PoissonResults = PoissonModel.fit()
print PoissonResults.summary()

### Interpretation of the results:

- For the log link: $\mu = \exp(-3.30 + 0.16*x) = 0.037*(1.18)^{x}$. If the carapace width increases by 1 cm, the estimated mean number of satellites increases by 18%.

- For the identity link: $\mu = -11.53 + 0.55*x$. If the carapace width increases by 1 cm, the estimated mean number of satellites increases by 0.55. A ~2cm increase in width is associated with an extra satellite.

- Note however that the deviance of the fit is not very good: more than 500, for 171 degrees of freedom. More on this later.

In [None]:
'''
Superimpose fit results on data points.
'''
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6,6))
ax.errorbar(x=xval, y=yval, yerr=yerr, xerr=xerr, linestyle='none')
idfit = -11.5318 + 0.5495*np.array(xval)
logfit = np.exp(-3.3048 + 0.1640*np.array(xval))
ax.plot(xval, idfit, label='Identity link')
ax.plot(xval, logfit, label='Log link')
ax.set_xlabel('Mean Carapace Width in Bins (cm)', fontsize=15)
ax.set_ylabel('Mean Number of Satellites', fontsize=15)
ax.legend(prop={'size': 15}, loc="upper left")
plt.show()

- The fit looks good for both link functions. 
- As noted before however, the deviance is large. While this doesn't necessarily affect the coefficient estimates, it may affect the coefficient errors.
- The large deviance could indicate that the proposed model (Poisson) is incorrect, or that there is overdispersion. Let's check the latter first.

In [None]:
'''
Check for overdispersion.
'''
satels4 = bingroups['Satellites'].agg(['count', 'sum', 'mean', 'var'])
print satels4

Note how the variance is quite a bit larger than the mean. For Poisson counts, mean and variance should be equal. This non-Poisson behavior is called **overdispersion**. Possible causes include:
- **Missing Features**: We only used carapace width to predict the number of satellites in our model. Maybe the number of satellites is Poisson distributed at each fixed combination of width, weigth, spine condition, color, and perhaps some other features we didn't think about. By using only one predictor out of four or more, we caused the response variable to be a mixture of Poisson populations, each with its own mean.
- **The True Model Is Not Poisson**: Another possible model is the negative binomial, which has an extra parameter to model variability.

Note that overdispersion is not a problem in ordinary linear regression with normally distributed response variables. This is because the normal distribution has a separate parameter to describe variability.

One way to fix overdispersion is to try a different model, one with an additional parameter to allow modeling of variability. A good candidate for the case at hand is the negative binomial GLM. The negative binomial distribution has probability mass function:
\begin{equation}
f(y;k,\mu) \;=\; \frac{\Gamma(y+k)}{\Gamma(k)\,\Gamma(y+1)}\;\left(\frac{k}{\mu+k}\right)^{k}\;\left(1-\frac{k}{\mu+k}\right)^{y},
\end{equation}
where $y=0,1,2,\ldots$, and $k$ and $\mu$ are parameters. The interpretation of this distribution is that $f(y;k,\mu)$ represents the probability of seeing $y$ "failures" before we reach $k$ "successes". The probability of one success is given by $k/(\mu+k)$.

The mean and variance of the negative binomial distribution are given by:
\begin{equation}
{\rm E}(Y) = \mu\quad\textrm{ and }\quad {\rm Var}(Y) = \mu + \frac{\mu^{2}}{k}.
\end{equation}
Note that mean and variance are not in a one-to-one relationship since the variance involves the new parameter $k$. For large $k$ we recover the Poisson distribution (where variance equals mean).

In [None]:
'''
Fit a negative binomial GLM.
The parameter alpha in the glm call below corresponds to 1/k in the formula for the negative
binomial distribution.
Try playing around with the link function (log or identity) and alpha values (from 0.1 to 2.0).
'''
formula = 'Satellites ~ Width'
NegBinModel = smf.glm(formula=formula, data=df0, 
                      family=sm.families.NegativeBinomial(link=sm.families.links.log, alpha=1.0))
NegBinResults = NegBinModel.fit()
print NegBinResults.summary()

The deviance is significantly better for the negative binomial model! Note that the standard errors on the coefficients have doubled compared to the results of the Poisson GLM. The coefficients themselves have changed much less.

In [None]:
'''
Fit a Poisson GLM model to all the predictors
'''
formula = 'Satellites ~ Color+Spine+Width+Weight'
PoissonModel = smf.glm(formula=formula, data=df0, family=sm.families.Poisson(link=sm.families.links.log))
PoissonResults = PoissonModel.fit()
print PoissonResults.summary()

Including all four predictors has not improved the deviance. In fact, the spine condition appears to be useless as predictor, and width is also useless in the presence of weight. Intuitively it is clear that width and weight must be strongly correlated. 

In conclusion, adding missing features has not improved this model, but changing from Poisson to negative binomial has.