# Weighted Generalized Linear Models

In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

## Weighted GLM: Poisson response data

### Load data

In this example, we'll use the affair dataset using a handful of exogenous variables to predict the extra-marital affair rate. 

Weights will be generated to show that `freq_weights` are equivalent to repeating records of data. On the other hand, `var_weights` is equivalent to aggregating data.  

In [None]:
print(sm.datasets.fair.NOTE)

Load the data into a pandas dataframe.

In [None]:
data = sm.datasets.fair.load_pandas().data

 The dependent variable is "affairs"

In [None]:
data.describe()

In [None]:
data[:3]

For the weights, we'll randomly generate an array. 

In [None]:
np.random.seed(42)
weights = np.random.randint(1, 4, size=len(data))

In [None]:
weights.max()

### Frequency weights are equivalent to *repeated* records

First, let's demonstrate that `freq_weights` are equivalent to repeating records. We'll create a new dataset--called `data_repeated` where we repeat records proportionally to the the weight.

In [None]:
data_repeated = np.repeat(np.array(data), weights, axis=0)
data_repeated = pd.DataFrame(data_repeated)
data_repeated.columns = data.columns

In [None]:
data_repeated.describe()

In [None]:
glm = smf.glm('affairs ~ rate_marriage + age + yrs_married',
              data=data, family=sm.families.Poisson(), freq_weights=weights)
res = glm.fit()
print(res.summary())

In [None]:
glm_repeated = smf.glm('affairs ~ rate_marriage + age + yrs_married',
                       data=data_repeated, family=sm.families.Poisson())
res = glm_repeated.fit()
print(res.summary())

The parameters, errors, degrees of freedom, likelihood, deviance, an Pearson Chi2 are all the same! The only difference is the number of which is intentionally higher. 

## Variance weights are equivalent to *averaging* records

We'll use the same weights from the `freq_weights` example. The only difference is that we need to make a new dataset called `data_unaveraged` where the `endog` will have the same average by multiplying the first record in the duplicated set by the weight and adding records with `endog` of 0 to get the average back to the original. 

In [None]:
data_unaveraged = data.copy()
data_unaveraged['affairs'] *= weights
data_extra = np.repeat(np.array(data_unaveraged), weights - 1, axis=0)
data_extra = pd.DataFrame(data_extra)
data_extra.columns = data.columns
data_extra['affairs'] = 0
data_unaveraged = pd.concat((data_unaveraged, data_extra), axis=0)

In [None]:
glm_unaveraged = smf.glm('affairs ~ rate_marriage + age + yrs_married',
                       data=data_unaveraged, family=sm.families.Poisson())
res = glm_unaveraged.fit()
print(res.summary())

In [None]:
glm = smf.glm('affairs ~ rate_marriage + age + yrs_married',
                       data=data, family=sm.families.Poisson(),
                       var_weights=weights)
res = glm.fit()
print(res.summary())

In this example, think of the original `data` is an *average* of the `data_unaveraged` data. Using `var_weights`, we get the same parameters and standard errors, but the likelihood, degrees of freedom, and deviance are different. 

## Special case with `log` link: Exposure is equivalent to *aggregating* data

We'll continue to use the same weights, but now we will un-aggregate by adding additional records with `endog` of 0. The *sum* of the endog will be the same.

In [None]:
data_unaggregated = data.copy()
data_extra = np.repeat(np.array(data), weights - 1, axis=0)
data_extra = pd.DataFrame(data_extra)
data_extra.columns = data.columns
data_extra['affairs'] = 0
data_unaggregated = pd.concat((data_unaggregated, data_extra), axis=0)

In [None]:
glm_unaggregated = smf.glm('affairs ~ rate_marriage + age + yrs_married',
                            data=data_unaggregated, family=sm.families.Poisson())
res = glm_unaggregated.fit()
print(res.summary())

In [None]:
glm_exposure = smf.glm('affairs ~ rate_marriage + age + yrs_married',
                       data=data, family=sm.families.Poisson(),
                       exposure=weights)
res = glm_exposure.fit()
print(res.summary())

Similar to the `average` example, exposure provides the same paramters and standard errors, but different degrees of freedom, likelihood, and deviance. This example only works for the `Poisson` family because the variance is equal to the expectation. You can also match using the `Tweedie` family with exposure by including `var_weights = weight ** (var_power - 1)`.