# Absorbing Regression

An absorbing regression is a model of the form 

$$ y_i = x_i \beta + z_i \gamma +\epsilon_i $$

where interest is on $\beta$ and not $\gamma$.  $z_i$ may be high-dimensional, and may grow with the sample size (i.e., a matrix of fixed effects).

This notebook shows how this type of model can be fit in a simulate data set that mirrors some used in practice.  There are three effects, one for the state of the worker (small), one one for the workers firm (large)

In [1]:
import numpy as np
import pandas as pd
rs = np.random.RandomState(0)
nobs = 250000
state_id = rs.randint(50, size=nobs)
state_effects = rs.standard_normal(state_id.max()+1)
state_effects = state_effects[state_id]
# 5 workers/firm, on average
firm_id = rs.randint(nobs//5, size=nobs) 
firm_effects = rs.standard_normal(firm_id.max()+1)
firm_effects = firm_effects[firm_id]
cats = pd.DataFrame({"state": pd.Categorical(state_id), "firm": pd.Categorical(firm_id)})
eps = rs.standard_normal(nobs)
x = rs.standard_normal((nobs,2))
x = np.column_stack([np.ones(nobs), x])
y = x.sum(1) + firm_effects + state_effects + eps

## Including a constant
The estimator can estimate an intercept even when all dummies are included.  This is does using a mathematical trick and the intercept is not usually meaningful. This is done as-if the the dummies are orthogonalized to a constant. 

In [2]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x, absorb=cats)
print(mod.fit())

                         Absorbing LS Estimation Summary                          
Dep. Variable:              dependent   R-squared:                          0.8462
Estimator:               Absorbing LS   Adj. R-squared:                     0.8080
No. Observations:              250000   F-statistic:                     4.944e+05
Date:                Mon, Dec 14 2020   P-value (F-stat):                   0.0000
Time:                        16:46:25   Distribution:                      chi2(2)
Cov. Estimator:                robust   R-squared (No Effects):             0.6665
                                        Varaibles Absorbed:               4.97e+04
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog.0         0.7737     0.0018     432.47     0.0000      0.7702      0.7772
exog.1         1.001

## Excluding the constant
If the constant is dropped the other coefficient are identical since the dummies span the constant.

In [3]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x[:,1:], absorb=cats)
print(mod.fit())

                         Absorbing LS Estimation Summary                          
Dep. Variable:              dependent   R-squared:                          0.8462
Estimator:               Absorbing LS   Adj. R-squared:                     0.8080
No. Observations:              250000   F-statistic:                     4.944e+05
Date:                Mon, Dec 14 2020   P-value (F-stat):                   0.0000
Time:                        16:46:25   Distribution:                      chi2(2)
Cov. Estimator:                robust   R-squared (No Effects):             0.6665
                                        Varaibles Absorbed:               4.97e+04
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog.0         1.0012     0.0020     498.01     0.0000      0.9973      1.0051
exog.1         0.998

## Optimization Options
LSMR is iterative and does not have a closed form. The tolerance can be set using `lsmr_options` which is a dictionary.  See [scipy.sparse.linalg.lsmr](https://docs.scipy.org/doc/scipy-1.2.1/reference/generated/scipy.sparse.linalg.lsmr.html#scipy.sparse.linalg.lsmr) for details on the options.

Below `use_cache` is set to ensure that LSMR is run.  By default, the exogenous variables with the effects purged are cached. LSMR is run once for the dependent and for each column in exog. 

In [4]:
from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(y, x[:,1:], absorb=cats)
res = mod.fit(use_cache=False, lsmr_options={"show": True})

 
LSMR            Least-squares solution of  Ax = b

The matrix A has 250000 rows and 49702 columns
damp = 0.00000000000000e+00

atol = 1.00e-08                 conlim = 1.00e+08

btol = 1.00e-08             maxiter =    49702

 
   itn      x(1)       norm r    norm Ar  compatible   LS      norm A   cond A
     0  0.00000e+00  1.205e+03  1.030e+03   1.0e+00  7.1e-04
     1  1.49975e+01  8.358e+02  3.105e+02   6.9e-01  3.1e-01  1.2e+00  1.0e+00
     2 -1.34026e+00  7.825e+02  9.467e+01   6.5e-01  7.2e-02  1.7e+00  1.1e+00
     3 -1.27956e+00  7.758e+02  3.632e+01   6.4e-01  2.4e-02  2.0e+00  1.3e+00
     4 -3.42453e+00  7.745e+02  7.430e-01   6.4e-01  4.3e-04  2.2e+00  1.4e+00
     5 -3.42346e+00  7.745e+02  3.038e-01   6.4e-01  1.6e-04  2.4e+00  1.4e+00
     6 -3.34778e+00  7.745e+02  1.196e-02   6.4e-01  5.8e-06  2.6e+00  1.5e+00
     7 -3.34682e+00  7.745e+02  1.145e-02   6.4e-01  5.6e-06  2.7e+00  3.9e+00
     8 -3.34463e+00  7.745e+02  1.135e-02   6.4e-01  5.1e-06  2.9e+00  1.1e+0