**Fin 585R**  
**Diether**<br> 
**Class Notes**<br> 
**Fixed Effects Regressions Models**  

**Fama MacBeth: What Problems Does it Solve?**

We can use Fama-MacBeth (1973) regressions to test the prediction of the CAPM that all differences in expected returns are due to differences in beta. In other words, we test whether the security market line holds in a cross-sectional setting where the y-variable is expected returns and the x-variable is beta:<br><br>
$$
E(r_i) = E(r_{0M}) + \beta_{iM}\bigl[E(r_M) - E(r_{0M}) \bigr]
$$
<br>For example, we can test whether the SML holds using the following specification:<br><br>
$$
r_{it} = \gamma_{0t} + \gamma_{1t}\hat{\beta}_{it} + \gamma_{2t}log(ME_{i,t-1})
                     + \gamma_{3t}log([\tfrac{B}{M}]_{i,t-1}) + \nu_{it}
$$
<br>The CAPM predicts the following for the gamma coefficents: $\gamma_{1} = E(r_M) - E(r_{0M})$, $\gamma_{2} = 0$, and $\gamma_3 = 0$.

**Fama-MacBeth solves two econometric problems**:

1. It soaks up unobserved period by period variation. In Fama-MacBeth we run cross-sectional regressions period by period, so the estimation is done within each cross-section.<br><br>

2. Fama-MacBeth doesn't assume the error terms are IID across stocks like the standard linear model does. The standard errors reflect the time series distribution of the estimated coefficients. As such they should correctly reflect the cross-correlated nature of the error terms.

Fama-MacBeth has another estimation advantage. Because it's estimated period by period, we can allow for time varying betas or other variables to time vary (it loosens the stationarity assumption).

A final advantage of Fama-MacBeth is it can be implemented in systems with limited memory and processing power. It may take a while to compute on systems with little RAM and processing power, but it only ever estimates regressions that are a small sub-sample of the overall data. In the 1970s when Fama-MacBeth was introduced, working around this constraint was a big deal. 

**Alternative Fixed Effects and Clustered Standard Errors**

Fama-MacBeth has remained a popular estimation technique for cross-sectional tests of asset pricing models in the finance literature. That said, another way to test the CAPM cross-sectionally (that became popular in other areas of economics for similar kinds of empirical tests after Fama-MacBeth was developed) is a calendar time fixed effect linear regression model with standard errors cluster on calender-time. Adding fixed effects to a linear model soaks up unobserved period by period variation, and the estimation identifies off of cross-sectional variation within each time period. Additionally, the standard errors can be adjusted for cross-correlation within each time period by using a sandwich estimator for the variance/covariance matrix that is robust to cross-correlation at the calendar time level (often called clustered standard errors). Thus Fama-MacBeth and a linear regression with fixed effects and cluster standard errors solve the same econometric problems. However, they are not mechanically or mathematically equivalent. That said, they will yield quite similar estimates and standard errors.

The rest of the notebook demonstrates how to perform a cross-sectional test of the CAPM using a linear fixed effect models and clustered standard errors.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

**Data**

Let's use the data from the problem set where we tested the CAPM using Fama-MacBeth regressions.

In [2]:
df = pd.read_csv('https://diether.org/prephd/11-mstk_fm_29-63.csv',parse_dates=['caldt'])
df.head()

Unnamed: 0,permno,caldt,ret,beta,melag,bmlag
0,10006,1929-02-28,0.002519,0.63048,59.55,1.077
1,10006,1929-03-28,0.027638,0.62575,59.7,1.077
2,10006,1929-04-30,-0.022333,0.60519,60.45,1.077
3,10006,1929-05-31,-0.045685,0.60637,59.1,1.077
4,10006,1929-06-29,0.042553,0.60761,56.4,1.077


In [3]:
df['ret'] *= 100
df['logme'] = np.log(df['melag'])
df['logbm'] = np.log(df['bmlag'])

**Clustering standard errors**

First, how do you cluster standard errors in statsmodels? Well, statsmodels has a `cov_type` paramater where clustered standard errors is one of the options. We can run a simple pooled ols regression like the following.<br><br>
$$
r_{it} = \gamma_{0t} + \gamma_{1t}\hat{\beta}_{it} + \gamma_{2t}log(ME_{i,t-1})
                     + \gamma_{3t}log([\tfrac{B}{M}]_{i,t-1}) + \nu_{it}
$$

But instead of using a standard variance/covariance matrix, we adjust the standard errors to allow for clustering at the calender-month level. 

The code below firsts runs the regression without clustering the standard errors, and then estimates a regression with standard errors cluster on calender-date. Notice how the standard errors change.

**No clustering**

In [5]:
# Just a pooled regression; not representing cross section correctly.
r0 = smf.ols("ret ~ beta + logme + logbm",data=df).fit()
print(r0.summary())

                            OLS Regression Results                            
Dep. Variable:                    ret   R-squared:                       0.010
Model:                            OLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     932.7
Date:                Tue, 14 Feb 2023   Prob (F-statistic):               0.00
Time:                        14:37:14   Log-Likelihood:            -1.0779e+06
No. Observations:              279854   AIC:                         2.156e+06
Df Residuals:                  279850   BIC:                         2.156e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.4724      0.080     30.751      0.0

**Standard errors that allow for clustering on calender date**

In [7]:
# This is still not cross-sectional, but fixes standard errors. Look at the cov-type parameter on fit.
# Z-stat because we rely on the Central Limit Theorem now
r1 = (smf.ols("ret ~ beta + logme + logbm",data=df)
      .fit(cov_type='cluster',cov_kwds={'groups':df['caldt']}))
print(r1.summary())

                            OLS Regression Results                            
Dep. Variable:                    ret   R-squared:                       0.010
Model:                            OLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     5.826
Date:                Tue, 14 Feb 2023   Prob (F-statistic):           0.000661
Time:                        14:38:32   Log-Likelihood:            -1.0779e+06
No. Observations:              279854   AIC:                         2.156e+06
Df Residuals:                  279850   BIC:                         2.156e+06
Df Model:                           3                                         
Covariance Type:              cluster                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.4724      0.463      5.341      0.0

**Fixed Effects**

You can easily implement fixed effects by remembering that fixed effects are just dummy variables. `Statsmodels` makes it easy to create fixed effect using the categorical function (`C`) in the patsy interface. The problem with such an approach is that its execution time gets long as the dataset gets large, and it will use a **lot of ram** if dataset is large. Still, it's easy to implement. For example, consider the following regression model with calender time fixed effects ($d_t$ are the time based dummy variables):<br><br>
$$
r_{it} = d_{t} + \gamma_1 \hat{\beta}_{it} +  \gamma_{2t}log(ME_{i,t-1})
                     + \gamma_{3t}log([\tfrac{B}{M}]_{i,t-1}) + \nu_{it}
$$

I also continue to cluster the standard errors at the calender-month level. The code for such a fixed effects model is below. Feel free to try it if you want (it may take a minute or two to run assuming your machine has enough memory ... it probably does ... this dataset isn't that large).

In [9]:
# We need indicator variables for each month. 
# We can do this with the C() 'categorical' command which generates columns.
# This is equivalent of picking up within- cluster fixed effects.
# The inversion of XTX takes a minute.
r2 = (smf.ols("ret ~ -1 + beta + logme + logbm + C(caldt)",data=df)
      .fit(cov_type='cluster',cov_kwds={'groups':df['caldt']}))
print(r2.summary())


                            OLS Regression Results                            
Dep. Variable:                    ret   R-squared:                       0.393
Model:                            OLS   Adj. R-squared:                  0.392
Method:                 Least Squares   F-statistic:                       nan
Date:                Tue, 14 Feb 2023   Prob (F-statistic):                nan
Time:                        14:41:51   Log-Likelihood:            -1.0094e+06
No. Observations:              279854   AIC:                         2.020e+06
Df Residuals:                  279437   BIC:                         2.024e+06
Df Model:                         416                                         
Covariance Type:              cluster                                         
                                                 coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------------

**Efficient implementation of one way fixed effects**

An alternative is to implemnent the fixed effects by demeaning the data (see a good economtrics book for details). Below is a little function I wrote called `areg` (it basically has the same core functionality as stata's areg command). It allows you to estimate one way fixed effects models and cluster the standard errors: 

In [13]:
import patsy
import statsmodels.api as sm


def my_areg(formula,data=None,absorb=None,cluster=None):
    y,X = patsy.dmatrices(formula,data,return_type='dataframe')

    ybar = y.mean()
    y = y -  y.groupby(data[absorb]).transform('mean') + ybar
    
    Xbar = X.mean()
    X = X - X.groupby(data[absorb]).transform('mean') + Xbar
    
    reg = sm.OLS(y,X)
    # Account for df loss from FE transform
    reg.df_resid -= (data[absorb].nunique() - 1)
    
    return reg.fit(cov_type='cluster',cov_kwds={'groups':data[cluster]})


In [14]:
r3 = my_areg('ret ~ beta + logme + logbm',data=df,absorb='caldt',cluster='caldt')
print(r3.summary())

                            OLS Regression Results                            
Dep. Variable:                    ret   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     9.845
Date:                Tue, 14 Feb 2023   Prob (F-statistic):           2.77e-06
Time:                        14:46:31   Log-Likelihood:            -1.0094e+06
No. Observations:              279854   AIC:                         2.019e+06
Df Residuals:                  279437   BIC:                         2.019e+06
Df Model:                           3                                         
Covariance Type:              cluster                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0681      0.237      8.738      0.0

**Use the Fin 585R Library: areg is built in**

In [12]:
from finance_byu.regression import areg

r4 = areg('ret ~ beta + logme + logbm',data=df,absorb='caldt',cluster='caldt')
print(r4.summary())

                            OLS Regression Results                            
Dep. Variable:                    ret   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     9.845
Date:                Tue, 14 Feb 2023   Prob (F-statistic):           2.77e-06
Time:                        14:46:27   Log-Likelihood:            -1.0094e+06
No. Observations:              279854   AIC:                         2.019e+06
Df Residuals:                  279437   BIC:                         2.019e+06
Df Model:                           3                                         
Covariance Type:              cluster                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0681      0.237      8.738      0.0