# Basics of reghdfe

We are interested in the estimation of a linear model such as 

\begin{equation}
y=\mathbf{X}\beta + \mathbf{D}\alpha + \epsilon
\end{equation}

where $\mathbf{D}=[\mathbf{D}_1 ~\mathbf{D}_2 ~ ... ~ \mathbf{D}_F]$ and the generic $\mathbf{D}_i$ is a matrix of dummy variables associated with an indicator variable.

Consider the following dataset:

In [1]:
estimates clear
use fakedata, clear
sum
list in 1/10





    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
           i |        100       12.48    5.292419          1         20
           j |        100        8.37    4.409941          1         15
           t |        100        3.76    2.283538          1          9
           y |        100    -1.92123    10.76019    -28.771     27.271
          x1 |        100         .32    2.898554         -5          5
-------------+---------------------------------------------------------
          x2 |        100         .36    2.976406         -5          5
         zi1 |        100    .4886572    .2834865   .1086679   .8948932


     +------------------------------------------+
     | i   j   t         y   x1   x2        zi1 |
     |------------------------------------------|
  1. | 1   1   1   -12.941    1   -2   .1369841 |
  2. | 2   1   1   -13.734    0    1   .6432207 |
  3. | 2   1   2   -11.3

# Regression with one fixed effect (F=1)

There are basically three ways to run a regression with one fixed effect in official Stata!

In [2]:
qui regress y x1 x2 i.i
estimates store reg_1
qui areg y x1 x2, absorb(i)
estimates store areg_1
qui xtset i
qui xtreg y x1 x2, fe
estimates store xtreg_1


## Estimation with **reghdfe**

In [3]:
which reghdfe
qui reghdfe y x1 x2 i.i
estimates store reghdfe_1a
qui reghdfe y x1 x2, absorb(i)
estimates store reghdfe_1b


/Users/miguelportela/Library/Application Support/Stata/ado/plus/r/reghdfe.ado
*! version 6.12.2 02Nov2021






In [4]:
estimates table *_1 reghdfe_1a reghdfe_1b, keep(x1 x2) b(%7.4f) se(%7.4f) stats(N r2 r2_a)


----------------------------------------------------------------
    Variable |  reg_1    areg_1    xtreg_1   regh~1a   regh~1b  
-------------+--------------------------------------------------
          x1 |  1.0659    1.0659    1.0659    1.0659    1.0659  
             |  0.2885    0.2885    0.2885    0.2885    0.2885  
          x2 | -0.6755   -0.6755   -0.6755   -0.6755   -0.6755  
             |  0.2863    0.2863    0.2863    0.2863    0.2863  
-------------+--------------------------------------------------
           N |     100       100       100       100        97  
          r2 |  0.5832    0.5832    0.1990    0.5832    0.5777  
        r2_a |  0.4710    0.4710   -0.0166    0.4710    0.4803  
----------------------------------------------------------------
                                                    Legend: b/se


Coefficients and ses are identical in all regressions but...
- **xtreg** reports the "within" $R^2$ and $Adj R^2$ 
- **reghdfe** with **absorb** option reports less observations - it drops singletons!
- Dropping singletons affects calculation of $R^2$ and $Adj R^2$  

## What if the robust option is included?

In [5]:
qui regress y x1 x2 i.i, vce(robust)
estimates store reg_2a
qui areg y x1 x2, absorb(i) vce(robust)
estimates store areg_2a
qui xtreg y x1 x2, fe vce(robust) 
estimates store xtreg_2a
qui reghdfe y x1 x2, absorb(i) vce(robust) 
estimates store reghdfe_2a
estimates table *_2a, keep(x1 x2) b(%7.4f) se(%7.4f) stats(N)











------------------------------------------------------
    Variable | reg_2a    areg_2a   xtre~2a   regh~2a  
-------------+----------------------------------------
          x1 |  1.0659    1.0659    1.0659    1.0659  
             |  0.2906    0.2906    0.2956    0.2862  
          x2 | -0.6755   -0.6755   -0.6755   -0.6755  
             |  0.3025    0.3025    0.3148    0.2979  
-------------+----------------------------------------
           N |     100       100       100        97  
------------------------------------------------------
                                          Legend: b/se


- **regress** and **areg** produce the same results
- **xtreg** produces different clustered ses
- **reghdfe** also produces ses different from all other


What is going on?

- **xtreg** assumes that the data are clustered in **i**. With **xtreg** the option **vce(robust)** is the same as **vce(cluster i)**
- **reghdfe** produces different results because it drops singletons by default!
- Adding singletons biases your ses! If you want to keep singletons use option **keepsingletons**

## What if the cluster option is included?

In [6]:
qui regress y x1 x2 i.i, vce(cluster i)
estimates store reg_2b
qui areg y x1 x2, absorb(i)  vce(cluster i)
estimates store areg_2b
qui xtreg y x1 x2, fe vce(cluster i) 
estimates store xtreg_2b
qui reghdfe y x1 x2, absorb(i)  vce(cluster i) 
estimates store reghdfe_2b
qui reghdfe y x1 x2, absorb(i)  vce(cluster i) keepsingletons 
estimates store reghdfe_2ba

estimates table *_2b *_2ba, keep(x1 x2) b(%7.4f) se(%7.4f) stats(N)










> (link)



----------------------------------------------------------------
    Variable | reg_2b    areg_2b   xtre~2b   regh~2b   reghd~a  
-------------+--------------------------------------------------
          x1 |  1.0659    1.0659    1.0659    1.0659    1.0659  
             |  0.3296    0.3296    0.2956    0.2971    0.2956  
          x2 | -0.6755   -0.6755   -0.6755   -0.6755   -0.6755  
             |  0.3511    0.3511    0.3148    0.3164    0.3148  
-------------+--------------------------------------------------
           N |     100       100       100        97       100  
----------------------------------------------------------------
                                                    Legend: b/se


- **regress** and **areg** produce the same results - they adjust the dof by the number of coefficients in the fes 
- **xtreg** does not count the fes in the dof - this is appropriate if the fixed effects are nested within clusters
- **xtreg** produces the same results as with the **vce(robust)** option
- since the fe is nested within the cluster **reghdfe** (with option **keepsingleton**) produces the same ses as **xtreg**

- you can force **xtreg** to use the same correction as **areg** (option **dfadj**)

Now we cluster on the variable *j*


In [7]:
qui regress y x1 x2 i.i, vce(cluster j)
estimates store reg_2c
qui areg y x1 x2, absorb(i)  vce(cluster j)
estimates store areg_2c
*qui xtreg y x1 x2, fe vce(cluster j) 
*estimates store xtreg_2c
qui reghdfe y x1 x2, absorb(i)  vce(cluster j) 
estimates store reghdfe_2c
qui reghdfe y x1 x2, absorb(i)  vce(cluster j) keepsingletons 
estimates store reghdfe_2ca

estimates table *_2c *_2ca, keep(x1 x2) b(%7.4f) se(%7.4f) stats(N)








> (link)



------------------------------------------------------
    Variable | reg_2c    areg_2c   regh~2c   reghd~a  
-------------+----------------------------------------
          x1 |  1.0659    1.0659    1.0659    1.0659  
             |  0.2562    0.2562    0.2523    0.2562  
          x2 | -0.6755   -0.6755   -0.6755   -0.6755  
             |  0.1972    0.1972    0.1942    0.1972  
-------------+----------------------------------------
           N |     100       100        97       100  
------------------------------------------------------
                                          Legend: b/se


- **xtreg** fails to estimate
- now **reghdfe** produces the same results as **regress** and **areg** (if we keep singletons)

## Other reasons why reghdfe is better!

In [8]:
clear all
use data1fe



(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)


This is a subset of the dataset *nlswork*. 

In [9]:
%browse

Unnamed: 0,idcode,year,birth_yr,age,race,tenure,ln_wage
1,1,70,51,18,black,0.083333336,1.451214
2,1,71,51,19,black,0.083333336,1.0286198
3,1,72,51,20,black,0.91666669,1.5899774
4,1,73,51,21,black,0.083333336,1.7802728
5,1,75,51,23,black,0.16666667,1.7770116
6,1,77,51,25,black,1.5,1.7786806
7,1,78,51,26,black,0.083333336,2.4939759
8,1,80,51,28,black,1.8333334,2.5517154
9,1,83,51,31,black,0.66666669,2.4202614
10,1,85,51,33,black,1.9166666,2.6141725


In [10]:
qui xtreg ln_wage age tenure i.year, fe
estimates store xtreg_3a
qui areg ln_wage age tenure i.year, absorb(idcode)
estimates store areg_3a
qui xtreg ln_wage age tenure ib(69).year, fe
estimates store xtreg_3b
qui areg ln_wage age tenure ib(69).year, absorb(idcode)
estimates store areg_3b
qui reghdfe ln_wage age tenure, absorb(idcode year)
estimates store reghdfe_3c
estimates table *_3a *_3b *_3c, keep(age tenure) b(%7.4f) se(%7.4f) stats(N)














--------------------------------------------------------------------------
    Variable | xtreg_3a     areg_3a    xtreg_3b     areg_3b    reghdf~3c  
-------------+------------------------------------------------------------
         age |    0.0152      0.0152      0.0114      0.0114   (omitted)  
             |    0.0008      0.0008      0.0008      0.0008              
      tenure |    0.0219      0.0219      0.0219      0.0219      0.0219  
             |    0.0009      0.0009      0.0009      0.0009      0.0009  
-------------+------------------------------------------------------------
           N |     23206       23206       23206       23206       22654  
--------------------------------------------------------------------------
                                                              Legend: b/se


- **xtreg** and **areg** produce estimates of the coefficient for *age* even though the coefficient is not identified
- and the estimate changes when the base for the year fixed effect is changed!

## More reasons

- **reghdfe** is usually faster than **areg** o **xtreg** 
- **reghdfe** can compute ses with multi-way clustering (Note: with large datasets the order in which you introduce the cluster variables affects performance - always use the variable with highest number of distinct values first)

## Regression with more than one fixed effect (F>1)

With **reghdfe** you can absorb multiple fixed effects. Just add them to the **absorb** option 

In [115]:
clear all
use nlswork



(National Longitudinal Survey of Young Women, 14-24 years old in 1968)


In [116]:
qui reghdfe ln_wage ttl_exp union not_smsa nev_mar i.year, absorb(idcode)
estimates store reghdfe_4a
qui reghdfe ln_wage ttl_exp union not_smsa nev_mar, absorb(idcode year)
estimates store reghdfe_4b
estimates table *_4a *_4b, keep(ttl_exp union not_smsa nev_mar) b(%7.4f) se(%7.4f) stats(N)







----------------------------------
    Variable | regh~4a   regh~4b  
-------------+--------------------
     ttl_exp |  0.0437    0.0437  
             |  0.0016    0.0016  
       union |  0.1010    0.1010  
             |  0.0069    0.0069  
    not_smsa | -0.0943   -0.0943  
             |  0.0124    0.0124  
     nev_mar | -0.0209   -0.0209  
             |  0.0102    0.0102  
-------------+--------------------
           N |   18558     18558  
----------------------------------
                      legend: b/se


And we can add a third fixed effect:

In [117]:
reghdfe ln_wage ttl_exp union not_smsa nev_mar, absorb(idcode year occ_code)

(dropped 665 singleton observations)
(MWFE estimator converged in 17 iterations)

HDFE Linear regression                            Number of obs   =     18,486
Absorbing 3 HDFE groups                           F(   4,  14982) =     245.28
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.7636
                                                  Adj R-squared   =     0.7083
                                                  Within R-sq.    =     0.0615
                                                  Root MSE        =     0.2508

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     ttl_exp |    .041726   .0015802    26.41   0.000     .0386286    .0448234
       union |   .0922451   .0068775    13.41   

And it is possible to save the estimates of the fes

In [118]:
reghdfe ln_wage ttl_exp union not_smsa nev_mar, absorb(fe1=idcode fe2=year fe3=occ_code)
reghdfe ln_wage ttl_exp union not_smsa nev_mar fe1 fe2 fe3


(dropped 665 singleton observations)
(MWFE estimator converged in 17 iterations)

HDFE Linear regression                            Number of obs   =     18,486
Absorbing 3 HDFE groups                           F(   4,  14982) =     245.28
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.7636
                                                  Adj R-squared   =     0.7083
                                                  Within R-sq.    =     0.0615
                                                  Root MSE        =     0.2508

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     ttl_exp |    .041726   .0015802    26.41   0.000     .0386286    .0448234
       union |   .0922451   .0068775    13.41  

It is also possible to add interaction terms

In [119]:
reghdfe ln_wage union not_smsa nev_mar, absorb(idcode i.year##c.ttl_exp)

(dropped 666 singleton observations)
(MWFE estimator converged in 11 iterations)

HDFE Linear regression                            Number of obs   =     18,558
Absorbing 2 HDFE groups                           F(   3,  15048) =      85.58
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.7587
                                                  Adj R-squared   =     0.7024
                                                  Within R-sq.    =     0.0168
                                                  Root MSE        =     0.2535

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       union |   .0979223   .0068881    14.22   0.000     .0844207    .1114238
    not_smsa |    -.09025   .0123234    -7.32   

In [120]:
reghdfe ln_wage union not_smsa nev_mar, absorb(i.idcode##c.ttl_exp year)

(dropped 666 singleton observations)
(MWFE estimator converged in 11 iterations)

HDFE Linear regression                            Number of obs   =     18,558
Absorbing 2 HDFE groups                           F(   3,  11583) =      41.30
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.8601
                                                  Adj R-squared   =     0.7758
                                                  Within R-sq.    =     0.0106
                                                  Root MSE        =     0.2200

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       union |    .077019   .0074498    10.34   0.000     .0624162    .0916219
    not_smsa |   -.061414   .0157465    -3.90   

There is much more to **reghdfe**. Places to check are:

 - [Sergio Correia's website](http://scorreia.com/) 
 - [Sergio Correia's github](https://github.com/sergiocorreia)
 
 - The related packages:
     - [ivreghdfe](https://github.com/sergiocorreia/reghdfe)
     - [ppmlhdfe](https://github.com/sergiocorreia/ppmlhdfe)
     - [sumhdfe](https://github.com/ed-dehaan/sumhdfe)