# ECON 490: Good Regression Practice (14)


## Pre-requisites: 
--- 
1. Econometric approaches to linear regression taught in ECON 326.
2. Importing data into Stata.
3. Creating new varables using `generate`.
4. Creating and interpreting dummy variables


## Learning Outcomes:  
---
- Know how to deal with atypical values (outliers) by trimming or winsorizing data
- Be able to prevent empirical issues such as heteroskedasticity and multicollinearity from happening.


At this point, you have learned how powerful of a tool OLS can be. However, when we encounter real-world data, there are many problems that could arise. The purpose of this module is to help you understand and avoid such problems. 


## 14.1 Outliers 

Although it is *very unlikely*, sometimes our regression results can be driven by atypical values in our variables of interest. While there is no possible way to test this, it always helps to have a table of summary statistics showing the possible values that our variables in the analysis can take. 

For example, we might construct a dependent variable which contains the wage growth of workers, and we see that some of them grew their wage more than 400\% . One might wonder if this massive change is in fact real or an error made by the statisticians that produce the dataset. Also, even if the changes are correct, noticing that there are only a couple of observations with such big growth percentages, one can argue that these outliers are a main driver of the results we are obtaining. Hence, we are producing an analysis based on results that are not associated with the majority of our observations. The standard practice in these cases is to either winsorize or trim the subset of observations that are used in that regression. Both practices remove the outlier values in the dependent variable to allow us to produce more realistic results. 


<div class="alert alert-block alert-warning">
    
<b>Warning:</b> You should only consider fixing outliers when there is a clear reason to address this issue. Do not pursue to apply the tools below if the summary statistics in your data make sense to you in terms of abnormal values.
    
</div>

### 14.1.1 Winsorize 

Winsorizing is the process of limiting extreme values in the data to reduce the effect of (possibly erroneous) outliers. It consists on replacing values below the $a$ percentile by that percentile value, and values above the $b$ percentile by that percentile. Consider the following example:


In [1]:
clear* 

use fake_data, clear 

In [3]:
su earnings, d


                          Earnings
-------------------------------------------------------------
      Percentiles      Smallest
 1%     2037.231       8.881357
 5%     5147.155       10.06454
10%     8086.952       10.29567       Obs           2,861,772
25%     16701.25       11.60582       Sum of Wgt.   2,861,772

50%     36511.77                      Mean           71809.22
                        Largest       Std. Dev.      203384.9
75%     78840.01       5.15e+07
90%     157791.9       6.36e+07       Variance       4.14e+10
95%     240020.3       7.03e+07       Skewness       345.8759
99%     540524.9       1.90e+08       Kurtosis       282741.2


We can see from the summary statistics that the value earned by the 1st percentile is of 2037 however, the smallest earnings is of 8.88. The same divergence occurs for the 99th percentile. The value earned by the largest earner was of 190000000 while the value earned by the 99th percentile is only of 540524.9. This table shows us there are large outliers in our dependent variable. 

Therefore, we want to get rid of these outliers by winsorizing. What we would be doing is replacing the values of all the observations below the 1st percentile by the value of the 1st percentile and all the values of the observations above the 99th percentile by the value of the 99th percentile. 

Recall that Stata can record the information in the command by using `return list`.

In [4]:
return list


scalars:
                r(p99) =  540524.875
                r(p95) =  240020.28125
                r(p90) =  157791.9375
                r(p75) =  78840.0078125
                r(p50) =  36511.765625
                r(p25) =  16701.251953125
                r(p10) =  8086.9521484375
                 r(p5) =  5147.1552734375
                 r(p1) =  2037.231323242188
                r(max) =  190449648
                r(min) =  8.881357192993164
                r(sum) =  205501618680.3482
           r(kurtosis) =  282741.2128004082
           r(skewness) =  345.8759356723561
                 r(sd) =  203384.9162541108
                r(Var) =  41365424159.69167
               r(mean) =  71809.22123787226
              r(sum_w) =  2861772
                  r(N) =  2861772


We winsorize by first, creating a new variable with the same values as earnings, we will call it earnings_winsor. Then we will replace the values of earnings Windsor with the values of the 1st percentile (named in stata as r(p1)) if earnings are smaller than 1st percentile. We will do the same for the 99th percentile. The reason why we can decide to store the winsorized version of the dependent variable as a different variable is just for organizational purposes.

In [5]:
cap drop earnings_winsor
gen earnings_winsor = earnings
replace earnings_winsor = r(p1) if earnings_winsor<r(p1)
replace earnings_winsor = r(p99) if earnings_winsor>r(p99)



(28,617 real changes made)

(28,617 real changes made)


In [6]:
su earnings earnings_winsor


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    earnings |  2,861,772    71809.22    203384.9   8.881357   1.90e+08
earnings_w~r |  2,861,772    67309.14    88887.91   2037.231   540524.9


The typical choice are the 1st and 99th percentile as cutpoints, and by construction this wouldn't affect your main results if the outliers were not an issue (this is only recoding roughly 2% of the datapoints). 

### 14.1.2 Trim 

Trimming consists on replacing values below the $a$ percentile by a missing value, and values above the $b$ percentile by a missing value. The idea is that when that variable equals a missing value it won't be used in the regression. Stata by design does not include observations where there are missing variables in the command `regress`. Consider the following example:

In [7]:
cap drop earnings_trim
gen earnings_trim = earnings
replace earnings_winsor = . if earnings_winsor<r(p1)
replace earnings_winsor = . if earnings_winsor>r(p99)




(2,861,772 real changes made, 2,861,772 to missing)

(0 real changes made)


## 14.2 Multicollinearity 

If two variables are linear combinations of one another they are multicollinear. Ultimately, Stata will not allow you to include two variables in a regression that are perfect linear combinations of one another, such as a constant, a dummy variable for male and a dummy for female (since female = 1 - male). If you try this yourself you will see that one of those variables will be dropped from the regression.


In [10]:
cap drop male
gen male = sex == "M"

cap drop female 
gen female = sex == "F"

In [11]:
reg earnings male female

note: female omitted because of collinearity

      Source |       SS           df       MS      Number of obs   = 2,861,772
-------------+----------------------------------   F(1, 2861770)   =  18223.01
       Model |  7.4903e+14         1  7.4903e+14   Prob > F        =    0.0000
    Residual |  1.1763e+17 2,861,770  4.1104e+10   R-squared       =    0.0063
-------------+----------------------------------   Adj R-squared   =    0.0063
       Total |  1.1838e+17 2,861,771  4.1365e+10   Root MSE        =    2.0e+05

------------------------------------------------------------------------------
    earnings |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   33787.69   250.2928   134.99   0.000     33297.12    34278.25
      female |          0  (omitted)
       _cons |   50050.77   200.8553   249.19   0.000      49657.1    50444.44
----------------------------------------------

Is this a problem? Not really. Multicollinearity is a sign that a variable is not adding new information. Notice that with the constant term and a male dummy we can know the mean earnings of the females. In this case, the constant term *is* by construction the mean earnings of females, and the male dummy gives the premia that males receive.

While there are some statistical tests for multicollinearity, nothing beats having the right intuition when running regression. If there is an obvious case where two variables contain basically the same information, you should avoid including both in the analysis. For instance, we could have an age variable that takes non-integer values based on the months (e.g. if a baby is 1 year and 1 month old, it is coded as 1.083) versus an integer age variable. Both contain basically the same information, even though they are not perfectly collinear. Stata might still throw some results but the coefficients on these two variables may be 
#### may be what? 

## 14.3 Heteroskedasticity 

When we run a linear regression we basically split the outcome into a (linear) part explained by observables and an error term:
$$ y_i = a + b x_i + e_i$$ 

This is why it's also called a linear projection. The standard errors in our coefficients depend on $e_i^2$ (as you might remember from previous econometrics courses). Heteroskedasticity refers to the case where the variance of this projection error depends on the observables $x_i$. For instance, the variance of wages tends to be higher for college educated (there is some people with very high wages) whereas it is small for non-college educated (they tend to be concentrated in smaller paying jobs).  Stata by defaults assumes that it does not depend on the observables, also known as homoskedasticity. It is safe to say that this is an incredibly restrictive assumption.

While there are tests for heteroskedasticity, the standard applied econometrician relies on including the option `robust` at the end of the `regress` command. 



In [13]:
cap drop logearn 
gen logearn = log(earnings)


regress logearn age treated , robust





Linear regression                               Number of obs     =  2,861,772
                                                F(2, 2861769)     >   99999.00
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1301
                                                Root MSE          =      1.097

------------------------------------------------------------------------------
             |               Robust
     logearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0060827   .0000641    94.88   0.000      .005957    .0062084
     treated |  -.8178721   .0013369  -611.79   0.000    -.8204923   -.8152519
       _cons |   10.64645   .0024892  4277.02   0.000     10.64157    10.65132
------------------------------------------------------------------------------


The best thing is that the robust standard errors will be correct whenever we don't have very small data. This property is known as consistency. Therefore, there is no reason not to use robust standard errors in our ECON490 project.

## 14.4 Wrap up 
In this module we learned how to deal with outliers. One of the most important parts of a research project is data cleaning as, it is the first and most common place to make mistakes. Outliers and missing observations need to be taken care of so we are able to produce reliable results. The two ways to do that is by winsorizing and trimming out the dataset.

Just a word of caution. The subject of outliers can be very subjective. Unless they are extremely different than the rest of the data presented (such as in the case of fake data), its not always easy to decide if the outliers should be removed or not. In the majority of cases outliers are kept in the analysis process unless its know the data has been erroneously collected or it is very clear those observations are producing unrealistic results.

We also learned about heteroskedasticity and multicollinearity. Multicollinearity can arise from including the output gap and GDP in the same model. It can also arise from including all the categories of a qualitative variable as dummy variables. Although it doesn't produce any major problems when regressing two perfectly collinear variables, adding these variables don't add any explanatory value. Also, in most cases variables are not perfectly collinear and including many covariates that are somewhat collinear to a regression increases the the variance and can lead to over-fitting, making the results unreliable.