## Chapter 8

In [1]:
import pandas as pd
import statsmodels.api as sm
import numpy as np

In [2]:
# Exercise 1
sleep75 = pd.read_stata("stata/SLEEP75.DTA")

y = sleep75.sleep
X = sm.add_constant(sleep75[["totwrk", "educ", "age", "agesq", "yngkid", "male"]])
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  sleep   R-squared:                       0.123
Model:                            OLS   Adj. R-squared:                  0.115
Method:                 Least Squares   F-statistic:                     16.30
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.28e-17
Time:                        23:53:52   Log-Likelihood:                -5259.3
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     699   BIC:                         1.056e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3840.8521    239.414     16.043      0.0

In [3]:
sleep75["u_squared"] = model.resid ** 2

model = sm.OLS(sleep75.u_squared, sm.add_constant(sleep75.male)).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:              u_squared   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.117
Date:                Sun, 24 May 2020   Prob (F-statistic):              0.291
Time:                        23:53:52   Log-Likelihood:                -10032.
No. Observations:                 706   AIC:                         2.007e+04
Df Residuals:                     704   BIC:                         2.008e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       1.894e+05   2.05e+04      9.216      0.0

C1.i $Var(u_i|male_i) = \pi_0 + \pi_1 male_i$

C1.ii Results above. Since male is negative it implies that the variance is higher for women.

C1.iii The t-statistic is just over 1 (p-value 0.291) and so the variance of $u$ is not statistically different for men and women

In [4]:
# Exercise 2
hprice1 = pd.read_stata("stata/hprice1.dta")

y = hprice1.price
X = sm.add_constant(hprice1[["lotsize", "sqrft", "bdrms"]])
model = sm.OLS(y, X).fit()
model_hc = model.get_robustcov_results(cov_type = "HC3")
model_summary = model_hc.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.672
Model:                            OLS   Adj. R-squared:                  0.661
Method:                 Least Squares   F-statistic:                     19.54
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.06e-09
Time:                        23:53:52   Log-Likelihood:                -482.88
No. Observations:                  88   AIC:                             973.8
Df Residuals:                      84   BIC:                             983.7
Df Model:                           3                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -21.7703     41.033     -0.531      0.5

In [5]:
y = hprice1.lprice
X = sm.add_constant(hprice1[["llotsize", "lsqrft", "bdrms"]])
model = sm.OLS(y, X).fit()
model_hc = model.get_robustcov_results(cov_type = "HC3")
model_summary = model_hc.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                 lprice   R-squared:                       0.643
Model:                            OLS   Adj. R-squared:                  0.630
Method:                 Least Squares   F-statistic:                     44.82
Date:                Sun, 24 May 2020   Prob (F-statistic):           2.14e-17
Time:                        23:53:52   Log-Likelihood:                 25.861
No. Observations:                  88   AIC:                            -43.72
Df Residuals:                      84   BIC:                            -33.81
Df Model:                           3                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2970      0.850     -1.525      0.1

C2.i The standard erros increase substantially across the board, particularly with lotsize which is no longer significant.

C2.ii Standard errors are higher but not as dramatically. 

C2.iii Transforming the variable can mitigate the effects of heteroskedasticity

In [6]:
# Exercies 3
import statsmodels.stats.diagnostic as smd

smd.het_white(model.resid, X)

(9.549448521078983, 0.3881743289213765, 1.0549560917353393, 0.4053127292220229)

In [7]:
hprice1["u"] = model.resid
hprice1["llotsizesq"] = hprice1.llotsize ** 2
hprice1["lsqrftsq"] = hprice1.lsqrft ** 2
hprice1["bdrmssq"] = hprice1.bdrms ** 2
hprice1["llotsizelsqrft"] = hprice1.llotsize * hprice1.lsqrft
hprice1["llotsizebdrms"] = hprice1.llotsize * hprice1.bdrms
hprice1["lsqrftbdrms"] = hprice1.lsqrft * hprice1.bdrms

y = hprice1.u
X = sm.add_constant(hprice1[["llotsize", "lsqrft", "bdrms", "llotsizesq", "lsqrftsq", "bdrmssq", "llotsizelsqrft", "llotsizebdrms", "lsqrftbdrms"]])
model = sm.OLS(y, X).fit()

model.f_test("(llotsize = 0), (lsqrft = 0), (bdrms = 0), (llotsizesq = 0), (lsqrftsq = 0), (bdrmssq = 0), (llotsizelsqrft = 0), (llotsizebdrms = 0), (lsqrftbdrms = 0)")

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[1.83947824]]), p=0.07413714664998507, df_denom=78, df_num=9>

C3 Both the F-test version and LM version fail to reject the null that the parameters are any different, though the F-test is significant at the 10% level

In [8]:
# Exercise 4
vote1 = pd.read_stata("stata/VOTE1.DTA")

y = vote1.voteA
X = sm.add_constant(vote1[["prtystrA", "democA", "lexpendA", "lexpendB"]])
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)
vote1["u_sq"] = model.resid ** 2
vote1["y_hat"] = model.fittedvalues
vote1["y_hat_sq"] = model.fittedvalues ** 2
print(sm.OLS(model.resid, X).fit().summary())

                            OLS Regression Results                            
Dep. Variable:                  voteA   R-squared:                       0.801
Model:                            OLS   Adj. R-squared:                  0.796
Method:                 Least Squares   F-statistic:                     169.2
Date:                Sun, 24 May 2020   Prob (F-statistic):           8.09e-58
Time:                        23:53:53   Log-Likelihood:                -593.20
No. Observations:                 173   AIC:                             1196.
Df Residuals:                     168   BIC:                             1212.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         37.6614      4.736      7.952      0.0

In [9]:
r_2 = sm.OLS(model.resid ** 2, X).fit().rsquared
print((r_2 / 4)/((1 - r_2)/(173 - 4 - 1)))
print(r_2 * model.nobs)
print(smd.het_breuschpagan(model.resid, X))

2.3301128267408124
9.093356486631743
(9.093356486631743, 0.05880790411090132, 2.330112826740811, 0.058057501107013666)


In [10]:
y = vote1.u_sq
X = sm.add_constant(vote1[["y_hat", "y_hat_sq"]])
model = sm.OLS(y, X).fit()
f_stat = (model.rsquared / model.df_model) / ((1 - model.rsquared) / (model.nobs - model.df_model - 1))
print(f_stat)

2.7858276132900985


C4.i $R^2$ is 0 seeing as the model imposes the fact that the residuals are not correlated with the explanatory variables

C4.ii The p-value is 0.058, not enough to reject at the 5% level

C4.iii The F test is about 2.786 which isn't quite enough to suggest heteroskedasticity

In [11]:
# Exercise 5
pntsprd = pd.read_stata("stata/PNTSPRD.DTA")

y = pntsprd.sprdcvr
X = np.ones(553)
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                sprdcvr   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                       nan
Date:                Sun, 24 May 2020   Prob (F-statistic):                nan
Time:                        23:53:53   Log-Likelihood:                -401.10
No. Observations:                 553   AIC:                             804.2
Df Residuals:                     552   BIC:                             808.5
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.5154      0.021     24.228      0.0

In [12]:
pntsprd.neutral.sum()

35

In [13]:
X = sm.add_constant(pntsprd[["favhome", "neutral", "fav25", "und25"]])
model = sm.OLS(y, X).fit()
model_hc = model.get_robustcov_results(cov_type = "HC3")
model_summary = model_hc.summary()
print(model.summary())
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                sprdcvr   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                 -0.004
Method:                 Least Squares   F-statistic:                    0.4674
Date:                Sun, 24 May 2020   Prob (F-statistic):              0.760
Time:                        23:53:53   Log-Likelihood:                -400.16
No. Observations:                 553   AIC:                             810.3
Df Residuals:                     548   BIC:                             831.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4896      0.045     10.938      0.0

C5.i The spread is not significantly different from .5

C5.ii 35 games were played on a neutral court

C5.iii No variables are statistically significant. Theoretically neutral is the most practically significant, but the confidence interval is large enough we should not consider it as being different from zero

C5.iv The coefficients are 0 and so how could the variance be different?

C5.v We do not find sufficient evidence to reject the null that these variables have a bearing on the spread cover

C5.vi We could not predict Las Vegas covering the spread with this information

In [14]:
# Exercise 6
crime1 = pd.read_stata("stata/CRIME1.DTA")
crime1["arr86"] = (crime1.narr86 > 0).astype("int8")

y = crime1.arr86
X = sm.add_constant(crime1[["pcnv", "avgsen", "tottime", "ptime86", "qemp86"]])
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  arr86   R-squared:                       0.047
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     27.03
Date:                Sun, 24 May 2020   Prob (F-statistic):           9.09e-27
Time:                        23:53:53   Log-Likelihood:                -1609.7
No. Observations:                2725   AIC:                             3231.
Df Residuals:                    2719   BIC:                             3267.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.4406      0.017     25.568      0.0

In [15]:
print(model.fittedvalues.min())
print(model.fittedvalues.max())

0.0066431248405761645
0.5576897210365442


In [16]:
crime1["h"] = 1 / (model.fittedvalues * (1 - model.fittedvalues))
crime1["h_hat_sqrt"] = np.sqrt(model.fittedvalues * (1 - model.fittedvalues))
crime1["wconst"] = 1 / crime1.h_hat_sqrt
crime1["wpcnv"] = crime1.pcnv / crime1.h_hat_sqrt
crime1["wavgsen"] = crime1.avgsen / crime1.h_hat_sqrt
crime1["wtottime"] = crime1.tottime / crime1.h_hat_sqrt
crime1["wptime86"] = crime1.ptime86 / crime1.h_hat_sqrt
crime1["wqemp86"] = crime1.qemp86 / crime1.h_hat_sqrt
crime1["warr86"] = crime1.arr86 / crime1.h_hat_sqrt

y = crime1.warr86
X = crime1[["wconst", "wpcnv", "wavgsen", "wtottime", "wptime86", "wqemp86"]]
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

y = crime1.arr86
X = sm.add_constant(crime1[["pcnv", "avgsen", "tottime", "ptime86", "qemp86"]])
model = sm.WLS(y, X, weights = crime1.h).fit()
model_summary = model.summary()
print(model_summary)

                                 OLS Regression Results                                
Dep. Variable:                 warr86   R-squared (uncentered):                   0.294
Model:                            OLS   Adj. R-squared (uncentered):              0.292
Method:                 Least Squares   F-statistic:                              188.3
Date:                Sun, 24 May 2020   Prob (F-statistic):                   5.28e-201
Time:                        23:53:53   Log-Likelihood:                         -3838.7
No. Observations:                2725   AIC:                                      7689.
Df Residuals:                    2719   BIC:                                      7725.
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [17]:
model.f_test("(avgsen = 0), (tottime = 0)")

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[0.88493951]]), p=0.41285798000947516, df_denom=2.72e+03, df_num=2>

C6.i Fitted values are between 0 and 1 (0.56 is max and 0.01 is smallest)

C6.ii WLS and manually weighting both produce the same results

C6.iii avgsen and tottime are not jointly significant at the 5% level

In [18]:
# Exercise 7
loanapp = pd.read_stata("stata/loanapp.dta")

loanapp_reg = loanapp[["approve", "white", "hrat", "obrat", "loanprc", "unem", "male", "married", "dep", "sch", "cosign", "chist", "pubrec", "mortlat1", "mortlat2", "vr"]].dropna()

y = loanapp_reg.approve
X = sm.add_constant(loanapp_reg[["white", "hrat", "obrat", "loanprc", "unem", "male", "married", "dep", "sch", "cosign", "chist", "pubrec", "mortlat1", "mortlat2", "vr"]])
model = sm.OLS(y, X).fit()
model_hc = model.get_robustcov_results(cov_type = "HC3")
model_summary = model_hc.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                approve   R-squared:                       0.166
Model:                            OLS   Adj. R-squared:                  0.159
Method:                 Least Squares   F-statistic:                     14.56
Date:                Sun, 24 May 2020   Prob (F-statistic):           5.63e-36
Time:                        23:53:53   Log-Likelihood:                -429.26
No. Observations:                1971   AIC:                             890.5
Df Residuals:                    1955   BIC:                             979.9
Df Model:                          15                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.9367      0.060     15.560      0.0

In [19]:
print(model_hc.fittedvalues.min(), model_hc.fittedvalues.max())

0.2273447305415921 1.1729878406795526


C7.i The robust confidence interval is wider. Insignificant with WLS but still significant with HC3

C7.ii There are values greater than 1. This means that we cannot directly apply WLS without some kind of adjustment.

In [20]:
# Exercise 8
gpa1 = pd.read_stata("stata/GPA1.DTA")

y = gpa1.colGPA
X = sm.add_constant(gpa1[["hsGPA", "ACT", "skipped", "PC"]])
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

gpa1["u_sq"] = model.resid ** 2
gpa1["y_hat"] = model.fittedvalues
gpa1["y_hat_sq"] = gpa1.y_hat ** 2

print(smd.het_white(model.resid, X))

                            OLS Regression Results                            
Dep. Variable:                 colGPA   R-squared:                       0.259
Model:                            OLS   Adj. R-squared:                  0.237
Method:                 Least Squares   F-statistic:                     11.90
Date:                Sun, 24 May 2020   Prob (F-statistic):           2.55e-08
Time:                        23:53:53   Log-Likelihood:                -39.098
No. Observations:                 141   AIC:                             88.20
Df Residuals:                     136   BIC:                             102.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.3565      0.328      4.142      0.0

In [21]:
y = gpa1.u_sq
X = sm.add_constant(gpa1[["y_hat", "y_hat_sq"]])
model = sm.OLS(y, X).fit()
gpa1["h_hat"] = model.fittedvalues

print((model.rsquared / model.df_model) / ((1 - model.rsquared) / (model.nobs - model.df_model - 1)))
print(gpa1.h_hat.min())

3.581494488337213
0.027381357121391055


In [22]:
y = gpa1.colGPA
X = sm.add_constant(gpa1[["hsGPA", "ACT", "skipped", "PC"]])
model = sm.WLS(y, X, weights = 1 / gpa1.h_hat).fit()
model_summary = model.summary()
print(model_summary)

                            WLS Regression Results                            
Dep. Variable:                 colGPA   R-squared:                       0.306
Model:                            WLS   Adj. R-squared:                  0.286
Method:                 Least Squares   F-statistic:                     15.01
Date:                Sun, 24 May 2020   Prob (F-statistic):           3.49e-10
Time:                        23:53:53   Log-Likelihood:                -35.364
No. Observations:                 141   AIC:                             80.73
Df Residuals:                     136   BIC:                             95.47
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4016      0.298      4.696      0.0

In [23]:
model_hc = model.get_robustcov_results(cov_type = "HC3")
model_summary = model_hc.summary()
print(model_summary)

                            WLS Regression Results                            
Dep. Variable:                 colGPA   R-squared:                       0.306
Model:                            WLS   Adj. R-squared:                  0.286
Method:                 Least Squares   F-statistic:                     20.28
Date:                Sun, 24 May 2020   Prob (F-statistic):           4.03e-13
Time:                        23:53:53   Log-Likelihood:                -35.364
No. Observations:                 141   AIC:                             80.73
Df Residuals:                     136   BIC:                             95.47
Df Model:                           4                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4016      0.324      4.327      0.0

C8.i Results are at the top

C8.ii The F statistic is about 3.58, statistically significant (which suggests there is heteroskedasticity)

C8.iii The estimates and their significance have both increased slightly

C8.iv The errors are slightly larger but not enough to change significance

In [24]:
# Exercise 9
smoke = pd.read_stata("stata/SMOKE.DTA")

y = smoke.cigs
X = sm.add_constant(smoke[["lincome", "lcigpric", "educ", "age", "agesq", "restaurn"]])
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                   cigs   R-squared:                       0.053
Model:                            OLS   Adj. R-squared:                  0.046
Method:                 Least Squares   F-statistic:                     7.423
Date:                Sun, 24 May 2020   Prob (F-statistic):           9.50e-08
Time:                        23:53:54   Log-Likelihood:                -3236.2
No. Observations:                 807   AIC:                             6486.
Df Residuals:                     800   BIC:                             6519.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.6398     24.079     -0.151      0.8

In [25]:
smoke["h"] = np.exp(sm.OLS(np.log(np.power(model.resid,2)), X).fit().fittedvalues)
model = sm.WLS(y, X, weights = 1 / smoke.h).fit()
model_summary = model.summary()
print(model_summary)

                            WLS Regression Results                            
Dep. Variable:                   cigs   R-squared:                       0.113
Model:                            WLS   Adj. R-squared:                  0.107
Method:                 Least Squares   F-statistic:                     17.06
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.32e-18
Time:                        23:53:54   Log-Likelihood:                -3207.8
No. Observations:                 807   AIC:                             6430.
Df Residuals:                     800   BIC:                             6462.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6355     17.803      0.317      0.7

In [26]:
print(smd.het_white(model.resid, X))
smoke["u"] = model.resid / np.sqrt(smoke.h)
smoke["y"] = model.fittedvalues / np.sqrt(smoke.h)
smoke["u_sq"] = np.power(smoke.u, 2)
smoke["y_sq"] = np.power(smoke.y, 2)

model = sm.OLS(smoke.u_sq, sm.add_constant(smoke[["y", "y_sq"]])).fit()
print((model.rsquared / model.df_model) / ((1 - model.rsquared) / (model.nobs - model.df_model - 1)))

(49.656911597129984, 0.0023485874759439757, 2.0483212193376925, 0.0019437274024845035)
11.153696142595692


In [27]:
model = sm.WLS(y, X, weights = 1 / smoke.h).fit().get_robustcov_results(cov_type = "HC3")
model_summary = model.summary()
print(model_summary)

                            WLS Regression Results                            
Dep. Variable:                   cigs   R-squared:                       0.113
Model:                            WLS   Adj. R-squared:                  0.107
Method:                 Least Squares   F-statistic:                     18.77
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.69e-20
Time:                        23:53:54   Log-Likelihood:                -3207.8
No. Observations:                 807   AIC:                             6430.
Df Residuals:                     800   BIC:                             6462.
Df Model:                           6                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6355     41.225      0.137      0.8

C9.i OLS results are above

C9.iii There is evidence for heteroskedasticity using both the special form and the standard White test

C9.iv It may be that the form of heteroskedasticity is not properly specified

C9.v Reported above

In [28]:
# Exercise 10

k401subs = pd.read_stata("stata/401ksubs.dta")

y = k401subs.e401k
X = sm.add_constant(k401subs[["inc", "incsq", "age", "agesq", "male"]])
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)
print(model.get_robustcov_results(cov_type = "HC3").summary())

                            OLS Regression Results                            
Dep. Variable:                  e401k   R-squared:                       0.094
Model:                            OLS   Adj. R-squared:                  0.094
Method:                 Least Squares   F-statistic:                     193.0
Date:                Sun, 24 May 2020   Prob (F-statistic):          3.41e-196
Time:                        23:53:54   Log-Likelihood:                -6051.5
No. Observations:                9275   AIC:                         1.211e+04
Df Residuals:                    9269   BIC:                         1.216e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.5063      0.081     -6.243      0.0

In [29]:
y = np.power(model.resid, 2)
k401subs["y_hat"] = model.fittedvalues
k401subs["y_hat_sq"] = np.power(k401subs.y_hat, 2)
X = sm.add_constant(k401subs[["y_hat", "y_hat_sq"]])

het_model = sm.OLS(y, X).fit()
print((het_model.rsquared / het_model.df_model) / ((1 - het_model.rsquared)/(het_model.nobs - het_model.df_model - 1)))

310.3228196883429


In [30]:
print(model.fittedvalues.min(), model.fittedvalues.max())

0.029917158729936277 0.6971898835393978


In [31]:
k401subs["h"] = model.fittedvalues * (1 - model.fittedvalues)
y = k401subs.e401k
X = sm.add_constant(k401subs[["inc", "incsq", "age", "agesq", "male"]])
model = sm.WLS(y, X, weights = 1 / k401subs.h).fit()
print(model.summary())

                            WLS Regression Results                            
Dep. Variable:                  e401k   R-squared:                       0.108
Model:                            WLS   Adj. R-squared:                  0.107
Method:                 Least Squares   F-statistic:                     224.2
Date:                Sun, 24 May 2020   Prob (F-statistic):          1.28e-226
Time:                        23:53:54   Log-Likelihood:                -5953.5
No. Observations:                9275   AIC:                         1.192e+04
Df Residuals:                    9269   BIC:                         1.196e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.4880      0.076     -6.456      0.0

C10.i There are not any substantial differences between the OLS and heteroskedasticity robust errors.

C10.ii Not one to be solved computationally, but in broad strokes This stems from $Var(y|x) - p(x)(1-p(x)$ which can be cast in an error form and written as a regression model. The coefficients are 1 and -1 (for the standard and squared values) since the linear probability model deals with values between 0 and 1.

C10.iii The F statistic is very large, 310.32

C10.iv All the fitted values fall between 0 and 1. There are no substantial differences between the two models.

In [32]:
# Exercise 11

k401subs = pd.read_stata("stata/401ksubs.dta")
k401subs["e401kinc"] = k401subs.e401k * k401subs.inc
k401subs["agesqdemean"] = (k401subs.age.astype("int32") - 25) ** 2
k401subs = k401subs[k401subs.fsize == 1]

y = k401subs.nettfa
X = sm.add_constant(k401subs[["inc", "agesqdemean", "male", "e401k", "e401kinc"]])
model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)
print(model.get_robustcov_results(cov_type = "HC3").summary())

                            OLS Regression Results                            
Dep. Variable:                 nettfa   R-squared:                       0.131
Model:                            OLS   Adj. R-squared:                  0.129
Method:                 Least Squares   F-statistic:                     60.74
Date:                Sun, 24 May 2020   Prob (F-statistic):           4.43e-59
Time:                        23:53:54   Log-Likelihood:                -10511.
No. Observations:                2017   AIC:                         2.103e+04
Df Residuals:                    2011   BIC:                         2.107e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const         -17.1956      2.820     -6.097      