## Chapter 4

Assumptions required for these tools to work (statistical inference). Classical linear model (CLM) for cross-sectional regression

* Linear in parameters: Can be written as $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k + u$
* Random sampling: We have a random sample of $n$ observations that follow the population model as above
* No perfect collinearity: None of the independent variables is a constant and there is no exact linear relationship among the independent variables
* Zero conditional mean: The error $u$ has an expected value of zero given any values of the explanatory variables, or $E(u|x_1,x_2,...,x_k) = 0$ (Unobserved factors are, on average, unrelated to the explanatory variable)
* Homoskedasticity: The error $u$ has the same variance given any values of the explanatory variables, or $Var(u|x_1, x_2,...,x_k)=\sigma^2$
* Normality: The population error $u$ is independent of the explanatory variales $x_1, x_2, ... x_k$ and is normally distributed with zero mean and variance $\sigma^2$, or $u~Normal(0,\sigma^2)$ (this is for small samples)

In [1]:
import pandas as pd
import statsmodels.api as sm

In [2]:
#Exercise C1
vote1 = pd.read_stata("stata/VOTE1.DTA")

y = vote1.voteA
X = sm.add_constant(vote1[["lexpendA", "lexpendB", "prtystrA"]])
model = sm.OLS(y,X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  voteA   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     215.2
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.76e-57
Time:                        23:23:52   Log-Likelihood:                -596.86
No. Observations:                 173   AIC:                             1202.
Df Residuals:                     169   BIC:                             1214.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         45.0789      3.926     11.481      0.0

In [3]:
print(model.f_test("lexpendA = -lexpendB"))

<F test: F=array([[0.99630884]]), p=0.31963233050214857, df_denom=169, df_num=1>


In [4]:
vote1["expend_diff"] = vote1.lexpendB - vote1.lexpendA
X = sm.add_constant(vote1[["lexpendA", "expend_diff", "prtystrA"]])
model = sm.OLS(y,X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  voteA   R-squared:                       0.793
Model:                            OLS   Adj. R-squared:                  0.789
Method:                 Least Squares   F-statistic:                     215.2
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.76e-57
Time:                        23:23:52   Log-Likelihood:                -596.86
No. Observations:                 173   AIC:                             1202.
Df Residuals:                     169   BIC:                             1214.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          45.0789      3.926     11.481      

C1.i The interpretation is the percent increase of an extra dollar of campaign spending for candidate A

C1.ii $\beta_1 = -\beta_2$ or $\beta_1 + \beta_2 = 0$

C1.iii Both coefficients are statistically significant and appear to have the same absolute value. The hypothesis is formally tested and there is not sufficicent evidence to reject.

C1.iv By adding the two values and putting it into the regression the result is small with a t-statistic of -0.998, failing to reject the null hypothesis that it is different from zero.

In [5]:
#Exercise C2
lawsch85 = pd.read_stata("stata/LAWSCH85.DTA")
lawsch85_reg = lawsch85[["lsalary", "LSAT", "GPA", "llibvol", "lcost", "rank"]].dropna()

y = lawsch85_reg.lsalary
X = sm.add_constant(lawsch85_reg[["LSAT", "GPA", "llibvol", "lcost", "rank"]])
model = sm.OLS(y,X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                lsalary   R-squared:                       0.842
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     138.2
Date:                Sun, 24 May 2020   Prob (F-statistic):           2.93e-50
Time:                        23:23:52   Log-Likelihood:                 107.33
No. Observations:                 136   AIC:                            -202.7
Df Residuals:                     130   BIC:                            -185.2
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.3432      0.533     15.667      0.0

In [6]:
print(model.f_test("LSAT, GPA"))

<F test: F=array([[9.95175399]]), p=9.518119466666694e-05, df_denom=130, df_num=2>


In [7]:
lawsch85_reg = lawsch85[["lsalary", "LSAT", "GPA", "llibvol", "lcost", "rank", "clsize", "faculty"]].dropna()

y = lawsch85_reg.lsalary
X = sm.add_constant(lawsch85_reg[["LSAT", "GPA", "llibvol", "lcost", "rank", "clsize", "faculty"]])
model = sm.OLS(y,X).fit()
model_summary = model.summary()
print(model_summary)
print(model.f_test("clsize, faculty"))

                            OLS Regression Results                            
Dep. Variable:                lsalary   R-squared:                       0.844
Model:                            OLS   Adj. R-squared:                  0.835
Method:                 Least Squares   F-statistic:                     95.05
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.51e-46
Time:                        23:23:52   Log-Likelihood:                 103.77
No. Observations:                 131   AIC:                            -191.5
Df Residuals:                     123   BIC:                            -168.5
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.4159      0.552     15.239      0.0

C2.i The null is $\beta_{rank} = 0$ This null hypothesis is soundly rejected with a t-statistic of -9.541 and the coefficient indicates that there is a small negative relationship between rank and starting salary.

C2.ii Only GPA is individually significant (with a large effect), though the pair are jointly significant with a p-value of 0.03

C2.iii An F-test on the joint significance of class size and faculty fails to reject the null hypothesis that they are significantly different from zero. We do not have evidence to suggest that their inclusion would improve the model.

C2.iv Publications from faculty or other indications of professor quality. Location of school.

In [8]:
#Exercise C3
hprice1 = pd.read_stata("stata/hprice1.dta")

y = hprice1.lprice
X = sm.add_constant(hprice1[["sqrft", "bdrms"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                 lprice   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.579
Method:                 Least Squares   F-statistic:                     60.73
Date:                Sun, 24 May 2020   Prob (F-statistic):           4.17e-17
Time:                        23:23:52   Log-Likelihood:                 19.592
No. Observations:                  88   AIC:                            -33.18
Df Residuals:                      85   BIC:                            -25.75
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.7660      0.097     49.112      0.0

In [9]:
hprice1["modB1"] = hprice1.sqrft - (150 * hprice1.bdrms)
X_a = sm.add_constant(hprice1[["modB1", "bdrms"]])

model = sm.OLS(y, X_a).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                 lprice   R-squared:                       0.588
Model:                            OLS   Adj. R-squared:                  0.579
Method:                 Least Squares   F-statistic:                     60.73
Date:                Sun, 24 May 2020   Prob (F-statistic):           4.17e-17
Time:                        23:23:52   Log-Likelihood:                 19.592
No. Observations:                  88   AIC:                            -33.18
Df Residuals:                      85   BIC:                            -25.75
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.7660      0.097     49.112      0.0

In [10]:
conf_inv = (model.params[2] - (2.06 * model.bse[2]), model.params[2] + (2.06 * model.bse[2]))
print(conf_inv)

(0.030660261764766737, 0.140942421648557)


C3.iii [0.0306, 0.1409]

In [11]:
#Exercise C4
bwght = pd.read_stata("stata/BWGHT.DTA")
bwght_reduced = bwght[["bwght", "cigs", "parity", "faminc"]]

y = bwght_reduced.bwght
X = sm.add_constant(bwght_reduced[["cigs", "parity", "faminc"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  bwght   R-squared:                       0.035
Model:                            OLS   Adj. R-squared:                  0.033
Method:                 Least Squares   F-statistic:                     16.63
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.28e-10
Time:                        23:23:52   Log-Likelihood:                -6126.8
No. Observations:                1388   AIC:                         1.226e+04
Df Residuals:                    1384   BIC:                         1.228e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        114.2143      1.469     77.734      0.0

In [12]:
((0.0387 - 0.035) / (1 - 0.0387)) * (1185/2)

2.280505565380211

C4 The r-squared for all observations is 0.035 compared to the restricted r-squared of 0.0364. This would reject the null and determine motheduc and fatheduc are jointly significant.

In [13]:
#Exercise C5
mlb1 = pd.read_stata("stata/MLB1.DTA")

y = mlb1.lsalary
X = sm.add_constant(mlb1[["years", "gamesyr", "bavg", "hrunsyr"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                lsalary   R-squared:                       0.625
Model:                            OLS   Adj. R-squared:                  0.621
Method:                 Least Squares   F-statistic:                     145.2
Date:                Sun, 24 May 2020   Prob (F-statistic):           6.98e-73
Time:                        23:23:53   Log-Likelihood:                -386.25
No. Observations:                 353   AIC:                             782.5
Df Residuals:                     348   BIC:                             801.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         11.0209      0.266     41.476      0.0

In [14]:
y = mlb1.lsalary
X = sm.add_constant(mlb1[["years", "gamesyr", "bavg", "hrunsyr", "runsyr", "fldperc", "sbasesyr"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                lsalary   R-squared:                       0.639
Model:                            OLS   Adj. R-squared:                  0.632
Method:                 Least Squares   F-statistic:                     87.25
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.84e-72
Time:                        23:23:53   Log-Likelihood:                -379.71
No. Observations:                 353   AIC:                             775.4
Df Residuals:                     345   BIC:                             806.3
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.4083      2.003      5.196      0.0

In [15]:
model.f_test("(bavg = 0), (fldperc = 0), (sbasesyr = 0)")

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[0.68500393]]), p=0.5617089404843013, df_denom=345, df_num=3>

C5.i hrunsyr becomes statistically significant and the size of the coefficient almost doubles

C5.ii runsyr is (individually) statistically significant

C5.iii We fail to reject the null hypothesis that the three coefficients are zero (jointly insignificant)

In [16]:
#Exercise C6
wage2 = pd.read_stata("stata/WAGE2.DTA")

y = wage2.lwage
X = sm.add_constant(wage2[["educ", "exper", "tenure"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.155
Model:                            OLS   Adj. R-squared:                  0.152
Method:                 Least Squares   F-statistic:                     56.97
Date:                Sun, 24 May 2020   Prob (F-statistic):           8.12e-34
Time:                        23:23:53   Log-Likelihood:                -438.84
No. Observations:                 935   AIC:                             885.7
Df Residuals:                     931   BIC:                             905.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.4967      0.111     49.731      0.0

In [17]:
model.f_test("exper = tenure")

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[0.16963719]]), p=0.6805292607474092, df_denom=931, df_num=1>

C6.i The null hypothesis is $\beta_2 = \beta_3$ or $\beta_2 - \beta_3 = 0$

C6.ii The F-test has a p-value of 0.68. We fail to reject the null hypothesis that general workforce experience is different from tenure

In [18]:
#Exercise C7
twoyear = pd.read_stata("stata/twoyear.dta")

print(twoyear.phsrank.min())
print(twoyear.phsrank.max())
print(twoyear.phsrank.mean())

0
99
56.15703090344522


In [19]:
y = twoyear.lwage
X = sm.add_constant(twoyear[["jc", "totcoll", "exper", "phsrank"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.223
Model:                            OLS   Adj. R-squared:                  0.222
Method:                 Least Squares   F-statistic:                     483.8
Date:                Sun, 24 May 2020   Prob (F-statistic):               0.00
Time:                        23:23:53   Log-Likelihood:                -3887.9
No. Observations:                6763   AIC:                             7786.
Df Residuals:                    6758   BIC:                             7820.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4587      0.024     61.756      0.0

In [20]:
y = twoyear.lwage
X = sm.add_constant(twoyear[["jc", "totcoll", "exper", "phsrank", "id"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.223
Model:                            OLS   Adj. R-squared:                  0.222
Method:                 Least Squares   F-statistic:                     387.1
Date:                Sun, 24 May 2020   Prob (F-statistic):               0.00
Time:                        23:23:53   Log-Likelihood:                -3887.7
No. Observations:                6763   AIC:                             7787.
Df Residuals:                    6757   BIC:                             7828.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.4522      0.026     56.750      0.0

C7.ii A 10 percentage point increase is worth about 0.3%. The variable is not statistically significant

C7.iii phsrank does not not substantively change the conclusions on the returns to two and four year colleges. The coefficents are largely unchanged and fall within the confidence intervals. The effect of phsrank is small and neither economically nor statistically significant.

C7.iv The id is effecitvely random and so we would not expect it to systematically affect wages. The two-sided p-value is 0.507

In [21]:
#Exercise C8
ksubs = pd.read_stata("stata/401ksubs.dta")
single = ksubs[ksubs.fsize == 1]
print(single.shape)

y = single.nettfa
X = sm.add_constant(single[["inc", "age"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

(2017, 11)
                            OLS Regression Results                            
Dep. Variable:                 nettfa   R-squared:                       0.119
Model:                            OLS   Adj. R-squared:                  0.118
Method:                 Least Squares   F-statistic:                     136.5
Date:                Sun, 24 May 2020   Prob (F-statistic):           2.63e-56
Time:                        23:23:53   Log-Likelihood:                -10524.
No. Observations:                2017   AIC:                         2.105e+04
Df Residuals:                    2014   BIC:                         2.107e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -43.0398      4.080    -10.5

In [22]:
model.t_test("age = 1")

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.8427      0.092     -1.710      0.087       0.662       1.023

In [23]:
model.t_test("age = 1").pvalue / 2

0.04371513880356598

In [24]:
y = single.nettfa
X = sm.add_constant(single[["inc"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                 nettfa   R-squared:                       0.083
Model:                            OLS   Adj. R-squared:                  0.082
Method:                 Least Squares   F-statistic:                     181.6
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.08e-39
Time:                        23:23:53   Log-Likelihood:                -10565.
No. Observations:                2017   AIC:                         2.113e+04
Df Residuals:                    2015   BIC:                         2.115e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -10.5710      2.061     -5.130      0.0

In [25]:
single.inc.corr(single.age)

0.03905864335541976

C8.i There are 2017 single person households

C8.ii Possibly surprising that age increases after accounting for income? That said, assets probably should increase over time.

C8.iii The comparison of age/income = 0 is a negative value. Babies do not start with debt

C8.iv The p-value is 0.0437 for the one sided test, meaning we fail to reject the null hypothesis at the 1% level.

C8.v The coefficient is not very different. This is possibly because age is uncorrelated with income.

In [26]:
#Exercise C9
discrim = pd.read_stata("stata/discrim.dta")
discrim = discrim[["lpsoda", "prpblck", "lincome", "prppov"]].dropna()

y = discrim.lpsoda
X = sm.add_constant(discrim[["prpblck", "lincome", "prppov"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                 lpsoda   R-squared:                       0.087
Model:                            OLS   Adj. R-squared:                  0.080
Method:                 Least Squares   F-statistic:                     12.60
Date:                Sun, 24 May 2020   Prob (F-statistic):           6.92e-08
Time:                        23:23:53   Log-Likelihood:                 439.04
No. Observations:                 401   AIC:                            -870.1
Df Residuals:                     397   BIC:                            -854.1
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.4633      0.294     -4.982      0.0

In [27]:
discrim.prppov.corr(discrim.lincome)

-0.8402069122771414

In [28]:
discrim = pd.read_stata("stata/discrim.dta")
discrim = discrim[["lpsoda", "prpblck", "lincome", "prppov", "lhseval"]].dropna()

y = discrim.lpsoda
X = sm.add_constant(discrim[["prpblck", "lincome", "prppov", "lhseval"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                 lpsoda   R-squared:                       0.184
Model:                            OLS   Adj. R-squared:                  0.176
Method:                 Least Squares   F-statistic:                     22.31
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.24e-16
Time:                        23:23:53   Log-Likelihood:                 461.55
No. Observations:                 401   AIC:                            -913.1
Df Residuals:                     396   BIC:                            -893.1
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.8415      0.292     -2.878      0.0

In [29]:
model.f_test("lincome = 0, prppov = 0")

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[3.52268494]]), p=0.030448605108275816, df_denom=396, df_num=2>

C9.i prpblck is statistically different from 0 at the 5% level but not at the 1% level

C9.ii The correlation between log income and proportion in poverty is -0.84, highly correlated. This means multicollinearity, but each coefficient is statistically significant

C9.iii A 1% increase in the median housing value translates in a 0.12% increase in the price. The p-value is 0.000 very small)

C9.iv A test of joint significance has a p-value of 0.030 allowing us to say the variables are jointly significant. This likely means that too many variables are correlated.

C9.v The most recent regression does not violate any assumptions of the model but is a better fit. This would be the preferable model if we were not looking for information on the other variables.

In [30]:
#Exercise C10
elem94_95 = pd.read_stata("stata/elem94_95.dta")

y = elem94_95.lavgsal
X = sm.add_constant(elem94_95[["bs"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                lavgsal   R-squared:                       0.015
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     28.23
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.21e-07
Time:                        23:23:54   Log-Likelihood:                 85.171
No. Observations:                1848   AIC:                            -166.3
Df Residuals:                    1846   BIC:                            -155.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.7479      0.052    208.042      0.0

In [31]:
model.t_test("bs = -1")

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0            -0.7951      0.150      1.369      0.171      -1.089      -0.502

In [32]:
X = sm.add_constant(elem94_95[["bs", "lenrol", "lstaff"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                lavgsal   R-squared:                       0.482
Model:                            OLS   Adj. R-squared:                  0.481
Method:                 Least Squares   F-statistic:                     572.0
Date:                Sun, 24 May 2020   Prob (F-statistic):          9.17e-263
Time:                        23:23:54   Log-Likelihood:                 679.00
No. Observations:                1848   AIC:                            -1350.
Df Residuals:                    1844   BIC:                            -1328.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         13.9530      0.107    130.118      0.0

In [33]:
X = sm.add_constant(elem94_95[["bs", "lenrol", "lstaff", "lunch"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                lavgsal   R-squared:                       0.488
Model:                            OLS   Adj. R-squared:                  0.487
Method:                 Least Squares   F-statistic:                     439.4
Date:                Sun, 24 May 2020   Prob (F-statistic):          4.22e-266
Time:                        23:23:54   Log-Likelihood:                 689.98
No. Observations:                1848   AIC:                            -1370.
Df Residuals:                    1843   BIC:                            -1342.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         13.8315      0.110    126.055      0.0

C10.i The slope is statistically different from 0 but not statistically different from 1 (at the 5% level)

C10.ii The coefficient becomes smaller, as with table 4.1 (and is now statistically different from 1)

C10.iii lenrol/lstaff removed some of the uncertainty with regards to bs by being added

C10.iv lstaff is the largest negative value and statistically significant. It likely represents less money to spread among staff when everything is held equal

C10.v The effect is essentially 0. They are not being compensated

C10.vi The patterns are broadly consistent

In [34]:
#Exercise C11
htv = pd.read_stata("stata/HTV.DTA")
htv["abil2"] = htv.abil ** 2

y = htv.educ
X = sm.add_constant(htv[["motheduc", "fatheduc", "abil", "abil2"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                   educ   R-squared:                       0.444
Model:                            OLS   Adj. R-squared:                  0.443
Method:                 Least Squares   F-statistic:                     244.9
Date:                Sun, 24 May 2020   Prob (F-statistic):          1.34e-154
Time:                        23:23:54   Log-Likelihood:                -2436.6
No. Observations:                1230   AIC:                             4883.
Df Residuals:                    1225   BIC:                             4909.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.2402      0.287     28.671      0.0

In [35]:
model.t_test("motheduc = fatheduc")

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.0812      0.042      1.936      0.053      -0.001       0.163

In [36]:
X = sm.add_constant(htv[["motheduc", "fatheduc", "abil", "abil2", "tuit17", "tuit18"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                   educ   R-squared:                       0.445
Model:                            OLS   Adj. R-squared:                  0.442
Method:                 Least Squares   F-statistic:                     163.5
Date:                Sun, 24 May 2020   Prob (F-statistic):          1.43e-152
Time:                        23:23:54   Log-Likelihood:                -2435.8
No. Observations:                1230   AIC:                             4886.
Df Residuals:                    1223   BIC:                             4921.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.0819      0.313     25.840      0.0

In [37]:
model.f_test("tuit17, tuit18")

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[0.83932887]]), p=0.43224904059225544, df_denom=1.22e+03, df_num=2>

In [38]:
htv.tuit17.corr(htv.tuit18)

0.9808332601038466

In [39]:
htv["tuit_means"] = (htv.tuit17 + htv.tuit18).mean() # This cell does not produce an output because it is an assignment

In [40]:
X = sm.add_constant(htv[["motheduc", "fatheduc", "abil", "abil2", "tuit_means"]])

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                   educ   R-squared:                       0.444
Model:                            OLS   Adj. R-squared:                  0.443
Method:                 Least Squares   F-statistic:                     244.9
Date:                Sun, 24 May 2020   Prob (F-statistic):          1.34e-154
Time:                        23:23:54   Log-Likelihood:                -2436.6
No. Observations:                1230   AIC:                             4883.
Df Residuals:                    1225   BIC:                             4909.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
motheduc       0.1901      0.028      6.767      0.0

C11.i We reject the null that the quadratic term is equal to 0, implying that the relationship is quadratic

C11.ii The p-value for the test of equality is 0.053, failing to reject the null at 5%

C11.iii The F-test has a p-value of 0.432 meaning that the results are not jointly significant

C11.iv The correlation is 0.98. Using the average is preferable since the two are highly correlated, removing the collinearity.

C11.v The causal interpretation is odd since higher costs would imply lower consumption of education. What is likely happening is that students who are getting a higher education must pay tuition while those who stop their studies do not. Tuition is necessarily correlated with higher education, but the cause is reversed.