# Chapter 2

Assumptions required for the simple regression model (Gaus-Markov assumptions for simple regression)

* Linear in parameters: The population model is related to the independent variable and the error as a linear equation $y = \beta_0 + \beta_1 x + u$
* Random sampling: The sample is randomly selected (with the model from the previous assumption)
* Sample variation in the explanatory variable: x is not a constant
* Zero conditional mean: The error $u$ has an expected value of 0 given any value of the explanatory variable $E(u|x) = 0$
* Homoskedasticity: The error $u$ has the same variance given any value of $x$, $Var(u|x) = \sigma^2$

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Exercise C1
s401k = pd.read_stata("stata/401K.DTA")

In [3]:
print(s401k.prate.mean())
print(s401k.mrate.mean())

87.36289978027344
0.7315128445625305


In [4]:
import statsmodels.api as sm

y = s401k.prate
X = s401k.mrate
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  prate   R-squared:                       0.075
Model:                            OLS   Adj. R-squared:                  0.074
Method:                 Least Squares   F-statistic:                     123.7
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.10e-27
Time:                        23:17:52   Log-Likelihood:                -6437.0
No. Observations:                1534   AIC:                         1.288e+04
Df Residuals:                    1532   BIC:                         1.289e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         83.0755      0.563    147.484      0.0

In [5]:
X_p = pd.DataFrame({"const" : [1.0], "mrate" : [3.5]})
model.predict(exog=X_p)

0    103.589233
dtype: float64

C1.i Average participation rate is 87.36%, and the average match rate is 73 cents for every dollar contributed

C1.ii (Full summary above), N = 1534, Intercept = 83.0755, mrate = 5.8611, R-Squared 0.075

C1.iii The intercept could be considered the 'base' level of 401K participation without matching contribution (83% of the workforce require no inducement). The coefficient on mrate is the increase in participate for each dollar of the match rate (that is, for each dollar of a match rate we get a little under 6% additional participation)

C1.iv The predicted prate is 103.589233. This prediction implies that more than all eligible participants will participate once the employer contributes 3.5 times or more. This result reflects the linear model's slope increasing at a constant rate, rather than any observed outcome.

C1.v 7.5% of the variation in the participation rate is explained by the matching rate. This is not a very large amount explained by the matching rate alone.

In [6]:
#Exercise C2

ceosal2 = pd.read_stata("stata/CEOSAL2.DTA")
print(ceosal2.salary.mean())
print(ceosal2.ceoten.mean())

865.8644067796611
7.954802259887006


In [7]:
print(ceosal2[ceosal2.ceoten == 0].shape[0])
print(ceosal2.ceoten.max())

5
37


In [8]:
y = ceosal2.lsalary
X = ceosal2.ceoten
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                lsalary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.334
Date:                Sun, 24 May 2020   Prob (F-statistic):              0.128
Time:                        23:17:53   Log-Likelihood:                -160.84
No. Observations:                 177   AIC:                             325.7
Df Residuals:                     175   BIC:                             332.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.5055      0.068     95.682      0.0

C2.i The average salary is \\$865,864 and teh average tenure is 7.95 years

C2.ii Five CEOs are in their first year. The longest tenure is 37 years

C2.iii N = 177, Intercept: 6.5055, ceoten = 0.0097, R-Squared 0.013. The approximate percentage increase in salary given one more year of being CEO is about 1%.

In [9]:
#Exercise C3

sleep75 = pd.read_stata("stata/SLEEP75.DTA")

y = sleep75.sleep
X = sm.add_constant(sleep75.totwrk)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  sleep   R-squared:                       0.103
Model:                            OLS   Adj. R-squared:                  0.102
Method:                 Least Squares   F-statistic:                     81.09
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.99e-18
Time:                        23:17:53   Log-Likelihood:                -5267.1
No. Observations:                 706   AIC:                         1.054e+04
Df Residuals:                     704   BIC:                         1.055e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3586.3770     38.912     92.165      0.0

In [10]:
60 * 2 * model.params[1]

-18.089498894279217

C3.i n = 706, Intercept: 3586.3770, totwrk: -0.1507, R-squared: 0.103. The intercept means the base amount of sleep, independent of work (people spend 3586.38 minutes a week sleeping if they do not work).

C3.ii Increasing work by 2 hours will reduce sleep by 18 minutes. This is not a particularly large effect

In [11]:
#Exercise C4

wage2 = pd.read_stata("stata/WAGE2.DTA")

print(wage2.wage.mean())
print(wage2.IQ.mean())
print(wage2.IQ.std())

957.9454545454546
101.28235294117647
15.052636370265098


In [12]:
y = wage2.wage
X = sm.add_constant(wage2.IQ)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.096
Model:                            OLS   Adj. R-squared:                  0.095
Method:                 Least Squares   F-statistic:                     98.55
Date:                Sun, 24 May 2020   Prob (F-statistic):           3.79e-22
Time:                        23:17:53   Log-Likelihood:                -6891.4
No. Observations:                 935   AIC:                         1.379e+04
Df Residuals:                     933   BIC:                         1.380e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        116.9916     85.642      1.366      0.1

In [13]:
model.params[1] * 15

124.5459646235166

In [14]:
y = wage2.lwage
X = sm.add_constant(wage2.IQ)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.099
Model:                            OLS   Adj. R-squared:                  0.098
Method:                 Least Squares   F-statistic:                     102.6
Date:                Sun, 24 May 2020   Prob (F-statistic):           5.93e-23
Time:                        23:17:53   Log-Likelihood:                -468.85
No. Observations:                 935   AIC:                             941.7
Df Residuals:                     933   BIC:                             951.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.8870      0.089     66.131      0.0

In [15]:
15 * model.params[1]

0.1321073365345066

C4.i Average salary is \\$957.54 and average IQ is 101 with a standard deviation of 15

C4.ii A single point increase in IQ increases monthly wage by \\$8.30. An IQ increase of 15 points translates to an additional \\$124.55 a month. IQ explains a little under 10% of the variation in wages.

C4.iii A 15 increase in IQ translates to approximately a 13.2% increase in wage

In [16]:
#Exercise C5

rdchem = pd.read_stata("stata/RDCHEM.DTA")

y = rdchem.lrd
X = sm.add_constant(rdchem.lsales)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                    lrd   R-squared:                       0.910
Model:                            OLS   Adj. R-squared:                  0.907
Method:                 Least Squares   F-statistic:                     302.7
Date:                Sun, 24 May 2020   Prob (F-statistic):           3.20e-17
Time:                        23:17:53   Log-Likelihood:                -24.021
No. Observations:                  32   AIC:                             52.04
Df Residuals:                      30   BIC:                             54.97
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -4.1047      0.453     -9.066      0.0

C5.i $log(rd) = \beta_0 + \beta_1 log(sales) + u$. The elasticity is $\beta_1$

C5.ii A 1% increase in sales produces a 1.08% increase in research and development

In [17]:
#Exercise C6

meap93 = pd.read_stata("stata/MEAP93.DTA")

y = meap93.math10
X = sm.add_constant(meap93.lexpend)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                 math10   R-squared:                       0.030
Model:                            OLS   Adj. R-squared:                  0.027
Method:                 Least Squares   F-statistic:                     12.41
Date:                Sun, 24 May 2020   Prob (F-statistic):           0.000475
Time:                        23:17:53   Log-Likelihood:                -1531.4
No. Observations:                 408   AIC:                             3067.
Df Residuals:                     406   BIC:                             3075.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -69.3411     26.530     -2.614      0.0

C6.i Diminishing effects seem more appropriate. Initial investments in education have intuitive effects (supplies, access to technology), but past a certain point, additional expenditure cannot improve learning outcomes

C6.ii Dividing by 100 gives the percent increase, dividing by 10 gives the 10 percentage point increase instead. We're simply choosing a different base.

C6.iii N = 408, Intercept = -69.3411, lexpend = 11.1644, R-squared = 0.030

C6.iv A 10% increase in spending produces a 1.1% increase in math scores.

C6.v Math scores are not particularly high in this data set

In [18]:
#Exercise C7

charity = pd.read_stata("stata/charity.dta")
print(charity.gift.mean())
print(charity[charity.gift == 0].shape[0] / charity.shape[0])

7.444470477975632
0.6000468603561387


In [19]:
print(charity.mailsyear.mean())
print(charity.mailsyear.min())
print(charity.mailsyear.max())

2.0495548248291016
0.25
3.5


In [20]:
y = charity.gift
X = sm.add_constant(charity.mailsyear)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                   gift   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                  0.014
Method:                 Least Squares   F-statistic:                     59.65
Date:                Sun, 24 May 2020   Prob (F-statistic):           1.40e-14
Time:                        23:17:53   Log-Likelihood:                -17602.
No. Observations:                4268   AIC:                         3.521e+04
Df Residuals:                    4266   BIC:                         3.522e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.0141      0.739      2.724      0.0

In [21]:
model.predict(X).min()

2.6764663981855916

C7.i The average gift is worth 7.44 dutch guilders. 60% of people did not give a gift

C7.ii The average number of mailings a year is about 2. The minimum number is 0.25 and the maximum is 3.5

C7.iii N = 4268, Intercept = 2.0141, mailsyear = 2.6495, R-squared = 0.014

C7.iv The slope is the average gift without any mailings. If the cost per mailing is one guilder then the expectation is to make a net gain, but it does not imply it makes a net gain on every mailing since this refers to the averages.

C7.v The smallest predicted contribution is 2.68. The simple analysis cannot produce a zero gift since it is not possible to do negative mailings.

In [22]:
# Exercise C8

uni = np.random.uniform(low = 0, high = 10, size = 500)
norm = np.random.normal(size = 500) * 6
print(uni.mean())
print(uni.std())

5.1940270232527705
2.9336202150995585


In [23]:
print(norm.mean())
print(norm.std())

-0.34449076108098037
5.916322190618815


In [24]:
y = 1 + 2 * uni + norm

In [25]:
X = sm.add_constant(uni)

model = sm.OLS(y, X).fit()
model_summary = model.summary()
print(model_summary)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.500
Model:                            OLS   Adj. R-squared:                  0.499
Method:                 Least Squares   F-statistic:                     498.3
Date:                Sun, 24 May 2020   Prob (F-statistic):           5.13e-77
Time:                        23:17:53   Log-Likelihood:                -1598.3
No. Observations:                 500   AIC:                             3201.
Df Residuals:                     498   BIC:                             3209.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.5655      0.539      1.049      0.2

In [26]:
print(model.resid.sum())
print((model.resid * uni).sum())

-1.6200374375330284e-12
-6.0822458181064576e-12


In [27]:
print(norm.sum())
print((norm * uni).sum())

-172.24538054049017
-820.0611120576277


C8.i The mean of x is 5.09 and the standard deviation is 2.81

C8.ii The average of u is not exactly zero since we are dealing with a sample (and a sample of 500 no less). It would be extraordinary if it produced 0. Sample standard deviation is 6.31

C8.iii The estimates are close and both fall within the confidence interval. They are not exact however (again, since they are a sample).

C8.iv Both equations hold to all but the highest precision

C8.v The equations do not hold, but these calculations do not mean anything since 2.60 is defined on the residuals

C8.vi There will be slight changes with each reproduction (if the notebook is rerun the values in the answers above will likely be different from what is reported)