In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("ps5.ipynb")

# Econ 140 – Problem Set 5

Before getting started on the assignment, run the cell at the very top that imports `otter` and the cell below which will import the packages we need.

**Important:** As mentioned in problem set 0, if you leave this notebook alone for a while and come back, to save memory datahub will "forget" which code cells you have run, and you may need to restart your kernel and run all of the cells from the top. That includes this code cell that imports packages. If you get `<something> not defined` errors, this is because you didn't run an earlier code cell that you needed to run. It might be this cell or the `otter` cell above.

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

---

## Problem 1. Instrumental Variable Estimation

Consumption of gasoline is a critical component of household expenditures, and increasingly, it is the focus intense public policy debate given the concern over greenhouse emissions. For these reasons alone economists would like to find accurate estimates of price elasticity of demand for gasoline by American consumers. The data file `gasoline.csv` contains monthly data on U.S. consumption of gasoline from 1978 to 2002.

In [3]:
gas = pd.read_csv("gasoline.csv")
gas.head()

Unnamed: 0,obs,carsales,persincome,pricegas,quantgas,transindex
0,1978:01,10.07,1756,64.8,6681.0,59.6
1,1978:02,10.45,1756,64.7,6876.0,59.7
2,1978:03,10.953,1756,64.7,7255.0,59.9
3,1978:04,11.786,1821,64.9,7202.0,60.3
4,1978:05,11.804,1821,65.5,7724.0,61.0


<!-- BEGIN QUESTION -->

**Question 1.a.**
Estimate a simple linear demand equation by regressing the quantity of gas `quantgas` consumed on the price of a gallon of gas `pricegas`. What is your estimate of the price coefficient from the OLS estimation? Remember to use robust standard errors, and to always include a constant.

<!--
BEGIN QUESTION
name: q1_a
manual: true
-->

In [18]:
y_1a = gas['quantgas']
X_1a = sm.add_constant(gas[['pricegas']])
OLS_model_1a = sm.OLS(y_1a, X_1a)
results1a = OLS_model_1a.fit(cov_type='HC1')
results1a.summary()

0,1,2,3
Dep. Variable:,quantgas,R-squared:,0.046
Model:,OLS,Adj. R-squared:,0.043
Method:,Least Squares,F-statistic:,13.84
Date:,"Mon, 03 May 2021",Prob (F-statistic):,0.000239
Time:,08:34:57,Log-Likelihood:,-2356.4
No. Observations:,296,AIC:,4717.0
Df Residuals:,294,BIC:,4724.0
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,6531.8301,223.281,29.254,0.000,6094.208,6969.453
pricegas,7.8252,2.104,3.720,0.000,3.702,11.948

0,1,2,3
Omnibus:,11.752,Durbin-Watson:,0.191
Prob(Omnibus):,0.003,Jarque-Bera (JB):,5.598
Skew:,0.045,Prob(JB):,0.0609
Kurtosis:,2.332,Cond. No.,696.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.b.**
Use your OLSEs to express the price elasticity of demand evaluated at the average price of gas. Does it make economic sense?

*Hint: Express the price elasticity when demand is linear.*

<!--
BEGIN QUESTION
name: q1_b
manual: true
-->

In [19]:
gas[['pricegas']].mean()
elasticity = (-7.825)*(114.88)/((-7.825)*(114.88)-6531.93)
elasticity

0.1209732486092469

Since this wasn't a log-log model, the coefficient on price would not be the price elasticity of demand. We could calculate elasticity from the equation of linear demand: $Q={\beta}_{0} - {\beta}_{1}P$. We can then derive the equation for price elasticity:
$$ e=\frac{P}{Q}\frac{dQ}{dP}=\frac{P}{{\beta}_{0} - {\beta}_{1}P}(-{\beta}_{1}) = \frac{{\beta}_{1}P}{{\beta}_{1}P-{\beta}_{0}} $$
From the OLS results, we can see that ${\beta}_{0}=6531.83$, and ${\beta}_{1}=-7.825$. Using the average price=114.88, we can calculate the elasticity = $(-7.825)(114.88)/((-7.825)(114.88)-6531.93)$, which is 0.121.
Thus, we calculate the price elasticity of demand evaluated at the average price of gas to be 0.121. However, since it is a positive value, it does not make economic sense since supply should have a negative relationship with demand.


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.c.**
Now introduce per capita personal income `persincome` as a regressor in the linear demand model and re-estimate using OLS. How has your estimate of price coefficient changed?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_c
manual: true
-->

In [20]:
y_1c = gas['quantgas']
X_1c = sm.add_constant(gas[['pricegas', 'persincome']])
OLS_model_1c = sm.OLS(y_1c, X_1c)
results1c = OLS_model_1c.fit(cov_type='HC1')
results1c.summary()

0,1,2,3
Dep. Variable:,quantgas,R-squared:,0.759
Model:,OLS,Adj. R-squared:,0.757
Method:,Least Squares,F-statistic:,520.9
Date:,"Mon, 03 May 2021",Prob (F-statistic):,3.32e-97
Time:,08:35:02,Log-Likelihood:,-2152.8
No. Observations:,296,AIC:,4312.0
Df Residuals:,293,BIC:,4323.0
Df Model:,2,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,6632.9609,168.570,39.348,0.000,6302.569,6963.352
pricegas,-6.8606,1.361,-5.041,0.000,-9.528,-4.193
persincome,0.3188,0.010,32.050,0.000,0.299,0.338

0,1,2,3
Omnibus:,2.611,Durbin-Watson:,0.757
Prob(Omnibus):,0.271,Jarque-Bera (JB):,2.432
Skew:,0.127,Prob(JB):,0.296
Kurtosis:,3.364,Cond. No.,32200.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.d.**
Explain.

<!--
BEGIN QUESTION
name: q1_d
manual: true
-->

After incorporating the persincome variable, the coefficient on price becomes -6.86 with a t-stat of -5.041, which is highly significant. Since the price coefficient changes in its sign, the new estimate of price elasticity of the demand would also make economic sense with the new OLS results. Therefore, we could identify persincome as an exogeneous variable and its incorporation into the model can reduce the omitted variable bias.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.e.**
Do you think that the above regression suffers from omitted variable bias? If so, can you determine the sign of the bias?

<!--
BEGIN QUESTION
name: q1_e
manual: true
-->

Since the coefficient on the price variable moves from positive to negative from regression in (a) to regression in (c), it seems that it does suffer from omitted variable bias. Also, the omitted variable bias will occur in an upward direction since persincome is positively correlated with pricegas, meaning that the bias is positive. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.f.**
Give reasons why you should suspect that the gasoline price would be correlated with error term even after you introduced personal income into the regression. Evaluate the monthly sales of autos in the U.S. (carsales) serve as a good instrument for price of gas? Explain.

<!--
BEGIN QUESTION
name: q1_f
manual: true
-->

There should be more factors that would affect the demand of gasoline besides the gasoline price and personal income, which are not included in this regression model. For instance, the supply side of automobiles may improve the efficiency of fuels as a response to the change in cost of driving, which certainly includes the cost of gasoline. Therefore, as there exists a separate supply equation that determines the intersection of supply and demand, carsales could potentially serve as a good instrument for price of gas since it is correlated with the price of automobiles, but is not endogeneously determined by the demand for gasoline.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.g.**
Estimate the first stage of a two stage least squares estimation by regressing price of gasoline on the sales of cars. Also include personal income. Perform a test that determines whether car sales is a “strong instrument.”

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_g
manual: true
-->

In [21]:
x_1g = gas['pricegas']
Z_1g = sm.add_constant(gas[['carsales', 'persincome']])
model_1g = sm.OLS(x_1g, Z_1g)
results_1g = model_1g.fit(cov_type='HC1')
results_1g.summary()

0,1,2,3
Dep. Variable:,pricegas,R-squared:,0.308
Model:,OLS,Adj. R-squared:,0.303
Method:,Least Squares,F-statistic:,43.63
Date:,"Mon, 03 May 2021",Prob (F-statistic):,2.61e-17
Time:,08:35:10,Log-Likelihood:,-1245.0
No. Observations:,296,AIC:,2496.0
Df Residuals:,293,BIC:,2507.0
Df Model:,2,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,162.2362,10.132,16.013,0.000,142.378,182.094
carsales,-6.3378,0.957,-6.624,0.000,-8.213,-4.463
persincome,0.0023,0.001,3.788,0.000,0.001,0.003

0,1,2,3
Omnibus:,10.733,Durbin-Watson:,0.181
Prob(Omnibus):,0.005,Jarque-Bera (JB):,6.829
Skew:,0.22,Prob(JB):,0.0329
Kurtosis:,2.4,Cond. No.,55400.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.h.**
Explain.

<!--
BEGIN QUESTION
name: q1_h
manual: true
-->

From the first stage regression results, we can see that the test gives an F-statistic of 43.63 with a p-value of almost zero. Therefore, we can reject the null hypothesis that car sales is a weak instrumental variable since F>10.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.i.**
Can you suggest another instrument that is likely to be a better instrument than car sales?

<!--
BEGIN QUESTION
name: q1_i
manual: true
-->

A good instrument should have both instrument relevance and instrument exogeneity. Because we are looking at the demand equation of a supply-demand model, we can choose supply factors that are exogeneous in the supply equation and are not determined by the demand equation. For example, if we can obtain data on the available amount of major oil reserves in the world, we are able to get a variable that affects the supply of gasoline but is not related with the market demand for gasoline among the drivers.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.j.**
Now perform the second stage of the TSLS estimation and report any change in the size of the coefficient on gasoline price as a result of using the instrumental variable.

*Hint: `results.fittedvalues` will give you an array of the $\hat y$ values.*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_j
manual: true
-->

In [22]:
gas['pricegas_hat'] = results_1g.fittedvalues
y_1j = gas['quantgas']
X_1j = sm.add_constant(gas[['pricegas_hat', 'persincome']])
model_1j = sm.OLS(y_1j, X_1j)
results1j = model_1j.fit(cov_type='HC1')
results1j.summary()

0,1,2,3
Dep. Variable:,quantgas,R-squared:,0.751
Model:,OLS,Adj. R-squared:,0.749
Method:,Least Squares,F-statistic:,415.2
Date:,"Mon, 03 May 2021",Prob (F-statistic):,3.11e-86
Time:,08:35:15,Log-Likelihood:,-2157.8
No. Observations:,296,AIC:,4322.0
Df Residuals:,293,BIC:,4333.0
Df Model:,2,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,7399.7376,402.835,18.369,0.000,6610.196,8189.279
pricegas_hat,-14.9491,3.923,-3.810,0.000,-22.639,-7.260
persincome,0.3515,0.016,21.871,0.000,0.320,0.383

0,1,2,3
Omnibus:,7.832,Durbin-Watson:,0.794
Prob(Omnibus):,0.02,Jarque-Bera (JB):,9.249
Skew:,0.255,Prob(JB):,0.00981
Kurtosis:,3.7,Cond. No.,76300.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.k.**
Explain.

<!--
BEGIN QUESTION
name: q1_k
manual: true
-->

We can see that the coefficient on gasoline price is now -14.949, which is more elastic than the OLS result of -6.86. This should be a more accurate measure of the coefficient since the part in gasoline price that is correlated with the error term was disregarded by using the instrumental variable.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.l.**
Is the TSLS estimate of the price coefficient statistically significant? Do you have any reason to doubt the reported values of the standard errors from the second stage? Explain.

<!--
BEGIN QUESTION
name: q1_l
manual: true
-->

The TSLS estimate of the price coefficient is statistically significant with a t-stat of -3.81 and a p-value of virtually zero. 
However, there are reasons to doubt the reported values of SEs from the second stage because the regressors are fitted values of the first stage regression, and the final IV estimators should have less efficiency than OLS estimators. Yet since the OLS model used in the second stage assume that these predicted values are exogeneous variables, the reported standard errors in the second stage should be wrong.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.m.**
Suppose you were instead interested in studying how the supply of gas is influenced by its price. Would you feel comfortable regressing the quantity of gas produced on its price? Why?

<!--
BEGIN QUESTION
name: q1_m
manual: true
-->

No, we would not feel comfortable regressing the quantity of gas produced on its price since there
is still simultaneity bias.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.n.**
Also included in the dataset is the BLS monthly price index for consumer purchases of “transportation services” over the same sample period `transindex`. Perform TSLS estimation using this price index as an instrument. Evaluate the results of the first and second stages.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_n
manual: true
-->

In [4]:
y_1n_a = gas['pricegas']
X_1n_a = sm.add_constant(gas[['transindex', 'persincome']])
model_1n_a = sm.OLS(y_1n_a, X_1n_a)
results_1n_a = model_1n_a.fit(cov_type = 'HC1')
gas['X_hat'] = results_1n_a.fittedvalues
y_1n_b = gas['quantgas']
X_1n_b = sm.add_constant(gas[['X_hat', 'persincome']])
model_1n_b = sm.OLS(y_1n_b, X_1n_b)
results_1n_b = model_1n_b.fit(cov_type = 'HC1')
results_1n_b.summary()

0,1,2,3
Dep. Variable:,quantgas,R-squared:,0.822
Model:,OLS,Adj. R-squared:,0.821
Method:,Least Squares,F-statistic:,628.6
Date:,"Mon, 03 May 2021",Prob (F-statistic):,1.0099999999999999e-106
Time:,08:02:58,Log-Likelihood:,-2107.7
No. Observations:,296,AIC:,4221.0
Df Residuals:,293,BIC:,4233.0
Df Model:,2,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,8632.8412,269.222,32.066,0.000,8105.175,9160.507
X_hat,-27.9567,2.730,-10.241,0.000,-33.307,-22.606
persincome,0.4040,0.014,29.901,0.000,0.378,0.430

0,1,2,3
Omnibus:,3.557,Durbin-Watson:,1.052
Prob(Omnibus):,0.169,Jarque-Bera (JB):,3.278
Skew:,-0.248,Prob(JB):,0.194
Kurtosis:,3.14,Cond. No.,67700.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.o.**
Explain.

<!--
BEGIN QUESTION
name: q1_o
manual: true
-->

In the TSLS estimation using the price index as an instrument, the result of the first stage shows
that transindex is a strong instrument. The result of the second stage gives a more elastic estimate
of the price coefficient.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.p.**
Assume that you are told that at least one of the instruments above is not exogenous (it could be both). Based on your empirical results using these data, decide what you consider the “best” estimate of the price coefficient. It doesn't have to be one of the above instruments. Explain your reasoning.

<!--
BEGIN QUESTION
name: q1_p
manual: true
-->

It is very likely that both of the instruments above are not exogenous because an increase in
transindex or carsales will lead to an increase in the demand. The carsales is a better estimate of
the price coefficient than transindex as it is less elastic. However, neither of them are good estimtes,
and the “best” estimate would be from the supply side.

<!-- END QUESTION -->



---

## Problem 2. Experiments

Senior management at Ctrip, China's largest travel agency, is interested in allowing their Shanghai call center employees to work from home (telecommute). Allowing telecommuting may not only reduce office rental costs but it may also lower the high attrition rates the firm was experiencing by saving the employees from long commutes. However, management is also worried that employees may be less productive if they telecommute. To determine the effects of telecommuting on productivity, Ctrip decided to run an experiment wherein participants were allowed to work from home for several days over a 9 month period. They asked employees in the airfare and hotel departments whether they would be interested in volunteering for this experiment, and not all employees agreed to participate. Each employee who volunteered for the experiment was then assigned a random share of work days over the 9 months that they must work from home. The file `ctrip.csv` contains data from all 994 employees of Ctrip. 

| Variable | Description | Units |
|-|-|-|
| **personid** | person ID |  |
| **age** | age | years |
| **tenure** | tenure at Ctrip | months |
| **grosswage** | monthly gross salary | 1000s of CNY |
| **children** | whether person has children |  |
| **bedroom** | whether person has independent bedroom to work in |  |
| **commute** | daily commute in minutes | minutes |
| **men** | whether person is male |  |
| **married** | whether person is married |  |
| **volunteer** | whether person volunteers for experiment (work from home) |  |
| **high_educ** | tertiary education and above |  |
| **WFHShare** | share of work days worked from home during experiment |  |
| **calls** | average number of calls taken per week during experiment |  |

In [5]:
ctrip = pd.read_csv("ctrip.csv")
ctrip.head()

Unnamed: 0,personid,age,tenure,grosswage,children,bedroom,commute,men,married,volunteer,high_educ,WFHShare,calls
0,3224,30.0,113.0,3.824882,no,no,40.0,1.0,0.0,0.0,0.0,,
1,3906,33.0,96.0,2.737547,yes,yes,180.0,0.0,1.0,1.0,0.0,0.0,342.0
2,4118,31.0,94.0,3.46038,yes,no,180.0,0.0,1.0,0.0,1.0,,
3,4122,30.0,94.0,4.096246,no,no,180.0,0.0,0.0,0.0,0.0,,
4,4164,28.0,25.0,7.2532,no,yes,65.0,0.0,1.0,1.0,1.0,0.0,172.0


<!-- BEGIN QUESTION -->

**Question 2.a.**
What percentage of employees volunteered to participate in the experiment?

*Hint: Check out the [`Series.value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) function.*

<!--
BEGIN QUESTION
name: q2_a
manual: true
-->

In [6]:
ctrip['volunteer'].value_counts()[1] / 994

0.506036217303823

Around 50.6% of the employees volunteered in this experiment.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.b.i.**
Use the variables `commute` as a dependent variable in a bivariate linear regression where `volunteer` is the explanatory variable.

<!--
BEGIN QUESTION
name: q2_b_1
manual: true
-->

In [8]:
y_2b = ctrip['commute']
x_2b = sm.add_constant(ctrip['volunteer'])
ols_model = sm.OLS(y_2b, x_2b)
results_2b = ols_model.fit(cov_type = 'HC1')
results_2b.summary()

0,1,2,3
Dep. Variable:,commute,R-squared:,0.011
Model:,OLS,Adj. R-squared:,0.01
Method:,Least Squares,F-statistic:,11.46
Date:,"Mon, 03 May 2021",Prob (F-statistic):,0.000739
Time:,08:12:24,Log-Likelihood:,-5413.0
No. Observations:,994,AIC:,10830.0
Df Residuals:,992,BIC:,10840.0
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,74.4656,2.316,32.152,0.000,69.926,79.005
volunteer,12.0318,3.554,3.385,0.001,5.066,18.998

0,1,2,3
Omnibus:,122.652,Durbin-Watson:,1.591
Prob(Omnibus):,0.0,Jarque-Bera (JB):,167.975
Skew:,0.993,Prob(JB):,3.35e-37
Kurtosis:,3.331,Cond. No.,2.63


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.b.ii.**
Interpret the coefficient on `volunteer` and comment on its statistical significance.

<!--
BEGIN QUESTION
name: q2_b_2
manual: true
-->

The coefficient on volunteer is 12.0318, which means that the average daily commute time for people
who volunteered for the experiment is 12.0318 minutes longer than those who were not willing to
volunteer. It is statistically significant since the p-value is 0.001, which is very close to zero and is
smaller than 0.01 and 0.05.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.c.i.**
Use the variable `tenure` as a dependent variable in a bivariate linear regression where `volunteer` is the explanatory variable.

<!--
BEGIN QUESTION
name: q2_c_1
manual: true
-->

In [9]:
y_2c = ctrip['tenure']
x_2c = sm.add_constant(ctrip['volunteer'])
ols_model2 = sm.OLS(y_2c, x_2c)
results2c = ols_model2.fit(cov_type = 'HC1')
results2c.summary()

0,1,2,3
Dep. Variable:,tenure,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.006
Method:,Least Squares,F-statistic:,7.451
Date:,"Mon, 03 May 2021",Prob (F-statistic):,0.00645
Time:,08:13:12,Log-Likelihood:,-4431.3
No. Observations:,994,AIC:,8867.0
Df Residuals:,992,BIC:,8876.0
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,26.8422,0.972,27.624,0.000,24.938,28.747
volunteer,-3.6235,1.327,-2.730,0.006,-6.225,-1.022

0,1,2,3
Omnibus:,97.416,Durbin-Watson:,0.099
Prob(Omnibus):,0.0,Jarque-Bera (JB):,124.805
Skew:,0.856,Prob(JB):,7.93e-28
Kurtosis:,3.292,Cond. No.,2.63


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.c.ii.**
Interpret the coefficient on `volunteer` and comment on its statistical significance.

<!--
BEGIN QUESTION
name: q2_c_2
manual: true
-->

The coefficient on volunteer is -3.6235, which means that people who volunteered have average
tenure that is 3.6235 months shorter than those who did not. The coefficient is still statistically
significant with a p-value of 0.006, which is smaller than 0.01 and 0.05.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.d.i.**
Impressed by your recent econometrics training, Ctrip hires you as a consultant to analyze the results from their experiment. To begin with, you estimate a bivariate linear regression model of the productivity of workers, measured by the log of the average number of calls taken per week (call this variable `ln_calls`), on the variable `WFHShare` (work from home share). 

*Hint: Add the argument [`missing='drop'`](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html) when constructing your OLS model to drop the missing entries.*

<!--
BEGIN QUESTION
name: q2_d_1
manual: true
-->

In [10]:
ctrip['ln_calls'] = np.log(ctrip['calls'])
y_2d = ctrip['ln_calls']
x_2d = sm.add_constant(ctrip['WFHShare'])
ols_model3 = sm.OLS(y_2d, x_2d, missing = 'drop')
results2d = ols_model3.fit(cov_type = 'HC1')
results2d.summary()

0,1,2,3
Dep. Variable:,ln_calls,R-squared:,0.163
Model:,OLS,Adj. R-squared:,0.161
Method:,Least Squares,F-statistic:,142.6
Date:,"Mon, 03 May 2021",Prob (F-statistic):,4.23e-29
Time:,08:14:07,Log-Likelihood:,-517.61
No. Observations:,503,AIC:,1039.0
Df Residuals:,501,BIC:,1048.0
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.4442,0.062,87.180,0.000,5.322,5.567
WFHShare,0.9753,0.082,11.942,0.000,0.815,1.135

0,1,2,3
Omnibus:,396.98,Durbin-Watson:,1.82
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8757.799
Skew:,-3.269,Prob(JB):,0.0
Kurtosis:,22.368,Cond. No.,4.07


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.d.ii.**
Interpret the regression coefficient on `WFHShare` in words. Is the effect statistically significant?

<!--
BEGIN QUESTION
name: q2_d_2
manual: true
-->

The coefficient on WFHShare is 0.9753. For every 1 percent increase in the share of work days
worked from home during experiment, the log of average numver of calls taken per week will increase
by 0.9753%. The effect is statistically significant since the p-value is 0.000 (smaller than 0.01 or
0.05).

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.e.**
Has the Ctrip company achieved the ideal of a randomized controlled experiement, so that we can view the estimated effects of working from home on productivity in causal terms?

<!--
BEGIN QUESTION
name: q2_e
manual: true
-->

No, the Ctrip company has not achieved an ideal randomized controlled experiment since workers
who did not volunteer might contribute totally different result on home productivity. From previous
OLS results we know that volunteers are more likely to have longer commute time and shorter tenure
term than workers who did not participate in the experiment. Therefore, even if the company
assigned a random share of work days over the 9 months for volunteers to work from home, the
experiment is not ideal to generate a comprehensive result.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.g.i.**
Create a dummy variable called `longcommute` which is equal to one if the employee has a commute of greater than or equal to 120 (i.e. 2 hours) and add it to the `ctrip` column.

*Hint: First create a boolean column for `longcommute` then cast it into integers using [`Series.astype(int)`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html).*

<!--
BEGIN QUESTION
name: q2_g_1
manual: true
-->

In [11]:
ctrip['longcommute'] = (ctrip['commute'] >= 120).astype(int)
ctrip

Unnamed: 0,personid,age,tenure,grosswage,children,bedroom,commute,men,married,volunteer,high_educ,WFHShare,calls,ln_calls,longcommute
0,3224,30.0,113.0,3.824882,no,no,40.0,1.0,0.0,0.0,0.0,,,,0
1,3906,33.0,96.0,2.737547,yes,yes,180.0,0.0,1.0,1.0,0.0,0.000000,342.0,5.834811,1
2,4118,31.0,94.0,3.460380,yes,no,180.0,0.0,1.0,0.0,1.0,,,,1
3,4122,30.0,94.0,4.096246,no,no,180.0,0.0,0.0,0.0,0.0,,,,1
4,4164,28.0,25.0,7.253200,no,yes,65.0,0.0,1.0,1.0,1.0,0.000000,172.0,5.147494,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
989,48138,26.0,0.0,1.125520,no,no,40.0,1.0,0.0,0.0,1.0,,,,0
990,48372,24.0,0.0,0.620690,no,yes,50.0,0.0,0.0,1.0,0.0,0.676768,505.0,6.224558,0
991,48378,18.0,0.0,0.620690,no,yes,160.0,1.0,0.0,0.0,0.0,,,,1
992,48382,22.0,0.0,0.620690,no,no,80.0,1.0,0.0,0.0,0.0,,,,0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.g.ii.**
How would you expect that including `longcommute` as a second explanatory variable would alter the coefficient on `WFHShare` – would it increase, decrease, or stay the same? Explain.

<!--
BEGIN QUESTION
name: q2_g_2
manual: true
-->

I would expect the coefficient on WFHShare to stay the same after adding longcommute as a second
explanatory variable. WFHShare, which is the share of work days from home during experiment, is
randomly assigned by the company and is therefore not correlated with longcommute. Thus, there
will not be ommited variable bias after including longcommute and the coefficient will not change.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.h.i.** 
Management believes that commute (the travel time from home to office and back) is an important determinant of a worker’s productivity. They have two hypotheses:

1. Employees who face a longer commute time are generally less productive than workers who have shorter commute times.
2. The effects of `WFHShare` on productivity is larger for those who face a longer commute.

Estimate a regression of `ln_calls`, with `WFHShare`, `longcommute`, and their interaction (call it `WFHShareXlongcommute`) as explanatory variables.

*Hint: Once again you will need to add the argument [`missing='drop'`](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html) when constructing your OLS model to drop the missing entries.*

<!--
BEGIN QUESTION
name: q2_h_1
manual: true
-->

In [13]:
ctrip['WFHShareXlongcommute'] = ctrip['WFHShare'] * ctrip['longcommute']
y_2h = ctrip['ln_calls']
x_2h = sm.add_constant(ctrip[['WFHShare', 'longcommute', 'WFHShareXlongcommute']])
ols_model4 = sm.OLS(y_2h, x_2h, missing = 'drop')
results2h = ols_model4.fit(cov_type = 'HC1')
results2h.summary()

0,1,2,3
Dep. Variable:,ln_calls,R-squared:,0.179
Model:,OLS,Adj. R-squared:,0.174
Method:,Least Squares,F-statistic:,179.4
Date:,"Mon, 03 May 2021",Prob (F-statistic):,6.790000000000001e-79
Time:,08:21:06,Log-Likelihood:,-512.63
No. Observations:,503,AIC:,1033.0
Df Residuals:,499,BIC:,1050.0
Df Model:,3,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5.4398,0.095,57.061,0.000,5.253,5.627
WFHShare,0.8641,0.125,6.926,0.000,0.620,1.109
longcommute,0.0162,0.103,0.158,0.875,-0.186,0.218
WFHShareXlongcommute,0.3300,0.137,2.415,0.016,0.062,0.598

0,1,2,3
Omnibus:,392.423,Durbin-Watson:,1.841
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8736.706
Skew:,-3.207,Prob(JB):,0.0
Kurtosis:,22.383,Cond. No.,9.76


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.h.ii.** 
Do your results support hypothesis (i), hypothesis (ii), both hypotheses, or neither one? Explain.

<!--
BEGIN QUESTION
name: q2_h_2
manual: true
-->

The result generated from the OLS regression above does not support hypothesis (i) because the
coefficient on longcommute is non-negative and has a p-value of 0.875 (larger than 0.5, so it is not
statistically significant at both 1% and 5% level).
In contrast, hypothesis (ii) is supported by my results. The coefficient value for WFHShareXlongcommute
is positive (0.3300) and corresponds to a p-value of 0.016, which indicates statistically
significance at 5% level.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.i.** 
If the coefficient on `longcommute` is statistically insignificant, would this lead you to drop `longcommute` from the regression model in part (h)? Explain your answer.

<!--
BEGIN QUESTION
name: q2_i
manual: true
-->

I would not drop longcommute from the regression model in part (h) even if the coefficient on
longcommute is statistically insignificant. Without longcommute as an exploratory variable, I
won’t be able to test whether there is a stronger effect of WFHShare on productivity for workers
who have a longer commute time (hypothesis ii). The meaning of each coefficient from the OLS
regression will change if we add or drop any of the exploratory variables.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.j.** 
Using the regression in part (h) and without estimating any other regression, write the estimated equation for the simple regression of `ln_calls` on `WFHShare` using only data for those with a commute of fewer than 120 minutes. You must show your solution to obtain full credit.

<!--
BEGIN QUESTION
name: q2_j
manual: true
-->

For workers with a commute of fewer than 120 minutes:
$$  ln\_calls = 5.4398 + 0.8641 ∗ WFHShare $$
Here longcommute and WFHShareXlongcommute will have value of zero from data for workers
with a commute time shorter than 120 minutes and therefore will be eliminated in the estimated
equation for ln_calls on WFHShare.

<!-- END QUESTION -->



---

## Problem 3. Natural Experiments

“Sin taxes” have not been the only way in which governments have attempted to reduce the consumption of cigarettes. In 1970, the U.S. passed a law that banned the advertising of cigarettes on radio and television. The ban took effect in 1971. The accompanying data file `cigads.csv` contains data on annual per capita consumption of tobacco measured in terms of “Annual grams of Tobacco Sold per Adult (15+)” for both the U.S. and Canada, 1968-1990 (`CIGSPC`). Also included in that file is a measure of the price of cigarettes given by the “Real Price of 20 grams Cents” for both countries (`PRICE`).

In [14]:
cigads = pd.read_csv("cigads.csv")
cigads.head()

Unnamed: 0,YEAR,COUNTRY,CIGSPC,PRICE
0,1964,CAN,3975,128
1,1965,CAN,4095,128
2,1966,CAN,4158,127
3,1967,CAN,4168,127
4,1968,CAN,3971,137


<!-- BEGIN QUESTION -->

**Question 3.a.**
Treating the ban in cigarette advertising as a quasi-experiment, perform a differences-in-differences analysis of the effect of the ban on the consumption of tobacco. Fill in the table that indicates the conclusion of your analysis.

The top left box with work has been done for you.

<!--
BEGIN QUESTION
name: q3_a
manual: true
-->

In [15]:
# Mean of annual grams of Tobacco Sold per Adult (15+) across the pre-treatment periods in Canada
pre_period = cigads[cigads['YEAR'] <= 1970]
np.mean(pre_period[pre_period['COUNTRY'] == "CAN"]['CIGSPC'])

4043.1428571428573

<!-- END QUESTION -->



|                                     | Before | After | After - Before |
|:----------------------------------- | :--------- | :----- | :---- |
| **Canada**          | 4043.14 | 4280.71 | 237.57 |
| **USA**                     | 3601.80 | 3804.05 | 202.25 |
| **USA - Canada**                   | -441.34 | -476.66 | -35.32 |

*Your explanation here*

By perform a differences-in-differences analysis of the effect of the ban on the consumption of tobacco, we see that USA - Canada, After - Before is -35.32, which means on average, cigarettes per adult decreases by -35.32.

<!-- BEGIN QUESTION -->

**Question 3.b.i.**
Now create a dummy variable `post` indicating the time period whether the ban was in effect or not, plus a dummy variable `treat` for the treatment group (i.e. the U.S.) and the control group (i.e. Canada). Regress tobacco consumption on these two dummies and on the interaction between the two (you can call this `treatpost`).

*Hint: Once again you will need to first create boolean columns then cast it into integers using [`Series.astype(int)`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html).*

<!--
BEGIN QUESTION
name: q3_b_1
manual: true
-->

In [16]:
cigads['post'] = (cigads['YEAR'] > 1970).astype(int)
cigads['treat'] = (cigads['COUNTRY'] == 'CAN').astype(int)
cigads['treatpost'] = cigads['post']*cigads['treat']
y_3b = cigads['CIGSPC']
X_3b = sm.add_constant(cigads[['post', 'treat', 'treatpost']])
model_3b = sm.OLS(y_3b, X_3b)
results_3b = model_3b.fit(cov_type = 'HC1')
results_3b.summary()

0,1,2,3
Dep. Variable:,CIGSPC,R-squared:,0.243
Model:,OLS,Adj. R-squared:,0.198
Method:,Least Squares,F-statistic:,13.82
Date:,"Mon, 03 May 2021",Prob (F-statistic):,1.09e-06
Time:,08:26:49,Log-Likelihood:,-400.28
No. Observations:,54,AIC:,808.6
Df Residuals:,50,BIC:,816.5
Df Model:,3,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4280.7143,50.433,84.880,0.000,4181.868,4379.560
post,-476.6643,102.532,-4.649,0.000,-677.623,-275.705
treat,-237.5714,63.652,-3.732,0.000,-362.328,-112.815
treatpost,35.3214,164.267,0.215,0.830,-286.636,357.279

0,1,2,3
Omnibus:,5.878,Durbin-Watson:,0.275
Prob(Omnibus):,0.053,Jarque-Bera (JB):,5.77
Skew:,-0.797,Prob(JB):,0.0559
Kurtosis:,2.843,Cond. No.,9.69


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.b.ii.**
How do your results compare to your diffs-in-diffs estimator?

<!--
BEGIN QUESTION
name: q3_b_2
manual: true
-->

The coefficient of treatpost is 35.32, which is equal to the diffs-in-diffs estimator.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.c.i.**
Finally, recognizing that price does also affect consumption, you introduce the price variable into the regression in (b). 

<!--
BEGIN QUESTION
name: q3_c_1
manual: true
-->

In [17]:
X_3c = sm.add_constant(cigads[['post', 'treat', 'treatpost', 'PRICE']])
model_3c = sm.OLS(y_3b, X_3c)
results_3c = model_3c.fit(cov_type = 'HC1')
results_3c.summary()

0,1,2,3
Dep. Variable:,CIGSPC,R-squared:,0.854
Model:,OLS,Adj. R-squared:,0.842
Method:,Least Squares,F-statistic:,72.98
Date:,"Mon, 03 May 2021",Prob (F-statistic):,5.03e-20
Time:,08:31:28,Log-Likelihood:,-355.8
No. Observations:,54,AIC:,721.6
Df Residuals:,49,BIC:,731.5
Df Model:,4,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,5539.0026,106.688,51.918,0.000,5329.899,5748.106
post,-451.1424,70.652,-6.385,0.000,-589.619,-312.666
treat,60.8905,54.984,1.107,0.268,-46.875,168.656
treatpost,259.1679,83.122,3.118,0.002,96.252,422.083
PRICE,-11.8706,0.926,-12.812,0.000,-13.687,-10.055

0,1,2,3
Omnibus:,2.758,Durbin-Watson:,0.402
Prob(Omnibus):,0.252,Jarque-Bera (JB):,2.656
Skew:,0.5,Prob(JB):,0.265
Kurtosis:,2.575,Cond. No.,897.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.c.ii.**
Report your results and compare to those from (b).

<!--
BEGIN QUESTION
name: q3_c_2
manual: true
-->

By comparing the results to those from (b) and observing the p-value, we see that the coefficient
on treatpost becomes much bigger and statistically significant.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.d.**
Why would you expect that the price of a pack of cigarettes might be correlated with the error term? Note that some economists have argued that the advertising ban reduced competition among cigarette makers by eliminating one dimension on which they compete for customers, which in turn led to higher prices.

<!--
BEGIN QUESTION
name: q3_d
manual: true
-->

We would expect that the price of a pack of cigarettes might be correlated with the error term
because the price depends on not only the supply, but also the demand of cigarettes.

<!-- END QUESTION -->



---

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()