## Causal Inference
# School of Information, University of Michigan 
## Week 3

### Resources:
- Course Manual, which can be found in Coursera
- [Instrumental Variables & Randomized Encouragement Trials: Driving Engagement of Learners](assets/MediumArticle.pdf)

## Part 1

### Background

Researchers in Coursera are interested in figuring out whether a certain learning style can cause a learner to be more engaged and thus more likely to ultimately complete a course.

### Data

The data file lecture3.csv contains 7 variables for 49,808 learners on the online learning platform Coursera. Below are the descriptions of each variable in the data:

- *id*: a unique identifier for each leaner
- *paid_enroll*: dummy variable that equals to 1 if a learner has paid for enrollment, 0 otherwise
- *prv_wk_nbr*: the most recent course week a learner has completed, as a measure of how far the learner is into the class (e.g. if a learner most recently completed week 2 of a course, this variable is equal to 2)
- *prv_wk_min*: the minutes a learner spent in the previous week on the platform
- *message*: equal to 1 if a learner is in the treatment group (i.e. he/she received a message), 0 otherwise
- *binge*: equal to 1 if a learner has binged, 0 otherwise (bingeing behavior is defined as completing and starting consecutive weeks of a course on the same day)
- *complete*: dummy variable that is equal to 1 if a learner completed the next week in the course, 0 otherwise


In [1]:
# Import statements. Run this cell.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy import stats
from linearmodels import IV2SLS

In [2]:
!pip install plotly==5.10.0

Collecting plotly==5.10.0
  Downloading plotly-5.10.0-py2.py3-none-any.whl (15.2 MB)
     |████████████████████████████████| 15.2 MB 24.1 MB/s eta 0:00:01     |████████████████████████████████| 15.2 MB 24.1 MB/s            
[?25hCollecting tenacity>=6.2.0
  Downloading tenacity-8.1.0-py3-none-any.whl (23 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.10.0 tenacity-8.1.0


In [3]:
#Uploading data for assignment. Run this cell.
data_coursera = pd.read_csv('assets/lecture3.csv')

#Uncomment below to see the first five lines of the dataframe.
data_coursera.head()

Unnamed: 0,id,paid_enroll,prv_wk_nbr,prv_wk_min,message,binge,complete
0,1,1,2,193,0,1,1
1,2,0,5,194,0,1,1
2,3,1,1,45,0,0,1
3,4,1,4,118,0,0,1
4,5,0,5,247,0,1,1


## Questions

We are interested in investigating whether bingeing, defined as completing and starting consecutive weeks of a course on the same day, increases the likelihood of completing the following week in a course.

**Note**: You can refer to the manual for the methods we use in the assignment if you need to. 

**Use the data_coursera dataframe uploaded above to answer the questions below unless otherwise specified.**

**1.** Using robust standard errors in the statsmodels module, regress the variable *complete* on *binge*. Assign the coefficient in front of *binge* to the variable `binge_coeff1_1` and ensure that its data type is float. (Round to four decimal places.) (1 pt)

In [4]:
model = smf.ols(formula = 'complete ~ binge', data = data_coursera).fit()
robust_model = model.get_robustcov_results(cov_type = 'HC1')
binge_coeff1_1 = round(robust_model.params[1],4)
#raise NotImplementedError()

In [5]:
# Hidden Tests, checking value of binge_coeff1_1.

**2.** Now run the regression (using robust standard errors) one more time with additional controls: *paid_enroll*, *prv_wk_nbr*, *prv_wk_min*. Assign the coefficient in front of *binge* to the variable `binge_coeff1_2` and ensure that its data type is float. (Round to four decimal places.) (1 pt)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden Tests, checking value of binge_coeff1_2.

When the point estimate we are interested in (i.e., the coefficient in front of variable binge) changes drastically with the inclusion of further covariates, we consider that to be worrisome for causal inference purposes (remember the regression sensitivity analysis). Furthermore, intuitively the positive correlation between bingeing and completion could just be the result of self-selection by learners who are both inherently more likely to complete as well as more likely to binge because of higher motivation. To overcome this problem, researchers in Coursera decided to run a randomized encouragement trial. They randomly split their learners into two groups. The treatment group received a message immediately after completing a week of material. The goal of the message was to encourage learners to start the next week right away (see below). The control group didn’t receive the message.

<img src="assets/Congratulations.png" alt="Treatment Message" style="width: 500px;"/>

## Part 2

### Questions 

We will be using the binary variable message as our instrument to investigate the impact of binging on completion of the following week’s lecture.

**1.** Since messages were randomly assigned, we know that the independence assumption is satisfied. What does the exclusion restriction mean in this context? (2 pts)

**Note**: This question will be manually graded. 

YOUR ANSWER HERE

**2.** Let’s look at the first stage relationship.

**2a.** Using robust standard errors in the statsmodels module, regress variable *binge* on variable *message*. Assign the results (using the `.get_robustcov_results()` method) to the variable `robust_reg2_2a`. (1 pt)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden Tests, checking the coefficients and standard errors of robust_reg2_2a.

**2b.** Do we have a strong first stage? Explain. (1 pt)

**Note**: This question will be manually graded.

YOUR ANSWER HERE

**3.**  Let’s look at the intention-to-treat effect.

**3a.** Calculate the “intention-to-treat” (ITT) effect by running the reduced form regression. That is, using robust standard errors, regress *complete* on *message*. Based on your regression results, how much does receiving a message change the likelihood of completing the next week? Assign this number (the coefficient in front of variable *message*) to the variable `l_change2_3` and ensure that its data type is float. (Round to four decimal places.) (1 pt)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden Tests, checking value of l_change2_3

**3b.** Based on the p-value, can you conclude that it is significant at the 5% level? Explain. (i.e. report BOTH the p-value AND the decision rule based on p-value to determine if the coefficient differs from 0 at the 5% significance level) (1 pt) 

**Note**: This question will be manually graded.

YOUR ANSWER HERE

**4.** The ITT doesn’t take into account that some users may not comply with the treatment assignment. With heterogeneous treatment effects, we have an additional assumption: monotonicity. This means that there are “no defiers” in the population. 
Explain what “no defiers” means in this context and what its implications are for the identification of treatment effects.
(2 pts)

**Note**: This question will be manually graded.

YOUR ANSWER HERE

**5.** Assuming the no defiers assumption is satisfied,  calculate the share of “always-takers,” which is given by the probability of bingeing when assigned to not receive a message. (You can calculate this value by dividing the number of learners who binged without receiving a message to the total number of learners in the no message group.) Assign the value to the variable `at_share2_5` and ensure that its data type is float. (Round to four decimal places.) (1 pt)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden Tests, checking value of at_share2_5

**6.** Similarly, calculate the share of “never-takers”, which is given by the probability of not bingeing when assigned to receiving a message. Assign the value to the variable `nt_share2_6` and ensure that its data type is float. (Round to four decimal places.) (1 pt)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden Tests, checking value of nt_share2_6.

**7.** ITT effects divided by the difference in compliance rates between the two groups (i.e. the effect of the instrument on the treatment) captures the causal effect of bingeing on compliers who binged as a result of the experiment. That is, the IV estimate we are interested in is equal to the reduced form divided by the first stage. Calculate the IV estimate manually, that is, divide the reduced form coefficient from Part 2, Question 3a (rounded to four decimal places) by the first stage coefficient from Part 2, Question 2a (rounded to four decimal places). Assign the value to the variable `answer2_7` and ensure that its data type is float. (Round to four decimal places.) (1 pt)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden Tests, checking value of answer2_7.

**8.** In order to obtain a measure of the precision of our IV estimate we want to use the 2SLS method. Using robust standard errors, run a two-stage least squares regression, where the outcome variable is *complete*, the instrumented variable is *binge* and the instrument is the variable *message*. Use the IV2SLS module from the linearmodels library. Assign the results using the `.fit()` method to the variable `iv2sls2_8`. (2 pts)

**Note**: Be sure to remove any NAs from the dataframe before proceeding. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden Tests, checking the coefficients and standard errors of iv2sls2_8.