# A/B Testing the Udacity Website

Luopeiwen Yi

In these exercises, we'll be analyzing data on user behavior from an experiment run by Udacity, the online education company. More specifically, we'll be looking at a test Udacity ran to improve the onboarding process on their site.

Udacity's test is an example of an "A/B" test, in which some portion of users visiting a website (or using an app) are randomly selected to see a new version of the site. An analyst can then compare the behavior of users who see a new website design to users seeing their normal website to estimate the effect of rolling out the proposed changes to all users. While this kind of experiment has it's own name in industry (A/B testing), to be clear it's just a randomized experiment, and so everything we've learned about potential outcomes and randomized experiments apply here. 

(Udacity has generously provides the data from this test under an Apache open-source license, and you can find their [original writeup here](https://www.kaggle.com/tammyrotem/ab-tests-with-python/notebook). If you're interested in learning more on A/B testing in particular, it seems only fair while we use their data to flag they have a full course on the subject [here](https://www.udacity.com/course/ab-testing--ud257).)

## Udacity's Test

The test [is described by Udacity as follows](https://www.kaggle.com/tammyrotem/ab-tests-with-python/notebook): 

At the time of this experiment, Udacity courses currently have two options on the course overview page: "start free trial", and "access course materials".

**Current Conditions Before Change**

- If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first.
- If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

**Description of Experimented Change**

- In the experiment, Udacity tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course.
- If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.
- At this point, the student would have the option to continue enrolling in the free trial, or access the course materials for free instead. This [screenshot](images/udacity_checkyoureready.png) shows what the experiment looks like.

**Udacity's Hope is that...**:

> this might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time -- without significantly reducing the number of students to continue past the free trial and eventually complete the course. If this hypothesis held true, Udacity could improve the overall student experience and improve coaches' capacity to support students who are likely to complete the course.



## Gradescope Autograding

Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file `exercise_abtesting.ipynb` before uploading.

You can check that you have answers for all questions in your `results` dictionary with this code:

```python
assert set(results.keys()) == {
    "ex4_avg_oec",
    "ex5_avg_guardrail",
    "ex7_ttest_pvalue",
    "ex9_ttest_pvalue_clicks",
    "ex10_num_obs",
    "ex11_guard_ate",
    "ex11_guard_pvalue",
    "ex11_oec_ate",
    "ex11_oec_pvalue",
    "ex14_se_treatment",
}
```


### Submission Limits

Please remember that you are **only allowed FOUR submissions to the autograder.** Your last submission (if you submit 4 or fewer times), or your third submission (if you submit more than 4 times) will determine your grade Submissions that error out will **not** count against this total.

That's one more than usual in case there are issues with exercise clarity.

In [34]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import warnings
import statsmodels.api as sm

warnings.filterwarnings("ignore")

pd.set_option("mode.copy_on_write", True)

# Create a results dictionary
results = {}

## Import the Data

### Exercise 1

Begin by importing Udacity's data on user behavior [here.](https://github.com/nickeubank/MIDS_Data/tree/master/udacity_AB_testing) 

There are TWO datasets for this test — one for the control data (users who saw the original design), and one for treatment data (users who saw the experimental design). Udacity decided to show their test site to 1/2 of visitors, so there are roughly the same number of users appearing in each dataset (though this is not a requirement of AB tests).

Please remember to load the data directly from github to assist the autograder.

In [35]:
# import control data
control = pd.read_csv(
    "https://media.githubusercontent.com/media/nickeubank/MIDS_Data/master/udacity_AB_testing/control_data.csv"
)

# import experiment data
experiment = pd.read_csv(
    "https://media.githubusercontent.com/media/nickeubank/MIDS_Data/master/udacity_AB_testing/experiment_data.csv"
)

In [36]:
control.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7723,687,134.0,70.0
1,"Sun, Oct 12",9102,779,147.0,70.0
2,"Mon, Oct 13",10511,909,167.0,95.0
3,"Tue, Oct 14",9871,836,156.0,105.0
4,"Wed, Oct 15",10014,837,163.0,64.0


In [37]:
control.isnull().sum()

Date            0
Pageviews       0
Clicks          0
Enrollments    14
Payments       14
dtype: int64

In [38]:
experiment.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments
0,"Sat, Oct 11",7716,686,105.0,34.0
1,"Sun, Oct 12",9288,785,116.0,91.0
2,"Mon, Oct 13",10480,884,145.0,79.0
3,"Tue, Oct 14",9867,827,138.0,92.0
4,"Wed, Oct 15",9793,832,140.0,94.0


In [39]:
experiment.isnull().sum()

Date            0
Pageviews       0
Clicks          0
Enrollments    14
Payments       14
dtype: int64

### Exercise 2

Explore the data. Can you identify the unit of observation of the data (e.g. what is represented by each row)?

(Note this is not the only way that A/B test data can be collected and/or reported — this is just what Udacity provided, presumably to help address privacy concerns.)

In [40]:
# Print the shapes of the datasets
print(f"Control dataset shape: {control.shape}")
print(f"Experiment dataset shape: {experiment.shape}")

# Print descriptive statistics
print("\nControl dataset descriptive statistics:")
print(control.describe())

print("\nExperiment dataset descriptive statistics:")
print(experiment.describe())

Control dataset shape: (37, 5)
Experiment dataset shape: (37, 5)

Control dataset descriptive statistics:
          Pageviews      Clicks  Enrollments    Payments
count     37.000000   37.000000    23.000000   23.000000
mean    9339.000000  766.972973   164.565217   88.391304
std      740.239563   68.286767    29.977000   20.650202
min     7434.000000  632.000000   110.000000   56.000000
25%     8896.000000  708.000000   146.500000   70.000000
50%     9420.000000  759.000000   162.000000   91.000000
75%     9871.000000  825.000000   175.000000  102.500000
max    10667.000000  909.000000   233.000000  128.000000

Experiment dataset descriptive statistics:
          Pageviews      Clicks  Enrollments    Payments
count     37.000000   37.000000    23.000000   23.000000
mean    9315.135135  765.540541   148.826087   84.565217
std      708.070781   64.578374    33.234227   23.060841
min     7664.000000  642.000000    94.000000   34.000000
25%     8881.000000  722.000000   127.000000   69.00

> The unit of observation is the interactions of unique users with the Udacity website in terms of pageviews, clicks, enrollments and payments on a daily basis.

## Pick your measures

### Exercise 3

The easiest way to analyze this data is to stack it into a single dataset where each observation is a day-treatment-arm (so you should end up with two rows per day, one for those who are in the treated groups, and one for those who were in the control group). Note that currently nothing in the data identifies whether a given observation is a treatment group observation or a control group observation, so you'll want to make sure to add a "treatment" indicator variable.

The variables in the data are:

- Pageviews: number of unique users visiting homepage
- Clicks: number of those users clicking "Start Free Trial"
- Enrollments: Number of people enrolling in trial
- Payments: Number of people who eventually pay for the service. Note the `payment` column reports payments for the users who first visited the site on the reported date, not payments occurring on the reported date.

In [41]:
# Add a treatment indicator variable
control["treatment"] = 0  # Indicates control group
experiment["treatment"] = 1  # Indicates experiment (treatment) group

# Combine the two datasets
stack_data = pd.concat([control, experiment], ignore_index=True)

# Display the shape of the combined dataset and a brief description
print(stack_data.shape)
print(stack_data.describe())

(74, 6)
          Pageviews      Clicks  Enrollments    Payments  treatment
count     74.000000   74.000000    46.000000   46.000000  74.000000
mean    9327.067568  766.256757   156.695652   86.478261   0.500000
std      719.455794   66.005616    32.289571   21.730408   0.503413
min     7434.000000  632.000000    94.000000   34.000000   0.000000
25%     8891.500000  713.000000   131.750000   70.000000   0.000000
50%     9379.500000  768.500000   153.500000   91.000000   0.500000
75%     9779.000000  826.500000   175.500000  100.750000   1.000000
max    10667.000000  909.000000   233.000000  128.000000   1.000000


In [42]:
stack_data.head()

Unnamed: 0,Date,Pageviews,Clicks,Enrollments,Payments,treatment
0,"Sat, Oct 11",7723,687,134.0,70.0,0
1,"Sun, Oct 12",9102,779,147.0,70.0,0
2,"Mon, Oct 13",10511,909,167.0,95.0,0
3,"Tue, Oct 14",9871,836,156.0,105.0,0
4,"Wed, Oct 15",10014,837,163.0,64.0,0


### Exercise 4

Given Udacity's goals, what outcome are they hoping will be impacted by their manipulation?

Or, to ask the same question in the language of the Potential Outcomes Framework, what is their $Y$?

Or to ask the same question in the language of Kohavi, Tang and Xu, what is their *Overall Evaluation Criterion (OEC)*?

(I'm only asking one question, I'm just trying to phrase it using different terminologies we've encountered to help you see how they all fit together)

When you feel like you have your answer, please compute it. Store the average value of the variable in `results` under the key `ex4_avg_oec`. **Please round your answer to 4 decimal places.**

NOTE: You'll probably notice you have two choices to make when it comes to actually computing the OEC. 

- You could probably imagine either computing a ratio or a difference of two things — please calculate the difference.
- You may also be unsure whether to normalize by `Clicks`. Normalizing by clicks will help account for variation that comes from day-to-day variation in users, so it's a good thing to do. With infinite data, you'd expect to get the same results without normalizing by `Clicks` (since on average the same share of users are in each arm of the experiment), but for finite data it's a good strategy. Note that this is only ok because users make the choice to click or not *before* they see different versions of the website (it is "pre-treatment").

Just to make sure you're on track, your measure should have an average value of *about* 9%.

In [43]:
# Calculate the OEC for each row in the dataset
stack_data["OEC"] = (stack_data["Enrollments"] - stack_data["Payments"]) / stack_data[
    "Clicks"
]

# Calculate the overall average OEC across all data
overall_avg_oec = stack_data["OEC"].mean()

# Converting the overall average OEC to a percentage and rounding to 4 decimal places
overall_avg_oec_percentage = round(overall_avg_oec * 100, 2)

results["ex4_avg_oec"] = round(overall_avg_oec, 4)

# Print the overall average OEC as a percentage
print(f"The overall average OEC is around {overall_avg_oec_percentage}%.")

The overall average OEC is around 9.41%.


> Given Udacity's goals, they are aiming to impact the ratio of users who enroll in the free trial and then proceed to make a payment, by reducing the number of frustrated students who left the free trial because they didn't have enough time, without significantly reducing the number of students who continue past the free trial and eventually complete the course. The $Y$, or OEC, is the difference between enrollments and payments normalized by the number of clicks, which is a measure of how many users who enrolled did not proceed to make a payment.

### Exercise 5

Given Udacity's goals, what outcome are they hoping will *not* be impacted by their manipulation? In other words, what do they want to measure to ensure their treatment doesn't have unintended negative consequences that might be really costly to their operation?

Note that while this isn't how Kohavi, Tang, and Xu use the term "guardrail metrics" — they usually use the term to refer to things we measure to ensure the experiment is working the way it should — some people would also use the term "guardrail metrics" for something that could be impacted even if the experiment is working correctly, but which the organization wants to track to ensure they aren't impacted because they are deemed really important.

Again, please normalize by `Clicks`. Store the average value of this guardrail metric as `ex5_avg_guardrail` and **round your answer to 4 decimal places.**

> The outcome Udacity hopes that will not be impacted by their manipulation is number of payments because they want to enhance the student experience by setting clearer expectations without reducing the overall number of students who decide to pay after the free trial. The guardrail metric is payments-to-clicks ratio.

In [44]:
# Calculate the Payments-to-Clicks ratio for each row in the dataset
stack_data["Payments_to_Clicks"] = stack_data["Payments"] / stack_data["Clicks"]

# Compute the average Payments-to-Clicks ratio across all data
avg_payments_to_clicks = stack_data["Payments_to_Clicks"].mean()

# Rounding the average to 4 decimal places
avg_payments_to_clicks_rounded = round(avg_payments_to_clicks * 100, 2)

results["ex5_avg_guardrail"] = round(avg_payments_to_clicks, 4)

print(
    f"The average Payments-to-Clicks ratio is around {avg_payments_to_clicks_rounded}%."
)

The average Payments-to-Clicks ratio is around 11.58%.


## Validating The Data

### Exercise 6

Whenever you are working with experimental data, the first thing you want to do is verify that users actually were randomly sorted into the two arms of the experiment. In this data, half of users were supposed to be shown the old version of the site and half were supposed to see the new version.

`Pageviews` tells you how many unique users visited the welcome site we are experimenting on. `Pageviews` is what is sometimes called an "invariant" or "guardrail" variable, meaning that it shouldn't vary across treatment arms—after all, people have to visit the site before they get a chance to see the treatment, so there's no way that being assigned to treatment or control should affect the number of pageviews assigned to each group.

"Invariant" variables are also an example of what are known as a "pre-treatment" variable, because pageviews are determined before users are manipulated in any way. That makes it analogous to gender or age in experiments where you have demographic data—a person's age and gender are determined before they experience any manipulations, so the value of any pre-treatment attributes should be the same across the two arms of our experiment. This is what we've previously called "checking for balance," If pre-treatment attributes aren't balanced, then we may worry our attempt to randomly assign people to different groups failed.  Kohavi, Tang and Xu call this a "trust-based guardrail metric" because it helps us determine if we should trust our data.

To test the quality of the randomization, calculate the average number of pageviews for the treated group and for the control group. Do they look similar?

In [45]:
# Calculate the average number of pageviews for the control and experiment groups
avg_pageviews_control = stack_data[stack_data["treatment"] == 0]["Pageviews"].mean()
avg_pageviews_experiment = stack_data[stack_data["treatment"] == 1]["Pageviews"].mean()

print(
    f"Average pageviews for control group is around {round(avg_pageviews_control,4)}. Average pageviews for experiment group is around {round(avg_pageviews_experiment,4)}"
)

Average pageviews for control group is around 9339.0. Average pageviews for experiment group is around 9315.1351


> The average number of pageviews for the treated group and for the control group are very similar. The difference between the two averages is very small.

### Exercise 7

"Similar" is a tricky concept -- obviously, we expect *some* differences across groups since users were *randomly* divided across treatment arms. The question is whether the differences between groups are larger than we'd expect to emerge given our random assignment process. To evaluate this, let's use a `ttest` to test the statistical significance of the differences we see. 

**Note**: Remember that scipy functions don't accept `pandas` objects, so you use a scipy function, you have to pass the numpy vectors underlying your data with the `.values` operator (e.g. `df.my_column.values`). 

Does the difference in `pageviews` look statistically significant?

Store the resulting p-value in `ex7_ttest_pvalue` **rounded to four decimal places.**

In [46]:
# Extract the numpy vectors for pageviews from both groups
pageviews_control = stack_data[stack_data["treatment"] == 0]["Pageviews"].values
pageviews_experiment = stack_data[stack_data["treatment"] == 1]["Pageviews"].values

# Perform a t-test on pageviews
ttest_result_pageviews = ttest_ind(pageviews_control, pageviews_experiment)
print(
    "Performing a t-test to compare the pageviews between the control and experiment groups:"
)
ttest_result_pageviews

Performing a t-test to compare the pageviews between the control and experiment groups:


TtestResult(statistic=0.14171182982874964, pvalue=0.8877034068650902, df=72.0)

In [47]:
# Store the p-value, rounded to four decimal places
ex7_ttest_pvalue = round(ttest_result_pageviews.pvalue, 4)

results["ex7_ttest_pvalue"] = ex7_ttest_pvalue

print(
    f"The p-value from the t-test on pageviews between control and experiment groups is: {ex7_ttest_pvalue}"
)

The p-value from the t-test on pageviews between control and experiment groups is: 0.8877


> The difference in pageviews does not look statistically significant because p value > 0.05. 

### Exercise 8

`Pageviews` is not the only "pre-treatment" variable in this data we can use to evaluate balance/use as a guardrail metric. What other measure is pre-treatment? Review the description of the experiment if you're not sure.

> Clicks is another pre-treatment variable. Clicks, which measures the number of those users clicking "Start Free Trial", happens before users get a chance to see the treatments. 

### Exercise 9

Check if the other pre-treatment variable is also balanced. Store the p-value of your test of difference in `results` under the key `"ex9_ttest_pvalue_clicks"` **rounded to four decimal places.**t

In [48]:
# Calculate the average number of pageviews for the control and experiment groups
avg_clicks_control = stack_data[stack_data["treatment"] == 0]["Clicks"].mean()
avg_clicks_experiment = stack_data[stack_data["treatment"] == 1]["Clicks"].mean()

print(
    f"Average clicks for control group is around {round(avg_clicks_control,4)}. Average clicks for experiment group is around {round(avg_clicks_experiment,4)}"
)

Average clicks for control group is around 766.973. Average clicks for experiment group is around 765.5405


In [49]:
# Extract the numpy vectors for clicks from both groups
clicks_control = stack_data[stack_data["treatment"] == 0]["Clicks"].values
clicks_experiment = stack_data[stack_data["treatment"] == 1]["Clicks"].values

# Perform a t-test on  clicks
ttest_result_clicks = ttest_ind(clicks_control, clicks_experiment)
print(
    "Performing a t-test to compare the clicks between the control and experiment groups:"
)
ttest_result_clicks

Performing a t-test to compare the clicks between the control and experiment groups:


TtestResult(statistic=0.09270642968639531, pvalue=0.9263942642482703, df=72.0)

In [50]:
# Store the p-value, rounded to four decimal places
ex9_ttest_pvalue_clicks = round(ttest_result_clicks.pvalue, 4)
results["ex9_ttest_pvalue_clicks"] = ex9_ttest_pvalue_clicks

print(
    f"The p-value from the t-test on clicks between control and experiment groups is: {ex9_ttest_pvalue_clicks}"
)

The p-value from the t-test on clicks between control and experiment groups is: 0.9264


> - The average number of clicks for the treated group and for the control group are very similar. The difference between the two averages is minimal.
>- The difference in Clicks does not look statistically significant because p value > 0.05. 
>- The Clicks variable is balanced. 

## Estimating the Effect of Experiment

### Exercise 10

Now that we've validated our randomization, our next task is to estimate our treatment effect. First, though, there's an issue with your data you've been able to largely ignore until now, but which you should get a grip on before estimating your treatment effect — can you tell what it is and what you should do about it?

Store the number of observations in your data *after* you've addressed this in `ex10_num_obs` (this is mostly meant as a way to sanity check your answer with autograder).

In [51]:
# Checking for missing values in the dataset
missing_values = stack_data.isnull().sum()

print(f"Missing values in each column:\n{missing_values}")

Missing values in each column:
Date                   0
Pageviews              0
Clicks                 0
Enrollments           28
Payments              28
treatment              0
OEC                   28
Payments_to_Clicks    28
dtype: int64


> The issue with the data is the presence of missing values in the Enrollments and Payments columns. To address this issue, I can remove observations with missing values since these observations cannot contribute to the analysis of conversion rates or payment rates.

In [52]:
# Remove observations with missing values in the 'Enrollments' or 'Payments' columns
stack_data_new = stack_data.dropna(subset=["Enrollments", "Payments"])

# Display the number of observations after addressing missing values
ex10_num_obs = stack_data_new.shape[0]

results["ex10_num_obs"] = ex10_num_obs

print(f"The number of observations after addressing missing values is {ex10_num_obs}.")

The number of observations after addressing missing values is 46.



### Exercise 11

Now that we've established we have good balance (meaning we think randomization was likely successful), we can evaluate the effects of the experiment. Test whether the OEC and the metric you *don't* want affected have different average values in the control group and treatment group. 

Because we've randomized, this is a consistent estimate of the Average Treatment Effect of Udacity's website change.

Calculate the difference in means in your OEC and guardrail metrics using a simple t-test. Store the resulting effect estimates in `ex11_oec_ate` and `ex11_guard_ate` and p-values in `ex11_oec_pvalue` and `ex11_guard_pvalue`. **Please round all answers to 4 decimal places.** Report your ATE in *percentage points*, where `1` denotes 1 percentage point.


In [53]:
# Separating control and experiment groups for OEC and payment_to_clicks
oec_control = stack_data_new[stack_data_new["treatment"] == 0]["OEC"]
oec_experiment = stack_data_new[stack_data_new["treatment"] == 1]["OEC"]
payment_to_clicks_control = stack_data_new[stack_data_new["treatment"] == 0][
    "Payments_to_Clicks"
]
payment_to_clicks_experiment = stack_data_new[stack_data_new["treatment"] == 1][
    "Payments_to_Clicks"
]

In [54]:
# T-tests for OEC
ttest_result_oec = ttest_ind(oec_control, oec_experiment)

print(
    "Performing a t-test to compare the OEC between the control and experiment groups:"
)

ttest_result_oec

Performing a t-test to compare the OEC between the control and experiment groups:


TtestResult(statistic=1.5350318294600591, pvalue=0.1319361736551665, df=44.0)

In [55]:
# T-tests for payment_to_clicks
ttest_result_payment_to_clicks = ttest_ind(
    payment_to_clicks_control, payment_to_clicks_experiment
)

print(
    "Performing a t-test to compare the payments to clicks between the control and experiment groups:"
)

ttest_result_payment_to_clicks

Performing a t-test to compare the payments to clicks between the control and experiment groups:


TtestResult(statistic=0.5387777625331603, pvalue=0.5927558614268024, df=44.0)

In [56]:
# Average Treatment Effects (ATE) for OEC
ex11_oec_ate = (oec_experiment.mean() - oec_control.mean()) * 100

results["ex11_oec_ate"] = round(abs(ex11_oec_ate), 4)
print(
    f"The Average Treatment Effects (ATE) of Udacity's website change for OEC is around {round(ex11_oec_ate, 4)}%"
)

The Average Treatment Effects (ATE) of Udacity's website change for OEC is around -1.5888%


In [57]:
# Average Treatment Effects (ATE) for payment_to_clicks
ex11_guard_ate = (
    payment_to_clicks_experiment.mean() - payment_to_clicks_control.mean()
) * 100

results["ex11_guard_ate"] = round(abs(ex11_guard_ate), 4)
print(
    f"The Average Treatment Effects (ATE) of Udacity's website change for payment_to_clicks is around {round(ex11_guard_ate, 4)}%"
)

The Average Treatment Effects (ATE) of Udacity's website change for payment_to_clicks is around -0.4897%


In [58]:
# p_value for OEC
ex11_oec_pvalue = round(ttest_result_oec.pvalue, 4)

results["ex11_oec_pvalue"] = ex11_oec_pvalue
print(
    f"The p-value from the t-test on OEC between control and experiment groups is: {ex11_oec_pvalue}"
)

# p_value for payment_to_clicks
ex11_guard_pvalue = round(ttest_result_payment_to_clicks.pvalue, 4)

results["ex11_guard_pvalue"] = ex11_guard_pvalue
print(
    f"The p-value from the t-test on payment_to_clicks between control and experiment groups is: {ex11_guard_pvalue}"
)

The p-value from the t-test on OEC between control and experiment groups is: 0.1319
The p-value from the t-test on payment_to_clicks between control and experiment groups is: 0.5928


### Exercise 12

Do you feel that Udacity achieved their goal? Did their intervention cause them any problems? If they asked you "What would happen if we rolled this out to everyone?" what would you say?

As you answer this question, a small additional question: up until this point you've (presumably) been reporting the default p-values from the tools you are using. These, as you may recall from stats 101, are two-tailed p-values. Do those seem appropriate for your OEC?

>- I don't think Udacity achieved their goal. The Average Treatment Effect (ATE) for the OEC is −1.5888%, suggesting a decrease in the gap between enrollments and payments when normalized by the number of clicks, which could indicate a higher enrollment to payment conversion rate. However, the change in OEC is not statistically significant since p-value > 0.05, indicating that the observed differences might have occurred by chance.
>- I don't think the intervention cause them any problems. The ATE for the payment to clicks ratio is -0.4897%, showing a minor decrease in the proportion of clicks leading to payments. However, the change in payment to clicks ratio is not statistically significant since p-value > 0.05, indicating that the observed differences might have occurred by chance.
>- If asked about rolling this out to everyone, I would say I'm not sure what the impact would be and we should be cautious. The intervention doesn't show a statistically significant negative impact on payment to clicks, but it doesn't show a statistically significant positive impact on OEC either. Further analysis might be needed to understand the lack of statistically significant improvement and whether other metrics or qualitative feedback suggest the change improves the student experience or aligns with Udacity's long-term goals.
>- The use of two-tailed p-values is standard when we are open to finding effects in both directions (either positive or negative impacts). However, for the OEC, where Udacity has a specific direction of improvement in mind (i.e., reducing the gap between enrollments and payments without decreasing the overall conversion), a one-tailed p-value might be more appropriate. This would test the hypothesis that the treatment effect is in a specific direction (improvement) rather than any change. 

### Exercise 13

One of the magic things about experiments is that all you have to do is compare averages to get an average treatment effect. However, you *can* do other things to try and increase the statistical power of your experiments, like add controls in a linear regression model. 

As you likely know, a bivariate regression is exactly equivalent to a t-test, so let's start by re-estimating the effect of treatment on your OEC using a linear regression. Can you replicate the results from your t-test? They shouldn't just be close—they should be numerically equivalent (i.e. exactly the same to the limits of floating point number precision). 

In [59]:
# Prepare the independent variables (add a constant term for the intercept)
X = sm.add_constant(stack_data_new["treatment"])
# Prepare the dependent variable - OEC
y = stack_data_new["OEC"]

# Fit the linear regression model
lr_model = sm.OLS(y, X).fit()

# Display the model summary to replicate the t-test results through linear regression
linear_regression_model_summary = lr_model.summary()
linear_regression_model_summary

0,1,2,3
Dep. Variable:,OEC,R-squared:,0.051
Model:,OLS,Adj. R-squared:,0.029
Method:,Least Squares,F-statistic:,2.356
Date:,"Wed, 27 Mar 2024",Prob (F-statistic):,0.132
Time:,19:47:00,Log-Likelihood:,89.832
No. Observations:,46,AIC:,-175.7
Df Residuals:,44,BIC:,-172.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1021,0.007,13.948,0.000,0.087,0.117
treatment,-0.0159,0.010,-1.535,0.132,-0.037,0.005

0,1,2,3
Omnibus:,14.16,Durbin-Watson:,1.908
Prob(Omnibus):,0.001,Jarque-Bera (JB):,15.205
Skew:,1.227,Prob(JB):,0.000499
Kurtosis:,4.383,Cond. No.,2.62


In [60]:
# Coefficient (Effect Size) for Treatment
treatment_coefficient = lr_model.params["treatment"]
# Convert the treatment effect size to a percentage
treatment_effect_percentage = treatment_coefficient * 100

# P-value for the Treatment effect
treatment_pvalue = lr_model.pvalues["treatment"]

print(f"Coefficient for Treatment is around {round(treatment_effect_percentage, 4)}%")
print(f"P-value for Treatment effect is around {round(treatment_pvalue,4)}")

Coefficient for Treatment is around -1.5888%
P-value for Treatment effect is around 0.1319


> I replicated the results from my t test.

### Exercise 14

Now add indicator variables for the date of each observation. Do the standard errors on your `treatment` variable change? If so, in what direction?

Store your new standard error in `ex14_se_treatment`. Round your answer to 4 decimal places.

You should have found that your standard errors decreased by about 30\%—this is why, although just comparing means *works*, if you have additional variables adding them to your analysis can be helpful (all the usual rules for model specification apply — for example, you still want to be careful about overfitting, which one could argue is maybe part of what's happening here). 

In many other cases, the effect of adding controls is likely to be larger — the date indicators we added to our data are perfectly balanced between treatment and control, so we aren't adding a lot of data to the model by adding them as variables. They're accounting for some day-to-day variation (presumably in the types of people coming to the site), but they aren't controlling for any residual baseline differences the way a control like "gender" or "age" might (since those kind of individual-level attributes will never be perfectly balanced across treatment and control). 

In [61]:
# Convert the "Date" column into indicator variables
stack_data_new_date_dummies = pd.get_dummies(
    stack_data_new, columns=["Date"], drop_first=True
)

In [62]:
# Prepare the independent variables
X_dummies = stack_data_new_date_dummies.drop(
    columns=[
        "Pageviews",
        "Clicks",
        "Enrollments",
        "Payments",
        "Payments_to_Clicks",
        "OEC",
    ]
)

# add a constant term for the intercept
X_dummies = sm.add_constant(X_dummies)

# Prepare the dependent variable OEC
y_dummies = stack_data_new_date_dummies["OEC"]

In [63]:
# Fit the linear regression model
lr_model_with_date = sm.OLS(y_dummies, X_dummies.astype(float)).fit()

# Display the model summary
linear_regression_model_summary_dummies = lr_model_with_date.summary()
linear_regression_model_summary_dummies

0,1,2,3
Dep. Variable:,OEC,R-squared:,0.806
Model:,OLS,Adj. R-squared:,0.602
Method:,Least Squares,F-statistic:,3.962
Date:,"Wed, 27 Mar 2024",Prob (F-statistic):,0.000978
Time:,19:47:00,Log-Likelihood:,126.29
No. Observations:,46,AIC:,-204.6
Df Residuals:,22,BIC:,-160.7
Df Model:,23,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.1079,0.016,6.651,0.000,0.074,0.142
treatment,-0.0159,0.007,-2.398,0.025,-0.030,-0.002
"Date_Fri, Oct 24",0.0445,0.022,1.983,0.060,-0.002,0.091
"Date_Fri, Oct 31",-0.0074,0.022,-0.331,0.744,-0.054,0.039
"Date_Mon, Oct 13",-0.0231,0.022,-1.026,0.316,-0.070,0.024
"Date_Mon, Oct 20",-0.0285,0.022,-1.270,0.217,-0.075,0.018
"Date_Mon, Oct 27",0.0328,0.022,1.458,0.159,-0.014,0.079
"Date_Sat, Nov 1",-0.0235,0.022,-1.047,0.306,-0.070,0.023
"Date_Sat, Oct 11",-0.0017,0.022,-0.074,0.941,-0.048,0.045

0,1,2,3
Omnibus:,3.871,Durbin-Watson:,1.863
Prob(Omnibus):,0.144,Jarque-Bera (JB):,3.826
Skew:,0.0,Prob(JB):,0.148
Kurtosis:,4.413,Cond. No.,27.3


In [64]:
# Extract the standard error for the treatment variable
se_treatment_with_date = lr_model_with_date.bse["treatment"]

ex14_se_treatment = round(se_treatment_with_date, 4)
print(f"the standard error for the treatment variable is around {ex14_se_treatment}")
results["ex14_se_treatment"] = ex14_se_treatment

the standard error for the treatment variable is around 0.0066


> The standard error on my treatment variable changed from 0.010 to 0.0066 after adding indicator variables for the date of each observation in the linear regression model. It becomes smaller than before (decreased by about 30%).

### Exercise 15

Does this result have any impact on the recommendations you would offer Udacity?

>- After adding indicator variables for the date of each observation in the linear regression model, the p value changed from 0.1319 to 0.025. The Average Treatment Effect (ATE) for the OEC is −1.5888%, suggesting a decrease in the gap between enrollments and payments when normalized by the number of clicks, which could indicate a higher enrollment to payment conversion rate. The change in OEC is statistically significant since p-value < 0.05. Therefore, I think Udacity achieved their goal.  
>- I would recommend Udacity to cautiously implement this intervention. While it shows a statistically significant positive impact on OEC, Udacity should still evaluate its practical significance before rolling out the intervention to every users. This intervention might set clearer expectations for students upfront, thus reducing the number of frustrated students who left the free trial because they didn't have enough time.

In [65]:
# results

In [66]:
assert set(results.keys()) == {
    "ex4_avg_oec",
    "ex5_avg_guardrail",
    "ex7_ttest_pvalue",
    "ex9_ttest_pvalue_clicks",
    "ex10_num_obs",
    "ex11_guard_ate",
    "ex11_guard_pvalue",
    "ex11_oec_ate",
    "ex11_oec_pvalue",
    "ex14_se_treatment",
}