<a href="https://colab.research.google.com/github/tmckim/materials-fa24-colab/blob/main/lec_demos/lec17.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Before you start - Save this notebook!

When you open a new Colab notebook from the WebCampus (like you hopefully did for this one), you cannot save changes. So it's  best to store the Colab notebook in your personal drive `"File > Save a copy in drive..."` **before** you do anything else.

The file will open in a new tab in your web browser, and it is automatically named something like: "**Copy of lec17.ipynb**". You can rename this to just the title of the assignment "**lec17.ipynb**". Make sure you do keep an informative name (like the name of the assignment) so that you know which files to submit back to WebCampus for grading! More instructions on this are at the end of the notebook.


**Where does the notebook get saved in Google Drive?**

By default, the notebook will be copied to a folder called “Colab Notebooks” at the root (home directory) of your Google Drive. If you use this for other courses or personal code notebooks, I recommend creating a folder for this course and then moving the assignments AFTER you have completed them. <br>

I also recommend you give the folder where you save your notebooks^ a different name than the folder we create below that will store the notebook resources you need each time you work through a course notebook. This includes any data files you will need, links to the images that appear in the notebook, and the files associated with the autograder for answer checking.<br>
You should select a name other than '**NS499-DataSci-course-materials**'. <br>
This folder gets overwritten with each assignment you work on in the course, so you should **NOT** store your notebooks in this folder that we use for course materials! <br><br>For example, you could create a folder called 'NS499-**notebooks**' or something along those lines.
___

# Import & Setup

In [1]:
# Step 1
# Setup and add files needed to access gdrive
from google.colab import drive                                   # these lines mount your gdrive to access the files we import below
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
# Step 2
# Change directory to the correct location in gdrive (modified way to do this from before)
import os
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/')

In [None]:
# Step 3
# Remove the files that were previously there- we will replace with all the old + new ones for this assignment
!rm -r materials-fa24-colab

In [None]:
# Step 4
# These lines clone (copy) all the files you will need from where I store the code+data for the course (github)
# Second part of the code copies the files to this location and folder in your own gdrive
!git clone https://github.com/tmckim/materials-fa24-colab '/content/gdrive/My Drive/NS499-DataSci-course-materials/materials-fa24-colab/'

In [5]:
# Step 5
# Change directory into the folder where the resources for this assignment are stored in gdrive (modified way from before)
os.chdir('/content/gdrive/MyDrive/NS499-DataSci-course-materials/materials-fa24-colab/lec_demos/')

In [3]:
# Import packages and other things needed
# Don't change this cell; Just run this cell
# If you restart colab, make sure to run this cell again after the first ones above^

from datascience import *
import numpy as np
import warnings
warnings.simplefilter(action='ignore',category=np.VisibleDeprecationWarning)

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.rcParams["patch.force_edgecolor"] = True

## Learning Objectives ##

- use simulation to conduct hypothesis testing
- review simulation steps and process
- compare two samples using A/B testing
- compare two samples from a randomized controlled (clinical) trial 
- determine whether results are causal


In [6]:
# Read in table of data
births = Table.read_table('baby.csv')

In [None]:
# birth weight in ounces
# days in womb
# age of mother
# height of mother
# weight of mother- pounds
# mother smoking status

In [None]:
# How can we find the average for each group?
means_table = births.select('Birth Weight', 'Maternal Smoker').group('Maternal Smoker', np.average)
means_table

In [21]:
# Now we can manually calculate the difference based on our hypothesis above to see what the value is in the data
means = means_table.column(1)
observed_difference = means.item(1) - means.item(0)
observed_difference

-9.266142572024918

In [7]:
# Define our function - these are just the individual steps we performed above
# we have put them nicely together in a function so we can reuse this bit of code and it is flexible for another example
def difference_of_means(table, numeric_label, category_label):
    """Takes: name of table, column label of numerical variable,
    column label of group-label variable
    Returns: Difference of means of the two groups"""

    #table with the two relevant columns
    reduced = table.select(category_label, numeric_label)

    # table containing group means
    means_table = reduced.group(category_label, np.average)

    # array of group means
    means = means_table.column(1)

    return means.item(1) - means.item(0)

In [8]:
# Run the function
# Function inputs:
    # table = births
    # numeric_label = Birth Weight column
    # category_label = Maternal Smoker column
difference_of_means(births, 'Birth Weight', 'Maternal Smoker')

-9.266142572024918

# Permutation Test

In [9]:
# Function for our simulation and calculate the test statistic
# Again, we did these steps individually, and are just putting them together nicely so we can reuse this code for a new dataset
def one_simulated_difference(table, numeric_label, category_label):
    """Takes: name of table, column label of numerical variable,
    column label of group-label variable
    Returns: Difference of means of the two groups after shuffling labels"""

    # array of shuffled labels
    shuffled_labels = table.sample(with_replacement = False).column(category_label)

    # table of numerical variable and shuffled labels
    shuffled_table = table.select(numeric_label).with_column('Shuffled Label', shuffled_labels)  # this is our simulated data

    return difference_of_means(shuffled_table, numeric_label, 'Shuffled Label')   # computes and returns the test statistic

In [None]:
# Test the function
one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')

In [13]:
# Simulation loop- under the *null* hypothesis
differences = make_array()

for i in np.arange(2500):
    new_difference = one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')
    differences = np.append(differences, new_difference)

In [None]:
# Create a table from the data array and plot the distribution
diff_tbl = Table().with_column('Simulated Differences', differences)
diff_tbl.hist()

print('Observed Difference:', observed_difference)
plt.title('Prediction Under the Null Hypothesis');

In [None]:
# Calculating the p-value

p_val = diff_tbl.where('Simulated Differences', are.below(observed_difference)).num_rows / 2500
p_val

In [None]:
# Calculating the p-value another way
p_value = np.count_nonzero(differences <= observed_difference) / 2500
p_value

# Randomized Control Experiment

In [None]:
# Botox treatment for chronic pain
# 1 - improvement in outcome (pain)
# 0 - no change in outcome (pain)
botox = Table.read_table('bta.csv')
botox.show()

In [None]:
# What are the unique combinations of all of these values- would you use group or pivot?
...

In [None]:
# Average value of improvement for the outcome variable
...

# Testing the Hypothesis

In [None]:
# Use the function we defined above to calculate the difference
# We have a different table, and input columns
observed_diff = ...
observed_diff

In [None]:
# Run our permutation function
...

In [None]:
# Simulate many times
simulated_diffs = ...

for ... in ...(10000):
    sim_diff = ...
    simulated_diffs = ...

In [None]:
# Create a table and plot the distribution
col_name = 'Distance between groups'
Table().with_column(col_name, simulated_diffs).hist(col_name)

plt.scatter(observed_diff, -0.03,c='red'); # this adds our red dot for the observed value below the plot

Use a threshold of 1% (0.01) since this is a randomized controlled (clinical) trial.

In [None]:
# Calculate p-value
...

In [None]:
# The above code is similar to using np.count_nonzero


What can we conclude?