### Tutorial: Hypothesis Testing for Microtubule Catastrophe

#### Written by Joeyta Banerjee, Rashi Jeeda, and Mei Yi You 

This notebook outlined the process we used for hypothesis testing between for the microtubule catastrophe experiment. Specifically, we compare the results for fluorescent labeled and non-fluorescent labeled tubulin.

In [7]:
%load_ext autoreload
%autoreload 2
import os, sys
# Using alias for packages can reduce having to write out the entire name every single time you call them
import pandas as pd
import holoviews as hv
import bokeh.io
import scipy
import numpy as np
import scipy.stats as st
import MCAT_pkg as mc 

import warnings
warnings.simplefilter('ignore')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [8]:
mc?

[1;31mType:[0m        module
[1;31mString form:[0m <module 'MCAT_pkg' from 'c:\\users\\rashi jeeda\\documents\\caltech\\classes\\3-1-bebi103a\\mcat\\mcat_pkg\\MCAT_pkg\\__init__.py'>
[1;31mFile:[0m        c:\users\rashi jeeda\documents\caltech\classes\3-1-bebi103a\mcat\mcat_pkg\mcat_pkg\__init__.py
[1;31mDocstring:[0m  
The contents of this package allow users to reproduce our microtubule catastrophe analysis. It includes:
Our methods for parsing the raw data
Methods for performing exploratory analysis
Statistical analysis modules (bootstrapping, hypothesis testing, MLE analysis, and model assessment)


In [9]:
# Identify the location that the data file is in  
data = "data/gardner_time_to_catastrophe_dic_tidy.csv"

# Read in the file as df

df = pd.read_csv(data, comment = "#")

In [10]:
# parse out the labeled and not_labeled data using our function
labeled, not_labeled = mc.separate_categories(df)

We can use mc.categorical_plot to gain ECDFs for the two samples with confidence intervals, so that we can observe the amount of overlap between the two datasets.

In [11]:
# plot the ECDFs for each of them with confidence intervals generated by bootstrapping
p = mc.categorical_plot(df, "time to catastrophe (s)", "labeled", conf_int = True, 
                        palette = ["limegreen", "silver"], order = [True, False])
p.background_fill_color = '#fafafa'
bokeh.io.show(p)

For more quantiative metrics, we can perform a hypothesis test using bootstrapping, as well as look at means and confidence intervals for each of the datasets.

In [12]:
# Generate samples (10,000)
bs_reps = mc.draw_bs_reps_test_stat(labeled, not_labeled, size=10000)

# Compute p-value
stat = st.ks_2samp(labeled, not_labeled)[0]
p_val = np.sum(bs_reps >= stat) / len(bs_reps)

print("p-value =", p_val)

p-value = 0.8361


In [13]:
p = mc.viz_compare_conf_int(labeled, not_labeled, xlabel = "Time to catastrophe (s)", 
                            label1 = "labeled", label2 = "not labeled")
p.background_fill_color = '#fafafa'
bokeh.io.show(p)

In [14]:
# generate bootstrap replicates for the mean of each data set, then take the mean and confidence intervals
bs_reps_mean_tubulin = mc.draw_bs_reps_mean(labeled, size=10000)
bs_reps_mean_no_tubulin = mc.draw_bs_reps_mean(not_labeled, size = 10000)

mean_tubulin = np.mean(bs_reps_mean_tubulin)
mean_no_tubulin = np.mean(bs_reps_mean_no_tubulin)

conf_int_tubulin = np.percentile(bs_reps_mean_tubulin, [2.5, 97.5])
conf_int_no_tubulin = np.percentile(bs_reps_mean_no_tubulin, [2.5, 97.5])

print('Mean time to catastrophe with labeled tubulin is {} with 95% conf int : {}'.format(mean_tubulin,
                                                                                         conf_int_tubulin))
print('Mean time to catastrophe with nonlabeled tubulin is {} with 95% conf int: {}'.format(mean_no_tubulin,
                                                                                           conf_int_no_tubulin))

Mean time to catastrophe with labeled tubulin is 441.2080947867299 with 95% conf int : [401.39751185 481.94372038]
Mean time to catastrophe with nonlabeled tubulin is 412.0603842105263 with 95% conf int: [353.68289474 476.26447368]


We use the DKW inequality to further compare the two models as well as to observe the bounds for the confidence intervals.

In [15]:
p1 = mc.plot_conf_int(labeled, "Confidence Interval for Labeled", "Time to catastrophe (s)")
p1.background_fill_color = '#fafafa'
bokeh.io.show(p1)

In [16]:
p2 = mc.plot_conf_int(not_labeled, "Confidence Interval for Not Labeled", "Time to catastrophe (s)", 
                      color = "dimgray", palette = ["silver"])
p2.background_fill_color = '#fafafa'
bokeh.io.show(p2)

See help(function) for more information on the functions and their arguments.

In [17]:
%load_ext watermark
%watermark -v -p MCAT_pkg,numpy,scipy,pandas,bokeh,iqplot,tqdm,jupyterlab

CPython 3.7.7
IPython 7.13.0

MCAT_pkg 0.0.1
numpy 1.18.1
scipy 1.4.1
pandas 0.24.2
bokeh 2.2.1
iqplot 0.1.6
tqdm 4.48.0
jupyterlab 1.2.6


Acknowledgements!

We thank the publishers of Gardner et. al. for sharing their data, the BeBi103 TAs for their guidance, the makers of Poole for this website template, Rosita Fu and Griffin Chure for design inspiration, and of course, Justin Bois for his assistance, insight, and useful code!
