In [26]:
import pandas as pd 
import numpy as np
import os
import scipy.stats as st

import iqplot
import bokeh.io
import bokeh.plotting
from bokeh.models import Legend
import numba

import bebi103

bokeh.io.output_notebook()

# Aim 1: Data Validation for Microtubule Experiments #
### Comparing Labeled and Unlabeled tubulin performance ###
In the experiment conducted by Gardner et al., microtubules were labeled with fluorescent markers. We investigate whether or not these fluorescent markers influence tubulin performance, determined by time to catastrophe (s). We look at data gathered from unlabeled and labeled tubulin, and focus on three different comparisons: 
1. ECDF of labeled vs unlabeled tubulin
2. Mean time to catastrophe of labeled vs unlabeled tubulin
3. Hypothesis testing assuming identical distributions 

Each of these strategies checks whether or not the labeled or unlabeled tubulin datasets are different in some way. If a significant difference does exist, this means that the fluorescent markers have some sort of impact on microtubule performance. This would also mean that the labeled tubulin used in the subsequent experiments do not accurately depict microtubule behavior. In this way, we hope to validate the data collected by confirming that the fluorescent markers do not influence microtubule performance. <br />

To start our investigation, we read in the dataset and save the values in a tidy data frame. 

In [27]:
data_path = '../datasets'
file_path = os.path.join(data_path, 'gardner_time_to_catastrophe_dic_tidy.csv')

# get rid of the index column when loading data
df = pd.read_csv(file_path).iloc[:, 1:]
# replace True/False with labeled vs unlabeleed
df['labeled'] = df['labeled'].apply(lambda x: 'labeled tubulin' if x else 'unlabeled tubulin')

df

Unnamed: 0,time to catastrophe (s),labeled
0,470.0,labeled tubulin
1,1415.0,labeled tubulin
2,130.0,labeled tubulin
3,280.0,labeled tubulin
4,550.0,labeled tubulin
...,...,...
301,180.0,unlabeled tubulin
302,145.0,unlabeled tubulin
303,745.0,unlabeled tubulin
304,390.0,unlabeled tubulin


### 1. ECDF comparison ###
To determine whether or not microtubule performance is different between labeled and unlabeled tubulin, we first look at the cumulative distributions of the empirical data. If the cumulative distributions occupy the same areas, then the fluorescent markers probably do not have a strong effect on microtubule performance since unlabeled/labeled times to catastrophe are indistinguishable from each other. We use `iqplot` to display the respective ECDFs below and observe whether or not the unlabeled and labeled datasets are identically distributed.

In [28]:
p = iqplot.ecdf(
    data=df,
    q='time to catastrophe (s)', 
    cats='labeled',
    style='staircase',
    conf_int = True
)

bokeh.io.show(p)

By a quick, visual inspection of the plot, it looks like the catastrophe times for microtubules with labeled and unlabeled tubulin could be identically distributed. The confidence interval for the unlabeled tubulin almost always overlaps with the labeled tubulin confidence intervals. <br /> <br />
Since we are using the confidence intervals to check whether or not the datasets overlap, further investigation of confidence interval generation is worth exploring. The confidence intervals above were calculated with bootstrapping, but we can also use Dvoretzky-Kiefer-Wolfowitz Inequality (DKW) to compute confidence intervals for the ECDF. To start, we define an ecdf function that can compute the ecdf at an arbitrary x value. 

In [29]:
def ecdf(x, data):
    """
    This function computes the value of the ECDF built from a 1D array, data, 
    at arbitrary points, x, which can also be an array.
    
    x can be an integer or float, an array of ints, or a list of ints
    """
    data_sorted = np.sort(data)
    
    ecdf_li = []
    
    if type(x) == int or type(x) == float:
        index_tup = np.where(data_sorted <= x)[0]
        if index_tup.size == 0:
            ecdf = 0
        else:
            ecdf = (index_tup[-1] + 1) / len(data)
        ecdf_li.append(ecdf)
    else:
        for value in x:
            index_tup = np.where(data_sorted <= value)[0]
            if index_tup.size == 0:
                ecdf = 0
            else:
                ecdf = (index_tup[-1] + 1) / len(data)
            ecdf_li.append(ecdf)
    
    return np.array(ecdf_li)

The DKW inequality states that for any $\epsilon > 0$,

\begin{align}
P\left(\mathrm{sup}_x \left|F(x) - \hat{F}(x)\right| > \epsilon\right) \le 2\mathrm{e}^{-2 n \epsilon^2},
\end{align}

To calculate the DKW inequality for the microtubule catastrophe data then, we first calculate $\alpha$ which while be used to calculate $\epsilon$. For the 95% confidence interval: <br />
\begin{align}
100*(1-\alpha) & = 95 \\
1-\alpha & = 0.95 \\
0.05 &= \alpha \\
\end{align} <br /> 

Now we create a function to get $\epsilon$ for a given dataset (since $n$ might vary), which I will later use to calculate the upper and lower bounds. I use the calculated $\alpha$ value and the expression:  <br />
\begin{align}
\epsilon &= \sqrt{\frac{1}{2n} \log{\frac{2}{\alpha}}} \\
\end{align} 

In [30]:
alpha = 0.05

def calc_epsilon(data):
    n = len(data)
    x = 1/(2*n)
    l = np.log(2/alpha)
    return np.sqrt(x*l)

Next we create a function that returns the lower bound, given by the expression: <br />
\begin{align}
L(x) = \max\left(0, \hat{F}(x) - \epsilon\right),
\end{align}

In [31]:
def lower_bound(data):
    """
    for a given array of experimental data, 
    this function returns a list of the DKW lower bound values 
    corresponding to the given data points
    """
    ep = calc_epsilon(data) 
    l_bounds = []
    for x in data: 
        lower = np.maximum(0, x-ep)
        l_bounds.append(lower)
    return l_bounds      

Now we create a function that returns the upper bound, given by the expression: <br />
\begin{align}
U(x) = \min\left(1, \hat{F}(x) + \epsilon\right).
\end{align}

In [32]:
def upper_bound(data):
    """
    for a given array of experimental data, 
    this function returns a list of the DKW upper bound values 
    corresponding to the given data points
    """
    ep = calc_epsilon(data) 
    u_bounds = []
    for x in data: 
        upper = np.minimum(1, x+ep)
        u_bounds.append(upper)
    return u_bounds   

Now I want to plot the confidence intervals for the ECDFs of the labeled and unlabeled tubulin times to catastrophe. I use the previously made `unlabeled_tubulin` and `labeled tubulin` arrays, along with the `ecdf` method made in part e to obtain the ecdf values of the two arrays. I use the `lower_bound` and `upper_bound` methods just created to also store the lower and upper bounds for each dataset. <br /><br />
For the **unlabeled** tubulin dataset I create a dataframe that holds all the values needed for plotting. I sort the dataset by the value so I can plot in order.

In [33]:
values_u = []
ecdf_u = []
for x in unlabeled_tubulin:
    values_u.append(x)
    a = ecdf([x], unlabeled_tubulin)
    ecdf_u.append(a[0])
df_unlabeled = pd.DataFrame(data = {"value":values_u,
                                    "ecdf":ecdf_u, 
                                   }
                           )
df_unlabeled = df_unlabeled.sort_values(by = ["value"])
e = df_unlabeled.loc[:, "ecdf"]
lower_u = lower_bound(e)
upper_u = upper_bound(e)

df_unlabeled["lower_bound"] = lower_u
df_unlabeled["upper_bound"] = upper_u

df_unlabeled.head()

Unnamed: 0,value,ecdf,lower_bound,upper_bound
29,40.0,0.010526,0.0,0.149865
47,60.0,0.021053,0.0,0.160391
28,75.0,0.031579,0.0,0.170917
85,80.0,0.042105,0.0,0.181444
61,85.0,0.063158,0.0,0.202496


Now for the **labeled** tubulin dataset I repeat this procedure of creating a dataframe and then plotting. 

In [34]:
values_l = []
ecdf_l = []
for x in labeled_tubulin:
    values_l.append(x)
    a = ecdf([x], labeled_tubulin)
    ecdf_l.append(a[0])
df_labeled = pd.DataFrame(data = {"value":values_l,
                                  "ecdf":ecdf_l, 
                                 }
                         )
df_labeled = df_labeled.sort_values(by = ["value"])
e_l = df_labeled.loc[:, "ecdf"]
lower_l = lower_bound(e_l)
upper_l = upper_bound(e_l)

df_labeled["lower_bound"] = lower_l
df_labeled["upper_bound"] = upper_l

df_labeled.head()

Unnamed: 0,value,ecdf,lower_bound,upper_bound
10,55.0,0.004739,0.0,0.098235
15,60.0,0.009479,0.0,0.102974
34,65.0,0.018957,0.0,0.112453
5,65.0,0.018957,0.0,0.112453
41,75.0,0.023697,0.0,0.117192


Finally, I create the final plot which overlays the two ecdf's with their respective confidence intervals, calculated with the DKW lower/upper bound expressions. 

In [35]:
p2 = bokeh.plotting.figure(
    width=800,
    height=400,
    x_axis_label="time to catastrophe",
    y_axis_label="ecdf",
    title = "unlabeled vs. labeled tubulin",
)

e = p2.line(source = df_unlabeled, x = "value", y = "ecdf",
             color = "#b2abd2", line_width = 3, alpha = 0.7)

l = p2.circle(source = df_unlabeled, x = "value", y = "lower_bound",
             color = "#5e3c99", alpha = 0.5)

u = p2.circle(source = df_unlabeled, x = "value", y = "upper_bound",
             color = "#5e3c99", alpha = 0.5)

e_l = p2.line(source = df_labeled, x = "value", y = "ecdf",
               color = "#fdb863", line_width = 3, alpha = 0.7)

l_l = p2.circle(source = df_labeled, x = "value", y = "lower_bound",
               color = "#e66101", alpha = 0.3)

u_l = p2.circle(source = df_labeled, x = "value", y = "upper_bound",
               color = "#e66101", alpha = 0.3)


legend = Legend(items=[("unlabeled ecdf"   , [e]),
                       ("unlabeled lower bound" , [l]),
                       ("unlabeled upper bound" , [u]),
                       ("labeled ecdf"   , [e_l]),
                       ("labeled lower bound" , [l_l]),
                       ("labeled upper bound" , [u_l]),], location="center")

p2.add_layout(legend, 'right')
p2.legend.click_policy = "hide"

bokeh.io.show(p2)

The purple dots here show the unlabeled tubulin bounds while the orange dots show the labeled tubulin bounds, and the lines show the ecdf values. <br />
Comparing the lower bounds, the orange and purple dots seem to follow the same trajectory. This means that the unlabeled and labeled tubulin lower bound values are very similar. <br />
Comparing the upper bounds, the labeled values the purple dots are noticeably above the orange dots. This means that the unlabeled upper bound values are slightly higher than the labeled upper bound values, though the shape of the bounds is the same. <br />
Though the upper bound values are not as aligned as the lower bound values, it is still reasonable to conclude that these confidence intervals are very similar. Therefore, this quick visual check supports the hypothesis that microtubule times to catastrophe are identically distributed between unlabeled and labeled tubulin. <br /><br /> 
This conclusion matches what we found from the iqplot calculations.

### 2. Mean time to catastrophe comparison ###
Next, we compare the mean times to catastrophe between labeled and unlabeled tubulin to detect any possible differences in performance. If the mean times are close to each other, there is more reason to believe that the fluorescent markers do not affect microtubule performance. To check this, we use nonparametric bootstrapping to compute confidence intervals for the plug-in estimate for the mean time to catastrophe for each of the two conditions. First we define some functions to calculate our bootstrap replicate for each bootstrap sample. 

In [36]:
rg = np.random.default_rng()

# set up numpy arrays with values for the labeled and unlabeled tubulin
unlabeled_tubulin = df.loc[df['labeled'] == 'unlabeled tubulin', 'time to catastrophe (s)'].values
labeled_tubulin = df.loc[df['labeled'] == 'labeled tubulin', 'time to catastrophe (s)'].values

def generate_bootstrap_samples(data):
    """Draw N bootstrap samples from a 1D data set."""
    return rg.choice(data, size=len(data))

def bootstrap_reps_mean(data, N=1):
    """Draw boostrap replicates of the mean from 1D data set."""
    means = np.empty(N)
    for i in range(N):
        means[i] = np.mean(generate_bootstrap_samples(data))
    return means

Now we can generate 100,000 bootstrap samples for both the labeled and unlabeled and calculate the plug-in estimate for the mean.

In [37]:
unlabeled_means = bootstrap_reps_mean(unlabeled_tubulin, N=100000)
labeled_means = bootstrap_reps_mean(labeled_tubulin, N=100000)

unlabeled_mean_conf_int = np.percentile(unlabeled_means, [2.5, 97.5])
labeled_mean_conf_int = np.percentile(labeled_means, [2.5, 97.5])

print(f"Unlabeled tubulin time to catastrophe(s) confidence interval: [{unlabeled_mean_conf_int[0]:.2f}, {unlabeled_mean_conf_int[1]:.2f}]")
print(f"Labeled tubulin time to catastrophe(s) confidence interval: [{labeled_mean_conf_int[0]:.2f}, {labeled_mean_conf_int[1]:.2f}]")

Unlabeled tubulin time to catastrophe(s) confidence interval: [353.68, 476.89]
Labeled tubulin time to catastrophe(s) confidence interval: [401.90, 481.54]


We use the `bebi103` package to visually display these confidence intervals.

In [38]:
labeled_mean = labeled_tubulin.mean()
unlabeled_mean = unlabeled_tubulin.mean()

summaries = [
    dict(label = "unlabeled tubulin", estimate = unlabeled_mean, 
         conf_int = unlabeled_mean_conf_int),
    dict(label = "labeled tubulin", estimate = labeled_mean, 
         conf_int = labeled_mean_conf_int)
]
bokeh.io.show(
    bebi103.viz.confints(summaries)
)

The confidence intervals of the two categories have significant overlap. This calculation supports the previous conclusion from the ecdf since there is not a clear difference in microtubule performance between labeled and unlabeled samples. <br /> <br />
Again, since we are using the confidence intervals to check value overlaps it is worth double checking that our confidence interval generation is appropriate. In this case, we can double check our confidence intervals with a theoretical distribution rather than the empirical distribution. Specifically, we can use the normal distribution by the central limit theorem: 
\begin{align}
&\mu = \bar{x},\\[1em]
&\sigma^2 = \frac{1}{n(n-1)}\sum_{i=1}^n (x_i - \bar{x})^2,
\end{align}

We define a function to calculate the variance of the data set using this theoretical equation.

In [39]:
def calc_variance(data_array):
    """This function calculates the variance of a 1D data array"""
    n = data_array.size
    mean = data_array.mean()
    numer = 0
    
    for i in range(n):
        numer += (data_array[i] - mean) ** 2
    
    denom = n * (n-1)
    
    return numer/denom

Now we perform the calculation and visualize the confidence intervals. 

In [40]:
unlabeled_variance = calc_variance(unlabeled_tubulin)
labeled_variance = calc_variance(labeled_tubulin)

labeled_conf1 = st.norm.ppf(0.025, loc=labeled_mean, scale=np.sqrt(labeled_variance))
labeled_conf2 = st.norm.ppf(0.975, loc=labeled_mean, scale=np.sqrt(labeled_variance))

unlabeled_conf1 = st.norm.ppf(0.025, loc=unlabeled_mean, scale=np.sqrt(unlabeled_variance))
unlabeled_conf2 = st.norm.ppf(0.975, loc=unlabeled_mean, scale=np.sqrt(unlabeled_variance))

print(f"Unlabeled tubulin time to catastrophe(s) confidence interval: [{unlabeled_conf1:.2f}, {unlabeled_conf2:.2f}]")
print(f"Labeled tubulin time to catastrophe(s) confidence interval: [{labeled_conf1:.2f}, {labeled_conf2:.2f}]")

summaries = [
    dict(label = "unlabeled tubulin", estimate = unlabeled_mean, 
         conf_int = [unlabeled_conf1, unlabeled_conf2]),
    dict(label = "labeled tubulin", estimate = labeled_mean, 
         conf_int = [labeled_conf1, labeled_conf2])
]
bokeh.io.show(
    bebi103.viz.confints(summaries)
)

Unlabeled tubulin time to catastrophe(s) confidence interval: [350.84, 474.21]
Labeled tubulin time to catastrophe(s) confidence interval: [400.74, 480.68]


When comparing the confidence interval calculation from the theoretical distribution to the confidence interval derived from the empirical distribution, we can see that the confidence intervals are very similar. Again, there does not seem to be a significant difference between the times to catastrophe between the unlabeled and labeled tubulin.

### 3. Hypothesis testing assuming identical distributions ###
Next, we use a permutation hypothesis test to test the hypothesis that the distribution of catastrophe times for microtubules with labeled tubulin is the same as that for unlabeled tubulin.

#### Step 1: State the null hypothesis.
> The null hypothesis is that the time to catastrophe for labeled and unlabeled tubulin are identically distributed.

#### Step 2: Define a test statistic.
> For our first NHST experiment, the test statistic that will be used is difference in means. This test statistic will offer a good comparison to results acquired in part a and b, when we compared the confidence intervals of the means of the two categories. <br /> <br />
> For our second NHST experiment, the test statistic that will be used is difference in variance. Time to catastrophe can be modeled by a combination of exponential processes (as shown in HW6), and we know that we can get a normal approximation of the distribution using the equating moments or taylor expansion method. Since a normal distribution can be described by the mean and variance, we decided to also compare the difference of variance as a test statistic since we are interested in the question of if the labeled and unlabeled tubulin time to catastrophe come from the same distribution.  <br /> <br />
> For our third NHST experiment, the test statistic that will be used is difference in medians. We decided to conduct this test to compare a different parameter that can describe a distribution to offer more information about if the two categories come from the same distribution. This was mainly done out of curiosity.

#### Step 3: Simulate data acquisition for the scenario where the null hypothesis is true, many many times.

> We will concatenate the two data sets, randomly shuffle them, designate the first entries in the shuffled array to be a “labeled” data set and the rest to be a “unlabeled” data set. Our null hypothesis posits that both the labeled and unlabeled tubulin catastrophe times come from the same distribution, so our concatenated data set will include all points from both categories.

In [41]:
@numba.njit
def generate_perm_sample(x, y):
    """Generate a permutation sample."""
    combined_data = np.concatenate((x, y))
    np.random.shuffle(combined_data)
    half_index = int(combined_data.size / 2)

    return combined_data[:half_index], combined_data[half_index:]

#### Step 4a: (NHST Experiment 1) Compute the p-value (the fraction of simulations for which the test statistic (diff in mean) is at least as extreme as the test statistic computed from the measured data). Do this 10 million times.

In [42]:
@numba.njit
def generate_perm_reps_diff_mean(x, y, N):
    """Generate array of permuation replicates."""
    out = np.empty(N)
    for i in range(N):
        x_perm, y_perm = generate_perm_sample(x, y)
        out[i] = np.mean(x_perm) - np.mean(y_perm)

    return out

# Compute test statistic for original data set
diff_mean = np.mean(labeled_tubulin) - np.mean(unlabeled_tubulin)

# Draw replicates
perm_reps = generate_perm_reps_diff_mean(labeled_tubulin, unlabeled_tubulin, 10000000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_mean) / len(perm_reps)

print('NHST experiment 1: Difference in mean p-value =', p_val)

NHST experiment 1: Difference in mean p-value = 0.2058407


#### Step 4b: (NHST Experiment 2) Compute the p-value (the fraction of simulations for which the test statistic (diff in variance) is at least as extreme as the test statistic computed from the measured data). Do this 10 million times.

In [43]:
@numba.njit
def generate_perm_reps_diff_variance(x, y, N):
    """Generate array of permuation replicates."""
    out = np.empty(N)
    for i in range(N):
        x_perm, y_perm = generate_perm_sample(x, y)
        out[i] = np.var(x_perm) - np.var(y_perm)

    return out

# Compute test statistic for original data set
diff_variance = np.var(labeled_tubulin) - np.var(unlabeled_tubulin)

# Draw replicates
perm_reps = generate_perm_reps_diff_variance(labeled_tubulin, unlabeled_tubulin, 10000000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_variance) / len(perm_reps)

print('NHST experiment 2: Difference in variance p-value =', p_val)

NHST experiment 2: Difference in variance p-value = 0.5942287


#### Step 4c: (NHST Experiment 3) Compute the p-value (the fraction of simulations for which the test statistic (diff in median) is at least as extreme as the test statistic computed from the measured data). Do this 10 million times.

In [44]:
@numba.njit
def generate_perm_reps_diff_median(x, y, N):
    """Generate array of permuation replicates."""
    out = np.empty(N)
    for i in range(N):
        x_perm, y_perm = generate_perm_sample(x, y)
        out[i] = np.median(x_perm) - np.median(y_perm)

    return out

# Compute test statistic for original data set
diff_median = np.median(labeled_tubulin) - np.median(unlabeled_tubulin)

# Draw replicates
perm_reps = generate_perm_reps_diff_median(labeled_tubulin, unlabeled_tubulin, 10000000)

# Compute p-value
p_val = np.sum(perm_reps >= diff_median) / len(perm_reps)

print('NHST experiment 3: Difference in median p-value =', p_val)

NHST experiment 3: Difference in median p-value = 0.4767821


The p-value is 0.21 where the test statistic is the difference in means. <br />
The p-value is 0.59 where the test statistic is the difference in variance. <br />
The p-value is 0.48 where the test statistic is the difference in medians. <br />
This means that the probability of getting a difference of means, variances, and medians, respectively, as extreme as was observed under the null hypothesis is relatively high (the null hypothesis being that the control and test samples were drawn from identical distribution). Although this result does not confirm that we can reject the null hypothesis (p-value does not represent the probability that a given hypothesis is "true"), it supports the findings in the previous sections where we do not observe a strong influence of fluorescent markers on microtubule performance. 