In [1]:
import os

import pandas as pd 
import numpy as np
import scipy 
import scipy.stats as st
import tqdm

import warnings

import bebi103
import iqplot


import bokeh
bokeh.io.output_notebook()

  register_cmap("cet_" + name, cmap=cmap)
  register_cmap("cet_" + name, cmap=cmap)


# Aim 2 continued: models for different tubulin concentrations #
### Does tubulin concentration influence microtubule behavior? ###
To further explore the possible models for microtubule catastrophe, we now look at times to catastrophe for microtubules that grew in environments with different tubulin concentrations. We start by reading in this data and creating some initial visualizations. 

In [2]:
data_path = '../datasets'
file_path = os.path.join(data_path, "gardner_mt_catastrophe_only_tubulin.csv")
df = pd.read_csv(file_path, header=9)
df = pd.melt(df).dropna()
df.rename(columns={'variable':'concentration', 'value':'time (s)'}, inplace=True)

To look at how big the groupings are, we check the number of observations per concentration. 

In [3]:
for name in ['7 uM', '9 uM', '10 uM', '12 uM', '14 uM']: 
    group = df[df['concentration'] == name]
    print(name, len(group))

7 uM 608
9 uM 255
10 uM 224
12 uM 692
14 uM 141


There is a similar number of 9uM and 10uM measurements. The 7uM and 12 uM measurments are almost two times as many as the 9uM/10uM number. There are the least number of 14uM measurments. These differences might be due to preferred environments for microtubule catastrophe, but could also be due to the experimental design. <br /><br />
We next use a stripbox plot to visualize how catastrophe times differ amongst the tubulin concentrations. 

In [4]:
p_stripbox = iqplot.stripbox(data=df, 
                             q="time (s)", 
                             order=['7 uM', '9 uM', '10 uM', '12 uM', '14 uM'],
                             cats="concentration", 
                             title="Catastrophe times by oncentration type")

bokeh.io.show(p_stripbox)

Looking at the means, the larger tubulin concentrations have slightly longer catastrophe times. However, just by looking at the data points themselves it is difficult to see this trend. <br /> <br />
To better distinguish these differences, we next plot the ECDFs for the different concentrations. 

In [5]:
p = iqplot.ecdf(
        data = df,
        q = "time (s)",
        order=['7 uM', '9 uM', '10 uM', '12 uM', '14 uM'],
        cats = 'concentration',
        style = "staircase",
        title="Time to Catastrophe ECDF", 
        conf_int=True
) 


bokeh.io.show(p)

The ECDF plots show that the different concentration types have fairly different distributions, unlike what we saw with labeled vs unlabeled tubulin. This implies that tubulin concentration **does** influence microtubule performance. Furthermore, the plot above supports the trend we saw earlier: that larger concentrations of tubulin lead to longer times to catastrophe. 

## MLE of parameters for the 12uM tubulin environment ## 
Now that we have established this dependence on tubulin concentration, we can investigate whether or not this dependence affects the fit of the proposed models to the empirical data. Previously, we found that both of our models captured our empirical dataset in the unlabeled vs labeled experiments. Here, we focus on the times to catastrophe for a higher concentration environment (12uM tubulin) and see if we observe the same results. 

### 1. Gamma distribution ###
We start by creating functions for MLE calculation, including a log-likelihood function and an optimizing function. 

In [6]:
def log_like_gamma(log_params, y):
    log_alpha, log_beta = log_params
    
    #convert log parameters
    alpha = np.exp(log_alpha)
    beta = np.exp(log_beta)
    
    return np.sum(scipy.stats.gamma.logpdf(y, alpha, loc=0, scale=1/beta))

def mle_gamma(y):
    opt = scipy.optimize.minimize(
            fun = lambda log_params, y: -log_like_gamma(log_params, y),
            x0 = np.array([1,1]),
            args = (y,),
            method = "BFGS"
            )
    return np.exp(opt.x)

Next we perform the MLE calculation on our dataset for the 12uM environment. 

In [7]:
y = df[df['concentration'] == '12 uM']['time (s)']

a_mle, b_mle = mle_gamma(y)

print("α MLE: ", a_mle)
print("β MLE: ", b_mle)

α MLE:  2.9152793621913
β MLE:  0.007660623571445881


Next we create a plot to visualize our proposed gamma model with the MLE parameters. 

In [8]:
rg = np.random.default_rng()
values = rg.gamma(a_mle, 1/b_mle, size=len(df[df['concentration'] == '12 uM']))

p = iqplot.ecdf(
        data = df[df['concentration'] == '12 uM'],
        q = "time (s)",
        cats = 'concentration',
        style = "staircase",
        title="Time to Catastrophe ECDF vs Gamma Model", 
        conf_int=True
) 


p_theoretical = iqplot.ecdf(
    data = values, 
    style = "staircase",
    p=p,
    palette=["#fdbb84"]
)

bokeh.io.show(p_theoretical)



A first plot comparing with a single draw from the MLE gamma model shows that our model performs reasonably well, but we will need to pull many samples from our bootstrap model to see the confidence interval of our MLE parameterized model. Consequently, we create a function that will allow us to draw many samples from our MLE gamma distribution

In [9]:
def draw_gamma(alpha, beta, size):
    return rg.gamma(a_mle, 1/b_mle, size=size)

gamma_model_samples = np.array(
    [draw_gamma(a_mle, b_mle, size=len(df[df['concentration'] == '12 uM'])) for _ in range(100000)]
)

To more clearly see where the models deviate, we can plot with subtracting the MLE model ECDF values. 

In [10]:
p = bebi103.viz.predictive_ecdf(
    samples=gamma_model_samples, data=df[df['concentration'] == '12 uM']['time (s)'], diff='ecdf', discrete=True, x_axis_label="n", color='orange', data_color='blue'
)

bokeh.io.show(p)

We see that the data (blue) exceeds the 95% conf int in the n=600 range. 

### 2. Poisson process ###
Again, we use a two exponential model to capture our poisson process. We create functions for MLE calculation, including a log-likelihood function and an optimizing function. 

In [11]:
# 2 Exponential model
def log_like_two_exp(params, y):
    b1, b2 = params
    
    if b1 < 0 or b2 < 0: 
        return -np.inf
    
    ## If b1 == b2, it's basically the Gamma probability distribution with beta = 1/b1 and waiting for 2 processes
    if b1 == b2 : 
        logx = st.gamma.logpdf(y, 2, loc=0, scale=1/b1)
        log_like = np.sum(logx)
    else : 
        logx1 = st.expon.logpdf(y, loc=0, scale=1/b1)
        logx2 = st.expon.logpdf(y, loc=0, scale=1/b2)

        lse_coeffs = np.tile([b2 / (b2-b1), -b1 / (b2-b1)], [len(y), 1]).transpose()

        log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)
        log_like = np.sum(log_likes) 
    
    return log_like

def mle_iid_two_exp(y, params_guess):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, y: -log_like_two_exp(params, y),
            x0=np.array(params_guess),
            args=(y,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)
        

Next we perform the MLE calculation for the 12uM concentration values. 

In [12]:
two_exp_mle = mle_iid_two_exp(y, params_guess=[b_mle, b_mle+1])

print("β1 MLE: ", two_exp_mle[0])
print("β2 MLE: ", two_exp_mle[1])

β1 MLE:  0.005252582386427601
β2 MLE:  0.005264081593813873


Now we can plot the CDF values with the ECDF values to compare. 

In [13]:
def analyticalCDF(t, B1, B2) : 
    """
    This function returns the CDF value of a joint exponential's CDF. 
    
    parameters
    ---
    t = number or array of numbers corresponding to the timepoints to obtain CDF for
    B1 = number, the rate of Poisson process 1
    B2 = number, the rate of Poisson process 2
    
    returns
    ---
    CDF_values = a number or list of numbers corresponding to the CDF values of the provided t values. 
    """
    return ((B1 * B2) / (B2 - B1)) * ((1-np.exp(-B1 * t))/B1 - (1-np.exp(-B2 * t))/B2)

y = df[df['concentration'] == '12 uM']['time (s)']
n_theor = np.arange(0, y.max()+1)
two_exp_analytical = analyticalCDF(n_theor, two_exp_mle[0], two_exp_mle[1])


p = iqplot.ecdf(
    data = df[df['concentration'] == '12 uM'],
    q = "time (s)",
    cats = 'concentration',
    style = "staircase",
    title="Time to Catastrophe ECDF vs 2 Exp Model", 
    conf_int=True
) 


n_plot, cdf_plot = bebi103.viz.cdf_to_staircase(n_theor, two_exp_analytical)
p.line(n_plot, cdf_plot, line_color='orange', line_width=2)
bokeh.io.show(p)

A first plot comparing with a single draw from the MLE poisson model shows that our model performs reasonably well, but not as well as it appeared for the Gamma case. To confirm, we draw many samples from our MLE two exp distribution. 

In [14]:
def gen_two_exp(b1, b2, size):
    """
    creates a random multivariate normal distribution with the given parameters
    """
    t1 = rg.exponential(1/b1, size=size)
    t2 = rg.exponential(1/b2, size=size)
    return t1+t2

two_exp_model_samples = np.array(
    [gen_two_exp(two_exp_mle[0], two_exp_mle[1], size=len(df[df['concentration'] == '12 uM'])) for _ in range(100000)]
)

To more clearly see where the models deviate, we can plot with subtracting the MLE model ECDF values. 

In [15]:
p = bebi103.viz.predictive_ecdf(
    samples=two_exp_model_samples, data=df[df['concentration'] == '12 uM']['time (s)'], diff='ecdf', discrete=True, x_axis_label="n", color='orange', data_color='blue'
)

bokeh.io.show(p)

We see that the data (blue) exceeds the 95% conf int around n=300 and n=600. <br /> <br />
Comparing the Gamma model vs the Poisson model, we can see that the Gamma model fits the 12 uM data better. This may be due to the fact that we were able to fit more processes (the alpha parameter) rather than assuming 2 processes in the 2 exp model, even though the exp model has flexibility of different beta rates. 

## Further analysis with the favored model: Gamma distribution ##
Now that we have determined the model that seems to capture the dataset the best, we can perform further analysis of the microtubule times to catastrophe under different tubulin concentration environments. We start by obtaining the MLEs for each concentration. 

In [16]:
gamma_params = {}
for name in ['7 uM', '9 uM', '10 uM', '12 uM', '14 uM']: 
    group = df[df['concentration'] == name]
    y = group['time (s)']
    a_mle, b_mle = mle_gamma(y)
    
    gamma_params[name] = (a_mle, b_mle)
    
    print(name)
    print("α MLE: ", a_mle)
    print("β MLE: ", b_mle)
    
    gamma_model_samples = np.array([draw_gamma(a_mle, b_mle, size=len(y)) for _ in range(100000)])
    
    p = bebi103.viz.predictive_ecdf(
        samples=gamma_model_samples, data=y, discrete=True, x_axis_label="n", color='orange', data_color='blue'
    )

    bokeh.io.show(p)
    
    p = bebi103.viz.predictive_ecdf(
        samples=gamma_model_samples, data=y, diff='ecdf', discrete=True, x_axis_label="n", color='orange', data_color='blue'
    )

    bokeh.io.show(p)
    
    print('========================')

7 uM
α MLE:  2.4439098997770685
β MLE:  0.0075502906587536


9 uM
α MLE:  2.679863680538916
β MLE:  0.008779101026996812


10 uM
α MLE:  3.2108230199632306
β MLE:  0.009029809941070716


12 uM
α MLE:  2.9152793621913
β MLE:  0.007660623571445881


14 uM
α MLE:  3.3615058966445193
β MLE:  0.007174876425757962




To visualize the confidence intervals for alpha and beta, we can bootstrap samples from our MLE gamma distribution for each concentration. We create methods for this parametric bootstrapping. 

In [17]:
def draw_bs_sample(alpha, beta, size):
    """Draw a bootstrap sample from a 1D data set."""
    return rg.gamma(alpha, 1/beta, size=size)


def draw_bs_reps_params(alpha, beta, n_measurements, size=10000, progress_bar = False):
    """Draw bootstrap replicates of alpha and beta.""" 
    if progress_bar:
        iterator = tqdm.tqdm(range(size))
    else:
        iterator = range(size)
    
    out_a = []
    out_b = []
    for _ in iterator:
        y = draw_bs_sample(alpha, beta, n_measurements)
        a, b = mle_gamma(y)
        out_a.append(a)
        out_b.append(b)
    return np.array(out_a), np.array(out_b)

Next we calculate the confidence intervals for each concentration. 

In [None]:
alpha_summaries = []

beta_summaries = []

for name in ['7 uM', '9 uM', '10 uM', '12 uM', '14 uM']: 
    group = df[df['concentration'] == name]
    print(name, gamma_params[name]) 
    
    a_mle, b_mle = gamma_params[name]
    
    g_params = draw_bs_reps_params(a_mle, b_mle, len(group), size=10000, progress_bar = True)
    
    a_params = g_params[0]
    b_params = g_params[1]

    conf_int_a = np.percentile(a_params, [2.5, 97.5],)
    conf_int_b = np.percentile(b_params, [2.5, 97.5],) 

    alpha_summaries.append(dict(label = name, estimate = a_mle, conf_int = conf_int_a))
    beta_summaries.append(dict(label = name, estimate = b_mle, conf_int = conf_int_b))



7 uM (2.4439098997770685, 0.0075502906587536)


 37%|█████████████▉                        | 3654/10000 [04:16<06:58, 15.17it/s]

Finally, we visualize the confidence intervals. 

In [None]:
bokeh.io.show(
    bebi103.viz.confints(alpha_summaries, title="Alpha")
)

bokeh.io.show(
    bebi103.viz.confints(beta_summaries, title="Beta")
)

From these plots, we can more clearly see that the alpha estimates increase with concentration and the beta values peak at 9 or 10 uM. With a smaller rate value for the higher tubulin concentration scenarios, this means that catastrophe happens slower. Coupled with more processes with that rate as identified by the alpha number of processes estimates would mean that the total time to catastrophe would be even longer for higher concentrations. This agrees with the model that in higher concentrations, more polymerization happens, which would lead to longer catastrophe times. Further, these MLE results and interpretation agree with the exploratory analysis where increased concentrations of tubulin correlate with longer catastrophe times. <br /> <br />
In summary, we can see the gamma distribution reasonably well for all concentrations. The alpha values seem to increase with larger concentrations, while beta seems to peak at the middle concentrations. 