# Homework 2.3: Microtubule catastrophe and ECDFs [SOLO] (30 pts)

[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/gardner_time_to_catastrophe_dic_tidy.csv)

<hr />

In a [future lesson](../../lessons/07/iqplot.iypnb), you will learn about **emprical cumulative distribution functions**, or ECDFs. These are useful ways to visualize how measured data are distributed. An ECDF evaluated at point _x_ is defined as

ECDF(_x_) = fraction of data points ≤ _x_.

The ECDF is defined on the entire real number line, with $\mathrm{ECDF}(x\to-\infty) = 0$ and $\mathrm{ECDF}(x\to\infty) = 1$. However, the ECDF is often plotted as discrete points, $\{(x_i, y_i)\}$, where for point $i$, $x_i$ is the value of the measured quantity and $y_i$ is $\mathrm{ECDF}(x_i)$. For example, if I have a set of measured data with values (1.1, –6.7, 2.3, 9.8, 2.3), the points on the ECDF plot are

| x      | y   |
|:------:|:---:|
| –6.7  |  0.2 |
| 1.1   |  0.4 |
| 2.3   |  0.6 |
| 2.3   |  0.8 |
| 9.8   |  1.0 |

In this problem, you will use you newly acquired skills using Numpy and Bokeh to compute ECDFs from a real data set and plot them.

[Gardner, Zanic, and coworkers](http://dx.doi.org/10.1016/j.cell.2011.10.037) investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state. In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin.

We will look at the data used to generate Fig. 2a of their paper. In the end, you will generate a plot similar to that figure.

**a)** Write a function with the call signature `ecdfvals(data)`, which takes a one-dimensional Numpy array (or Pandas `Series`; the same construction of your function will work for both) of data and returns the `x` and `y` values for plotting the ECDF in the "dots" style, as in Fig. 2a of the Gardner, Zanic, et al. paper. As a reminder, 

> ECDF(*x*) = fraction of data points ≤ x.

When you write this function, you may only use base Python and the standard library, in addition to Numpy and Pandas.

In [1]:
# import statements
import numpy as np
import pandas as pd

# plot in bokeh
import bokeh.io
import bokeh.plotting

In [2]:
# function to take in 1D array and returns x and y for plotting ECDF in dots style
def ecdfvals(data):
    # extract all unique timing values
    data_vals = np.unique(data)

    # initialize x and y
    x = np.array(data_vals)
    y = np.zeros(len(data_vals))
    for i, val in enumerate(data_vals):
        y[i]= len(np.where(data == val)[0])

    # normalize counts to percentage
    y = y/np.sum(y)
    # correct counts to cumulative percentage
    y = np.cumsum(y)
    
    return x, y

**b)** Use the `ecdfvals()` function that you wrote to plot the ECDFs shown in Fig. 2a of the Gardner, Zanic, et al. paper. By looking this plot, do you think that the fluorescent labeling makes a difference in the onset of catastrophe? (We will do a more careful statistical inference later in the course, but for now, does it pass the eye test? Eye tests are an important part of EDA.) You can access the data set here: [https://s3.amazonaws.com/bebi103.caltech.edu/data/gardner_time_to_catastrophe_dic_tidy.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/gardner_time_to_catastrophe_dic_tidy.csv)

In [4]:
# read csv into dataframe, some tidying of data
# df = pd.read_csv("..\data\gardner_time_to_catastrophe_dic_tidy.csv",header=[0])
df = pd.read_csv("../data/gardner_time_to_catastrophe_dic_tidy.csv",header=[0])
df.drop(columns=df.columns[0], axis=1, inplace=True)

# separate false and true catastrophe data
df_false = df[df['labeled']==np.unique(df['labeled'])[0]].iloc[:,0]
df_true = df[df['labeled']==np.unique(df['labeled'])[1]].iloc[:,0]

# obtain values for plotting using ecdfvals function
x_false, y_false = ecdfvals(df_false)
x_true, y_true = ecdfvals(df_true)

df

Unnamed: 0,time to catastrophe (s),labeled
0,470.0,True
1,1415.0,True
2,130.0,True
3,280.0,True
4,550.0,True
...,...,...
301,180.0,False
302,145.0,False
303,745.0,False
304,390.0,False


<div class="alert alert-block alert-info">
Notebook did not run properly. I think you might have an issue with the line endings VS code uses? -1
</div>

In [5]:
# Enable viewing Bokeh plots in the notebook
bokeh.io.output_notebook()

p = bokeh.plotting.figure(
    width=400,
    height=300,
    x_axis_label="time to catastrophe (s)",
    y_axis_label="ECDF",
)

p.circle(
    x=x_true,
    y=y_true,
    legend_label="Labeled",
)

p.circle(
    x=x_false,
    y=y_false,
    legend_label="Unlabeled",
    color="orange"
)

p.legend.location = "bottom_right"

bokeh.io.show(p)

Visually, it seems that fluorescent labelling does result in a slight increase in the time to catastrophe (labeled probes show a lower ECDF especially around the 200-400s and 500-900s area). However, it's hard to tell for sure whether or not there's a difference, and usng a statistical test would greatly help us out with this.