# Microtubule catastrophe and ECDFs {#exr-mt-catastrophe-and-ecdfs}

[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/gardner_time_to_catastrophe_dic_tidy.csv)

<hr />

As a reminder, the empirical cumulative distribution function for a set of data point evaluated at x is 

>ECDF(x) = fraction of data points ≤ x.

The ECDF is defined on the entire real number line, with $\mathrm{ECDF}(x\to-\infty) = 0$ and $\mathrm{ECDF}(x\to\infty) = 1$. However, the ECDF is often plotted as discrete points, $\{(x_i, y_i)\}$, where for point $i$, $x_i$ is the value of the measured quantity and $y_i$ is $\mathrm{ECDF}(x_i)$. For example, if I have a set of measured data with values (1.1, –6.7, 2.3, 9.8, 2.3), the points on the ECDF plot are

| x      | y   |
|:------:|:---:|
| –6.7  |  0.2 |
| 1.1   |  0.4 |
| 2.3   |  0.6 |
| 2.3   |  0.8 |
| 9.8   |  1.0 |

In this exercise, you will use a data set we will explore throughout the workshop. [Gardner, Zanic, and coworkers](http://dx.doi.org/10.1016/j.cell.2011.10.037) investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state. In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin.

We will look at the data used to generate Fig. 2a of their paper. In the end, you will generate a plot similar to that figure.

**a)** Write a function with the call signature `ecdfvals(data)`, which takes a one-dimensional Numpy array (or Polars `Series`; the same construction of your function will work for both) of data and returns the `x` and `y` values as Numpy arrays for plotting the ECDF in the "dots" style, as in Fig. 2a of the Gardner, Zanic, et al. paper. As a reminder, 

> ECDF(*x*) = fraction of data points ≤ x.

Assume that there are no NaNs in the input. When you write this function, you may only use base Python and the standard library, in addition to Numpy and Polars. (iqplot has this functionality built-in, but the point here is to build a more concrete understanding of what an ECDF is.)

**b)** Write a function, `ecdfvals_expr(col)`, that returns a Polars `Expression` that will compute the `y` values of an ECDF for a given column, `col`. Again, assume there are no NaNs in the column.

**c)** Use either the `ecdfvals()` function or the ecdfvals_expr() function that you wrote to plot the ECDFs shown in Fig. 2a of the Gardner, Zanic, et al. paper. By looking this plot, do you think that the fluorescent labeling makes a difference in the onset of catastrophe? (We will do a more careful statistical inference later in the workshop, but for now, does it pass the eye test? Eye tests are an important part of EDA.) You can access the data set here: [https://s3.amazonaws.com/bebi103.caltech.edu/data/gardner_time_to_catastrophe_dic_tidy.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/gardner_time_to_catastrophe_dic_tidy.csv)

<br />

## Solution

<hr>

In [1]:
import numpy as np
import polars as pl

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

**a)** First, we write a function to compute the `x` and `y` values for plotting an ECDF. To do this, the `x` values are simply the sorted data set and the `y` values go from `1/n` to `1`, where `n` is the total number of data points.

In [2]:
def ecdfvals(data):
    """
    Compute `x` and `y` values for plotting an ECDF.
    
    Parameters
    ----------
    data : array_like
        Array of data to be plotted as an ECDF.
        
    Returns
    -------
    x : array
        `x` values for plotting
    y : array
        `y` values for plotting    
    """
    return np.sort(data), np.arange(1, len(data)+1) / len(data)

**b)** To write the function for application to a data frame, we do not have the luxury of sorting because this would disrupt the ordering of the data frame. We can, however, use the `rank()` function for our purposes.

In [3]:
def ecdfvals_expr(col):
    """Return a Polar Expression for ECDF values"""
    return col.rank(method="ordinal") / col.len()

**c)** With our function in hand, we can easily compute the x and y values for the plot. We can pass Numpy arrays directly into `p.scatter()`.

In [4]:
# Load data set
df = pl.read_csv("../data/gardner_time_to_catastrophe_dic_tidy.csv")

# Compute x and y values for ECDFs
x_labeled, y_labeled = ecdfvals(
    df.filter(pl.col("labeled"))["time to catastrophe (s)"]
    )
x_unlabeled, y_unlabeled = ecdfvals(
    df.filter(~pl.col("labeled"))["time to catastrophe (s)"]
)

# Make the plot
p = bokeh.plotting.figure(
    frame_width=450,
    frame_height=300,
    x_axis_label="time to catastrophe (min)",
    y_axis_label="ECDF",
)
p.scatter(x=x_labeled / 60, y=y_labeled, legend_label="labeled")
p.scatter(x=x_unlabeled / 60, y=y_unlabeled, color="orange", legend_label="unlabeled")
p.legend.location = "bottom_right"

bokeh.io.show(p)

Alternatively, we can use our Polars expression.

In [5]:
# Apply ECDF function over labeled column
df = df.with_columns(
    ecdfvals_expr(pl.col("time to catastrophe (s)")).alias("ECDF").over("labeled"),
    (pl.col("time to catastrophe (s)") / 60).alias("time to catastrophe (min)"),
)

# Make the plot
p = bokeh.plotting.figure(
    frame_width=450,
    frame_height=300,
    x_axis_label="time to catastrophe (min)",
    y_axis_label="ECDF",
)
p.scatter(
    source=df.filter(pl.col("labeled")).to_dict(),
    x="time to catastrophe (min)",
    y="ECDF",
    legend_label="labeled",
)
p.scatter(
    source=df.filter(~pl.col("labeled")).to_dict(),
    x="time to catastrophe (min)",
    y="ECDF",
    color="orange",
    legend_label="unlabeled",
)
p.legend.location = "bottom_right"

bokeh.io.show(p)

There does not seem to be a major difference between the two treatments, at least not by eye.

## Computing environment

In [6]:
%load_ext watermark
%watermark -v -p numpy,polars,bokeh,jupyterlab

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy     : 2.2.6
polars    : 1.31.0
bokeh     : 3.7.3
jupyterlab: 4.4.5

