In a [future lesson](../../lessons/07/iqplot.iypnb), you will learn about **emprical cumulative distribution functions**, or ECDFs. These are useful ways to visualize how measured data are distributed. An ECDF evaluated at point _x_ is defined as

ECDF(_x_) = fraction of data points ≤ _x_.

The ECDF is defined on the entire real number line, with $\mathrm{ECDF}(x\to-\infty) = 0$ and $\mathrm{ECDF}(x\to\infty) = 1$. However, the ECDF is often plotted as discrete points, $\{(x_i, y_i)\}$, where for point $i$, $x_i$ is the value of the measured quantity and $y_i$ is $\mathrm{ECDF}(x_i)$. For example, if I have a set of measured data with values (1.1, –6.7, 2.3, 9.8, 2.3), the points on the ECDF plot are

| x      | y   |
|:------:|:---:|
| –6.7  |  0.2 |
| 1.1   |  0.4 |
| 2.3   |  0.6 |
| 2.3   |  0.8 |
| 9.8   |  1.0 |

In this problem, you will use you newly acquired skills using Numpy and Bokeh to compute ECDFs from a real data set and plot them.

[Gardner, Zanic, and coworkers](http://dx.doi.org/10.1016/j.cell.2011.10.037) investigated the dynamics of microtubule catastrophe, the switching of a microtubule from a growing to a shrinking state. In particular, they were interested in the time between the start of growth of a microtubule and the catastrophe event. They monitored microtubules by using tubulin (the monomer that comprises a microtubule) that was labeled with a fluorescent marker. As a control to make sure that fluorescent labels and exposure to laser light did not affect the microtubule dynamics, they performed a similar experiment using differential interference contrast (DIC) microscopy. They measured the time until catastrophe with labeled and unlabeled tubulin.

We will look at the data used to generate Fig. 2a of their paper. In the end, you will generate a plot similar to that figure.

<div class="alert alert-info">
Total: 19.5/30
</div>

**a)** Write a function with the call signature `ecdfvals(data)`, which takes a one-dimensional Numpy array (or Pandas `Series`; the same construction of your function will work for both) of data and returns the `x` and `y` values for plotting the ECDF in the "dots" style, as in Fig. 2a of the Gardner, Zanic, et al. paper. As a reminder, 

> ECDF(*x*) = fraction of data points ≤ x.

When you write this function, you may only use base Python and the standard library, in addition to Numpy and Pandas.


In [1]:
import numpy as np
import bokeh.plotting
import seaborn as sns
import pandas as pd
from collections import Counter

def ecdfvals(data):
    new_data={}
    unique_num=np.unique(data)
    data_sorted=data
    for value in unique_num:
        new_data[value]=len(np.where(data_sorted==value)[0])
    total_occurance=np.sum(list(new_data.values()))
    print(total_occurance)
    for i in new_data.keys():
        new_data[i]= new_data[i]/total_occurance
    return new_data

In [3]:
import csv
filename='../data/gardner_time_to_catastrophe_dic_tidy.csv'
file = open(filename)
catastrophe_time=csv.reader(file)
header = []
header = next(catastrophe_time)
rows = []
for row in catastrophe_time:
        rows.append(row)
catastrophe_time_extract=np.array(rows)
catastrophe_time_extract=catastrophe_time_extract[:,1]
ECDF=ecdfvals(catastrophe_time_extract)

306


<div class="alert alert-info">
Notebook did not run - check data path. -1
</div>

In [4]:
# [float(i) for i in list(ECDF.keys())], list(ECDF.values())
df = pd.DataFrame(columns=["number","occurence"])
df["number"]=[float(i) for i in list(ECDF.keys())]
df["occurence"]=list(ECDF.values())
df=df.sort_values("number")
df["occurence"]=df["occurence"].cumsum(axis=0)
df.head()

Unnamed: 0,number,occurence
70,40.0,0.003268
96,55.0,0.006536
102,60.0,0.013072
111,65.0,0.019608
124,75.0,0.026144


In [5]:
p= bokeh.plotting.figure(
    width=400,
    height=300,
    x_axis_label='x',
    y_axis_label='y',
)
p.line(df["number"], df["occurence"])
bokeh.io.show(p)

<div class="alert alert-info">
Here, we wanted you to plot two separate ECDFs- one with the times to catastrophe for the labeled samples and one for the unlabeled samples. The idea behind ECDFs is that they are a summary of the way the data is distributed. To be able to see whether two experimental conditions differ, you would want to compare their distributions in relation to each other, which is not possible when combining them into a single ECDF plot. -5
    </div>

<div class="alert alert-info">
No mention of eye test. -3
</div>

<div class="alert alert-info">
No submission tag : -5% (-1.5)
</div>