<div class="alert alert-info">
good job! 45/40
</div>

## Attributions

**Chris**: Bulk of problems (a)-(d)

**Emily**: Part (d) and edits to problems (a)-(c); documentation

**Naveen**: Section headers

## Python Setup

Import and initiate modules.

In [1]:
# Imports

import numpy as np
import pandas as pd

import bokeh.io
import bokeh.plotting

from IPython.display import display # Compatibility fix

bokeh.io.output_notebook()

## Part (a): Reading data on microtubule catastrophe (Gardner, Zanic, et al.)

First let's load the data file to visualize what we're working with.

In [2]:
# Load data file stored in /data folder
df = pd.read_csv('../data/gardner_et_al_2011_time_to_catastrophe_dic.csv', comment='#')

df.head()

Unnamed: 0,time to catastrophe with labeled tubulin (s),time to catastrophe with unlabeled tubulin (s)
0,470,355.0
1,1415,425.0
2,130,540.0
3,280,265.0
4,550,1815.0


## Part (b): Tidying the the data

We can see that this data isn't tidy because each row is not a single observation. For example, it doesn't make sense to include the two observed times of 470s and 355s in the first line because they are not observations from the same experimental setup - one is from a labeled experiment and the other is unlabeled. 

Instead, we should have each observational unit in their own separate table, which we can accomplish by creating a new dataframe for both labeled and unlabeled tubulin.

In [3]:
# Create new dataframes
labeled_df = (pd.DataFrame(data=df['time to catastrophe with labeled tubulin (s)'],
                           columns=['time to catastrophe with labeled tubulin (s)']))

unlabeled_df = (pd.DataFrame(data=df['time to catastrophe with unlabeled tubulin (s)'],
                           columns=['time to catastrophe with unlabeled tubulin (s)']))

# Remove the extra rows of NaN in the unlabeled dataframe, since there are fewer 
# trials of unlabeled than labeled
unlabeled_df = unlabeled_df.dropna(axis=0, how='any')

# Display the two new dataframes
display(labeled_df)
display(unlabeled_df)


Unnamed: 0,time to catastrophe with labeled tubulin (s)
0,470
1,1415
2,130
3,280
4,550
5,65
6,330
7,325
8,340
9,95


Unnamed: 0,time to catastrophe with unlabeled tubulin (s)
0,355.0
1,425.0
2,540.0
3,265.0
4,1815.0
5,160.0
6,370.0
7,460.0
8,190.0
9,130.0


<div class="alert alert-info">
good 15/15<br>
Watch out for unnecessary code (renaming the columns with the same name and extra parentheses when you create the individual datraframes).
</div>

## Part (c): Plotting cumulative distributions

Now that our data is tidy, we can do a calculation of the empirical cumulative distribution function (ECDF). It is defined as ECDF(x) = fraction of data points ≤ x.

In [4]:
def ecdf_vals(data):
    """
    Compute the (x, y) value pairs to plot an ECDF function from a numpy array of pandas series
    
    Parameters
    ----------
    data : 1D numpy array/pandas series
           data used to plot the ECDF
           
    Returns
    -------
    output: np.array(data.size, 2)
            returns and np array where the first column is the data-value
            and the second column is the cumulative distribution at that point
    
    """
    
    # Convert to np.array if data is a dataframe
    if isinstance(data, pd.DataFrame):
        data = data.as_matrix()
        data.resize(data.size)
    
    # Initialize x-y values
    xy = np.zeros([2, data.size])
    
    # Initialize fraction of data points <= x as y
    xy[1] = np.linspace(0, 1, data.size+1)[1:]
    
    # Sort data points and add to x
    xy[0] = np.sort(data)
    
    return xy

<div class="alert alert-info">
Your function is not exactly correct. The first y-value should be `1/len(data)`, not 0, since that value itself counts. It is also unnecessary to pack the x,y values into one array to then unpack them below.
</div>

Now, we can write a function to visualize the data by plotting the ECDF function.

In [7]:
# Create figure
p = bokeh.plotting.figure(plot_height=300,
                          plot_width=500,
                          x_axis_label='x',
                          y_axis_label='y',
                          title = "ECDF of Time to Catastrophe for Labeled and Unlabeled Tubulins")

# Find ECDF x-y values
ecdf_labeled = ecdf_vals(labeled_df)
ecdf_unlabeled = ecdf_vals(unlabeled_df)

# Plot both labeled and unlabeled ECDF points
p.circle(ecdf_labeled[0], ecdf_labeled[1],
         color='dodgerblue', legend='Labeled')
p.circle(ecdf_unlabeled[0], ecdf_unlabeled[1],
         color='tomato', legend='Unlabeled')

# Changing some graph features
p.xaxis.axis_label = "Time to Catastrophe (s)"
p.yaxis.axis_label = "ECDF"
p.legend.location = 'bottom_right'

bokeh.io.show(p)

<div class="alert alert-info">
25/25
</div>

## Part (d): Plotting 

Now we'll try and plot the ECDFs using a line plot instead of using points/scatterplot. To do this, we first create a function plot_ecdf_formal that prepares x and y values for a line plot:

In [8]:
def plot_ecdf_formal(data):
    """
    Compute the (x, y) value pairs to plot an ECDF function from a numpy array of pandas series
    for plotting a line plot
    
    Parameters
    ----------
    data : 1D numpy array/pandas series
           data used to plot the ECDF
           
    Returns
    -------
    output: np.array(data.size, 2)
            returns and np array where the first column is the data-value
            and the second column is the cumulative distribution at that point
    
    """
    
    # Convert to np.arry if data is a dataframe
    if isinstance(data, pd.DataFrame):
        data = data.as_matrix()
        data.resize(data.size)
    
    # Initialize x-y values
    xy = np.zeros([2, 2*data.size - 1])
    
    # Initialize fraction of data points <= x as y
    y = np.zeros([2, data.size-1])
    
    # Copy all values but the last freq value to another column
    y[0] = y[1] = np.linspace(0, 1, data.size+1)[1:-1]
    
    # Append the two columns by flattening
    xy[1,:-1] = (y.T).flatten()
    
    # Set last value
    xy[1,-1] = 1
    
    # Since we are sorting the data anyways, we can just sort it this
    # time to get it in order without much performance loss
    x = np.sort(data)
    x = np.append(x, x[1:])
    xy[0] = np.sort(x)
    
    return xy

Next, we use the function above to create a Bokeh figure:

In [9]:
# Create figure
p = bokeh.plotting.figure(plot_height=300,
                          plot_width=500,
                          x_axis_label='x',
                          y_axis_label='y',
                          title = "Formal ECDF of Time to Catastrophe for Labeled/Unlabeled Tubulins")

# Find ECDF x-y values
form_ecdf_labeled = plot_ecdf_formal(labeled_df)
form_ecdf_unlabeled = plot_ecdf_formal(unlabeled_df)

# Plot both labeled and unlabeled ECDF points
p.line(form_ecdf_labeled[0], form_ecdf_labeled[1],
         color='dodgerblue', legend='Labeled')
p.line(form_ecdf_unlabeled[0], form_ecdf_unlabeled[1],
         color='tomato', legend='Unlabeled')

# Changing some graph features
p.xaxis.axis_label = "Time to Catastrophe (s)"
p.yaxis.axis_label = "ECDF"
p.legend.location = 'bottom_right'

bokeh.io.show(p)

<div class="alert alert-info">
Same comment here, but nice otherwise 5/5

</div>