# <center> Using the Euler Characteristic Transform on Protein Data </center>

### <center> Sarah Percival </center>
<center> Department of Mathematics and Statistics, University of New Mexico</center>

# Welcome
This notebook focuses on the Euler Characteristic Transform. Email spercival@unm.edu with questions or comments.

# Goals
* Learn Python basics
* Apply the Euler Characteristic Transform (ECT) to protein data
* Make efficient, scalable computations

In [6]:
# Python comes with a basic set of commands, but we need to import some packages to compute the ECT
import numpy as np # matrix manipulation

import matplotlib.pyplot as plt # image plotting

import pandas as pd # viewing data

In [3]:
# This command allows us to zoom and drag plots.
%matplotlib notebook

In [9]:
# import the .pdb file as a Pandas dataframe

In [15]:
# the main protein structure comes from the protein "backbone". This consists of atoms with values of 'N', 'C', or
# 'CA' in columnn 2
# select only the rows corresponding to atoms in the backbone

In [1]:
# plot the atoms
# the atoms are in order in the PDB files, so we can connect adjacent atoms to obtain protein shape

If you can't see the plot above, try running the cell below.

In [None]:
%matplotlib inline

# Euler Characteristic Transform

Persistent homology is computationally expensive. This means that the calculations required to compute it are complex and take a long time. If we want to analyze 200 million proteins, we may need something that is more simple computationally. That's where the Euler Characteristic Transform, or ECT, comes in.

In [8]:
# compute the Euler Characteristic curve in the given direction
# because proteins have one connected component with no loops, the first entry in the ECC should be 0
# and the last entry should be 1
# need to choose an appropriate start and end

def compute_ecc(data, start_time, end_time, normal):
    """Computes the Euler characteristic curve along a given direction.

    Direction is given by normal vector. Begin by computing the dot product of the normal vector with the starting
    point. Incrementally increase starting vector to sweep along normal.
    
    A plane is given by the equation a*x+b*y+c*z+d=0 where [a,b,c] is the normal. [x, y, z] is the point that
    intersects the plane. Thus, we have to calculate d to find all points such that a*x+b*y+c*z+d<0.
    
    Subtract the number of edges from the number of points to obtain the Euler characteristic at each timestamp.

    Parameters
    ----------
    data : array-like

        Input point cloud. Points are assumed to be in order.
        
    end_time : scalar
    
        Start time is assumed to be 0. End time is the maximum value that the starting point is multiplied by.
        
    starting_point: array-like
    
        This is the center of the first plane
        
    normal: array-like
    
        The direction along which to sweep the plane

    Returns
    -------
    ecc : list

        The Euler characteristic curve in the given direction.
    """

How can the efficiency of the above function be improved? Use the cell magic %%timeit to test any improvements.

Now it's time to apply the ECT to our data!

In [9]:
# choose a direction and use your function above to compute the ECC for that direction

Plot the ECC curve below.

In [2]:
# plot the curve

Select a different angle and compute the ECC.

In [11]:
# write your new angle here

In [3]:
# plot the new ECC

Repeat the above three cells 3 more times with different angles, for a total of five different angles. How are the plots for each curve different? How are they similar? There are no wrong answers!

Once you have the curves for several different angles, try appending them together.

In [None]:
# concatenate the ECC vectors (arrays) into one long vector (array)


Bonus: use the cell magic %%timeit to test the time it takes to compute the ecc for the backbone vs. just the alpha carbons ('CA' atoms only).

In [None]:
# timing for whole backbone:
# timing for alpha carbons only: