# Determining the position and orientation of a small molecule in an enzyme

#### Author: **Thomas Casey**
##### Location: Michigan State University
Work published in [J. Phys. Chem. B](https://doi.org/10.1021/jp404743d) and [Applied Mag. Res.](https://doi.org/10.1007/s00723-020-01288-w)

## Table of Contents
1. [Business Understanding](#Business-Understanding)
2. [Data Understanding](#Data-Understanding)
3. [Data Preparation](#Data-Preparations)
4. [Modeling](#Method-Description)
5. [Evaluation](#Evaluation)
6. [Deployment 1](#Deployment-1)
7. [Deployment 2](#Deployment-2)

<a id="Business Understanding"> </a>
## Business Understanding

What are the precise spacial positions of small molecules taurine and water in relation to the center of the enzyme TauD just prior to the enzyme catalyzing the breakdown of taurine?


<a id="Data Understanding"> </a>
## Data Understanding

The available data are derived from a spectroscopy technique known as [Electron Paramagnetic Resonance (EPR)](https://en.wikipedia.org/wiki/Electron_paramagnetic_resonance), in particular [Electron Spin Echo Envelope Modulation (ESEEM) and HYperfine Sublevel CORElation (HYSCORE)](https://doi.org/10.1002/0470862106.ia337). In brief, ESEEM yields one-dimension arrays of signal amplitudes collected as a function of time (*changing delay between two microwave pulses*), HYSCORE yields two-dimensional data where each dimension is also time (*each dimension is one of the two changing delays between a set of three microwave pulses*). The data are in binary files, proprietary Bruker format. Data are sectioned in pairs of files, one contains the raw data and the other contains descriptive text. 

<a id="Data Preparation"> </a>
## Data Preparation

Data are collected using commericially available [spectrometers](https://www.bruker.com/en/products-and-solutions/mr/epr-instruments.html) controlled using proprietary software. 
<br>
<br>
The magnetic resonance community typically uses [MATLAB](http://www.mathworks.com/matlab) or python to handle EPR data. The tool used for this study was the most widely used tools for EPR data import and modeling, [EasySpin](http://www.easyspin.org) operating in MATLAB. For the purpose of this notebook I will translate as much as possible to python.
<br>
<br>
EPR data are modeled using a well established theoretical framework, extraction of information from the data is achieved by using optimization algorithms to model data using quantum mechanical expressions. For this study I used built in [EasySpin (*docs*)](https://easyspin.org/easyspin/documentation/) functions to supply the quantum mechanical expression.

For this notebook I will translate the MATLAB code to python. The following code will load and prepare the data for fitting using a python package for which I am a principle contributor, DNPLab. This package handles proprietary EPR spectrometer data formats.

**Import packages**

In [2]:
import sys
import os
import re

# import dnplab
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

**Define a data processing function**

In [2]:
def proc_ESE(ws): # function for tranforming the spectrum from time domain to frequency domain

    dnplab.dnpNMR.remove_offset(ws, dim="ns")
    dnplab.dnpTools.baseline(ws, dim="ns", type="polynomial", order=6)
    dnplab.dnpNMR.window(ws, type="hamming", dim="ns")
    dnplab.dnpNMR.fourier_transform(ws, dim="ns")
    ws["proc"].values = abs(ws["proc"].real.values[int(ws["proc"].shape[0]/2):])

    return ws

**Retrieve and sort the data**

In [None]:
directory = "./resources/taud/"

files = os.listdir(directory)

date = []
samp_id = []
samp_num = []
experiment = []
mag_field = []
file_type = []
maximum_loc = []
loc_corrupt = []
for indx, paths in enumerate(files):
    if ".DS_Store" not in paths:
        txt = re.split('(\d+)', paths)

    if txt[6] == ".DTA" and "ESE" in paths:
        date.append(txt[1])
        samp_id.append(txt[2])
        samp_num.append(txt[3])
        experiment.append(txt[4])
        mag_field.append(txt[5])
        data = dnplab.dnpImport.load(os.path.join(direct, paths))
        ws = dnplab.create_workspace("proc", data)
        ws["proc"].attrs["nmr_frequency"] = ws["proc"].attrs["frequency"]
        
        # Processing function #
        ws = proc_ESE(ws)
        #                     #
        
        maximum_loc.append(ws["proc"].coords["ns"][np.argmax(ws["proc"].real.values)])

table_dict = {"date": date,
              "samp_id": samp_id,
              "samp_num": samp_num,
              "experiment": experiment,
              "mag_field": mag_field,
              "data_maximum": maximum_loc
              }

df = pd.DataFrame(table_dict, columns=table_dict.keys())

**Scatter plot for quick visualization of the data**

In [None]:
sns.set_theme(style="darkgrid")
sns.relplot(y='loc_data_maximum', x='mag_field', hue="samp_num",  data=table_dict)

<a id="Modeling"> </a>
## Modeling

**One dimensional model**

The location of the data maximum is directly related to the dominant frequency in the un-processed, time domain, data. The possible frequencies are determined by the nuclei present in the sample. Here we have protons ($^{1}H$) and/or deuterium ($^{2}H$). The nucleus can be identified based on the dominant frequency. However, magnetic dipolar coupling introduces error by shifting the dominant frequency and/or splitting into multiple frequencies. Nevertheless, the frequency region of the maximum should correlate with the magnetic field ($B_{0}$) according to the relationship,

$freq = \gamma B_{0} $

where $ \gamma $ is the gyromagnetic ratio of the nucleus. A training set can be constructed where the $freq$ is predicted for both nuclei with the above equation using magnetic field values in 'mag_field'. Next, a test set can be constructed that is composed of $freq$ values slightly shifted from the predicted values of the training set. Once the model accurately picks the correct nucleus from the correlation of magnetic field and frequency in the test set, it can be used to guess the nucleus of the samples. 

**Create training set**

In [1]:
freq_1H = np.array(table_dict["mag_field"]) * 4.2576  # MHz
freq_2H = np.array(table_dict["mag_field"]) * .6536   # MHz

SyntaxError: invalid syntax (<ipython-input-1-10f760994ba5>, line 1)

**Two dimensional model**

![MATLAB example of pre-process](http://)

<a id="Evaluation"> </a>
## Evaluation

<a id="Deployment 1"> </a>
### Deployment 1

##### **One dimensional ESEEM data informs on the position of taurine in TauD**

Using the angles and distances that yielded the best chi^2 a physical picture of the active site of TauD can be constructed

![TauD results 1](http://)

<a id="Deployment 2"> </a>
### Deployment 2

##### **Two dimensional HYSCORE data informs on the position of water in TauD**

Using the same methodology, the two-dimensional data can be modeled to yield a physical picture for the location of waters in the active site of TauD

![TauD results 2](http://)