# Increased Confidence in Metabolite Identity from CE-MS through Migration Time Prediction by Machine Learning with Scikit-Learn
---

> 07-December-2018


> Michael Loewen (qq123456) & Bill Zizek (qq268547)


> SCS 3253
--- 
## Introduction
**Capillary Electrophoresis (CE) - Mass Spectrometry (MS)** is a multi-dimensional analytical separation instrument used for analysis of biological samples among others. Upon injection, a current is applied and compounds within samples are separated as they migrate through the capillary by their respective electrophoretic mobility differences before reaching a time-of-flight (TOF) mass detector. In spite of CE-MS being a highly accurate and robust detection method, ambient and human variables can cause the same compound to migrate at different times despite otherwise identical conditions. This variability has no impact on the detection and quantification of compounds. However, dependable known migration times through a predictive model empowers replacement of expensive chemical standards through higher confidence of compound identification by additional evidence to accurate mass and isotopic pattern information and thus increases the value of datasets containing unknowns. With multiple experimental runs, there may be multiple instances of a compound and recorded migration times. This dynamic experimental data is annotated with relevant physicochemical properties and miscellaneous metadata to provide context to produced machine learning results. For respective compounds, chemical databases can be used to extract properties to perform calculations for relevant features such as effective mobility, molecular volume, and absolute compound mobility. Following training and profiling of model performance, further annotation to chemical compound classes can give context to results stemming from the model.

### Variables
`Experiment #` is a unique identifier corresponding to instrumental data acquisition. Each "experiment" takes 30-45 minutes and contains data for many different compounds. The experimental data captured per `Experiment #` is a measurd `m/z` value, `MT_sec`, and many other data points not relevant to the scope of this study.

`m/z` is a compound's mass-to-charge ratio (observed mass divided by charge number). In this study, each `Compound` has it's own respective `m/z` value but in practice many compounds can have the same `m/z` value.

`Compound` is a name identifier for drug compounds in this study. Compound identity has been confirmed already by `m/z`, relative migration time, and comigrating internal standards for unambiguous ID.

`MT_sec` is the migration time of a compound's peaks measured in seconds.

`FPhe_RMT` is a ratiometric calculation per `Experiment #`. A respective compounds `MT_sec` is divided by the `MT_sec` of F-Phe within the corresponding `Experiment #`.

`Drug Class` is a family classification that relates to the respective `Compound`.

`Monoisotopic Mass` is the sum of the masses of the atoms in a molecule using the unbound, ground-state, rest mass of the principal (most abundant) isotope.

`Strongest Acidic/Basic pKa`, `logP`, and `Van der Waals volume` are physicochemical properties pertinent to the theory of Capillary Electrophoresis that contribute to differences in compound migration time.

---

## Data Import and Exploratory Inspection

In [1]:
import pandas as pd
data = pd.read_csv("./data/MT_data.csv")
data.head()

Unnamed: 0,Experiment #,m/z,Compound,MT_sec,FPhe_RMT,Drug Class,Monoisotopic Mass,Strongest Acidic pKa,Strongest Basic pKa,logP,Van der Waals volume
0,1,141.1435,Amphetamine-d5,944.7,0.796207,Internal Standard,140.1362,,10.01,1.8,144.86
1,1,172.1332,Gabapentin,1018.32,0.858255,Other,171.125931,4.63,9.91,-1.27,176.16
2,1,177.1022,Cotinine,967.56,0.815474,Other,176.094955,,4.79,0.21,165.31
3,1,178.1226,Mephedrone,1000.38,0.843135,Amphetamine,177.115364,,8.05,2.12,181.62
4,1,184.0768,F-Phe,1186.5,1.0,Internal Standard,183.0696,1.86,9.45,-1.04,160.46


![CompoundRMT Boxplot](./figures/compoundRMT_boxplot_clean.png)

In [2]:
compound_freq = pd.read_csv("./figures/compound_freqtable.csv")
compound_freq.head()

Unnamed: 0,Compound,n
0,6-acetylmorphine,27
1,7-aminoclonazepam,27
2,7-aminoflunitrazepam,27
3,Amitriptyline,27
4,Amphetamine-d5,27


In [3]:
drugClass_freq = pd.read_csv("./figures/drugClass_freqtable.csv")
drugClass_freq

Unnamed: 0,Drug Class,n
0,Amphetamine,108
1,Anesthetic,148
2,Antidepressants,281
3,Internal Standard,54
4,Opioid,512
5,Other,122
6,Sedative,189
7,Stimulant,81


![drugClassRMT Boxplot](./figures/drugClassRMT_box.png)

### Interpretations

There does not appear to be any outliers as it pertains to FPhe_RMT per drug. There is also an ideal window of RMT values from approximately 0.75-1.1. Amphetamines perhaps in the future should be evaluated against a different IS than FPhe to maintain a tighter window but for the purposes of this study it's acceptable. Most individual compounds were observed to have tight migration time variability across experimental runs. There is a heavier amount of opioid compound measurements versus other drug classes but this shouldn't impact the scope of this work.