## Introduction

This notebook will introduce the data available in this project and how to access it.

### Data

The data totals ~1GB and is accessible from the following [Google Drive link](https://drive.google.com/drive/folders/13_goQHv07qy-fJILSXxOxyiKtzrr5BRe?usp=sharing). You will find two main files in the repo:
1. abmelt.pkl
2. martini.pkl

These two files correspond to simulation datasets for all 47 antibodies for the all-atom abmelt protocols and Martini3 coarse grained protocol respectively. Each is a pickle file which is easily accessible by python in a similar format to dictionaries via keys. Let's first look at the abmelt data (all-atom simulations).

In [1]:
import pickle

data_location = "data/data" # edit this path to wherever you downloaded the data to

with open(f'{data_location}/abmelt.pkl', "rb") as f:
    abmelt_data = pickle.load(f)

# we can see that at the highest level data is grouped by antibody
print("Number of antibodies =", len(abmelt_data))
print(abmelt_data.keys())

Number of antibodies = 47
dict_keys(['abituzumab', 'abrilumab', 'adalimumab', 'alirocumab', 'anifrolumab', 'atezolizumab', 'bapineuzumab', 'basiliximab', 'bavituximab', 'benralizumab', 'bevacizumab', 'blosozumab', 'bococizumab', 'brentuximab', 'briakinumab', 'brodalumab', 'canakinumab', 'carlumab', 'certolizumab', 'clazakizumab', 'codrituzumab', 'crenezumab', 'dacetuzumab', 'daclizumab', 'daratumumab', 'denosumab', 'dinutuximab', 'duligotuzumab', 'emibetuzumab', 'enokizumab', 'epratuzumab', 'etrolizumab', 'farletuzumab', 'fasinumab', 'ficlatuzumab', 'fulranumab', 'ixekizumab', 'muromonab', 'obinutuzumab', 'otelixizumab', 'ozanezumab', 'ponezumab', 'rituximab', 'seribantumab', 'simtuzumab', 'vedolizumab', 'veltuzumab'])


In [26]:
import pandas as pd

# let's look specifically at the first antibody abituzumab
print("Number of features =", len(abmelt_data["abituzumab"]))
print(abmelt_data["abituzumab"].keys())
print("Feature data type =", type(abmelt_data["abituzumab"]["angles"]))
print("All features are pandas dataframe?", all(isinstance(abmelt_data["abituzumab"][x], pd.DataFrame) for x in abmelt_data["abituzumab"]))

Number of features = 22
dict_keys(['angles', 'bonds', 'charge', 'coloumb14', 'coloumbSR', 'contacts', 'covar', 'dipole', 'distances', 'energy', 'enthalpy', 'gyr', 'kinetic', 'lj14', 'ljSR', 'partition-sasa', 'per-block-s2', 'per-res-s2', 'potential', 'rmsd', 'rmsf', 'sasa'])
Feature data type = <class 'pandas.core.frame.DataFrame'>
All features are pandas dataframe? True


In [None]:
# inspecting each feature we can see that they are of mixed shapes
for x in abmelt_data["abituzumab"]:
    print("Feature", x, "shape =", abmelt_data["abituzumab"][x].shape)

# so each of these features is a grouping of related subfeatures 
# into data frames where each column is a subfeature
cols_count = 0
total_data_points = 0
for x in abmelt_data["abituzumab"]:
    print(x, abmelt_data["abituzumab"][x].shape)
    cols_count += abmelt_data["abituzumab"][x].shape[1]
    total_data_points += abmelt_data["abituzumab"][x].shape[0] * abmelt_data["abituzumab"][x].shape[1]
print("Total columns for abituzumab per temp:", int(cols_count/3))
print("Total datapoints for abituzumab:", int(total_data_points))

Feature angles shape = (501, 15)
Feature bonds shape = (8001, 6)
Feature charge shape = (10, 21)
Feature coloumb14 shape = (8001, 3)
Feature coloumbSR shape = (8001, 3)
Feature contacts shape = (8001, 3)
Feature covar shape = (1620, 3)
Feature dipole shape = (8001, 12)
Feature distances shape = (501, 18)
Feature energy shape = (8001, 3)
Feature enthalpy shape = (8001, 3)
Feature gyr shape = (8001, 96)
Feature kinetic shape = (8001, 3)
Feature lj14 shape = (8001, 3)
Feature ljSR shape = (8001, 3)
Feature partition-sasa shape = (8001, 3)
Feature per-block-s2 shape = (8, 3)
Feature per-res-s2 shape = (215, 3)
Feature potential shape = (9, 24)
Feature rmsd shape = (8001, 6)
Feature rmsf shape = (113, 21)
Feature sasa shape = (8001, 21)
angles (501, 15)
bonds (8001, 6)
charge (10, 21)
coloumb14 (8001, 3)
coloumbSR (8001, 3)
contacts (8001, 3)
covar (1620, 3)
dipole (8001, 12)
distances (501, 18)
energy (8001, 3)
enthalpy (8001, 3)
gyr (8001, 96)
kinetic (8001, 3)
lj14 (8001, 3)
ljSR (8001, 

In [30]:
# let's look specifically at the bonds feature
abmelt_data["abituzumab"]["bonds"]

Unnamed: 0,bonds_all_300K,bonds_lh_300K,bonds_all_350K,bonds_lh_350K,bonds_all_400K,bonds_lh_400K
0,164.0,2.0,157.0,3.0,157.0,5.0
1,160.0,1.0,162.0,3.0,157.0,5.0
2,160.0,3.0,166.0,4.0,162.0,3.0
3,156.0,3.0,168.0,5.0,161.0,5.0
4,165.0,2.0,168.0,4.0,140.0,2.0
...,...,...,...,...,...,...
7996,170.0,3.0,157.0,5.0,133.0,5.0
7997,165.0,4.0,155.0,4.0,159.0,6.0
7998,160.0,4.0,145.0,4.0,155.0,7.0
7999,167.0,4.0,154.0,3.0,153.0,5.0


For the "bonds" feature we have 6 columns (subfeatures). There is a bonds_all and bonds_lh for each of three temperatures, 300K, 350K and 400K. These are the temperatures at which the antibody was simulated. the bonds feature itself refers to hydrogen bond counts for the entire Fv region (bonds_all) and just between the heavy and light chain (bonds_lh). These hydrogen bond counts are time series taken over the length of the MD simulation trajectory at snapshots every 0.1ns. Given that simulations were performed for 100ns we would expect 1000 rows in this table however for this and all the other time series data the first 20ns was discard to remove potentially noisy values from equilbiration. This is standard practice in MD.

Every major feature in our dataset that has 8001 rows is time series data with equilibration cutoff. Notice how for every feature set the number of columns is divisable by 3, every feature includes a seperate column of values for each temperature. For a more detailed overview of every unique feature/subfeature in this data checkout the featur_reference file.

Earlier we showed that we have over a million data points per antibody, that's a lot of a data so it would be fruitful to start with some summary statistics and perhaps use them going forward for any future machine learning. In the AbMelt paper they simply use the mean and standard deviation of each subfeature for their models but you should consider other approaches to dataset reduction capable of preserving more information. Much of the skill in machine learning is learning how to approach such large datasets and how the data can be represented for tradeoffs in information capacity, computational efficiency and model appriopriateness. For example, larger models like transformers may benefit from keeping as much information as possible (with one element of data analysis being determining the redundancy of the data for large model learning, how much do we need?). My advice would be to start basic with linear models and multivariate analysis across antibodies on summarisations of the per-feature data or explore information capacity and redundancy with dimensionality reduction techniques such as PCA and clustering. Good luck!