# Data preparation usage and examples

This notebook showcases two ways of interacting with the `mbae` data preparation code.


The `mbae.py prepare` command covers most of the user needs in data preparation (see the first section).

`mbae` also provides useful interface (see the second section) for programmatic access (see the second section) to the data preparation routines.


## 1. Access via `mbae.py` interface

`mbae.py` has several commands, each having a collection of options.
Let's look at the commands first.

In [1]:
! python mbae.py --help

Usage: mbae.py [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help  Show this message and exit.

Commands:
  alleles  List all supported alleles
  predict  Predict binding affinity
  prepare  Prepares training data


And then we'll look at the options for the `prepare` command.

In [2]:
! python mbae.py prepare --help

Usage: mbae.py prepare [OPTIONS]

  Prepares training data

Options:
  -d, --download_dir DIRECTORY    Path to download directory  [default: ./]
  -D, --database TEXT             Databases to prepare. Supports multiple
                                  values. Use `--database iedb` or `--database
                                  bdata` to download and prepare IEDB or
                                  Bdata, separately. Currently, available
                                  resources are: iedb, bdata, and none (for
                                  not parsing any data source)  [default: all]

  -m, --mapping FILE              Path to mapping: a headerless file with
                                  space-like separator (e.g., \t) holding
                                  mappings between allele names (1st column)
                                  and accessions (2nd column). Accessions must
                                  be from IPD-MHC or IMGT/HLA.If not provided,
                

Suppose one wants to prepare only mapping between accession and allotypes and save it into the current directory.
Thus, one would use `prepare` in the following manner:

In [3]:
! mbae.py prepare --database none --save mapping

In [4]:
! head mapping.tsv

HLA-A*01:01	HLA27590
HLA-A*01010101	HLA00001
HLA-A*010101	HLA00001
HLA-A*01011	HLA00001
HLA-A*0101	HLA00001
HLA-A*01:02	HLA26566
HLA-A*0102	HLA00002
HLA-A*01:03	HLA23245
HLA-A*0103	HLA00003
HLA-A*01:04	HLA18724


One can observe that the file `mapping.tsv` has appeared in the current directory.

Since we've already obtained mapping between accessions and allotypes, we can use it to prepare the IEDB resource. 
Let's also save raw and parsed files into the `./tmp` directory (should be created beforehand).

In [5]:
! mkdir ./tmp

In [6]:
! mbae.py prepare -d ./tmp -m ./mapping.tsv -D iedb -s raw -s parsed

 'FLA-E*01801' 'H2-Db H155A mutant' 'H2-Db Y159F mutant'
 'H2-Kb D77S, K89A mutant' 'H2-Kb E152A, R155Y, L156Y mutant'
 'H2-Kb Y22F, M23I, E24S, D30N mutant' 'H2-Kb Y84A mutant'
 'H2-Kb Y84C mutant' 'H2-Lq' 'H2-d class I' 'HLA class I'
 'HLA-A*02:01 K66A mutant' 'HLA-A*02:01 K66A, E63Q mutant' 'HLA-A1'
 'HLA-A11' 'HLA-A2' 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A68'
 'HLA-B*08:01 B:I66A mutant' 'HLA-B*08:01 E76C mutant' 'HLA-B27' 'HLA-B39'
 'HLA-B40' 'HLA-B44' 'HLA-B51' 'HLA-B58' 'HLA-B60' 'HLA-B62' 'HLA-B7'
 'HLA-B8' 'HLA-Cw1' 'HLA-Cw4' 'Mamu-B*001:01' 'Mamu-B*003:01'
 'Ptal-N*01:01' 'RT1-Aa' 'SLA-1*04:01' 'SLA-3*02:02' 'Xela-UAAg'] corresponding to 1109 records


In [7]:
ls ./tmp

IEDB_parsed.tsv      test_data.tsv
mhc_ligand_full.zip  train_data.tsv


Thus, the `./tmp` directory now contains both raw and parsed files, as well as training and testing data separated using 0.8 threshold.

Now, we can also try to prepare both Bdata2013 and IEDB dataset.
In this case, the `prepare` command will merge these data sources, appending unique Bdata allele-peptide observations to all IEDB ones.
Let's also make the data preparation verbose so we can clearly see how filtering and merging operations affect the number of entries.

In [8]:
! mbae.py prepare -d ./tmp -m mapping.tsv --verbose

INFO:root:IEDB -- successfully initialized resource
INFO:root:IEDB -- downloaded resource from https://www.iedb.org/downloader.php?file_name=doc/mhc_ligand_full_single_file.zip
INFO:root:IEDB -- loaded resource; records: 1629184
INFO:root:IEDB -- filtered class I records; records: 1414016
INFO:root:IEDB -- filtered quantitative assays; records: 184527
INFO:root:IEDB -- filtered quantitative measurements; records: 157465
INFO:root:IEDB -- filtered by evidence codes; records: 157465
INFO:root:IEDB -- filtered by antigen type; records: 157461
INFO:root:IEDB -- filtered by peptide length; records: 157059
 'FLA-E*01801' 'H2-Db H155A mutant' 'H2-Db Y159F mutant'
 'H2-Kb D77S, K89A mutant' 'H2-Kb E152A, R155Y, L156Y mutant'
 'H2-Kb Y22F, M23I, E24S, D30N mutant' 'H2-Kb Y84A mutant'
 'H2-Kb Y84C mutant' 'H2-Lq' 'H2-d class I' 'HLA class I'
 'HLA-A*02:01 K66A mutant' 'HLA-A*02:01 K66A, E63Q mutant' 'HLA-A1'
 'HLA-A11' 'HLA-A2' 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A68'
 'HLA-B*08:01 B:I66A mutant' 

In [9]:
ls ./tmp

IEDB_parsed.tsv      test_data.tsv
mhc_ligand_full.zip  train_data.tsv


Some allotypes have a small number of observations.
One may want to separate the parsed data as beloning to "abundant" and "rare" MHC alleles.
This is easily done by providing the `--separate_rare` or `-S` flag.
In this case, since we did not provide the mapping, the logging will output the `IMGTHLA` and `IPDMHC` parsing steps first.

In [10]:
! mbae.py prepare -d ./tmp -v -S

INFO:root:IMGTHLA -- successfully initialized resource
INFO:root:IPDMHC -- successfully initialized resource
INFO:root:IMGTHLA -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGTHLA -- successfully extracted mappings
INFO:root:IPDMHC -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPDMHC -- parsed xml tree
INFO:root:IPDMHC -- successfully extracted mappings
INFO:root:IEDB -- successfully initialized resource
INFO:root:IEDB -- downloaded resource from https://www.iedb.org/downloader.php?file_name=doc/mhc_ligand_full_single_file.zip
INFO:root:IEDB -- loaded resource; records: 1629184
INFO:root:IEDB -- filtered class I records; records: 1414016
INFO:root:IEDB -- filtered quantitative assays; records: 184527
INFO:root:IEDB -- filtered quantitative measurements; records: 157465
INFO:root:IEDB -- filtered by evidence codes; records: 157465
INFO:root:IEDB -- filtered by ant

As a result, we have four new files appearing in the `./tmp` directory: `train/test_abundant.tsv` and `train/test_rare.tsv`.

In [11]:
ls ./tmp

IEDB_parsed.tsv          test_data_abundant.tsv   train_data_abundant.tsv
mhc_ligand_full.zip      test_data_rare.tsv       train_data_rare.tsv
test_data.tsv            train_data.tsv


## 2. Programmatic access

In [12]:
import logging
logging.basicConfig(level=logging.INFO) # To display logging messages

from mbae_src.data.base import Constants
from mbae_src.data.prepare import (
    obtain_mapping, separate_abundant, separate_fraction, IMGTHLAhistory, IPDMHChistory, IEDB, Bdata)

Let's first create a mapping object.
If we don't need `IMGTHLAhistory`, `IPDMHChistory` objects, we can simply go ahead and call `obtain_mapping`.
The latter only needs a directory path to store downloaded resources.

In [13]:
mapping = obtain_mapping('./tmp')

INFO:root:IMGTHLA -- successfully initialized resource
INFO:root:IMGTHLA -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGTHLA -- successfully extracted mappings
INFO:root:IPDMHC -- successfully initialized resource
INFO:root:IPDMHC -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPDMHC -- parsed xml tree
INFO:root:IPDMHC -- successfully extracted mappings


In [14]:
list(mapping.items())[:5]

[('HLA-A*01:01', 'HLA27590'),
 ('HLA-A*01010101', 'HLA00001'),
 ('HLA-A*010101', 'HLA00001'),
 ('HLA-A*01011', 'HLA00001'),
 ('HLA-A*0101', 'HLA00001')]

Alternatively, we can init `IMGTHLAhistory`, `IPDMHChistory` manually like so.

In [15]:
imgt, ipd = IMGTHLAhistory(), IPDMHChistory()

INFO:root:IMGTHLA -- successfully initialized resource
INFO:root:IPDMHC -- successfully initialized resource


These are `Resource`s, by default having three methods: `fetch`, `parse` and `dump`.
Obviously, `parse` depends on `fetch`'s results, while `parse` must be called prior to `dump`.

In [16]:
imgt.fetch() 
imgt.parse()
ipd.fetch()
ipd.parse();

INFO:root:IMGTHLA -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGTHLA -- successfully extracted mappings
INFO:root:IPDMHC -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPDMHC -- parsed xml tree
INFO:root:IPDMHC -- successfully extracted mappings


Every resource has an attribute holding parsed data.
For the `imgt` and `ipd` it's a dictionary with allotype-accession mappings.

In [17]:
list(ipd.parsed_data.items())[:5], list(imgt.parsed_data.items())[:5]

([('Aona-DQA1*2701', 'NHP00001'),
  ('Aona-DQA1*27:01', 'NHP00001'),
  ('Aona-DQA1*2702', 'NHP00002'),
  ('Aona-DQA1*27:02', 'NHP00002'),
  ('Aona-DQA1*2703', 'NHP00003')],
 [('HLA-A*01:01', 'HLA27590'),
  ('HLA-A*01010101', 'HLA00001'),
  ('HLA-A*010101', 'HLA00001'),
  ('HLA-A*01011', 'HLA00001'),
  ('HLA-A*0101', 'HLA00001')])

Thus, `obtain_mapping` can combine these dictionaries with an addition of manually entered mappings.

In [18]:
mapping = obtain_mapping('./tmp', ipd=ipd, imgt=imgt)
list(mapping.items())[:5]

[('HLA-A*01:01', 'HLA27590'),
 ('HLA-A*01010101', 'HLA00001'),
 ('HLA-A*010101', 'HLA00001'),
 ('HLA-A*01011', 'HLA00001'),
 ('HLA-A*0101', 'HLA00001')]

Now, with the obtained mapping we can go ahead and preare a data source.
Let's do this for `Bdata2013` since it's faster to download.

In [19]:
?Bdata

[0;31mInit signature:[0m
[0mBdata[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdownload_dir[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdownload_file_name[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'bdata.zip'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmapping[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mIO[0m[0;34m,[0m [0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;34m'./mbae_resources/mapping.tsv'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Resource fetching and parsing Bdata2013.
[0;31mInit docstring:[0m
:param download_dir: Path to a directory where the resource will be downloaded.
:param download_file_name: How to name a raw downloaded file.
:param mapping: Initializing IEDB req

In [20]:
bdata = Bdata(mapping=mapping)

INFO:root:Bdata -- successfully initialized resource


By default, it'll download the resource into a temporary directory.

In [21]:
bdata.fetch()

INFO:root:Bdata -- downloaded resource from http://tools.iedb.org/static/main/binding_data_2013.zip


'/var/folders/h2/x9k131ls3dnf2j2xnm3bdxt00000gn/T/tmpzws6bjcb/bdata.zip'

In [22]:
bdata.parse();

INFO:root:Bdata -- loaded resource; records: 186684
INFO:root:Bdata -- filtered by peptide length; records: 186598
 'BoLA-T2b' 'ELA-A1' 'H-2-Kbm8' 'H-2-Lq' 'HLA-A1' 'HLA-A11' 'HLA-A2'
 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A3/11' 'HLA-B27' 'HLA-B44' 'HLA-B51'
 'HLA-B60' 'HLA-B7' 'HLA-B8' 'HLA-Cw1' 'HLA-Cw4' 'Mamu-A*01' 'Mamu-A*02'
 'Mamu-A*07' 'Mamu-A*11' 'Mamu-A*2201' 'Mamu-A*2601' 'Mamu-B*01'
 'Mamu-B*03' 'Mamu-B*04' 'Mamu-B*08' 'Mamu-B*1001' 'Mamu-B*17'
 'Mamu-B*3901' 'Mamu-B*52' 'Mamu-B*6601' 'Mamu-B*8301' 'Mamu-B*8701'
 'RT1-Bl' 'RT1A'] corresponding to 15450 records
INFO:root:Bdata -- filtered out unmapped allotypes; records: 171148
INFO:root:Bdata -- dropped unnecessary columns and removed duplicates; records 171148
INFO:root:Bdata -- completed resource preparation; records: 171148


Now we can inspect the parsed data.

In [23]:
bdata.parsed_data.head()

Unnamed: 0,accession,peptide,measurement,measurement_ord,inequality,source
0,NHP00705,RRDYRRGL,778.583409,3,=,Bdata
1,NHP00705,YHSNVKEL,18806.16664,1,=,Bdata
2,NHP00705,AQFSPQYL,22203.18686,0,=,Bdata
3,NHP00705,GDYKLVEI,87128.71287,0,>,Bdata
4,NHP00705,RGYVFQGL,87128.71287,0,>,Bdata


Alternatively, if we provide the directory, raw downloaded files will be stored there.
Note that one can omit the mapping creation: in this case, the `obtain_mapping` will called during initialization of the `Bdata`. This will trigger the warning though.

In [24]:
bdata = Bdata(download_dir='./tmp')

INFO:root:IMGTHLA -- successfully initialized resource
INFO:root:IMGTHLA -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGTHLA -- successfully extracted mappings
INFO:root:IPDMHC -- successfully initialized resource
INFO:root:IPDMHC -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPDMHC -- parsed xml tree
INFO:root:IPDMHC -- successfully extracted mappings
INFO:root:Bdata -- successfully initialized resource


`IEDB` has the same interface, although its processing takes a bit more time.

In [25]:
iedb = IEDB()

INFO:root:IMGTHLA -- successfully initialized resource
INFO:root:IMGTHLA -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGTHLA -- successfully extracted mappings
INFO:root:IPDMHC -- successfully initialized resource
INFO:root:IPDMHC -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPDMHC -- parsed xml tree
INFO:root:IPDMHC -- successfully extracted mappings
INFO:root:IEDB -- successfully initialized resource


In [26]:
iedb.fetch(), iedb.parse();

INFO:root:IEDB -- downloaded resource from https://www.iedb.org/downloader.php?file_name=doc/mhc_ligand_full_single_file.zip
INFO:root:IEDB -- loaded resource; records: 1629184
INFO:root:IEDB -- filtered class I records; records: 1414016
INFO:root:IEDB -- filtered quantitative assays; records: 184527
INFO:root:IEDB -- filtered quantitative measurements; records: 157465
INFO:root:IEDB -- filtered by evidence codes; records: 157465
INFO:root:IEDB -- filtered by antigen type; records: 157461
INFO:root:IEDB -- filtered by peptide length; records: 157059
 'FLA-E*01801' 'H2-Db H155A mutant' 'H2-Db Y159F mutant'
 'H2-Kb D77S, K89A mutant' 'H2-Kb E152A, R155Y, L156Y mutant'
 'H2-Kb Y22F, M23I, E24S, D30N mutant' 'H2-Kb Y84A mutant'
 'H2-Kb Y84C mutant' 'H2-Lq' 'H2-d class I' 'HLA class I'
 'HLA-A*02:01 K66A mutant' 'HLA-A*02:01 K66A, E63Q mutant' 'HLA-A1'
 'HLA-A11' 'HLA-A2' 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A68'
 'HLA-B*08:01 B:I66A mutant' 'HLA-B*08:01 E76C mutant' 'HLA-B27' 'HLA-B39'
 'HLA-

Now we can separate the processed IEDB data into abundant and rare subsets, and each of the latter - into train and test subsets.

In [27]:
abundant, rare = separate_abundant(iedb.parsed_data, rare_threshold=200)
print(f'A number of observations belonging to "abundant" allotypes: {len(abundant)}')
print(f'A number of observations belonging to "rare" allotypes: {len(rare)}')

A number of observations belonging to "abundant" allotypes: 150191
A number of observations belonging to "rare" allotypes: 3576


In [28]:
abundant_train, abundant_test = separate_fraction(abundant, fraction=0.85, mode='observations')
print(f'A number of training examples: {len(abundant_train)}')
print(f'A number of testing examples: {len(abundant_test)}')

A number of training examples: 127659
A number of testing examples: 22532


Suppose one wants to separate a certain number of allotypes for testing.
This is possible by providing `mode=accessions` to `separate_fraction`.

For example, let's separate ten abundant allotypes.

In [29]:
frac = 10 / len(abundant['accession'].unique())
frac

0.14084507042253522

In [30]:
abundant_train, abundant_test = separate_fraction(abundant, fraction=0.85, mode='allotypes')
print(f'A number of training examples: {len(abundant_train)}')
print(f'A number of testing examples: {len(abundant_test)}')

A number of training examples: 118858
A number of testing examples: 31333


Now, hopefully, that is more than enough to get one going with the data preparation.

Let's be nice and clean up the data.

In [31]:
! rm -r tmp
! rm ./mapping.tsv