# Data preparation usage and examples.

This notebook showcases two ways of interacting with the `mbae` data preparation code.


The `mbae.py prepare` command covers the most of the user needs in data preparation (see the first section).

`mbae` also provides useful interface (see the second section) for programmatic access to the data preparation routines.


## 1. Access via `mbae.py` interface.

`mbae.py` has several commands, each having a collection of options.
Let's look at the commands first.

In [1]:
! python mbae.py --help

Usage: mbae.py [OPTIONS] COMMAND [ARGS]...

Options:
  -h, --help  Show this message and exit.

Commands:
  alleles  List all supported alleles
  predict  Predict binding affinity
  prepare  Prepare mbae resources


And then we'll look at the subcommands of the `prepare` command.

In [2]:
! python mbae.py prepare --help

Usage: mbae.py prepare [OPTIONS] COMMAND [ARGS]...

  Prepare mbae resources

Options:
  -h, --help  Show this message and exit.

Commands:
  dataset    Prepare training dataset
  sequences  Prepare allotype sequences


Thus, there are two possible data types to prepaare with `mbae`: (1) dataset holding the training observations and (2) allotype sequences.

First, we'll go ahead and look at how does one prepare the training data.

### 1.1 Preparing training observations.

In [3]:
! python mbae.py prepare dataset --help

Usage: mbae.py prepare dataset [OPTIONS]

  Prepare training dataset

Options:
  -d, --download_dir DIRECTORY    Path to a download directory  [default: ./]
  -D, --database TEXT             Databases to prepare. Supports multiple
                                  values. Use `--database iedb` or `--database
                                  bdata` to download and prepare IEDB or
                                  Bdata, separately. Currently, available
                                  resources are: iedb and bdata, while "all"
                                  and "none" values are reserved for parsing
                                  all or none data sources, respectively.
                                  [default: all]

  -m, --mapping FILE              Path to mapping: a headerless file with
                                  space-like separator (e.g., \t) holding
                                  mappings between allele names (1st column)
                                  and ac

Mapping between allotype names and accessions is mandatory. 
It serves as a way to disambiguate different allotype naming conventions. Given a concrete allotype, all of the (historical and current) names (ideally) point to the same accession within IEDB.

Suppose one wants to prepare only mapping between accession and allotypes and save it into the current directory.
Thus, one would use `prepare` in the following manner:

In [4]:
! mbae.py prepare dataset --database none --save mapping

In [5]:
! head mapping.tsv

HLA-A*01:01	HLA27590
HLA-A*01010101	HLA00001
HLA-A*010101	HLA00001
HLA-A*01011	HLA00001
HLA-A*0101	HLA00001
HLA-A*01:02	HLA26566
HLA-A*0102	HLA00002
HLA-A*01:03	HLA23245
HLA-A*0103	HLA00003
HLA-A*01:04	HLA18724


One can observe that the file `mapping.tsv` has appeared in the current directory.

Since we've already obtained the mapping between accessions and allotypes, we can use it to prepare the IEDB resource. 
Let's also save raw and parsed files into the `./tmp` directory (should be created beforehand).

In [6]:
! mkdir -p ./tmp

In [7]:
! mbae.py prepare dataset -d ./tmp -m ./mapping.tsv -D iedb -s raw -s parsed

 'FLA-E*01801' 'H2-Db H155A mutant' 'H2-Db Y159F mutant'
 'H2-Kb D77S, K89A mutant' 'H2-Kb E152A, R155Y, L156Y mutant'
 'H2-Kb Y22F, M23I, E24S, D30N mutant' 'H2-Kb Y84A mutant'
 'H2-Kb Y84C mutant' 'H2-Lq' 'H2-d class I' 'HLA class I'
 'HLA-A*02:01 K66A mutant' 'HLA-A*02:01 K66A, E63Q mutant' 'HLA-A1'
 'HLA-A11' 'HLA-A2' 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A68'
 'HLA-B*08:01 B:I66A mutant' 'HLA-B*08:01 E76C mutant' 'HLA-B27' 'HLA-B39'
 'HLA-B40' 'HLA-B44' 'HLA-B51' 'HLA-B58' 'HLA-B60' 'HLA-B62' 'HLA-B7'
 'HLA-B8' 'HLA-Cw1' 'HLA-Cw4' 'Mamu-B*001:01' 'Mamu-B*003:01'
 'Ptal-N*01:01' 'RT1-Aa' 'SLA-1*04:01' 'SLA-3*02:02' 'Xela-UAAg'] corresponding to 1113 records


In [8]:
ls ./tmp

IEDB_parsed.tsv      mhc_ligand_full.zip


Thus, the `./tmp` directory now contains both raw and parsed files, as well as training and testing data.

Now, we can also prepare both Bdata2013 and IEDB datasets.
In this case, `mbae` will merge these data sources, appending unique Bdata allele-peptide observations to all IEDB ones.
Let's also make the data preparation verbose so we can clearly see how filtering and merging operations affect the number of entries.

In [9]:
! mbae.py prepare dataset -d ./tmp -m mapping.tsv --verbose

INFO:root:IEDB -- successfully initialized resource
INFO:root:IEDB -- downloaded resource from https://www.iedb.org/downloader.php?file_name=doc/mhc_ligand_full_single_file.zip
INFO:root:IEDB -- loaded resource; records: 1883298
INFO:root:IEDB -- filtered class I records; records: 1621944
INFO:root:IEDB -- filtered quantitative assays; records: 184623
INFO:root:IEDB -- filtered quantitative measurements; records: 157525
INFO:root:IEDB -- filtered by evidence codes; records: 157525
INFO:root:IEDB -- filtered by antigen type; records: 157521
INFO:root:IEDB -- filtered by peptide length; records: 157111
 'FLA-E*01801' 'H2-Db H155A mutant' 'H2-Db Y159F mutant'
 'H2-Kb D77S, K89A mutant' 'H2-Kb E152A, R155Y, L156Y mutant'
 'H2-Kb Y22F, M23I, E24S, D30N mutant' 'H2-Kb Y84A mutant'
 'H2-Kb Y84C mutant' 'H2-Lq' 'H2-d class I' 'HLA class I'
 'HLA-A*02:01 K66A mutant' 'HLA-A*02:01 K66A, E63Q mutant' 'HLA-A1'
 'HLA-A11' 'HLA-A2' 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A68'
 'HLA-B*08:01 B:I66A mutant' 

In [10]:
ls ./tmp

IEDB_parsed.tsv      test_data.tsv
mhc_ligand_full.zip  train_data.tsv


Some allotypes have a small number of observations.
One may want to separate the parsed data as beloning to "abundant" and "rare" MHC alleles.
This is easily done by providing the `--separate_rare` or `-S` flag.
In this case, since we did not provide the mapping, the logging will output the `IMGTHLA` and `IPDMHC` parsing steps first.

In [11]:
! mbae.py prepare dataset -d ./tmp -v -S

INFO:root:IMGT/HLA history -- successfully initialized resource
INFO:root:IPD-MHC history -- successfully initialized resource
INFO:root:IMGT/HLA history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGT/HLA history -- successfully extracted mappings
INFO:root:IPD-MHC history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPD-MHC history -- parsed xml tree
INFO:root:IPD-MHC history -- successfully extracted mappings
INFO:root:IEDB -- successfully initialized resource
INFO:root:IEDB -- downloaded resource from https://www.iedb.org/downloader.php?file_name=doc/mhc_ligand_full_single_file.zip
INFO:root:IEDB -- loaded resource; records: 1883298
INFO:root:IEDB -- filtered class I records; records: 1621944
INFO:root:IEDB -- filtered quantitative assays; records: 184623
INFO:root:IEDB -- filtered quantitative measurements; records: 157525
INFO:root:IEDB -- filtered by ev

As a result, we have four new files appearing in the `./tmp` directory: `train/test_abundant.tsv` and `train/test_rare.tsv`.

In [12]:
ls ./tmp

IEDB_parsed.tsv          test_data_abundant.tsv   train_data_abundant.tsv
mhc_ligand_full.zip      test_data_rare.tsv       train_data_rare.tsv
test_data.tsv            train_data.tsv


### 1.2 Preparing sequences.
`mbae` is trained on both peptide and allotype sequences.
Training dataset by default includes the former.
Thus, `mbae` provides interface to prepare the latter.

In [13]:
! mbae.py prepare sequences --help

Usage: mbae.py prepare sequences [OPTIONS]

  Prepare allotype sequences

Options:
  -d, --download_dir DIRECTORY  Path to a download directory.  [default: ./]
  -o, --output TEXT             Output file name for the final data.
                                [default: sequences.fasta]

  -s, --save TEXT               An option controlling what will be saved.
                                Multiple values are supported: - final (final
                                data will be saved); - parsed (every parsed
                                Resource will be saved); - raw (raw downloaded
                                files will be saved). Example: "-s parsed -s
                                final" to save parsed data of used resources
                                along with the final data.   [default: final]

  -a, --accessions TEXT         A list of comma-separated accessions.
  -f, --accessions_file FILE    A path to a file holding a list of accessions,
                      

As previously, the interface is fairly flexible. 
Here is what it can do for you:
1. Download canonical allotype sequences from IMGT/HLA and IPD-MHC.
2. Filter the downloaded sequences based on accessions.
3. Align (using `mafft --add`) to our MSA encompassing MHC binding regions and cut sequences accordingly.

We largely rely on IMGT/HLA and IPD-MHC in terms of sequences.
Namely, we require that your allotype has a valid accession in the aformentioned resources.
If it does, it is likely our alignment profile already contains this sequences, and `mbae` will pull if from there.
Otherwise, it will be aligned to the MSA and cut.
However, one can bypass both filtering and cutting by passing no accessions and using `--profile none` option (e.g., when one wants just to download the raw data), respectively.
We'll show these usecases below.

Let's start by downloading the raw data into the `./tmp`. In this case, we'll omit filtering by accessions and cutting by profile.

In [14]:
! mbae.py prepare sequences -d ./tmp -s raw -p none

/Users/ivanreveguk/Projects/mbae_git/none


In [15]:
! ls ./tmp

IEDB_parsed.tsv         test_data.tsv           train_data_abundant.tsv
MHC_prot.fasta          test_data_abundant.tsv  train_data_rare.tsv
hla_prot.fasta          test_data_rare.tsv
mhc_ligand_full.zip     train_data.tsv


Now we'll filter by accessions while omitting the alignment.
A small number of accessions can be passed as a comma-separated list.

In this case, however, `mbae` will filter out accessions missing in both IPD-MHC and IMGT/HLA databases.

In [16]:
! mbae.py prepare sequences -d ./tmp -s parsed -s final -v -p none --accessions H-2-Kk,HLA25912,HLA00009,NHP00709,HLA00222,HLA28165

/Users/ivanreveguk/Projects/mbae_git/none
INFO:root:IPD-MHC sequences -- successfully initialized resource
INFO:root:IMGT/HLA sequences -- successfully initialized resource
INFO:root:IPD-MHC sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC_prot.fasta
INFO:root:IMGT/HLA sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/hla_prot.fasta
INFO:root:IPD-MHC sequences -- loaded 10302 sequences
INFO:root:IPD-MHC sequences -- filtered by accessions; 1 left (out of provided 6).
INFO:root:IPD-MHC sequences -- finished parsing sequences; 1 in total
INFO:root:IMGT/HLA sequences -- loaded 27840 sequences
INFO:root:IMGT/HLA sequences -- filtered by accessions; 4 left (out of provided 6).
INFO:root:IMGT/HLA sequences -- finished parsing sequences; 4 in total
INFO:root:IPD-MHC sequences -- saved parsed data to /Users/ivanreveguk/Projects/mbae_git/tmp/IPD-MHC_sequences.fasta
INFO:root:IMGT/HLA sequences -- saved par

To prepare a larger set of sequences, it's clearly better to put these into a single file first.
Then provide `mbae` with a correct path to a file via `--accessions_file` or `-f` option.

In [17]:
sample_accessions = [
    'ELA04973', 'HLA01173', 'HLA24796', 'HLA28069', 'BoLA03176',
    'NHP01865', 'HLA27997', 'HLA27977', 'H-2-Kk', 'HLA25912',
    'HLA00009', 'HLA27068', 'HLA16743', 'HLA27953', 'HLA15201',
    'NHP00709', 'HLA00222', 'HLA28165', 'HLA25670', 'HLA25900']
with open('./tmp/accessions.txt', 'w') as f:
    print(*sample_accessions, sep='\n', file=f)

In [18]:
! mbae.py prepare sequences -d ./tmp -s final -v -f ./tmp/accessions.txt -t 6

/Users/ivanreveguk/Projects/mbae_git/mbae_resources/binding_regions.fsa
INFO:root:IPD-MHC sequences -- successfully initialized resource
INFO:root:IMGT/HLA sequences -- successfully initialized resource
INFO:root:IPD-MHC sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC_prot.fasta
INFO:root:IMGT/HLA sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/hla_prot.fasta
INFO:root:IPD-MHC sequences -- loaded 10302 sequences
INFO:root:IPD-MHC sequences -- filtered by accessions; 4 left (out of provided 21).
INFO:root:IPD-MHC sequences -- 7 sequences were found in profile /Users/ivanreveguk/Projects/mbae_git/mbae_resources/binding_regions.fsa
INFO:root:IPD-MHC sequences -- 1 will be aligned to /Users/ivanreveguk/Projects/mbae_git/mbae_resources/binding_regions.fsa
Cutting sequences: 100%|██████████████████████████| 1/1 [00:05<00:00,  5.89s/it]
INFO:root:IPD-MHC sequences -- finished parsing sequences; 8 in t

In [19]:
! head ./tmp/sequences.fasta

>BoLA03176 BoLA-2*00801
SHSLRYFLTAVSRPGLGEPRFIIVGYVDDTQFVRFDSNTPNPRMEPRARWVEKEGPEYWD
RETRNSKETAQTFRANLNTALGYYNQSEAGSHTVQEMYGCDVGPDGRLLRGFMQDAYDGR
DYIALNEDLRSWTAADTAAQITKRKWEAAGDAETWRNYLEGRCVEWLRRYLENGKDALL
>H-2-Kk H2-Kk
PHSLRYFHTAVSRPGLGKPRFISVGYVDDTQFVRFDSDAENPRYEPRVRWMEQVEPEYWE
RNTQIAKGNEQIFRVNLRTALRYYNQSAGGSHTFQRMYGCEVGSDWRLLRGYEQYAYDGC
DYIALNEDLKTWTAADMAALITKHKWEQAGDAERDRAYLEGTCVEWLRRYLQLGNATLP
>HLA00009 HLA-A*02:04
SHSMRYFFTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWD


And, finally, it is even possible to download and cut ALL the IPD-MHC and IMGT/HLA sequences.
The command would be `mbae.py prepare sequences -d ./tmp`. 
It will download raw sequences into the `/tmp` (not the local `./tmp` since we did not pass `-s raw`) and cut each of them.
This'll obviously take a long time and we aren't going to wait until completion.

In [20]:
! mbae.py prepare sequences -d ./tmp -v 

/Users/ivanreveguk/Projects/mbae_git/mbae_resources/binding_regions.fsa
INFO:root:IPD-MHC sequences -- successfully initialized resource
INFO:root:IMGT/HLA sequences -- successfully initialized resource
INFO:root:IPD-MHC sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC_prot.fasta
INFO:root:IMGT/HLA sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/hla_prot.fasta
INFO:root:IPD-MHC sequences -- loaded 10302 sequences
INFO:root:IPD-MHC sequences -- 35 sequences were found in profile /Users/ivanreveguk/Projects/mbae_git/mbae_resources/binding_regions.fsa
INFO:root:IPD-MHC sequences -- 10267 will be aligned to /Users/ivanreveguk/Projects/mbae_git/mbae_resources/binding_regions.fsa
Cutting sequences:   0%|                              | 0/10267 [00:00<?, ?it/s]^C

Aborted!
Cutting sequences:   0%|                              | 0/10267 [00:03<?, ?it/s]


## 2. Programmatic access

In [21]:
%load_ext autoreload
%autoreload 2

import logging
logging.basicConfig(level=logging.INFO) # To display logging messages
from random import sample

import numpy as np
import pandas as pd

from mbae_src.data.base import Constants
from mbae_src.data.prepare import (
    obtain_mapping, separate_abundant, separate_fraction, _dump_data, # Helper functions
    IMGTHLAhistory, IPDMHChistory, # Objects to obtain an allotype->accession mapping
    IMGTHLAsequences, IPDMHCsequences, # Objects to get sequences
    IEDB, Bdata # Objects to parse observations from
)

### 2.1 Obtaining mappings

Let's first create a mapping object.
If we don't need `IMGTHLAhistory`, `IPDMHChistory` objects, we can simply go ahead and call `obtain_mapping`.
The latter only needs a directory path to store downloaded resources.

In [22]:
mapping = obtain_mapping('./tmp')

INFO:root:IMGT/HLA history -- successfully initialized resource
INFO:root:IMGT/HLA history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGT/HLA history -- successfully extracted mappings
INFO:root:IPD-MHC history -- successfully initialized resource
INFO:root:IPD-MHC history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPD-MHC history -- parsed xml tree
INFO:root:IPD-MHC history -- successfully extracted mappings


In [23]:
list(mapping.items())[:5]

[('HLA-A*01:01', 'HLA27590'),
 ('HLA-A*01010101', 'HLA00001'),
 ('HLA-A*010101', 'HLA00001'),
 ('HLA-A*01011', 'HLA00001'),
 ('HLA-A*0101', 'HLA00001')]

Alternatively, we can init `IMGTHLAhistory`, `IPDMHChistory` manually, like so:

In [24]:
imgt_hist, ipd_hist = IMGTHLAhistory(), IPDMHChistory()

INFO:root:IMGT/HLA history -- successfully initialized resource
INFO:root:IPD-MHC history -- successfully initialized resource


These are `Resource`s, by default having three methods: `fetch`, `parse` and `dump`.
Obviously, `parse` depends on `fetch`'s results, while `parse` must be called prior to `dump`.

In [25]:
imgt_hist.fetch() 
imgt_hist.parse()
ipd_hist.fetch()
ipd_hist.parse();

INFO:root:IMGT/HLA history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGT/HLA history -- successfully extracted mappings
INFO:root:IPD-MHC history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPD-MHC history -- parsed xml tree
INFO:root:IPD-MHC history -- successfully extracted mappings


Every resource has an attribute holding parsed data.
For the `imgt` and `ipd` it's a dictionary with allotype-accession mappings.

In [26]:
list(ipd_hist.parsed_data.items())[:5], list(imgt_hist.parsed_data.items())[:5]

([('Aona-DQA1*2701', 'NHP00001'),
  ('Aona-DQA1*27:01', 'NHP00001'),
  ('Aona-DQA1*2702', 'NHP00002'),
  ('Aona-DQA1*27:02', 'NHP00002'),
  ('Aona-DQA1*2703', 'NHP00003')],
 [('HLA-A*01:01', 'HLA27590'),
  ('HLA-A*01010101', 'HLA00001'),
  ('HLA-A*010101', 'HLA00001'),
  ('HLA-A*01011', 'HLA00001'),
  ('HLA-A*0101', 'HLA00001')])

Thus, `obtain_mapping` combines these two dictionaries with an addition of manually entered mappings.

In [27]:
mapping = obtain_mapping('./tmp', ipd=ipd_hist, imgt=imgt_hist)
list(mapping.items())[:5]

[('HLA-A*01:01', 'HLA27590'),
 ('HLA-A*01010101', 'HLA00001'),
 ('HLA-A*010101', 'HLA00001'),
 ('HLA-A*01011', 'HLA00001'),
 ('HLA-A*0101', 'HLA00001')]

### 2.2 Obtaining training observations

Now, with the obtained mapping we can go ahead and preare a data source.
Let's do this for `Bdata2013` since it's faster to download.

In [28]:
?Bdata

[0;31mInit signature:[0m
[0mBdata[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdownload_dir[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdownload_file_name[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'bdata.zip'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmapping[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mIO[0m[0;34m,[0m [0mMapping[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;34m'./mbae_resources/mapping.tsv'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Resource fetching and parsing Bdata2013.
[0;31mInit docstring:[0m
:param download_dir: Path to a directory where the resource will be downloaded.
:param download_file_name: How to name a raw downloaded file.
:param mapping: Initializing IEDB req

In [29]:
bdata = Bdata(mapping=mapping)

INFO:root:Bdata -- successfully initialized resource


By default, it'll download the resource into a temporary directory.

In [30]:
bdata.fetch()

INFO:root:Bdata -- downloaded resource from http://tools.iedb.org/static/main/binding_data_2013.zip


'/var/folders/h2/x9k131ls3dnf2j2xnm3bdxt00000gn/T/tmp7wlzjv_b/bdata.zip'

In [31]:
bdata.parse();

INFO:root:Bdata -- loaded resource; records: 186684
INFO:root:Bdata -- filtered by peptide length; records: 186602
 'BoLA-T2b' 'ELA-A1' 'H-2-Kbm8' 'H-2-Lq' 'HLA-A1' 'HLA-A11' 'HLA-A2'
 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A3/11' 'HLA-B27' 'HLA-B44' 'HLA-B51'
 'HLA-B60' 'HLA-B7' 'HLA-B8' 'HLA-Cw1' 'HLA-Cw4' 'Mamu-A*01' 'Mamu-A*02'
 'Mamu-A*07' 'Mamu-A*11' 'Mamu-A*2201' 'Mamu-A*2601' 'Mamu-B*01'
 'Mamu-B*03' 'Mamu-B*04' 'Mamu-B*08' 'Mamu-B*1001' 'Mamu-B*17'
 'Mamu-B*3901' 'Mamu-B*52' 'Mamu-B*6601' 'Mamu-B*8301' 'Mamu-B*8701'
 'RT1-Bl' 'RT1A'] corresponding to 15453 records
INFO:root:Bdata -- filtered out unmapped allotypes; records: 171149
INFO:root:Bdata -- dropped unnecessary columns and removed duplicates; records 171149
INFO:root:Bdata -- completed resource preparation; records: 171149


Now we can inspect the parsed data.

In [32]:
bdata.parsed_data.head()

Unnamed: 0,accession,peptide,measurement,measurement_ord,inequality,source
0,NHP00705,RRDYRRGL,778.583409,3,=,Bdata
1,NHP00705,YHSNVKEL,18806.16664,1,=,Bdata
2,NHP00705,AQFSPQYL,22203.18686,0,=,Bdata
3,NHP00705,GDYKLVEI,87128.71287,0,>,Bdata
4,NHP00705,RGYVFQGL,87128.71287,0,>,Bdata


Alternatively, if we provide a valid directory path, raw downloaded files will be stored there.
Note that one can omit the mapping creation: in this case, `Bdata` will make a call to the `obtain_mapping` internally. This will trigger the warning.

In [33]:
bdata = Bdata(download_dir='./tmp')

INFO:root:IMGT/HLA history -- successfully initialized resource
INFO:root:IMGT/HLA history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGT/HLA history -- successfully extracted mappings
INFO:root:IPD-MHC history -- successfully initialized resource
INFO:root:IPD-MHC history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPD-MHC history -- parsed xml tree
INFO:root:IPD-MHC history -- successfully extracted mappings
INFO:root:Bdata -- successfully initialized resource


`IEDB` has the same interface, although its processing takes a bit more time.

In [34]:
iedb = IEDB()

INFO:root:IMGT/HLA history -- successfully initialized resource
INFO:root:IMGT/HLA history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/Allelelist_history.txt
INFO:root:IMGT/HLA history -- successfully extracted mappings
INFO:root:IPD-MHC history -- successfully initialized resource
INFO:root:IPD-MHC history -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC.xml
INFO:root:IPD-MHC history -- parsed xml tree
INFO:root:IPD-MHC history -- successfully extracted mappings
INFO:root:IEDB -- successfully initialized resource


In [35]:
iedb.fetch(), iedb.parse();

INFO:root:IEDB -- downloaded resource from https://www.iedb.org/downloader.php?file_name=doc/mhc_ligand_full_single_file.zip
INFO:root:IEDB -- loaded resource; records: 1883298
INFO:root:IEDB -- filtered class I records; records: 1621944
INFO:root:IEDB -- filtered quantitative assays; records: 184623
INFO:root:IEDB -- filtered quantitative measurements; records: 157525
INFO:root:IEDB -- filtered by evidence codes; records: 157525
INFO:root:IEDB -- filtered by antigen type; records: 157521
INFO:root:IEDB -- filtered by peptide length; records: 157111
 'FLA-E*01801' 'H2-Db H155A mutant' 'H2-Db Y159F mutant'
 'H2-Kb D77S, K89A mutant' 'H2-Kb E152A, R155Y, L156Y mutant'
 'H2-Kb Y22F, M23I, E24S, D30N mutant' 'H2-Kb Y84A mutant'
 'H2-Kb Y84C mutant' 'H2-Lq' 'H2-d class I' 'HLA class I'
 'HLA-A*02:01 K66A mutant' 'HLA-A*02:01 K66A, E63Q mutant' 'HLA-A1'
 'HLA-A11' 'HLA-A2' 'HLA-A24' 'HLA-A26' 'HLA-A3' 'HLA-A68'
 'HLA-B*08:01 B:I66A mutant' 'HLA-B*08:01 E76C mutant' 'HLA-B27' 'HLA-B39'
 'HLA-

Now we can separate the processed IEDB data into abundant and rare subsets, and each of the latter - into train and test subsets.

In [36]:
abundant, rare = separate_abundant(iedb.parsed_data, rare_threshold=200)
print(f'A number of observations belonging to "abundant" allotypes: {len(abundant)}')
print(f'A number of observations belonging to "rare" allotypes: {len(rare)}')

A number of observations belonging to "abundant" allotypes: 150239
A number of observations belonging to "rare" allotypes: 3575


In [37]:
abundant_train, abundant_test = separate_fraction(abundant, fraction=0.85, mode='observations')
print(f'A number of training examples: {len(abundant_train)}')
print(f'A number of testing examples: {len(abundant_test)}')

A number of training examples: 127633
A number of testing examples: 22606


Suppose one wants to separate a certain number of allotypes for testing.
This is possible by providing `mode=accessions` to `separate_fraction`.

For example, let's separate ten abundant allotypes.

In [38]:
frac = 10 / len(abundant['accession'].unique())
frac

0.14084507042253522

In [39]:
abundant_train, abundant_test = separate_fraction(abundant, fraction=0.85, mode='allotypes')
print(f'A number of training examples: {len(abundant_train)}')
print(f'A number of testing examples: {len(abundant_test)}')

A number of training examples: 131394
A number of testing examples: 18845


### 2.3 Obtaining allotype sequences

Finally, let's prepare allotype sequences.
For this, we'll first need to obtain a handful of accessions.

In [40]:
sample_accessions = [
    'ELA04973', 'HLA01173', 'HLA24796', 'HLA28069', 'BoLA03176',
    'NHP01865', 'HLA27997', 'HLA27977', 'H-2-Kk', 'HLA25912',
    'HLA00009', 'HLA27068', 'HLA16743', 'HLA27953', 'HLA15201',
    'NHP00709', 'HLA00222', 'HLA28165', 'HLA25670', 'HLA25900']

In [41]:
ipd_seqs = IPDMHCsequences()
imgt_seqs = IMGTHLAsequences()

INFO:root:IPD-MHC sequences -- successfully initialized resource
INFO:root:IMGT/HLA sequences -- successfully initialized resource


In [42]:
ipd_seqs.fetch(), imgt_seqs.fetch()

INFO:root:IPD-MHC sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IPDMHC/Latest/MHC_prot.fasta
INFO:root:IMGT/HLA sequences -- downloaded resource from https://raw.githubusercontent.com/ANHIG/IMGTHLA/Latest/hla_prot.fasta


('/var/folders/h2/x9k131ls3dnf2j2xnm3bdxt00000gn/T/tmpxvsx_upn/MHC_prot.fasta',
 '/var/folders/h2/x9k131ls3dnf2j2xnm3bdxt00000gn/T/tmpjsd3334e/hla_prot.fasta')

As previously, `parse` method is carrying out all the hard work.
For both of the initialized objects, `parse` accepts a list of accessions.
If the latter is not provided, all of the available sequences will be used. Since the later might take ages the user will be warned.

By default, `parse` uses our alignment profile to excise binding region sequences.
If the cutting is redundant, one can use `parse` with `profile_path=None` argument.
We use `mafft --add` for an alignment; it works faster by allowing parallel execution via the `threads` argument.

The method returns biopython's `SeqRecord` objects.

In [43]:
ipd_seqs.parse(
    accessions=[a for a in sample_accessions if 'HLA' not in a], 
    verbose=True, threads=6)
imgt_seqs.parse(
    accessions=[a for a in sample_accessions if 'HLA' in a], 
    verbose=True, threads=6);

INFO:root:IPD-MHC sequences -- loaded 10302 sequences
INFO:root:IPD-MHC sequences -- filtered by accessions; 4 left (out of provided 5).
INFO:root:IPD-MHC sequences -- 4 sequences were found in profile ./mbae_resources/binding_regions.fsa
INFO:root:IPD-MHC sequences -- 1 will be aligned to ./mbae_resources/binding_regions.fsa
Cutting sequences: 100%|██████████| 1/1 [00:05<00:00,  5.33s/it]
INFO:root:IPD-MHC sequences -- finished parsing sequences; 5 in total
INFO:root:IMGT/HLA sequences -- loaded 27840 sequences
INFO:root:IMGT/HLA sequences -- filtered by accessions; 15 left (out of provided 15).
INFO:root:IMGT/HLA sequences -- 3 sequences were found in profile ./mbae_resources/binding_regions.fsa
INFO:root:IMGT/HLA sequences -- 12 will be aligned to ./mbae_resources/binding_regions.fsa
Cutting sequences: 100%|██████████| 12/12 [01:12<00:00,  6.04s/it]
INFO:root:IMGT/HLA sequences -- finished parsing sequences; 15 in total


Note that despite some sequences were lacking in the IPD-MHC dump, they were still included into the `parsed_data` due to being present in the alignment profile.

We could've, as previously, use the `dump_data` method to save the sequences into fasta files.
However, one probably wants these two groups of sequences in a single file.
Thus, we'll simply use the `dump_data` helper function on merged `parsed_data` attributes.

In [44]:
?_dump_data

[0;31mSignature:[0m
[0m_dump_data[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdump_path[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mresource_name[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdump_data[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m [0mDict[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mBio[0m[0;34m.[0m[0mSeqRecord[0m[0;34m.[0m[0mSeqRecord[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
A helper function to dump the resource's data.
Consult with type annotations to check which data types are supported.
:param dump_path: A path to dump to.
:param resource_name: A name of the resource for formatting errors and logging messages.
:param dump_data: Data to dump.
[0;31mFile:[0m      ~/Projects/mbae_git/mbae_src/da

In [46]:
_dump_data('./tmp/sequences.fasta', 'NoResource', ipd_seqs.parsed_data + imgt_seqs.parsed_data)

INFO:root:NoResource -- saved parsed data to ./tmp/sequences.fasta


In [47]:
! head ./tmp/sequences.fasta

>BoLA03176 BoLA-2*00801
SHSLRYFLTAVSRPGLGEPRFIIVGYVDDTQFVRFDSNTPNPRMEPRARWVEKEGPEYWD
RETRNSKETAQTFRANLNTALGYYNQSEAGSHTVQEMYGCDVGPDGRLLRGFMQDAYDGR
DYIALNEDLRSWTAADTAAQITKRKWEAAGDAETWRNYLEGRCVEWLRRYLENGKDALL
>H-2-Kk H2-Kk
PHSLRYFHTAVSRPGLGKPRFISVGYVDDTQFVRFDSDAENPRYEPRVRWMEQVEPEYWE
RNTQIAKGNEQIFRVNLRTALRYYNQSAGGSHTFQRMYGCEVGSDWRLLRGYEQYAYDGC
DYIALNEDLKTWTAADMAALITKHKWEQAGDAERDRAYLEGTCVEWLRRYLQLGNATLP
>NHP00709 Patr-A*04:01
SHSMRYFSTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWD


# Finale.

Now, hopefully, that is more than enough to get one going with the data preparation.

Let's be nice and clean up the data.

In [48]:
! rm -r tmp
! rm ./mapping.tsv