In [1]:
## This is just to get some logging output in the Notebook

import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

When executing pipeline runs, your input sequences will get assigned a new internal identifier. This identifier corresponds to the md5 hash of the sequence. We do this, becuase for storing and processing purposes we need unique strings as identifiers, and unfortunately, some FASTA files contain invalid characters in the header.

Nevertheless, sometimes you may want to convert the keys contained in the h5 files produces from the pipeline back from the internal ids to their original id as in the FASTA header of the input sequence.

We produce a mapping_file.csv which shows this mapping (the first, unnamed column represents the sequence' md5 hash, while the column `original_id` represents the extracted id from the input FASTA)

This operation can be dangerous, because if the `original_id` contains invalid characters or is empty, the h5 file will be corrupted.

Nevertheless, we make a helper function available which converts the internal ids back to the original ids **in place**, meaning that the h5 file will be directly modified (this is meant to avoid duplication of large h5 files, but with the risk of corrupting the original file. Please: only perform this operation if you are sure about what you are doing, and if it's strictly neccessary!)

In [2]:
import h5py
from bio_embeddings.utilities import reindex_h5_file

In [3]:
# Let's check the keys of our h5 file:
with h5py.File("pipeline_output_example/disprot/reduced_embeddings_file.h5", "r") as h5_file:
    for key in h5_file.keys():
        print(key,)

0011ab0c11c7fea51fefcd039b1b69f5
003b243d0117cbaf2b7434184c409b06
004cef7b0dae937e6d722817c17ed889
0115b4447d6911651804d1303bf5f272
0140b3ec6cba5734a909c1d734e48ea0
01de4461209b76f57819919dc38faa99
025559e7b85ed1448d34e61221a54788
0297e58178e3f37ecaa080f23a5efc20
02bfb6e7933ff7691596243b73dbe6c9
02c8fa126578dbabb41f3ac9dc9a5048
0344807a9a3d95e3bb0f87f63128f377
039bdcaed4b963708412980cbad8de89
0431824d5e480b7fcc445a341026e5ea
047eb22f6dfe33a4a92617d65a1fa82a
04be32a6831acc2133372b4120a0e2b8
04cfcc23fe543b512cef2e51224df99b
05dd0ca1a3b4abfec2a2ad7a52e87abf
061758e89a077f82a31d8c676bf85813
0617f823368b939c701307874f89c983
063d622af96afdec941a191ef14e66a2
066a9ff5215407e654e6a3e9e8e2af0d
0672a7974be7c387568a628ef45fab8e
070cf76fa2364143b25970c36eca1fd5
0723af56546695809206bba0cb109bcf
07d6fcda7bd34ff9b467ba2e6ec83744
080f9bbc35ccb54e5e3a9ed3aade00fc
08709af4d8c7b41e70ee842e21a8b978
089c42cb6629d19361318b5f319aa852
09112e5056174b30c061ab1d68cec572
092cd4a4e88af5efc063464886492812
092d655d95

In [4]:
# In place re-indexing of h5 file

reindex_h5_file("pipeline_output_example/disprot/reduced_embeddings_file.h5",
               "pipeline_output_example/disprot/mapping_file.csv")

2020-08-28 18:06:25,279 INFO Reindexing the following keys: {'d091494802054d3028f8b2835b105938', '1edff89dd11abd88053a334446e61ff0', 'ceb5d9a673fd373c0077049a4e4c160d', '8bc787bf152caf55cf243458dd3584e8', '354ef2145f8fd285c319bfac4fda2060', '2d22df61c8ebfd7202eea4fea3036621', 'ab4777c26157b7b17ec5e3f5bd2c1532', '003b243d0117cbaf2b7434184c409b06', '53418d25489eef6015dcd5078c66be35', 'a2268bb3d1f8fadf284db7e78fbc0105', '22c4e9b25f9fd4c97cbf58fe2be37b11', 'c171ab073ae917bb66f36b0ec1ca536e', '2fa95af0f24f46a4393b630463a5e94f', '6afd28a89336f8a5bdb073bf57c66710', '5968889b169b8b019f9efa6e1365fbed', '2f93709b76736e8175c78afbfbd86168', 'c96970699494faca52c721498e70a573', 'ee335e01e5e15e40c9fabe1a1ea682c1', '10affb07c44267e93e6a25683fb79e5e', '79193f9e812f141ec0a98635fbbd4da2', 'e40f752dedf675e2f7c99142ebb2607a', '04cfcc23fe543b512cef2e51224df99b', '2a201b7955eb6fbedb8bb7ddcfe5e06a', '0344807a9a3d95e3bb0f87f63128f377', '8ccd400c02ba0162033520ee954522d7', 'c36e50bc6de35773fd97b65e8aeeab07', '06

In [5]:
# Let's check the new keys of our h5 file:
with h5py.File("pipeline_output_example/disprot/reduced_embeddings_file.h5", "r") as h5_file:
    for key in h5_file.keys():
        print(key,)

A0A0G2JXC5
A0A0H2W778
A0A0H2ZP82
A0A0H3CFC9
A0A0J9X1Q5
A0A140GKJ0
A0A178W0D3
A0A1Z3GD05
A0A2K8FR49
A0L5S6
A0MHA3
A0ZWU1
A1B8N7
A2VD23
A4C1A5
A4ZNR2
A5F384
A5YV76
A6NF83
A7UMX5
A8AZZ3
A8CDV5
B0FRH7
B2LME8
B6KJB6
B7T1D7
B9KDD4
C0J347
C6KEI3
C6KSX0
C6ZFX3
D0PV95
E1BSW7
E6PBU3
E6PBU9
F2Z293
F9UST4
G0S4M4
G0SCY6
G1K3N3
G1TDB3
G3V7P1
G4NEJ8
G4SLH0
G8HXD9
H0W0T5
I1JLC8
I7MK25
J7FTI7
J7QLC0
J8TM36
M1GUG5
N1NXA6
O00204
O00268
O00273
O00429
O00488
O00585
O08785
O08807
O10609
O13828
O13916
O14140
O14214
O14352
O14519
O14713
O14733
O14745
O14776
O14832
O14974
O15169
O15234
O15273
O15350
O15922
O17428
O23764
O24172
O24646
O25010
O28362
O29867
O30916
O31467
O31818
O31851
O32728
O33599
O34800
O35274
O35718
O43236
O43312
O43313
O43464
O43474
O43516
O43561
O43663
O46385
O50835
O54918
O54928
O55000
O57173
O60200
O60356
O60563
O60566
O60701
O60741
O60828
O60885
O60888
O60927
O61667
O65934
O66493
O66858
O67086
O70480
O73557
O74774
O75324
O75469
O75496
O75506
O75533
O75554
O75683
O75807
O75928
O76074
O8128