### Open the embeddings from a pipeline run and use them

In the following we will open up some seqvec embeddings produced by the pipeline and use them for visualizations. We will start with the reduced embeddings: these are per-protein embeddings (as opposed to per-amino-acid).

In [1]:
import h5py

In [2]:
proteins = []

In [3]:
with h5py.File('pipeline_output_example/reduced_embeddings_file.h5', 'r') as f:
    for protein_id in f.keys():
        proteins.append((protein_id, list(f[protein_id])))

In [4]:
print("The first protein in the set has id {} and the embedding is of size {}."
      .format(proteins[0][0], len(proteins[0][1])))

The first protein in the set has id 12fe229c316d544bfb78c332f181a5b9 and the embedding is of size 1024.


##### Remapping IDs

The pipeline produces an id (md5 hash of the protein seqeunce) for every protein seqeunce in the input FASTA file. It does so to create a unique identifier to allow easier identification of the proteins downstream.

We want to re-map the original IDs found in the input FASTA file to the new IDs in the array of proteins above. The next code blocks will do just that.

In [5]:
from pandas import read_csv

In [6]:
mapping_file = read_csv('pipeline_output_example/mapping_file.csv', index_col=0)

In [7]:
proteins = [(mapping_file.loc[p[0]].original_id, p[1]) for p in proteins]