# **Retrieving data from the resource programatically**

In this notebook we show how to download Bioteque embeddings and other metadata from the resource progamatically

In [1]:
import os
import sys
import h5py
import numpy as np
import pandas as pd
import tarfile
import urllib



## **1) Downloading embeddings from the Bioteque resource**

You can request the proper url from the bioteque page in order to download the data.

The main downloadding url is: https://bioteque.irbbarcelona.org/downloads/embeddings.

There embeddings are place according to their source entity, metapath and dataset (separated
by the symbol: >) in a file called *embeddings.tar.gz* (e.g. https://bioteque.irbbarcelona.org/downloads/embeddings>GEN>GEN-ppi-GEN>string/embeddings.tar.gz). Alternatively, you can download all the embedding datasets avaiable for a given metapath by accessing to the file *all_datasets_embeddings.tar* inside the metapath folder (e.g.https://bioteque.irbbarcelona.org/downloads/embeddings>GEN>GEN-ppi-GEN/all_datasets_embeddings.tar.gz).

Here we have prepared a function that will automatically download and uncompress the embedding of your mp (and dataset) of interest:

In [2]:
downloading_root = 'http://biotequetest.irbbarcelona.org/downloads'

#The download function
def download_embbeding(mp,dt=None , out_path = './', uncompress = True):
    """
    Given a metapath and a dataset it downloads the embedding from the Bioteque.
    If no dataset is provided, it dowloands all the datasets available for the metapath.
    """

    
    mnd = mp[:3]
    if dt is None:
        url = downloading_root.rstrip('/')+'/embeddings>%s>%s/all_datasets_embeddings.tar'%(mnd,mp)
    else:
        url = downloading_root.rstrip('/')+'/embeddings>%s>%s>%s/embeddings.tar.gz'%(mnd,mp,dt)

    #--Testing if exists
    response = urllib.request.urlopen(url)
    if response.getcode() != 200:
        sys.exit('The provided url does not exists:\n"%s"\n'%url)

    #--Creating output file system
    opath = out_path+'/%s/%s'%(mp,dt) if dt is not None else out_path+'/%s'%(mp)
    if not os.path.exists(opath):
        os.makedirs(opath)
    ofile = opath+'/%s'%url.split('/')[-1]
        
    #--Fetching
    urllib.request.urlretrieve(url, ofile) 
    
    #--Uncompressing
    if uncompress is True:
        with tarfile.open(ofile) as f:
            subfiles  = f.getnames()
            f.extractall(opath)     
            os.remove(ofile)

        for _ in subfiles:
            if '.tar.gz' in _:
                k = opath+'/%s'%_
                with tarfile.open(k) as f:
                    f.extractall('/'.join(k.rstrip('/').split('/')[:-1]))
                    os.remove(k)
 

In [3]:
#Setting paths and variables

mp = 'CLL-dwr+upr-GEN-dwr+upr-CLL'
dt = 'ccle_rna-ccle_rna'

source_entity = mp[:3]
target_entity = mp[-3:]

out_path = './embedding_folder/' # Path to the embedding data
uncompress = True

#--Downloading
download_embbeding(mp,dt=dt, out_path = out_path, uncompress = uncompress)

Notice that you could use a loop to iteratively download a list of metapath of interest!

## **2) Retrieving Bioteque metadata**

In the previous function, the only requisite is to know which metapath (and dataset) we are interested in. A good start to explore the different options is to make use of the explore page of the Bioteque web: https://bioteque.irbbarcelona.org/. 

### **2.1) Getting the embedding universe**


Alternatively, you can have a look at the embedding universe table available at https://bioteque.irbbarcelona.org/downloads/embeddings/embedding_universe.csv. You can read it directly with pandas!

In [4]:
emb_uv_path = downloading_root+'/embeddings/embedding_universe.csv'
emb_uv = pd.read_csv(emb_uv_path)
emb_uv.head()

Unnamed: 0,L,mnd1,mnd2,metapath,dataset,n1,n2,network preservation (cosine),network preservation (euclidean)
0,1,CHE,CPD,CHE-has-CPD,chebi,23056,86157,1.0,1.0
1,1,CLL,CPD,CLL-sns-CPD,ctrpv2_sens,826,477,0.95,0.93
2,1,CLL,CPD,CLL-sns-CPD,drugcell,1199,609,0.95,0.93
3,1,CLL,CPD,CLL-sns-CPD,gdsc1000_sens,1002,232,0.93,0.91
4,1,CLL,CPD,CLL-sns-CPD,nci60_sens,58,17401,1.0,1.0


### **2.2) Getting the embedding universe**

Additionally, we also have provide the embedded node universe for each entity with some metadata. They are under the path https://bioteque.irbbarcelona.org/downloads/node_universe/, and can also be read directly with pandas:

In [5]:
entity = 'CLL' #The entity code we wanna read

entity_df = pd.read_csv(downloading_root+'/node_universe/%s.tsv.gz'%entity, sep='\t')
entity_df.head()

Unnamed: 0,code,name,type,organism,synonyms,other_ids
0,CVCL_0001,HEL,Cancer cell line,9606,Hel|GM06141|GM06141B|Human ErythroLeukemia,CelloPub=CLPUB00447|DOI=10.1007/978-1-4757-164...
1,CVCL_0002,HL-60,Cancer cell line,9606,HL 60|HL60,DOI=10.1016/B978-0-12-221970-2.50457-5|DOI=10....
2,CVCL_0003,HMC-1,Cancer cell line,9606,HMC1,DOI=10.1016/B978-0-12-221970-2.50457-5|PubMed=...
3,CVCL_0004,K-562,Cancer cell line,9606,K562|K 562|GM05372|GM05372E,CelloPub=CLPUB00447|DOI=10.1016/B978-0-12-2219...
4,CVCL_0005,NB4,Cancer cell line,9606,NB-4|NB.4,DOI=10.1016/B978-0-12-221970-2.50457-5|Patent=...


## **3) Reading embeddings (after having downloaded them)**

Bioteque embeddings are saved in a Hierarchical Data Format (HDF), especially optimized to save multidimensional data. In python we have the hp5y package which allow us to read, write and modify this documents.


**Important**: Some embeddings will have different source and target entities. Embeddings for each entity are kept in separate HD5 files, so you need to read them separately by calling twice the *read_embedding()* function, especifying the entity/metanode (mnd) of interest.

In [6]:
import h5py

def read_embedding(path, entity):
    
    #--Reads the ids of the embeddings
    with open(path+'/%s_ids.txt'%entity) as f:
        ids = f.read().splitlines()
        
    #--Reads the embedding vectors 
    with h5py.File(path+'/%s_emb.h5'%entity,'r') as f:
        emb =  f['m'][:]
        
    return ids,emb

#------------------

out_path = './embedding_folder/' # Path to the embedding data
mp = 'CLL-dwr+upr-GEN-dwr+upr-CLL'
dt = 'ccle_rna-ccle_rna'
source_entity = mp[:3]
target_entity = mp[-3:]

#--Source entity
emb_path = out_path+'/%s/%s/'%(mp,dt)
ids, emb = read_embedding(emb_path, source_entity)

#--Target entity
#  --> As target entity (CLL) is the same than source entity (CLL), there is no point in reading them again.

# ids2, emb2 = read_embedding(data_path, target_entity) # Not need in this example as source == target