# CCM Benchmate Apis module tutorial

In this notbook we will explore some of the api endpoints that are exposed in the module, how to use them and how to integrate their outputs.

This is not a complete example of all the capabilities of this module, its main strength is not that you can search these enpoints in an interactive fasion and explore data as you go but that you can autommate this process in an arbitrary depth provided that you have an understanding of all the outputs that are returned and what to do when something doesn't go according to plan.

## Ensembl API:

This module **does not** expose all the available endpoints. There are alot of them and not all of them seemed to be immediately useful. If this is not the case please create an issue and we'll look into it.

In [None]:
import pandas as pd

# load the module

from ccm_benchmate.apis.ensembl import Ensembl
ensembl=Ensembl()

By default the dataset is set to human, you can change this using the dataset parameter. I will continue wih human. The API uses the latest dataset and other datasets are not available. Please keep this in mind if your queries are not identical at different time points. It is possible to install older versions locally.

If this is something you are interested in please contact CCM to discuss next steps.

### Variation

This enpoint is actually 3 endpoints joined together, you have 3 options, you can search for variant using their variant id, additionally, you can search by pubmed or pmc id (if you know the id of the paper) and it will return all the variants that are associated with that paper.

In [None]:
info = ensembl.variation("rs56116432", add_annotations=True)
info

The add_annotation feature returns a lot of information about the variant. If you do not use that your results will be much smaller.

In [None]:
info = ensembl.variation("rs56116432")
info

In [None]:
info_pub = ensembl.variation("26318936", method="publication", pubtype="pubmed")
info_pub

This publication method returns all the ids mentioned in a paper, you can then use these ids and search for variant information if you'd like. Just be nice and do not overwhelm the ensembl api endpoint. It's usually a good idea to wait a second between a few queries. If you are doing this as part of a script this can be done easily using `sys.sleep`.

The translate enpoint translates the variantn to different annotaions.

In [None]:
notations=ensembl.variation("rs56116432", method='translate')
notations

The vep module runs VEP, that's it. There are a lot of tools available for you to choose from for annotations. The only downside is your variant needs to be a `ccm_benchmate.variant.Variant` instance. This is not that big of a deal as we will see.

In [None]:
from ccm_benchmate.variant.variant import SequenceVariant
# See variant module for more information
myvar= SequenceVariant(1, 55051215, 'G', 'GA')

vep_info = ensembl.vep(species="human", variant=myvar, tools=None)
vep_info

Here is a list of all the available tools, not all of them can be used all at once, some of them are mutually exclusive. If a tool doesnt apply to your variant (like you have an intergenic variant you will not get an alphamissense score), that filed will not be in the returned `JSON` object. Do not rely on existence of keys. This is true for all the apis
incorporated in this module, that is just how the apis work in these endpoints.

The phenotype module returns information about phenotypes that are associated with a genomic range. For that you will
need a `GenomicRanges` instance, please see its readme and notebook for more information.

In [None]:
from ccm_benchmate.ranges.genomicranges import GenomicRange
grange = GenomicRange(9, 22125503, 22125520, "+")
phenotypes = ensembl.phenotype(grange)
phenotypes

In [None]:
import pandas as pd
pd.DataFrame(phenotypes)

As you can see there is a lot of information here, we can look at one of them to see what's in it.m

In [None]:
phenotypes[1]

We can see all the information that comes with it, there are some common ones ilke location, and ontology accessions, additionally you can see if there are papers associated with this annotation. The locations, can then be further used to search for what is in that region to see what other genomic features are present, the papers can be used in the literature module to ingest and create a knowledgebase (see its notebook and documentation) and the ids can be used in the ncbi api to get more information about them.

There  are couple of straightforward enpoints we will just go through them quickly

In [None]:
seq = ensembl.sequence("ENSG00000139618")
seq

In [None]:
from ccm_benchmate.ranges.genomicranges import GenomicRange
grange = GenomicRange(9, 22125503, 22125520, "+")

overlap = ensembl.overlap(grange, features=["transcript"])
overlap

There are a lot of features available to search from but not all of them are availble at any given time for a given region. Sometimes you want to convert some coordinate system to antother, like genomic coordinates to protein coordinates, this can be annoying to code your own, that's why we have the mapping endpoint.

In [None]:
ensembl.mapping("ENST00000650946", 100, 120, type="cDNA")

The last one we are going explore I think is the most useful one. The `xrefs` endpoint returns all the other databases that the ensembl api cross references to. You can then take these ids and use the api enpoints that are presented here or in other places to deepend your research.

In [None]:
xrefs = ensembl.xrefs("ENSG00000139618")
xrefs

Finally, you can return about the species, and the kinds of infromation that is available in the api (there may be changes and that is beyond our control) using `Ensembl.info` method.

## Stringdb API

Stringdb is a web platform that focuses on protein-protein interactions, you will need to specify your species and protein identifiers. I've also built in an option to run the interaction queires recursively. That is, you can take a protein and gather all the other proteins that interact with it, then take them all and repeat the process to generate a network of arbitrary depth. Of course this will increase the number things returned exponentially and will talke exponentially longer. So keep that in mind.

In [None]:
from ccm_benchmate.apis.stringdb import StringDb

stringdb=StringDb()

In [None]:
import pandas as pd
network = stringdb.gather("human", name="ENSP00000354587", get_network=False)
pd.DataFrame(network)

Get network specifies whether you wanto get the itneractors of interactor. If you specify that to `True` and network depth. The number will grow exponentially, so anything over 3 is probably overkill by a wide margin. You can use a wide range of identifiers, in the example above we are using an ensembl protein id (things need to be proteins) but it can be a whole bunch of other ids. See their documentation for details.

Speaking of protien-protein interactions we have 2 more apis implemented.

## BioGrid

Biogrid is a similar platform that focuses on protein-proteininteractions with some experimental data annotaions as to how that interaction is determined. To use biogrid you need to get an access key but it is free.

In [1]:
from ccm_benchmate.apis.others import BioGrid
biogrid=BioGrid(access_key="<your api key>")

Inital configuration collects a bunch of up to date information so it might take a few seconds. This information includes
the kinds of evidences that are collectes, organisms and types of indentifiers that are supported. This is useful because we will need to specify these things when we query the database.

In [5]:
interactions=biogrid.interactions(gene_list=["ENSP00000354587"])
interactions

Unnamed: 0,BIOGRID_INTERACTION_ID,ENTREZ_GENE_A,ENTREZ_GENE_B,BIOGRID_ID_A,BIOGRID_ID_B,SYSTEMATIC_NAME_A,SYSTEMATIC_NAME_B,OFFICIAL_SYMBOL_A,OFFICIAL_SYMBOL_B,SYNONYMS_A,...,PUBMED_ID,ORGANISM_A,ORGANISM_B,THROUGHPUT,QUANTITATION,MODIFICATION,ONTOLOGY_TERMS,QUALIFICATIONS,TAGS,SOURCEDB
0,103,6416,2318,112315,108607,-,-,MAP2K4,FLNC,JNKK|JNKK1|MAPKK4|MEK4|MKK4|PRKMK4|SAPKK-1|SAP...,...,9006895,9606,9606,Low Throughput,-,-,{},-,-,BIOGRID
1,117,84665,88,124185,106603,-,-,MYPN,ACTN2,CMD1DD|CMH22|MYOP|RCM4,...,11309420,9606,9606,Low Throughput,-,-,{},-,-,BIOGRID
2,183,90,2339,106605,108625,-,-,ACVR1,FNTA,ACTRI|ACVR1A|ACVRLK2|ALK2|FOP|SKR1|TSRI,...,8599089,9606,9606,Low Throughput,-,-,{},-,-,BIOGRID
3,278,2624,5371,108894,111384,-,-,GATA2,PML,DCML|IMD21|MONOMAC|NFE1B,...,10938104,9606,9606,Low Throughput,-,-,{},-,-,BIOGRID
4,418,6118,6774,112038,112651,RP4-547C9.3,-,RPA2,STAT3,REPA2|RP-A p32|RP-A p34|RPA32,...,10875894,9606,9606,Low Throughput,-,-,{},-,-,BIOGRID
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,40341,36328,39088,62105,64481,Dmel_CG13167,Dmel_CG3922,Vha36-2,RpS17,CG13167|Dmel\CG13167,...,14605208,7227,7227,High Throughput,-,-,{},-,-,BIOGRID
9996,40342,36328,3354918,62105,78043,Dmel_CG13167,Dmel_CG17420,Vha36-2,RpL15,CG13167|Dmel\CG13167,...,14605208,7227,7227,High Throughput,-,-,{},-,-,BIOGRID
9997,40343,36327,33616,62104,59817,Dmel_CG13168,Dmel_CG15423,CG13168,CG15423,Dmel\CG13168,...,14605208,7227,7227,High Throughput,-,-,{},-,-,BIOGRID
9998,40344,36327,43665,62104,68515,Dmel_CG13168,Dmel_CG15549,CG13168,CG15549,Dmel\CG13168,...,14605208,7227,7227,High Throughput,-,-,{},-,-,BIOGRID


In [None]:
biogrid.organisms

## IntAct

Intact is one other interaction database. There were a lot of requests to include all of these in the package, While they provide similar information they do have different use cases.

In [1]:
from ccm_benchmate.apis.others import IntAct
intact=IntAct(page_size=100)

In [3]:
# to search intact you need the ebi id, this you can get from ensembl.xrefs or from uniprot (see part2)
interactions=intact.intact_search("Q05471")
interactions

Unnamed: 0,idA,idB,taxidA,taxidB,experimental_role_A,experimental_role_B,stoichiometry_A,stoichiometry_B,detection_method,annotations,is_negative,affected_by_mutation,pubmed_id,score
0,Q05471 (uniprotkb),Q05471 (uniprotkb),559292,559292,bait,prey,0-0,0-0,tap,"author-list (Lambert J-P, Fillingham J. , Gree...",False,False,21179020,0.35
1,Q05471 (uniprotkb),Q06707 (uniprotkb),559292,559292,bait,prey,0-0,0-0,tap,curation depth (IMEx)\npublication year (2004)...,False,False,15045029,0.80
2,Q05471 (uniprotkb),Q06707 (uniprotkb),559292,559292,bait,prey,0-0,0-0,anti tag coip,accepted (Accepted 2024-MAY-08 AT 09:40 BST AT...,False,False,37968396,0.80
3,P38326 (uniprotkb),Q05471 (uniprotkb),559292,559292,bait,prey,0-0,0-0,anti tag coip,accepted (Accepted 2024-MAY-08 AT 09:40 BST AT...,False,False,37968396,0.85
4,P31376 (uniprotkb),Q05471 (uniprotkb),559292,559292,bait,prey,0-0,0-0,anti tag coip,accepted (Accepted 2024-MAY-08 AT 09:40 BST AT...,False,False,37968396,0.85
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,Q03940 (uniprotkb),Q05471 (uniprotkb),559292,559292,bait,prey,1-1,1-1,tap,author-confidence (z-score: 0.683760282208367)...,False,False,16554755,0.93
244,Q05471 (uniprotkb),Q03940 (uniprotkb),559292,559292,prey,bait,1-1,1-1,tap,author-confidence (probability: 99.6)\nuniprot...,False,False,16554755,0.93
245,Q03940 (uniprotkb),Q05471 (uniprotkb),559292,559292,bait,prey,0-0,0-0,tap,accepted (null)\nfigure legend (Supplemental T...,False,False,16429126,0.93
246,Q03940 (uniprotkb),Q05471 (uniprotkb),559292,559292,prey,bait,1-1,1-1,tap,author-confidence (probability: 99.4)\nuniprot...,False,False,16554755,0.93


Intact database contains information not just about protein-protein interactions but also other molecule types. This means your response could be quite large. Also I have integrated so that the api keeps searching for interactions until the last page is reached. This means you will get all the results once the request is comple but if your request has a lot of information it might take a few seconds to a few minutes.