In [1]:
import requests
import json

This notebook illustrates the use of the 'raw' REST API implemented by the metagraph server. For most usecases, we recommend using the Python client library which simplifies the logistics to make queries.
The REST API consumes and replies with JSON objects.

Setting the base URL for our API calls. In this notebook, we query the metasub19 graph.
Note, that if you run a metagraph server instance locally, the base url would simply be `http://localhost:5555` (Port may vary of course)

In [2]:
base_url = 'http://dnaloc.ethz.ch/api/metasub19'

# Getting Basic Graph Information

## Graph Stats

Some basic statistics about a graph and its annotations.

The returned json object has two keys one describing the annotation and the other the graph. Each have their own dictionary of key value pairs

In [3]:
r = requests.get(url=f'{base_url}/stats')

print(json.dumps(r.json(), indent=2))

{
  "annotation": {
    "filename": "output_k19_cleaned_graph_annotation_small.collect.relabeled.brwt.annodbg",
    "labels": 4220,
    "objects": 71751760043,
    "relations": 243269874047
  },
  "graph": {
    "filename": "graph_merged_k19.small.dbg",
    "is_canonical_mode": true,
    "k": 19,
    "nodes": 71751760043
  }
}


## Graph Labels

List of all column labels. 

Note, that often metadata is encoded in a column label. Entries are separated by `;`, where the first entry corresponds to the sample ID/sample name and is followed by key value pairs separated by `=` from each other.

In [4]:
r = requests.get(url=f'{base_url}/column_labels')
len(r.json())

4220

In [5]:
r.json()[0]

'haib17CEM4890_H2NYMCCXY_SL254769;metasub_name=CSD16-OFA-050;city=offa;latitude=8.1548483;longitude=4.7263633;surface_material=wood;station=nan;num_reads=24759619.0;city_latitude=8.0;city_longitude=4.0;city_total_population=90000.0;continent=sub_saharan_africa;sample_type=environmental_microbiome'

# Search for Sequences

## Basic Search

In the basic usage, you can upload the equivalent of a  FASTA file along with query parameters.

The "FASTA-file" should be converted to a single string with `\n` as line breaks. What is following the `>` on the same line is taken as the sequence identifier/description and is returned back in the `seq_description` property of a result object. This way, if multiple sequences are queried, a result can easily be attributed the original query sequence. 

The `discovery_fraction` is a value between [0, 1.0] and stands for the minimum fraction of kmers of a sequence found to have a certain annotation.

Returns list of objects, one for each query sequence. Such an object contains following fields:
* `seq_description`: description of sequence (the `>` part in the fasta file)
* `results`: result for that sequence, which is a list of objects with the following fields:
    * `kmer_count`: number of kmers of the sequence found with this annotation
    * `sample`: Column name of annotation
    * `properties` (optional): dictionary with metadata about the sample. This generally differs from one graph to another

In [6]:
fasta_str = "\n".join([">example_query",
                                            'GTGAGGGGGGCAAAAATAAGAAGCAAGTTCTGAAGTTCACTCT'])

fasta_str

'>example_query\nGTGAGGGGGGCAAAAATAAGAAGCAAGTTCTGAAGTTCACTCT'

In [7]:
req = {"FASTA": fasta_str,
       "discovery_fraction": 0.4}
ret = requests.post(url=f'{base_url}/search', json=req)
ret.json()

[{'results': [{'kmer_count': 10,
    'properties': {'city': 'berlin',
     'city_latitude': 52.52436828613281,
     'city_longitude': 13.410530090332031,
     'city_total_population': 3711930.0,
     'continent': 'europe',
     'latitude': None,
     'longitude': None,
     'metasub_name': 6.0,
     'num_reads': 45883240.0,
     'sample_type': 'environmental_microbiome',
     'station': None,
     'surface_material': None},
    'sample': 'haib17CEM4890_HKC32ALXX_SL254684'},
   {'kmer_count': 10,
    'properties': {'city': 'tokyo',
     'city_latitude': 35.68949890136719,
     'city_longitude': 139.69171142578125,
     'city_total_population': 13839910.0,
     'continent': 'east_asia',
     'latitude': 35.600887298583984,
     'longitude': 139.68487548828125,
     'metasub_name': 'CSD16-TOK-013',
     'num_reads': 40433992.0,
     'sample_type': 'environmental_microbiome',
     'station': None,
     'surface_material': 'vinyl'},
    'sample': 'haib17CEM4890_HKC32ALXX_SL254689'},
   {'km

## More Advanced Search

Additional parameters are supported:

* `num_labels`: return the `num_labels` top results per sequence with respect to the kmer count 


There is also the possibility to *align* a sequence first and used the aligned sequence to query the graph. This can be done via the `align=True` parameter, default is `align=False`. In case of `align=True`, the parameter `max_num_nodes_per_seq_char` is also supported (see "Alignment" section below).

In this case, a result object contains more information (the result objects are the same):

* `sequence`: aligned sequence
* `score`: alignment score
* `cigar`: the cigar string string of the aligned sequence

In [8]:
req = {"FASTA": fasta_str,
       "discovery_fraction": 0.4,
       "align": True,  
       "num_labels": 2}

ret = requests.post(url=f'{base_url}/search', json=req)
ret.json()

[{'cigar': '5=1X2=1X34=',
  'results': [{'kmer_count': 12,
    'properties': {'city': 'berlin',
     'city_latitude': 52.52436828613281,
     'city_longitude': 13.410530090332031,
     'city_total_population': 3711930.0,
     'continent': 'europe',
     'latitude': None,
     'longitude': None,
     'metasub_name': 6.0,
     'num_reads': 56733196.0,
     'sample_type': 'environmental_microbiome',
     'station': None,
     'surface_material': None},
    'sample': 'haib17CEM4890_HKC32ALXX_SL254696'},
   {'kmer_count': 11,
    'properties': {'city': 'berlin',
     'city_latitude': 52.52436828613281,
     'city_longitude': 13.410530090332031,
     'city_total_population': 3711930.0,
     'continent': 'europe',
     'latitude': None,
     'longitude': None,
     'metasub_name': 6.0,
     'num_reads': 45883240.0,
     'sample_type': 'environmental_microbiome',
     'station': None,
     'surface_material': None},
    'sample': 'haib17CEM4890_HKC32ALXX_SL254684'}],
  'score': 76,
  'seq_desc

## Alignment

The `/alignment` end point allows the alignment of sequences to the graph

In [9]:
req = {"FASTA": fasta_str}

ret = requests.post(url=f'{base_url}/align', json=req)
ret.json()

[{'alignments': [{'cigar': '5=1X2=1X34=',
    'score': 76,
    'sequence': 'GTGAGAGGAGCAAAAATAAGAAGCAAGTTCTGAAGTTCACTCT'}],
  'seq_description': 'example_query'}]

For every submitted sequence, it returns:
 * `seq_description`: description of sequence (the `>` part in the fasta file)
 * `alignments`:  alignment result object consisting of:
     * `sequence`: aligned sequence
     * `score`: alignment score
     * `cigar`: the cigar string string of the aligned sequence

There are the following additional query parameters:
* `max_alternative_alignments`: max number of different alignments to return (default 1)
* `max_num_nodes_per_seq_char`: maximum number of nodes to consider during extension 


