Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions: Web API / XG path's best practice #1343

Closed
6br opened this issue Dec 24, 2017 · 6 comments
Closed

Questions: Web API / XG path's best practice #1343

6br opened this issue Dec 24, 2017 · 6 comments

Comments

@6br
Copy link
Collaborator

6br commented Dec 24, 2017

Hello, I would like to run VG as a backend service of our web application. I have some questions for VG.

First, I need a web-API for calling index, find, construct, etc....
Because it is disabled to call any commands from inside of another container, I would like to call VG commands in docker container from another container which includes the web server via a port connection. Also, it is needed to access host's file by mounting volumes.

For example, if I post a HTTP request such as {subcommand: "find", opts: ["-x", "/mnt/a.xg", "-P", "<path>"]} to the docker container, then it returns JSON response.

Second, I would like to know the performance characteristics of XG's path since I am planning to add more than 1,000 paths into one VG file, convert into XG, and run vg find -P <path>.

When many paths (e.g. many genomes, annotations, variants and so on) are included in one XG file, it might cause performance issue (e.g. large XG file size, memory consumption / long turnaround time) on running as a web service.

So, what is the point to keep good performance on both building and querying index? (e.g. Is a query for the path the linear time/space of the number of nodes included in paths? Should not many paths share the same nodes? Should not many paths share the common long prefix/suffix name?)

Additional Questions:

  • Is there any way to store not the exact paths but annotation or frequency of path on each node or edge?
  • Is there any way to have paths (especially inter-chromosomal path) as the outside of graph and can it be queried if necessary?

Sincerely,

@adamnovak
Copy link
Member

Hello,

I'm not sure we can help too much with the Docker and web API setup; vg itself includes no HTTP API. You could write some kind of HTTP server that answers queries with vg commands, but I would hesitate to structure it like you illustrate here, where the client basically sends vg command line options to the server, because it admits a simple but extremely insecure implementation.

It might make more sense to package vg and your application in the same Docker container, which I think is the approach @wolfib is taking with his IVG visualizer. You could also just let one container SSH into the other to execute commands, or look into giving the one container access to the Docker socket to invoke additional sibling containers.

The XG path performance is I believe linear in the number of stored paths. It's also designed to store paths have about as many nodes in them as the whole index has; there's overhead per path per node not in the path. I don't think you would really run into a problem once you got the paths into the index, but in order to build an XG all the paths that are going into it need to be loaded into memory at the same time, so you might need a very large amount of memory to store 1000 paths in a very large graph.

If you want to store individual genomes, you might be best off with the GBWT index, which is designed to store a large number of similar paths in a compressed way. For the storage of annotations and named variants, some work is being done on top of vg in the context of https://gitlab.codenic.de/computomics/AGV but currently we don't have a good way to do them in mainline vg yet; we've been storing them as paths in vg files, but we drop them before building xg indexes.

If you want to store per-node or per-edge frequency data, you can just store that as a mapping from node ID (for nodes) or node ID pair (for edges) to frequency value, in whatever format is convenient for you. There's no way to embed that in a vg file or xg index, though, it would have to be a separate file. With a GBWT index, you can get the number of times a node or edge (or short path) is visited by paths in the index, though, which could let you compute a frequency among indexed paths.

You can store paths outside of the graph; you can create one or more vg Path object describing your path(s) and store them as JSON or as serialized Protobuf in a file. We don't have any indexing on such files, though, so you would be limited to scanning through them. Alternately, the GBWT index can store a collection of similar, compressed paths, and you can work with more than one GBWT index.

@6br
Copy link
Collaborator Author

6br commented Jan 12, 2018

Thank you for your reply!

In the first topic, I totally agree your perspective. To avoid security issue, I will choose either: 1. write a small wrapper API server for some subcommands with fixed command line options or 2. package VG and my tool in the same container.

In the second topic, I am grateful for the detail of XG's performance and the ideas for handling per-node or per-edge frequency data. About GWBT and XG, I have some questions.

First, GBWT's compressing performance seems good for storing a large number of paths. I am interested in fetching partial graphs around the specified path range(s) TARGET=path[:pos1[-pos2]] or node id. What I would like to know is whether GBWT can be an alternative index to chunk a partial graph.

  • GBWT can construct from "phased VCF", but is there any way to construct GBWT from existing paths?
  • Is there a high-level command line interface similar to vg find -P on the GBWT?

Second, I would like to integrate not only normal genome but also tumor genome, which sometimes includes some chimeric chromosomes, into a variation graph. Chimeric chromosomes are characterized by inter-chromosomal translocation.

When I want to represent inter-chromosomal translocation as variation graph on VG, there are two ways; 1. define an inter-chromosomal variant as an alignment between two chromosomes in GAM and use vg mod to integrate them into VG format or 2. make GFA file or VG-compatible JSON file.

But, the graph generated by above seems to be unsuitable for neither XG nor GBWT. Since each haploid is the same as another haploid in locally, but not similar in the full length (since it is very rare to share the same breakpoints), each path includes many nodes not in the path. In short, it lacks the collinearity among paths. I would like to know if there is any concern for indexing in this case.

@jltsiren
Copy link
Contributor

Building GBWT for existing paths is straightforward, but we haven't implemented it yet in VG. You can use GBWT construction tools, if you can generate the paths and store them on disk in a suitable format. GBWT also does not have the inverse suffix array functionality required for extracting ranges in specific paths, but it would be easy to add it. The harder part is determining what the functionality should exactly do.

GBWT is really just an FM-index for strings over a large alphabet (node identifiers). The graph is there implicitly, but it's just a model that describes which 2-mers we have seen in the strings. The model changes every time we insert strings with new 2-mers.

We split GBWT construction by chromosome for performance reasons. Each construction job can insert 2-3 million characters (node identifiers) per second into the index using 1-2 CPU threads. Because the construction is quite space-efficient, we can run many jobs in parallel. Merging the per-chromosome GBWT indexes into a whole-genome index is fast, as long as the node ids do not overlap.

If you have inter-chromosomal paths, it's probably best to build GBWT for intra-chromosomal paths first, and then insert the inter-chromosomal paths into the merged index. Breakpoints probably won't matter, as they just add new edges to the implicit graph. As far as BWT is concerned, a breakpoint that connects two existing substrings is a simple variant. A whole-genome graph already has tens of millions of variants, and the total number of inter-chromosomal breakpoints is probably much less than that.

@6br
Copy link
Collaborator Author

6br commented Jan 14, 2018

Thank you for your replying!

I appreciate the explanation of GBWT. The ideas of GBWT is very interesting for me. I would like to clarify the procedure to embed haplotypes using GBWT.

In this workflow at gbwt_wiki, the generated XG includes only reference path, and each haplotype is stored in GBWT. If I want to chunk sub-graphs which include haplotype information, I should use query interface to extract haplotypes.

And if I want to handle graph including edges/nodes which are not assigned any paths, I should use XG since XG can store raw nodes and edges information.

Is my understanding correct?

@jltsiren
Copy link
Contributor

GBWT works at a lower level than XG. It assumes a simple directed graph where nodes have integer identifiers. In the current implementation, GBWT takes the graph structure from the paths. It would also be possible to add unused nodes and edges (e.g. by initializing the GBWT index with an existing graph), but that has not been implemented.

XG uses bidirected graphs, where nodes contain sequences in addition to having identifiers. GBWT can simulate a bidirected graph by encoding the XG node id and the node orientation into the GBWT node id (with gbwt::Node::encode()). It doesn't know anything about the sequences, so you'll need XG for that.

The GBWT construction interface is documented in the wiki. In particular, the example of buffered construction is similar to what vg index does. You generate the sequences as std::vectors of integers and insert them into a GBWTBuilder object.

XG has two categories of path-like objects: paths and threads. The main difference is that paths are defined over both nodes and sequences, while threads are only defined over nodes. If you need queries like "which node contains sequence offset i on path P" (as in vg find -p), you need paths. GBWT only stores threads, as it does not know anything about the sequences in the nodes.

The query interface is still a work in progress. At the moment, it basically supports three kinds of operations:

  • counting the number of paths matching a pattern;
  • determining the identifiers of the paths matching a pattern; and
  • extracting full paths.

Some new operations can be added quickly, while others would require changes to the file format. In particular, vg find functionality would require the changes.

@6br
Copy link
Collaborator Author

6br commented Jan 16, 2018

I understood the development status.
I will try to build GBWT from our own dataset, and I will let you know if I have any questions.

Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants