Make all the VG commands using VGSet know how to read a file of filenames as input #234

cartoonist · 2016-02-24T08:17:45Z

I'm trying to construct and index a whole genome variation graph of a relatively small genome containing ~17200 short regions. I constructed variation graphs for each region separately. I also generated a joint id space across each graph by using vg ids. When I try to create xg index, I got this error message:

$ vg index -x wg.wx *.vg
[vg::map] could not concatenate graphs

In addition, when I try to explicitly indicate the file name of variation graphs, it reaches the ARG_MAX limit and this error message appears:

Argument list too long

The text was updated successfully, but these errors were encountered:

edawson · 2016-02-24T12:53:33Z

Hey Ali - that is a lot of files. Have you tried a subset of them as a smoke test?

vg index -x test.xg 1.vg 2.vg

Peeking at the source code, it looks like the [vg::map] could not concatenate graphs is probably triggered by the same ARG_MAX limit. To concatenate the graphs we just do a basic cat and pass each argument as a temporary graph, so it'll still get tripped up with so many files. Perhaps try doing them in batches by manually catting a subset of your graphs:

  ## Write a file that lists all of the input files
  for i in `ls | grep ".vg"`; do echo $i >> files.txt; done

 ## Batch process files, 100 at a time. May need to use more or fewer files in a batch.
 for i in `seq 100 100 17200`
       j=(expr $i - 100)
       cat `sed -n ' files.txt | sed ':a;N;$!ba;s/\n/ /g''` > $j_$i.vg
 done

 ## Now cat the catted files. There will be fewer now.
 for i in `ls | grep -o "[0-9]*_[0-9]*.vg"
      do cat $i >> merged.vg
 done

## Finally, index this (now gigantic) graph.
vg index -x merged.xg merged.vg

I'm not sure this will work, but I suspect it will be a step in the right direction. Hopefully @ekg or @adamnovak can chime in when they get some free time at their conference.

Magic sed line: http://unix.stackexchange.com/questions/114943/can-sed-replace-new-line-characters
Grab specific range of lines from file: http://stackoverflow.com/questions/191364/quick-unix-command-to-display-specific-lines-in-the-middle-of-a-file
Seq: man seq

ekg · 2016-02-24T16:32:51Z

Just concatenate those graphs together and try it again. You can loop over cat 1.vg >>combined.vg and it should work.

adamnovak · 2016-02-25T18:52:53Z

The .vg notation will not work around your argument list length problem, by the way. The * is expanded by the shell, so vg gets the list of all the matching files. If that list is too long, I don't know what exactly will happen, but it won't work correctly. It might just cancel the expansion and pass along the literal ".vg". If that happens, or if you otherwise you get the shell not to expand it (like by using quotes), vg will see "*.vg" literally, which is not a vg file that it can open, and so it won't work.

cartoonist · 2016-03-02T08:32:44Z

Thanks @ekg, @edawson, and @adamnovak. It's good to know that vg files can be merged by simply concatenating them using cat. I didn't know that and I think it will solve my problem. I'll try.

that is a lot of files. Have you tried a subset of them as a smoke test?

Yes, I tried and it worked for fewer number of vg files without problem.

I'm not sure about the details, but It seems that the wildcard expansion is done successfully. The E2BIG error (whose result is error message "Argument list too long" and it's defined in <sys/errno.h>), occurs in exec() system call. So, when I run vg (or other commands) by explicitly specifying the names of the vg files I got this error message (for example, here I used vg ids to create a joint id space across the graphs):

$ find . -iname "*.vg" | tr '\n' ' ' | xargs -0I {} -- vg ids -j {}
xargs: Argument list too long

But when I use wildcard, it works fine:

$ vg ids -j *.vg

But for indexing, I did the same trick (using wildcard) as a workaround for this issue:

$ vg index -x wg.xg *.vg
[vg::map] could not concatenate graphs

I don't get E2BIG error message, but vg fails to index. That's why I think there's some internal problems in this regard: maybe some external commands whose length exceed ARG_MAX limit are executed internally.

ekg · 2016-03-02T08:53:17Z

Your assessment is right. vg index is running a concatenate command to put
all the files together internally. This must exceed the command length
limits.

I think the only solution for large numbers of files is to concatenate them
together externally. Now, the ID space resolution will be a pain, but it
can be scripted out by taking the files in order. For each file, increment
the IDs by the maximum ID we have seen, then record the max ID of the graph
as the new maximum ID. You can loop through the files and do this to ensure
the ID space has no collisions. I do not think the right thing will happen
here if you concatenate the files before doing this. Then, concatenation
will work and produce a valid graph.

This is all pretty annoying and should be streamlined. We could implement
file lists (a file with one .vg file per line) as a way to do this. I think
this is somewhere between a feature request and a bug. Thoughts?
On Mar 2, 2016 9:32 AM, "Ali Ghaffaari" notifications@github.com wrote:

Thanks @ekg https://github.com/ekg, @edawson
https://github.com/edawson, and @adamnovak
https://github.com/adamnovak. It's good to know that vg files can be
merged by simply concatenating them using cat. I didn't know that, I
think It will solve my problem. I'll try.

that is a lot of files. Have you tried a subset of them as a smoke test?

Yes, I tried and it worked for fewer number of vg files without problem.

I'm not sure about the details, but It seems that the wildcard expansion
is done successfully. The E2BIG error (whose result is error message
"Argument list too long" and It's defined in <sys/errno.h>), occurs in
exec() system call. So, when I run vg (or other commands) by explicitly
specifying the names of the vg files I got this error message (for example,
here I used vg ids to create a joint id space across the graphs):

$ find . -iname "*.vg" | tr '\n' ' ' | xargs -0I {} -- vg ids -j {}
xargs: Argument list too long

But when I use wildcard, it works fine:

$ vg ids -j *.vg

But for indexing, I did the same trick (using wildcard) as a workaround
for this issue.

$ vg index -x wg.xg *.vg
[vg::map] could not concatenate graphs

I don't get E2BIG error message, but vg fails to index. That's why I think
there's some internal problems in this regard: maybe some external commands
whose length exceed ARG_MAX limit are executed internally.

—
Reply to this email directly or view it on GitHub
#234 (comment).

cartoonist · 2016-03-02T09:38:29Z

As a rough idea, vg merge would be a useful command which merges all given .vg files into one big .vg file with collision-free ID space. The input .vg file names for this command can be provided as either command-line argument or a file list. Then, one can use the resulting file for all sort of vg commands that get multiple .vg files.

ekg · 2016-03-02T09:59:00Z

It might be easier to teach vg ids to read a file list. Then you can do the
merge as a ID space unification followed by cat. That said I won't stop
anyone from making vg merge!
On Mar 2, 2016 10:38 AM, "Ali Ghaffaari" notifications@github.com wrote:

As a rough idea, vg merge would be a useful command which merges all
given .vg files into one big .vg file with collision-free ID space. The
input .vg file names for this command can be provided as either
command-line argument or a file list. Then, one can use the resulting file
for all sort of vg commands that get multiple .vg files.

—
Reply to this email directly or view it on GitHub
#234 (comment).

ekg · 2016-03-10T12:21:42Z

@cartoonist Have you managed to resolve this (even in a hacky way as described here)?

The issue is open because this shouldn't need to be scripted out.

ekg · 2016-03-10T12:22:49Z

@cartoonist have you tried building the graph off of the reference FASTA made of the 17200 contigs? It seems like it might just work if it's small. The tutorial focused on the problem of building the graph for a very large genome.

cartoonist · 2016-03-30T11:26:23Z

Hi @ekg, I was on vacation. Sorry for late reply. I will check and inform you about the way I could manage to create the graph in few days.

ekg · 2016-03-30T12:11:53Z

I've been testing with the current HEAD and things are going pretty well.
Still some things that need to be documented for handling large graphs.
I'll be updating the wiki to explain.

On Wed, Mar 30, 2016 at 12:26 PM Ali Ghaffaari notifications@github.com
wrote:

Hi @ekg https://github.com/ekg, I was on vacation. Sorry for late
reply. I will check and inform you about the way I could manage to create
the graph in few days.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#234 (comment)

adamnovak · 2016-10-19T16:23:49Z

Is this still a problem? And is the fact that people may want to operate on more graphs than they can fit on a command line still in scope for vg?

Do we want to change the issue to something like "Make all the VG commands using VGSet know how to read a file of filenames as input"?

ekg · 2016-10-20T09:04:48Z

Makes sense to me.

On Wed, Oct 19, 2016, 18:23 Adam Novak notifications@github.com wrote:

Is this still a problem? And is the fact that people may want to operate
on more graphs than they can fit on a command line still in scope for vg?

Do we want to change the issue to something like "Make all the VG commands
using VGSet know how to read a file of filenames as input"?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#234 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAI4EWJy0ScL1y-psZG497qQaMR-hYqFks5q1kQVgaJpZM4HhcdL
.

adamnovak changed the title ~~Unable to create xg index~~ Make all the VG commands using VGSet know how to read a file of filenames as input Oct 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make all the VG commands using VGSet know how to read a file of filenames as input #234

Make all the VG commands using VGSet know how to read a file of filenames as input #234

cartoonist commented Feb 24, 2016

edawson commented Feb 24, 2016

ekg commented Feb 24, 2016

adamnovak commented Feb 25, 2016

cartoonist commented Mar 2, 2016

ekg commented Mar 2, 2016

cartoonist commented Mar 2, 2016

ekg commented Mar 2, 2016

ekg commented Mar 10, 2016

ekg commented Mar 10, 2016

cartoonist commented Mar 30, 2016

ekg commented Mar 30, 2016

adamnovak commented Oct 19, 2016

ekg commented Oct 20, 2016

Make all the VG commands using VGSet know how to read a file of filenames as input #234

Make all the VG commands using VGSet know how to read a file of filenames as input #234

Comments

cartoonist commented Feb 24, 2016

edawson commented Feb 24, 2016

ekg commented Feb 24, 2016

adamnovak commented Feb 25, 2016

cartoonist commented Mar 2, 2016

ekg commented Mar 2, 2016

cartoonist commented Mar 2, 2016

ekg commented Mar 2, 2016

ekg commented Mar 10, 2016

ekg commented Mar 10, 2016

cartoonist commented Mar 30, 2016

ekg commented Mar 30, 2016

adamnovak commented Oct 19, 2016

ekg commented Oct 20, 2016