Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all the VG commands using VGSet know how to read a file of filenames as input #234

Open
cartoonist opened this issue Feb 24, 2016 · 13 comments

Comments

@cartoonist
Copy link
Contributor

I'm trying to construct and index a whole genome variation graph of a relatively small genome containing ~17200 short regions. I constructed variation graphs for each region separately. I also generated a joint id space across each graph by using vg ids. When I try to create xg index, I got this error message:

$ vg index -x wg.wx *.vg
[vg::map] could not concatenate graphs

In addition, when I try to explicitly indicate the file name of variation graphs, it reaches the ARG_MAX limit and this error message appears:

Argument list too long
@edawson
Copy link
Contributor

edawson commented Feb 24, 2016

Hey Ali - that is a lot of files. Have you tried a subset of them as a smoke test?

vg index -x test.xg 1.vg 2.vg

Peeking at the source code, it looks like the [vg::map] could not concatenate graphs is probably triggered by the same ARG_MAX limit. To concatenate the graphs we just do a basic cat and pass each argument as a temporary graph, so it'll still get tripped up with so many files. Perhaps try doing them in batches by manually catting a subset of your graphs:

  ## Write a file that lists all of the input files
  for i in `ls | grep ".vg"`; do echo $i >> files.txt; done

 ## Batch process files, 100 at a time. May need to use more or fewer files in a batch.
 for i in `seq 100 100 17200`
       j=(expr $i - 100)
       cat `sed -n ' files.txt | sed ':a;N;$!ba;s/\n/ /g''` > $j_$i.vg
 done

 ## Now cat the catted files. There will be fewer now.
 for i in `ls | grep -o "[0-9]*_[0-9]*.vg"
      do cat $i >> merged.vg
 done

## Finally, index this (now gigantic) graph.
vg index -x merged.xg merged.vg

I'm not sure this will work, but I suspect it will be a step in the right direction. Hopefully @ekg or @adamnovak can chime in when they get some free time at their conference.

Magic sed line: http://unix.stackexchange.com/questions/114943/can-sed-replace-new-line-characters
Grab specific range of lines from file: http://stackoverflow.com/questions/191364/quick-unix-command-to-display-specific-lines-in-the-middle-of-a-file
Seq: man seq

@ekg
Copy link
Member

ekg commented Feb 24, 2016

Just concatenate those graphs together and try it again. You can loop over cat 1.vg >>combined.vg and it should work.

@adamnovak
Copy link
Member

The .vg notation will not work around your argument list length problem, by the way. The * is expanded by the shell, so vg gets the list of all the matching files. If that list is too long, I don't know what exactly will happen, but it won't work correctly. It might just cancel the expansion and pass along the literal ".vg". If that happens, or if you otherwise you get the shell not to expand it (like by using quotes), vg will see "*.vg" literally, which is not a vg file that it can open, and so it won't work.

@cartoonist
Copy link
Contributor Author

Thanks @ekg, @edawson, and @adamnovak. It's good to know that vg files can be merged by simply concatenating them using cat. I didn't know that and I think it will solve my problem. I'll try.

that is a lot of files. Have you tried a subset of them as a smoke test?

Yes, I tried and it worked for fewer number of vg files without problem.

I'm not sure about the details, but It seems that the wildcard expansion is done successfully. The E2BIG error (whose result is error message "Argument list too long" and it's defined in <sys/errno.h>), occurs in exec() system call. So, when I run vg (or other commands) by explicitly specifying the names of the vg files I got this error message (for example, here I used vg ids to create a joint id space across the graphs):

$ find . -iname "*.vg" | tr '\n' ' ' | xargs -0I {} -- vg ids -j {}
xargs: Argument list too long

But when I use wildcard, it works fine:

$ vg ids -j *.vg

But for indexing, I did the same trick (using wildcard) as a workaround for this issue:

$ vg index -x wg.xg *.vg
[vg::map] could not concatenate graphs

I don't get E2BIG error message, but vg fails to index. That's why I think there's some internal problems in this regard: maybe some external commands whose length exceed ARG_MAX limit are executed internally.

@ekg
Copy link
Member

ekg commented Mar 2, 2016

Your assessment is right. vg index is running a concatenate command to put
all the files together internally. This must exceed the command length
limits.

I think the only solution for large numbers of files is to concatenate them
together externally. Now, the ID space resolution will be a pain, but it
can be scripted out by taking the files in order. For each file, increment
the IDs by the maximum ID we have seen, then record the max ID of the graph
as the new maximum ID. You can loop through the files and do this to ensure
the ID space has no collisions. I do not think the right thing will happen
here if you concatenate the files before doing this. Then, concatenation
will work and produce a valid graph.

This is all pretty annoying and should be streamlined. We could implement
file lists (a file with one .vg file per line) as a way to do this. I think
this is somewhere between a feature request and a bug. Thoughts?
On Mar 2, 2016 9:32 AM, "Ali Ghaffaari" notifications@github.com wrote:

Thanks @ekg https://github.com/ekg, @edawson
https://github.com/edawson, and @adamnovak
https://github.com/adamnovak. It's good to know that vg files can be
merged by simply concatenating them using cat. I didn't know that, I
think It will solve my problem. I'll try.

that is a lot of files. Have you tried a subset of them as a smoke test?

Yes, I tried and it worked for fewer number of vg files without problem.

I'm not sure about the details, but It seems that the wildcard expansion
is done successfully. The E2BIG error (whose result is error message
"Argument list too long" and It's defined in <sys/errno.h>), occurs in
exec() system call. So, when I run vg (or other commands) by explicitly
specifying the names of the vg files I got this error message (for example,
here I used vg ids to create a joint id space across the graphs):

$ find . -iname "*.vg" | tr '\n' ' ' | xargs -0I {} -- vg ids -j {}
xargs: Argument list too long

But when I use wildcard, it works fine:

$ vg ids -j *.vg

But for indexing, I did the same trick (using wildcard) as a workaround
for this issue.

$ vg index -x wg.xg *.vg
[vg::map] could not concatenate graphs

I don't get E2BIG error message, but vg fails to index. That's why I think
there's some internal problems in this regard: maybe some external commands
whose length exceed ARG_MAX limit are executed internally.


Reply to this email directly or view it on GitHub
#234 (comment).

@cartoonist
Copy link
Contributor Author

As a rough idea, vg merge would be a useful command which merges all given .vg files into one big .vg file with collision-free ID space. The input .vg file names for this command can be provided as either command-line argument or a file list. Then, one can use the resulting file for all sort of vg commands that get multiple .vg files.

@ekg
Copy link
Member

ekg commented Mar 2, 2016

It might be easier to teach vg ids to read a file list. Then you can do the
merge as a ID space unification followed by cat. That said I won't stop
anyone from making vg merge!
On Mar 2, 2016 10:38 AM, "Ali Ghaffaari" notifications@github.com wrote:

As a rough idea, vg merge would be a useful command which merges all
given .vg files into one big .vg file with collision-free ID space. The
input .vg file names for this command can be provided as either
command-line argument or a file list. Then, one can use the resulting file
for all sort of vg commands that get multiple .vg files.


Reply to this email directly or view it on GitHub
#234 (comment).

@ekg
Copy link
Member

ekg commented Mar 10, 2016

@cartoonist Have you managed to resolve this (even in a hacky way as described here)?

The issue is open because this shouldn't need to be scripted out.

@ekg
Copy link
Member

ekg commented Mar 10, 2016

@cartoonist have you tried building the graph off of the reference FASTA made of the 17200 contigs? It seems like it might just work if it's small. The tutorial focused on the problem of building the graph for a very large genome.

@cartoonist
Copy link
Contributor Author

Hi @ekg, I was on vacation. Sorry for late reply. I will check and inform you about the way I could manage to create the graph in few days.

@ekg
Copy link
Member

ekg commented Mar 30, 2016

I've been testing with the current HEAD and things are going pretty well.
Still some things that need to be documented for handling large graphs.
I'll be updating the wiki to explain.

On Wed, Mar 30, 2016 at 12:26 PM Ali Ghaffaari notifications@github.com
wrote:

Hi @ekg https://github.com/ekg, I was on vacation. Sorry for late
reply. I will check and inform you about the way I could manage to create
the graph in few days.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#234 (comment)

@adamnovak
Copy link
Member

Is this still a problem? And is the fact that people may want to operate on more graphs than they can fit on a command line still in scope for vg?

Do we want to change the issue to something like "Make all the VG commands using VGSet know how to read a file of filenames as input"?

@ekg
Copy link
Member

ekg commented Oct 20, 2016

Makes sense to me.

On Wed, Oct 19, 2016, 18:23 Adam Novak notifications@github.com wrote:

Is this still a problem? And is the fact that people may want to operate
on more graphs than they can fit on a command line still in scope for vg?

Do we want to change the issue to something like "Make all the VG commands
using VGSet know how to read a file of filenames as input"?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#234 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAI4EWJy0ScL1y-psZG497qQaMR-hYqFks5q1kQVgaJpZM4HhcdL
.

@adamnovak adamnovak changed the title Unable to create xg index Make all the VG commands using VGSet know how to read a file of filenames as input Oct 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants