Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation of large input files #45

Closed
hoelzer opened this issue Jan 2, 2020 · 9 comments
Closed

Evaluation of large input files #45

hoelzer opened this issue Jan 2, 2020 · 9 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request Priority LOW

Comments

@hoelzer
Copy link
Collaborator

hoelzer commented Jan 2, 2020

This issue is for documentation of the behavior of WtP for large input files. Based on this @replikation might implement FASTA chunking to increase speed of the pipeline.

case 1, aquadiva sample

bsub -n 4 -M 8.0G -R "rusage[mem=8.0G]" "nextflow run phage.nf --fasta /homes/mhoelzer/data/calc/aquadiva_kaiju/spades/H14_0_2_1/scaffolds.fasta --output /homes/mhoelzer/data/calc/aquadiva_kaiju/wtp/H14_0_2_1 -profile ebi --mp -resume"

(execluded metaphinder because of an previous issue)

  • 555 MB
  • 1.152.661 contigs
  • largest contigs:
    • 46863 NODE_1_length_46863_cov_23.219813
    • 40744 NODE_2_length_40744_cov_42.929662
    • 26449 NODE_3_length_26449_cov_28.914564
    • 23605 NODE_4_length_23605_cov_11.299066
    • 22256 NODE_5_length_22256_cov_27.842575
    • 22209 NODE_6_length_22209_cov_8.825675
    • 21099 NODE_7_length_21099_cov_6.212270
    • 20898 NODE_8_length_20898_cov_8.155352
    • 20440 NODE_9_length_20440_cov_6.053373
    • 18557 NODE_10_length_18557_cov_6.418549

started: Dec 31 12:50

Tools completed

  • virsorter: ~1h
  • virfinder: ~48h
  • deepvirfinder: running...
  • marvel: running...
  • pprmeta: ~8h

Job was aborted after 2.5 days by cluster for unclear reason. No stats for deepvirfinder and marvel

@hoelzer hoelzer added documentation Improvements or additions to documentation enhancement New feature or request Priority LOW labels Jan 2, 2020
@hoelzer
Copy link
Collaborator Author

hoelzer commented Jan 8, 2020

Screenshot 2020-01-08 10 06 50

large input files (500MB-1GB) are working with virsorter and pprmeta. I will test the other tools.

However, the r_plot process takes much time and seems even not be able to terminate for some files. Besides, the visualization is not really usefull for large input sets so I will deactivate it in a separate branch for my test runs.

@hoelzer
Copy link
Collaborator Author

hoelzer commented Jan 9, 2020

Update:
Virfinder finished after 18h for one of the large input files (~500 MB) fasta.

Now testing Marvel

@replikation
Copy link
Owner

replikation commented Jan 9, 2020

so you dropped all the 29 metagenomes with > 1mio contigs (for each sample) on it? :D
okay interesting... ill try out a few things and report back

@hoelzer
Copy link
Collaborator Author

hoelzer commented Jan 9, 2020

yeah... I thought the EBI cluster is huge so just go for it WtP! :D

At the moment I am just running one sample with the -resume option adding more and more tools (currently Marvel is running).

@replikation
Copy link
Owner

marvel is super difficult to implement here. as its analysing "bins" by default. so i need to split each contig into a separate fasta file. and you have 1-2 mio contigs per file

@hoelzer
Copy link
Collaborator Author

hoelzer commented Jan 9, 2020

uff I see. Maybe skipping Marvel if too many contigs are provided? I mean, it's just due to how Marvel is implemented and not reallt an issue of WtP

@replikation
Copy link
Owner

yep i was thinking about an "autoconfig" depending on the "assemblystats" of the input
e.g. to many contigs -> deactivate tool x and y -> contig to large -> deactivate deepvirfinder etc.

@hoelzer
Copy link
Collaborator Author

hoelzer commented Jan 9, 2020

I think that is a good idea and report back to the user what was deactivated and why.

@replikation
Copy link
Owner

these issue information are for #47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request Priority LOW
Projects
None yet
Development

No branches or pull requests

2 participants