Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on large dataset #85

Closed
NicolasNaepflin opened this issue Feb 27, 2023 · 2 comments
Closed

Running on large dataset #85

NicolasNaepflin opened this issue Feb 27, 2023 · 2 comments

Comments

@NicolasNaepflin
Copy link

Hi Sion,
Thanks for developing this tool! I have been using it a while now for smaller datasets (< 1000 genomes) without issues and it has been very useful.

Recently I was looking into processing larger (~ 10000 genomes) and potentially also more diverse datasets.

Do you have any input/ experience into processing large datasets? (eg. Are there other options to improve the run time apart from increasing the number of threads/ cores and using diamond instead of BLAST?)

Additionally, for more diverse genomes such as the Prochlorococcus example in your original publication, you used an MCL inflation value of 6. As far as I know, larger inflation parameters tend to produce a more fine-grained clustering. Was there any benchmarking (or other tests) performed to choose this inflation value?

Thank you in advance

Nicolas

@SionBayliss
Copy link
Owner

Hi Nicolas,

Thanks for using it!

PIRATE has been successfully used on very large datasets >25,000 genomes. I would suggest that you:

1/ Check the genomes for quality. One poor quality genome can have a detrimental effect on the clustering and especially the paralog identification and classification.
2/ Start with a much smaller subset of your most diverse samples so that you can pick a range of thresholds (--steps) that accurately captures the diversity in your collection. You could also experiment with inflation values here to ensure sensible clusters are produced. I am afraid I don't have any tips for selecting an MCL inflat value for you :(
3/ Don't run it with gene alignment, it will take ages to finish and can be run separately or on genes of interest afterwards.
4/ You can also run it with paralog detection off (--para-off) on the initial run as this can take a long time to complete. It can then be rerun with paralog detection only, using the --pan-off option, once it has finished clustering at least once. You WILL need to keep intermediate files on each run for this to work (-z 2). I would test the workflow on a smaller subset so that you don't put the wrong options in on your full set and remove intermediate files or have to reprocess everything :)
5/ Throw as many cores as you can at it.

I hope that helps,
S

@NicolasNaepflin
Copy link
Author

Hi Sion

Thanks for the quick input! I will let you know how it will work for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants