-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running on large dataset #85
Comments
Hi Nicolas, Thanks for using it! PIRATE has been successfully used on very large datasets >25,000 genomes. I would suggest that you: 1/ Check the genomes for quality. One poor quality genome can have a detrimental effect on the clustering and especially the paralog identification and classification. I hope that helps, |
Hi Sion Thanks for the quick input! I will let you know how it will work for me |
Hi Sion,
Thanks for developing this tool! I have been using it a while now for smaller datasets (< 1000 genomes) without issues and it has been very useful.
Recently I was looking into processing larger (~ 10000 genomes) and potentially also more diverse datasets.
Do you have any input/ experience into processing large datasets? (eg. Are there other options to improve the run time apart from increasing the number of threads/ cores and using diamond instead of BLAST?)
Additionally, for more diverse genomes such as the Prochlorococcus example in your original publication, you used an MCL inflation value of 6. As far as I know, larger inflation parameters tend to produce a more fine-grained clustering. Was there any benchmarking (or other tests) performed to choose this inflation value?
Thank you in advance
Nicolas
The text was updated successfully, but these errors were encountered: