-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hecatomb's same output files from the same dataset are with different sizes #35
Comments
Hi Leran, |
Some additional data: Same 134 samples running on HTCF (MMSEQS_AA_PRIMARY step): I don't think the second seqtable will be added to after this step |
It seems like the difference is mainly arising during clustering: Pathogen: more p08_remove_exact_dups.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log ***** HTCF (slurm): more p08_remove_exact_dups.M721_I9060_34044_Parkes_IBD_2034A_25_11_20_NEB_46_TCCCGAAT_S10.log |
This very well may be due to some changes we made to the clustering parameters for linclust. There was some toying with these settings over the past month, so if you ran the Pathogen run a while back and the HTCF run more recently you would likely get different results. Can you check the cluster settings for each run (should be in your config file, but may be in the rule file depending on which version and when Mike made updates). That large of a difference is more likely explained by a major setting change than a compute error. |
Here are the differences: HTCF: |
NOTE:
|
Well dropping from 0.97 to 0.7 is going to have the greatest effect and is likely way lower than what we intend. This should be set back to 0.97 or 0.95. You can read about alignment mode here: https://github.com/soedinglab/MMseqs2/wiki#how-to-set-the-right-alignment-coverage-to-cluster @mroach-awri can we go back to my original settings as the default for now? |
Yes I think I pushed my settings to the last build. The original default ( I'll be pushing a new build soon; did you want the original defaults or did you have a specific coverage and seq identity in mind? |
Hi Mike,
We need the original settings. Could the faster settings be a switch?
Kathie
From: beardymcjohnface ***@***.***>
Reply-To: shandley/hecatomb ***@***.***>
Date: Saturday, October 30, 2021 at 5:48 PM
To: shandley/hecatomb ***@***.***>
Cc: "Mihindukulasuriya, Kathie" ***@***.***>, Comment ***@***.***>
Subject: Re: [shandley/hecatomb] hecatomb's same output files from the same dataset are with different sizes (Issue #35)
* External Email - Caution *
Yes I think I pushed my settings to the last build. The original default (-c .97 --cov-mode 0) I think specifies 97% residue matches in the longer sequence, which will only cluster sequences that are end-to-end almost identical and most clusters are n=1. The current settings (--cov-mode 1 -c 0.7 --min-seq-id 0.95) specify 70% alignment coverage of the member sequence by the rep sequence at 95% identity. These are the setting I'll probably be using as I want to maximize runtime performance.
I'll be pushing a new build soon; did you want the original defaults or did you have a specific coverage and seq identity in mind?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANDVLDIRETLF5YUQWZSUV5LUJRY2LANCNFSM5G5YEEEQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
…________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
|
Running: Leran |
Hi,
Me and My colleague are running hecatomb on the same dataset. But the same output tables generated from us are with different rows or different number of sequences:
mine:
My colleague:
We are not sure if these steps have finished or not. Because we both failed at the "sankey_diagram" step.
I wanted to checked .err files in the LOG folder to see if the steps that generated those files were finished or not, but couldn't make sure which folders are the right ones to go.
I think If there could be a final .log file says "The entire hecatomb pipeline has successfully finished! " generated after the whole pipeline is done, It'll be very helpful.
Thanks!
Leran
The text was updated successfully, but these errors were encountered: