Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'uc' removed from deflines #557

Closed
chasemc opened this issue Apr 21, 2022 · 2 comments
Closed

'uc' removed from deflines #557

chasemc opened this issue Apr 21, 2022 · 2 comments

Comments

@chasemc
Copy link

chasemc commented Apr 21, 2022

Expected Behavior

The two fasta files depicted below are identical except for the deflines:

pass.fasta

>zzsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>zzsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

fail.fasta

>ucsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>ucsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

Current Behavior / Steps to Reproduce (for bugs)

Running easy-cluster on these two files:

# rm was run, but commented out for placing on github
# rm -rf ./tmp

mmseqs \
    easy-cluster \
    fail.fasta \
    'mmseqs2_fail' \
    ./tmp \
    --threads 24

# rm was run, but commented out for placing on github
# rm -rf ./tmp

mmseqs \
    easy-cluster \
    pass.fasta  \
    'mmseqs2_pass' \
    ./tmp \
    --threads 24

results in the correct output for mmseqs2_pass_cluster.tsv:

zzsomethingelse	zzsomethingelse
zzsomethingelse	zzsomething

but removes the 'uc' from the defline in mmseqs2_fail_cluster.tsv

somethingelse	somethingelse
somethingelse	something

This seems to be the case for any deflines that start with 'uc'


The FASTA files also have duplicate defline entries, where one of the duplicates doesn't contain a sequence:

mmseqs2_fail_all_seqs.fasta

>somethingelse
>ucsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>ucsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

mmseqs2_pass_all_seqs.fasta

>zzsomethingelse
>zzsomethingelse
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA
>zzsomething
MPELRRVLANGVELNVALCGSGPAVLLLHGFPHTWELWTDVMADLSGRYRVIAPDLRGFGASGRAASGYDAGTLAEDAAALLAALGVSSATVVGIDAGTAPAFLLALRHPGLVRRLVVMESLLGRLPGAEDFLAEGPPWWFGFHSAAPSLAETVLEGHEAAYVDWFLSAGTLGDGVRPALRDAFVRAYTGRQALSCAFSYYRALPKSAVQIEQAVATARLTVPTMALGARPVGAALERQLRPVTDDLTGHVIDDCGHIIPLHRPHALLALLHPFLAGEDAKAA

MMseqs Output (for bugs)

https://gist.github.com/chasemc/c0cccd804ac0ff78291e43ae10837c42
https://gist.github.com/chasemc/d8157a581c833406f15442e8b9ee4e81

Your Environment

Conda installed: MMseqs2 Version: 13.45111
Happy to give system info if needed

@chasemc
Copy link
Author

chasemc commented Apr 25, 2022

This is beyond me but it seems is might stem from here:

{ "uc", 2, 0}, // Uniclust

Is there an option to skip checking/removing these identifiers?

@milot-mirdita
Copy link
Member

I'll update #565 when we have a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants