Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when clustering MERC using easy-linclust #323

Open
arglog opened this issue Jun 25, 2020 · 11 comments
Open

Segmentation fault when clustering MERC using easy-linclust #323

arglog opened this issue Jun 25, 2020 · 11 comments
Assignees

Comments

@arglog
Copy link

arglog commented Jun 25, 2020

I tried to run mmseqs easy-linclust on the MERC dataset (from http://gwdu111.gwdg.de/~compbiol/plass/2018_08/) but got a segmentation fault.

Expected Behavior

Normal output of mmseqs easy-linclust

Current Behavior

Got Segmentation fault in the middle

Steps to Reproduce (for bugs)

> wget http://gwdu111.gwdg.de/~compbiol/plass/2018_08/MERC.fasta.gz
> mmseqs easy-linclust MERC.fasta.gz MERC /export/tmp/MERC -c 0.9 --cov-mode 1 --cluster-mode 2 --min-seq-id 0.5 --split-memory-limit 500G 

MMseqs Output (for bugs)

Tmp /export/tmp/MERC folder does not exist or is not a directory.
createdb ../MERC.fasta.gz /export/tmp/MERC/4233864688410091672/input --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 
Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
=================================================================================================== 292 Mio. sequences processed
=============
Time for merging to input_h: 0h 0m 40s 64ms
Time for merging to input: 0h 0m 40s 130ms
Database type: Aminoacid
Time for processing: 0h 12m 9s 179ms
Tmp /export/tmp/MERC/4233864688410091672/clu_tmp folder does not exist or is not a directory.
kmermatcher /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 500G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 
kmermatcher /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 500G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 
Database size: 292137902 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X) 
Generate k-mers list for 1 split
[=================================================================] 292.14M 36s 571ms
Sort kmer 0h 0m 3s 87ms
Sort by rep. sequence 0h 0m 2s 827ms
Time for fill: 0h 0m 16s 310ms
Time for merging to pref: 0h 0m 58s 394ms
Time for processing: 0h 3m 54s 379ms
rescorediagonal /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3 
[=================================================================] 292.14M 2m 8s 805ms
Time for merging to pref_rescore1: 0h 2m 40s 361ms
Time for processing: 0h 5m 54s 815ms
clust /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_rescore1 /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pre_clust --cluster-mode 2 --max-iterations 1000 --similarity-type 2 --threads 96 --compressed 0 -v 3 
Clustering mode: Greedy
Total time: 0h 1m 7s 208ms
Size of the sequence database: 292137902
Size of the alignment database: 292137902
Number of clusters: 245753321
Writing results 0h 1m 30s 550ms
Time for merging to pre_clust: 0h 1m 31s 28ms
Time for processing: 0h 5m 19s 116ms
createsubdb /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/order_redundancy /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/input_step_redundancy -v 3 --subdb-mode 1 
Time for merging to input_step_redundancy: 0h 0m 34s 71ms
Time for processing: 0h 1m 29s 221ms
createsubdb /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/order_redundancy /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter1 -v 3 --subdb-mode 1 
Time for merging to pref_filter1: 0h 0m 45s 806ms
Time for processing: 0h 1m 48s 52ms
filterdb /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter1 /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter2 --filter-file /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/order_redundancy --threads 96 --compressed 0 -v 3 
Filtering using file(s)
[=================================================================] 245.75M 2m 9s 682ms
Time for merging to pref_filter2: 0h 2m 9s 511ms
Time for processing: 0h 6m 15s 7ms
Segmentation fault (core dumped)
Error: Ungapped alignment step died
Error: Search died

Context

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): dc054792d1b1d091380638a712ee7566aba2bb38
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-compiled
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: cmake 3.10.2
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • Operating system and version: Ubuntu 18.04
@milot-mirdita
Copy link
Member

milot-mirdita commented Jun 25, 2020

I tried to reconstruct the command that probably crashed. Could you run it again inside a debugger to recover the backtrace? I have no clue what could have gone wrong so early in the command invocation (the running module had no output at all before it crashed).

  1. Run the following command
gdb --args mmseqs rescorediagonal /export/tmp/MERC/4233864688410091672/input_step_redundancy /export/tmp/MERC/4233864688410091672/input_step_redundancy /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter2 /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.9 -a 0 --cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3
  1. wait for a command prompt
  2. press r for run and then enter
  3. wait for the crash
  4. press bt for backtrace and then enter
  5. copy the output and paste it here

Thanks a lot for reporting the issue.

@arglog
Copy link
Author

arglog commented Jun 26, 2020

I re-ran from the very beginning (because it seems the temp files were auto-removed, e.g., input_step_redundancy). However, there is no backtrace output.

Time for merging to pref_filter1: 0h 0m 45s 203ms
Time for processing: 0h 1m 56s 417ms
filterdb /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter1 /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter2 --filter-file /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/order_redundancy --threads 96 --compressed 0 -v 3

Filtering using file(s)
[=================================================================] 100.00% 245.75M 2m 6s 123ms
Time for merging to pref_filter2: 0h 2m 13s 365ms
Time for processing: 0h 6m 17s 259ms
rescorediagonal /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter2 /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.9 -a 0
--cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

Segmentation fault (core dumped)                                  ] 0.00% 1 eta -
Error: Ungapped alignment step died
Error: Search died
[Inferior 1 (process 161684) exited with code 01]
(gdb) bt
No stack.

Let me know if there is something else I can test.

@milot-mirdita
Copy link
Member

Please run only the rescorediagonal module in GDB or it won't be able to catch the crash:

gdb --args mmseqs rescorediagonal /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter2 /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.9 -a 0
--cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

@arglog
Copy link
Author

arglog commented Jun 26, 2020

Good to know. Here is the output

Thread 1 "mmseqs" received signal SIGSEGV, Segmentation fault.
0x00005555555c0446 in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) [clone ._omp_fn.0] ()
(gdb) bt
#0  0x00005555555c0446 in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) [clone ._omp_fn.0] ()
#1  0x00007ffff726fecf in GOMP_parallel () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00005555555bc7de in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) ()
#3  0x00005555555c2014 in rescorediagonal(int, char const**, Command const&) ()
#4  0x00005555555ac4a5 in runCommand(Command*, int, char const**) ()
#5  0x000055555559dfbc in main ()
(gdb)

@milot-mirdita
Copy link
Member

Could you recompile MMseqs2 with -DCMAKE_BUILD_TYPE=Debug and run that again? Thanks a lot for investigating the issue.

@arglog
Copy link
Author

arglog commented Jun 26, 2020

Yes. This is the gdb output using Debug compiling.

Thread 156 "mmseqs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffd0fff7700 (LWP 230283)]
0x00005555556f98a4 in DistanceCalculator::computeSubstitutionDistance<char> (seq1=0x7ff249252000 "\037\213\b\b\217-h[",
    seq2=0x7ff249252000 "\037\213\b\b\217-h[", length=295, subMat=0x555555d7a8a0, globalAlignment=false)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:29
29                      int curr = subMat[static_cast<int>(seq1[pos])][static_cast<int>(seq2[pos])];
(gdb) bt
#0  0x00005555556f98a4 in DistanceCalculator::computeSubstitutionDistance<char> (seq1=0x7ff249252000 "\037\213\b\b\217-h[",
    seq2=0x7ff249252000 "\037\213\b\b\217-h[", length=295, subMat=0x555555d7a8a0, globalAlignment=false)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:29
#1  0x00005555556f85d3 in DistanceCalculator::ungappedAlignmentByDiagonal<char> (querySeq=0x7ff249252000 "\037\213\b\b\217-h[", querySeqLen=295,
    dbSeq=0x7ff249252000 "\037\213\b\b\217-h[", dbSeqLen=295, diagonal=0, subMat=0x555555d7a8a0, alnMode=1)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:130
#2  0x00005555556f783c in DistanceCalculator::computeUngappedAlignment<char> (querySeq=0x7ff249252000 "\037\213\b\b\217-h[", querySeqLen=295,
    dbSeq=0x7ff249252000 "\037\213\b\b\217-h[", dbSeqLen=295, diagonal=0, subMat=0x555555d7a8a0, alnMode=1)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:107
#3  0x00005555556f311b in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) [clone ._omp_fn.0] ()
    at /export/premium/software/MMseqs2/src/alignment/rescorediagonal.cpp:222
#4  0x00007ffff727895e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#5  0x00007ffff6e326db in start_thread (arg=0x7ffd0fff7700) at pthread_create.c:463
#6  0x00007ffff6b5b88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

@milot-mirdita
Copy link
Member

I think I know what's going on. I think MERC contains * character to mark gene starts and stop codons. In the normal MMseqs2 search etc we handle this case correctly, but Linclust uses a special sequence reading mode for best performance.

You can get around the issue by doing something like:

zcat MERC.fasta.gz | tr -d '*' | mmseqs easy-linclust stdin MERC /export/tmp/MERC -c 0.9 --cov-mode 1 --cluster-mode 2 --min-seq-id 0.5 --split-memory-limit 500G

(Not 100% sure the FASTA headers don't contain any * characters, this command will eliminate all *).

@arglog
Copy link
Author

arglog commented Jun 26, 2020

That's my initial guess actually 😄 I got the same error for MERC and metaclust_nr and I found they all have stop codon symbols in it. I really appreciate your help in investigating this issue. Let me remove * symbols and run it again.

@milot-mirdita
Copy link
Member

Okay I think that was not actually the issue, since we should deal with the stop codon already. It seems like gzip readin is broken in Linclust currently.

If you extract the MERC first it should work.

@milot-mirdita
Copy link
Member

The latest commit should fix the issue.

@arglog
Copy link
Author

arglog commented Jun 26, 2020

This is good to know! I do not have to drop the * symbols manually for every file then.

Just a follow-up question about the metagenomics FASTA files: I can understand why a * symbol appears at the end of a sequence - it's the stop codon - but why do we have two consecutive asterisks at the end in some sequences, or a single asterisk at the beginning in some other sequences? In addition, in MERC, I found some amino acids are lower-case. Does that have some special meanings?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants