Segmentation fault when clustering MERC using easy-linclust #323

arglog · 2020-06-25T21:43:25Z

I tried to run mmseqs easy-linclust on the MERC dataset (from http://gwdu111.gwdg.de/~compbiol/plass/2018_08/) but got a segmentation fault.

Expected Behavior

Normal output of mmseqs easy-linclust

Current Behavior

Got Segmentation fault in the middle

Steps to Reproduce (for bugs)

> wget http://gwdu111.gwdg.de/~compbiol/plass/2018_08/MERC.fasta.gz
> mmseqs easy-linclust MERC.fasta.gz MERC /export/tmp/MERC -c 0.9 --cov-mode 1 --cluster-mode 2 --min-seq-id 0.5 --split-memory-limit 500G

MMseqs Output (for bugs)

Tmp /export/tmp/MERC folder does not exist or is not a directory.
createdb ../MERC.fasta.gz /export/tmp/MERC/4233864688410091672/input --dbtype 0 --shuffle 1 --createdb-mode 1 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 
Shuffle database cannot be combined with --createdb-mode 0
We recompute with --shuffle 0
Converting sequences
=================================================================================================== 292 Mio. sequences processed
=============
Time for merging to input_h: 0h 0m 40s 64ms
Time for merging to input: 0h 0m 40s 130ms
Database type: Aminoacid
Time for processing: 0h 12m 9s 179ms
Tmp /export/tmp/MERC/4233864688410091672/clu_tmp folder does not exist or is not a directory.
kmermatcher /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 500G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 
kmermatcher /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size nucl:5,aa:13 --min-seq-id 0.5 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale nucl:0.200,aa:0.000 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 1 -k 0 -c 0.9 --max-seq-len 65535 --hash-shift 67 --split-memory-limit 500G --include-only-extendable 0 --ignore-multi-kmer 0 --threads 96 --compressed 0 -v 3 
Database size: 292137902 type: Aminoacid
Reduced amino acid alphabet: (A S T) (C) (D B N) (E Q Z) (F Y) (G) (H) (I V) (K R) (L J M) (P) (W) (X) 
Generate k-mers list for 1 split
[=================================================================] 292.14M 36s 571ms
Sort kmer 0h 0m 3s 87ms
Sort by rep. sequence 0h 0m 2s 827ms
Time for fill: 0h 0m 16s 310ms
Time for merging to pref: 0h 0m 58s 394ms
Time for processing: 0h 3m 54s 379ms
rescorediagonal /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.9 -a 0 --cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3 
[=================================================================] 292.14M 2m 8s 805ms
Time for merging to pref_rescore1: 0h 2m 40s 361ms
Time for processing: 0h 5m 54s 815ms
clust /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_rescore1 /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pre_clust --cluster-mode 2 --max-iterations 1000 --similarity-type 2 --threads 96 --compressed 0 -v 3 
Clustering mode: Greedy
Total time: 0h 1m 7s 208ms
Size of the sequence database: 292137902
Size of the alignment database: 292137902
Number of clusters: 245753321
Writing results 0h 1m 30s 550ms
Time for merging to pre_clust: 0h 1m 31s 28ms
Time for processing: 0h 5m 19s 116ms
createsubdb /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/order_redundancy /export/tmp/MERC/4233864688410091672/input /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/input_step_redundancy -v 3 --subdb-mode 1 
Time for merging to input_step_redundancy: 0h 0m 34s 71ms
Time for processing: 0h 1m 29s 221ms
createsubdb /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/order_redundancy /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter1 -v 3 --subdb-mode 1 
Time for merging to pref_filter1: 0h 0m 45s 806ms
Time for processing: 0h 1m 48s 52ms
filterdb /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter1 /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter2 --filter-file /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/order_redundancy --threads 96 --compressed 0 -v 3 
Filtering using file(s)
[=================================================================] 245.75M 2m 9s 682ms
Time for merging to pref_filter2: 0h 2m 9s 511ms
Time for processing: 0h 6m 15s 7ms
Segmentation fault (core dumped)
Error: Ungapped alignment step died
Error: Search died

Context

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): dc054792d1b1d091380638a712ee7566aba2bb38
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): self-compiled
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: cmake 3.10.2
Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
Operating system and version: Ubuntu 18.04

The text was updated successfully, but these errors were encountered:

milot-mirdita · 2020-06-25T22:05:04Z

I tried to reconstruct the command that probably crashed. Could you run it again inside a debugger to recover the backtrace? I have no clue what could have gone wrong so early in the command invocation (the running module had no output at all before it crashed).

Run the following command

gdb --args mmseqs rescorediagonal /export/tmp/MERC/4233864688410091672/input_step_redundancy /export/tmp/MERC/4233864688410091672/input_step_redundancy /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_filter2 /export/tmp/MERC/4233864688410091672/clu_tmp/16445679162920043634/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.9 -a 0 --cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

wait for a command prompt
press r for run and then enter
wait for the crash
press bt for backtrace and then enter
copy the output and paste it here

Thanks a lot for reporting the issue.

arglog · 2020-06-26T01:09:04Z

I re-ran from the very beginning (because it seems the temp files were auto-removed, e.g., input_step_redundancy). However, there is no backtrace output.

Time for merging to pref_filter1: 0h 0m 45s 203ms
Time for processing: 0h 1m 56s 417ms
filterdb /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter1 /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter2 --filter-file /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/order_redundancy --threads 96 --compressed 0 -v 3

Filtering using file(s)
[=================================================================] 100.00% 245.75M 2m 6s 123ms
Time for merging to pref_filter2: 0h 2m 13s 365ms
Time for processing: 0h 6m 17s 259ms
rescorediagonal /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter2 /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.9 -a 0
--cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

Segmentation fault (core dumped)                                  ] 0.00% 1 eta -
Error: Ungapped alignment step died
Error: Search died
[Inferior 1 (process 161684) exited with code 01]
(gdb) bt
No stack.

Let me know if there is something else I can test.

milot-mirdita · 2020-06-26T01:11:30Z

Please run only the rescorediagonal module in GDB or it won't be able to catch the crash:

gdb --args mmseqs rescorediagonal /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/input_step_redundancy /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_filter2 /export/tmp/MERC-gdb/7812673630337556672/clu_tmp/7630568140984029289/pref_rescore2 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 1 --wrapped-scoring 0 --filter-hits 1 -e 0.001 -c 0.9 -a 0
--cov-mode 1 --min-seq-id 0.5 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 96 --compressed 0 -v 3

arglog · 2020-06-26T01:16:18Z

Good to know. Here is the output

Thread 1 "mmseqs" received signal SIGSEGV, Segmentation fault.
0x00005555555c0446 in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) [clone ._omp_fn.0] ()
(gdb) bt
#0  0x00005555555c0446 in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) [clone ._omp_fn.0] ()
#1  0x00007ffff726fecf in GOMP_parallel () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#2  0x00005555555bc7de in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) ()
#3  0x00005555555c2014 in rescorediagonal(int, char const**, Command const&) ()
#4  0x00005555555ac4a5 in runCommand(Command*, int, char const**) ()
#5  0x000055555559dfbc in main ()
(gdb)

milot-mirdita · 2020-06-26T01:24:00Z

Could you recompile MMseqs2 with -DCMAKE_BUILD_TYPE=Debug and run that again? Thanks a lot for investigating the issue.

arglog · 2020-06-26T01:35:54Z

Yes. This is the gdb output using Debug compiling.

Thread 156 "mmseqs" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffd0fff7700 (LWP 230283)]
0x00005555556f98a4 in DistanceCalculator::computeSubstitutionDistance<char> (seq1=0x7ff249252000 "\037\213\b\b\217-h[",
    seq2=0x7ff249252000 "\037\213\b\b\217-h[", length=295, subMat=0x555555d7a8a0, globalAlignment=false)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:29
29                      int curr = subMat[static_cast<int>(seq1[pos])][static_cast<int>(seq2[pos])];
(gdb) bt
#0  0x00005555556f98a4 in DistanceCalculator::computeSubstitutionDistance<char> (seq1=0x7ff249252000 "\037\213\b\b\217-h[",
    seq2=0x7ff249252000 "\037\213\b\b\217-h[", length=295, subMat=0x555555d7a8a0, globalAlignment=false)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:29
#1  0x00005555556f85d3 in DistanceCalculator::ungappedAlignmentByDiagonal<char> (querySeq=0x7ff249252000 "\037\213\b\b\217-h[", querySeqLen=295,
    dbSeq=0x7ff249252000 "\037\213\b\b\217-h[", dbSeqLen=295, diagonal=0, subMat=0x555555d7a8a0, alnMode=1)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:130
#2  0x00005555556f783c in DistanceCalculator::computeUngappedAlignment<char> (querySeq=0x7ff249252000 "\037\213\b\b\217-h[", querySeqLen=295,
    dbSeq=0x7ff249252000 "\037\213\b\b\217-h[", dbSeqLen=295, diagonal=0, subMat=0x555555d7a8a0, alnMode=1)
    at /export/premium/software/MMseqs2/src/alignment/DistanceCalculator.h:107
#3  0x00005555556f311b in doRescorediagonal(Parameters&, DBWriter&, DBReader<unsigned int>&, unsigned long, unsigned long) [clone ._omp_fn.0] ()
    at /export/premium/software/MMseqs2/src/alignment/rescorediagonal.cpp:222
#4  0x00007ffff727895e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#5  0x00007ffff6e326db in start_thread (arg=0x7ffd0fff7700) at pthread_create.c:463
#6  0x00007ffff6b5b88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)

milot-mirdita · 2020-06-26T02:06:46Z

I think I know what's going on. I think MERC contains * character to mark gene starts and stop codons. In the normal MMseqs2 search etc we handle this case correctly, but Linclust uses a special sequence reading mode for best performance.

You can get around the issue by doing something like:

zcat MERC.fasta.gz | tr -d '*' | mmseqs easy-linclust stdin MERC /export/tmp/MERC -c 0.9 --cov-mode 1 --cluster-mode 2 --min-seq-id 0.5 --split-memory-limit 500G

(Not 100% sure the FASTA headers don't contain any * characters, this command will eliminate all *).

arglog · 2020-06-26T02:20:01Z

That's my initial guess actually 😄 I got the same error for MERC and metaclust_nr and I found they all have stop codon symbols in it. I really appreciate your help in investigating this issue. Let me remove * symbols and run it again.

milot-mirdita · 2020-06-26T13:17:44Z

Okay I think that was not actually the issue, since we should deal with the stop codon already. It seems like gzip readin is broken in Linclust currently.

If you extract the MERC first it should work.

milot-mirdita · 2020-06-26T13:43:51Z

The latest commit should fix the issue.

arglog · 2020-06-26T18:30:28Z

This is good to know! I do not have to drop the * symbols manually for every file then.

Just a follow-up question about the metagenomics FASTA files: I can understand why a * symbol appears at the end of a sequence - it's the stop codon - but why do we have two consecutive asterisks at the end in some sequences, or a single asterisk at the beginning in some other sequences? In addition, in MERC, I found some amino acids are lower-case. Does that have some special meanings?

Thanks!

milot-mirdita closed this as completed in cab0e83 Jun 26, 2020

milot-mirdita reopened this Jun 26, 2020

arglog mentioned this issue Jun 29, 2020

easy-linclust got stuck when clustering SRC #324

Open

milot-mirdita assigned milot-mirdita and martin-steinegger and unassigned milot-mirdita Jul 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when clustering MERC using easy-linclust #323

Segmentation fault when clustering MERC using easy-linclust #323

arglog commented Jun 25, 2020

milot-mirdita commented Jun 25, 2020 •

edited

Loading

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

Segmentation fault when clustering MERC using easy-linclust #323

Segmentation fault when clustering MERC using easy-linclust #323

Comments

arglog commented Jun 25, 2020

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Context

Your Environment

milot-mirdita commented Jun 25, 2020 • edited Loading

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

milot-mirdita commented Jun 26, 2020

arglog commented Jun 26, 2020

milot-mirdita commented Jun 25, 2020 •

edited

Loading