-
Notifications
You must be signed in to change notification settings - Fork 177
seqkit grep consumes large amounts of memory. #487
Description
Hello,
I am using seqkit to filter in full chromosome sequences from vertebrate genome assemblies, but keeping only sequences whose ID starts with "CM". For instance: seqkit grep -I -r -p "CM". However, it takes surprising amounts of memory, causing my HPC jobs to crash:
I am using seqkit 2.8.0 from the Galaxy image for Singularity depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img.
./depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img /usr/bin/time -v seqkit grep -I -r -p "CM" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null
Command terminated by signal 9
Command being timed: "seqkit grep -I -r -p CM GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
User time (seconds): 33.13
System time (seconds): 5.27
Percent of CPU this job got: 109%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 34.97s
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 78569696
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 154
Minor (reclaiming a frame) page faults: 1688681
Voluntary context switches: 28374
Involuntary context switches: 619
Swaps: 0
File system inputs: 3903404
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Running seqkit 2.3.0 locally (Debian) also shows high memory consumption
/usr/bin/time -v seqkit grep -r -p "^(CM|CP|FR|L[R-T]|O[U-Z])" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null
Command being timed: "seqkit grep -r -p ^(CM|CP|FR|L[R-T]|O[U-Z]) GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
User time (seconds): 48.80
System time (seconds): 7.51
Percent of CPU this job got: 115%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.77
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 9496532
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 826218
Voluntary context switches: 38384
Involuntary context switches: 751
Swaps: 0
File system inputs: 4949024
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
This large genome can be downloaded from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_027579735.1/
I looked for a command-line switch that would force seqkit to act like a true filter, without keeping large amounts of data in memory, but -I did not seem to help. Do you have a suggestion?
Best,
Charles Plessy
Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!
Prerequisites
- Make sure you've installed the correct executable binary file.
For Mac users, Please downloadseqkit_darwin_amd64.tar.gzfor Mac with Intel CPUs.seqkit_darwin_arm64.tar.gzfor Mac with M series CPUs.
- Make sure you are using the latest version by
seqkit version -u. - Read the usage and examples for the specific subcommand.
Describe your issue in detail
- Please copy and paste the command you ran and the error information if reported.
- It would be more helpful to provide as much information as you can:
- Are you running on a personal computer or a server?
- What's the operating system, and how much RAM (memory) is available?
- Show the types and sizes of input files with
file xxxandls -lh xxx. - Show some lines of input files with
head -n 5 xxxorzcat xxx.gz | head -n 5.
- Provide a reproducible example.
- Has this problem happened many times?
- Or it only failed with this input file or/and these command/parameters.