Skip to content

seqkit grep consumes large amounts of memory. #487

@charles-plessy

Description

@charles-plessy

Hello,

I am using seqkit to filter in full chromosome sequences from vertebrate genome assemblies, but keeping only sequences whose ID starts with "CM". For instance: seqkit grep -I -r -p "CM". However, it takes surprising amounts of memory, causing my HPC jobs to crash:

I am using seqkit 2.8.0 from the Galaxy image for Singularity depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img.

./depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img /usr/bin/time -v seqkit grep -I -r -p "CM" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null 
Command terminated by signal 9
	Command being timed: "seqkit grep -I -r -p CM GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
	User time (seconds): 33.13
	System time (seconds): 5.27
	Percent of CPU this job got: 109%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 34.97s
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 78569696
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 154
	Minor (reclaiming a frame) page faults: 1688681
	Voluntary context switches: 28374
	Involuntary context switches: 619
	Swaps: 0
	File system inputs: 3903404
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Running seqkit 2.3.0 locally (Debian) also shows high memory consumption

/usr/bin/time -v seqkit grep -r -p "^(CM|CP|FR|L[R-T]|O[U-Z])" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null 
	Command being timed: "seqkit grep -r -p ^(CM|CP|FR|L[R-T]|O[U-Z]) GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
	User time (seconds): 48.80
	System time (seconds): 7.51
	Percent of CPU this job got: 115%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.77
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 9496532
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 826218
	Voluntary context switches: 38384
	Involuntary context switches: 751
	Swaps: 0
	File system inputs: 4949024
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

This large genome can be downloaded from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_027579735.1/

I looked for a command-line switch that would force seqkit to act like a true filter, without keeping large amounts of data in memory, but -I did not seem to help. Do you have a suggestion?

Best,

Charles Plessy


Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!

Prerequisites

  • Make sure you've installed the correct executable binary file.
    For Mac users, Please download
    • seqkit_darwin_amd64.tar.gz for Mac with Intel CPUs.
    • seqkit_darwin_arm64.tar.gz for Mac with M series CPUs.
  • Make sure you are using the latest version by seqkit version -u.
  • Read the usage and examples for the specific subcommand.

Describe your issue in detail

  • Please copy and paste the command you ran and the error information if reported.
  • It would be more helpful to provide as much information as you can:
    • Are you running on a personal computer or a server?
    • What's the operating system, and how much RAM (memory) is available?
    • Show the types and sizes of input files with file xxx and ls -lh xxx.
    • Show some lines of input files with head -n 5 xxx or zcat xxx.gz | head -n 5.
  • Provide a reproducible example.
    • Has this problem happened many times?
    • Or it only failed with this input file or/and these command/parameters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions