seqkit grep consumes large amounts of memory.

Hello,

I am using seqkit to filter in full chromosome sequences from vertebrate genome assemblies, but keeping only sequences whose ID starts with "CM".  For instance: `seqkit grep -I -r -p "CM"`. However, it takes surprising amounts of memory, causing my HPC jobs to crash:

I am using seqkit 2.8.0 from the Galaxy image for Singularity `depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img`.

```
./depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img /usr/bin/time -v seqkit grep -I -r -p "CM" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null 
Command terminated by signal 9
	Command being timed: "seqkit grep -I -r -p CM GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
	User time (seconds): 33.13
	System time (seconds): 5.27
	Percent of CPU this job got: 109%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 34.97s
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 78569696
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 154
	Minor (reclaiming a frame) page faults: 1688681
	Voluntary context switches: 28374
	Involuntary context switches: 619
	Swaps: 0
	File system inputs: 3903404
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
```

Running seqkit 2.3.0 locally (Debian) also shows high memory consumption

```
/usr/bin/time -v seqkit grep -r -p "^(CM|CP|FR|L[R-T]|O[U-Z])" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null 
	Command being timed: "seqkit grep -r -p ^(CM|CP|FR|L[R-T]|O[U-Z]) GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
	User time (seconds): 48.80
	System time (seconds): 7.51
	Percent of CPU this job got: 115%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.77
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 9496532
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 826218
	Voluntary context switches: 38384
	Involuntary context switches: 751
	Swaps: 0
	File system inputs: 4949024
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
```

This large genome can be downloaded from: <https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_027579735.1/>

I looked for a command-line switch that would force seqkit to act like a true filter, without keeping large amounts of data in memory, but `-I` did not seem to help.  Do you have a suggestion?

Best,

Charles Plessy

----

Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!

### Prerequisites

- [x] Make sure you've installed the correct executable binary file.
    For Mac users, Please download
    - `seqkit_darwin_amd64.tar.gz` for Mac with Intel CPUs.
    - `seqkit_darwin_arm64.tar.gz` for Mac with M series CPUs.
- [x] Make sure you are using the latest version by `seqkit version -u`.
- [x] Read the [usage and examples](http://bioinf.shenwei.me/seqkit/usage/) for the specific subcommand.

### Describe your issue in detail

- [x] Please copy and paste the command you ran and the error information if reported.
- [x] It would be more helpful to provide as much information as you can:
    - [x] Are you running on a personal computer or a server?
    - [x] What's the operating system, and how much RAM (memory) is available?
    - [x] Show the types and sizes of input files with `file xxx` and `ls -lh xxx`.
    - [x] Show some lines of input files with `head -n 5 xxx` or `zcat xxx.gz | head -n 5`.
- [x] Provide a reproducible example.
    - [x] Has this problem happened many times?
    - [x] Or it only failed with this input file or/and these command/parameters.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seqkit grep consumes large amounts of memory. #487

Prerequisites

Describe your issue in detail

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

seqkit grep consumes large amounts of memory. #487

Description

Prerequisites

Describe your issue in detail

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions