Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samtools sort setting memory requirement #831

Open
bernt-matthias opened this issue Apr 25, 2018 · 3 comments
Open

samtools sort setting memory requirement #831

bernt-matthias opened this issue Apr 25, 2018 · 3 comments

Comments

@bernt-matthias
Copy link

I'm trying to set the available memory for samtools sort with -m in a cluster environment. If I use the memory that is available for the job (as reported by the cluster environment) I get: samtools sort: couldn't allocate memory for bam_mem. I guess this is because samtools sort uses memory also for other things than bam_mem.

Can you suggest a way how to set the memory parameter (automatically) such that as much as possible of the systems memory is used?

Maybe related: #807

FYI: I need it for that: galaxyproject/tools-iuc#1801

@daviesrob
Copy link
Member

The memory limit for samtools sort is actually per-thread, so you probably want to use GALAXY_MEMORY_MB / GALAXY_SLOTS when setting the -m option. Yes, this is crazy and I don't know why it was done that way but we're a bit stuck with it now.

A future update will add a new option to allow the memory limit to be set for the entire program, and we'll just have to work out how it should interact with the existing -m option.

Sort does use a bit more memory than specified, but it shouldn't go over by too much these days.

@bernt-matthias
Copy link
Author

@daviesrob Thanks for your response, but since I'm currently running single threaded jobs this seems not to be the solution.

My guess is that the value that is specified on the command line is the amount of memory that samtools uses for buffering BAM data. But samtools uses some more memory than that (for other datastructures). So the total memory used by samtools is the value given on the command line + X.

FYI: We recently introduced GALAXY_MEMORY_MB_PER_SLOT (galaxyproject/galaxy#5625) but I forgot to use it here 😀 .. but if I'm right it would not help anyway..?

@daviesrob
Copy link
Member

GALAXY_MEMORY_MB_PER_SLOT sounds ideal, but you would have to subtract a bit to allow for overheads. My experiments suggest that setting -m to about 75% of the absolute limit should be safe enough (if reading/writing BAM or SAM - CRAM may need extra space for references).

Once you've started spilling to disk it doesn't make much difference to the run time if you under-specify the limit, as long as it's not too far out. The only sorts that will get much slower are the ones that would otherwise have fitted in memory with a more generous limit.

fgvieira added a commit to snakemake/snakemake-wrappers that referenced this issue Feb 22, 2024
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

<!-- Add a description of your PR here-->
According to the manual, `-m` specifies the approximated maximum
required memory per thread.
In some cases, `samtools sort` can use more memory than specified (e.g.
samtools/samtools#831) so we account for some
overhead.

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* the `environment.yaml` pinning has been updated by running
`snakedeploy pin-conda-envs environment.yaml` on a linux machine,
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants