Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File handles left open from tabix_index #1038

Closed
kmhernan opened this issue Sep 3, 2021 · 2 comments
Closed

File handles left open from tabix_index #1038

kmhernan opened this issue Sep 3, 2021 · 2 comments

Comments

@kmhernan
Copy link

kmhernan commented Sep 3, 2021

Version: pysam==0.16.0.1
Python: 3.8.9

When using a Pool to create VCF files and then index them (~20,000 of them) I noticed file handles were left open and actually got an exception about too many. I was able to isolate it to the pysam.tabix_index function. To reproduce this, I made a silly script. You can grab one of the process ID's the log prints and do something like lsof -p <PID> | grep 'vcf' -c and you will see the number climb until all processes are done.

"""This script reproduces the tabix index scenario where
file handles are not closed.
"""
import logging
import multiprocessing
import os
import pysam
import sys
import time


logger_fmt = "[%(levelname)s] [%(asctime)s] [%(name)s] - %(message)s"
root_logger = logging.getLogger("example")


def worker(args):
    """Simple worker to make a VCF and tabix index it. Then sleep."""
    (
        out_directory,
        chunk,
    ) = args
    logger = root_logger.getChild("worker")
    logger.info(f"Processing chunk {chunk}")
    logger.info("Process ID: {}".format(os.getpid()))
    header = pysam.VariantHeader()
    header.add_line("##contig=<ID=chr1,length=100,assembly=Fake>")
    vcf_out_path = os.path.join(out_directory, f"chunk_{chunk}.vcf.gz")
    with pysam.VariantFile(vcf_out_path, mode="wz", header=header) as vcf_obj:
        for i in range(100):
            record = vcf_obj.new_record(
                contig="chr1", start=i, stop=i + 1, filter=("PASS"), alleles=("A", "T")
            )
            vcf_obj.write(record)

    idx_path = pysam.tabix_index(vcf_out_path, preset="vcf", force=True)
    time.sleep(1)


if __name__ == "__main__":
    handler = logging.StreamHandler(sys.stderr)
    formatter = logging.Formatter(logger_fmt, datefmt="%Y%m%d %H:%M:%S")
    handler.setFormatter(formatter)
    root_logger.addHandler(handler)
    root_logger.setLevel(level=logging.INFO)

    if len(sys.argv) != 2:
        root_logger.error(
            "ERROR! You must provide the path to the directory to write outputs."
        )
        root_logger.error("Usage: example.py /path/to/output/dir")
        sys.exit(1)

    odir = sys.argv[1]
    if not os.path.isdir(odir):
        root_logger.info(f"Creating output directory {odir}...")
        os.mkdir(odir)

    chunks = [(odir, i) for i in range(100)]
    with multiprocessing.Pool(processes=4) as vcf_pool:
        vcf_pool.map(worker, chunks)
@jmarshall
Copy link
Member

Thanks for the comprehensive bug report.

The file descriptor is being leaked from is_gzip_file(), which just needs a os.close(fd). However there is other code in tabix_index() that also opens the file to do format detection, so I will refactor it to detect both things at once instead.

@kmhernan
Copy link
Author

kmhernan commented Sep 7, 2021

Thanks @jmarshall ! I hoped my example would greatly help your ability to replicate and figure out the issue. LMK if you need anything else from me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants