Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytabix + multiprocessing? #2

Open
gibiansky opened this issue Jun 18, 2014 · 8 comments
Open

pytabix + multiprocessing? #2

gibiansky opened this issue Jun 18, 2014 · 8 comments

Comments

@gibiansky
Copy link

I'm having some issues using PyTabix with the python multiprocessing module; it seems that it somehow results in memory corruption of some sort, and messages along the lines of

[get_intv] the following line cannot be parsed and skipped: S=2546446;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000080001000014000100;WGT=1;VC=SNV;INT;KGPhase1;KGPROD;CAF=[0.9991,0.0009183];COMMON=0

The same file does not cause any problems when I'm not using multiprocessing.

Do you know anything about this? I looked for a close method on tabix file objects, but couldn't find one - could there be an issue with too many file descriptors to the same file, or something?

@slowkow
Copy link
Owner

slowkow commented Jun 18, 2014

I haven't tried to use pytabix with multiprocessing and I don't know why this problem arises.

Could you please share a code snippet that reproduces the error?

@gibiansky
Copy link
Author

I'm having a bit of difficulty getting the full error (with segfault/memory corruption), so maybe that aspect isn't even tabix's fault. However the following code

import multiprocessing
import tabix

dbsnp = tabix.open("./dbsnp-all-2013-12-11.vcf.gz")

def query_region(*args):
    z = []
    for x in xrange(20):
        z.extend(list(dbsnp.query('13', 24008000, 24009000)))
    return z

use_processes = True # adjustable to test for error
if use_processes:
    pool = multiprocessing.Pool(10)
    pool.map(query_region, xrange(100), 1)
else:
    for x in xrange(100):
        query_region(x)

prints nothing (as expected) if use_processes is False, but if you set it to true you get:

[get_intv] the following line cannot be parsed and skipped: S=24020770;dbSNPBuildID=135;SSR=0;SAO=0;VP=0x050000000001100014000100;WGT=1;VC=SNV;KGPhase1;KGPROD;CAF=[0.9853,0.01469];COMMON=1

repeated some number of times (the number varies).

This is just a VCF downloaded from the dbSNP FTP server.

@slowkow
Copy link
Owner

slowkow commented Jun 18, 2014

I can't reproduce the error when I run the code below, using the GTF that I provide with pytabix.

This leads me to believe that your VCF file might be the problem...

import multiprocessing
import tabix

dbsnp = tabix.open("test/example.gtf.gz")

def query_region(*args):
    z = []
    for x in xrange(20):
        z.extend(list(dbsnp.query('chr2', 20000, 30000)))
    return z

use_processes = True # adjustable to test for error
if use_processes:
    pool = multiprocessing.Pool(10)
    pool.map(query_region, xrange(100), 1)
else:
    for x in xrange(100):
        query_region(x)

@gibiansky
Copy link
Author

I do not think it is the contents of the VCF, as it works fine without multiprocessing. However, I imagine it might be the size - the VCF is something like 1.2 GB. A query on example.gtf.gz takes no time at all, while on my VCF, it takes on the order of 2-3 seconds for each query call.

Maybe if you have a very fast query, the processes don't have time to interfere, but if you have a longer one, they can do so occasionally?

Anyway, I have no idea what's going on here :(

@slowkow
Copy link
Owner

slowkow commented Jun 18, 2014

You might be right about timing, but I'm not sure. It would be worth reading the literature about using C extensions with multiprocessing. I skimmed a few Google results but didn't find anything relevant.

  1. Does the code below produce the same error? I moved dbsnp inside query_region().
  2. Do you get an error if you run your code on a smaller file? You might take a few lines from your 1.2 GB file. I tried to use a ~100MB file and could not reproduce the error.
import multiprocessing
import tabix

def query_region(*args):
    dbsnp = tabix.open("./dbsnp-all-2013-12-11.vcf.gz")
    z = []
    for x in xrange(20):
        z.extend(list(dbsnp.query('chr1', 200000, 300000)))
    return z

use_processes = True # adjustable to test for error
if use_processes:
    pool = multiprocessing.Pool(10)
    pool.map(query_region, xrange(100), 1)
else:
    for x in xrange(100):
        query_region(x)

@marklivingstone
Copy link

I am adding to this issue because I think it is pertinent. I access tabix files in a program which runs under MPI on HPC 16 core cluster (so one tabix file being accessed by 16 cores simultaneously.) Sometimes it works fine, other times I have ended up with a 974 line long traceback which involves pytabix and suggests that double freeing is going on. So I did some defensive programming, and when the error occurs, I get None type records returned from the tabix iterator.

Often when it works, the cluster is only lightly used, and it fails more often when the cluster is working harder. Is Tabix / pytabix supposed to work in multiuser type environments?

@slowkow
Copy link
Owner

slowkow commented Sep 22, 2015

Could I ask you to share a code snippet that reproduces the error? Could you share the error, too?

@marklivingstone
Copy link

Hi Kamil,

The code is part of a massive framework, but I will see what I can do. I
certainly can get you the error message. It will probably be after Tuesday.

Kind Regards,

Mark Livingstone

PhD Candidate
G23_2.31
Institute for Integrated and Intelligent Systems
School of Information and Communication Technology
Griffith University, Gold Coast campus
Queensland, 4222, Australia

E-mail: mark.livingstone@griffithuni.edu.au

On 22 September 2015 at 21:34, Kamil Slowikowski notifications@github.com
wrote:

Could I ask you to share a code snippet that reproduces the error? Could
you share the error, too?


Reply to this email directly or view it on GitHub
#2 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants