-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytabix + multiprocessing? #2
Comments
I haven't tried to use pytabix with multiprocessing and I don't know why this problem arises. Could you please share a code snippet that reproduces the error? |
I'm having a bit of difficulty getting the full error (with segfault/memory corruption), so maybe that aspect isn't even
prints nothing (as expected) if
repeated some number of times (the number varies). This is just a VCF downloaded from the dbSNP FTP server. |
I can't reproduce the error when I run the code below, using the GTF that I provide with pytabix. This leads me to believe that your VCF file might be the problem... import multiprocessing
import tabix
dbsnp = tabix.open("test/example.gtf.gz")
def query_region(*args):
z = []
for x in xrange(20):
z.extend(list(dbsnp.query('chr2', 20000, 30000)))
return z
use_processes = True # adjustable to test for error
if use_processes:
pool = multiprocessing.Pool(10)
pool.map(query_region, xrange(100), 1)
else:
for x in xrange(100):
query_region(x) |
I do not think it is the contents of the VCF, as it works fine without Maybe if you have a very fast query, the processes don't have time to interfere, but if you have a longer one, they can do so occasionally? Anyway, I have no idea what's going on here :( |
You might be right about timing, but I'm not sure. It would be worth reading the literature about using C extensions with multiprocessing. I skimmed a few Google results but didn't find anything relevant.
import multiprocessing
import tabix
def query_region(*args):
dbsnp = tabix.open("./dbsnp-all-2013-12-11.vcf.gz")
z = []
for x in xrange(20):
z.extend(list(dbsnp.query('chr1', 200000, 300000)))
return z
use_processes = True # adjustable to test for error
if use_processes:
pool = multiprocessing.Pool(10)
pool.map(query_region, xrange(100), 1)
else:
for x in xrange(100):
query_region(x) |
I am adding to this issue because I think it is pertinent. I access tabix files in a program which runs under MPI on HPC 16 core cluster (so one tabix file being accessed by 16 cores simultaneously.) Sometimes it works fine, other times I have ended up with a 974 line long traceback which involves pytabix and suggests that double freeing is going on. So I did some defensive programming, and when the error occurs, I get None type records returned from the tabix iterator. Often when it works, the cluster is only lightly used, and it fails more often when the cluster is working harder. Is Tabix / pytabix supposed to work in multiuser type environments? |
Could I ask you to share a code snippet that reproduces the error? Could you share the error, too? |
Hi Kamil, The code is part of a massive framework, but I will see what I can do. I Kind Regards,Mark Livingstone PhD Candidate E-mail: mark.livingstone@griffithuni.edu.au On 22 September 2015 at 21:34, Kamil Slowikowski notifications@github.com
|
I'm having some issues using PyTabix with the python
multiprocessing
module; it seems that it somehow results in memory corruption of some sort, and messages along the lines ofThe same file does not cause any problems when I'm not using
multiprocessing
.Do you know anything about this? I looked for a
close
method on tabix file objects, but couldn't find one - could there be an issue with too many file descriptors to the same file, or something?The text was updated successfully, but these errors were encountered: