Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastaFile iteration fails with multiple processes #409

Closed
AndreasHeger opened this issue Feb 26, 2017 · 2 comments
Closed

FastaFile iteration fails with multiple processes #409

AndreasHeger opened this issue Feb 26, 2017 · 2 comments

Comments

@AndreasHeger
Copy link
Contributor

see https://groups.google.com/forum/#!topic/pysam-user-group/bRPZoGQEcLc

@iprada
Copy link

iprada commented May 10, 2017

Hi, I am the original poster of the the question in google groups. I have done a little bit of testing in my own in order to solve my problem and reach the possible issue. I have notice that pysam fails with multiple processes if I give the full path to the genome to the function I want to run in parallel. However, if I change my working directory inside the function and I use string formatting to open the fastaFile, the multiple processes will read the file correctly and everything will work as expected

I write a small example of the issue willing that it will be helpful

This code giving the complete path to the genome fails opening the file

import pysam as ps
import multiprocessing as mp
import os

genome_fa = "/home/inigo/msc_thesis/genome_data/hg38.fa"
#working_dir = '/home/inigo/msc_thesis/genome_data/'
number_of_cores = 3

def get_fasta(genome_fasta):
   """function for getting fasta sequence from a genome"""
   # change working directory
   #os.chdir(working_dir)

   #some coodinates
   chr1 = "chr1"
   start = 200000
   end = 200050
   #open the file
   fastafile = ps.FastaFile("%s" % genome_fasta)
   # get the sequence
   fasta = fastafile.fetch(chr1, start, end)
   print(fasta)
   return(None)

# test the function in non-parallel mode
print(get_fasta(genome_fa))
# function in parallel
if __name__ == '__main__':
   jobs = []
   # init the processes
   for i in range(number_of_cores):
       print(i)
       p = mp.Process(target=get_fasta, args=(genome_fa))
       jobs.append(p)
       p.start()
   # kill the process
   for p in jobs:
       p.join()

This is the error output

GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
None
0
1
2
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
TypeError: get_fasta() takes 1 positional argument but 42 were given
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
TypeError: get_fasta() takes 1 positional argument but 42 were given
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
TypeError: get_fasta() takes 1 positional argument but 42 were given

This code giving the working directory to the function and changing the working directory inside the function will work as expected

import pysam as ps
import multiprocessing as mp
import os

genome_fa = "hg38.fa"
working_dir = '/home/inigo/msc_thesis/genome_data/'

number_of_cores = 3




def get_fasta(genome_fasta,working_dir):
    """function for getting fasta sequence from a genome"""
    # change working directory
    os.chdir(working_dir)

    #some coodinates
    chr1 = "chr1"
    start = 200000
    end = 200050
    #open the file
    fastafile = ps.FastaFile("%s" % genome_fasta)
    # get the sequence
    fasta = fastafile.fetch(chr1, start, end)
    print(fasta)
    return(None)


# test the function in non-parallel mode
print(get_fasta(genome_fa,working_dir))


# function in parallel
if __name__ == '__main__':
    jobs = []
    # init the processes
    for i in range(number_of_cores):
        print(i)
        p = mp.Process(target=get_fasta, args=(genome_fa,working_dir))
        jobs.append(p)
        p.start()
    # kill the process
    for p in jobs:
        p.join()

With the following expected output

GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
None
0
1
2
GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG

I hope this is helpful for the developers or for anybody with the same problem

best,

@AndreasHeger
Copy link
Contributor Author

Hi, took a while, the issue is a typo, use

p = mp.Process(target=get_fasta, args=(genome_fa,))

note the ',' to ensure you pass a tuple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants