FastaFile iteration fails with multiple processes #409

AndreasHeger · 2017-02-26T22:00:32Z

see https://groups.google.com/forum/#!topic/pysam-user-group/bRPZoGQEcLc

iprada · 2017-05-10T11:09:30Z

Hi, I am the original poster of the the question in google groups. I have done a little bit of testing in my own in order to solve my problem and reach the possible issue. I have notice that pysam fails with multiple processes if I give the full path to the genome to the function I want to run in parallel. However, if I change my working directory inside the function and I use string formatting to open the fastaFile, the multiple processes will read the file correctly and everything will work as expected

I write a small example of the issue willing that it will be helpful

This code giving the complete path to the genome fails opening the file

import pysam as ps
import multiprocessing as mp
import os

genome_fa = "/home/inigo/msc_thesis/genome_data/hg38.fa"
#working_dir = '/home/inigo/msc_thesis/genome_data/'
number_of_cores = 3

def get_fasta(genome_fasta):
   """function for getting fasta sequence from a genome"""
   # change working directory
   #os.chdir(working_dir)

   #some coodinates
   chr1 = "chr1"
   start = 200000
   end = 200050
   #open the file
   fastafile = ps.FastaFile("%s" % genome_fasta)
   # get the sequence
   fasta = fastafile.fetch(chr1, start, end)
   print(fasta)
   return(None)

# test the function in non-parallel mode
print(get_fasta(genome_fa))
# function in parallel
if __name__ == '__main__':
   jobs = []
   # init the processes
   for i in range(number_of_cores):
       print(i)
       p = mp.Process(target=get_fasta, args=(genome_fa))
       jobs.append(p)
       p.start()
   # kill the process
   for p in jobs:
       p.join()

This is the error output

GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
None
0
1
2
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
TypeError: get_fasta() takes 1 positional argument but 42 were given
Process Process-2:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
TypeError: get_fasta() takes 1 positional argument but 42 were given
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
TypeError: get_fasta() takes 1 positional argument but 42 were given

This code giving the working directory to the function and changing the working directory inside the function will work as expected

import pysam as ps
import multiprocessing as mp
import os

genome_fa = "hg38.fa"
working_dir = '/home/inigo/msc_thesis/genome_data/'

number_of_cores = 3




def get_fasta(genome_fasta,working_dir):
    """function for getting fasta sequence from a genome"""
    # change working directory
    os.chdir(working_dir)

    #some coodinates
    chr1 = "chr1"
    start = 200000
    end = 200050
    #open the file
    fastafile = ps.FastaFile("%s" % genome_fasta)
    # get the sequence
    fasta = fastafile.fetch(chr1, start, end)
    print(fasta)
    return(None)


# test the function in non-parallel mode
print(get_fasta(genome_fa,working_dir))


# function in parallel
if __name__ == '__main__':
    jobs = []
    # init the processes
    for i in range(number_of_cores):
        print(i)
        p = mp.Process(target=get_fasta, args=(genome_fa,working_dir))
        jobs.append(p)
        p.start()
    # kill the process
    for p in jobs:
        p.join()

With the following expected output

GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
None
0
1
2
GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG
GGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACG

I hope this is helpful for the developers or for anybody with the same problem

best,

AndreasHeger · 2017-07-10T15:08:41Z

Hi, took a while, the issue is a typo, use

p = mp.Process(target=get_fasta, args=(genome_fa,))

note the ',' to ensure you pass a tuple

AndreasHeger mentioned this issue Jun 1, 2017

Is there code in pysam for writing fastx files? #471

Closed

AndreasHeger closed this as completed Jul 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastaFile iteration fails with multiple processes #409

FastaFile iteration fails with multiple processes #409

AndreasHeger commented Feb 26, 2017

iprada commented May 10, 2017 •

edited

AndreasHeger commented Jul 10, 2017

FastaFile iteration fails with multiple processes #409

FastaFile iteration fails with multiple processes #409

Comments

AndreasHeger commented Feb 26, 2017

iprada commented May 10, 2017 • edited

AndreasHeger commented Jul 10, 2017

iprada commented May 10, 2017 •

edited