Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct how the temp bed files are created #9

Open
wants to merge 3 commits into
base: master
from

Conversation

Projects
None yet
1 participant
@davep
Copy link

commented Apr 11, 2019

This is slightly related to #4 and also to another issue that has been observed, the latter where some of the resulting bed files could end up with "corrupt" content, but never in a consistent way.

There appears to be a problem with how sep_chrom_bed creates the bed files. The naming seems designed to try and avoid clashes but, unfortunately, stands a good chance of ensuring clashes in some situations.

The main problem is that, rather than using conventional methods of generating temp files, a file name is generated from time, is checked for, and if it looks like it already exists a new name is generated with rand (the use of + rather than . in that being the cause of #4). Unfortunately, if multiple processes are being used, this can result in the same sequence of possible file names being generated.

Note that Parallel::ForkManager is used to handle parallel processes, and that it has this warning about rand.

My thinking here is that it might be possible for more than one process to arrive on the same file name at the same time, and then both write to the same file and cause issues and apparent corruption.

This pull request contains changes that switch to using perl's own tempfile function to generate the file names.

davep added some commits Apr 11, 2019

Correctly generate a temporary file name for bed files
This fixes a problem where sep_chrom_bed could easily create clashed file
names, while working to actually not create clashed file names. The problem
is that, rather than using tempfile, the code was using seconds since the
Unix epoch and then, if it looked like a clash was possible, adding a random
number to the end of it.

The problem here is that, mixed with a fork (which it would do), it would
create multiple forks that all followed the sane random sequence. See:

https://metacpan.org/pod/Parallel::ForkManager#USING-RAND()-IN-FORKED-PROCESSES

for why. Long story short: the code that would try and ensure there wasn't a
filename clash would almost guarantee that there was.

So this change switches to using an actual temp file name generation
function to create the temporary bed file names.
@davep

This comment has been minimized.

Copy link
Owner Author

commented on 596ff12 Apr 11, 2019

Note that this also addresses/fixes shenlab-sinai#4 in that it does away with the code that was the cause of the problem.

Although the above change does away with the problematic code, hat-tip to @kb-bioinf for identifying what that problem was ($timehead + rand should really have been $timehead . rand).

Tidy up the naming of the bed file
- Separate the chromosome from the rest of the name.
- Have a slightly longer run of template characters.
- Had .bed as a suffix.

None of this should have the code really work any differently, but it should
make the file names easier on the eye when looking at tmp.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.