Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread panicked #485

Closed
joshfactorial opened this issue Feb 9, 2023 · 22 comments · Fixed by #500
Closed

Thread panicked #485

joshfactorial opened this issue Feb 9, 2023 · 22 comments · Fixed by #500

Comments

@joshfactorial
Copy link

I'm getting a weird error when I try to use this profiler as part of a batch submission job. The error logs I'm seeing look like this:

=fil-profile= Memory usage will be written out at exit, and stored in profile_200M..
=fil-profile= You can also run the following command while the program is still running to write out peak memory usage up to that point: kill -s SIGUSR2 43523
thread '<unnamed>' panicked at 'already borrowed: BorrowMutError', filpreload/src/lib.rs:138:33
stack backtrace:

Then, that's it. The program never starts. The same command on the command line works fine. Here is my input command:

fil-profile --no-browser -o profile_200M run -m \
    neat --log-level DEBUG \
        --no-log \
        model-seq-err \
        -o DefaultSingleEndedBinned \
        -i reads/sub200M_read1.fq \
        --overwrite
@itamarst
Copy link
Collaborator

itamarst commented Feb 9, 2023

Sorry it didn't work!

What version of Fil are you using, what version of Python, and what OS?

@joshfactorial
Copy link
Author

fil-profile v 2022.7.1
RHEL 7.9
slurm 21.08.8-2
Python 3.10.8

@itamarst
Copy link
Collaborator

Hm, I guess I should get Conda packages updated...

@itamarst
Copy link
Collaborator

You can try installing with pip meanwhile as a workaround, to see if that helps; the latest version there is 2023.1.0.

@itamarst
Copy link
Collaborator

Conda-Forge now has up-to-date packages (2023.1.0), so if that's what you were using, can you retest? Thank you!

@joshfactorial
Copy link
Author

I installed the update, but haven't had a chance to check the results yet.

@joshfactorial
Copy link
Author

Okay, I have a different error, but unrelated. I think that means it at least got past the thread-panicked problem.

@itamarst
Copy link
Collaborator

Great (and not so great). Tell me more about the new error!

@joshfactorial
Copy link
Author

Okay, well our server was down for a bit, but I was able to fix the issues I was seeing in my code and re-run, and I'm still getting this error: thread '' panicked at 'already borrowed: BorrowMutError', filpreload/src/lib.rs:144:29

@itamarst
Copy link
Collaborator

Sorry it's not working, I'll take a look.

@itamarst
Copy link
Collaborator

Oh and can you:

  1. Set environment variable RUST_BACKTRACE=1, rerun, and then post the whole traceback it prints
  2. Tell me which version and OS exactly you're running on?

Thank you!

@joshfactorial
Copy link
Author

Release:


NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"

The backtrace seems to have revealed nothing:

=fil-profile= Memory usage will be written out at exit, and opened automatically in a browser.
=fil-profile= You can also run the following command while the program is still running to write out peak memory usage up to that point: kill -s SIGUSR2 16747
thread '<unnamed>' panicked at 'already borrowed: BorrowMutError', filpreload/src/lib.rs:144:29
stack backtrace:
slurmstepd: error: *** JOB 4604 ON vfc002 CANCELLED AT 2023-03-14T14:04:15 DUE TO TIME LIMIT ***

@joshfactorial
Copy link
Author

However long I run it, it starts, hits that error, continues to run for the duration, doing nothing. The profile is a bunch of memory usage on imports then nothing.

@itamarst
Copy link
Collaborator

itamarst commented Mar 14, 2023

Just talking through the code to remind myself (have to head out soon):

  1. At a broad level, re-entrancy is supposed to be prevented by a flag that gets incremented and decremented around calls into Rust from _filpreload.c. The general structue is "if should_track_memory(): increment() then run() then decrement()"
  2. All the code appears to be structured that way appropriately. This is an assumption that should be verified more carefully.
  3. CORRECTED: The specific issue is inside an add_allocation() function. Which suggests something somewhere is doing some other interaction with the callstack (set/clear/start/finish) which triggers allocation and somehow is reentrant.

@joshfactorial
Copy link
Author

Could it be something in the way the slurm scheduler works? I apologize ahead of time for my lack of knowledge of all the inner workings lol.

@joshfactorial
Copy link
Author

It's an HPC cluster I'm running it on. I do have some ability to add some libraries and such, if that's the problem.

@itamarst
Copy link
Collaborator

Probably not slurm. It's possible it's Redhat 7.9? Which BTW is losing extended support in a year, after which you'll have to pay extra for security updates (something to hassle the cluster administrator about).

But that's just random guessing, I would have to figure out mechanism. I will read code some more and think.

Could you tell me what libraries you're using? Are you using threads?

@joshfactorial
Copy link
Author

It's all Python:

python 3.10
biopython 1.79
pkginfo
matplotlib
numpy
tqdm
pyyaml
pip
scipy
pytest
bedtools
htslib
pybedtools
pysam
frozendict
poetry 1.3

As far as Red Hat goes, it's been a whole thing that I stay out of lol. Above my paygrade.

@joshfactorial
Copy link
Author

It's single threaded at the moment. I tried requesting more processers from the server to see if that solved it, but it had no effect.

@itamarst
Copy link
Collaborator

Hm. I think I may've found one place that could be causing the issue.

@itamarst itamarst linked a pull request Mar 14, 2023 that will close this issue
@itamarst
Copy link
Collaborator

I will try to do release later today.

@itamarst
Copy link
Collaborator

Release with fix is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants