-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NBMAKE INTERNAL ERROR: Exception ignored in socket #80
Comments
I believe the notebook failure is correct, but the error message isn't particularly informative. Replacing |
Hi Adam, I can't reproduce this, is it possible this was a transient error relating to some of the Ipython network components? I have however discovered a bug: It seems like |
Unfortunately this happens every time for me consistently. I'll try to reproduce it locally. |
Couldn't reproduce locally. Could be something specific to the CI environment, or could be a difference in dependency versions. |
Strange. Do you know how I tried that magic command on a colab notebook and found it didn't exist. Are there any downsides from your workaround of using |
There is no |
FWIW, still seeing this every time we try to test our notebooks: https://github.com/microsoft/torchgeo/actions/runs/4127657580/jobs/7131131990 |
thanks for persistently reporting this @adamjstewart I hit a brick wall last time because I couldn't repro. Are you able to provide a docker image where we can repro this bug? |
I'm also still trying to repro locally. So far it only happens for us in CI. I did notice that some of our cells timeout when run locally, so I increased the default timeout period. But I'm still seeing the issue.
I will clarify this and say that the error happens ~100% of the time, but the notebook that errors out keeps changing. In microsoft/torchgeo#1097, the notebook that was failing switched from Also note that the error is slightly different than last time. Now we have:
Unfortunately I don't know how to get a full stacktrace so I can't know what this str object is or why it isn't a filehandle. |
This is interesting. The error points to an |
@adamjstewart, would you be interested in containerising this test process? I reckon either (a) this issue would stop happening, or (b) We would be able to repro in vscode and root cause it We could make your PR (microsoft/torchgeo#1097) use the same pattern as this repo's tests |
Tried copying that pattern but I must have done something wrong because the devcontainer isn't launching properly. I stole the devcontainer recipe from microsoft/torchgeo#1085 but I've never used vscode before so I have no idea if it's correct. |
I've done some investigation in a codespaces instance that runs the same container that I've put in the CI branch. I believe we have two issues:
If I fix the latter issue then the error messages should get easier to parse, although having seen the logs when running in VSCode (codespaces), it still isn't clear why the kernel is dying. |
How would I go about debugging this? Is there an easy way to tell which step of the notebook is failing? Not sure if print statements + |
Discovered something interesting. In microsoft/torchgeo#1124, we finally got our notebook tests passing for the first time in a year by using smaller downloads. However, if you compare microsoft/torchgeo@aadd199 and microsoft/torchgeo@77b4e0b, you'll see that the tests were failing and I fixed them simply by clearing the output of the notebook. Is there any reason why the output would cause an internal error? That was quite a surprise to me. |
Things were working great, but now all of a sudden I'm seeing this error again: https://github.com/microsoft/torchgeo/actions/runs/4456067450/jobs/7826269250?pr=1124 I tried nbstripout on all notebooks, but that didn't help. I'm not sure what changed, but the error message isn't helpful. Is there any way to enable more verbose logging or other debugging tips you have? |
Hi @adamjstewart, my gut feeling is that nbstripout is at best correlated with these test failures. It feels like the root cause still needs to be discovered. Common practice is to pragmatically retry flaky tests to see if we can avoid hitting 100% reliability until we are able to identify a long-term fix. Please can you try using this plugin which I believe will let us get a sense of how much retrying is necessary for a test to pass (if they are indeed just flaky due to resource contention) e: Apologies, right now I can't give you better logs, although it's my top priority for this project. |
Thanks, let me try that. I agree that the issue is transient, which suggests that it's still a hardware issue, but I can't figure out what could possibly be wrong... |
Now it's magically working lol. I'm fine with using pytest-rerunfailures until we can get to the bottom of this issue. |
What is your dev environment by the way -- Are you able to share which cloud instance/provider etc you are using (or if you are using a local device)? |
This is using GitHub Actions, see here for the configuration. |
Sorry for context: I’m wondering how this all runs on your interactive dev
environment to get a sense of how the ci agent differs (which is the root
of the challenge here IMO)
On Tue, 21 Mar 2023 at 15:34, Adam J. Stewart ***@***.***> wrote:
This is using GitHub Actions, see here
<https://github.com/microsoft/torchgeo/blob/main/.github/workflows/tutorials.yaml>
for the configuration.
—
Reply to this email directly, view it on GitHub
<#80 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANZJM7OZRBDCR4Y44XB2663W5HDB7ANCNFSM6AAAAAAQEB25TE>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Alex Remedios
|
Ah, yes, when I run it locally on my laptop (macOS 10.15.7 x86_64 or macOS 13.2.1 arm64) everything works fine. All dependencies are installed via the Spack package manager. |
I see. Likely you have 2-8x the hardware resources as the github actions agent. It would be interesting to try tunneling into the agent and debugging interactively with the CLI. These docs are quite helpful: https://code.visualstudio.com/docs/remote/tunnels Here's a workflow file that lets you open in vscode: do not do this in a public repo for security reasons name: code
on:
push:
jobs:
code:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- run: |
set -x
set -e -u -o pipefail
curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
tar -xf vscode_cli.tar.gz
./code tunnel --accept-server-license-terms --no-sleep |
Oh for sure, my laptop is much more powerful. But I think our notebooks are using a very small amount of resources now, although it's hard to be sure. Hopefully a better error message will reveal what's really going on. If there was an OOM or hard drive full error that would make things obvious. |
We got things working for a while by rerunning the tests up to 10x until they passed, but that no longer seems to be working. See microsoft/torchgeo#1521 for the newest error messages. As far as I know, nothing changed about the available GitHub runners. And the failing tutorial hasn't changed in years. Did anything change in nbmake? Any debugging suggestions? |
I think the issue was with one of our tutorials. Once we removed that tutorial in microsoft/torchgeo#1521 everything passes without needing any reruns! I'll close this for now because I never liked that tutorial anyway, but I'll reopen if something similar happens again in the future. |
Describe the bug
I'm seeing the following error message in CI:
To Reproduce
Expected behavior
I would expect nbmake to either pass all tests, or inform me of what tests may have failed.
Desktop
Additional context
Seems similar to #71 but the error message is slightly different.
The text was updated successfully, but these errors were encountered: