Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBMAKE INTERNAL ERROR: Exception ignored in socket #80

Closed
adamjstewart opened this issue Sep 3, 2022 · 27 comments
Closed

NBMAKE INTERNAL ERROR: Exception ignored in socket #80

adamjstewart opened this issue Sep 3, 2022 · 27 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@adamjstewart
Copy link

Describe the bug

I'm seeing the following error message in CI:

NBMAKE INTERNAL ERROR
Exception ignored in: <socket.socket fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/jsonschema/exceptions.py", line 134, in _set
    for k, v in kwargs.items():
ResourceWarning: unclosed <socket.socket fd=27, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/runner.py", line 338, in from_call
    result: Optional[TResult] = func()
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/runner.py", line [25](https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true#step:5:26)9, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_hooks.py", line [26](https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true#step:5:27)5, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 88, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 78, in unraisable_exception_runtest_hook
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

To Reproduce

  1. See https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true
  2. Failing notebook is https://github.com/microsoft/torchgeo/blob/releases/v0.3/docs/tutorials/transforms.ipynb

Expected behavior

I would expect nbmake to either pass all tests, or inform me of what tests may have failed.

Desktop

  • OS: Ubuntu 20.04
  • Browser: Chrome
  • Version: 1.3.4

Additional context

Seems similar to #71 but the error message is slightly different.

@adamjstewart
Copy link
Author

I believe the notebook failure is correct, but the error message isn't particularly informative. Replacing %nvidia-smi with !nvidia-smi fixes the notebook on Google Colab, although it may still break in CI where there is no GPU installed. We should still fix the error message if we can.

@alex-treebeard
Copy link
Member

Hi Adam,

I can't reproduce this, is it possible this was a transient error relating to some of the Ipython network components?

I have however discovered a bug: It seems like UsageErrors relating to unknown magic commands such as %nvidia-smi halt execution due to the error, yet are reported by nbmake as successful.

@adamjstewart
Copy link
Author

Unfortunately this happens every time for me consistently. I'll try to reproduce it locally.

@adamjstewart
Copy link
Author

Couldn't reproduce locally. Could be something specific to the CI environment, or could be a difference in dependency versions.

@alex-treebeard
Copy link
Member

Strange. Do you know how %nvidia-smi is installed?

I tried that magic command on a colab notebook and found it didn't exist.

Are there any downsides from your workaround of using !nvidia-smi?

@adamjstewart
Copy link
Author

There is no %nvidia-smi magic command, you have to use !nvidia-smi instead. The bug isn't that %nvidia-smi doesn't exist, the bug is that nbmake fails with a mysterious error that doesn't help me locate the problem.

@alex-treebeard alex-treebeard added bug Something isn't working help wanted Extra attention is needed labels Oct 19, 2022
@adamjstewart
Copy link
Author

FWIW, still seeing this every time we try to test our notebooks: https://github.com/microsoft/torchgeo/actions/runs/4127657580/jobs/7131131990

@alex-treebeard
Copy link
Member

thanks for persistently reporting this @adamjstewart

I hit a brick wall last time because I couldn't repro. Are you able to provide a docker image where we can repro this bug?

@adamjstewart
Copy link
Author

I'm also still trying to repro locally. So far it only happens for us in CI. I did notice that some of our cells timeout when run locally, so I increased the default timeout period. But I'm still seeing the issue.

is it possible this was a transient error relating to some of the Ipython network components?

Unfortunately this happens every time for me consistently.

I will clarify this and say that the error happens ~100% of the time, but the notebook that errors out keeps changing. In microsoft/torchgeo#1097, the notebook that was failing switched from pretrained_weights.ipynb to trainers.ipynb. It could be that reproducing the issue requires running all notebooks, not just one notebook.

Also note that the error is slightly different than last time. Now we have:

NBMAKE INTERNAL ERROR
Exception ignored in: <socket.socket fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/__init__.py", line 166, in read
    buf = fp.read()
AttributeError: 'str' object has no attribute 'read'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/notebooknode.py", line 47, in from_dict
    return NotebookNode({k: from_dict(v) for k, v in d.items()})
ResourceWarning: unclosed <socket.socket fd=23, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 339, in from_call
    result: Optional[TResult] = func()
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 260, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 88, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 78, in unraisable_exception_runtest_hook
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

Unfortunately I don't know how to get a full stacktrace so I can't know what this str object is or why it isn't a filehandle.

@adamjstewart
Copy link
Author

This is interesting. The error points to an AttributeError triggered here. However, this should be caught by the following line. I'm not sure why it isn't being caught.

@alex-treebeard
Copy link
Member

@adamjstewart, would you be interested in containerising this test process?

I reckon either (a) this issue would stop happening, or (b) We would be able to repro in vscode and root cause it

We could make your PR (microsoft/torchgeo#1097) use the same pattern as this repo's tests

@adamjstewart
Copy link
Author

Tried copying that pattern but I must have done something wrong because the devcontainer isn't launching properly. I stole the devcontainer recipe from microsoft/torchgeo#1085 but I've never used vscode before so I have no idea if it's correct.

@alex-treebeard
Copy link
Member

I've done some investigation in a codespaces instance that runs the same container that I've put in the CI branch.

I believe we have two issues:

  1. For some reason (probably memory or storage exhaustion -- github ci specs) the ipykernel dies when executing the notebooks
  2. When the kernel dies, nbmake throws an internal error (rather than handling and reporting it).

If I fix the latter issue then the error messages should get easier to parse, although having seen the logs when running in VSCode (codespaces), it still isn't clear why the kernel is dying.

@adamjstewart
Copy link
Author

How would I go about debugging this? Is there an easy way to tell which step of the notebook is failing? Not sure if print statements + pytest -s work with notebooks or not.

@adamjstewart
Copy link
Author

Discovered something interesting. In microsoft/torchgeo#1124, we finally got our notebook tests passing for the first time in a year by using smaller downloads. However, if you compare microsoft/torchgeo@aadd199 and microsoft/torchgeo@77b4e0b, you'll see that the tests were failing and I fixed them simply by clearing the output of the notebook. Is there any reason why the output would cause an internal error? That was quite a surprise to me.

@adamjstewart
Copy link
Author

Things were working great, but now all of a sudden I'm seeing this error again: https://github.com/microsoft/torchgeo/actions/runs/4456067450/jobs/7826269250?pr=1124

I tried nbstripout on all notebooks, but that didn't help. I'm not sure what changed, but the error message isn't helpful. Is there any way to enable more verbose logging or other debugging tips you have?

@alex-treebeard
Copy link
Member

alex-treebeard commented Mar 20, 2023

Hi @adamjstewart, my gut feeling is that nbstripout is at best correlated with these test failures.

It feels like the root cause still needs to be discovered.

Common practice is to pragmatically retry flaky tests to see if we can avoid hitting 100% reliability until we are able to identify a long-term fix.

Please can you try using this plugin which I believe will let us get a sense of how much retrying is necessary for a test to pass (if they are indeed just flaky due to resource contention)

e: Apologies, right now I can't give you better logs, although it's my top priority for this project.

@adamjstewart
Copy link
Author

Thanks, let me try that. I agree that the issue is transient, which suggests that it's still a hardware issue, but I can't figure out what could possibly be wrong...

@adamjstewart
Copy link
Author

Now it's magically working lol. I'm fine with using pytest-rerunfailures until we can get to the bottom of this issue.

@alex-treebeard
Copy link
Member

alex-treebeard commented Mar 21, 2023

What is your dev environment by the way -- Are you able to share which cloud instance/provider etc you are using (or if you are using a local device)?

@adamjstewart
Copy link
Author

This is using GitHub Actions, see here for the configuration.

@alex-treebeard
Copy link
Member

alex-treebeard commented Mar 21, 2023 via email

@adamjstewart
Copy link
Author

Ah, yes, when I run it locally on my laptop (macOS 10.15.7 x86_64 or macOS 13.2.1 arm64) everything works fine. All dependencies are installed via the Spack package manager.

@alex-treebeard
Copy link
Member

alex-treebeard commented Mar 22, 2023

I see. Likely you have 2-8x the hardware resources as the github actions agent.

It would be interesting to try tunneling into the agent and debugging interactively with the CLI. These docs are quite helpful: https://code.visualstudio.com/docs/remote/tunnels

Here's a workflow file that lets you open in vscode:

do not do this in a public repo for security reasons

name: code

on:
  push:
jobs:
  code:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - run: |
        set -x
        set -e -u -o pipefail
        curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
        tar -xf vscode_cli.tar.gz
        ./code tunnel --accept-server-license-terms --no-sleep

@adamjstewart
Copy link
Author

Oh for sure, my laptop is much more powerful. But I think our notebooks are using a very small amount of resources now, although it's hard to be sure. Hopefully a better error message will reveal what's really going on. If there was an OOM or hard drive full error that would make things obvious.

@adamjstewart
Copy link
Author

We got things working for a while by rerunning the tests up to 10x until they passed, but that no longer seems to be working. See microsoft/torchgeo#1521 for the newest error messages. As far as I know, nothing changed about the available GitHub runners. And the failing tutorial hasn't changed in years. Did anything change in nbmake? Any debugging suggestions?

@adamjstewart
Copy link
Author

I think the issue was with one of our tutorials. Once we removed that tutorial in microsoft/torchgeo#1521 everything passes without needing any reruns! I'll close this for now because I never liked that tutorial anyway, but I'll reopen if something similar happens again in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants