NBMAKE INTERNAL ERROR: Exception ignored in socket #80

adamjstewart · 2022-09-03T21:59:47Z

Describe the bug

I'm seeing the following error message in CI:

NBMAKE INTERNAL ERROR
Exception ignored in: <socket.socket fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/jsonschema/exceptions.py", line 134, in _set
    for k, v in kwargs.items():
ResourceWarning: unclosed <socket.socket fd=27, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/runner.py", line 338, in from_call
    result: Optional[TResult] = func()
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/runner.py", line [25](https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true#step:5:26)9, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_hooks.py", line [26](https://github.com/microsoft/torchgeo/runs/8172122269?check_suite_focus=true#step:5:27)5, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 88, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/hostedtoolcache/Python/3.10.6/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 78, in unraisable_exception_runtest_hook
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

To Reproduce

Expected behavior

I would expect nbmake to either pass all tests, or inform me of what tests may have failed.

Desktop

OS: Ubuntu 20.04
Browser: Chrome
Version: 1.3.4

Additional context

Seems similar to #71 but the error message is slightly different.

The text was updated successfully, but these errors were encountered:

adamjstewart · 2022-09-03T22:27:48Z

I believe the notebook failure is correct, but the error message isn't particularly informative. Replacing %nvidia-smi with !nvidia-smi fixes the notebook on Google Colab, although it may still break in CI where there is no GPU installed. We should still fix the error message if we can.

alex-treebeard · 2022-09-05T13:17:50Z

Hi Adam,

I can't reproduce this, is it possible this was a transient error relating to some of the Ipython network components?

I have however discovered a bug: It seems like UsageErrors relating to unknown magic commands such as %nvidia-smi halt execution due to the error, yet are reported by nbmake as successful.

adamjstewart · 2022-09-05T16:54:20Z

Unfortunately this happens every time for me consistently. I'll try to reproduce it locally.

adamjstewart · 2022-09-05T19:16:20Z

Couldn't reproduce locally. Could be something specific to the CI environment, or could be a difference in dependency versions.

alex-treebeard · 2022-09-07T08:12:48Z

Strange. Do you know how %nvidia-smi is installed?

I tried that magic command on a colab notebook and found it didn't exist.

Are there any downsides from your workaround of using !nvidia-smi?

adamjstewart · 2022-09-07T14:51:41Z

There is no %nvidia-smi magic command, you have to use !nvidia-smi instead. The bug isn't that %nvidia-smi doesn't exist, the bug is that nbmake fails with a mysterious error that doesn't help me locate the problem.

adamjstewart · 2023-02-08T22:16:57Z

FWIW, still seeing this every time we try to test our notebooks: https://github.com/microsoft/torchgeo/actions/runs/4127657580/jobs/7131131990

alex-treebeard · 2023-02-09T10:49:38Z

thanks for persistently reporting this @adamjstewart

I hit a brick wall last time because I couldn't repro. Are you able to provide a docker image where we can repro this bug?

adamjstewart · 2023-02-09T21:58:11Z

I'm also still trying to repro locally. So far it only happens for us in CI. I did notice that some of our cells timeout when run locally, so I increased the default timeout period. But I'm still seeing the issue.

is it possible this was a transient error relating to some of the Ipython network components?

Unfortunately this happens every time for me consistently.

I will clarify this and say that the error happens ~100% of the time, but the notebook that errors out keeps changing. In microsoft/torchgeo#1097, the notebook that was failing switched from pretrained_weights.ipynb to trainers.ipynb. It could be that reproducing the issue requires running all notebooks, not just one notebook.

Also note that the error is slightly different than last time. Now we have:

NBMAKE INTERNAL ERROR
Exception ignored in: <socket.socket fd=-1, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/__init__.py", line 166, in read
    buf = fp.read()
AttributeError: 'str' object has no attribute 'read'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/nbformat/notebooknode.py", line 47, in from_dict
    return NotebookNode({k: from_dict(v) for k, v in d.items()})
ResourceWarning: unclosed <socket.socket fd=23, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>

  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 339, in from_call
    result: Optional[TResult] = func()
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/runner.py", line 260, in <lambda>
    lambda: ihook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 88, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/opt/hostedtoolcache/Python/3.10.9/x64/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 78, in unraisable_exception_runtest_hook
    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

Unfortunately I don't know how to get a full stacktrace so I can't know what this str object is or why it isn't a filehandle.

adamjstewart · 2023-02-09T22:55:32Z

This is interesting. The error points to an AttributeError triggered here. However, this should be caught by the following line. I'm not sure why it isn't being caught.

alex-treebeard · 2023-02-12T16:50:12Z

@adamjstewart, would you be interested in containerising this test process?

I reckon either (a) this issue would stop happening, or (b) We would be able to repro in vscode and root cause it

We could make your PR (microsoft/torchgeo#1097) use the same pattern as this repo's tests

adamjstewart · 2023-02-12T21:58:03Z

Tried copying that pattern but I must have done something wrong because the devcontainer isn't launching properly. I stole the devcontainer recipe from microsoft/torchgeo#1085 but I've never used vscode before so I have no idea if it's correct.

alex-treebeard · 2023-02-15T10:29:58Z

I've done some investigation in a codespaces instance that runs the same container that I've put in the CI branch.

I believe we have two issues:

For some reason (probably memory or storage exhaustion -- github ci specs) the ipykernel dies when executing the notebooks
When the kernel dies, nbmake throws an internal error (rather than handling and reporting it).

If I fix the latter issue then the error messages should get easier to parse, although having seen the logs when running in VSCode (codespaces), it still isn't clear why the kernel is dying.

adamjstewart · 2023-02-19T04:58:36Z

How would I go about debugging this? Is there an easy way to tell which step of the notebook is failing? Not sure if print statements + pytest -s work with notebooks or not.

adamjstewart · 2023-02-26T19:05:28Z

Discovered something interesting. In microsoft/torchgeo#1124, we finally got our notebook tests passing for the first time in a year by using smaller downloads. However, if you compare microsoft/torchgeo@aadd199 and microsoft/torchgeo@77b4e0b, you'll see that the tests were failing and I fixed them simply by clearing the output of the notebook. Is there any reason why the output would cause an internal error? That was quite a surprise to me.

adamjstewart · 2023-03-18T16:48:12Z

Things were working great, but now all of a sudden I'm seeing this error again: https://github.com/microsoft/torchgeo/actions/runs/4456067450/jobs/7826269250?pr=1124

I tried nbstripout on all notebooks, but that didn't help. I'm not sure what changed, but the error message isn't helpful. Is there any way to enable more verbose logging or other debugging tips you have?

alex-treebeard · 2023-03-20T12:58:27Z

Hi @adamjstewart, my gut feeling is that nbstripout is at best correlated with these test failures.

It feels like the root cause still needs to be discovered.

Common practice is to pragmatically retry flaky tests to see if we can avoid hitting 100% reliability until we are able to identify a long-term fix.

Please can you try using this plugin which I believe will let us get a sense of how much retrying is necessary for a test to pass (if they are indeed just flaky due to resource contention)

e: Apologies, right now I can't give you better logs, although it's my top priority for this project.

adamjstewart · 2023-03-20T15:12:18Z

Thanks, let me try that. I agree that the issue is transient, which suggests that it's still a hardware issue, but I can't figure out what could possibly be wrong...

adamjstewart · 2023-03-20T16:16:12Z

Now it's magically working lol. I'm fine with using pytest-rerunfailures until we can get to the bottom of this issue.

alex-treebeard · 2023-03-21T15:24:51Z

What is your dev environment by the way -- Are you able to share which cloud instance/provider etc you are using (or if you are using a local device)?

adamjstewart · 2023-03-21T15:34:43Z

This is using GitHub Actions, see here for the configuration.

alex-treebeard · 2023-03-21T15:38:44Z

Sorry for context: I’m wondering how this all runs on your interactive dev environment to get a sense of how the ci agent differs (which is the root of the challenge here IMO)

On Tue, 21 Mar 2023 at 15:34, Adam J. Stewart ***@***.***> wrote: This is using GitHub Actions, see here <https://github.com/microsoft/torchgeo/blob/main/.github/workflows/tutorials.yaml> for the configuration. — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANZJM7OZRBDCR4Y44XB2663W5HDB7ANCNFSM6AAAAAAQEB25TE> . You are receiving this because you commented.Message ID: ***@***.***>

-- Alex Remedios

adamjstewart · 2023-03-21T15:40:10Z

Ah, yes, when I run it locally on my laptop (macOS 10.15.7 x86_64 or macOS 13.2.1 arm64) everything works fine. All dependencies are installed via the Spack package manager.

alex-treebeard · 2023-03-22T14:12:49Z

I see. Likely you have 2-8x the hardware resources as the github actions agent.

It would be interesting to try tunneling into the agent and debugging interactively with the CLI. These docs are quite helpful: https://code.visualstudio.com/docs/remote/tunnels

Here's a workflow file that lets you open in vscode:

do not do this in a public repo for security reasons

name: code

on:
  push:
jobs:
  code:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - run: |
        set -x
        set -e -u -o pipefail
        curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
        tar -xf vscode_cli.tar.gz
        ./code tunnel --accept-server-license-terms --no-sleep

adamjstewart · 2023-03-22T15:03:00Z

Oh for sure, my laptop is much more powerful. But I think our notebooks are using a very small amount of resources now, although it's hard to be sure. Hopefully a better error message will reveal what's really going on. If there was an OOM or hard drive full error that would make things obvious.

adamjstewart · 2023-09-01T15:05:54Z

We got things working for a while by rerunning the tests up to 10x until they passed, but that no longer seems to be working. See microsoft/torchgeo#1521 for the newest error messages. As far as I know, nothing changed about the available GitHub runners. And the failing tutorial hasn't changed in years. Did anything change in nbmake? Any debugging suggestions?

adamjstewart · 2023-09-21T13:01:43Z

I think the issue was with one of our tutorials. Once we removed that tutorial in microsoft/torchgeo#1521 everything passes without needing any reruns! I'll close this for now because I never liked that tutorial anyway, but I'll reopen if something similar happens again in the future.

adamjstewart mentioned this issue Sep 3, 2022

Tutorials: replace cell magic with shell command microsoft/torchgeo#756

Merged

alex-treebeard mentioned this issue Sep 5, 2022

Flag magic usage errors as failures #81

Open

alex-treebeard added bug Something isn't working help wanted Extra attention is needed labels Oct 19, 2022

adamjstewart mentioned this issue Feb 8, 2023

CI: test notebooks when modified microsoft/torchgeo#1097

Merged

alex-treebeard mentioned this issue Feb 14, 2023

[Draft] Containerise notebook tests microsoft/torchgeo#1112

Closed

adamjstewart mentioned this issue Mar 22, 2023

Faster, more space-efficient tutorials microsoft/torchgeo#1124

Merged

5 tasks

adamjstewart closed this as completed Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NBMAKE INTERNAL ERROR: Exception ignored in socket #80

NBMAKE INTERNAL ERROR: Exception ignored in socket #80

adamjstewart commented Sep 3, 2022

adamjstewart commented Sep 3, 2022

alex-treebeard commented Sep 5, 2022

adamjstewart commented Sep 5, 2022

adamjstewart commented Sep 5, 2022

alex-treebeard commented Sep 7, 2022

adamjstewart commented Sep 7, 2022

adamjstewart commented Feb 8, 2023

alex-treebeard commented Feb 9, 2023

adamjstewart commented Feb 9, 2023

adamjstewart commented Feb 9, 2023

alex-treebeard commented Feb 12, 2023

adamjstewart commented Feb 12, 2023

alex-treebeard commented Feb 15, 2023

adamjstewart commented Feb 19, 2023

adamjstewart commented Feb 26, 2023

adamjstewart commented Mar 18, 2023

alex-treebeard commented Mar 20, 2023 •

edited

Loading

adamjstewart commented Mar 20, 2023

adamjstewart commented Mar 20, 2023

alex-treebeard commented Mar 21, 2023 •

edited

Loading

adamjstewart commented Mar 21, 2023

alex-treebeard commented Mar 21, 2023 via email

adamjstewart commented Mar 21, 2023

alex-treebeard commented Mar 22, 2023 •

edited

Loading

adamjstewart commented Mar 22, 2023

adamjstewart commented Sep 1, 2023

adamjstewart commented Sep 21, 2023

NBMAKE INTERNAL ERROR: Exception ignored in socket #80

NBMAKE INTERNAL ERROR: Exception ignored in socket #80

Comments

adamjstewart commented Sep 3, 2022

adamjstewart commented Sep 3, 2022

alex-treebeard commented Sep 5, 2022

adamjstewart commented Sep 5, 2022

adamjstewart commented Sep 5, 2022

alex-treebeard commented Sep 7, 2022

adamjstewart commented Sep 7, 2022

adamjstewart commented Feb 8, 2023

alex-treebeard commented Feb 9, 2023

adamjstewart commented Feb 9, 2023

adamjstewart commented Feb 9, 2023

alex-treebeard commented Feb 12, 2023

adamjstewart commented Feb 12, 2023

alex-treebeard commented Feb 15, 2023

adamjstewart commented Feb 19, 2023

adamjstewart commented Feb 26, 2023

adamjstewart commented Mar 18, 2023

alex-treebeard commented Mar 20, 2023 • edited Loading

adamjstewart commented Mar 20, 2023

adamjstewart commented Mar 20, 2023

alex-treebeard commented Mar 21, 2023 • edited Loading

adamjstewart commented Mar 21, 2023

alex-treebeard commented Mar 21, 2023 via email

adamjstewart commented Mar 21, 2023

alex-treebeard commented Mar 22, 2023 • edited Loading

adamjstewart commented Mar 22, 2023

adamjstewart commented Sep 1, 2023

adamjstewart commented Sep 21, 2023

alex-treebeard commented Mar 20, 2023 •

edited

Loading

alex-treebeard commented Mar 21, 2023 •

edited

Loading

alex-treebeard commented Mar 22, 2023 •

edited

Loading