Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wandb hangs experiment (10 min+): Internal Server Error for url: https://api.wandb.ai/graphql #1016

Closed
hanspinckaers opened this issue May 5, 2020 · 10 comments

Comments

@hanspinckaers
Copy link

wandb --version && python --version && uname

  • Weights and Biases version: 0.8.35
  • Python version: 3.7
  • Operating System: Linux

Description

For a few days, I noticed experiments hanging on wandb logging. Sometimes I even saw crashes.

So far, downgrading to 0.8.33 seems to help. Will report if the problem arises again.

What I Did

2020-05-05 09:25:36,056 ERROR   Thread-18 :22373 [internal.py:execute():113] 500 response executing GraphQL.
2020-05-05 09:25:36,057 ERROR   Thread-18 :22373 [internal.py:execute():114] {"error":"Error 1040: Too many connections"}

2020-05-05 09:25:36,058 ERROR   Thread-18 :22373 [retry.py:__call__():108] Retry attempt failed:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/wandb/retry.py", line 95, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/wandb/apis/internal.py", line 116, in execute
    six.reraise(*sys.exc_info())
  File "/usr/local/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/wandb/apis/internal.py", line 110, in execute
    return self.client.execute(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/gql/transport/requests.py", line 39, in execute
    request.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/graphql
2020-05-05 09:25:39,928 INFO    Thread-3  :22373 [run_manager.py:_on_file_modified():691] file/dir modified: <redacted>/run-20200505_072213-camelyon-16384-full-correct-loss/wandb-metadata.json
2020-05-05 09:25:40,883 ERROR   Thread-18 :22373 [internal.py:execute():113] 500 response executing GraphQL.
2020-05-05 09:25:40,884 ERROR   Thread-18 :22373 [internal.py:execute():114] {"error":"Error 1040: Too many connections"}

2020-05-05 09:25:50,866 ERROR   Thread-18 :22373 [internal.py:execute():113] 500 response executing GraphQL.
2020-05-05 09:25:50,867 ERROR   Thread-18 :22373 [internal.py:execute():114] {"error":"Error 1040: Too many connections"}

2020-05-05 09:25:51,143 WARNING Thread-7  :22373 [util.py:request_with_retry():614] requests_with_retry encountered retryable exception: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/files/hanspinckaers/camelyon/camelyon-16384-full-correct-loss/file_stream. args: ('https://api.wandb.ai/files/hanspinckaers/camelyon/camelyon-16384-full-correct-loss/file_stream',), kwargs: {'json': {'files': {'output.log':
@vanpelt
Copy link
Contributor

vanpelt commented May 5, 2020

@hanspinckaers our retry logic should do a rolling backoff and resume normal operation in the case of outages. None of that logic should have changed between 0.8.33 and 35. We're currently looking into a brief nightly outage caused by automatic database maintenance. We'll confirm the retry logic is working appropriately and are looking for solutions to the DB maintenance issue. Are you running this from a regular python process or within a Jupyter shell?

@hanspinckaers
Copy link
Author

It seems more likely that it is the outage then, the 0.8.33 version 'working' could just be a coincidence. This is running in a regular python process (PyTorch multiprocessing though). We had some hiccups with our storage system as well, so that could have played a role too. However, in some cases this exception was the last thing my python process logged before hanging or crashing.

@hanspinckaers
Copy link
Author

I have never seen this again, closing this issue now.

@richardrl
Copy link

Just saw too many connections error in the dashboard

image

@vanpelt
Copy link
Contributor

vanpelt commented Jun 10, 2021

@richardrl we had an outage last night that caused these errors. Everything should be functioning properly now.

@emanuelevivoli
Copy link

$ wandb --version && python --version && uname
wandb, version 0.10.30
Python 3.6.13 :: Anaconda, Inc.
Linux

I still have this issue.
How can i fix it?

@vanpelt
Copy link
Contributor

vanpelt commented Jul 19, 2021

@emanuelevivoli is the error you're talking about related to the sweep issue you filed here? Can you share the specific steps that cause your process to hang?

@emanuelevivoli
Copy link

hi @vanpelt ,
sorry for the late answer, but I shifted on another project and I totaly forgot about these issues.
No more problem from my experience, thanks.

@sadra-barikbin
Copy link

Hi, I'm experiencing this issue in Google Colab environment. To reproduce:

#bash
wandb login  --cloud "API_KEY"

then

#python
api = wandb.Api()
runs = api.runs(f'{entity}/{project_name}')
runs[0]

output:

Retry attempt failed:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/wandb/sdk/lib/retry.py", line 102, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/wandb/apis/public.py", line 205, in execute
    return self._client.execute(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/wandb/vendor/gql-0.2.0/wandb_gql/transport/requests.py", line 39, in execute
    request.raise_for_status()
  File "/usr/local/lib/python3.7/dist-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://api.wandb.ai/graphql
wandb: Network error (HTTPError), entering retry loop.```

@aryamohan23
Copy link

Hi, Im experience the same issue as @sadra-barikbin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants