Read timed out when total combination of tunable parameters exceed about 15 million #780

richarddwang · 2020-01-18T13:44:01Z

wandb --version && python --version && uname

wandb, version 0.8.21
Python 3.6.9
Linux

What I Did

wandb sweep sweep.yaml

method: grid
metric:
  name: val_acc
  goal: minimize
parameters:
  setting:
    distribution: categorical
    values:
      - stack_ffn
      - act_pkm
      - stack_encdec_ffn
  q_linear:
    distribution: categorical
    values:
      - true
      - false
  k_linear:
    distribution: categorical
    values:
      - true
      - false
  v_linear:
    distribution: categorical
    values:
      - true
      - false
  o_linear:
    distribution: categorical
    values:
      - true
      - false
  q_norm:
    distribution: categorical
    values:
      - true
      - false
  k_norm:
    distribution: categorical
    values:
      - true
      - false
  v_norm:
    distribution: categorical
    values:
      - true
      - false
  inner_norm:
    distribution: categorical
    values:
      - true
      - false
  norm_way:
    distribution: categorical
    values:
      - C
      - CL
  q_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  k_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  v_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  inner_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  proj_share:
    distribution: categorical
    values:
      - qk
      - qv
      - kv
      - qkv
      - no
  proj_way:
    distribution: categorical
    values:
      - ->head
      - head->
      - head->_share
  relative:
    distribution: categorical
    values:
      - true
      - false
  q_downscale:
    distribution: categorical
    values:
      - true
      - false
  k_downscale:
    distribution: categorical
    values:
      - true
      - false
  v_downscale:
    distribution: categorical
    values:
      - true
      - false
  inner_downscale:
    distribution: categorical
    values:
      - true
      - false
  inner_mul:
    distribution: categorical
    values:
      - QK
      - KV

and get timeout

Network error (ReadTimeout), entering retry loop. See /home/shulie8518/Workspace/Review_Attention/wandb/debug.log for full traceback.

debug.log

2020-01-19 16:01:03,860 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:01:15,154 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:01:27,885 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:01:38,038 ERROR   MainThread:31362 [retry.py:__call__():108] Retry attempt failed:
Traceback (most recent call last):
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 389, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/wandb/retry.py", line 95, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/wandb/apis/internal.py", line 110, in execute
    return self.client.execute(*args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/gql/transport/requests.py", line 38, in execute
    request = requests.post(self.url, **post_args)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10)
2020-01-19 16:01:42,484 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:02:00,755 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:02:27,699 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai

The text was updated successfully, but these errors were encountered:

richarddwang · 2020-01-19T08:07:55Z

I found it is not specific parameter nor the number of paramters cause the problem, cuz I have tried some ablation and adding many constant paramters. The total number of combination of possible values is the problem. When the number of combination exceed 14~15 million, I get the error.

richarddwang · 2020-01-19T08:13:25Z

For the reason why I want to do so many runs is because I am trying to do a preliminary exploration in large search space by tiny scale experiments (1~2 mins/run), to rule out some possibilities first.
I really like sweep, hope you guys can fix this problem.

raubitsj · 2020-01-19T15:30:21Z

Thanks for the report. We will look into this and figure out ifwe can handle this size of combinations or if we have to set some limits.

ktobah · 2020-03-27T03:50:21Z

Any updates on this issue because I am getting it to.
I actually can see from the docker log that my runs finish just fine, and they reach this point:

wandb: Synced stoic-sweep-2417: https://app.wandb.ai/entity/proj/runs/dhcvvr71
2020-03-27 03:37:06,244 - wandb.wandb_agent - INFO - Cleaning up finished run: dhcvvr71
wandb: Network error (ReadTimeout), entering retry loop. See /opt/training/wandb/debug.log for full traceback.

When I check the debug.log, I see the same errors as above. Then after sometime, I see them all marked as crashed on the wandb cloud.

Bargsteen · 2020-04-25T15:30:45Z

Any updates on this issue? I am encountering the same problem.

hongliny · 2020-07-22T00:58:16Z

Same problem here: "wandb: Network error (ReadTimeout), entering retry loop."

fcampagne · 2020-09-14T20:52:03Z

I was facing the same issue with bayes optimization and a large number of combinations. I switched to random reasoning that random should require constant compute to select the next configuration and do not seem to see timeouts. I have not done a ton of tests so far, but random seems to not have the issue.
I will try to use random in large search spaces and seed the next sweep with the results when I switch to bayes.

github-actions · 2020-12-20T00:50:48Z

This issue is stale because it has been open 60 days with no activity.

ariG23498 · 2021-02-22T13:02:23Z

Hey folk the thread has gone stale
It would be awesome to know whether this issue still persists for you?

fcampagne · 2021-02-22T14:23:17Z

Since there was no update on this ticket, I did not try with bayes recently. Assuming the bug has not been fixed.

hongliny · 2021-02-23T18:59:14Z

The issue still persists on my side

prash-p · 2021-04-06T18:17:46Z

I am also having this issue.
Wandb: 0.10.25
Python: 3.7.10 and 3.6.9

vanpelt · 2021-04-06T19:12:21Z

We're working on improvements to our underlying sweep architecture to allow for large search spaces. In the meantime, unfortunately the only solution is to reduce the number of tunable parameters in your sweep space.

prash-p · 2021-04-06T19:48:00Z

@vanpelt I also have this issue when I'm not running sweeps and just training a large model. Any ideas?

vanpelt · 2021-04-06T21:55:41Z

@prash-p we had a brief outage this morning. Our library should continue retrying in these cases. Can you share the output you saw in your terminal?

prash-p · 2021-04-08T14:19:20Z

@vanpelt I've included the debug log here: #2039
Essentially my program just freezes and there is no more output. I had this issue again just now

When frozen, the log is this:

2021-04-08 10:03:59,181 DEBUG   SenderThread:7111 [sender.py:send():160] send: history
2021-04-08 10:03:59,181 DEBUG   SenderThread:7111 [sender.py:send():160] send: summary
2021-04-08 10:03:59,210 INFO    SenderThread:7111 [sender.py:_save_file():781] saving file wandb-summary.json with policy end
2021-04-08 10:04:01,204 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:01,205 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:01,205 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:04:11,581 DEBUG   SenderThread:7111 [sender.py:send():160] send: stats
2021-04-08 10:04:16,268 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:16,269 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:16,269 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:04:31,352 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:31,352 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:31,352 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:04:42,157 DEBUG   SenderThread:7111 [sender.py:send():160] send: stats
2021-04-08 10:04:46,428 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:46,429 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:46,429 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:05:01,509 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:05:01,510 DEBUG   SenderThread:7111 [sender.py:send():160] send: request

opsxcq · 2021-11-22T21:38:07Z

I'm facing a similar issue, but on the upload of artifacts which contain a considerable amount of metadata. Also worth to mention that I'm running a self hosted version of wandb.

opsxcq · 2021-11-22T22:13:20Z

just a little bit of debugging from my side, even setting WANDB_HTTP_TIMEOUT=600, I still facing an issue, this time

Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/filesync/step_prepare.py", line 42, in _thread_body
    prepare_response = self._prepare_batch(batch)
  File "xxx/lib/python3.7/site-packages/wandb/filesync/step_prepare.py", line 92, in _prepare_batch
    return self._api.create_artifact_files(file_specs)
  File "xxx/lib/python3.7/site-packages/wandb/apis/normalize.py", line 62, in wrapper
    six.reraise(CommError, CommError(message, err), sys.exc_info()[2])
  File "xxx/lib/python3.7/site-packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "xxx/lib/python3.7/site-packages/wandb/apis/normalize.py", line 24, in wrapper
    return func(*args, **kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 2231, in create_artifact_files
    "artifactFiles": [af for af in artifact_files],
  File "xxx/lib/python3.7/site-packages/wandb/sdk/lib/retry.py", line 102, in __call__
    result = self._call_fn(*args, **kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 130, in execute
    return self.client.execute(*args, **kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 54, in execute
    raise Exception(str(result.errors[0]))
wandb.errors.CommError: context deadline exceeded

meanwhile mysqld went crazy on the wandb server machine

vanpelt · 2021-11-22T23:08:13Z

@opsxcq when running wandb/local you'll want to configure an external MySQL database so you can scale it according to your needs. It looks like you're attempting to make a very expensive query and the resources provided to the docker container simply aren't enough for the job to complete. As a near term fix are you able to increase the resources available to the container? The long term fix is provisioning a MySQL database ideally in one of the clouds and then dumping the database inside of the container and exporting it to a database you can scale with your workloads.

opsxcq · 2021-11-23T08:16:13Z

thanks for your answer @vanpelt , the machine that I'm using is a PRIMERGY RX300 S7 with 2 sockets of 12 cores + 256gb of ram, I didn't added any additional configuration nor performance restriction on docker for wandb local, this machine is shared with some other applications, but nothing is really using it (load <5 all the time).

This feel like a performance bug with an addition of the fact that it fails even with huge timeouts, it keep failing with context deadline exceeded, much before it would actually reach the timeout configured in WANDB_HTTP_TIMEOUT.

I was checking here sdk/interface/artifacts.py

        step_prepare = wandb.filesync.step_prepare.StepPrepare(
            self._api, 0.1, 0.01, 1000
        )  # TODO: params
        step_prepare.start()

which then calls filesync/step_prepare.py:

class StepPrepare(object):
    """A thread that batches requests to our file prepare API.

    Any number of threads may call prepare_async() in parallel. The PrepareBatcher thread
    will batch requests up and send them all to the backend at once.
    """

    def __init__(self, api, batch_time, inter_event_time, max_batch_size):
        self._api = api
        self._inter_event_time = inter_event_time
        self._batch_time = batch_time
        self._max_batch_size = max_batch_size
        self._request_queue = queue.Queue()
        self._thread = threading.Thread(target=self._thread_body)
        self._thread.daemon = True

and follow inside the same file with

    def _gather_batch(self, first_request):
        batch_start_time = time.time()
        batch = [first_request]
        while True:
            try:
                request = self._request_queue.get(
                    block=True, timeout=self._inter_event_time
                )
                if isinstance(request, RequestFinish):
                    return True, batch
                batch.append(request)
                remaining_time = self._batch_time - (time.time() - batch_start_time)
                if remaining_time < 0 or len(batch) >= self._max_batch_size:
                    break
            except queue.Empty:
                break
        return False, batch

Feels like the timeout is static in this case and not using WANDB_HTTP_TIMEOUT

vanpelt · 2021-11-24T02:13:41Z

@opsxcq the context deadline exceeded won't be impacted by WANDB_HTTP_TIMEOUT. All requests to the backend must complete within 60 seconds regardless of the client's timeout setting. I would need to know more specifics about the action on your end that's causing the timeout.

The issue is likely related to filesystem performance inside of the container. Just to reiterate, running the container without an external MySQL database and S3 compatible object store will never be performant enough for production workloads. You can learn more about configuring external file storage and mysql here.

opsxcq · 2021-11-24T16:02:34Z

I'm running some benchmarks right now, I was uploading about 200k entries on metadata I reduced it to 10% of the metadata to check what happens and still got the same error. About the 60 seconds timeout on the backend, where can I configure it ? I think that is pretty unlikely to be due to IO, since I could upload around 250gb of data as artifact data without metadata with no problems.

Looks like is something related to the metadata processing and/or data being persisted on the database side, sure a faster database always helps, but this case is unfeasible such small amount of data to require so much processing power, please be open for the possibility of a performance bug.

This is not a production deployment, is a homelab deployment which I use for learning your platform and for personal research, so it means that there is only one client connected at time. So when it is production ready I can implement at work the same workflows, but I would expect for this hardware to have at least 50 clients using it with no issues.

If for performance reasons I've to store metadata outside the metadata area of the artefacts, I'm willing to help with code and effort required to make it fast enough so the usage of metadata is possible. I would like to reiterate how much this feature helps on the day to day usage, and how making it not feasible for it to handle real scenarios would lower your product value and remove a huge selling point.

vanpelt · 2021-11-24T17:24:12Z

@opsxcq can you please share what you mean by metadata? If you could provide some example code that mimics your use case we could reproduce on our end and see what's causing the performance issue.

opsxcq · 2021-11-24T18:59:17Z

        artifact = wandb.Artifact(event.name,
                                  type='dataset',
                                  description=event.description,
                                  metadata=meta)

Where meta is a dict with about 200k keys if flattened, the original structure contains internal dicts. It is passed as a parameter. Bellow some statistics about this meta object referenced in the code above. Bellow an analysis of how it looks like in runtime:

len(meta.keys())
12
>>> len(meta['x1'].keys())
5633
>>> len(meta['x1']['x2'].keys())
5

I'd to censor what the fields mean, but multiplying them I've

>>> len(meta.keys()) * len(meta['x1'].keys()) * len(meta['x1']['x2'].keys())
337980

this is the amount of unique keys used in the metadata for the artefact.

vanpelt · 2021-11-24T23:02:47Z

Yep, that's a massive amount of json to encode and store in the MySQL database. I would only put metadata into the artifact itself that you expect you'll need to filter down by in the future. You should be able to just encode the metadata as json and write it to a file within your artifact which will be much more performant and still let you see exactly what values were present in the future for a given version.

There's a bug in the MySQL JSON library that has O(n^3) complexity when serializing de-serializing JSON that has array's in it which might be the root cause here. In the future we may limit the amount of metadata persisted to the DB automatically and just store it directly in the artifact automatically.

opsxcq · 2021-11-25T16:45:56Z

In which version should I expect this fix ? When it will be delivered ? I see that the code for the backend is closed source and proprietary, any other way that I can apply this fix by myself ?

I enjoyed the seamless integration of just acessing the metadata object, but given your answer, the best practices for this case would be serializing it to a shelf or json file, and use it as an additional metadata file ? Could this be implemented in the client side, by this I mean in this repository, so it could be transparent for the user ? There are concerns about arbitrary filenames as in metadata.wandb or something like this ? Also related question, would an artifact of the type wandb.data_types.Table be reasonable for this ? Or it would face the same performance issues ?

vanpelt · 2021-11-25T17:21:47Z

@opsxcq the bug is actually in MySQL 5.7, you could likely run an external MariaDB 5.7 database to get around the JSON serialization bug.

The only concern with arbitrary filenames is whether you'll write a file to your artifact in the future that would collide with it. Using a table could be a great fit for this. It would allow you to filter and navigate the metadata within the artifact. There's currently a 200k row limit for wandb.Table's but we're planning to lift that limit when we support serializing tabular data into parquet files hopefully early next year. wandb.Tables do not persist any data to MySQL so this specific performance regression does not effect tables.

opsxcq · 2021-11-25T17:59:43Z

Just for future reference of those reading this issue, the ticket numbers from oracle are Bug #103790, Bug #32919524 and Bug #28949700. Fixed in MySQL 5.7.36, released in 2021-10-19. These fixes are available on the current repository that the docker image uses.

wandb@a4958675557a:~$ apt list --upgradable
Listing... Done
mysql-client-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
mysql-client-core-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
mysql-server-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
mysql-server-core-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
vim-common/bionic-updates,bionic-security 2:8.0.1453-1ubuntu1.7 all [upgradable from: 2:8.0.1453-1ubuntu1.6]
vim-tiny/bionic-updates,bionic-security 2:8.0.1453-1ubuntu1.7 amd64 [upgradable from: 2:8.0.1453-1ubuntu1.6]
xxd/bionic-updates,bionic-security 2:8.0.1453-1ubuntu1.7 amd64 [upgradable from: 2:8.0.1453-1ubuntu1.6]

I will be migrating the database now and will let you know the outcome

opsxcq · 2021-11-25T18:02:00Z

wandb got stuck into a restart loop after the upgrade.

opsxcq · 2021-11-25T18:30:58Z

Pulling the last build wandb/local image

wandb@e778ed1cbe1e:~$ apt-cache policy mysql-server
mysql-server:
  Installed: 5.7.36-1ubuntu18.04
  Candidate: 5.7.36-1ubuntu18.04
  Version table:
     8.0.27-0ubuntu0.20.04.1 500
        500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
     8.0.19-0ubuntu5 500
        500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
 *** 5.7.36-1ubuntu18.04 1001
        500 http://repo.mysql.com/apt/ubuntu bionic/mysql-5.7 amd64 Packages
        100 /var/lib/dpkg/status

Now I will confirm if this problem persists.

opsxcq · 2021-11-26T12:56:19Z

Now the client fails with error 500. I will update it to check if the error continues

opsxcq · 2021-12-02T16:01:47Z

I'm still struggling to make wandb client and server communicate, after some roll back I could get it working, but with this bug persisting, if I use the latest version, I still getting 500 errors.

opsxcq · 2021-12-08T19:51:52Z

After rolling back a few versions I finally got it stable. But then the problem still persisting even with an external mysql on the correct verison. I adoped the suggested solution by @vanpelt, instead of using the metadata object I created a file for metadata inside the repository, and changed to dvc for data management which solved all performance issues.

what are the testing guidelines for the backend ? I would like to include a performance test and keep it failing until this issue is solved, but since the repository is closed for the open public I've no idea how to do it. The reason for this case to be adressed is because I'm not the only person who uses metadata like this, and I'm not even using a production load, so I think that a collaborative effort would benefit both sides on this.

vanpelt · 2021-12-08T20:03:29Z

@opsxcq all of our integration tests for the backend are in our internal closed source repository. We do have an opensource repository for the wandb/local container. I would be thrilled to collaborate on adding a benchmark or performance test for the existing built containers in that repository. This repository also has a number of tests that execute against a mocked out or actual cloud / local backend. We could potentially add something here. If you can provide a simple test script that emulates the behavior you saw we can figure out the best place to add it.

richarddwang changed the title ~~Read timed out when many parameters assigned to sweep~~ [Bug?] Read timed out when many parameters assigned to sweep Jan 19, 2020

richarddwang changed the title ~~[Bug?] Read timed out when many parameters assigned to sweep~~ [Bug][Sweep] Read timed out when total combination of tunable parameters exceed about 15 million Jan 19, 2020

github-actions bot added the stale label Dec 20, 2020

ariG23498 added the ty:bug type of the issue is a bug label Feb 22, 2021

ariG23498 changed the title ~~[Bug][Sweep] Read timed out when total combination of tunable parameters exceed about 15 million~~ Read timed out when total combination of tunable parameters exceed about 15 million Feb 22, 2021

ariG23498 added the c:sweeps Component: Sweeps label Feb 22, 2021

prash-p mentioned this issue Apr 6, 2021

[CLI] Network error (ReadTimeout), entering retry loop. #2039

Closed

sydholl removed the stale label Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read timed out when total combination of tunable parameters exceed about 15 million #780

Read timed out when total combination of tunable parameters exceed about 15 million #780

richarddwang commented Jan 18, 2020 •

edited

richarddwang commented Jan 19, 2020

richarddwang commented Jan 19, 2020

raubitsj commented Jan 19, 2020

ktobah commented Mar 27, 2020 •

edited

Bargsteen commented Apr 25, 2020

hongliny commented Jul 22, 2020

fcampagne commented Sep 14, 2020

github-actions bot commented Dec 20, 2020

ariG23498 commented Feb 22, 2021

fcampagne commented Feb 22, 2021 •

edited

hongliny commented Feb 23, 2021

prash-p commented Apr 6, 2021

vanpelt commented Apr 6, 2021

prash-p commented Apr 6, 2021

vanpelt commented Apr 6, 2021

prash-p commented Apr 8, 2021 •

edited

opsxcq commented Nov 22, 2021

opsxcq commented Nov 22, 2021

vanpelt commented Nov 22, 2021

opsxcq commented Nov 23, 2021

vanpelt commented Nov 24, 2021

opsxcq commented Nov 24, 2021

vanpelt commented Nov 24, 2021

opsxcq commented Nov 24, 2021

vanpelt commented Nov 24, 2021

opsxcq commented Nov 25, 2021

vanpelt commented Nov 25, 2021

opsxcq commented Nov 25, 2021

opsxcq commented Nov 25, 2021

opsxcq commented Nov 25, 2021

opsxcq commented Nov 26, 2021

opsxcq commented Dec 2, 2021

opsxcq commented Dec 8, 2021

vanpelt commented Dec 8, 2021

Read timed out when total combination of tunable parameters exceed about 15 million #780

Read timed out when total combination of tunable parameters exceed about 15 million #780

Comments

richarddwang commented Jan 18, 2020 • edited

What I Did

richarddwang commented Jan 19, 2020

richarddwang commented Jan 19, 2020

raubitsj commented Jan 19, 2020

ktobah commented Mar 27, 2020 • edited

Bargsteen commented Apr 25, 2020

hongliny commented Jul 22, 2020

fcampagne commented Sep 14, 2020

github-actions bot commented Dec 20, 2020

ariG23498 commented Feb 22, 2021

fcampagne commented Feb 22, 2021 • edited

hongliny commented Feb 23, 2021

prash-p commented Apr 6, 2021

vanpelt commented Apr 6, 2021

prash-p commented Apr 6, 2021

vanpelt commented Apr 6, 2021

prash-p commented Apr 8, 2021 • edited

opsxcq commented Nov 22, 2021

opsxcq commented Nov 22, 2021

vanpelt commented Nov 22, 2021

opsxcq commented Nov 23, 2021

vanpelt commented Nov 24, 2021

opsxcq commented Nov 24, 2021

vanpelt commented Nov 24, 2021

opsxcq commented Nov 24, 2021

vanpelt commented Nov 24, 2021

opsxcq commented Nov 25, 2021

vanpelt commented Nov 25, 2021

opsxcq commented Nov 25, 2021

opsxcq commented Nov 25, 2021

opsxcq commented Nov 25, 2021

opsxcq commented Nov 26, 2021

opsxcq commented Dec 2, 2021

opsxcq commented Dec 8, 2021

vanpelt commented Dec 8, 2021

richarddwang commented Jan 18, 2020 •

edited

ktobah commented Mar 27, 2020 •

edited

fcampagne commented Feb 22, 2021 •

edited

prash-p commented Apr 8, 2021 •

edited