Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read timed out when total combination of tunable parameters exceed about 15 million #780

Open
richarddwang opened this issue Jan 18, 2020 · 34 comments
Labels
c:sweeps Component: Sweeps ty:bug type of the issue is a bug

Comments

@richarddwang
Copy link

richarddwang commented Jan 18, 2020

wandb --version && python --version && uname

wandb, version 0.8.21
Python 3.6.9
Linux

What I Did

wandb sweep sweep.yaml

method: grid
metric:
  name: val_acc
  goal: minimize
parameters:
  setting:
    distribution: categorical
    values:
      - stack_ffn
      - act_pkm
      - stack_encdec_ffn
  q_linear:
    distribution: categorical
    values:
      - true
      - false
  k_linear:
    distribution: categorical
    values:
      - true
      - false
  v_linear:
    distribution: categorical
    values:
      - true
      - false
  o_linear:
    distribution: categorical
    values:
      - true
      - false
  q_norm:
    distribution: categorical
    values:
      - true
      - false
  k_norm:
    distribution: categorical
    values:
      - true
      - false
  v_norm:
    distribution: categorical
    values:
      - true
      - false
  inner_norm:
    distribution: categorical
    values:
      - true
      - false
  norm_way:
    distribution: categorical
    values:
      - C
      - CL
  q_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  k_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  v_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  inner_activ:
    distribution: categorical
    values:
      - no
      - softmax
      - sparsemax
  proj_share:
    distribution: categorical
    values:
      - qk
      - qv
      - kv
      - qkv
      - no
  proj_way:
    distribution: categorical
    values:
      - ->head
      - head->
      - head->_share
  relative:
    distribution: categorical
    values:
      - true
      - false
  q_downscale:
    distribution: categorical
    values:
      - true
      - false
  k_downscale:
    distribution: categorical
    values:
      - true
      - false
  v_downscale:
    distribution: categorical
    values:
      - true
      - false
  inner_downscale:
    distribution: categorical
    values:
      - true
      - false
  inner_mul:
    distribution: categorical
    values:
      - QK
      - KV

and get timeout

Network error (ReadTimeout), entering retry loop. See /home/shulie8518/Workspace/Review_Attention/wandb/debug.log for full traceback.

debug.log

2020-01-19 16:01:03,860 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:01:15,154 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:01:27,885 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:01:38,038 ERROR   MainThread:31362 [retry.py:__call__():108] Retry attempt failed:
Traceback (most recent call last):
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1344, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
    raise value
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 389, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
    raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/wandb/retry.py", line 95, in __call__
    result = self._call_fn(*args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/wandb/apis/internal.py", line 110, in execute
    return self.client.execute(*args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/gql/client.py", line 52, in execute
    result = self._get_result(document, *args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/gql/client.py", line 60, in _get_result
    return self.transport.execute(document, *args, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/gql/transport/requests.py", line 38, in execute
    request = requests.post(self.url, **post_args)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/home/shulie8518/VirtualEnvironment/py37/lib/python3.7/site-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='api.wandb.ai', port=443): Read timed out. (read timeout=10)
2020-01-19 16:01:42,484 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:02:00,755 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
2020-01-19 16:02:27,699 DEBUG   MainThread:31362 [connectionpool.py:_new_conn():824] Starting new HTTPS connection (1): api.wandb.ai
@richarddwang richarddwang changed the title Read timed out when many parameters assigned to sweep [Bug?] Read timed out when many parameters assigned to sweep Jan 19, 2020
@richarddwang richarddwang changed the title [Bug?] Read timed out when many parameters assigned to sweep [Bug][Sweep] Read timed out when total combination of tunable parameters exceed about 15 million Jan 19, 2020
@richarddwang
Copy link
Author

I found it is not specific parameter nor the number of paramters cause the problem, cuz I have tried some ablation and adding many constant paramters. The total number of combination of possible values is the problem. When the number of combination exceed 14~15 million, I get the error.

@richarddwang
Copy link
Author

For the reason why I want to do so many runs is because I am trying to do a preliminary exploration in large search space by tiny scale experiments (1~2 mins/run), to rule out some possibilities first.
I really like sweep, hope you guys can fix this problem.

@raubitsj
Copy link
Member

Thanks for the report. We will look into this and figure out ifwe can handle this size of combinations or if we have to set some limits.

@ktobah
Copy link
Contributor

ktobah commented Mar 27, 2020

Any updates on this issue because I am getting it to.
I actually can see from the docker log that my runs finish just fine, and they reach this point:

wandb: Synced stoic-sweep-2417: https://app.wandb.ai/entity/proj/runs/dhcvvr71
2020-03-27 03:37:06,244 - wandb.wandb_agent - INFO - Cleaning up finished run: dhcvvr71
wandb: Network error (ReadTimeout), entering retry loop. See /opt/training/wandb/debug.log for full traceback.

When I check the debug.log, I see the same errors as above. Then after sometime, I see them all marked as crashed on the wandb cloud.

@Bargsteen
Copy link

Any updates on this issue? I am encountering the same problem.

@hongliny
Copy link

Same problem here: "wandb: Network error (ReadTimeout), entering retry loop."

@fcampagne
Copy link

I was facing the same issue with bayes optimization and a large number of combinations. I switched to random reasoning that random should require constant compute to select the next configuration and do not seem to see timeouts. I have not done a ton of tests so far, but random seems to not have the issue.
I will try to use random in large search spaces and seed the next sweep with the results when I switch to bayes.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity.

@github-actions github-actions bot added the stale label Dec 20, 2020
@ariG23498
Copy link
Contributor

Hey folk the thread has gone stale
It would be awesome to know whether this issue still persists for you?

@fcampagne
Copy link

fcampagne commented Feb 22, 2021

Since there was no update on this ticket, I did not try with bayes recently. Assuming the bug has not been fixed.

@ariG23498 ariG23498 added the ty:bug type of the issue is a bug label Feb 22, 2021
@ariG23498 ariG23498 changed the title [Bug][Sweep] Read timed out when total combination of tunable parameters exceed about 15 million Read timed out when total combination of tunable parameters exceed about 15 million Feb 22, 2021
@ariG23498 ariG23498 added the c:sweeps Component: Sweeps label Feb 22, 2021
@hongliny
Copy link

The issue still persists on my side

@prash-p
Copy link

prash-p commented Apr 6, 2021

I am also having this issue.
Wandb: 0.10.25
Python: 3.7.10 and 3.6.9

@vanpelt
Copy link
Contributor

vanpelt commented Apr 6, 2021

We're working on improvements to our underlying sweep architecture to allow for large search spaces. In the meantime, unfortunately the only solution is to reduce the number of tunable parameters in your sweep space.

@prash-p
Copy link

prash-p commented Apr 6, 2021

@vanpelt I also have this issue when I'm not running sweeps and just training a large model. Any ideas?

@vanpelt
Copy link
Contributor

vanpelt commented Apr 6, 2021

@prash-p we had a brief outage this morning. Our library should continue retrying in these cases. Can you share the output you saw in your terminal?

@prash-p
Copy link

prash-p commented Apr 8, 2021

@vanpelt I've included the debug log here: #2039
Essentially my program just freezes and there is no more output. I had this issue again just now

When frozen, the log is this:

2021-04-08 10:03:59,181 DEBUG   SenderThread:7111 [sender.py:send():160] send: history
2021-04-08 10:03:59,181 DEBUG   SenderThread:7111 [sender.py:send():160] send: summary
2021-04-08 10:03:59,210 INFO    SenderThread:7111 [sender.py:_save_file():781] saving file wandb-summary.json with policy end
2021-04-08 10:04:01,204 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:01,205 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:01,205 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:04:11,581 DEBUG   SenderThread:7111 [sender.py:send():160] send: stats
2021-04-08 10:04:16,268 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:16,269 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:16,269 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:04:31,352 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:31,352 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:31,352 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:04:42,157 DEBUG   SenderThread:7111 [sender.py:send():160] send: stats
2021-04-08 10:04:46,428 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:04:46,429 DEBUG   SenderThread:7111 [sender.py:send():160] send: request
2021-04-08 10:04:46,429 DEBUG   SenderThread:7111 [sender.py:send_request():169] send_request: status
2021-04-08 10:05:01,509 DEBUG   HandlerThread:7111 [handler.py:handle_request():120] handle_request: status
2021-04-08 10:05:01,510 DEBUG   SenderThread:7111 [sender.py:send():160] send: request

@opsxcq
Copy link

opsxcq commented Nov 22, 2021

I'm facing a similar issue, but on the upload of artifacts which contain a considerable amount of metadata. Also worth to mention that I'm running a self hosted version of wandb.

@opsxcq
Copy link

opsxcq commented Nov 22, 2021

just a little bit of debugging from my side, even setting WANDB_HTTP_TIMEOUT=600, I still facing an issue, this time

Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/filesync/step_prepare.py", line 42, in _thread_body
    prepare_response = self._prepare_batch(batch)
  File "xxx/lib/python3.7/site-packages/wandb/filesync/step_prepare.py", line 92, in _prepare_batch
    return self._api.create_artifact_files(file_specs)
  File "xxx/lib/python3.7/site-packages/wandb/apis/normalize.py", line 62, in wrapper
    six.reraise(CommError, CommError(message, err), sys.exc_info()[2])
  File "xxx/lib/python3.7/site-packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "xxx/lib/python3.7/site-packages/wandb/apis/normalize.py", line 24, in wrapper
    return func(*args, **kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 2231, in create_artifact_files
    "artifactFiles": [af for af in artifact_files],
  File "xxx/lib/python3.7/site-packages/wandb/sdk/lib/retry.py", line 102, in __call__
    result = self._call_fn(*args, **kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/sdk/internal/internal_api.py", line 130, in execute
    return self.client.execute(*args, **kwargs)
  File "xxx/lib/python3.7/site-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 54, in execute
    raise Exception(str(result.errors[0]))
wandb.errors.CommError: context deadline exceeded

meanwhile mysqld went crazy on the wandb server machine

@vanpelt
Copy link
Contributor

vanpelt commented Nov 22, 2021

@opsxcq when running wandb/local you'll want to configure an external MySQL database so you can scale it according to your needs. It looks like you're attempting to make a very expensive query and the resources provided to the docker container simply aren't enough for the job to complete. As a near term fix are you able to increase the resources available to the container? The long term fix is provisioning a MySQL database ideally in one of the clouds and then dumping the database inside of the container and exporting it to a database you can scale with your workloads.

@opsxcq
Copy link

opsxcq commented Nov 23, 2021

thanks for your answer @vanpelt , the machine that I'm using is a PRIMERGY RX300 S7 with 2 sockets of 12 cores + 256gb of ram, I didn't added any additional configuration nor performance restriction on docker for wandb local, this machine is shared with some other applications, but nothing is really using it (load <5 all the time).

This feel like a performance bug with an addition of the fact that it fails even with huge timeouts, it keep failing with context deadline exceeded, much before it would actually reach the timeout configured in WANDB_HTTP_TIMEOUT.

I was checking here sdk/interface/artifacts.py

        step_prepare = wandb.filesync.step_prepare.StepPrepare(
            self._api, 0.1, 0.01, 1000
        )  # TODO: params
        step_prepare.start()

which then calls filesync/step_prepare.py:

class StepPrepare(object):
    """A thread that batches requests to our file prepare API.

    Any number of threads may call prepare_async() in parallel. The PrepareBatcher thread
    will batch requests up and send them all to the backend at once.
    """

    def __init__(self, api, batch_time, inter_event_time, max_batch_size):
        self._api = api
        self._inter_event_time = inter_event_time
        self._batch_time = batch_time
        self._max_batch_size = max_batch_size
        self._request_queue = queue.Queue()
        self._thread = threading.Thread(target=self._thread_body)
        self._thread.daemon = True

and follow inside the same file with

    def _gather_batch(self, first_request):
        batch_start_time = time.time()
        batch = [first_request]
        while True:
            try:
                request = self._request_queue.get(
                    block=True, timeout=self._inter_event_time
                )
                if isinstance(request, RequestFinish):
                    return True, batch
                batch.append(request)
                remaining_time = self._batch_time - (time.time() - batch_start_time)
                if remaining_time < 0 or len(batch) >= self._max_batch_size:
                    break
            except queue.Empty:
                break
        return False, batch

Feels like the timeout is static in this case and not using WANDB_HTTP_TIMEOUT

@vanpelt
Copy link
Contributor

vanpelt commented Nov 24, 2021

@opsxcq the context deadline exceeded won't be impacted by WANDB_HTTP_TIMEOUT. All requests to the backend must complete within 60 seconds regardless of the client's timeout setting. I would need to know more specifics about the action on your end that's causing the timeout.

The issue is likely related to filesystem performance inside of the container. Just to reiterate, running the container without an external MySQL database and S3 compatible object store will never be performant enough for production workloads. You can learn more about configuring external file storage and mysql here.

@opsxcq
Copy link

opsxcq commented Nov 24, 2021

I'm running some benchmarks right now, I was uploading about 200k entries on metadata I reduced it to 10% of the metadata to check what happens and still got the same error. About the 60 seconds timeout on the backend, where can I configure it ? I think that is pretty unlikely to be due to IO, since I could upload around 250gb of data as artifact data without metadata with no problems.

Looks like is something related to the metadata processing and/or data being persisted on the database side, sure a faster database always helps, but this case is unfeasible such small amount of data to require so much processing power, please be open for the possibility of a performance bug.

This is not a production deployment, is a homelab deployment which I use for learning your platform and for personal research, so it means that there is only one client connected at time. So when it is production ready I can implement at work the same workflows, but I would expect for this hardware to have at least 50 clients using it with no issues.

If for performance reasons I've to store metadata outside the metadata area of the artefacts, I'm willing to help with code and effort required to make it fast enough so the usage of metadata is possible. I would like to reiterate how much this feature helps on the day to day usage, and how making it not feasible for it to handle real scenarios would lower your product value and remove a huge selling point.

@vanpelt
Copy link
Contributor

vanpelt commented Nov 24, 2021

@opsxcq can you please share what you mean by metadata? If you could provide some example code that mimics your use case we could reproduce on our end and see what's causing the performance issue.

@opsxcq
Copy link

opsxcq commented Nov 24, 2021

        artifact = wandb.Artifact(event.name,
                                  type='dataset',
                                  description=event.description,
                                  metadata=meta)

Where meta is a dict with about 200k keys if flattened, the original structure contains internal dicts. It is passed as a parameter. Bellow some statistics about this meta object referenced in the code above. Bellow an analysis of how it looks like in runtime:

len(meta.keys())
12
>>> len(meta['x1'].keys())
5633
>>> len(meta['x1']['x2'].keys())
5

I'd to censor what the fields mean, but multiplying them I've

>>> len(meta.keys()) * len(meta['x1'].keys()) * len(meta['x1']['x2'].keys())
337980

this is the amount of unique keys used in the metadata for the artefact.

@vanpelt
Copy link
Contributor

vanpelt commented Nov 24, 2021

Yep, that's a massive amount of json to encode and store in the MySQL database. I would only put metadata into the artifact itself that you expect you'll need to filter down by in the future. You should be able to just encode the metadata as json and write it to a file within your artifact which will be much more performant and still let you see exactly what values were present in the future for a given version.

There's a bug in the MySQL JSON library that has O(n^3) complexity when serializing de-serializing JSON that has array's in it which might be the root cause here. In the future we may limit the amount of metadata persisted to the DB automatically and just store it directly in the artifact automatically.

@opsxcq
Copy link

opsxcq commented Nov 25, 2021

In which version should I expect this fix ? When it will be delivered ? I see that the code for the backend is closed source and proprietary, any other way that I can apply this fix by myself ?

I enjoyed the seamless integration of just acessing the metadata object, but given your answer, the best practices for this case would be serializing it to a shelf or json file, and use it as an additional metadata file ? Could this be implemented in the client side, by this I mean in this repository, so it could be transparent for the user ? There are concerns about arbitrary filenames as in metadata.wandb or something like this ? Also related question, would an artifact of the type wandb.data_types.Table be reasonable for this ? Or it would face the same performance issues ?

@vanpelt
Copy link
Contributor

vanpelt commented Nov 25, 2021

@opsxcq the bug is actually in MySQL 5.7, you could likely run an external MariaDB 5.7 database to get around the JSON serialization bug.

The only concern with arbitrary filenames is whether you'll write a file to your artifact in the future that would collide with it. Using a table could be a great fit for this. It would allow you to filter and navigate the metadata within the artifact. There's currently a 200k row limit for wandb.Table's but we're planning to lift that limit when we support serializing tabular data into parquet files hopefully early next year. wandb.Tables do not persist any data to MySQL so this specific performance regression does not effect tables.

@opsxcq
Copy link

opsxcq commented Nov 25, 2021

Just for future reference of those reading this issue, the ticket numbers from oracle are Bug #103790, Bug #32919524 and Bug #28949700. Fixed in MySQL 5.7.36, released in 2021-10-19. These fixes are available on the current repository that the docker image uses.

wandb@a4958675557a:~$ apt list --upgradable
Listing... Done
mysql-client-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
mysql-client-core-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
mysql-server-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
mysql-server-core-5.7/bionic-updates,bionic-security 5.7.36-0ubuntu0.18.04.1 amd64 [upgradable from: 5.7.35-0ubuntu0.18.04.2]
vim-common/bionic-updates,bionic-security 2:8.0.1453-1ubuntu1.7 all [upgradable from: 2:8.0.1453-1ubuntu1.6]
vim-tiny/bionic-updates,bionic-security 2:8.0.1453-1ubuntu1.7 amd64 [upgradable from: 2:8.0.1453-1ubuntu1.6]
xxd/bionic-updates,bionic-security 2:8.0.1453-1ubuntu1.7 amd64 [upgradable from: 2:8.0.1453-1ubuntu1.6]

I will be migrating the database now and will let you know the outcome

@opsxcq
Copy link

opsxcq commented Nov 25, 2021

wandb got stuck into a restart loop after the upgrade.

@opsxcq
Copy link

opsxcq commented Nov 25, 2021

Pulling the last build wandb/local image

wandb@e778ed1cbe1e:~$ apt-cache policy mysql-server
mysql-server:
  Installed: 5.7.36-1ubuntu18.04
  Candidate: 5.7.36-1ubuntu18.04
  Version table:
     8.0.27-0ubuntu0.20.04.1 500
        500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages
     8.0.19-0ubuntu5 500
        500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
 *** 5.7.36-1ubuntu18.04 1001
        500 http://repo.mysql.com/apt/ubuntu bionic/mysql-5.7 amd64 Packages
        100 /var/lib/dpkg/status

Now I will confirm if this problem persists.

@opsxcq
Copy link

opsxcq commented Nov 26, 2021

Now the client fails with error 500. I will update it to check if the error continues

@opsxcq
Copy link

opsxcq commented Dec 2, 2021

I'm still struggling to make wandb client and server communicate, after some roll back I could get it working, but with this bug persisting, if I use the latest version, I still getting 500 errors.

@opsxcq
Copy link

opsxcq commented Dec 8, 2021

After rolling back a few versions I finally got it stable. But then the problem still persisting even with an external mysql on the correct verison. I adoped the suggested solution by @vanpelt, instead of using the metadata object I created a file for metadata inside the repository, and changed to dvc for data management which solved all performance issues.

what are the testing guidelines for the backend ? I would like to include a performance test and keep it failing until this issue is solved, but since the repository is closed for the open public I've no idea how to do it. The reason for this case to be adressed is because I'm not the only person who uses metadata like this, and I'm not even using a production load, so I think that a collaborative effort would benefit both sides on this.

@vanpelt
Copy link
Contributor

vanpelt commented Dec 8, 2021

@opsxcq all of our integration tests for the backend are in our internal closed source repository. We do have an opensource repository for the wandb/local container. I would be thrilled to collaborate on adding a benchmark or performance test for the existing built containers in that repository. This repository also has a number of tests that execute against a mocked out or actual cloud / local backend. We could potentially add something here. If you can provide a simple test script that emulates the behavior you saw we can figure out the best place to add it.

@sydholl sydholl removed the stale label Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:sweeps Component: Sweeps ty:bug type of the issue is a bug
Projects
None yet
Development

No branches or pull requests