-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read timed out when total combination of tunable parameters exceed about 15 million #780
Comments
I found it is not specific parameter nor the number of paramters cause the problem, cuz I have tried some ablation and adding many constant paramters. The total number of combination of possible values is the problem. When the number of combination exceed 14~15 million, I get the error. |
For the reason why I want to do so many runs is because I am trying to do a preliminary exploration in large search space by tiny scale experiments (1~2 mins/run), to rule out some possibilities first. |
Thanks for the report. We will look into this and figure out ifwe can handle this size of combinations or if we have to set some limits. |
Any updates on this issue because I am getting it to. wandb: Synced stoic-sweep-2417: https://app.wandb.ai/entity/proj/runs/dhcvvr71
2020-03-27 03:37:06,244 - wandb.wandb_agent - INFO - Cleaning up finished run: dhcvvr71
wandb: Network error (ReadTimeout), entering retry loop. See /opt/training/wandb/debug.log for full traceback. When I check the debug.log, I see the same errors as above. Then after sometime, I see them all marked as |
Any updates on this issue? I am encountering the same problem. |
Same problem here: "wandb: Network error (ReadTimeout), entering retry loop." |
I was facing the same issue with bayes optimization and a large number of combinations. I switched to random reasoning that random should require constant compute to select the next configuration and do not seem to see timeouts. I have not done a ton of tests so far, but random seems to not have the issue. |
This issue is stale because it has been open 60 days with no activity. |
Hey folk the thread has gone stale |
Since there was no update on this ticket, I did not try with bayes recently. Assuming the bug has not been fixed. |
The issue still persists on my side |
I am also having this issue. |
We're working on improvements to our underlying sweep architecture to allow for large search spaces. In the meantime, unfortunately the only solution is to reduce the number of tunable parameters in your sweep space. |
@vanpelt I also have this issue when I'm not running sweeps and just training a large model. Any ideas? |
@prash-p we had a brief outage this morning. Our library should continue retrying in these cases. Can you share the output you saw in your terminal? |
@vanpelt I've included the debug log here: #2039 When frozen, the log is this:
|
I'm facing a similar issue, but on the upload of artifacts which contain a considerable amount of metadata. Also worth to mention that I'm running a self hosted version of wandb. |
just a little bit of debugging from my side, even setting
meanwhile mysqld went crazy on the wandb server machine |
@opsxcq when running wandb/local you'll want to configure an external MySQL database so you can scale it according to your needs. It looks like you're attempting to make a very expensive query and the resources provided to the docker container simply aren't enough for the job to complete. As a near term fix are you able to increase the resources available to the container? The long term fix is provisioning a MySQL database ideally in one of the clouds and then dumping the database inside of the container and exporting it to a database you can scale with your workloads. |
thanks for your answer @vanpelt , the machine that I'm using is a This feel like a performance bug with an addition of the fact that it fails even with huge timeouts, it keep failing with I was checking here step_prepare = wandb.filesync.step_prepare.StepPrepare(
self._api, 0.1, 0.01, 1000
) # TODO: params
step_prepare.start() which then calls class StepPrepare(object):
"""A thread that batches requests to our file prepare API.
Any number of threads may call prepare_async() in parallel. The PrepareBatcher thread
will batch requests up and send them all to the backend at once.
"""
def __init__(self, api, batch_time, inter_event_time, max_batch_size):
self._api = api
self._inter_event_time = inter_event_time
self._batch_time = batch_time
self._max_batch_size = max_batch_size
self._request_queue = queue.Queue()
self._thread = threading.Thread(target=self._thread_body)
self._thread.daemon = True and follow inside the same file with def _gather_batch(self, first_request):
batch_start_time = time.time()
batch = [first_request]
while True:
try:
request = self._request_queue.get(
block=True, timeout=self._inter_event_time
)
if isinstance(request, RequestFinish):
return True, batch
batch.append(request)
remaining_time = self._batch_time - (time.time() - batch_start_time)
if remaining_time < 0 or len(batch) >= self._max_batch_size:
break
except queue.Empty:
break
return False, batch Feels like the timeout is static in this case and not using |
@opsxcq the context deadline exceeded won't be impacted by WANDB_HTTP_TIMEOUT. All requests to the backend must complete within 60 seconds regardless of the client's timeout setting. I would need to know more specifics about the action on your end that's causing the timeout. The issue is likely related to filesystem performance inside of the container. Just to reiterate, running the container without an external MySQL database and S3 compatible object store will never be performant enough for production workloads. You can learn more about configuring external file storage and mysql here. |
I'm running some benchmarks right now, I was uploading about 200k entries on metadata I reduced it to 10% of the metadata to check what happens and still got the same error. About the 60 seconds timeout on the backend, where can I configure it ? I think that is pretty unlikely to be due to IO, since I could upload around 250gb of data as artifact data without metadata with no problems. Looks like is something related to the metadata processing and/or data being persisted on the database side, sure a faster database always helps, but this case is unfeasible such small amount of data to require so much processing power, please be open for the possibility of a performance bug. This is not a production deployment, is a homelab deployment which I use for learning your platform and for personal research, so it means that there is only one client connected at time. So when it is production ready I can implement at work the same workflows, but I would expect for this hardware to have at least 50 clients using it with no issues. If for performance reasons I've to store metadata outside the metadata area of the artefacts, I'm willing to help with code and effort required to make it fast enough so the usage of metadata is possible. I would like to reiterate how much this feature helps on the day to day usage, and how making it not feasible for it to handle real scenarios would lower your product value and remove a huge selling point. |
@opsxcq can you please share what you mean by metadata? If you could provide some example code that mimics your use case we could reproduce on our end and see what's causing the performance issue. |
artifact = wandb.Artifact(event.name,
type='dataset',
description=event.description,
metadata=meta) Where meta is a len(meta.keys())
12
>>> len(meta['x1'].keys())
5633
>>> len(meta['x1']['x2'].keys())
5 I'd to censor what the fields mean, but multiplying them I've
this is the amount of unique keys used in the metadata for the artefact. |
Yep, that's a massive amount of json to encode and store in the MySQL database. I would only put metadata into the artifact itself that you expect you'll need to filter down by in the future. You should be able to just encode the metadata as json and write it to a file within your artifact which will be much more performant and still let you see exactly what values were present in the future for a given version. There's a bug in the MySQL JSON library that has O(n^3) complexity when serializing de-serializing JSON that has array's in it which might be the root cause here. In the future we may limit the amount of metadata persisted to the DB automatically and just store it directly in the artifact automatically. |
In which version should I expect this fix ? When it will be delivered ? I see that the code for the backend is closed source and proprietary, any other way that I can apply this fix by myself ? I enjoyed the seamless integration of just acessing the metadata object, but given your answer, the best practices for this case would be serializing it to a shelf or json file, and use it as an additional metadata file ? Could this be implemented in the client side, by this I mean in this repository, so it could be transparent for the user ? There are concerns about arbitrary filenames as in |
@opsxcq the bug is actually in MySQL 5.7, you could likely run an external MariaDB 5.7 database to get around the JSON serialization bug. The only concern with arbitrary filenames is whether you'll write a file to your artifact in the future that would collide with it. Using a table could be a great fit for this. It would allow you to filter and navigate the metadata within the artifact. There's currently a 200k row limit for |
Just for future reference of those reading this issue, the ticket numbers from oracle are Bug #103790, Bug #32919524 and Bug #28949700. Fixed in MySQL 5.7.36, released in 2021-10-19. These fixes are available on the current repository that the docker image uses.
I will be migrating the database now and will let you know the outcome |
wandb got stuck into a restart loop after the upgrade. |
Pulling the last build
Now I will confirm if this problem persists. |
Now the client fails with error 500. I will update it to check if the error continues |
I'm still struggling to make wandb client and server communicate, after some roll back I could get it working, but with this bug persisting, if I use the latest version, I still getting 500 errors. |
After rolling back a few versions I finally got it stable. But then the problem still persisting even with an external mysql on the correct verison. I adoped the suggested solution by @vanpelt, instead of using the metadata object I created a file for metadata inside the repository, and changed to what are the testing guidelines for the backend ? I would like to include a performance test and keep it failing until this issue is solved, but since the repository is closed for the open public I've no idea how to do it. The reason for this case to be adressed is because I'm not the only person who uses metadata like this, and I'm not even using a production load, so I think that a collaborative effort would benefit both sides on this. |
@opsxcq all of our integration tests for the backend are in our internal closed source repository. We do have an opensource repository for the wandb/local container. I would be thrilled to collaborate on adding a benchmark or performance test for the existing built containers in that repository. This repository also has a number of tests that execute against a mocked out or actual cloud / local backend. We could potentially add something here. If you can provide a simple test script that emulates the behavior you saw we can figure out the best place to add it. |
wandb --version && python --version && uname
wandb, version 0.8.21
Python 3.6.9
Linux
What I Did
and get timeout
debug.log
The text was updated successfully, but these errors were encountered: