-
Notifications
You must be signed in to change notification settings - Fork 191
Weights download speedup #1354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weights download speedup #1354
Conversation
…ch faster (parallel chunks if possible and better buffer handling)
Had Claude give it a once-over. These were the most salient pieces of feedback: 1. Threading Safety Violation - Lines 844-868progress_lock = threading.Lock()
downloaded_bytes = 0
def download_chunk(start, end):
nonlocal downloaded_bytes
# ... downloads chunk ...
with progress_lock:
content_buffer.seek(current_pos)
content_buffer.write(chunk)
downloaded_bytes += len(chunk) Problem: Fix: Calculate position inside the lock: def download_chunk(start, end):
nonlocal downloaded_bytes
local_pos = start
try:
response = requests.get(wrap_url(url), headers=headers, stream=True, timeout=ROBOFLOW_API_REQUEST_TIMEOUT)
response.raise_for_status()
for chunk in response.iter_content(chunk_size=8192):
if chunk:
with progress_lock:
content_buffer.seek(local_pos)
content_buffer.write(chunk)
downloaded_bytes += len(chunk)
local_pos += len(chunk) 2. Buffer Pre-allocation Bug - Lines 825-827content_buffer.seek(total_size - 1)
content_buffer.write(b'\0') # Pre-allocate buffer Problem: This only allocates the last byte, not the entire buffer. Writing to middle positions on an empty BytesIO will fill gaps with zeros, corrupting binary data. Fix: Pre-allocate properly: content_buffer = io.BytesIO(b'\0' * total_size) 3. Missing Error Handling in Parallel DownloadThe parallel download doesn't handle Fix: Add consistent error handling: def download_chunk(start, end):
# ... existing code ...
response = requests.get(wrap_url(url), headers=build_roboflow_api_headers(), stream=True, timeout=ROBOFLOW_API_REQUEST_TIMEOUT)
api_key_safe_raise_for_status(response=response) |
…download-speedup`) Here is the optimized version of your code, rewritten for better runtime and memory performance. - **Optimization strategy**: - Avoid unnecessary variable initializations and duplicate function calls. - Minimize expensive string formatting and JSON parsing. - Reduce `time.time()` calls inside tight loops. - Pre-calculate constants. - Reuse byte counters; only log when necessary. - In log calls, use conditional debug log early to avoid unnecessary string formatting. **All return values remain the same, and function signatures are not changed.** **All comments are preserved or updated if code has changed.** **Summary of changes:** - Avoid recomputing constants, unnecessary branching, duplicate string formatting when log level is not enabled. - Only call `time.time()` and perform Mbps calculations when logging. - Minor clean-ups for clarity. - Final response construction is unchanged. - Comments updated as needed, originals preserved. This will be measurably faster and use slightly less memory, especially when downloading large files.
⚡️ Codeflash found optimizations for this PR📄 24% (0.24x) speedup for
|
@PawelPeczek-Roboflow added similar functionality here: https://github.com/roboflow/inference/blob/feature/inference-v1-models/inference_experimental/inference_exp/utils/download.py I'd vote for porting that change (since it does the same thing and the plan for it is to replace current solution) |
testing this on staging @grzegorz-roboflow observed the following error:
So we will need to add retry and backoff logic before merging this change. Will try to implement that as well as streaming directly to file and testing that with serverless filesystem on staging also. |
Description
This adds anew download method to be used for downloading weights. I was basically unable to download perception encoder (2.5GB) or even clip weights (250MB). because it would go so slow (I used naive chinking and measured at ~5Mbps and slower..although partially that chunking might have been worse than what requests does by default). Either way with previous implementation I was waiting 30 minutes and still weights weren't done, now it downloads in seconds.
This implements parralell or serialized chunked downloads with BytesIO buffer and adds some basic logging about file size when model weights are fetched (with debug logs it will print download progress and speed at 10% intervals also)
Ideally we don't need to buffer entire weight file in memory at al...but we cant just stream to a file because at least need to be careful I think since cache implementation differences in various envs so will need some more refactoring for that.
Type of change
Please delete options that are not relevant.
maintenance / speed improvement
How has this change been tested, please provide a testcase or example of how you tested the change?
locally
Any specific deployment considerations
no
Docs