Description
Describe the bug
For some reason I appear to be getting excessive fragmentation from freshly downloaded files:
This is right after a defrag and consolidation. I know for a fact that there were enough contiguous free regions on disk for every file in this dataset to be stored contiguously, but that doesn't seem to be happening.
If the downloader isn't pre-allocating space for the whole file before writing it out, then I would figure that that is probably why it's doing this. When looking at disk activity while it's writing files, I do observe that disk queue lengths are longer than I would expect when writing a single large file (as in they're above 1 during file writes).
Reproduction
Easiest way to measure the issue: first, run defrag <drive letter>: /A /V
and note the fragmented space percentage/number of fragmented files to get a baseline.
Download any dataset containing modestly large files on Windows, probably best if the drive has been used and has had files deleted on it so it has several non-contiguous large chunks of free space. in my case, I downloaded https://huggingface.co/datasets/dalle-mini/open-images, which ended up with files being fragmented into dozens or hundreds of pieces.
Afterwards, run defrag <drive letter>: /A /V
again and note that the number of fragmented files probably increased.
Logs
System info
- huggingface_hub version: 0.29.1
- Platform: Windows-10-10.0.22631-SP0
- Python version: 3.10.8
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: C:\Users\<snip>\.cache\huggingface\token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: 2.12.0
- Torch: 2.6.0+cu126
- Jinja2: 3.1.2
- Graphviz: N/A
- keras: 2.12.0
- Pydot: N/A
- Pillow: 9.0.0
- hf_transfer: 0.1.9
- gradio: 3.31.0
- tensorboard: N/A
- numpy: 1.23.5
- pydantic: 1.10.2
- aiohttp: 3.8.1
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: C:\Users\<snip>\.cache\huggingface\hub
- HF_ASSETS_CACHE: C:\Users\<snip>\.cache\huggingface\assets
- HF_TOKEN_PATH: C:\Users\<snip>\.cache\huggingface\token
- HF_STORED_TOKENS_PATH: C:\Users\<snip>\.cache\huggingface\stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10