Skip to content

huggingface-cli downloads leading to excessive filesystem fragmentation on Windows #3034

Open
@drhead

Description

@drhead

Describe the bug

For some reason I appear to be getting excessive fragmentation from freshly downloaded files:

Image

This is right after a defrag and consolidation. I know for a fact that there were enough contiguous free regions on disk for every file in this dataset to be stored contiguously, but that doesn't seem to be happening.

If the downloader isn't pre-allocating space for the whole file before writing it out, then I would figure that that is probably why it's doing this. When looking at disk activity while it's writing files, I do observe that disk queue lengths are longer than I would expect when writing a single large file (as in they're above 1 during file writes).

Reproduction

Easiest way to measure the issue: first, run defrag <drive letter>: /A /V and note the fragmented space percentage/number of fragmented files to get a baseline.

Download any dataset containing modestly large files on Windows, probably best if the drive has been used and has had files deleted on it so it has several non-contiguous large chunks of free space. in my case, I downloaded https://huggingface.co/datasets/dalle-mini/open-images, which ended up with files being fragmented into dozens or hundreds of pieces.

Afterwards, run defrag <drive letter>: /A /V again and note that the number of fragmented files probably increased.

Logs

System info

- huggingface_hub version: 0.29.1
- Platform: Windows-10-10.0.22631-SP0
- Python version: 3.10.8
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: C:\Users\<snip>\.cache\huggingface\token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: 2.12.0
- Torch: 2.6.0+cu126
- Jinja2: 3.1.2
- Graphviz: N/A
- keras: 2.12.0
- Pydot: N/A
- Pillow: 9.0.0
- hf_transfer: 0.1.9
- gradio: 3.31.0
- tensorboard: N/A
- numpy: 1.23.5
- pydantic: 1.10.2
- aiohttp: 3.8.1
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: C:\Users\<snip>\.cache\huggingface\hub
- HF_ASSETS_CACHE: C:\Users\<snip>\.cache\huggingface\assets
- HF_TOKEN_PATH: C:\Users\<snip>\.cache\huggingface\token
- HF_STORED_TOKENS_PATH: C:\Users\<snip>\.cache\huggingface\stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions