Skip to content

Xet upload workflow #2887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Mar 12, 2025
Merged

Xet upload workflow #2887

merged 45 commits into from
Mar 12, 2025

Conversation

hanouticelina
Copy link
Contributor

@hanouticelina hanouticelina commented Feb 25, 2025

Partially resolves #2713

This PR contains only the Xet upload workflow implemented in xetpoc_huggingface_hub (internal).

The main branch for xet storage integration is xet-integration.

Main changes:

  • Make hf_xet available as an optional dependency via pip install huggingface_hub[hf_xet]
    Note: since it's a common part for download and upload, this has been pushed directly into the xet-integration branch.
  • Integrate changes from the xet poc for the download workflow only.
  • Add tests.

to try it in from this branch:

pip install -e ".[dev,hf_xet]"
export HF_DEBUG=1 #  if you want to set huggingface_hub logger to debug level
huggingface-cli upload --repo-type model rajatarya/xet-enable-test ~/data/yelp_sample

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@hanouticelina hanouticelina requested review from Wauplin and bpronan March 4, 2025 15:58
@hanouticelina hanouticelina marked this pull request as ready for review March 4, 2025 15:58
Copy link
Collaborator

@bpronan bpronan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few missing doc strings, but otherwise looks great!

@hanouticelina
Copy link
Contributor Author

hanouticelina commented Mar 7, 2025

I just have a question about when Xet is enabled but user is not xet authorized. Wondering when could that happen and what the expected behavior here (my understanding is that currently it fails to upload without any workaround possible).

I think if xet is enabled in the repo and hf_xet is installed, then we should not fallback to http since it means that the user passed invalid credentials when calling GET /api/:repoType(models|spaces|datasets)/:namespace?/:repo/xet-write-token/:rev.
it's actually the same auth level as upload, delete and commit (see here: moon-landing/server/server.ts#L834C1-L835C1).

@Wauplin
Copy link
Contributor

Wauplin commented Mar 7, 2025

I think if xet is enabled in the repo and hf_xet is installed, then we should not fallback to http since it means that the user passed invalid credentials when calling GET /api/:repoType(models|spaces|datasets)/:namespace?/:repo/xet-write-token/:rev.
it's actually the same auth level as upload, delete and commit (see here:moon-landingserver/server.ts#L834C1-L835C1).

Ah yes, I forgot the workaround in this case would be to uninstall hf_xet if really they are not authorized to use it. But shouldn't happen except if doing something unexpected. Then it's all fine to raise the unauthorized error :)

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@hanouticelina
Copy link
Contributor Author

Let's wait for a final review from @bpronan and/or @rajatarya and then I think we should be good to merge! 🙂

Copy link
Collaborator

@bpronan bpronan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@bpronan
Copy link
Collaborator

bpronan commented Mar 7, 2025

I think if xet is enabled in the repo and hf_xet is installed, then we should not fallback to http since it means that the user passed invalid credentials when calling GET /api/:repoType(models|spaces|datasets)/:namespace?/:repo/xet-write-token/:rev.
it's actually the same auth level as upload, delete and commit (see here: moon-landing/server/server.ts#L834C1-L835C1).

This is 💯 correct. The user would only see this case (xetEnabled on repo + hf_xet installed + no access) if they don't have upload/commit access on the repo in the first place. In other words, they have to have write access to the repo to upload to a xet enabled repo with hf_xet installed. The earlier call to _fetch_upload_modes would have failed already by then.

Uninstalling hf_xet shouldn't help here because they can't write to the repo anyway.

@hanouticelina
Copy link
Contributor Author

I think we can merge this one to the xet-integration branch and then we can open another PR for any xet-specific part once we merge the If-None-Match path into main.

@hanouticelina hanouticelina merged commit 86de575 into xet-integration Mar 12, 2025
20 checks passed
@hanouticelina hanouticelina deleted the xet-upload-workflow branch March 12, 2025 18:28
@Wauplin Wauplin mentioned this pull request Mar 26, 2025
hanouticelina added a commit that referenced this pull request Mar 27, 2025
* add hf_xet as an optional dependency

* update installed packages at runtime

* split xet testing in CI

* fix workflow

* fix windows

* Xet download workflow (#2875)

* first draft

* remove comment

* hf_xet instead of xet

* update docstring

* fix

* update docstring

* simplify typing

* quality

* add logging

* fix tests

* add unit tests for xet utilities

* first draft of download testing

* more tests

* address some comments

* fix tests

* check if hf_xet is available or not

* remove unnecessary dest dir creation

* keep comment

Co-authored-by: Lucain <lucain@huggingface.co>

* post-review improvements

* Update tests/test_xet_download.py

---------

Co-authored-by: Lucain <lucain@huggingface.co>

* Add ability to enable/disable xet storage on a repo (#2893)

* add ability to enable/disable xet storage

* add test

* better way to check if all settings are none

* don't strip authorization header with downloading with xet

* update comment

* Xet upload workflow (#2887)

* add upload workflow

* fixes and tests

* use helper for prgress bar

* use tmp repo in tests

* some fixes

* update tests

* mock HF_XET_CACHE

* fix tests

* fix utils tests

* debug CI

* fix

* check if xet is enabled

* debug CI

* debug CI again

* revert

* debugging

* don't rerun xet tests

* revert

* remove pytest timeout

* don't run tests in parallel

* add comment

* revert and rename variable

* don't skip tests

* remove warning

* fix tests

* Apply suggestions from code review

* fixes

* fix syntax error with python 3.8

* catch Invalid credentials

* fix

* record Space API VCR test

* use raise instead of raise e

Co-authored-by: Lucain <lucain@huggingface.co>

* disable xet storage for the other tests

* reverting

* isolate xet tests for windows

* fix windows

* install hf_xet for xet testing

---------

Co-authored-by: Lucain <lucain@huggingface.co>
Co-authored-by: Lucain Pouget <lucainp@gmail.com>

* Xet Docs for huggingface_hub (#2899)

* Xet docs
* PR feedback, added waitlist links
* Added HF_XET_CACHE env variable docs
* PR feedback
* Doc feedback
* Added two lines about flow of upload/download
* Updating links to Hub doc location
* Reformat headings, less levels in TOC

---------

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
Co-authored-by: Célina <hanouticelina@gmail.com>
Co-authored-by: Lucain <lucainp@gmail.com>

* Adding Token Refresh Xet Test (#2932)

Directly calling hfxet.download_files() with token_refresher callback
   to ensure that hfxet calls the token refresher as expected.

---------

Co-authored-by: Celina Hanouti <hanouticelina@gmail.com>

* Using a two stage download path for xet files. (#2920)

* Adding request header on resolve endpoint indicating that we can receive xet info.
* Adding test to ensure that the header is always sent on metdata request
* Using a two stage download path for xet files.
* Using the GET call's JSON
* Using xet_backed for the whether the file is a xet file or not to disambiguate from whether xet is enabled
* Adding and fixing tests
* Testing fix WIP
* Rewriting xet download to use the refresh route to resolve the xetmetadata
* Parameter type check
* Docs
* Removing extraneous constant
* Fixing file_download tests
* Readding the refresh route into the file metadata
* Refactoring the XetMetadata object into two objects to reflect the Hub changes.
* Fixing broken tests
* Code cleanup from self review
* Fixing types
* Quality & Lint
* Handling when hub returns the entire refresh route in its headers.
* Update tests/test_xet_utils.py
* Fixing merge conflicts in the new tests
* Extracting the refresh route from the link header (#2953)
* Getting the refresh route from the links header
* refactor xet_file_data func signature & tests

Co-authored-by: Lucain <lucain@huggingface.co>
Co-authored-by: Rajat Arya <rajat@huggingface.co>

* Update src/huggingface_hub/constants.py

Co-authored-by: Célina <hanouticelina@gmail.com>

---------

Co-authored-by: Celina Hanouti <hanouticelina@gmail.com>
Co-authored-by: Rajat Arya <rajatarya@users.noreply.github.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Pierric Cistac <Pierrci@users.noreply.github.com>
Co-authored-by: Brian Ronan <brian.ronan@huggingface.co>
Co-authored-by: Rajat Arya <rajat@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants