In [None]:
from huggingface_hub import hf_hub_download, snapshot_download

### From latest version

In [None]:
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json")

In [None]:
hf_hub_download(repo_id="google/fleurs", filename="fleurs.py", repo_type="dataset")

### From specific version
By default, the latest version from the main branch is downloaded. However, in some cases you want to download a file at a particular version (e.g. from a specific branch, a PR, a tag or a commit hash). To do so, use the revision parameter

examples do not exist anymore.

In [5]:
# Download from the `v1.0` tag
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="v1.0")

In [None]:
# Download from the `test-branch` branch
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="test-branch")

In [None]:
# Download from Pull Request #3
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="refs/pr/3")

In [None]:
# Download from a specific commit hash
# Note: When using the commit hash, it must be the full-length hash instead of a 7-character commit hash.
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a")

### Construct a url
In case you want to construct the URL used to download a file from a repo, you can use hf_hub_url() which returns a URL. Note that it is used internally by hf_hub_download().

### Download entire repository
snapshot_download() downloads an entire repository at a given revision. It uses internally hf_hub_download() which means all downloaded files are also cached on your local disk. Downloads are made concurrently to speed-up the process.

To download a whole repository, just pass the repo_id and repo_type

In [None]:
snapshot_download(repo_id="lysandre/arxiv-nlp")

In [2]:
# Or from a dataset
snapshot_download(repo_id="google/fleurs", repo_type="dataset")

snapshot_download() downloads the latest revision by default. If you want a specific repository revision, use the revision parameter

In [None]:
snapshot_download(repo_id="lysandre/arxiv-nlp", revision="refs/pr/1")

### Filter files to download
snapshot_download() provides an easy way to download a repository. However, you don’t always want to download the entire content of a repository. For example, you might want to prevent downloading all .bin files if you know you’ll only use the .safetensors weights. You can do that using allow_patterns and ignore_patterns parameters.

https://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm

In [1]:
# use allow_patterns to only download JSON configuration files:
snapshot_download(repo_id="lysandre/arxiv-nlp", allow_patterns="*.json")

In [None]:
# ignore_patterns can exclude certain files from being downloaded.
snapshot_download(repo_id="lysandre/arxiv-nlp", ignore_patterns=["*.msgpack", "*.h5"])

In [None]:
# download all json and markdown files except vocab.json
snapshot_download(repo_id="gpt2", allow_patterns=["*.md", "*.json"], ignore_patterns="vocab.json")

### Download to local folder
You can define your cache location by setting cache_dir parameter (both in hf_hub_download() and snapshot_download()).

in some cases you want to download files and move them to a specific folder. You can do that using the local_dir and local_dir_use_symlinks parameters

- local_dir must be a path to a folder on your system. The downloaded files will keep the same file structure as in the repo. For example if filename="data/train.csv" and local_dir="path/to/folder", then the returned filepath will be "path/to/folder/data/train.csv".
- local_dir_use_symlinks defines how the file must be saved in your local folder.
  - The default behavior ("auto") is to duplicate small files (<5MB) and use symlinks for bigger files. Symlinks allow to optimize both bandwidth and disk usage. However manually editing a symlinked file might corrupt the cache, hence the duplication for small files. The 5MB threshold can be configured with the HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD environment variable.
  - If local_dir_use_symlinks=True is set, all files are symlinked for an optimal disk space optimization. This is for example useful when downloading a huge dataset with thousands of small files.
  - Finally, if you don’t want symlinks at all you can disable them (local_dir_use_symlinks=False). The cache directory will still be used to check wether the file is already cached or not. If already cached, the file is duplicated from the cache (i.e. saves bandwidth but increases disk usage). If the file is not already cached, it will be downloaded and moved directly to the local dir. This means that if you need to reuse it somewhere else later, it will be re-downloaded.
 
| Parameters                 | File already cached  | Returned path   | Can read path?    | Can save to path?                      | Optimized bandwidth             | Optimized disk usage          |
|----------------------------|----------------------|-----------------|-------------------|---------------------------------------|--------------------------------|-------------------------------|
| local_dir=None             | symlink in cache     | ✅              | ❌                | (save would corrupt the cache)        | ✅                             | ✅                            |
| local_dir="path/to/folder" |                      |                 |                   |                                       |                                |                               |
| local_dir_use_symlinks="auto" | file or symlink in folder | ✅         | ✅                | ⚠️ (for big files do not resolve path before saving) | ✅                         | ✅                           |
| local_dir="path/to/folder" |                      |                 |                   |                                       |                                |                               |
| local_dir_use_symlinks=True | symlink in folder    | ✅              | ⚠️               | (do not resolve path before saving)   | ✅                             | ✅                           |
| local_dir="path/to/folder" |                      |                 |                   |                                       |                                |                               |
| local_dir_use_symlinks=False | No                   | file in folder  | ✅                | ✅                                    | ❌ (if re-run, file is re-downloaded) | ⚠️ (multiple copies if ran in multiple folders) |
| local_dir="path/to/folder" |                      |                 |                   |                                       |                                |                               |
| local_dir_use_symlinks=False | Yes                  | file in folder  | ✅                | ✅                                    | ⚠️ (file has to be cached first) | ❌ (file is duplicated)        |
