This package provides four ready-to-use singleton instances, each offering a dictionary-like interface to different types of HuggingFace resources:

In [4]:
import hf

In [6]:
hf.datasets, hf.models, hf.spaces, hf.papers

(<hf.base.HfDatasets at 0x110725630>,
 <hf.base.HfModels at 0x136f84e80>,
 <hf.base.HfSpaces at 0x136f84760>,
 <hf.base.HfPapers at 0x136f86fe0>)

# hf.datasets

Mapping (i.e. "dict-like") view to local datasets, and accessor to remote ones.

Now using singleton instances - just import `datasets` directly from `hf`!

## List local datasets

As with dictionaries, `datasets` is an iterable. An iterable of keys. 
The keys are repository ids for those datasets you've downloaded. 
See what datasets you already have cached locally like this:

The values of `hf.datasets` are the `DatasetDict` 
(from Huggingface's `datasets` package) instances that give you access to the dataset.
If you already have the dataset downloaded locally, it will load it from there, 
if not it will download it, then give it to you (and it will be cached locally 
for the next time you access it). 


In [6]:
print(f"{hf.datasets.get_size('stingning/ultrachat')=} (in GiBs)")

hf.datasets.get_size('stingning/ultrachat')=8.651052392087877 (in GiBs)


In [8]:
data = hf.datasets['stingning/ultrachat']

Loading dataset shards:   0%|          | 0/19 [00:00<?, ?it/s]

In [10]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'data'],
        num_rows: 1468352
    })
})

## Search for remote datasets

See what arguments you can use in the huggingface_hub documentation:
https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.list_datasets

In [12]:
search_results = hf.datasets.search('music', gated=False)  # the gated=False means "is public"
print(f"search_results is a {type(search_results).__name__}")

search_results is a generator


In [13]:
# search_results is a generator: Let's get the first key
result = next(search_results)
print(f"result is a {type(result).__qualname__}")

result is a DatasetInfo


In [14]:
print('key: ', result.id)
print('\ndescription:', result.description[:80], '...')

key:  ccmusic-database/music_genre

description: 
	
		
		Dataset Card for Music Genre
	

The Default dataset comprises approximat ...


But, before you download, you might want to know what the size is 
(which at the point of writing this, is not given by the `DatasetInfo`). 
We provide this for you too, via the get_size method.

In [21]:
hf.datasets.get_size(result)  # by default, in GBs

4.268976908177137

You can ask for the size via the key (the repository stub, the "id"), 
or just the result (`DatasetInfo`) object. 
Also, you can access `get_size` via `hf`, or via `hf.datasets`.

In [20]:
result.id

'ccmusic-database/music_genre'

In [None]:
(
    hf.get_size(result.id, unit_bytes=1) 
    hf.get_size(result, unit_bytes=1), 
    hf.datasets.get_size(result.id, unit_bytes=1), 
    hf.datasets.get_size(result, unit_bytes=1)
)

(2511675232.0, 2511675232.0, 4583779052.0, 4583779052.0)

In [18]:
assert (
    hf.get_size(result.id, unit_bytes=1) 
    == hf.get_size(result, unit_bytes=1) 
    == hf.datasets.get_size(result.id, unit_bytes=1) 
    == hf.datasets.get_size(result, unit_bytes=1)
)

AssertionError: 

In [None]:
# And if you like it, just get it (you know how!)
# (Note: Normally, you should do datasets[result.id], but we allow you to just 
# stick the DatasetInfo instance in there for convenience. You're welcome.)
data = datasets[result]

## Retrieving and viewing several results

In [None]:
def table_of_results(results, n=10):
    import itertools, operator, pandas as pd

    results_table = pd.DataFrame(  # make a table with
        map(
            operator.attrgetter('__dict__'),  # the attributes dicts
            itertools.islice(results, n),  # ... of the first 10 search results
        )
    )
    return results_table

results_table = table_of_results(search_results)
results_table

Unnamed: 0,id,author,sha,created_at,last_modified,private,gated,disabled,downloads,downloads_all_time,...,tags,trending_score,card_data,siblings,xet_enabled,lastModified,cardData,_id,description,key
0,Genius-Society/hoyoMusic,Genius-Society,4f7e5120c0e8e26213d4bb3b52bcce76e69dfce4,2023-11-05 15:07:57+00:00,2025-03-28 04:08:19+00:00,False,False,False,54,,...,"[task_categories:text-generation, task_categor...",5,,,,2025-03-28 04:08:19+00:00,,6547afcd28b7019eae3d090e,\n\t\n\t\t\n\t\tIntro\n\t\n\nThis dataset main...,
1,Genius-Society/emo163,Genius-Society,6b8c3526b66940ddaedf15602d01083d24eb370c,2023-05-01 05:45:31+00:00,2025-03-28 04:15:03+00:00,False,False,False,93,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-03-28 04:15:03+00:00,,644f51fba00f4b11d3a4bbd2,\n\t\n\t\t\n\t\tIntro\n\t\n\nThe emo163 datase...,
2,ccmusic-database/acapella,ccmusic-database,4cb8a4d4cb58cc55f30cb8c7a180fee1b5576dc5,2023-05-25 08:05:41+00:00,2025-02-17 10:12:20+00:00,False,False,False,133,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-02-17 10:12:20+00:00,,646f16d5e2a72c647b61af0a,\n\t\n\t\t\n\t\tDataset Card for Acapella Eval...,
3,ccmusic-database/pianos,ccmusic-database,db2b3f74c4c989b4fbda4b309e6bc925bfd8f5d1,2023-05-25 11:32:28+00:00,2025-04-05 23:40:59+00:00,False,False,False,148,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-04-05 23:40:59+00:00,,646f474ce2a72c647b6bad98,\n\t\n\t\t\n\t\tDataset Card for Piano Sound Q...,
4,ccmusic-database/chest_falsetto,ccmusic-database,1160f5002fc1bbcd23aa59bcfc2df3015b893114,2023-05-25 13:53:10+00:00,2025-03-21 09:30:19+00:00,False,False,False,99,,...,"[task_categories:audio-classification, languag...",4,,,,2025-03-21 09:30:19+00:00,,646f6846ac3bff5945e74ea8,\n\t\n\t\t\n\t\tDataset Card for Chest voice a...,
5,ccmusic-database/bel_canto,ccmusic-database,d8bd952b0bb87d8f2faee1bd2f8bfc8123d5bc9a,2023-05-26 08:53:43+00:00,2025-03-25 13:18:12+00:00,False,False,False,118,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-03-25 13:18:12+00:00,,647073973df93fddecde5d63,\n\t\n\t\t\n\t\tDataset Card for Bel Conto and...,
6,ccmusic-database/instrument_timbre,ccmusic-database,90a803fe7043d1b8ddb79832fdc4d6d6f2166cba,2023-05-27 10:31:24+00:00,2025-02-17 08:27:36+00:00,False,False,False,131,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-17 08:27:36+00:00,,6471dbfc0211f85270fb4880,\n\t\n\t\t\n\t\tDataset Card for Chinese Music...,
7,ccmusic-database/timbre_range,ccmusic-database,242afee6bc5d2361e9afa0e4d57daa5a9ec9799e,2023-06-05 13:27:25+00:00,2025-02-16 03:24:49+00:00,False,False,False,121,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-16 03:24:49+00:00,,647de2bd5214d172cbb8541e,\n\t\n\t\t\n\t\tDataset Card for Timbre and Ra...,
8,ccmusic-database/erhu_playing_tech,ccmusic-database,3ee153cfc69d199c9722e08e34666f48635122b8,2023-07-14 10:54:23+00:00,2025-02-16 03:48:53+00:00,False,False,False,98,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-16 03:48:53+00:00,,64b1295fa17e4a051989b17c,\n\t\n\t\t\n\t\tDataset Card for Erhu Playing ...,
9,ccmusic-database/GZ_IsoTech,ccmusic-database,5c9a61e880b726358bd1085190118d5646417568,2023-10-12 13:23:57+00:00,2025-02-16 03:43:03+00:00,False,False,False,104,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-16 03:43:03+00:00,,6527f36d720bf65b654d8b31,\n\t\n\t\t\n\t\tDataset Card for GZ_IsoTech Da...,


In [14]:
print(results_table.iloc[:4, :3].to_string())

                          id            author                                       sha
0   Genius-Society/hoyoMusic    Genius-Society  4f7e5120c0e8e26213d4bb3b52bcce76e69dfce4
1      Genius-Society/emo163    Genius-Society  6b8c3526b66940ddaedf15602d01083d24eb370c
2  ccmusic-database/acapella  ccmusic-database  4cb8a4d4cb58cc55f30cb8c7a180fee1b5576dc5
3    ccmusic-database/pianos  ccmusic-database  db2b3f74c4c989b4fbda4b309e6bc925bfd8f5d1


In [None]:
assert (
    hf.get_size(result.id, unit_bytes=1, repo_type='dataset') 
    == hf.get_size(result, unit_bytes=1, repo_type='dataset') 
    == hf.datasets.get_size(result.id, unit_bytes=1) 
    == hf.datasets.get_size(result, unit_bytes=1)
)

0.9885653089731932

# HfModels

Mapping interface to huggingface models

Now using singleton instances - just import `models` directly from `hf`!

In [None]:
from hf import models

# models is now a ready-to-use singleton instance

And then, the same as with datasets, you have a dict-like object that you can iterate on
(to get keys of local models), use to access values (the models), search for remote models to download, etc.

## Search for models

In [9]:
model_search_results = models.search('embeddings', gated=False)  # the gated=False means "is public"
print(f"model_search_results is a {type(model_search_results).__name__}")

model_search_results is a generator


In [None]:
first_model_less_than_200mb = next(filter(lambda m: get_size(m.id, repo_type='model') < 0.2, model_search_results))
first_model_less_than_200mb

ModelInfo(id='ibm-granite/granite-embedding-small-english-r2', author=None, sha=None, created_at=datetime.datetime(2025, 7, 17, 20, 41, 53, tzinfo=datetime.timezone.utc), last_modified=None, private=False, disabled=None, downloads=14348, downloads_all_time=None, gated=None, gguf=None, inference=None, inference_provider_mapping=None, likes=39, library_name='sentence-transformers', tags=['sentence-transformers', 'pytorch', 'safetensors', 'modernbert', 'feature-extraction', 'granite', 'embeddings', 'transformers', 'mteb', 'sentence-similarity', 'en', 'arxiv:2508.21085', 'license:apache-2.0', 'autotrain_compatible', 'text-embeddings-inference', 'endpoints_compatible', 'region:us'], pipeline_tag='sentence-similarity', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, trending_score=5, siblings=None, spaces=None, safetensors=None, security_repo_status=None, xet_enabled=None)

In [None]:
get_size(first_model_less_than_200mb, repo_type='model')

0.18094349279999733

## Download (or load) a model

In [23]:
model = models[first_model_less_than_200mb]
model

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/95.3M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/95.3M [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

'/Users/thorwhalen/.cache/huggingface/hub/models--ibm-granite--granite-embedding-small-english-r2/snapshots/c949f235cb63fcbd58b1b9e139ff63c8be764eeb'

In [26]:
from hf.base import scan_cache_dir


scan_cache_dir()



## List (local) models

In [47]:
list(models)

['roberta-base',
 'ibm-granite/granite-embedding-125m-english',
 'kyutai/moshiko-pytorch-bf16',
 'sesame/csm-1b',
 'ibm-granite/granite-embedding-small-english-r2',
 'distilbert-base-uncased',
 'meta-llama/Llama-3.2-1B',
 'sony/silentcipher',
 'musiclang/musiclang-chord-v2-4k']

# Scrap

## Repo ids used in testing (by test code of huggingface_hub)

In [34]:
repo_ids = [
    "tiiuae/falcon-7b-instruct",
    "stabilityai/stable-diffusion-2-1",
    "super-cool-model",
    "lysandre/test-model",
    "username/my-cool-space",
    "lysandre/test-dataset",
    "lysandre/test-private",
    "lysandre/my-corrupted-dataset",
    "bigscience/bloom-1b3",
]

from hf.base import get_size

bad_repos = set()
for repo_id in (set(repo_ids) - bad_repos):
    try:
        size = get_size(repo_id)
        print(f"{repo_id:40} {size:.3f} GB")
    except Exception as e:
        bad_repos.add(repo_id)
    

lysandre/test-model                      0.000 GB
tiiuae/falcon-7b-instruct                52.681 GB
tiiuae/falcon-7b-instruct                52.681 GB
stabilityai/stable-diffusion-2-1         33.845 GB
stabilityai/stable-diffusion-2-1         33.845 GB


In [42]:
dataset_repo_ids = [
    "llamafactory/tiny-supervised-dataset",
    "ola13/small-the_pile",
    "ucirvine/sms_spam",
    "lhoestq/demo1",
]

for repo_id in dataset_repo_ids:
    try:
        size = get_size(repo_id)
        print(f"{repo_id:40} {size:.3f} GB")
    except Exception as e:
        print(f"Problem with {repo_id}: {e}")

llamafactory/tiny-supervised-dataset     0.000 GB
ola13/small-the_pile                     0.306 GB
ucirvine/sms_spam                        0.000 GB
ola13/small-the_pile                     0.306 GB
ucirvine/sms_spam                        0.000 GB
lhoestq/demo1                            0.000 GB
lhoestq/demo1                            0.000 GB


In [None]:
datasets['llamafactory/tiny-supervised-dataset']

README.md:   0%|          | 0.00/285 [00:00<?, ?B/s]

train.json: 0.00B [00:00, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 300
    })
})

In [55]:
round(get_size("llamafactory/tiny-supervised-dataset"), 4)

0.0001

In [None]:
key1 = "llamafactory/tiny-supervised-dataset"
key2 = "ucirvine/sms_spam"
assert round(get_size(key1, repo_type='dataset'), 4) == 0.0001
# get size in bytes
assert get_size(key2, unit_bytes=1, repo_type='dataset') == 365026.0

val1 = datasets[key1]

assert key1 in datasets  # now we have the key1
assert list(val1) == ['train']
assert list(val1['train'].features) == ['instruction', 'input', 'output']
assert val1['train'].num_rows == 300

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 300
    })
})

In [78]:
from huggingface_hub import snapshot_download

repo_id = key1
repo_type = "dataset" # Use "dataset" for datasets

# This function checks the cache first. If the repo is already downloaded, 
# it immediately returns the local path to the cached folder.
local_repo_path = snapshot_download(
    repo_id=repo_id,
    repo_type=repo_type
)

local_repo_path

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

'/Users/thorwhalen/.cache/huggingface/hub/datasets--llamafactory--tiny-supervised-dataset/snapshots/2ff06c75e01ae4195ed34fe77606e15902ea0b0d'

In [None]:
from hf.base import get_size

get_size("ucirvine/sms_spam")
datasets["ucirvine/sms_spam"]

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 5574
    })
})

In [45]:
get_size("llamafactory/tiny-supervised-dataset")

0.00012909621000289917

In [None]:
datasets['llamafactory/tiny-supervised-dataset']

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 300
    })
})