This package provides four ready-to-use singleton instances, each offering a dictionary-like interface to different types of HuggingFace resources:

In [1]:
import hf

In [2]:
hf.datasets, hf.models, hf.spaces, hf.papers

(<hf.base.HfDatasets at 0x11897c6a0>,
 <hf.base.HfModels at 0x12df80e80>,
 <hf.base.HfSpaces at 0x12df82fe0>,
 <hf.base.HfPapers at 0x12df80a60>)

# hf.datasets

Mapping (i.e. "dict-like") view to local datasets, and accessor to remote ones.

Now using singleton instances - just import `datasets` directly from `hf`!

## List local datasets

As with dictionaries, `datasets` is an iterable. An iterable of keys. 
The keys are repository ids for those datasets you've downloaded. 
See what datasets you already have cached locally like this:

In [3]:
list(hf.datasets)

['MMMU/MMMU',
 'li2017dailydialog/daily_dialog',
 'llamafactory/tiny-supervised-dataset',
 'google-research-datasets/go_emotions',
 'stingning/ultrachat',
 'sander-wood/m4-rag',
 'open-llm-leaderboard-old/results',
 'allenai/WildChat-1M',
 'takala/financial_phrasebank',
 'ucirvine/sms_spam']

The values of `hf.datasets` are the `DatasetDict` 
(from Huggingface's `datasets` package) instances that give you access to the dataset.
If you already have the dataset downloaded locally, it will load it from there, 
if not it will download it, then give it to you (and it will be cached locally 
for the next time you access it). 


In [4]:
data = hf.datasets['stingning/ultrachat']
data

Loading dataset shards:   0%|          | 0/19 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'data'],
        num_rows: 1468352
    })
})

## Search for remote datasets

See what arguments you can use in the huggingface_hub documentation:
https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.list_datasets

In [5]:
search_results = hf.datasets.search('music', gated=False)  # the gated=False means "is public"
print(f"search_results is a {type(search_results).__name__}")

search_results is a generator


In [6]:
# search_results is a generator: Let's get the first key
result = next(search_results)
print(f"result is a {type(result).__name__}")

result is a DatasetInfo


In [7]:
print('key: ', result.id)
print('\ndescription:', result.description[:80], '...')

key:  ccmusic-database/music_genre

description: 
	
		
		Dataset Card for Music Genre
	

The Default dataset comprises approximat ...


But, before you download, you might want to know what the size is 
(which at the point of writing this, is not given by the `DatasetInfo`). 
We provide this for you too, via the `get_size` method.

In [8]:
hf.datasets.get_size(result)  # by default, in GBs

4.268976908177137

You can ask for the size via the key (the repository stub, the "id"), 
or just the result (`DatasetInfo`) object. 
Also, you can access `get_size` via `hf`, or via `hf.datasets`.
You can also specify `unit_bytes=1` if you want your size in bytes. 
The default is `unit_bytes=1024 ** 3` to get sizes in GiB. 

In [8]:
hf.datasets.get_size(result.id, unit_bytes=1)

4583779052.0

In [None]:
# And if you like what you see (info and size), just get it (you know how!)
# (Note: Normally, you should do datasets[result.id], but we allow you to just 
# stick the DatasetInfo instance in there for convenience. You're welcome.)
data = hf.datasets[result]

## Retrieving and viewing several results

In [None]:
def table_of_results(results, n=10):
    import itertools, operator, pandas as pd

    results_table = pd.DataFrame(  # make a table with
        map(
            operator.attrgetter('__dict__'),  # the attributes dicts
            # Note: The reason we use islice instead of the limit arg, is that that arg doesn't exist for papers search!
            itertools.islice(results, n),  # ... of the first 10 search results
        )
    )
    return results_table

results_table = table_of_results(search_results)
results_table

Unnamed: 0,id,author,sha,created_at,last_modified,private,gated,disabled,downloads,downloads_all_time,...,tags,trending_score,card_data,siblings,xet_enabled,lastModified,cardData,_id,description,key
0,Genius-Society/hoyoMusic,Genius-Society,4f7e5120c0e8e26213d4bb3b52bcce76e69dfce4,2023-11-05 15:07:57+00:00,2025-03-28 04:08:19+00:00,False,False,False,54,,...,"[task_categories:text-generation, task_categor...",5,,,,2025-03-28 04:08:19+00:00,,6547afcd28b7019eae3d090e,\n\t\n\t\t\n\t\tIntro\n\t\n\nThis dataset main...,
1,Genius-Society/emo163,Genius-Society,6b8c3526b66940ddaedf15602d01083d24eb370c,2023-05-01 05:45:31+00:00,2025-03-28 04:15:03+00:00,False,False,False,93,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-03-28 04:15:03+00:00,,644f51fba00f4b11d3a4bbd2,\n\t\n\t\t\n\t\tIntro\n\t\n\nThe emo163 datase...,
2,ccmusic-database/acapella,ccmusic-database,4cb8a4d4cb58cc55f30cb8c7a180fee1b5576dc5,2023-05-25 08:05:41+00:00,2025-02-17 10:12:20+00:00,False,False,False,133,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-02-17 10:12:20+00:00,,646f16d5e2a72c647b61af0a,\n\t\n\t\t\n\t\tDataset Card for Acapella Eval...,
3,ccmusic-database/pianos,ccmusic-database,db2b3f74c4c989b4fbda4b309e6bc925bfd8f5d1,2023-05-25 11:32:28+00:00,2025-04-05 23:40:59+00:00,False,False,False,148,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-04-05 23:40:59+00:00,,646f474ce2a72c647b6bad98,\n\t\n\t\t\n\t\tDataset Card for Piano Sound Q...,
4,ccmusic-database/chest_falsetto,ccmusic-database,1160f5002fc1bbcd23aa59bcfc2df3015b893114,2023-05-25 13:53:10+00:00,2025-03-21 09:30:19+00:00,False,False,False,99,,...,"[task_categories:audio-classification, languag...",4,,,,2025-03-21 09:30:19+00:00,,646f6846ac3bff5945e74ea8,\n\t\n\t\t\n\t\tDataset Card for Chest voice a...,
5,ccmusic-database/bel_canto,ccmusic-database,d8bd952b0bb87d8f2faee1bd2f8bfc8123d5bc9a,2023-05-26 08:53:43+00:00,2025-03-25 13:18:12+00:00,False,False,False,118,,...,"[task_categories:audio-classification, task_ca...",4,,,,2025-03-25 13:18:12+00:00,,647073973df93fddecde5d63,\n\t\n\t\t\n\t\tDataset Card for Bel Conto and...,
6,ccmusic-database/instrument_timbre,ccmusic-database,90a803fe7043d1b8ddb79832fdc4d6d6f2166cba,2023-05-27 10:31:24+00:00,2025-02-17 08:27:36+00:00,False,False,False,131,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-17 08:27:36+00:00,,6471dbfc0211f85270fb4880,\n\t\n\t\t\n\t\tDataset Card for Chinese Music...,
7,ccmusic-database/timbre_range,ccmusic-database,242afee6bc5d2361e9afa0e4d57daa5a9ec9799e,2023-06-05 13:27:25+00:00,2025-02-16 03:24:49+00:00,False,False,False,121,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-16 03:24:49+00:00,,647de2bd5214d172cbb8541e,\n\t\n\t\t\n\t\tDataset Card for Timbre and Ra...,
8,ccmusic-database/erhu_playing_tech,ccmusic-database,3ee153cfc69d199c9722e08e34666f48635122b8,2023-07-14 10:54:23+00:00,2025-02-16 03:48:53+00:00,False,False,False,98,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-16 03:48:53+00:00,,64b1295fa17e4a051989b17c,\n\t\n\t\t\n\t\tDataset Card for Erhu Playing ...,
9,ccmusic-database/GZ_IsoTech,ccmusic-database,5c9a61e880b726358bd1085190118d5646417568,2023-10-12 13:23:57+00:00,2025-02-16 03:43:03+00:00,False,False,False,104,,...,"[task_categories:audio-classification, languag...",4,,,,2025-02-16 03:43:03+00:00,,6527f36d720bf65b654d8b31,\n\t\n\t\t\n\t\tDataset Card for GZ_IsoTech Da...,


In [14]:
print(results_table.iloc[:4, :3].to_string())

                          id            author                                       sha
0   Genius-Society/hoyoMusic    Genius-Society  4f7e5120c0e8e26213d4bb3b52bcce76e69dfce4
1      Genius-Society/emo163    Genius-Society  6b8c3526b66940ddaedf15602d01083d24eb370c
2  ccmusic-database/acapella  ccmusic-database  4cb8a4d4cb58cc55f30cb8c7a180fee1b5576dc5
3    ccmusic-database/pianos  ccmusic-database  db2b3f74c4c989b4fbda4b309e6bc925bfd8f5d1


# models

`hf.models` has basically the same interface as with `hf.datasets` (except for the `search` methods, which don't share all the arguments). 
That's sort of the point of the `hf` facade: It provides the same dict-like interface to your basic operations. 
So you have an object that you can iterate on (to get keys of local models), use to access values (the models), search for remote models to download, etc.

## Search for models

In [13]:
model_search_results = hf.models.search('embeddings', gated=False)  # the gated=False means "is public"
print(f"model_search_results is a {type(model_search_results).__name__}")

model_search_results is a generator


In [15]:
first_model_less_than_200mb = next(filter(lambda m: hf.models.get_size(m.id) < 0.2, model_search_results))
first_model_less_than_200mb

ModelInfo(id='ibm-granite/granite-embedding-small-english-r2', author=None, sha=None, created_at=datetime.datetime(2025, 7, 17, 20, 41, 53, tzinfo=datetime.timezone.utc), last_modified=None, private=False, disabled=None, downloads=14546, downloads_all_time=None, gated=None, gguf=None, inference=None, inference_provider_mapping=None, likes=39, library_name='sentence-transformers', tags=['sentence-transformers', 'pytorch', 'safetensors', 'modernbert', 'feature-extraction', 'granite', 'embeddings', 'transformers', 'mteb', 'sentence-similarity', 'en', 'arxiv:2508.21085', 'license:apache-2.0', 'autotrain_compatible', 'text-embeddings-inference', 'endpoints_compatible', 'region:us'], pipeline_tag='sentence-similarity', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, trending_score=5, siblings=None, spaces=None, safetensors=None, security_repo_status=None, xet_enabled=None)

In [16]:
hf.models.get_size(first_model_less_than_200mb)

0.18094349279999733

## Download (or load) a model

In [17]:
model = hf.models[first_model_less_than_200mb]
model

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

'/Users/thorwhalen/.cache/huggingface/hub/models--ibm-granite--granite-embedding-small-english-r2/snapshots/c949f235cb63fcbd58b1b9e139ff63c8be764eeb'

## List (local) models

In [18]:
list(hf.models)

['ibm-granite/granite-embedding-small-english-r2',
 'roberta-base',
 'sony/silentcipher',
 'distilbert-base-uncased',
 'musiclang/musiclang-chord-v2-4k',
 'sesame/csm-1b',
 'ibm-granite/granite-embedding-125m-english',
 'lysandre/test-model',
 'kyutai/moshiko-pytorch-bf16',
 'meta-llama/Llama-3.2-1B']