The TIB Subject Indexing Dataset (TIB-SID) is a bilingual benchmark for extreme multi-label text classification (XMTC) over real library records, designed for domain classification and GND-based subject indexing. The dataset combines a large, structured, authority-controlled label space with long-tail sparsity, cross-lingual variation, and real-world domain imbalance, making it substantially closer to operational library cataloging than standard text classification benchmarks.
- 136,569 library records in JSON-LD with predefined train / dev / test benchmark splits
- Languages: English and German
- 28 domains
- Record types: article, book, conference, report, thesis
Download the dataset here: data
TIB-SID was introduced through the LLMs4Subjects shared tasks organized in 2025. More than 12 LLM-based systems were developed and evaluated on the dataset by participating teams worldwide. The shared task websites provide additional context, task details, and leaderboard results.
If you use TIB-SID, please cite:
Coming soon...
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

