A corpus of German language travelogues from the period 1500-1876, drawn from the Austrian Books Online project of the Austrian National Library. The corpus was compiled by the domain experts of the Travelogues Project, using the library's administration system (ALMA). Full-texts and manifests with metadata were retrieved using the SACHA infrastructure. Texts are the result of Optical Character Recognition (OCR), and were not manually corrected. Travelogues is funded through grant I 3795 of the Austrian Science Fund (FWF), and grant 398697847 of the German Research Foundation (DFG).
- 16th_century |- 16c-books.zip (14 MB, 66 files) |- 16c-metadata.zip (68 KB, 66 files) - 17th_century |- 17c-books.zip (49 MB, 204 files) |- 17c-metadata.zip (202 KB, 204 files) - 18th_century |- 18c-books.zip (214 MB, 949 files) |- 18c-metadata.zip (814 KB, 949 files)
IMPORTANT! Git LFS must be installed on your system in order to clone this repository correctly.
Accessing Digital Objects Online
Book and metadata files are named according to their barcode identifiers in the Austrian
National Library. The permanent URLs to the digital objects can be constructed by prefixing
the barcode with
http://data.onb.ac.at/ABO/+, e.g. for barcode
Use of the Corpus for Machine Learning
This corpus was used to train an automatic classifier in this publication:
Jan Rörden, Doris Gruber, Martin Krickl, Bernhard Haslhofer (2019) Identifying Historical Travelogues in Large Text Corpora Using Machine Learning (accepted for publication), arXiv:2001.01673 [cs.DL]
More information and source code is available in this repository: Travelogues/identifying-travelogues.