Diachronic News and Travel (DNT) corpus.
This corpus contains 279 documents, for a total of 183.517 tokens, distributed across three genres (news, travel reports, and travel guides) and two temporal periods (1862-1939 and 1998-2017).
This repository contains the raw text data divided per genre and temporal period (e.g. guide-hist: folder with travel guide from the 1862-1939 period).
The data have been enriched with manual annotation and accompanied by the development of NPL processing tools. We aim at making DNT a large multi-layer annotated corpus with different language phenomena. Feel free to contribute!
Below we link the dedicated repositories for each task:
-
Content Types Identification: https://github.com/tommasoc80/ContentTypes
-
Event Detection in Historical Texts: https://github.com/dhfbk/Histo
-
Named Entity Recognition in Historical Texts: https://github.com/dhfbk/Detection-of-place-names-in-historical-travel-writings
DNT will be presented to the 10th AIUCD conference DH for society: e-quality, participation, rights and values in the Digital Age.
@inproceedings{caselli_sprugnoli_dnt2021,
title={{DNT: un Corpus Diacronico e Multigenere di Testi in Lingua Inglese}},
author={Tommaso Caselli, Rachele Sprugnoli},
booktitle={AIUCD2021 - Book of Abstracts, Quaderni di Umanistica Digitale.},
year={2021}
}
COMING SOON: pre-tokenized version of the data with offsets.
This work is licensed under a Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.