The reports on the dataset are built to provide up-to-date summary statistics on the dataset, show the test sets we have compiled and annotated ourselves, and allow the results of the pipeline to be verified and further improved upon. This readme provides an overview of the relevant items. Several summary tables function as input for the general pipeline and are placed in the config/ folder. The relevant files are also linked in this readme.
main_descriptives.pdf
- A live report with up to date summary statistics collected from the curated dataset.
Overviews:
harmonize_places.pdf
- A live report for the place name harmonization generated by the R-markdown notebook.../config/places/places_harmonized.tsv
- The mappings between place name variants and harmonized names.interactive_plots/places_harmonized_network.html
- A visual overview of place name variants connected in harmonization.../config/places/places_coordinates.tsv
- The harmonized placenames and their attached coordinates.interactive_plots/places_on_map.html
- A visual overview of all the coordinates of the place names in the dataset.
Build tables
../config/places/places_rules.tsv
- The rules used in harmonizing placenames.../config/places/resolved_manually_geo.tsv
- Manually resolved placenames where the algorithmic approach did not give a solution.
Test sets:
testsets/testset_places_tartu_variants.tsv
- placename harmonization variants for Tartu.testsets/testset_places_checked.tsv
- a sample of 200 place names to check for accuracytestsets/testset_locations_on_map_checked.html
- a map based on the coordinates of these 200 place names to check for accuracy
Overviews:
harmonize_publishers.pdf
- A live report for the publisher name harmonization generated by the R-markdown notebook.../config/publishers/publisher_harmonization_mapping.json
- The mappings between publisher name variants and publishers.publishers_overview_rulebased_harmonization.tsv
- A more detailed overview table of all publisher names in the dataset with harmonized names and some metadata.../config/publishers/publisher_similarity_groups.tsv
- The mappings between standardized publisher names and the similarity grops based on text embeddings. The mappings are made for publishers operating in the same location, referred to byharm_name
.
Build tables:
../config/publishers/publisher_harmonize_rules.tsv
- The rules by which publishers are harmonized by.
Test sets:
testsets/testset_publishers_rulebased_checked.tsv
- Annotation of the rule-based publisher name harmonizations within the test set of publishers from Viljandi.testsets/testset_publishers_cluster_similarity_checked.tsv
- Annotation of the text embedding based publisher name harmonizations within the test set of publishers from Viljandi.testsets/testset_publishers_harmonize_both_methods_summary.tsv
- The summary table of both methods within the test set of publishers from Viljandi.
The processing algorithms are built to provide an efficient and semi-automated solution to data enrichment. While we have done our best to reduce errors, there is always room for improvement.
The main ways to improve would be to improve the rules underlying the algorithms or to manually correct the output tables. Both options are helpful for development and will allow the quality to be improved for all users.
If you find errors or places for improvement you can 1) edit the summary tables in the repository and send a pull request, 2) edit the rules behind the algorithms and send a pull request, or 3) contact the maintainers by e-mail (current address krister.kruusmaa@tlu.ee) or by creating an issue in GitHub.