reports

History

Name		Name	Last commit message	Last commit date
parent directory ..
interactive_plots		interactive_plots
plots		plots
testsets		testsets
README.md		README.md
harmonize_places.pdf		harmonize_places.pdf
harmonize_publishers.pdf		harmonize_publishers.pdf
main_descriptives.pdf		main_descriptives.pdf
places_overview_rulebased harmonization.tsv		places_overview_rulebased harmonization.tsv
publishers_overview_rulebased_harmonization.tsv		publishers_overview_rulebased_harmonization.tsv

README.md

Reports

The reports on the dataset are built to provide up-to-date summary statistics on the dataset, show the test sets we have compiled and annotated ourselves, and allow the results of the pipeline to be verified and further improved upon. This readme provides an overview of the relevant items. Several summary tables function as input for the general pipeline and are placed in the config/ folder. The relevant files are also linked in this readme.

Main overview

main_descriptives.pdf - A live report with up to date summary statistics collected from the curated dataset.

Place names

Overviews:

harmonize_places.pdf - A live report for the place name harmonization generated by the R-markdown notebook.
../config/places/places_harmonized.tsv - The mappings between place name variants and harmonized names.
interactive_plots/places_harmonized_network.html - A visual overview of place name variants connected in harmonization.
../config/places/places_coordinates.tsv - The harmonized placenames and their attached coordinates.
interactive_plots/places_on_map.html - A visual overview of all the coordinates of the place names in the dataset.

Build tables

../config/places/places_rules.tsv - The rules used in harmonizing placenames.
../config/places/resolved_manually_geo.tsv - Manually resolved placenames where the algorithmic approach did not give a solution.

Test sets:

testsets/testset_places_tartu_variants.tsv - placename harmonization variants for Tartu.
testsets/testset_places_checked.tsv - a sample of 200 place names to check for accuracy
testsets/testset_locations_on_map_checked.html - a map based on the coordinates of these 200 place names to check for accuracy

Publishers

Overviews:

harmonize_publishers.pdf - A live report for the publisher name harmonization generated by the R-markdown notebook.
../config/publishers/publisher_harmonization_mapping.json - The mappings between publisher name variants and publishers.
publishers_overview_rulebased_harmonization.tsv - A more detailed overview table of all publisher names in the dataset with harmonized names and some metadata.
../config/publishers/publisher_similarity_groups.tsv - The mappings between standardized publisher names and the similarity grops based on text embeddings. The mappings are made for publishers operating in the same location, referred to by harm_name.

Build tables:

../config/publishers/publisher_harmonize_rules.tsv - The rules by which publishers are harmonized by.

Test sets:

testsets/testset_publishers_rulebased_checked.tsv - Annotation of the rule-based publisher name harmonizations within the test set of publishers from Viljandi.
testsets/testset_publishers_cluster_similarity_checked.tsv - Annotation of the text embedding based publisher name harmonizations within the test set of publishers from Viljandi.
testsets/testset_publishers_harmonize_both_methods_summary.tsv - The summary table of both methods within the test set of publishers from Viljandi.

Improving the pipeline

The processing algorithms are built to provide an efficient and semi-automated solution to data enrichment. While we have done our best to reduce errors, there is always room for improvement.

The main ways to improve would be to improve the rules underlying the algorithms or to manually correct the output tables. Both options are helpful for development and will allow the quality to be improved for all users.

If you find errors or places for improvement you can 1) edit the summary tables in the repository and send a pull request, 2) edit the rules behind the algorithms and send a pull request, or 3) contact the maintainers by e-mail (current address krister.kruusmaa@tlu.ee) or by creating an issue in GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

reports

reports

README.md

Reports

Main overview

Place names

Publishers

Improving the pipeline

Files

reports

Directory actions

More options

Directory actions

More options

Latest commit

History

reports

Folders and files

parent directory

README.md

Reports

Main overview

Place names

Publishers

Improving the pipeline