-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate groups #2
Comments
Yes, there are editorial inconsistencies in the original, but these shouldn't matter for building the database, right? I've done a lot of checking of group and code names. The editorial policy, however, has been as much as possible to fix new mistakes but to keep the original transcription in the body of the XML. All the normalization of group names and codes has been in the CatRef attribute values, like #grp-danish, etc. Those are tightly constrained in the schema to an enumerated list, and I've been validating against them to catch anomalies. To build your database for searching and indexing, take stuff out of the header rather than the text/front matter. The source (newspaper) names are where there's still a lot that could be done to normalize. I've made some progress but it's a huge job. Let me know if you have questions about what's in the markup, or ideas about how to change the markup. I have a batch of a hundred or so small changes (many improvements) that I'll push up soon, with more to keep coming indefinitely (not that the site will hold for all of them). |
Fair enough. I've rescripted the loader to use the CatRef attribute for the canonical name, and I'll just put together a dictionary for the display names. Source is a whole different beast, of course, and we can deal with that in a separate thread. |
Here's a partial list of groups that are listed multiple times, in different ways:
GERMAN
German
BOHEMIAN
BOHEMIANS
Bohemian
CZECH
CZECHOSLOVAKIAN
CROATIAN
Croatian
CROATIAN (Serbian section)
CROATIAN (Serbian Section)
DANISH
Danish
POLISH
Polish
The text was updated successfully, but these errors were encountered: