Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate groups #2

Closed
santheo opened this issue Aug 12, 2010 · 2 comments
Closed

Duplicate groups #2

santheo opened this issue Aug 12, 2010 · 2 comments

Comments

@santheo
Copy link
Owner

santheo commented Aug 12, 2010

Here's a partial list of groups that are listed multiple times, in different ways:

GERMAN
German

BOHEMIAN
BOHEMIANS
Bohemian

CZECH
CZECHOSLOVAKIAN

CROATIAN
Croatian

CROATIAN (Serbian section)
CROATIAN (Serbian Section)

DANISH
Danish

POLISH
Polish

@knoxdw
Copy link
Collaborator

knoxdw commented Aug 13, 2010

Yes, there are editorial inconsistencies in the original, but these shouldn't matter for building the database, right? I've done a lot of checking of group and code names. The editorial policy, however, has been as much as possible to fix new mistakes but to keep the original transcription in the body of the XML. All the normalization of group names and codes has been in the CatRef attribute values, like #grp-danish, etc. Those are tightly constrained in the schema to an enumerated list, and I've been validating against them to catch anomalies. To build your database for searching and indexing, take stuff out of the header rather than the text/front matter.

The source (newspaper) names are where there's still a lot that could be done to normalize. I've made some progress but it's a huge job.

Let me know if you have questions about what's in the markup, or ideas about how to change the markup. I have a batch of a hundred or so small changes (many improvements) that I'll push up soon, with more to keep coming indefinitely (not that the site will hold for all of them).

@santheo
Copy link
Owner Author

santheo commented Aug 16, 2010

Fair enough. I've rescripted the loader to use the CatRef attribute for the canonical name, and I'll just put together a dictionary for the display names.

Source is a whole different beast, of course, and we can deal with that in a separate thread.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants