Add replace_unknown_speakers_goldstandard.py for automated TEI speaker mapping#31
Add replace_unknown_speakers_goldstandard.py for automated TEI speaker mapping#31mandlilaast merged 5 commits intodevfrom
replace_unknown_speakers_goldstandard.py for automated TEI speaker mapping#31Conversation
|
What issue is this connected to? |
BobBorges
left a comment
There was a problem hiding this comment.
Looks reasonable. I have a few comments throughout. Some things seem to be a bit more verbose than necessary. Otherwise the only real issue is that we don't want to put who attributes on note elements.
| return modified | ||
|
|
||
|
|
||
| def apply_speaker_recursively(el: etree._Element, person_id: str, folder_type: str) -> bool: |
There was a problem hiding this comment.
Why are you doing this? Don't you already have a row for every element that contains an "unknown" to be replaces?
There was a problem hiding this comment.
I remember that in some cases the speaker and segment structure was not the same (with prev-next), but nested under each other. I could not find any examples of it atm, unfortunately.
|
The script has been fixed, and the related sample 20/20 approved in PR swerik-project/riksdagen-records#127. Are there any other changes I should make @BobBorges ? |
|
@mandlilaast fix the assignees. |
Description
This PR adds a new script to automatically apply speaker annotations in TEI XML files based on TSV/CSV mappings. It handles unknown speakers, propagates
whoandtypeattributes recursively, and follows<u next>chains and sibling elements.Key Features
is-speakerandnon-speakermappings from TSV/CSV files or folders.<u>,<seg>, and<note>elements, including nested structures.<u next>chains and following siblings.pyriksdagen.io.parse_teiandwrite_teifor parsing and writing TEI files, ensuring TEI-compliant output.speaker_mapping_failures.tsv.--folder; no hard-coded file paths.