Skip to content

Add replace_unknown_speakers_goldstandard.py for automated TEI speaker mapping#31

Merged
mandlilaast merged 5 commits intodevfrom
Issue-89-new
Dec 17, 2025
Merged

Add replace_unknown_speakers_goldstandard.py for automated TEI speaker mapping#31
mandlilaast merged 5 commits intodevfrom
Issue-89-new

Conversation

@mandlilaast
Copy link
Copy Markdown
Contributor

Description

This PR adds a new script to automatically apply speaker annotations in TEI XML files based on TSV/CSV mappings. It handles unknown speakers, propagates who and type attributes recursively, and follows <u next> chains and sibling elements.

Key Features

  • Processes is-speaker and non-speaker mappings from TSV/CSV files or folders.
  • Recursively updates <u>, <seg>, and <note> elements, including nested structures.
  • Propagates speaker information along <u next> chains and following siblings.
  • Uses pyriksdagen.io.parse_tei and write_tei for parsing and writing TEI files, ensuring TEI-compliant output.
  • Supports multiprocessing for faster processing of multiple files.
  • Logs all failed rows to speaker_mapping_failures.tsv.
  • Fully dynamic input paths via --folder; no hard-coded file paths.

@MansMeg
Copy link
Copy Markdown
Contributor

MansMeg commented Nov 10, 2025

What issue is this connected to?

@mandlilaast
Copy link
Copy Markdown
Contributor Author

mandlilaast commented Nov 10, 2025

Bob noticed that I never uploaded the script that actually made the changes in this PR 127.

Copy link
Copy Markdown
Contributor

@BobBorges BobBorges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. I have a few comments throughout. Some things seem to be a bit more verbose than necessary. Otherwise the only real issue is that we don't want to put who attributes on note elements.

Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
return modified


def apply_speaker_recursively(el: etree._Element, person_id: str, folder_type: str) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing this? Don't you already have a row for every element that contains an "unknown" to be replaces?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that in some cases the speaker and segment structure was not the same (with prev-next), but nested under each other. I could not find any examples of it atm, unfortunately.

Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
Comment thread src/cur-prot/replace-unknown-speakers-by-goldstandard.py Outdated
@mandlilaast
Copy link
Copy Markdown
Contributor Author

The script has been fixed, and the related sample 20/20 approved in PR swerik-project/riksdagen-records#127.

Are there any other changes I should make @BobBorges ?

@MansMeg
Copy link
Copy Markdown
Contributor

MansMeg commented Dec 16, 2025

@mandlilaast fix the assignees.

@mandlilaast mandlilaast merged commit bf76959 into dev Dec 17, 2025
1 check passed
@mandlilaast mandlilaast deleted the Issue-89-new branch December 17, 2025 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants