Skip to content

Conversation

@benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Oct 30, 2025

This PR improves quote convention detection for Paratext projects, especially messy projects that are inconsistent with their quote conventions. It does this by implementing a weighted voting scheme across different books. It also makes a small change to the way that quote convention similarity is calculated to accommodate the weighted voting. Finally, it adds a new quote convention that was recently observed in a project.

On a set of 57 real projects submitted to Serval, this improves the accuracy of quote convention detection from 40% to 95%.


This change is Reviewable

@benjaminking benjaminking requested a review from Enkidu93 October 30, 2025 18:59
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddaspit reviewed 14 of 14 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @Enkidu93)


machine/punctuation_analysis/quote_convention_detector.py line 53 at r1 (raw file):

        return STANDARD_QUOTE_CONVENTIONS.score_all_quote_conventions(self._quotation_mark_tabulator)

    def detect_quote_convention_and_get_tabulated_quotation_marks(

Could we expose the tabulated quotation marks from the QuoteConventionAnalysis instead of returning it separately?


machine/punctuation_analysis/quote_convention.py line 64 at r1 (raw file):

    def __hash__(self) -> int:
        return hash((tuple(self.level_conventions)))

Are the extra parentheses necessary? Doesn't tuple return a tuple?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Enkidu93 reviewed 14 of 14 files at r1, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @benjaminking)


tests/punctuation_analysis/test_quote_convention_set.py line 1254 at r1 (raw file):

    assert all_three_quote_convention_set.find_most_similar_convention(noisy_multiple_english_quotes_tabulator) == (
        standard_english_quote_convention,
        approx(0.8333333333333, rel=1e-9),

I'm noticing that these scores all seem to be going down. Why is that the case? If the intent is to use the same logic as before but weight it by book, shouldn't these stay the same? Is it because of this: # The scores of greater depths depend on the scores of shallower depths? Is the idea that if the top-level quotes are off, that should be compounded into the score for deeper quotes? Was this motivated by particular examples?

We aren't thresholding on these values at the moment anywhere, are we? If so, we need to make sure we update those threshold values.


machine/punctuation_analysis/quote_convention_analysis.py line 17 at r1 (raw file):

        self._convention_scores = convention_scores
        if len(convention_scores) > 0:
            self._best_quote_convention_score = max(convention_scores.items(), key=lambda item: item[1])[1]

You should combine this with setting self._best_quote_convention_score so you don't have to calculate max(...) twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants