Skip to content

CER calculation

Mike Gerber edited this page Nov 12, 2020 · 1 revision
  • We treat grapheme clusters as characters
    • Reasoning: This is what users commonly perceive as characters. We cannot simply use code points as there are grapheme clusters as LATIN SMALL LETTER M, COMBINING TILDE that cannot be represented with a single code point.
  • We count whitespace
    • Reasoning: A missing space is an error
  • We count punctuation
    • Reasoning: A missing period (.) or wrong hyphen (-) is an error
  • Normalization
    • We normalize MUFI PUA characters to their canonically equivalent Unicode representations
Clone this wiki locally