# Unicode mess

In [None]:
# https://hsivonen.fi/string-length/
# https://www.python.org/dev/peps/pep-0393/ -- PEP 393 -- Flexible String Representation
uni_str = "🤦🏼‍♂️"
len(uni_str)

5

In [None]:
# enter large codepoints with \UXXXXXXXX and small with \uXXXX
print("\U0001F926", len("\U0001F926"))

🤦 1


In [None]:
for char in uni_str:
    print('"\\U{:08x}"'.format(ord(char)))

"\U0001f926"
"\U0001f3fc"
"\U0000200d"
"\U00002642"
"\U0000fe0f"


Trying to mess up with surrogates

They live between codes `\uD800` and `\uDFFF` included.

In [None]:
# surrogate_test = "\uD801"
# print(surrogate_test, len(surrogate_test))

# We would get:
# UnicodeEncodeError: 'utf-8' codec can't encode character '\ud801' in position 28: surrogates not allowed
# IOStream.flush timed out
# IOStream.flush timed out

In [None]:
# Can we convert a 32 bit code in this area to something printable?
chr(0xD801)

'\ud801'

In [None]:
chr(0xD801) == "\uD801"

True

In [None]:
ord("\uD801")

55297

In [None]:
# Test surrogate encoding vs direct encoding of large unicode code point
'\uD83D\uDC69' == "\U0001F600"

False

In [None]:
len('\uD83D\uDC69')

2

In [None]:
len("\U0001F600")

1

In [None]:
'\uD83D\uDC69' == '\U0000D83D\U0000DC69'

True

Ok, so it seems we should also avoid surrogates to accept surrogates, though Python tolerates them in its internal representation.
We are likely to produce errors in the output.
More background about UTF-16 surrogates in UTF-32 below.

From: https://en.wikipedia.org/wiki/UTF-32#Variants
"ISO/IEC 10646:2020". standards.iso.org. Retrieved 2021-10-12. "Clause 9.4: "Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-0000 DFFF are ill-formed". Clause 4.57: "[UCS codespace] consisting of the integers from 0 to 10 FFFF (hexadecimal)". Clause 4.58: "[UCS scalar value] any UCS code point except high-surrogate and low-surrogate code points"."

Conclusion about the characters we will accept: a subset of `[\u0000 - \uD7FF] U [\uE000 - \uFFFF]

Acceptable Blocks from the BMP (https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane):
- Basic Latin (Lower half of ISO/IEC 8859-1: ISO/IEC 646:1991-IRV aka ASCII) (0000–007F) for values => \u0020 and < U+007F and \n...
- Latin-1 Supplement (Upper half of ISO/IEC 8859-1) (0080–00FF) for values => U+00A0
- Latin Extended-A (0100–017F)
- ~~Latin Extended-B (0180–024F)~~
- ~~Spacing Modifier Letters (02B0–02FF)~~
- ~~Combining Diacritical Marks (0300–036F)~~
- Greek and Coptic (0370–03FF) -> only if needed
- ~~Cyrillic (0400–04FF)~~ -> project to replacement char
- ~~Cyrillic Supplement (0500–052F)~~ -> project to replacement char
- ~~Cyrillic Extended-C (1C80–1C8F)~~ -> project to replacement char
- Latin supplements:
  *  ~~Phonetic Extensions (1D00–1D7F)~~
  *  ~~Phonetic Extensions Supplement (1D80–1DBF)~~
  *  ~~Combining Diacritical Marks Supplement (1DC0–1DFF)~~
  *  ~~Latin Extended Additional (1E00–1EFF)~~
- ~~Greek Extended (1F00–1FFF)~~
- Symbols:
  *  General Punctuation (2000–206F) -> en dash, em dash, etc. (PROJECT SPACES, HYPHENS AND DASHES, FORBID EXTRA CHARS)
  *  ~~Superscripts and Subscripts (2070–209F)~~ -> GT should actually use this (to project)
  *  ~~Currency Symbols (20A0–20CF)~~
  *  ~~Combining Diacritical Marks for Symbols (20D0–20FF)~~
  *  ~~Letterlike Symbols (2100–214F)~~
  *  ~~Number Forms (2150–218F)~~ -> project roman numerals to latin letters if needed (done with NFKD in theory)
  *  ~~Arrows (2190–21FF)~~ -> avoid if possible
  *  ~~Mathematical Operators (2200–22FF)~~
  *  ~~Miscellaneous Technical (2300–23FF)~~ -> would be anachronistic
  *  ~~Control Pictures (2400–243F)~~
  *  ~~Optical Character Recognition (2440–245F)~~
  *  Enclosed Alphanumerics (2460–24FF) -> Used for medals
  *  ~~Box Drawing (2500–257F)~~
  *  ~~Block Elements (2580–259F)~~
  *  ~~Geometric Shapes (25A0–25FF)~~
  *  Miscellaneous Symbols (2600–26FF) -> maybe one or two (hand, star), to project if possible (0x261e...)
  *  Dingbats (2700–27BF) -> maybe one or two, to project if possible
  *  ~~Miscellaneous Mathematical Symbols-A (27C0–27EF)~~
  *  ~~Supplemental Arrows-A (27F0–27FF)~~ -> avoid if possible
  *  ~~Braille Patterns (2800–28FF)~~
  *  ~~Supplemental Arrows-B (2900–297F)~~
  *  ~~Miscellaneous Mathematical Symbols-B (2980–29FF)~~
  *  ~~Supplemental Mathematical Operators (2A00–2AFF)~~
  *  ~~Miscellaneous Symbols and Arrows (2B00–2BFF)~~ -> project stars if needed
- ~~Latin Extended-C (2C60–2C7F)~~
- ~~Cyrillic Extended-A (2DE0–2DFF)~~
- ~~Supplemental Punctuation (2E00–2E7F)~~
- ~~Cyrillic Extended-B (A640–A69F)~~
- ~~Latin Extended-D (A720–A7FF)~~
- Private Use Area (E000–F8FF) -> only if needed, to substitute long custom codes, unhandled scripts, etc.
- Alphabetic Presentation Forms (FB00–FB4F) -> only U+FB00 - U+FB06 (ligatures)
- U+FEFF BYTE ORDER MARK
- U+FFFD � REPLACEMENT CHARACTER

WARNING: what about the "end of paragraph" (or so) that we can get from the OCR or NER output?

WARNING: replace hand symbols like "👉" (0x1f449) from the "Miscellaneous Symbols and Pictographs" block (plane 1) with "☞" (0x261e)
https://www.unicode.org/charts/PDF/U1F300.pdf