# Normalization

Unicode normalization is used to *normalize* different but similiar characters. For example the following unicode characters (and character combinations) are equivalent:

**Canonical Equivalence**

| | | Equivalence Reason |
| --- | --- | --- |
| Ç | C◌̧ | Combined character sequences |
| 가 | ᄀ ᅡ | Conjoined Korean characters |

**Compatibility equivalence**

| | | Equivalence Reason |
| --- | --- | --- |
| ℌ | H | Font variant |
| \[NBSP\] | \[SPACE\] | Both are linebreak sequences |
| ① | 1 | Circled variant |
| x² | x2 | Superscript |
| xⱼ | xj | Subscript |
| ½ | 1/2 | Fractions |

We have mentioned two different types of equivalence here, canonical and compatibility equivalence.

**Canonical equivalence** means both forms are fundamentally the same and when rendered are indistinguishable. For example we can take the unicode for `'Ç' \u00C7` or the unicode for `'C' \u0043` and `'̧' \u0327`, when the latter two characters are rendered together they look the same as the first character:

In [1]:
print("\u00C7", "\u0043"+"\u0327")

Ç Ç


However, if we print these characters seperately, we can see very clearly that they are not the same:

In [2]:
print("\u00C7", "\u0043", "\u0327")

Ç C ̧


These are examples of canonical equivalence, but we also have compatibility equivalence.

**Compatibility equivalence** refers to the formatting differences between characters, which includes (but is not limited to):

* font variants
* cursive forms
* circled characters
* width variation
* size changes
* rotation
* superscript and subscript
* fractions

In this case we can see a difference between the rendered characters, for example between `ℌ` and `H`, or `½` and `1 ⁄ 2`.

For many of these examples which are either canonical equivalents (Ç ↔ C ̧ ) or compatibility equivalents (½ → 1 ⁄ 2), if we compare if these different forms are equal, we will find that they are not:

In [3]:
"Ç" == "Ç"

False

In [4]:
"ℌ" == "H"

False

In [5]:
"½" == "1⁄2"  # note that 1⁄2 are the characters 1 ⁄ 2 placed together (they are automatically formatted)

False