# Which Character Belongs to Which Unicode Category?

This notebook explores how to determine the Unicode category of characters and how to use Unicode categories in regular expressions.

## Learning Goals

- Learn how to determine the Unicode category of a character using the `unicodedata` module.
- Understand how to retrieve the name of a Unicode character.
- Explore the use of Unicode categories in regular expressions with the `regex` package.
- Familiarize yourself with common Unicode categories and their applications.


## How Can I Determine the Unicode Category of a Character?

The `unicodedata.category` function returns a category abbreviation for a given character. The `unicodedata.name` function provides the character's name.


In [2]:
import unicodedata

utfstr = "1a* äöü."

for c in utfstr:
    print(c, "Cat:", unicodedata.category(c))
    print(c, "Name:", unicodedata.name(c, "No name found"))

1 Cat: Nd
1 Name: DIGIT ONE
a Cat: Ll
a Name: LATIN SMALL LETTER A
* Cat: Po
* Name: ASTERISK
  Cat: Zs
  Name: SPACE
ä Cat: Ll
ä Name: LATIN SMALL LETTER A WITH DIAERESIS
ö Cat: Ll
ö Name: LATIN SMALL LETTER O WITH DIAERESIS
u Cat: Ll
u Name: LATIN SMALL LETTER U
̈ Cat: Mn
̈ Name: COMBINING DIAERESIS
. Cat: Po
. Name: FULL STOP


## Unicode Categories in Regular Expressions

The `regex` package (an enhanced version of the `re` module) supports Unicode categories in regular expressions. For example, you can match punctuation using `\p{P}`.


In [3]:
!pip install regex


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
import regex

text = "Oh... What???"
cleaned_text = regex.sub(r"\p{P}+", " ", text)
repr(cleaned_text)

"'Oh  What '"

## A List of Unicode Categories

Unicode characters are grouped into categories such as:

- `Lu`: Uppercase Letter
- `Ll`: Lowercase Letter
- `Nd`: Decimal Number
- `P`: Punctuation
- `So`: Symbol, Other (includes emojis)

For a complete list of categories, refer to the [Unicode Standard](https://www.unicode.org/reports/tr44/#GC_Values_Table).

### To Which Category Do Emojis Belong?

Most emojis belong to the `So` (Symbol, Other) category. You can verify this using `unicodedata.category` or by matching with `\p{So}` in the `regex` package.
