# Substitution of matched substrings

Learning goals:

- Understand why replacement strings with backreferences also need rawstring decoration
- Understand how the backslash can be used to remove the special meaning of itself


In [None]:
import re

text = "Hässliche Köche verdürben das Gebräu"

pattern = r"([aeiouäöü]+)"

# Matched groups can be inserted in the replacement text.
# \N (N is the N-th grouping bracket in the pattern)
replacement = r"[\1]"

print(re.sub(pattern, replacement, text))

H[ä]ssl[i]ch[e] K[ö]ch[e] v[e]rd[ü]rb[e]n d[a]s G[e]br[äu]


What happens if we forget the `r` in front of the replacement? Remember, the replacement is a normal string optionally interspersed with references to captured groups.


In [None]:
replacement = "[\1]"

print(re.sub(pattern, replacement, text))

H[]ssl[]ch[] K[]ch[] v[]rd[]rb[]n d[]s G[]br[]


Hmmmh, what is this invisible character?


In [3]:
ord("\1")

1

Analyzing characters with `unicodedata` is useful for understanding the properties of specific characters, especially when working with unexpected or invisible characters. For example:

- All Unicode code points have a general category, which can be retrieved using `unicodedata.category`.
- Most Unicode code points also have a name, accessible via `unicodedata.name`.

Hint: By providing a `default` argument to `unicodedata.name`, you can ensure that the function always returns a meaningful result. If the character does not have a name, the `default` value will be returned instead of raising an error.


In [None]:
import unicodedata

print(unicodedata.name("\1", default=None) or "No name found for the character \\1")
print(unicodedata.category("\1"))

No name found for the character \1
Cc


Ok, a control character...

The `unicodedata.category` function returns a two-letter code representing the Unicode general category of a character. For example:

- `Cc` stands for "Control Character."
- `Lu` stands for "Uppercase Letter."
- `Nd` stands for "Decimal Number."

You can find the full list of Unicode general categories in the [Unicode Standard](https://www.unicode.org/reports/tr44/#GC_Values_Table).


Could we fix this without using `r""`? Sure...


In [5]:
replacement = "[\\1]"

print(re.sub(pattern, replacement, text))

H[ä]ssl[i]ch[e] K[ö]ch[e] v[e]rd[ü]rb[e]n d[a]s G[e]br[äu]
