# Regular expressions: substitution and split

by Koenraad De Smedt at UiB

---
This tutorial is a continuation of the one about regex search. It shows substitution and splitting of strings based on regex. These are basic techniques for manipulating string patterns. There are some more techniques which are not demonstrated here. If you want to know more about regex in Python, see the [documentation](https://docs.python.org/3/library/re.html).

---

In [None]:
import re

phrase = '- A whimsical musical... comedy!'

## Substitution

The `re.sub` function substitutes parts that match a regular expression. Notice that the longest possible match is used.

In [None]:
print(re.sub('w.*ical', 'tragical', phrase))

If several matches are found, each of them is replaced. The following substitutes a newline for each sequence of periods and/or spaces.


In [None]:
print(re.sub('[. ]+', '\n', phrase))

A *caret* (circumflex) at the beginning of a square bracket expression means *negation*. In the following, all characters that are *not* *a, e, i, o, u* or *y* are substituted.

In [None]:
print(re.sub('[^aeiouy]', '*', phrase))

## Word characters

`\w` matches any letter, digit or underscore.

In [None]:
print(re.sub('\w', '*', phrase))

Text can be omitted by replacing it with the empty string.

In [None]:
print(re.sub('\w+', '', phrase))

`\W` matches anything which is *not* a letter, digit or underscore.

In [None]:
print(re.sub('\W+', ' ', phrase))
print(re.sub('\W+', '', phrase))

Since Python3 supports Unicode, `\w` matches not only `[a-zA-Z0-9_]` but also letters from other alphabets. Similarly for `\W`.

In [None]:
print(re.sub('\w', '*', '- Håkon bløffer fælt.'))
print(re.sub('\W', '*', '- Håkon bløffer fælt.'))

## Groups

Groups are marked with parentheses. The replacement refers to matching groups by means of indices with backslash, such as `\1`, `\2`, and so on.

The use of \ for the numbered groups interferes with the normal use of \ to escape characters in Python strings. Therefor you must either write a [raw string preceded by `r`](https://docs.python.org/3/library/re.html#raw-string-notation), or you must used double backslash, such as `\\1`.

The following replaces a *c* by an *s* before an *i* or *e*.

In [None]:
print(re.sub('c([ie])', 's\\1', 'citron'))

The following example refers to two matching groups and reverses them.

In [None]:
print(re.sub('(\w+ical) (\w+ical)', r'\2 \1', phrase))

## Lookbehind and lookahead

Substitution is performed only on non-overlapping patterns. Consider the following simplified rule for intervocalic voicing of fricatives. After matching `'ofi'`, this part of the string has been consumed, so that `'ifa'` will not match.

In [None]:
re.sub('([aio])f([aio])', r'\1v\2', 'xofifan')

A possible workaround is looking for patterns before and/or after a match, without actually making them part of the match. In the following, `?<=` looks behind to a left context and `?=` looks ahead to a right context.

In [None]:
re.sub('(?<=[aio])f(?=[aio])', r'v', 'xofifan')

## Anchors

The characters `^` and `$` are *anchors* that indicate the beginning and end of the string, respectively. In the following `^p` matches a *p* only at the beginning of the string, and `,$` matches a comma only at the end of the string.

In [None]:
genres = '''poetry,
novel,
short story,
documentary,
biography,'''

print(re.sub('^p', 'P', (re.sub(',$', '.', genres))))

If you want to match a newline, use the `\n` code. Note that the end of the string is not necessarily a newline.

In [None]:
print(re.sub('[,.]\n', ' or ', genres))

## Flags

Normally RE operations are case-sensitive. Adding the `re.I` flag ignores case in the matching. The question mark indicates that the previous item is optional.

In [None]:
print(re.sub('C.?M', 'TRAG', phrase, flags=re.I))

## Special characters

By now, it should be clear that several characters have special meanings in regular expressions:

> `. * + [ ] ( ) | ? ^ $`

Also the following have special meanings, see the [chapter on regular expressions by Jurafsky & Martin](https://web.stanford.edu/~jurafsky/slp3/2.pdf), but I will not give examples here.

> `{ }`

As mentioned before, special characters must be escaped in regex search strings if they are to be taken literally.

In [None]:
print(re.sub('\.+ ', ' — ', phrase))

## [Split](https://hr.wikipedia.org/wiki/Split)

A string can be split with a given regular expression. This results in a list of strings. The following splits a string at a series of periods and/or spaces.

In [None]:
print(re.split('[. ]+', phrase))

If there is a matching group at the beginning or end of the string, an empty string may appear in the result.

In [None]:
re.split('\W+', 'Bergen? -- Bergen... Bergen!')

### Exercises

1.   Extend the previous expression to include other punctuation marks, such as question marks, exclamation marks, etc. Use this to divide a longer text into sentences.
2.   Use a regexp to omit vowels `[aeiouy]` from a string. Test. 
3.   Expand English contractions with negation in a text. For instance, replace *n’t* at the end of a word by a space followed by *not*, so that *don’t* becomes *do not* and *doesn’t* becomes *does not*. Note that some other contractions, such as *won’t* and *can't*, must be handled separately.