# Regular expression search

by Koenraad De Smedt at UiB

---
Searching for patterns is a basic method in data science.

Strings can be searched and manipulated by means of a regular expression (*RE* or *regex*) using functions from the `re` module. Regular expression indicate patterns of characters.

This notebook demonstrates how to:

1.   Search in a string for a match to a pattern
2.   Find all matches to a pattern in a string

For more information on regular expressions and their use for NLP, read Jurafsky & Martin. *Speech and Language Processing, 3rd ed.* Ch. 2: [Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf). However, note that there are a few system dependent conventions. Jurafsky & Martin use slashes to delimit regular expressions, but in Python they are simply strings.

See also the [documentation of Python regular expression operations](https://docs.python.org/3/library/re.html) and the [Python regular expression howto](https://docs.python.org/3/howto/regex.html).

---

In order to use regular expressions in Python, we should import the `re` module. Let's also make an example string in which we will search.

In [None]:
import re

phrase = '- A whimsical musical... comedy!' 

## Search for a pattern match

The `re.search` function searches for a regular expression in a string. It returns a match object containing the start and end points of the *first* match and the matching part of the string. If no match is found, the function returns `None`. The simplest regex is a string of regular characters. Note that search is case-sensitive.

In [None]:
print(re.search('ical', phrase))
print(re.search('ICAL', phrase))

Some characters have a special meaning in a regular expression. The vertical bar `|` means *or* (disjunction). The following looks for two alternative strings and the search is successful as soon as one of them is found.

In [None]:
print(re.search('comical|musical', phrase))

Use parentheses for making a group. The following searches for *com* or *mus*, followed by *ical*. It is equivalent to the previous search, but more compact.

In [None]:
print(re.search('(com|mus)ical', phrase))

Square brackets contain a list of alternative *characters*. The regex `'[Aa] '` matches an *A* or *a* followed by a space. The first match is returned.

In [None]:
print(re.search('[Aa] ', phrase))
print(re.search('[Ww] ', phrase))

Between square brackets, the hyphen indicates a *span* of alternative characters. The following looks for all uppercase characters from A to and including Z.

In [None]:
print(re.search('[A-Z]', phrase))

The result of the search can be interpreted as a truth value and can thus be used in a conditional expression.

In [None]:
if (re.search('[x-z]!', phrase)):
  print('There is a match')
else:
  print('There is no match')

A period in a regex stands for an arbitrary character.

In [None]:
print(re.search('i..l', phrase))

If you want to match a literal period, use `\.` In cases where the Python meaning of `\` in a string literal might interfere with its meaning in a regular expression, one might prefer a [*raw* string preceded by `r`](https://docs.python.org/3/library/re.html#raw-string-notation). In practice, this seems necessary only in substitutions (see the notebook on Regex substitution).

In [None]:
print(re.search('....cal\.', phrase))
print(re.search(r'....cal\.', phrase))

The asterisk indicates that the expression just before it should be matched zero or more times. The plus sign indicates that the previous expression should be matched one or more times.

In [None]:
print(re.search('....cal\.* ', phrase))
print(re.search('....cal\.+ ', phrase))

Between square brackets, the period and other special characters do not need to be escaped. The following finds all matches of sequences containing one or more periods, exclamation marks and/or question marks.

In [None]:
punctuation = '[.!?]+'
re.findall(punctuation, phrase)

## Find all matches

The `re.findall` function returns *all non-overlapping* matching parts of the string, not just the first one. Notice that the second expression in the following gives only one result, due to overlap.

In [None]:
print(re.findall('....cal', phrase))
print(re.findall('......cal', phrase))

RE matching is *eager*, which means it attempts to find the longest matching part of the string. Even if there are two potential shorter matches, the longest match is returned.

In [None]:
print(re.findall('.+cal', phrase))

If you use *groups* with parentheses in `re.findall`, then only the matches for the groups are returned. If there is more than one group, you will get a list of tuples containing the groups.

In [None]:
print(re.findall('([A-Za-z]+)al', phrase))
print(re.findall('([A-Za-z]+)(al)', phrase))

## Lookbehind and lookahead

Matches are non-overlapping. After the first match, the matching part of the string cannot be usef for other matches. Even though there are two potential matches, `'ome'` and `'edy'`, these two overlap, so that only the first one is returned.

In [None]:
re.findall('[oey][md][oey]', 'comedy')

If you want overlapping matches anyway, a possible solution is looking for patterns before and/or after a match, without actually making them part of the match. In the following, `?<=` looks behind to a left context and `?=` looks ahead to a right context. Notice that the lookbehind and lookahead are not part of the match.

In [None]:
re.findall('(?<=[oey])[md](?=[oey])', 'comedy')

### Exercises

Use the following text (loosely based on a news report) to test your exercise solutions.
```
text = '''20 isbreer i Norge er nå borte:
364 kvadratkilometer isbre har forsvunnet mellom 2006 og
2022. Det tilsvarer et område på størrelse med Mjøsa!!!
Samtidig som de 20 breene har forsvunnet, har isbreer
i Norge totalt minket 14,5 prosent siden forrige kartlegging.
Ismassene som har smeltet bort siden da, har en størrelse
på 364 kvadratkilometer til sammen; det er et område like
stort som omtrent 50.000 fotballbaner... Kan denne utviklingen
stanses??
'''
```

1.   Use a regex to find the number of sentence delimiters in the text, where a sentence delimiter consists of *one or more* consecutive periods, colons, semicolons, exclamation marks or question marks, followed by a space or newline.
2.   Use a regex to find how many digits there are in the string. Note: instead of `[0-9]`, you can also use `\d`. Try it.
3.   Use a regex to find the numbers in the text, where a number is a sequence of digits, possibly containing a period or comma between digits. Tip: use a disjunction. Before the `|` write a pattern for numbers with period or comma and after `|` write a pattern for plain numbers.
