## Regular Expressions
### Why are we doing this?
- Describe text patterns
- Specify strings we might want to extract from a document
- Search and replace text (e.g. in building a chatbot like ELIZA)
- Text normalization: converting everything to a more convenient, standard form (e.g. Tokenization)

The regular expression syntax are mostly for `grep` but can be easily adapted for Python.

#### Basic Python regular expression metacharacters, including wildcards, ranges and closures
| Operator | Behaviour |
|-----|---------------------------------|
|`.` | Wildcard, matches any character |
| `^abc` | Matches some pattern `abc` at the start of a string |
| `abc$` | Matches some pattern `abc` at the end of a string |
| `[abc]` | Matches one of a set of characters |
| `[A-Z0-9]` | Matches one of a range of characters |
| `ed\|ing\|s` | Matches one of the specified strings (disjunction) |
| `*` | Zero or more of previous item, e.g., `a*`, `[a-z]*` (also known as *Kleene Clusure*) |
| `+` | One or more of previous item, e.g. `a+`, `[a-z]+` |
| `?` | Zero or one of the previous item (i.e. optional) e.g. `a?`, `[a-z]?` |
| `{n}` | Exactly *n* repeats when *n* is a non-negative integer |
| `{n,}` | At least *n* repeats |
| `{, n}` | No more than *n* repeats |
| `{m, n}` | At least *m* and no more than *n* repeats |
| `a(b\|c)+` | Parentheses that indicate the scope of the operators |

In [2]:
import re

#### Figure 2.1 - some simple regex searches

In [3]:
text = 'interesting links to woodchucks and lemurs'
pattern = re.compile(r'woodchucks')
print(pattern.findall(text)) # Find all gets all possible matches in the text

print('Alternatively we can do the following')
re.findall(r'woodchucks', text)

['woodchucks']
Alternatively we can do the following


['woodchucks']

In [9]:
[int(n) for n in re.findall(r'[0-9]+', '2019-12-31')]

[2019, 12, 31]

In [28]:
text = "Mary Ann stoped by Mona's"
pattern = re.compile(r'a')
print(pattern.findall(text))

# Alternatively can use `re.search` - this returns a match object if matched
re.search(r'a', text)

['a', 'a']


<re.Match object; span=(1, 2), match='a'>

In [11]:
text = "You’ve left the burglar behind again!” said Nori"
pattern = re.compile(r'!')
pattern.findall(text)

['!']

#### Figure 2.2 - Use of the brackets `[]` to specify a disjunction of characters.

In [16]:
text = 'Woodchunk or woodchunk?'
pattern = re.compile(r'[wW]oodchunk')
print(pattern.findall(text))

# Can also use the IGNORECASE
pattern = re.compile(r'woodchunk', re.IGNORECASE)
print('Ignorecase:', pattern.findall(text))

['Woodchunk', 'woodchunk']
Ignorecase: ['Woodchunk', 'woodchunk']


In [13]:
text = 'In uomini, in soldati'
pattern = re.compile(r'[abc]')
pattern.findall(text)

['a']

In [17]:
text = 'plenty of 7 to 5'
pattern = re.compile(r'[0-9]')
pattern.findall(text)

['7', '5']

#### Figure 2.3 - Use of the brackets `[]` plus the dash to specify a range

In [22]:
text = "we should call it 'Drenched Blossoms'"
pattern = re.compile(r'[A-Z]')
print(pattern.findall(text))

text = 'my beans were impatient to be hoed'
pattern = re.compile(r'[a-z]')
print(pattern.findall(text))

text = 'Chapter 1: Down the Rabbit Hole'
pattern = re.compile(r'[0-9]')
print(pattern.findall(text))

['D', 'B']
['m', 'y', 'b', 'e', 'a', 'n', 's', 'w', 'e', 'r', 'e', 'i', 'm', 'p', 'a', 't', 'i', 'e', 'n', 't', 't', 'o', 'b', 'e', 'h', 'o', 'e', 'd']
['1']


#### Figure 2.4 - The caret `^` for negation or just to mean `^`.

In [23]:
text = "Oyfn pripetchik"
pattern = re.compile(r'[^A-Z]') # Not an upper case letter
print(pattern.findall(text)) # This gets all the spaces as well

['y', 'f', 'n', ' ', 'p', 'r', 'i', 'p', 'e', 't', 'c', 'h', 'i', 'k']


In [25]:
text = "I have no exquisite reason for 't"
pattern = re.compile(r'[^Ss]') # neither 'S' or 's'
print(pattern.findall(text))

['I', ' ', 'h', 'a', 'v', 'e', ' ', 'n', 'o', ' ', 'e', 'x', 'q', 'u', 'i', 'i', 't', 'e', ' ', 'r', 'e', 'a', 'o', 'n', ' ', 'f', 'o', 'r', ' ', "'", 't']


In [29]:
text = 'our resident Djinn'
pattern = re.compile(r'[^.]') # Not a period
print(pattern.findall(text)) 

['o', 'u', 'r', ' ', 'r', 'e', 's', 'i', 'd', 'e', 'n', 't', ' ', 'D', 'j', 'i', 'n', 'n']


In [30]:
text = "look up ^ now"
pattern = re.compile(r'[e^]') # Either 'e' or '^'
print(pattern.findall(text))

['^']


In [31]:
text = "look up a^b now"
pattern = re.compile(r'[a^b]')
print(pattern.findall(text))

['a', '^', 'b']


#### Figure 2.5 - The question mark `?` marks optionality of the previous expression

In [33]:
text = 'woodchuck woodchucks Woodchuck'
pattern = re.compile(r'woodchucks?')
print(pattern.findall(text))

# optionally ignore case
pattern = re.compile(r'woodchucks?', re.IGNORECASE)
print('Ignore case:', pattern.findall(text))

['woodchuck', 'woodchucks']
Ignore case: ['woodchuck', 'woodchucks', 'Woodchuck']


In [35]:
text = 'color Colour colour'
pattern = re.compile(r'colou?r') # American spelling?
pattern.findall(text)

['color', 'colour']

#### Kleene * (clean star): zero or more occurrences of the immediately previous character

In [41]:
text = 'a aaaa aaaaaaa baa Ofminor ababab ab'
pattern = re.compile(r'a*') 
print(pattern.findall(text))  # Note this matches everything that has Zero a s as well. Not what we desire

pattern = re.compile(r'aa*') # One a followed by zero or more a s
print(pattern.findall(text))

['a', '', 'aaaa', '', 'aaaaaaa', '', '', 'aa', '', '', '', '', '', '', '', '', '', 'a', '', 'a', '', 'a', '', '', 'a', '', '']
['a', 'aaaa', 'aaaaaaa', 'aa', 'a', 'a', 'a', 'a']


In [43]:
# A more complex example
pattern = re.compile(r'[ab]*')
print(pattern.findall(text))

['a', '', 'aaaa', '', 'aaaaaaa', '', 'baa', '', '', '', '', '', '', '', '', '', 'ababab', '', 'ab', '']


#### Kleene + : one or more occurrences of the immediately preceding character

In [44]:
text = 'a ba aaa aaaa baaaa bababa'
pattern = re.compile(r'a+')
print(pattern.findall(text))

['a', 'a', 'aaa', 'aaaa', 'aaaa', 'a', 'a', 'a']


In [45]:
pattern = re.compile(r'ba+')
print(pattern.findall(text))

['ba', 'baaaa', 'ba', 'ba', 'ba']


In [46]:
text = 'abce 10234 bc345 345asd'
pattern = re.compile(r'[0-9]+')
print(pattern.findall(text))

['10234', '345', '345']


#### Figure 2.6 - The use of the period `.` to specify **any** character (*except* carriage return)

In [48]:
text = "begin, beg'n begun"
pattern = re.compile(r'beg.n')
print(pattern.findall(text))

['begin', "beg'n", 'begun']


In [49]:
# Another example, suppose we want to find any line where the word 'aardvark' appears twice
text = "I have seen the world's biggest aardvark but there are many other aardvarks out there"
pattern = re.compile(r'aardvark.*aardvark')
pattern.findall(text)

['aardvark but there are many other aardvark']

#### Anchors are special characters anchor regular expressions to particular places in a string
- `^` caret - matches start of a line
- `$` dollar sign, matches end of line

In [50]:
text = 'The headache been the worst ever'
pattern = re.compile(r'^the', re.IGNORECASE)
pattern.findall(text)

['The']

In [51]:
text1 = 'The cat has eaten the dog'
text2 = 'The dog.'

pattern = re.compile(r'^The dog\.$', re.IGNORECASE)
print(pattern.findall(text1))
print(pattern.findall(text2))

[]
['The dog.']


Other anchors:
* `\b` matches a word boundary
* `\B` matches a non-boundary
A word for the purposes of a regular expression is defined as *any sequence of digits, underscores or letters*.

In [56]:
text = 'There are 99 bottles of beers on the wall and 299 bottles on the floor and costs $99'
pattern = re.compile(r'\b99\b') # matches 99 and $99
print(pattern.findall(text))

pattern = re.compile('\B99') # Matches 299
print(pattern.findall(text))

['99', '99']
['99']


### 2.1.2 Disjunction, Grouping and Precedence
* `|` pipe symbol indicates **disjunction**
#### Operator precedence hierarchy
Ordered from highest precedence to lowest precedence

| Operator | Symbol |
|----------|--------|
| Parenthesis | `()` |
| Counters | `*` `+` `?` `{}` |
| Sequences and anchors | `the` `^my` `end$` |
| Disjunction | `\|` |

#### Simple Example
We wanted to write a regular expression to find cases of the ENglish article `the`. 

Start simple:
```
'the'
``` 
This pattern will miss the word when it begins a sentence and hence is capitalized (i.e. 'The')
```
'[tT]he'
```
This will incorrectly return texts with `the` embedded in other words (e.g. *other* or *theology*). So need to specify that we want instances with a word boundary on both sides:
```
'\b[tT]he\b'
```

Suppose we want to do this without the use of `\b`. We might want this since `\b` won't treat underscores and numbers as word boundaries, but we might want to find `the` in some context where it might also have underlines or numbers nearby (*the_* or *the25*). We need to specify that we want instances in which there are no alphabetic letters on either side of the *the*:
```
'[^a-zA-Z][tT]he[^a-zA-Z]'
```
But there is still one problem with this pattern, it won't find the word *the* when it begins a line. This is because the regular expression `[^a-zA-Z]`, which we used to avoid embedded instances of *the*, implies that there must be some single (although non-alphabetic) character before the *the*. We can avoid this by specifying that before the *the* we require *either* the beginning-of-line or a non-alphabetic character, and the same at the end of the line:
```
'(^|[^a-zA-Z])[tT]he([^a-zA-Z]|\$)'
```

### 2.1.4 More complex example
Suppose we want to build an application to help a user buy a computer on the web. The user might want 'any machine with at least 6GHz and 500GB of disk space for less than \\$1000. To do this kind of retrieval, we first need to be able to look for expressions like *6GHz* or *500GB* or *Mac* or \\$999.99.

Here's a regular expression for a dollar sign followed by a string of digits
```
'\$[0-9]+'
```
Now we need to deal with fractions of dollars. We'll add a decimal point and two digits afterwards:
```
'\$[0-9]+\.[0-9][0-9]'
```
This pattern only allows \\$199.99 but not \\$199. We need to make the cents optional and to make sure we are at a word boundary.
```
(^|\W)\$[0-9]+(\.[0-9][0-9])?\b
```

In [159]:
text = '$1234 price $1999999.34 but discount for $199 $19'
pattern = re.compile(r'\$(\d{3,}[\d.,]*)\b')
pattern.findall(text)

['1234', '1999999.34', '199']

In [166]:
text = 'I have more than 10000 GB or memory but you have 19.25 GB 0.25 Gigabytes 400 GB'
pattern = re.compile(r'([0-9]\d{2,}[\d.,]*)? *(GB|[Gg]igabytes?)\b')
pattern.findall(text) 

[('10000', 'GB'), ('', 'GB'), ('', 'Gigabytes'), ('400', 'GB')]

### 2.1.5 More Operators

| RE | Expansion | Match | 
|----|-----------|-------|
| `\b` |  | Word boundary (zero width) |
| `\d` | `[0-9]` | Any digit |
| `\D` | `[^0-9]` | Any non-digit |
| `\w` | `[a-zA-Z0-9_]` | Any alphanumeric/underscore |
| `\W` | `[^\w]` | A non-alphanumeric |
| `\s` | `[ \r\t\n\f]` | White space (space, tab) |
| `\S` | `[^\s]` | None-white space |

In [18]:
print(re.findall('\d', 'Party of 5'))
print(re.findall('\D', 'Blue Moon'))
print(re.findall('\w', 'Daiyu'))
print(re.findall('\W', 'ABCD!!'))
print(re.findall('\s', 'abc \tdef\n lij\r'))
print(re.findall('\S', '\n\r abc'))

['5']
['B', 'l', 'u', 'e', ' ', 'M', 'o', 'o', 'n']
['D', 'a', 'i', 'y', 'u']
['!', '!']
[' ', '\t', '\n', ' ', '\r']
['a', 'b', 'c']


### 2.1.6 Substitution, Capture Groups and Eliza
An important use of regular expressions is in **substitutions**. 

In [19]:
regexp = r'the (.*)er they (.*), the \1er we \2'
text = 'the faster they ran, the faster we ran'

re.findall(regexp, text)

Simulating ELIZA

In [52]:
regexp = r".*I'm (depressed|sad).*"

def upp_repl(match, start='', end=''):
    return "I'M SORRY THAT YOU ARE " + match.group(1).upper()

In [46]:
text = "He says I'm depressed much of the time"

In [53]:
re.sub(regexp, upp_repl, text)

"I'M SORRY THAT YOU ARE DEPRESSED"