# Regular Expressions

http://www.pyregex.com

This tutorial is based on chapter 7 "Pattern Matching With Regular Expressions" of the book *Automate The Boring Stuff The Boring Stuff With Python* by Al Sweigart.



*Regular expressions* allow you to specify a pattern of text to search for.

Regular expressions are huge time-savers, not just for software users but also for programmers. In fact, tech writer Cory Doctorow argues that even before teaching programming, we should be teaching regular expressions:

> Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.
https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions


## Finding Patterns of Text with Regular Expressions

Regular expressions, called regexes for short, are descriptions for a pattern of text. For example, a `\d` in a regex stands for a digit character, that is, any single numeral `0` to `9`. The regex `\d\d \d\d \d\d \d\d` could be used  by Python to match a Danish telefon number, a string of eight numbers separated by whitespaces.

But regular expressions can be much more sophisticated. For example, adding a `2` in curly brackets (`{2}`) after a pattern is like saying, "Match this pattern two times." So the regex `\d{2} \d{2} \d{2} \d{2}` also matches the correct phone number format. It could be shortened even more to `(\d{2} ){3}\d{2}`.

### Creating Regex Objects

All the regex functions in Python are in the `re` module.

Passing a string value representing your regular expression to `re.compile()` returns a Regex pattern object (or simply, a Regex object). Note, since regular expressions frequently use backslashes in them, it is convenient to pass raw strings to the `re.compile()` function instead of typing extra backslashes. Typing `r'\d{2} \d{2} \d{2} \d{2}'` is much easier than typing `'\\d{2} \\d{2} \\d{2} \\d{2}'`.




In [None]:
import re


phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

### Matching Regex Objects

A Regex object’s `search()` method searches the string it is passed for any matches to the regex. The `search()` method will return `None` if the regex pattern is not found in the string. If the pattern is found, the `search()` method returns a `Match` object. `Match` objects have a `group()` method that will return the actual matched text from the searched string.

In [None]:
import re


phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

address_entry = """Møller 
20 86 46 44 
Herningvej 8
4800
Nykøbing F"""

mo = phone_num_reg.search(address_entry)
mo.group()



### Grouping with Parentheses


Adding parentheses will create groups in the regex: `r'(\d{4})\n(Nykøbing F)'`. Then you can use the `group()` match object method to grab the matching text from just one group. The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the `group()` match object method, you can grab different parts of the matched text. Passing 0 or nothing to the `group()` method will return the entire matched text.


In [None]:
phone_num_reg = re.compile(r'(\d{4})\n(Nykøbing F)')

address_entry = """Møller 
20 86 46 44 
Herningvej 8
4800
Nykøbing F"""

mo = phone_num_reg.search(address_entry)
print(mo.group(0))
print('Group 1: ', mo.group(1))
print('Group 2: ', mo.group(2))
mo.groups()

### The `findall()` Method

In addition to the `search()` method, Regex objects also have a `findall()` method. While `search()` will return a Match object of the first matched text in the searched string, the `findall()` method will return the strings of every match in the searched string. 


If there are groups in the regular expression, then findall() will return a list of tuples.

In [None]:
phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

address_entry = """Møller 
20 86 46 44 
Herningvej 8
4800
Nykøbing F

A Bischoff Møller
86 14 18 31 
Stenkildevej 14
8260
Viby J

A Egelund-Møller
54 94 41 81 
Rønnebærparken 1 0011
4983
Dannemare"""

numbers = phone_num_reg.findall(address_entry)
print('All matches: {}'.format(numbers))

mo = phone_num_reg.search(address_entry)
print('First match: {}'.format(mo.group()))

### More Regexp Syntax

On top of grouping and repetitions, Regexps can express quite a bit more. Have a look to http://www.pyregex.com to see what else is possible. 