In [2]:
import re

In [3]:
print('I would like some vegetables.'.replace('vegetables', 'pie'))
print(re.sub('vegetables', 'pie', 'I would like some vegetables.'))

I would like some pie.
I would like some pie.


The advantages of regex start to become more clear if we need to make more than one replacement:

In [4]:
veggie_request = 'I would like some vegetables, vitamins, and water.'
print(veggie_request.replace('vegetables', 'pie')
    .replace('vitamins', 'pie')
    .replace('water', 'pie'))
print(re.sub('vegetables|vitamins|water', 'pie', veggie_request))

I would like some pie, pie, and pie.
I would like some pie, pie, and pie.


I used the metacharacter `|`, the regex "or" operator, to shorten the command. Metacharacters signify a special regex command and don't match themselves unless escaped with `\`. We won't go over the other metacharacters here, so I highly recommend looking at the basics section of [the cheat sheet](https://www.debuggex.com/cheatsheet/regex/python) when you tackle the exercises.

### Character Classes
Suppose we want to match a specific set of characters. `Re` offers several built in sets, plus the ability to build our own custom version. For example, the special character `\D` matches all non-digit characters and makes it trivial to do do basic phone number cleanup:

In [4]:
messy_phone_number = '(123) 456-7890'
print(re.sub(r'\D', '', messy_phone_number))

1234567890


You may have noticed that I the added raw string prefix `r` before my pattern. This allows us to specify special characters with a single `\` rather than `\\`. [Raw string notation (r"text") keeps regular expressions sane](https://docs.python.org/3/library/re.html#raw-string-notation); use them by default.

If we take a second look at the example above, you'll notice that it strips out too much data for some use cases. If a user entered some letters into the phone number, we might want to raise an error for that entry rather than try to clean it up. A better option is to define a custom character set to narrow down what we delete. 

In [5]:
really_messy_number = messy_phone_number + ' this is not a valid phone number'
print(re.sub(r'\D', '', really_messy_number))
print(re.sub(r'[-.() ]', '', really_messy_number))

1234567890
1234567890thisisnotavalidphonenumber


That pattern means 'delete any character found between the brackets'. Everything within the brackets is treated as if they were `|` delimited, and we wouldn't have to escape special characters.

If you need to build custom classes, it's worth taking a look at [the detailed explanation in the documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) as there are some special ordering rules that only apply within the `[]`.

### Quantifiers
In many cases, we only want to match a specific number of occurrences. A full US phone number including the area code but  country code and no extension will always have 10 digits. If we're searching a text for phone numbers, we'll want to match strings of digits with no more or less than that.

In [6]:
buried_phone_number = 'You are the 987th caller in line for 1234567890. Please continue to hold.'
re.findall(r'\d{10}', buried_phone_number)

['1234567890']

### Lookarounds
In other cases we may only want a portion of the item we're matching. Let's say that we just need the area code from a phone number. This is where lookarounds come in handy.

In [7]:
re.findall(r'\d{3}(?=\d{7})', buried_phone_number)

['123']

That pattern matches three numbers if and only if they're followed by seven more numbers, and only returns the first three. The relevant special characters are in the `Regular Expression Assertions` section of [the cheat sheet](https://www.debuggex.com/cheatsheet/regex/python). 

### Flags
It's often helpful to adjust a pattern's 'settings'. Flags allow us to do that. My personal favorite makes a pattern case insensitive:

In [8]:
wordy_tom = """Tom. Let's talk about him. He often forgets to capitalize tom, his name. Oh, and don't match tomorrow."""
re.findall(r'(?i)\bTom\b', wordy_tom)

['Tom', 'tom']