# DS The Bridge - Expresiones regulares


## Flexible Pattern Matching with Regular Expressions

The methods of Python's ``str`` type give you a powerful set of tools for formatting, splitting, and manipulating string data.
But even more powerful tools are available in Python's built-in *regular expression* module.
Regular expressions are a huge topic; there are there are entire books written on the topic (including Jeffrey E.F. Friedl’s [*Mastering Regular Expressions, 3rd Edition*](http://shop.oreilly.com/product/9780596528126.do)), so it will be hard to do justice within just a single subsection.

My goal here is to give you an idea of the types of problems that might be addressed using regular expressions, as well as a basic idea of how to use them in Python.
I'll suggest some references for learning more in [Further Resources on Regular Expressions](#Further-Resources-on-Regular-Expressions).

Fundamentally, regular expressions are a means of *flexible pattern matching* in strings.
If you frequently use the command-line, you are probably familiar with this type of flexible matching with the "``*``" character, which acts as a wildcard.
For example, we can list all the IPython notebooks (i.e., files with extension *.ipynb*) with "Python" in their filename by using the "``*``" wildcard to match any characters in between:

In [23]:
!ls *Python*.ipynb

2.1.WarmUp_Introduccion_a_Python.ipynb
2.1.Workout_Introducción_a_Python.ipynb


Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching sytaxes.
The Python interface to regular expressions is contained in the built-in ``re`` module; as a simple example, let's use it to duplicate the functionality of the string ``split()`` method:

In [26]:
print(line)
import re
regex = re.compile('\s+')
regex.split(line)

the quick brown fox jumped over a lazy dog


['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

Here we've first *compiled* a regular expression, then used it to *split* a string.
Just as Python's ``split()`` method returns a list of all substrings between whitespace, the regular expression ``split()`` method returns a list of all substrings between matches to the input pattern.

In this case, the input is ``"\s+"``: "``\s``" is a special character that matches any whitespace (space, tab, newline, etc.), and the "``+``" is a character that indicates *one or more* of the entity preceding it.
Thus, the regular expression matches any substring consisting of one or more spaces.

The ``split()`` method here is basically a convenience routine built upon this *pattern matching* behavior; more fundamental is the ``match()`` method, which will tell you whether the beginning of a string matches the pattern:

In [27]:
for s in ["     ", "abc  ", "  abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' matches
'abc  ' does not match
'  abc' matches


Like ``split()``, there are similar convenience routines to find the first match (like ``str.index()`` or ``str.find()``) or to find and replace (like ``str.replace()``).
We'll again use the line from before:

In [28]:
line = 'the quick brown fox jumped over a lazy dog'

With this, we can see that the ``regex.search()`` method operates a lot like ``str.index()`` or ``str.find()``:

In [29]:
line.index('fox')

16

In [30]:
regex = re.compile('fox')
match = regex.search(line)
match.start()

16

Similarly, the ``regex.sub()`` method operates much like ``str.replace()``:

In [31]:
line.replace('fox', 'BEAR')

'the quick brown BEAR jumped over a lazy dog'

In [32]:
regex.sub('BEAR', line)

'the quick brown BEAR jumped over a lazy dog'

With a bit of thought, other native string operations can also be cast as regular expressions.

### A more sophisticated example

But, you might ask, why would you want to use the more complicated and verbose syntax of regular expressions rather than the more intuitive and simple string methods?
The advantage is that regular expressions offer *far* more flexibility.

Here we'll consider a more complicated example: the common task of matching email addresses.
I'll start by simply writing a (somewhat indecipherable) regular expression, and then walk through what is going on.
Here it goes:

In [33]:
email = re.compile('\w+@\w+\.[a-z]{3}')

Using this, if we're given a line from a document, we can quickly extract things that look like email addresses

In [34]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)

['guido@python.org', 'guido@google.com']

(Note that these addresses are entirely made up; there are probably better ways to get in touch with Guido).

We can do further operations, like replacing these email addresses with another string, perhaps to hide addresses in the output:

In [35]:
email.sub('--@--.--', text)

'To email Guido, try --@--.-- or the older address --@--.--.'

Finally, note that if you really want to match *any* email address, the preceding regular expression is far too simple.
For example, it only allows addresses made of alphanumeric characters that end in one of several common domain suffixes.
So, for example, the period used here means that we only find part of the address:

In [36]:
email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

This goes to show how unforgiving regular expressions can be if you're not careful!
If you search around online, you can find some suggestions for regular expressions that will match *all* valid emails, but beware: they are much more involved than the simple expression used here!

### Basics of regular expression syntax

The syntax of regular expressions is much too large a topic for this short section.
Still, a bit of familiarity can go a long way: I will walk through some of the basic constructs here, and then list some more complete resources from which you can learn more.
My hope is that the following quick primer will enable you to use these resources effectively.

#### Simple strings are matched directly

If you build a regular expression on a simple string of characters or digits, it will match that exact string:

In [37]:
regex = re.compile('ion')
regex.findall('Great Expectations')

['ion']

#### Some characters have special meanings

While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:
```
. ^ $ * + ? { } [ ] \ | ( )
```
We will discuss the meaning of some of these momentarily.
In the meantime, you should know that if you'd like to match any of these characters directly, you can *escape* them with a back-slash:

In [38]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")

['$']

The ``r`` preface in ``r'\$'`` indicates a *raw string*; in standard Python strings, the backslash is used to indicate special characters.
For example, a tab is indicated by ``"\t"``:

In [39]:
print('a\tb\tc')

a	b	c


Such substitutions are not made in a raw string:

In [40]:
print(r'a\tb\tc')

a\tb\tc


For this reason, whenever you use backslashes in a regular expression, it is good practice to use a raw string.

#### Special characters can match character groups

Just as the ``"\"`` character within regular expressions can escape special characters, turning them into normal characters, it can also be used to give normal characters special meaning.
These special characters match specified groups of characters, and we've seen them before.
In the email address regexp from before, we used the character ``"\w"``, which is a special marker matching *any alphanumeric character*. Similarly, in the simple ``split()`` example, we also saw ``"\s"``, a special marker indicating *any whitespace character*.

Putting these together, we can create a regular expression that will match *any two letters/digits with whitespace between them*:

In [41]:
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

This example begins to hint at the power and flexibility of regular expressions.

The following table lists a few of these characters that are commonly useful:

| Character | Description                 || Character | Description                     |
|-----------|-----------------------------||-----------|---------------------------------|
| ``"\d"``  | Match any digit             || ``"\D"``  | Match any non-digit             |
| ``"\s"``  | Match any whitespace        || ``"\S"``  | Match any non-whitespace        |
| ``"\w"``  | Match any alphanumeric char || ``"\W"``  | Match any non-alphanumeric char |

This is *not* a comprehensive list or description; for more details, see Python's [regular expression syntax documentation](https://docs.python.org/3/library/re.html#re-syntax).

#### Square brackets match custom character groups

If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in.
For example, the following will match any lower-case vowel:

In [42]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

Similarly, you can use a dash to specify a range: for example, ``"[a-z]"`` will match any lower-case letter, and ``"[1-3]"`` will match any of ``"1"``, ``"2"``, or ``"3"``.
For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [43]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6')

['G2', 'H6']

#### Wildcards match repeated characters

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, ``"\w\w\w"``.
Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:

In [44]:
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

There are also markers available to match any number of repetitions – for example, the ``"+"`` character will match *one or more* repetitions of what precedes it:

In [45]:
regex = re.compile(r'\w+')
regex.findall('The quick brown fox')

['The', 'quick', 'brown', 'fox']

The following is a table of the repetition markers available for use in regular expressions:

| Character | Description | Example |
|-----------|-------------|---------|
| ``?`` | Match zero or one repetitions of preceding  | ``"ab?"`` matches ``"a"`` or ``"ab"`` |
| ``*`` | Match zero or more repetitions of preceding | ``"ab*"`` matches ``"a"``, ``"ab"``, ``"abb"``, ``"abbb"``... |
| ``+`` | Match one or more repetitions of preceding  | ``"ab+"`` matches ``"ab"``, ``"abb"``, ``"abbb"``... but not ``"a"`` |
| ``{n}`` | Match ``n`` repetitions of preeeding | ``"ab{2}"`` matches ``"abb"`` |
| ``{m,n}`` | Match between ``m`` and ``n`` repetitions of preceding | ``"ab{2,3}"`` matches ``"abb"`` or ``"abbb"`` |

With these basics in mind, let's return to our email address matcher:

In [46]:
email = re.compile(r'\w+@\w+\.[a-z]{3}')

We can now understand what this means: we want one or more alphanumeric character (``"\w+"``) followed by the *at sign* (``"@"``), followed by one or more alphanumeric character (``"\w+"``), followed by a period (``"\."`` – note the need for a backslash escape), followed by exactly three lower-case letters.

If we want to now modify this so that the Obama email address matches, we can do so using the square-bracket notation:

In [47]:
email2 = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
email2.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

We have changed ``"\w+"`` to ``"[\w.]+"``, so we will match any alphanumeric character *or* a period.
With this more flexible expression, we can match a wider range of email addresses (though still not all – can you identify other shortcomings of this expression?).

#### Parentheses indicate *groups* to extract

For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to *group* the results:

In [48]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')

In [49]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email3.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

As we see, this grouping actually extracts a list of the sub-components of the email address.

We can go a bit further and *name* the extracted components using the ``"(?P<name> )"`` syntax, in which case the groups can be extracted as a Python dictionary:

In [50]:
email4 = re.compile(r'(?P<user>[\w.]+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})')
match = email4.match('guido@python.org')
match.groupdict()

{'user': 'guido', 'domain': 'python', 'suffix': 'org'}

## `str.replace()`

If you don't know how it works, you can always check the `help`:

In [51]:
help(str.replace)

Help on method_descriptor:

replace(...)
    S.replace(old, new[, count]) -> str
    
    Return a copy of S with all occurrences of substring
    old replaced by new.  If the optional argument count is
    given, only the first count occurrences are replaced.



This will not modify `my_string` because replace is not done in-place.

In [52]:
my_string = " my taylor is rich"
my_string.replace('a', '?')
print(my_string)

 my taylor is rich


You have to store the return value of `replace` instead.

In [53]:
my_modified_string = my_string.replace('is', 'will be')
print(my_modified_string)

 my taylor will be rich


## `str.format()`

In [54]:
secret = '{} is cool'.format('Python')
print(secret)

Python is cool


In [55]:
print('My name is {} {}, you can call me {}.'.format('John', 'Doe', 'John'))
# is the same as:
print('My name is {first} {family}, you can call me {first}.'.format(first='John', family='Doe'))

My name is John Doe, you can call me John.
My name is John Doe, you can call me John.


## `str.join()`

In [56]:
pandas = 'pandas'
numpy = 'numpy'
requests = 'requests'
cool_python_libs = ', '.join([pandas, numpy, requests])

In [57]:
print('Some cool python libraries: {}'.format(cool_python_libs))

Some cool python libraries: pandas, numpy, requests


Alternatives (not as [Pythonic](http://docs.python-guide.org/en/latest/writing/style/#idioms) and [slower](https://waymoot.org/home/python_string/)):

In [58]:
cool_python_libs = pandas + ', ' + numpy + ', ' + requests
print('Some cool python libraries: {}'.format(cool_python_libs))

cool_python_libs = pandas
cool_python_libs += ', ' + numpy
cool_python_libs += ', ' + requests
print('Some cool python libraries: {}'.format(cool_python_libs))

Some cool python libraries: pandas, numpy, requests
Some cool python libraries: pandas, numpy, requests


## `str.upper(), str.lower(), str.title()`

In [59]:
mixed_case = 'PyTHoN hackER'

In [60]:
mixed_case.upper()

'PYTHON HACKER'

In [61]:
mixed_case.lower()

'python hacker'

In [62]:
mixed_case.title()

'Python Hacker'

## `str.strip()`

In [63]:
ugly_formatted = ' \n \t Some story to tell '
stripped = ugly_formatted.strip()

print('ugly: {}'.format(ugly_formatted))
print('stripped: {}'.format(ugly_formatted.strip()))

ugly:  
 	 Some story to tell 
stripped: Some story to tell


## `str.split()`

In [64]:
sentence = 'three different words'
words = sentence.split()
print(words)

['three', 'different', 'words']


In [65]:
type(words)

list

In [66]:
secret_binary_data = '01001,101101,11100000'
binaries = secret_binary_data.split(',')
print(binaries)

['01001', '101101', '11100000']


## Calling multiple methods in a row

In [67]:
ugly_mixed_case = '   ThIS LooKs BAd '
pretty = ugly_mixed_case.strip().lower().replace('bad', 'good')
print(pretty)

this looks good


Note that execution order is from left to right. Thus, this won't work:

In [68]:
pretty = ugly_mixed_case.replace('bad', 'good').strip().lower()
print(pretty)

this looks bad
