## Regular Expressions
---

### Introduction
---

A **regular expression** (or **regex**) is a sequence of characters that define a search pattern. 

The language was developed in the 1950s. Examples: 
- used server side to validate the format of email addresses or password during registration
- used for parsing text data files to find, replace or delete certain string

In [1]:
# In python, regexes are supported by the "re" module.
import re

In [42]:
?re

In [2]:
# If the pattern is found, `findall` will return all its instances
print(re.findall('a', 'Natural Language Processing'))

['a', 'a', 'a', 'a']


In [3]:
# If the pattern is not found, `findall` will return an empty list
print(re.findall('a', 'Georgetown University'))

[]


In [4]:
# By default, regular expressions are case sensitive
print(re.findall('A', 'Machine Learning'))

[]


In [5]:
# For find and replace, the function is `sub`
re.sub('Language', 'Gas', 'Natural Language Processing')

'Natural Gas Processing'

### Special characters
---

Character|Description
:---: | ---
`.`|matches any single character, except newline characters
`\w`|matches any single letter, digit or underscore
`\W`|matches any character not part of `\w`
`\s`|matches a single whitespace character like: space, newline, tab, return
`\S`|matches any character not part of `\s`
`\t`|matches tab
`\n`|matches newline
`\r`|matches return
`\d`|matches decimal digit 0-9
`\b`|matches the the beginning or end of a word
`\`|escapes special characters

In [7]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('The qu.ck brown', string)

['The quick brown']

In [8]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('The qu\wck brown', string)

['The quick brown']

In [9]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('The quick\Wbrown', string)

['The quick brown']

In [10]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('The quick\wbrown', string)

[]

In [11]:
string = "2 quick brown foxes jump over the lazy dog"
re.findall('\d', string)

['2']

In [12]:
# for `\b` we need a raw string
string = "The quick brown fox jumps over the lazy dog"
re.findall(r'\bThe\b', string)

['The']

In [13]:
string = "The quick brown fox jumps over the lazy dog."
re.findall('\.', string)

['.']

### Positional characters
---

Character|Description
:---: | ---
`^`|matches a pattern at the start of the string
`$`|matches a pattern at the end of string
`\A`|matches only at the start of the string, works across multiple lines as well

In [15]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('^The', string)

['The']

In [16]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('dog$', string)

['dog']

In [17]:
string = """The quick brown fox jumps over the lazy dog.

The sentence above is an English language pangram."""
re.findall('\AThe', string)

['The']

### Ranges
---

Character|Description
:---: | ---
`[abc]`|matches `a` or `b` or `c`
`[a-zA-Z0-9]`|matches any letter from `a` to `z`, or `A` to `Z`, or `0` to `9`
`[^a-z]`|matches everything besides any letter from `a` to `z`

In [19]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('[xyz]', string)

['x', 'z', 'y']

In [20]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('[x-z]', string)

['x', 'z', 'y']

In [21]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('[^a-w T]', string)

['x', 'z', 'y']

### Repetitions
---

Character|Description
:---: | ---
`+`|checks for 1 or more characters (or group) to its left (greedy)
`*`|checks for 0 or more characters (or group) to its left (greedy)
`?`|checks for 0 or 1 character (or group) to its left
`{x}`|repeat exactly `x` number of times
`{x,}`|repeat at least `x` times or more
`{x, y}`|repeat at least `x` times, but no more than `y` times

In [23]:
string = "The quick brown fox jumps over the lazy moose"
re.findall('mo+', string)

['moo']

In [24]:
string = "The quick brown fox jumps over the lazy moose"
re.findall('mo*', string)

['m', 'moo']

In [25]:
string = "The quick brown fox jumps over the lazy moose"
re.findall('mo?', string)

['m', 'mo']

In [26]:
string = "The quick brown fox jumps over the lazy moose"
re.findall('o{2}', string)

['oo']

In [27]:
# When a special character matches as much of the string as possible, 
# it is said to be a "Greedy Match". It is the normal behavior of a regular 
# expression but sometimes this behavior is not desired:

string = "<h1>This is a header</h1>"
re.findall('<.*>', string)[0]

# The pattern <.*> matches the whole string, right up to the second 
# occurrence of >.

'<h1>This is a header</h1>'

In [28]:
# However, if you only wanted to match the first <h1> tag, we should use 
# the greedy qualifier "*?" which matches as little text as possible.

string = "<h1>This is a header</h1>"
re.findall('<.*?>', string)[0]

'<h1>'

### Flags
---

Flag|Description
:---: | ---
`re.IGNORECASE`|perform case-insensitive matching
`re.MULTILINE`|ignore line breaks for `^`, `$`
`re.DOTALL`|make `.` match newlines as well

In [30]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('[T]', string, flags=re.IGNORECASE)

['T', 't']

### Groups
---

Group|Description
:---: | ---
`(...)`|define a group
&#124;|`OR` operator
`?:`|match group, but do not return it as a separate entity
`?=`|positive lookahead
`?!`|negative lookahead
`?<=`|positive lookbehind
`?<!`|negative lookbehind

In [32]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('(quick|fast) brown', string)

['quick']

In [33]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('(?:quick|fast) brown', string)

['quick brown']

In [34]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('brown (?=fox)', string)

['brown ']

In [35]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('brown (?!box)', string)

['brown ']

In [36]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('(?<=quick) brown', string)

[' brown']

In [37]:
string = "The quick brown fox jumps over the lazy dog"
re.findall('(?<!fast) brown', string)

[' brown']

### Summary
---


Regex|Description
:---: | ---
`.`|matches any single character, except newline characters
`\w`|matches any single letter, digit or underscore
`\W`|matches any character not part of `\w`
`\s`|matches a single whitespace character like: space, newline, tab, return
`\S`|matches any character not part of `\s`
`\t`|matches tab
`\n`|matches newline
`\r`|matches return
`\d`|matches decimal digit 0-9
`\b`|matches the the beginning or end of a word
`^`|matches a pattern at the start of the string
`$`|matches a pattern at the end of string
`\A`|matches only at the start of the string, works across multiple lines as well
`[abc]`|matches `a` or `b` or `c`
`[a-zA-Z0-9]`|matches any letter from `a` to `z`, or `A` to `Z`, or `0` to `9`
`[^a-z]`|matches everything besides any letter from `a` to `z`
`+`|checks for 1 or more characters (or group) to its left (greedy)
`*`|checks for 0 or more characters (or group) to its left (greedy)
`?`|checks for 0 or 1 character (or group) to its left
`{x}`|repeat exactly `x` number of times
`{x,}`|repeat at least `x` times or more
`{x, y}`|repeat at least `x` times, but no more than `y` times
`re.IGNORECASE`|perform case-insensitive matching
`re.MULTILINE`|ignore line breaks for `^`, `$`
`re.DOTALL`|make `.` match newlines as well
`(...)`|define a group
&#124;|`OR` operator
`?:`|match group, but do not return it as a separate entity
`?=`|positive lookahead
`?!`|negative lookahead
`?<=`|positive lookbehind
`?<!`|negative lookbehind

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise regex.1</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Remove the extra spaces from the following text:</p></a>
    <p style="margin-left: 150px;
              margin-right: 100px;
              line-height: 1.7em;"><span style="font-family:monospace;">This&nbsp;&nbsp;is&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a&nbsp;string&nbsp;&nbsp;&nbsp;with&nbsp;&nbsp;a&nbsp;&nbsp;&nbsp;&nbsp;lot&nbsp;of&nbsp;spaces</span></font>
</div>

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise regex.2</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Extract only the text within quotation marks (non-greedy) from the following text:</p></a>
    <p style="margin-left: 150px;
              margin-right: 100px;
              line-height: 1.7em;"><span style="font-family:monospace;">"This" is a string with "quotes"</span></font>
</div>

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise regex.3</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Extract words longer than 3 characters from the following text:</p></a>
    <p style="margin-left: 150px;
              margin-right: 100px;
              line-height: 1.7em;"><span style="font-family:monospace;">This is a string with a few, words</span></font>
</div>

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise regex.4</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Remove all parentheses and their content from the following text:</p></a>
    <p style="margin-left: 150px;
              margin-right: 100px;
              line-height: 1.7em;"><span style="font-family:monospace;">This is a string (but who cares about this)</span></font>
</div>

<div class="alert alert-block alert-info" 
     style="border-color: #2E86C1; 
            border-left: 5px solid #2E86C1;
            padding-top: 5px">
    <font size="4"> 
    <a href="../projects/playground.ipynb" style="text-decoration: none">
    <h3 style="font-family: monospace">Exercise regex.5</h3>
    <p style="margin-left: 100px;
              margin-right: 100px;
              line-height: 1.7em;">Extract all numbers from the following text:</p></a>
    <p style="margin-left: 150px;
              margin-right: 100px;
              line-height: 1.7em;"><span style="font-family:monospace;">Seems like 65.74 is a float, but 5 is an integer</span></p></font>
</div>