In [1]:
import pandas as pd
import numpy as np
import regex as re

# RegEx

A regular expression (“RegEx”) is a sequence of characters that specifies a search pattern. They are written to extract specific information from text. Regular expressions are essentially part of a smaller programming language embedded within python, made available through the `re` module. As such, they have a stand-alone syntax and methods for various capabilities.

Regular expressions are useful in many applications beyond data science. For example, American Social Security Numbers (SSNs) are often validated with regular expressions. As a reminder, SSNs must follow the pattern: 3 digits, followed by a `-`, followed by 2 digits, followed by a `-`, finally followed by 4 digits. This can be expressed as `r"[0-9]{3}-[0-9]{2}-[0-9]{4}"` using RegEx.


You should not aim to 'memorize RegEx' or anything along those lines. There are so many things to memorize, it will be very inefficient for you to spend time on them! Instead, you should aim to understand what RegEx is capable of, and how you can use it with the aid of a reference table.

There are a ton of resources to learn and experiment with regular expressions. A few are provided below:

- [Official Regex Guide](https://docs.python.org/3/howto/regex.html)
- [Data 100 Reference Sheet](https://ds100.org/sp22/resources/assets/hw/regex_reference.pdf) 
- [Regex101.com](https://regex101.com/)
    - Be sure to choose the `Python` flavour under the category on the left.

## Basic RegEx Syntax

| Operation      | Order  | Syntax Example | Matches     | Doesn't Match     | 
|----------------|-|----------------|-------------|-------------------|
| `Group`: `()` <br />(parenthesis)       | 1 | A(A\|B)AAB      | AAAAB<br />ABAAB| every other string|
|                       |    |         (AB)*A    |    A <br />ABABABABA      | AA <br />  ABBA     | 
| `Closure`: `*` <br />(zero or more)   | 2   | (AB)*A         | AA  <br />  ABBBBBBA | AB <br />  ABABA       |
| `Concatenation`          | 3    | AABAAB         | AABAAB      | every other string|
| `Or`: `\|`  | 4 | AA\|BAAB        | AA<br /> BAAB   | every other string|

Notice how these metacharacter operations are ordered. Rather than being literal characters, these **metacharacters** manipulate adjacent characters. `()` takes precedence, followed by `*`, and finally `|`. This allows us to differentiate between very different regex commands like `AB*` and `(AB)*`. The former reads "`A` then zero or more copies of `B`", while the latter specifies "zero or more copies of `AB`".

### Examples

**Question 1**: Give a regular expression that matches `moon`, `moooon`, etc. Your expression should match any even number of `o`s except zero (i.e. don’t match `mn`).

**Answer 1**: `moo(oo)*n`

- Hardcoding `oo` before the capture group ensures that `mn` is not matched.
- A capture group of `(oo)*` ensures the number of `o`'s is even.

**Question 2**: Using only the basic operations, formulate a regex that matches `muun`, `muuuun`, `moon`, `moooon`, etc. Your expression should match any even number of `u`s or `o`s except zero (i.e. don’t match `mn`).

**Answer 2**: `m(uu(uu)*|oo(oo)*)n`

- The leading `m` and trailing `n` ensures that only strings beginning with `m` and ending with `n` are matched.
- Notice how the outer capture group surrounds the `|`. 
    - Consider the regex `m(uu(uu)*)|(oo(oo)*)n`. This incorrectly matches `muu` and `oooon`. 
        - Each OR clause is everything to the left and right of `|`. The incorrect solution matches only half of the string, and ignores either the beginning `m` or trailing `n`.
        - A set of paranthesis must surround `|`. That way, each OR clause is everything to the left and right of `|` **within** the group. This ensures both the beginning `m` *and* trailing `n` are matched.

## Regex Expanded

Provided below are more complex regular expression functions. 

| Operation                                      | Syntax Example  | Matches        |Doesn't Match     |
|------------------------------------------------|-----------------|----------------|------------------|
| `Any Character`: `.` <br />  (except newline)| .U.U.U.         | CUMULUS <br /> JUGULUM| SUCCUBUS  <br />  TUMULTUOUS  |
| `Character Class`: `[]` <br /> (match one character in `[]`)| [A-Za-z][a-z]*  | word <br /> Capitalized| camelCase <br /> 4illegal|
| `Repeated "a" Times`: `{a}`<br />              | j[aeiou]{3}hn   | jaoehn <br /> jooohn| jhn <br /> jaeiouhn|
| `Repeated "from a to b" Times`: `{a, b}`<br /> | j[0u]{1,2}hn    | john <br /> juohn| jhn <br /> jooohn| 
| `At Least One`: `+`                            | jo+hn           | john  <br /> joooooohn     | jhn <br />jjohn|
| `Zero or One`: `?`                             | joh?n           | jon <br /> john  | any other string |

A character class matches a single character in it's class. These characters can be hardcoded -- in the case of `[aeiou]` -- or shorthand can be specified to mean a range of characters. Examples include:

1. `[A-Z]`: Any capitalized letter
2. `[a-z]`: Any lowercase letter
3. `[0-9]`: Any single digit
4. `[A-Za-z]`: Any capitalized of lowercase letter
5. `[A-Za-z0-9]`: Any capitalized or lowercase letter or single digit

### Examples

Let's analyze a few examples of complex regular expressions.

|Syntax| Matches                         | Does Not Match                  |
|-|---------------------------------|---------------------------------|
|`.*SPB.*`| RASPBERRY <br />   SPBOO   | SUBSPACE <br />         SUBSPECIES        |
|`[0-9]{3}-[0-9]{2}-[0-9]{4}`| 231-41-5121 <br />    573-57-1821          | 231415121 <br />  57-3571821            | |                      |                     |
|`[a-z]+@([a-z]+\.)+(edu\|com)`| horse@pizza.com <br /> horse@pizza.food.com | frank_99@yahoo.com <br /> hug@cs  |

**Explanations**

1. `.*SPB.*` only matches strings that contain the substring `SPB`.
    - The `.*` metacharacter matches any amount of non-negative characters. Newlines do not count.  
2. This regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit
    - You'll recognize this as the familiar Social Security Number regular expression
3. Matches any email with a `com` or `edu` domain, where all characters of the email are letters.
    - At least one `.` must preceed the domain name. Including a backslash `\` before any metacharacter (in this case, the `.`) tells regex to match that character exactly.

## Convenient Regex

| Operation                                      | Syntax Example  | Matches        |Doesn't Match     |
|------------------------------------------------|-----------------|----------------|------------------|
| `built in character class`                     | `\w+` <br />  `\d+`<br />  `\s+`| Fawef_03 <br />231123<br />`whitespace`|this person<br /> 423 people<br /> `non-whitespace`|
| `character class negation`: `[^]`<br />(everything except the given characters)| [^a-z]+.        | PEPPERS3982    <br /> 17211!↑å | porch <br />     CLAmS|
| `escape character`: `\` <br />       (match the literal next character)           | cow\\.com       | cow.com        | cowscom          |
| `beginning of line`: `^`                       | ^ark            | ark two ark o  <br /> ark o ark| dark   | 
| `end of line`: `$`                             | ark$            | dark <br />    ark o ark | ark two          | 
| `lazy version of zero or more` : `*?`          | 5.*?5           | 5005 <br />  55  | 5005005          | 

### Greediness

In order to fully understand the last operation in the table, we have to discuss greediness. RegEx is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern `<div>.*</div>`. Given the sentence below, we would hope that the bolded portions would be matched:

"This is a **\<div>example\<\/div>** of greediness \<div>in\<\/div> regular expressions.”
"

In actuality, the way RegEx processes the text given that pattern is as follows:

1. "Look for the exact string \<div>" 

2. then, “look for any character 0 or more times" 

3. then, “look for the exact string \<\/div>"

The result would be all the characters starting from the leftmost \<div> and the rightmost \<\/div> (inclusive). So, we would match "This is a **\<div>example\<\/div> of greediness \<div>in\<\/div>** regular expressions.”     

We can fix this making our the pattern non-greedy, `<div>.*?</div>`. You can read up more on the documentation [here](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy).

### Examples

Let's revist our earlier problem of extracting date/time data from the given `.txt` files. Here is how the data looked.

In [2]:
with open('data/log.txt', 'r') as f:
    log_lines = f.readlines()

log_lines

['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n',
 '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"\n',
 '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']

**Question**: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and time zone.

**Answer**: `\[.*\]`

- Notice how matching the literal `[` and `]` is necessary. Therefore, an escape character `\` is required before both `[` and `]` — otherwise these metacharacters will match character classes. 
- We need to match a particular format between `[` and `]`. For this example, `.*` will suffice.

**Alternative Solution**: `\[\w+/\w+/\w+:\w+:\w+:\w+\s-\w+\]`

- This solution is much safer. 
    - Imagine the data between `[` and `]` was garbage - `.*` will still match that. 
    - The alternate solution will only match data that follows the correct format.

## Regex in Python and Pandas (RegEx Groups)

### Canonicalization

#### Canonicalization with RegEx

Canonicalization is the process of converting data that has multiple formats into a standard form. In the previous subchapter, we examined the process of canonicalization using pandas `Series` methods. However, our code wirth this approach was unnecessarily verbose. Equipped with our knowledge of regular expressions, let's fix this.

To do so, we need to understand a few functions in the `re` module. The first of these is the substitute function: `re.sub(pattern, rep1, text)`. It behaves similarly to `python`'s built-in `.replace` function, and returns text with all instances of `pattern` replaced by `rep1`. 

The regular expression here removes text surrounded by `<>` (also known as HTML tags).

In order, the pattern matches ... 
1. a single `<`
2. any character that is not a `>` : div, td valign..., /td, /div
3. a single `>`

Any substring in `text` that fulfills all three conditions will be replaced by `''`.

In [3]:
text = "<div><td valign='top'>Moo</td></div>"
pattern = r"<[^>]+>"
re.sub(pattern, '', text) 

'Moo'

Notice the `r` preceding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (i.e., the Python newline metacharacter `\n`). This makes them useful for regular expressions, which often contain literal `\` characters.

In other words, don't forget to tag your RegEx with an `r`.

#### Canonicalization with `pandas`

We can also use regular expressions with `pandas` `Series` methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple: <br /> `ser.str.replace(pattern, repl, regex=True`).

Consider the following `DataFrame` `html_data` with a single column.

In [4]:
data = {"HTML": ["<div><td valign='top'>Moo</td></div>", \
                 "<a href='http://ds100.org'>Link</a>", \
                 "<b>Bold text</b>"]}
html_data = pd.DataFrame(data)
html_data

Unnamed: 0,HTML
0,<div><td valign='top'>Moo</td></div>
1,<a href='http://ds100.org'>Link</a>
2,<b>Bold text</b>


We can use regular expressions as follows:

In [5]:
pattern = r"<[^>]+>"
html_data['HTML'].str.replace(pattern, '', regex=True)

0          Moo
1         Link
2    Bold text
Name: HTML, dtype: object

### Extraction

#### Extraction with RegEx

Just like with canonicalization, the `re` module provides capability to extract relevant text from a string: <br /> `re.findall(pattern, text)`. This function returns a list of all matches to `pattern`. 

Using the familiar regular expression for Social Security Numbers:

In [6]:
text = "My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789."
pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
re.findall(pattern, text)  

['123-45-6789', '321-45-6789']

#### Extraction with `pandas`

`pandas` similarily provides extraction functionality on a `Series` of data: `ser.str.findall(pattern)`

Consider the following `DataFrame` `ssn_data`.

In [8]:
data = {"SSN": ["987-65-4321", "forty", \
                "123-45-6789 bro or 321-45-6789",
               "999-99-9999"]}
ssn_data = pd.DataFrame(data)
ssn_data

Unnamed: 0,SSN
0,987-65-4321
1,forty
2,123-45-6789 bro or 321-45-6789
3,999-99-9999


Applying the `findall` function;

In [10]:
ssn_data["SSN"].str.findall(pattern)

0                 [987-65-4321]
1                            []
2    [123-45-6789, 321-45-6789]
3                 [999-99-9999]
Name: SSN, dtype: object

This function returns a list for every row containing the pattern matches in a given string.

As you may expect, there are similar `pandas` equivalents for other `re` functions as well. `Series.str.extract` takes in a pattern and returns a `DataFrame` of each capture group’s first match in the string. In contrast, `Series.str.extractall` returns a multi-indexed `DataFrame` of all matches for each capture group. You can see the difference in the outputs below:

In [11]:
pattern_cg = r"([0-9]{3})-([0-9]{2})-([0-9]{4})"
ssn_data["SSN"].str.extract(pattern_cg)

Unnamed: 0,0,1,2
0,987.0,65.0,4321.0
1,,,
2,123.0,45.0,6789.0
3,999.0,99.0,9999.0


In [12]:
ssn_data["SSN"].str.extractall(pattern_cg)

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,987,65,4321
2,0,123,45,6789
2,1,321,45,6789
3,0,999,99,9999


### Regular Expression Capture Groups

Earlier we used parentheses `(` `)` to specify the highest order of operation in regular expressions. However, they have another meaning; parentheses are often used to represent **capture groups**. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data. 

Let's take a look at an example.

#### Example 1

In [13]:
text = "Observations: 03:04:53 - Horse awakens. 03:05:14 - Horse goes back to sleep."

Say we want to capture all occurences of time data (hour, minute, and second) as *seperate entities*.

In [14]:
pattern_1 = r"(\d\d):(\d\d):(\d\d)"
re.findall(pattern_1, text)

[('03', '04', '53'), ('03', '05', '14')]

Notice how the given pattern has 3 capture groups, each specified by the regular expression `(\d\d)`. We then use `re.findall` to return these capture groups, each as tuples containing 3 matches.

These regular expression capture groups can be different. We can use the `(\d{2})` shorthand to extract the same data.

In [16]:
pattern_2 = r"(\d\d):(\d\d):(\d{2})"
re.findall(pattern_2, text)

[('03', '04', '53'), ('03', '05', '14')]

#### Example 2

With the notion of capture groups, convince yourself how the following regular expression works.

In [23]:
first_line = log_lines[0]
print("first line: \n", first_line)
pattern = r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
day, month, year, hour, minute, second, time_zone = re.findall(pattern, first_line)[0]
print("matched text: \n", day, month, year, hour, minute, second, time_zone)

first line: 
 169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"

matched text: 
 26 Jan 2014 10 47 58 -0800


## Limitations of Regular Expressions

Today, we explored the capabilities of regular expressions in data wrangling with text data. However, there are a few things to be wary of.

Writing regular expressions is like writing a program.

- Need to know the syntax well.
- Can be easier to write than to read.
- Can be difficult to debug.

Regular expressions are terrible at certain types of problems:

- For parsing a hierarchical structure, such as JSON, use the `json.load()` parser, not RegEx!
- Complex features (e.g. valid email address).
- Counting (same number of instances of a and b). (impossible)
- Complex properties (palindromes, balanced parentheses). (impossible)

Ultimately, the goal is not to memorize all regular expressions. Rather, the aim is to:

- Understand what RegEx is capable of.
- Parse and create RegEx, with a reference table
- Use vocabulary (metacharacter, escape character, groups, etc.) to describe regex metacharacters.
- Differentiate between (), [], {}
- Design your own character classes with \d, \w, \s, […-…], ^, etc.
- Use `python` and `pandas` RegEx methods.