# Introduction to Regular Expressions
During this workshop we'll cover...
* The purpose,
* The methods,
* The syntax, and
* The strategies for using regex.

We'll use Python's version of regex (the `re` module) for today's exercise.  Regex is available in pretty much all major scripting and programming languages.  We'll conclude the workshop with a practical application.  During the PracApp we'll use the `pandas` module.  Since pandas is not the focus of the workshop I've provided skeleton scripts to enable those without that experience.  

##    The purpose of regex
Regex was developed in the 1950s as a method to characterize sub-strings within a set of text.  It has a formal theoretical definition that draws from computer science, set theory, and language theory.  It was implemented as early as the 1960s, and has progressively gained popularity as a way to conduct text search within a document.  Most compiled and interpreted languages have access to regex functionality through a library (standard or otherwise) or within a base distribution.  Syntax and functionality can vary from one language to the next, but core tasks and strategies are consistent.

Our modelers use regex to extract structure from unstructured text data.  It is the ideal tool to address tasks like finding and extracting URLs, email addresses, phone numbers, or other such information from a set of documents within a corpus.  To do this it searches, character by character, through a text string attempting to match at each step.  It is especially useful in web-scraping applications that have to be tailored to a specific purpose.  Also of note, the `nltk` module within Python includes tokenizing objects that directly reference their own version of regex.

##  The methods of the [`re` module](https://docs.python.org/2/library/re.html)
Several methods and constants are available.  We'll discuss a few here.  This section is provided as a reference for you to use during subsequent hands-on sections.

First, the **`re.compile()`** method creates a RegExObject that can be useful when one intends to use the results of the same pattern multiple times within a script.  The other methods that we'll cover today can be accessed directly from the `re` module, or from a RegExObject.  Creating a RegExObject is good practice for all but the simplest scripts.  As the name implies, it can speed computational time for complex data processing scripts.

The **`re.search()`** and **`re.match()`** methods are used to find a matching substring within a document.  `re.match()` will return a match object if the pattern is found at the ***beginning*** of the text, while `re.search()` will go through the ***entire*** text until it finds a matching substring or reaches the end.  `re.search()` will only return the first match encountered (left to right).  

The **`re.findall()`** and **`re.finditer()`** methods are useful for finding all of the matching (non-overlapping) substrings; as opposed to merely the first.  `re.findall()` returns a python list with the results, while `re.finditer()` returns an iterable of the match objects.  

In [1]:
import re
myStr = "A simple test string with numbers in it like these: 123, 456."
myRE = re.compile(r'[1-9]{3}') # This pattern is looking for three numeric characters in succession.

mtch = myRE.match(myStr)
srch = myRE.search(myStr)

fndall = myRE.findall(myStr)

print('The match returns: ' + str(mtch) + '\n')

print('The search returns: ' + str(srch) + '\n')

print('The findall returns: ')
print(fndall)

The match returns: None

The search returns: <_sre.SRE_Match object at 0x02DC4BB8>

The findall returns: 
['123', '456']


Why didn't the `match()` return a result?   By the way, when `match()` or `search()` fail they return a `NoneType` object.  

What's up with the `search()` return?   How would we get access to the actual result?

What do you think `finditer()` would do?

The match objects returned by `match()`, `search()`, and `finditer()` have properties that allow access to, among other things, the matched string; as in:


In [2]:
# Please note that you must execute the code cell above this one in order for this one to work.

print(srch.group(0))

123


Other properties give one access to the starting and ending position of the match (useful for data cleaning), and other such valuable attributes. 

**Flags** are options of the methods listed above that alter the behavior of the match.  One example is the **`re.IGNORECASE`** flag that, as the name implies, matches alphabetic characters without considering the capitalization.  Another is the **`re.DOTALL`** flag, that changes the behavior of the (**.**) special character to include newlines, as seen below.

##  The syntax

A regex pattern, or merely a regex, is composed of literals, metacharacters, and special characters.  **Literals**, as the name implies, are meant to match the exact characters as they appear.  **Metacharacters** are used to construct a match for broad categories of things like whitespace, words, or structured groups.  

Many cheatsheets are available online, like [this one](https://www.debuggex.com/cheatsheet/regex/python).  The simplest **special characters** exist that enable quick access to large classes of text.  For example, a single period (**`.`**) will match any single character except a newline (unless the DOTALL flag is specified).  

The primary metacharacter is the backslash (\\).  The \\ will either escape a special character allowing it to be matched as a literal, or denote a character class or assertion (thus converting a literal into a special). A couple of the interesting special characters are presented below.  See the myriad cheatsheets for a more complete listing (like the one referenced above).

<table>
<tr>
<th>
Special Character
</th>
<th>
Matches
</th>
</tr>
<tr>
<td>
.
</td>
<td>
Any character except a new line
</td>
</tr>
<tr>
<td>
*
</td>
<td>
Zero or more of the previous token... use with caution
</td>
</tr>
<tr>
<td>
+
</td>
<td>
One or more of the previous token
</td>
</tr>
<tr>
<td>
{2}
</td>
<td>
Exactly 2 of the previous token
</td>
</tr>
<tr>
<td>
\d
</td>
<td>
One digit character
</td>
</tr>
<tr>
<td>
\s
</td>
<td>
One whitespace character
</td>
</tr>
<tr>
<td>
\w
</td>
<td>
One word character; that is, one alpha-numeric
</td>
</tr>
<tr>
<td>
\b
</td>
<td>
A word boundary
</td>
</tr>
</table>

Using the syntax we've covered so far, can you construct a regex that will find the email address in the given string?


In [3]:
myStr = "The email address in question is constructed as the letter 'm' followed by 6 digits, followed by '@' followed by some number of alpha-numerics, followed by '.edu'.  One example is m003826@gbsmg.edu which is the string we want to find."

yourRE = r"" # Put your regex in the quotes here.

yourObj = re.compile(yourRE) # You could skip this step and go straight to yourSrch = re.search(yourRE, myStr) if you really wanted

yourSrch = yourObj.search(myStr)

print(yourSrch.group(0))

m003826@gbsmg.edu


As you've probably realized, there are many ways to write this expression to find what we're looking for.  Using only what we've talked about so far produces a rather brute force solution.  It does, however, have the advantage of working.  If you've read ahead and built something a little more subtle it has the advantage of being a more readable.  

We have two more syntax concepts to cover.  The first is that of **groups**.  Groups collect tokens together, and allow ordering.  As you may have guessed, the `.group()` attribute above allows access to invidual groups within a regex (the 0 index indicates the whole matched string).  Groups are denoted by parenthesis; as in:

In [4]:
myStr = "If I'm parsing natural language, and want to find common accidently misspelled words."
### If we're looking for accidentally, but want to include known misspellings...
myRE = "ac{1,2}ident(al)?ly" # Should match accidentally, accidently, acidentally

print(re.search(myRE, myStr).group(0))

accidently


Notice the curly braces `{}` and question mark `?` above; they're different kinds of **quantifiers**.  Quantifiers are often used in conjunction with groups as I've done here.  The `?` matches 0 or 1 of the preceding token.  By default quantifiers are **greedy**, meaning that they'll match as many characters as possible.  That is why `.*` can be problematic without additional conditions applied.  You can specify **lazy** match with the addition of a question mark `?` after the quantifier (as in, `.*?`).  

It's also worth noting that the question mark is a special special character in that its interpretation depends on the context.  It can be a quantifier or a lazy specification, but it could also have other meanings (see look arounds below).  

Now let's apply the same technique to searching tweets for laughing.  The `myStr` variable below is a list of notional tweets, in which we'll try to find and match on different ways that a person might record that they're laughing.  

**Hint** A pipestem `|` is recognized as a parallel, 'or' match.  This is known as **alternation** because you're specifying alternatives.  For this exercise you'll want to be careful how you contruct the group.

In [5]:
myStr = ['LOL, that was sooooo funny!', 'That video made me laugh... hahaha #funny', 'Ha! I loved it!']

yourRE = r"" # Type a regex in the parenthesis that will capture the laughter above.
yourObj = re.compile(yourRE, re.IGNORECASE)

for s in myStr:
    print yourObj.search(s).group(0) + '\n'

LOL

ha

Ha



You'll notice that we're using a raw string now: `r""`.  That's because Python interprets a non-raw string `\b` as a backspace, which doesn't work for us.  You can start to see how we would go about using an iterable object like a list or a pandas dataframe/series to pull interesting things out of text fields.  

The last major concept we'll cover is looking around.  **Look arounds** are a powerful way to find sub-strings based on context without including the context in the match.  There are four kinds of look arounds, summarized in the table below.  Replace the ellipsis with the appropriate regex. 

<table>
<tr>
<th>
Look around group
</th>
<th>
Asserts...
</th>
</tr>
<tr>
<td>
(?=...)
</td>
<td>
Positive look ahead
</td>
</tr>
<tr>
<td>
(?!...)
</td>
<td>
Negative look ahead
</td>
</tr>
<tr>
<td>
(?&lt;=...)
</td>
<td>
Positive look behind
</td>
</tr>
<tr>
<td>
(?&lt;!...)
</td>
<td>
Negative look behind
</td>
</tr>
</table>

A look around adds another condition for a match that considers substrings before and after the currently considered substring.  In the example below we're looking for a username contained within an email address.

In [6]:
myStr = "Suppose my email address is contained within a text string, rlantz@novetta.com, then we're trying to find my username."

myRE = r"(\w*(?=@))"

myObj = re.compile(myRE, re.I)

print myObj.search(myStr).group(0)

rlantz


Another notable is that Python's regex engine needs to know how many characters to check in a look behind.  For that reason you can't use quantifiers in a look behind.  Alternation requires that the alternatives be of the same length.  

Now your turn.  This regex will incorporate many of the concepts that we've covered; such as escape characters, quantifiers, groups, and look arounds.  Suppose that you know a 10 digit account number is immediately followed by the individual's account balance.  Construct a regex that will match the account number.

In [7]:
import re

myStr = "My phone number is 1023456789, and my account information is 9876543201 $250.00 or something like that."

# For the purpose of this exercise, you don't need to worry about commas within an account balance.

yourRE = r""

yourObj = re.compile(yourRE, re.I)

print(yourObj.search(myStr).group(0))

9876543201


## The strategies

The following are questions that I've found to be helpful when approaching a text parsing or search problem.  

* What makes the intended matching string unique?  
    * Deconstruct the elements that make it unique, and list the syntax associated with each.
    * Order and group the listed regexes as appropriate.
* Are you attempting to extract the intended match, or merely check for its presence?
* Is it possible that more than one intended match will occur, and do you want to return all of them?
* Are there elements consistently occuring before or after the intended match?
* Finally (admittedly not a question) be prepared for trial and error; a testing regimen is a must.

Recently many regex [testers](https://www.debuggex.com/?flavor=python) have arisen making it relatively easy to check your regex against a test string.  While helpful for quick checks, it's possible that an online tester will return inconsistent results (e.g., the raw string problem).  For important applications, I recommend writing your own regex test script in your language of choice.

Finally, if you're looking for a regex tutorial that is not specific to Python, I recommend [this one](http://www.regular-expressions.info/tutorial.html). 

# Practical Application

Using the dataset and sample scripts provided construct regexs that perform the following tasks:

* Match the email addresses in the **`fi_fi_info`** field.
* Match the phone numbers in the **`fi_fi_info`** field.
* Match and remove the account numbers that have leaked into the **`receiver_fi_info`** field.

After confirming that the respective counts are correct, open the .csv file to spot check some of your results.  

The first code block is a preliminary step that will read in your data.  When prompted you should point to the "fm_workshop1.csv" dataset.  

In [33]:
# Run this first

import re
import pandas as pd

import Tkinter, tkFileDialog

def getFP():
    root = Tkinter.Tk()
    root.withdraw()
    fp = tkFileDialog.askopenfilename(parent = root, title = "Choose a File")
    return fp

myFP = getFP()
myData = pd.read_csv(myFP)

resFP = myFP[:myFP.rfind('/')]

myData['Results1'] = 'None'
myData['Results2'] = 'None'
myData['Results3'] = 'None'

print("Data read is complete")

print(resFP)
myData.head()

Data read is complete
C:/Users/rlantz/Documents/regex-training


Unnamed: 0,IDX,tid,receiver_fi_info,fi_fi_info,Results1,Results2,Results3
0,1,630000000000.0,,,,,
1,2,314000000000.0,,,,,
2,3,316000000000.0,,,,,
3,4,206000000000.0,,,,,
4,5,521000000000.0,,,,,


In [34]:
# Complete this code and run it for the first exercise

### Exercise 1. Pull the email addresses and place the results in 'Results1' ###

yourRE = r"" # Construct a regex that will match email addrsses
yourObj = re.compile(yourRE)
cntEmails = 0

for i, r in enumerate(myData['fi_fi_info']):
    if type(r) == str:
        if yourObj.search(r):
            res = yourObj.search(r).group(0) 
            myData.Results1.iat[i] = res
            cntEmails += 1 

print(cntEmails)

1190


In [35]:
# Complete this code and run it for the second exercise

### Exercise 2. Pull the phone numbers and place the results in 'Results2' ###

yourRE = r"" # Construct a regex that will match phone numbers
yourObj = re.compile(yourRE)
cntPhs = 0

for i, r in enumerate(myData['fi_fi_info']):
    if type(r) == str:
        if yourObj.search(r):
            res = yourObj.search(r).group(0) 
            myData.Results2.iat[i] = res   
            cntPhs += 1 

print(cntPhs)

1190


In [36]:
# Complete this code and run it for the third exercise

### Exercise 3. Pull and clean the account numbers and place them in 'Results3' ###

yourRE = r"" # Construct a regex that will match account numbers
yourObj = re.compile(yourRE)
cntAccts = 0

for i, r in enumerate(myData['receiver_fi_info']):
    if type(r) == str:
        if yourObj.search(r):
            res =  yourObj.search(r).group(0) 
            myData.Results3.iat[i] = res
            myData.receiver_fi_info.iat[i] = r.replace(res, "")
            cntAccts += 1 

print(cntAccts)

myData.to_csv(resFP + "/Workshop1_Results.csv")
print(myData[myData.fi_fi_info.notnull()].head())
print(myData[myData.Results3 != 'None'])

4
     IDX           tid receiver_fi_info  \
466  467  3.030000e+11              NaN   
542  543  1.120000e+12              NaN   
600  601  2.190000e+11              NaN   
845  846  2.270000e+11              NaN   
995  996  1.210000e+11              NaN   

                                            fi_fi_info  \
466  ATTN: Christine Jenkins, Phone: 370-(733)108-7...   
542  ATTN: Steven Kennedy, Phone: 7-(172)115-7146 E...   
600  ATTN: Matthew Young, Phone: 36-(833)603-1795, ...   
845  ATTN: Evelyn Stephens, Phone: 20-(709)181-4335...   
995  ATTN: Jane Warren, Phone: 62-(435)302-8328, Em...   

                    Results1           Results2 Results3  
466   cjenkinsbj@samsung.com  370-(733)108-7400     None  
542  skennedyj2@yolasite.com    7-(172)115-7146     None  
600          myoungdc@wp.com   36-(833)603-1795     None  
845  estephenslj@mozilla.org   20-(709)181-4335     None  
995   jwarreno1@amazon.co.jp   62-(435)302-8328     None  
           IDX           tid  \
367 