# Regular Expressions


## Overview

A Regular Expression is a sequence of characters that define a search criteria. The use of regular expressions in Python allows us to quickly identify our sought-after characters while minimizing our code. 

Regular expressions are in a standard Python library, 're.' The two main methods are search() and findall(). Search() will merely find the text you are looking for (returning a 'match' object), while findall() will actually return the sought-after text in a list format. One thing to note is that if there are multiple lines in the object you are scanning, you need to apply your search to each line. 

**Python Regular Expression Quick Guide**


<pre>^        Matches the beginning of a line  
$        Matches the end of the line  
.        Matches any character  
\s       Matches whitespace  
\S       Matches any non-whitespace character  
*        Repeats a character zero or more times (greedy)  
*?       Repeats a character zero or more times (lazy)  
+        Repeats a character one or more times (greedy)  
+?       Repeats a character one or more times (lazy)  
[aeiou]  Matches a single character in the listed set  
[^XYZ]   Matches a single character not in the listed set  
[a-z0-9] The set of characters can include a range  
(        Indicates where string extraction is to start  
)        Indicates where string extraction is to end  </pre>


## Example


To load the python regular expression library, we merely invoke "re" with the import statement. Our data set contains a large number of email addresses.

In [10]:
import os
import re
os.chdir('C:\\Users\\zlatan.kremonic\\documents\\analytics\\Cheat_Sheets\\data')

First, let's look for email addresses by searching any line that begins with "From." We do this using the ^ symbol. This will return any line that starts with the given search criteria. 

In [74]:
regex = open('RegEx.txt')
for line in regex:
    line = line.rstrip()
    if re.search('^From:', line):
        print line

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu


Findall() will return a list with the searched characters. "\S" looks for any non-whitespace character. "+" repeats that character one or more times. 

In [73]:

regex = open('RegEx.txt')
for line in regex:
    line = line.rstrip()
    x = re.findall('\S+@\S+', line)
    if len(x) > 0:
        print x

['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042308.m04N8v6O008125@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']


Let's say we only want to look for letters or numbers when searching for our list of emails. We can specify this by specifying our range in brackets. We also use the "+" sign to tell python that we need to match the given criteria "one or more times."

In [72]:
regex = open('RegEx.txt')
for line in regex:
    line = line.rstrip()
    x = re.findall('[a-zA-Z0-9.]+@[a-zA-Z0-9.]+', line)
    if len(x) > 0:
        print x

['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']


Greedy versus lazy:

'Greedy' regular expressions will try to keep satifying your condition until it fails, whereupon and backtracks until the condition is met. 'Lazy' regular expressions will stop as soon as the first instance of the condition is met. 

In [46]:
# Greedy:

html = 'This is a <EM>first</EM> test.'
print 'Greedy:',re.findall('<.+>', html)

# Lazy:

html = 'This is a <EM>first</EM> test.'
print 'Lazy:',re.findall('<.+?>', html)

Greedy: ['<EM>first</EM>']
Lazy: ['<EM>', '</EM>']


Use parentheses for string extraction.

In [49]:
html = 'This is a <EM>first</EM> test.'
print 'Greedy:',re.findall('<(.+?)>', html)

Greedy: ['EM', '/EM']


We can use curly brackets to more specifically denote the length of our search criteria, with the first number being the mininum and the second number being the max. If we only place one number in there, then the match will be *exact*.

In [58]:
text = '39, 34329, 2390, 3939, 1000, 393434'
print re.findall('[1-9][1-9]{1,3}', text)

['39', '3432', '239', '3939', '3934', '34']


In [None]:
regex = open('RegEx.txt')
for line in regex:
    line = line.rstrip()
    x = re.findall('[a-zA-Z0-9.]+@[a-zA-Z0-9.]+', line)
    if len(x) > 0:
        print x

## Additional Resources
- http://www.regular-expressions.info/refquick.html
- https://docs.python.org/2/library/re.html
- http://en.wikipedia.org/wiki/Regular_expression
