# Regular Expressions

In computing, a regular expression, also referred to as “regex” or “regexp”, provides a concise and 
flexible means for matching strings of text, such as particular characters, words, or patterns of 
characters. A regular expression is written in a formal language that can be interpreted by a regular 
expression processor. See more at http://en.wikipedia.org/wiki/Regular_expression.

## Understanding Regular Expressions

-  Very powerful and quite cryptic
-  Fun once you understand them
-  Regular expressions are a language unto themselves
-  A language of “marker characters” - programming with characters
-  It is kind of an “old school” language - compact

## Regular Expression Quick Guide

<img src="http://drive.google.com/uc?export=view&id=1yyEhtBi38hWajh0bkWOaZkVUeDhYeC09" width="800">

See https://docs.python.org/3.7/library/re.html for more details.


## The Regular Expression Module

-  Before you can use regular expressions in your program, you 
   must import the library using “import re”

-  You can use re.search() to see if a string matches a 
   regular expression, similar to using the find() method for strings
    
-  You can use re.findall() to extract portions of a string that 
   match your regular expression, similar to a combination of 
   find() and slicing:  var[5:10] 

## Using re.search() Like find()

In [None]:
## https://docs.python.org/3.7/library/stdtypes.html#string-methods
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if line.find('From:') >= 0:
        print(line)

In [None]:
## https://docs.python.org/3/library/re.html#re.search

import re

hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:', line) :
        print(line)

### Function re.search
    
**re.search(pattern, string, flags=0)**

Scan through string looking for the **first** location where the 
regular expression pattern produces a match, and return a 
corresponding match object. Return None if no position in 
the string matches the pattern; note that this is different 
from finding a zero-length match at some point in the string.

See https://docs.python.org/3/library/re.html#re.search

In [32]:
import re
x = 'My 2 favorite numbers are 2 and 13'
y = re.search('2', x)
y.group(0)

'2'

## Using re.search() Like startswith()

We fine-tune what is matched by adding special characters to the string

In [None]:
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if line.startswith('From:') :
        print(line)

In [None]:
import re

hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line) :
        print(line)

## Wild-Card Characters

-  The dot character matches any character
-  If you add the asterisk character (*), the character 
   is “any number of times” (0 or more)

In [None]:
## see https://docs.python.org/3.7/library/re.html#match-objects 
## for the function match.group

import re

text1 = "X-Sieve: CMU Sieve 2.3"
match1 = re.search('^X.*:', text1)
match1.group(0)   

In [None]:
import re

text2 = "X-DSPAM-Result: Innocent"
match2 = re.search('^X.*:', text2)
match2.group(0)

In [None]:
import re

text3 = "X-DSPAM-Confidence: 0.8475"
match3 = re.search('^X.*:', text3)
match3.group(0)

In [None]:
import re

text4 = "X-Content-Type-Message-Body: text/plain"
match4 = re.search('^X.*:', text4)
match4.group(0)

In [None]:
import re

text5 = "X-Plane is behind schedule: two weeks"
match5 = re.search('^X.*:', text5)
match5.group(0)

In [None]:
import re

text6 = "X-: Very Short"
match6 = re.search('^X.*:', text6)
match6.group(0)

## Fine-Tuning Your Match

Depending on how “clean” your data is and the purpose of your 
application, you may want to narrow your match down a bit

In [None]:
import re

text1 = "X-Sieve: CMU Sieve 2.3"
match1 = re.search('^X-\S+:', text1)
match1.group(0) 

In [None]:
import re

text2 = "X-DSPAM-Result: Innocent"
match2 = re.search('^X-\S+:', text2)
match2.group(0)

In [None]:
import re

text3 = "X-DSPAM-Confidence: 0.8475"
match3 = re.search('^X-\S+:', text3)
match3.group(0)

In [None]:
import re

text4 = "X-Content-Type-Message-Body: text/plain"
match4 = re.search('^X-\S+:', text4)
match4.group(0)

In [None]:
import re

text5 = "X-Plane is behind schedule: two weeks"
match5 = re.search('^X-\S+:', text5)
match5.group(0)

In [None]:
import re

text6 = "X-: Very Short"
match6 = re.search('^X-\S+:', text6)
match6.group(0)

## Matching and Extracting Data

- re.search() returns a True/False depending on whether the 
  string matches  the regular expression
- If we actually want the matching strings to be extracted, 
  we use re.findall()

In [34]:
import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.search('[0-9]+', x)
print(y)

<re.Match object; span=(3, 4), match='2'>


In [35]:
import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+',x)
print(y)

['2', '19', '42']


**re.findall(pattern, string, flags=0)**

Return all non-overlapping matches of pattern in string, as 
a list of strings. The string is scanned left-to-right, and 
matches are returned in the order found. If one or more 
groups are present in the pattern, return a list of groups; 
this will be a list of tuples if the pattern has more than 
one group. Empty matches are included in the result.

See https://docs.python.org/3/library/re.html#re.findall.

In [None]:
When we use re.findall(), it returns a list of zero or more 
sub-strings that match the regular expression.

In [36]:
import re

x = 'My 2 favorite numbers are 6 and 8'
y = re.findall('[0-9]+',x)
print(y)

['2', '6', '8']


In [37]:
import re

x = 'My 2 favorite numbers are 6 and 8'
y = re.findall('[AEIOU]+',x)
print(y)

[]


## Warning: Greedy Matching

The repeat characters (* and +) push outward in both 
directions (greedy) to match the largest possible string.

In [38]:
import re

x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)

['From: Using the :']


<font color='red'>Question: Why not 'From:' ? </font>

## Non-Greedy Matching

Not all regular expression repeat codes are greedy!  If you 
add a ? character, the + and * chill out a bit...

In [39]:
import re

x = 'From: Using the : character'
y = re.findall('^F.+?:', x)
print(y)

['From:']


In [42]:
import re

x = 'From: Using the : character and From: ABC DEFG:'
y = re.findall('^F.+?:', x)
print(y)

['From:']


In [41]:
import re

x = 'From: Using the : character and From: ABC DEFG:'
y = re.findall('F.+?:', x)
print(y)

['From:', 'From:']


## Fine-Tuning String Extraction

You can refine the match for re.findall() and separately 
determine which portion of the match is to be extracted 
by using parentheses (round brackets)

In [43]:
x = "From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008"
y = re.findall('\S+@\S+', x)
print(y)

['stephen.marquard@uct.ac.za']


In [45]:
x = "From abc@example.com stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008"
y = re.findall('\S+@\S+', x)
print(y)

['abc@example.com', 'stephen.marquard@uct.ac.za']


In [None]:
Parentheses are not part of the match - but they tell where to 
start and stop what string to extract

In [44]:
x = "From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008"
y = re.findall('^From (\S+@\S+)', x)
print(y)

['stephen.marquard@uct.ac.za']


In [46]:
x = "From abc@example.com stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008"
y = re.findall('^From (\S+@\S+)', x)
print(y)

['abc@example.com']


## More String Parsing Examples

In [47]:
## Extracting a host name - using find and string slicing

data = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
atpos = data.find('@')    ## position for the at sign @
print(atpos)

sppos = data.find(' ',atpos)  ## position for the first space after @
print(sppos)

host = data[atpos+1 : sppos]
print(host)

21
31
uct.ac.za


In [48]:
## Extracting a host name - using double splitting 

## We split a line one way, and then grab one of the pieces of the 
## line and split that piece again

line = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
words = line.split()
email = words[1]
pieces = email.split('@')
print(pieces[1])

uct.ac.za


In [49]:
## Extracting a host name - The Regex Version

import re 
lin = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
y = re.findall('@([^ ]*)', lin)
print(y)

['uct.ac.za']


In [50]:
## Extracting a host name - Even Cooler Regex Version

import re 
lin = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)',lin)
print(y)

['uct.ac.za']


In [52]:
import re
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
    line = line.rstrip()
    stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
    if len(stuff) != 1 :  
        continue
    num = float(stuff[0])
    numlist.append(num)
print('Maximum:', max(numlist))

Maximum: 0.9907


## Escape Character
If you want a special regular expression character to just behave 
normally (most of the time) you prefix it with '\'

In [53]:
import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+', x)
print(y)

['$10.00']
