## Regular Expressions

Regular expressions are text matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, fro finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions.

If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python.

## Searching for Patterns in Text

One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the re module to find some text:

In [106]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re
# List of patterns to search for
patterns = [ 'term1', 'term2' ]
# Text to parse
text = 'This is a string with term1, but it does not have the other term.'
for pattern in patterns:
    print('Searching for "%s" in: \n"%s"' % (pattern, text))
    if re.search(pattern,  text):
        print('Match was found.')
    else:
        print('No Match was found.')

Searching for "term1" in: 
"This is a string with term1, but it does not have the other term."
Match was found.
Searching for "term2" in: 
"This is a string with term1, but it does not have the other term."
No Match was found.


Now we've seen that re.search() will take the pattern, scan the text, and then returns a **Match** object. If no pattern is found, a **None** is returned. To get a clearer picture of this match object, check out the cell below:

In [49]:
# List of patterns to search for
pattern = 'term1'
# Text to parse
text = 'This is a string with term1, but it does not have the other term.'
match = re.search(pattern,  text)
type(match)
match.start()
match.end()
text = 'This is a string with a TERM1, ...'
match = re.search(pattern,  text, re.IGNORECASE)
match.start()

_sre.SRE_Match

22

27

24

In [41]:
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
   print("matchObj.group() : ", matchObj.group())
   print("matchObj.group(1) : ", matchObj.group(1))
   print("matchObj.group(2) : ", matchObj.group(2))
else:
   print("No match!!")

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter


## Split with regular expressions

Let's see how we can split with the re syntax. This should look similar to how you used the split() method with strings.

In [14]:
# Term to split on
split_term = '@'
phrase = 'What is the domain name of someone with the email: hello@gmail.com'
# Split the phrase
re.split(split_term,phrase)

['What is the domain name of someone with the email: hello', 'gmail.com']

Note how re.split() returns a list with the term to spit on removed and the terms in the list are a split up version of the string. 

## Finding all instances of a pattern

You can use re.findall() to find all the instances of a pattern in a string. For example:

In [50]:
# Returns a list of all matches
re.findall('match','test phrase match is in match middle')
re.finditer('match','test phrase match is in match middle')

['match', 'match']

<callable_iterator at 0x246e2ed7978>

## Pattern re Syntax

This will be the bulk of this lecture on using re with Python. Regular expressions supports a huge variety of patterns to find where a single string occurred. 

We can use *metacharacters* along with re to find specific types of patterns. 

Since we will be testing multiple re syntax forms, let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [27]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %r' %pattern)
        print(re.findall(pattern,phrase))

## Repetition Syntax

There are five ways to express repetition in a pattern:

    1.) A pattern followed by the meta-character * is repeated zero or more times. 
    2.) Replace the * with + and the pattern must appear at least once. 
    3.) Using ? means the pattern appears zero or one time. 
    4.) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times         the pattern should repeat. 
    5.) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the           value appears at least m times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [28]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']
Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']
Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']
Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']
Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']


## Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input [ab] searches for occurrences of either a or b.
Let's see some examples:

In [29]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
test_patterns = [ '[sd]',    # either s or d
            's[sd]+']   # s followed by one or more s or d
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']
Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']


## Exclusion

We can use ^ to exclude terms by incorporating it into the bracket syntax notation. Let's see some examples:

Use [^!.? ] to check for matches that are not a !,.,?, or space. Add the + to check that the match appears at least once, this basically translates to finding the words.

In [36]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
print(re.findall('[^!.? ]+',test_phrase), end=" ")

['This', 'is', 'a', 'string', 'But', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'it'] 

## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

Common use cases are to search for a specific range of letters in the alphabet, such [a-f] would return matches with any instance of letters between a and f. 

Let's walk through some examples:

In [38]:

test_phrase = 'This is an example sentence. Lets see if we find some letters.'
test_patterns=[ '[a-z]+',     # sequences of lower case letters
                '[A-Z]+',     # sequences of upper case letters
                '[a-zA-Z]+',  # sequences of lower or upper case letters
                '[A-Z][a-z]+']# 1 uppercase letter followed by lowercase letters
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '[a-z]+'
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'find', 'some', 'letters']
Searching the phrase using the re check: '[A-Z]+'
['T', 'L']
Searching the phrase using the re check: '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'find', 'some', 'letters']
Searching the phrase using the re check: '[A-Z][a-z]+'
['This', 'Lets']


## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash (\\). Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r, for creating regular expressions eliminates this problem and maintains readability.

In [39]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'
test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: '\\d+'
['1233']
Searching the phrase using the re check: '\\D+'
['This is a string with some numbers ', ' and a symbol #hashtag']
Searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']
Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']
Searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']


In [1]:
import re
phone = "2004-959-559 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print("Phone Num : ", num)
# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print("Phone Num : ", num)

Phone Num :  2004-959-559 
Phone Num :  2004959559


In [36]:
from re import *
t = 'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.'

In [37]:
sub(r'(\w*)o(\w*)', r'=\1o\2=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [38]:
sub(r'(?P<before>\w*)o(?P<after>\w*)', r'=\g<before>o\g<after>=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [39]:
sub(r'(?x) (?P<before> \w*) o (?P<after> \w*)', r'=\g<before>o\g<after>=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [40]:
sub(r'(?x) (\w*o\w*)', r'=\1=', t)

'The quick =brown= =fox= =born= =on= 1/23/2013 jumped =over= the lazy =dog= =born= =on= 10/6/10.'

In [41]:
fi = finditer(r'(?x) / (.* ) /', t)

[m.group(1) for m in fi]

['23/2013 jumped over the lazy dog born on 10/6']

In [42]:
fi = finditer(r'(?x) / (.*? ) /', t)

[m.group(1) for m in fi]

['23', '6']

In [43]:
fi = finditer(r'(?x) ([aeiou]) .* \1', t)

[m.group(0) for m in fi]

['e quick brown fox born on 1/23/2013 jumped over the', 'og born o']

In [44]:
fi = finditer(r'(?x) ([aeiou]) .*? \1', t)

[m.group(0) for m in fi]

['e quick brown fox born on 1/23/2013 jumpe', 'over the lazy do', 'orn o']

In [45]:
split(r'\w+u\w+', t)

['The ',
 ' brown fox born on 1/23/2013 ',
 ' over the lazy dog born on 10/6/10.']

In [46]:
split(r'(\w+u\w+)', t)

['The ',
 'quick',
 ' brown fox born on 1/23/2013 ',
 'jumped',
 ' over the lazy dog born on 10/6/10.']

In [47]:
split(r'(\w*o\w*)', t)

['The quick ',
 'brown',
 ' ',
 'fox',
 ' ',
 'born',
 ' ',
 'on',
 ' 1/23/2013 jumped ',
 'over',
 ' the lazy ',
 'dog',
 ' ',
 'born',
 ' ',
 'on',
 ' 10/6/10.']

Here we split on 4-letter words.  Then we do it again and capture the 4-letter words.

In [48]:

split(r'\b\w{4}\b', t)


['The quick brown fox ',
 ' on 1/23/',
 ' jumped ',
 ' the ',
 ' dog ',
 ' on 10/6/10.']

In [49]:

split(r'\b(\w{4})\b', t)


['The quick brown fox ',
 'born',
 ' on 1/23/',
 '2013',
 ' jumped ',
 'over',
 ' the ',
 'lazy',
 ' dog ',
 'born',
 ' on 10/6/10.']

### **split** and zero-width assertions.
Here we split on word boundaries.  **\\b** is a zero-width assertion.  It requires that certain characters be present, but it doesn't "consume" them.

In [51]:
try:
    split(r'\b',t)
except ValueError as ve:
    print("ValueError: ",ve)


ValueError:  split() requires a non-empty pattern match.


So what went wrong?  Unlike **split** in Perl, the split function in **re** will not split on zero-width assertions. The new **regex** module gets this right.

In [52]:

from regex import *
split(r'\b',t)


['',
 'The',
 ' ',
 'quick',
 ' ',
 'brown',
 ' ',
 'fox',
 ' ',
 'born',
 ' ',
 'on',
 ' ',
 '1',
 '/',
 '23',
 '/',
 '2013',
 ' ',
 'jumped',
 ' ',
 'over',
 ' ',
 'the',
 ' ',
 'lazy',
 ' ',
 'dog',
 ' ',
 'born',
 ' ',
 'on',
 ' ',
 '10',
 '/',
 '6',
 '/',
 '10',
 '.']

Oops. To get the new behavior, we must add the "Version 1" option to the regular expression.  "Version 0" emulates **re**.

In [53]:
split(r'(?V1)\b',t)

['',
 'The',
 ' ',
 'quick',
 ' ',
 'brown',
 ' ',
 'fox',
 ' ',
 'born',
 ' ',
 'on',
 ' ',
 '1',
 '/',
 '23',
 '/',
 '2013',
 ' ',
 'jumped',
 ' ',
 'over',
 ' ',
 'the',
 ' ',
 'lazy',
 ' ',
 'dog',
 ' ',
 'born',
 ' ',
 'on',
 ' ',
 '10',
 '/',
 '6',
 '/',
 '10',
 '.']

We can make life a little easier by setting the version globally.

In [54]:
import regex
regex.DEFAULT_VERSION = VERSION1
split(r'\b',t)

['',
 'The',
 ' ',
 'quick',
 ' ',
 'brown',
 ' ',
 'fox',
 ' ',
 'born',
 ' ',
 'on',
 ' ',
 '1',
 '/',
 '23',
 '/',
 '2013',
 ' ',
 'jumped',
 ' ',
 'over',
 ' ',
 'the',
 ' ',
 'lazy',
 ' ',
 'dog',
 ' ',
 'born',
 ' ',
 'on',
 ' ',
 '10',
 '/',
 '6',
 '/',
 '10',
 '.']

**\\m** and **\\M** are zero-width assertions that are true at the beginnings and ends of words.

In [55]:

split(r'\M',t)


['The',
 ' quick',
 ' brown',
 ' fox',
 ' born',
 ' on',
 ' 1',
 '/23',
 '/2013',
 ' jumped',
 ' over',
 ' the',
 ' lazy',
 ' dog',
 ' born',
 ' on',
 ' 10',
 '/6',
 '/10',
 '.']

In [56]:

split(r'\m',t)


['',
 'The ',
 'quick ',
 'brown ',
 'fox ',
 'born ',
 'on ',
 '1/',
 '23/',
 '2013 ',
 'jumped ',
 'over ',
 'the ',
 'lazy ',
 'dog ',
 'born ',
 'on ',
 '10/',
 '6/',
 '10.']

### Look-arounds
We can split on any 4-letter word.

In [57]:

split(r'(?x) \b \w{4} \b', t)


['The quick brown fox ',
 ' on 1/23/',
 ' jumped ',
 ' the ',
 ' dog ',
 ' on 10/6/10.']

But what if we want to split on any 4-letter word but **born**? We can use a look-ahead assertion.  Look-aheads and look-behinds come in two flavors: positive and negative.  All four are zero-width assertions.  They required certain characters to be present or absent, but don't consume the characters.  In this case, we need a negative assertion.  We could do a look-ahead:

In [59]:
split(r'(?x) \b (?!born) \w{4} \b', t)

['The quick brown fox born on 1/23/',
 ' jumped ',
 ' the ',
 ' dog born on 10/6/10.']

or a look-behind:

In [61]:
split(r'(?x) \b \w{4} (?<!born) \b', t)

['The quick brown fox born on 1/23/',
 ' jumped ',
 ' the ',
 ' dog born on 10/6/10.']

or, if we are feeling perverse, both:

In [62]:
split(r'(?x) \b (?!born) \w{4} (?<!born) \b', t)

['The quick brown fox born on 1/23/',
 ' jumped ',
 ' the ',
 ' dog born on 10/6/10.']

This one splits on any 4-letter word that doesn't contain **o**:

In [48]:

split(r'(?x) \b (?!\w*o) \w{4} \b', t)


['The quick brown fox born on 1/23/',
 ' jumped over the ',
 ' dog born on 10/6/10.']

This one splits on the letter **o**.  The **o** is consumed and lost.

In [66]:
split(r'(?x) o', t)

['The quick br',
 'wn f',
 'x b',
 'rn ',
 'n 1/23/2013 jumped ',
 'ver the lazy d',
 'g b',
 'rn ',
 'n 10/6/10.']

This one has a positive look-ahead assertion.  It splits before every **o**.

In [67]:
split(r'(?x) (?=o)', t)

['The quick br',
 'own f',
 'ox b',
 'orn ',
 'on 1/23/2013 jumped ',
 'over the lazy d',
 'og b',
 'orn ',
 'on 10/6/10.']

This one has a positive look-behind assertion.  It splits after evey **o**.

In [68]:

split(r'(?x) (?<=o)', t)


['The quick bro',
 'wn fo',
 'x bo',
 'rn o',
 'n 1/23/2013 jumped o',
 'ver the lazy do',
 'g bo',
 'rn o',
 'n 10/6/10.']

This one splits between  **o** and **r**:

In [69]:
split(r'(?x) (?<=o) (?=r)', t)

['The quick brown fox bo',
 'rn on 1/23/2013 jumped over the lazy dog bo',
 'rn on 10/6/10.']

The assertions could appear in either order.

In [71]:
split(r'(?x) (?=r) (?<=o)', t)

['The quick brown fox bo',
 'rn on 1/23/2013 jumped over the lazy dog bo',
 'rn on 10/6/10.']

This one splits between any two consecutive vowels.

In [73]:
split(r'(?x) (?<=[aeiou]) (?=[aeiou])', t)

['The qu',
 'ick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.']

### Fun with DNA: Open reading frames
DNA is a sequence of bases, A, C, G, or T.  They are translated into proteins 3 bases at a time.  Each 3-base sequence is called a **codon**.  There is a special **start codon** ATG, and three **stop codons**, TGA, TAG, and TAA.  The start and stop codons are highlighted below.

In [75]:
dna = 'cgcgcATGcgcgcgTGAcgcgcgTAGcgcgcgcgc'
dna = dna.lower()

An **opening reading frame** or **ORF** consists of a start codon, followed by some more codons, and ending with a stop codon.  (In real life, "some more" is usually hundreds or thousands.)

In [76]:
orfpat = r'(?x) atg (...)* (tga|tag|taa)'
search(orfpat,dna)

<regex.Match object; span=(5, 26), match='atgcgcgcgtgacgcgcgtag'>

Actually, that's not quite right.  The internal codons should not be stop codons. We can handle that with a negative lookahead assertion.  (Can you think of another way?)

In [80]:
orfpat = r'(?x) atg  ( (?!tga|tag|taa) ... )*  (tga|tag|taa)'
search(orfpat,dna)

<regex.Match object; span=(5, 17), match='atgcgcgcgtga'>

We don't really want to capture the "some more codons" separately.  In a minute that will get in the way.  So we can use **(?:...)** to group without capturing:

In [81]:
orfpat = r'(?x) ( atg  (?: (?!tga|tag|taa) ... )*  (?:tga|tag|taa) )'
findall(orfpat,dna)

['atgcgcgcgtga']

Here is another DNA sequence.  Note that this one has overlapping ORFs.  We would like a list of **all** orfs, specifically **ATGcATGcgTGA** and **ATGcgTGAcTAA**.  Our last pattern only finds the first ORF. Since it consumes the first ORF, it also consumes the beginning of the second ORF.

In [82]:

dna = 'cgcgcATGcATGcgTGAcTAAcgTAGcgcgcgcgc'

dna = dna.lower()

findall(orfpat,dna)


['atgcatgcgtga']

Since we want to find something without consuming it, we can use a positive lookahead assertion.  We put the whole ORF pattern inside the lookahead.  We need to capture what is matched by the lookahead without consuming it.

In [83]:

orfpat = r'(?x) (?= ( atg  (?: (?!tga|tag|taa) ... )*  (?:tga|tag|taa) ))'

findall(orfpat,dna)


['atgcatgcgtga', 'atgcgtgactaa']

Notice the position of the capturing parentheses.  This doesn't work:

In [61]:

orfpat = r'(?x) ( (?= atg  (?: (?!tga|tag|taa) ... )*  (?:tga|tag|taa) ))'

findall(orfpat,dna)


['', '']

Why? Because the look-ahead assertion has width 0.

### More fun with DNA: Restriction Digest Assays
To perform certain assays, molecular biologists subject DNA sequences to enzymes known as restriction enzymes. There are several types; this is about Type II restriction endonucleases, to be precise. They are usually named with three letters, for the species of origin, and a Roman numeral; e.g., AfeI comes from Alcaligenes faecalis.  These enzymes typically recognize a specific sequence of 6-10 letters, and cut the DNA somewhere in the middle of that sequence.  For example, BgIII recognizes **AGATCT** and cuts between the first **A** and the **G**.

For a typical assay, the DNA will be digested by a "cocktail" of 3-6 enzymes. The lengths of the resulting pieces will be measured by gel electrophoresis.  The lengths should match up with the lengths predicted by an in silico digestion. If not, something is wrong.
Our task is to do the in silico digestion.

For development, here is a dictionary of 4 enzymes, a DNA sequence to digest, and the string we would like to get out of the process. In real life, the DNA sequence would be thousands to ten-thousands of letters long. The researcher could be interested in knowing the cut-points for dozens of enzymes, even though a typical assay uses just a few.

In [62]:

enzymes = {'A-GATCT': 'BgIII',
           'AGC-GCT': 'AfeI',
           'AGG-CCT': 'StuI',
           'AT-CGAT': 'ClaI'}

dna = 'AAAAGCGCTAAAATCGATAAAAAAGATCTAAAAAGCGCT'

goal = 'AAAAGC <AfeI> GCTAAAAT <ClaI> CGATAAAAAA <BgIII> GATCTAAAAAGC <AfeI> GCT'


We are going to use positive look-aheads and look-behinds.  We will build a look ahead-look behind combination for each enzyme.

In [63]:

pats = ['(?<=' + fore + ')(?=' + aft + ')' 
        
        for (fore,aft) in [split(r'-',s) for s in enzymes.keys()]]

pats


['(?<=AT)(?=CGAT)', '(?<=A)(?=GATCT)', '(?<=AGC)(?=GCT)', '(?<=AGG)(?=CCT)']

In [64]:

pats = ' | '.join(pats)
pats


'(?<=AT)(?=CGAT) | (?<=A)(?=GATCT) | (?<=AGC)(?=GCT) | (?<=AGG)(?=CCT)'

In [65]:

pat = "(?x) ( " + pats + " )"
pat


'(?x) ( (?<=AT)(?=CGAT) | (?<=A)(?=GATCT) | (?<=AGC)(?=GCT) | (?<=AGG)(?=CCT) )'

In [66]:
print(goal)

split(pat,dna)


AAAAGC <AfeI> GCTAAAAT <ClaI> CGATAAAAAA <BgIII> GATCTAAAAAGC <AfeI> GCT


['AAAAGC', '', 'GCTAAAAT', '', 'CGATAAAAAA', '', 'GATCTAAAAAGC', '', 'GCT']

It's a good start.  We split in the right places, but we didn't capture the recognition sequences, so we can't retrieve the name of the enzyme from the dictionary. In fact, we captured empty strings.  That's because we captured a zero-width assertion. So we will add some parentheses to capture the look-aheads and look-behinds.

In [67]:

pats = ['(?<=(' + fore + ')) (?=(' + aft + '))' 
        
        for (fore,aft) in [split(r'-',s) for s in enzymes.keys()]]

pats = '  |  '.join(pats)
pat = '(?x) (?: ' + pats + ' )'
pat


'(?x) (?: (?<=(AT)) (?=(CGAT))  |  (?<=(A)) (?=(GATCT))  |  (?<=(AGC)) (?=(GCT))  |  (?<=(AGG)) (?=(CCT)) )'

In [68]:

split(pat,dna)


['AAAAGC',
 None,
 None,
 None,
 None,
 'AGC',
 'GCT',
 None,
 None,
 'GCTAAAAT',
 'AT',
 'CGAT',
 None,
 None,
 None,
 None,
 None,
 None,
 'CGATAAAAAA',
 None,
 None,
 'A',
 'GATCT',
 None,
 None,
 None,
 None,
 'GATCTAAAAAGC',
 None,
 None,
 None,
 None,
 'AGC',
 'GCT',
 None,
 None,
 'GCT']

What happened?  The pattern has eight sets of capturing parentheses. So, the match also returns eight groups when it's executed.  Only the parentheses from the successful alternative will capture anything.  The other six groups are set to **None**.

Happily, **regex** provides a new "branch reset" feature. Briefly, it means that capturing occurs only on the successful branch.

In [69]:

pat = '(?x) (?| ' + pats + ' )'
pat


'(?x) (?| (?<=(AT)) (?=(CGAT))  |  (?<=(A)) (?=(GATCT))  |  (?<=(AGC)) (?=(GCT))  |  (?<=(AGG)) (?=(CCT)) )'

In [70]:

split(pat,dna)


['AAAAGC',
 'AGC',
 'GCT',
 'GCTAAAAT',
 'AT',
 'CGAT',
 'CGATAAAAAA',
 'A',
 'GATCT',
 'GATCTAAAAAGC',
 'AGC',
 'GCT',
 'GCT']

Hooray!  Now all we have to do it to map the recognition sequences into enzyme names:

In [71]:

L = split(pat,dna)

LL = [ ' <' + enzymes[L[i]+'-'+L[i+1]] + '> '+ L[i+2] 
      
                  for i in range(1,len(L),3) ]

L[0] + ''.join(LL)


'AAAAGC <AfeI> GCTAAAAT <ClaI> CGATAAAAAA <BgIII> GATCTAAAAAGC <AfeI> GCT'

Then we can pull it all together into a nice class:

In [72]:

import regex as re

class EndonucleaseDigestor:
    
    def __init__(this,enzymeDict):
        pats = ['(?<=(' + fore + '))(?=(' + aft + '))' 
                for (fore,aft) in [re.split(r'-',s) for s in enzymeDict.keys()]]
        pat = ' | '.join(pats)
        pat = '(?x) (?| ' + pat + ' )'
        this.pat = re.compile(pat)
        this.enzymes = enzymeDict
        
    def digest(this,dna):
        L = this.pat.split(dna)
        LL = [ ' <' + enzymes[L[i]+'-'+L[i+1]] + '> '+ L[i+2] for i in range(1,len(L),3) ]
        return L[0] + ''.join(LL)
        

enzymes = {'A-GATCT': 'BgIII',
           'AGC-GCT': 'AfeI',
           'AGG-CCT': 'StuI',
           'AT-CGAT': 'ClaI'}
dna = 'AAAAGCGCTAAAATCGATAAAAAAGATCTAAAAAGCGCT'
goal = 'AAAAGC <AfeI> GCTAAAAT <ClaI> CGATAAAAAA <BgIII> GATCTAAAAAGC <AfeI> GCT'

digestor = EndonucleaseDigestor(enzymes)
if digestor.digest(dna) == goal: print("passed")


passed


### TMTOWTDT: **sub** with a function
There's another way to solve the restriction digest problem. This time let's start by building a dictionary that maps the recognition sequences into versions with the enzyme name interposed:

In [73]:

d = { sub('-','',k) : sub('-'," <"+v+"> ", k) for k,v in enzymes.items()}
d


{'AGATCT': 'A <BgIII> GATCT',
 'AGCGCT': 'AGC <AfeI> GCT',
 'AGGCCT': 'AGG <StuI> CCT',
 'ATCGAT': 'AT <ClaI> CGAT'}

And let's build a pattern that matches all the recognition sequences:

In [74]:

p = ' | '.join( [ sub('-','',k) for k in enzymes.keys()])
p = '(?x) (' + p + ')'
p


'(?x) (ATCGAT | AGATCT | AGCGCT | AGGCCT)'

The second argument to **sub** can be a function rather than a string.  If so, the function is called with a **match** object as its argument.  We are interested in the first (and only) thing captured in the match and we want to get the corresponding string out of dictionary **d**.  So we define a function to do that.  Then we call sub with that function.

In [75]:

def subber (m):
    print(m)
    return d[m.group(1)]

sub(p, subber, dna)


<regex.Match object; span=(3, 9), match='AGCGCT'>
<regex.Match object; span=(12, 18), match='ATCGAT'>
<regex.Match object; span=(23, 29), match='AGATCT'>
<regex.Match object; span=(33, 39), match='AGCGCT'>


'AAAAGC <AfeI> GCTAAAAT <ClaI> CGATAAAAAA <BgIII> GATCTAAAAAGC <AfeI> GCT'

So we can make a class like this:

In [76]:

import regex as re

class AnotherEndonucleaseDigestor:
    
    def __init__(this,enzymeDict):
        this.d = { re.sub('-','',k) : re.sub('-'," <"+v+"> ", k) for k,v in enzymes.items()}
        p = ' | '.join( [ re.sub('-','',k) for k in enzymes.keys()])
        p = '(?x) (' + p + ')'
        this.pat = re.compile(p)
        this.enzymes = enzymeDict
        
    def digest(this,dna):
        return this.pat.sub( lambda m: this.d[m.group(1)]  , dna)

enzymes = {'A-GATCT': 'BgIII',
           'AGC-GCT': 'AfeI',
           'AGG-CCT': 'StuI',
           'AT-CGAT': 'ClaI'}
dna = 'AAAAGCGCTAAAATCGATAAAAAAGATCTAAAAAGCGCT'
goal = 'AAAAGC <AfeI> GCTAAAAT <ClaI> CGATAAAAAA <BgIII> GATCTAAAAAGC <AfeI> GCT'

digestor = AnotherEndonucleaseDigestor(enzymes)
if digestor.digest(dna) == goal: print("passed")


passed


That probably seems a lot simpler, but there is one problem.  What if two recognition sites are overlapping?  For the in silico simulation of a real digest, it doesn't much matter, because the resolution of gel electrophoresis is much less than the 5-10 bases that might be overlapping.  On the other hand, if the scientist actually wants a complete inventory of all the restriction sites for a large set of enzymes, overlaps matter, and this solution won't work.

### Nested sets
We have seen some character sets such as **[aeiou]** for all (lower-case) vowels.  Suppose we want all lower-case consonants.  One obvious way is to list them all.  We might also be tempted to use set negation:

In [84]:
findall(r'[^aeiou]+',t)

['Th',
 ' q',
 'ck br',
 'wn f',
 'x b',
 'rn ',
 'n 1/23/2013 j',
 'mp',
 'd ',
 'v',
 'r th',
 ' l',
 'zy d',
 'g b',
 'rn ',
 'n 10/6/10.']

The problem is that we get not just consonants, but spaces, digits, etc.
The new **regex** module allows us to do arithmetic on sets:


In [78]:
findall(r'(?x) [[a-z]--[aeiou]]+', t)

['h',
 'q',
 'ck',
 'br',
 'wn',
 'f',
 'x',
 'b',
 'rn',
 'n',
 'j',
 'mp',
 'd',
 'v',
 'r',
 'th',
 'l',
 'zy',
 'd',
 'g',
 'b',
 'rn',
 'n']

### Fuzzy matching
With **regex**, you can specify that patterns need only be satisfied approximately.  You can specify the number of insertions (**i**), number of deletions (**d**), and number of substitutions (**s**) as well as total number of errors (**e**).
This example allows at most one insertion and at most one deletion for each pattern.

In [86]:
list(finditer(r'(brown|lazy){i<=1,d<=1} (dog|fox){i<=1,d<=1}',
'The quick crown fax barn on Monday jumped over the sleazy hog bran on Tuesday.'))

[<regex.Match object; span=(10, 19), match='crown fax', fuzzy_counts=(0, 2, 2)>,
 <regex.Match object; span=(52, 61), match='leazy hog', fuzzy_counts=(0, 2, 1)>]

You can see that the match object reports the number of insertions, deletions, and substitutions as **fuzzy_counts**. 

You can even **require** a minimum number of errors:

In [87]:
list(finditer(r'(brown|lazy){1<=e<=3} (dog|fox){1<=e<=2}',
'The quick crown fax barn on Monday jumped over the sleazy hog bran on Tuesday.'))

[<regex.Match object; span=(8, 19), match='k crown fax', fuzzy_counts=(2, 2, 0)>,
 <regex.Match object; span=(20, 27), match='barn on', fuzzy_counts=(3, 0, 2)>,
 <regex.Match object; span=(50, 61), match=' sleazy hog', fuzzy_counts=(1, 3, 0)>,
 <regex.Match object; span=(62, 69), match='bran on', fuzzy_counts=(3, 0, 2)>]

What matched what?  We can find out by doing some more capturing.

In [97]:
findall(r'(?:(brown)|(lazy)){1<=e<=3} (?:(dog)|(fox)){1<=e<=2}',
'The quick crown fax barn on Monday jumped over the sleazy hog bran on Tuesday.')        

[('k crown', '', '', 'fax'),
 ('barn', '', 'on', ''),
 ('', ' sleazy', 'hog', ''),
 ('bran', '', 'on', '')]

What if we try our orginal correct string? We should get back no matches, because there are no errors, right?  Maybe not.

In [90]:

findall(r'(?:(brown)|(lazy)){1<=e<=3} (?:(dog)|(fox)){1<=e<=2}',
        
        'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.')
 

[('ck brown', '', 'fox', ''),
 (' born', '', 'on', ''),
 ('', 'he lazy', 'dog', ''),
 ('born', '', 'on', '')]

Maybe it will work if we try the **BESTMATCH** option:

In [92]:

findall(r'(brown|lazy){1<=e<=3} (dog|fox){1<=e<=2}',
        'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.',
        BESTMATCH)


[(' brown', 'fox '), (' lazy', 'dog '), ('born', 'on')]

Hmm.  Maybe we need the **ENHANCEMATCH** option.

In [94]:

findall(r'(brown|lazy){1<=e<=3} (dog|fox){1<=e<=2}',
        'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.',
         ENHANCEMATCH)


[(' brown', 'fox'), ('born', 'on'), (' lazy', 'dog '), ('born', 'on')]

Maybe we should use both....

In [95]:

findall(r'(brown|lazy){1<=e<=3} (dog|fox){1<=e<=2}',
        'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.',
         ENHANCEMATCH | BESTMATCH)


[(' brown', 'fox '), (' lazy', 'dog '), ('born', 'on')]

OK, now let's try a spelling corrector.  Here's a list of words.

In [87]:
f = open('words.txt')

f.readline()


'Aarhus\n'

In [88]:
words = f.readlines()
words[:10]

['Aaron\n',
 'Ababa\n',
 'aback\n',
 'abaft\n',
 'abandon\n',
 'abandoned\n',
 'abandoning\n',
 'abandonment\n',
 'abandons\n',
 'abase\n']

In [89]:
words =  ' '.join( [sub('\n', '', w) for w in words] )
words[-60:]

'zoom zooms zoos Zorn Zoroaster Zoroastrian Zulu Zulus Zurich'

Now let's make a string with some misspelled (and correct) words. It might seem counterintuitive, but we will take the misspelled words and turn them into a pattern, and use the dictionary as the target sequence.

In [90]:

misspelt = 'abrogatting baandoned abreviat astracted absinthe abussed abus zoan'

misspelt = split('\W+', misspelt)

misspelt = [r"(" + s + r"){e<=2}" for s in misspelt]

misspelt = r"(?x) \m (?: " + " | ".join(misspelt) + r" ) \M"

misspelt


'(?x) \\m (?: (abrogatting){e<=2} | (baandoned){e<=2} | (abreviat){e<=2} | (astracted){e<=2} | (absinthe){e<=2} | (abussed){e<=2} | (abus){e<=2} | (zoan){e<=2} ) \\M'

Note that this time we did not use the brach reset feature.  That's because the captured empty strings are going to tell us which misspelled word was matched.

In [91]:

lis = findall(misspelt,words, ENHANCEMATCH)

lis


[('', 'abandoned', '', '', '', '', '', ''),
 ('', '', '', '', '', '', 'abase', ''),
 ('', '', '', '', '', 'abased', '', ''),
 ('', '', '', '', '', '', 'abash', ''),
 ('', '', '', '', '', 'abashed', '', ''),
 ('', '', '', '', '', '', 'abbe', ''),
 ('', '', 'abbreviate', '', '', '', '', ''),
 ('', '', '', '', '', '', 'abed', ''),
 ('', '', '', '', '', '', 'abet', ''),
 ('', '', '', '', '', '', 'abets', ''),
 ('', '', '', '', '', '', 'able', ''),
 ('', '', '', '', '', '', 'ably', ''),
 ('', '', '', '', '', '', 'Abos', ''),
 ('', '', '', '', '', '', 'about', ''),
 ('abrogating', '', '', '', '', '', '', ''),
 ('', '', '', '', 'absentee', '', '', ''),
 ('', '', '', '', 'absinthe', '', '', ''),
 ('', '', '', 'abstracted', '', '', '', ''),
 ('', '', '', '', '', '', 'Abu', ''),
 ('', '', '', '', '', '', 'abuse', ''),
 ('', '', '', '', '', 'abused', '', ''),
 ('', '', '', '', '', 'abuses', '', ''),
 ('', '', '', '', '', '', 'abut', ''),
 ('', '', '', '', '', '', 'abuts', ''),
 ('', '', '', '', '

Now we will transpose the matrix.  Every column will contain matches for a single misspelled word.  Most of the entries will be empty strings.

In [92]:

z = list(zip(*lis))

z


[('',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'abrogating',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'arrogating',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  

And now we can filter out the empty strings.

In [93]:

z = [ list(filter(lambda s: s!= '', L)) for L in z]

z


[['abrogating', 'arrogating'],
 ['abandoned'],
 ['abbreviate'],
 ['abstracted', 'attracted', 'distracted', 'extracted', 'retracted'],
 ['absentee', 'absinthe'],
 ['abased',
  'abashed',
  'abused',
  'abuses',
  'abutted',
  'abysses',
  'amassed',
  'ambushed',
  'amused',
  'bossed',
  'bused',
  'busied',
  'bussed',
  'busses',
  'busted'],
 ['abase',
  'abash',
  'abbe',
  'abed',
  'abet',
  'abets',
  'able',
  'ably',
  'Abos',
  'about',
  'Abu',
  'abuse',
  'abut',
  'abuts',
  'abyss',
  'aces',
  'adds',
  'ads',
  'ages',
  'ague',
  'aids',
  'aims',
  'airs',
  'alas',
  'album',
  'albums',
  'alms',
  'alum',
  'ambush',
  'amuse',
  'ants',
  'anus',
  'apes',
  'aqua',
  'arcs',
  'arms',
  'arts',
  'as',
  'asks',
  'ass',
  'awls',
  'axes',
  'axis',
  'ayes',
  'babes',
  'Babul',
  'bud',
  'buds',
  'bug',
  'bugs',
  'bum',
  'bums',
  'bun',
  'buns',
  'bus',
  'bush',
  'buss',
  'bust',
  'busy',
  'but',
  'buy',
  'buys',
  'cabs',
  'deus',
  'ebbs',


It's not entirely satisfactory, but it might work well for correcting the spelling of small sets of words, for example, state names.

### Multiple captures
It's now possible to obtain information on all the successful matches of a repeated capture group, not just the last one.  Use **captures** instead of **group**.

In [107]:
dna = 'cgcgcATGcgcattcgggcgTGAcgcgcgTAGcgcgcgcgc'
dna = dna.lower()
orfpat = r'(?x) ( atg ( (?!tga|tag|taa) ... )*  (?:tga|tag|taa))'
search(orfpat,dna).captures(1)
search(orfpat,dna).captures(2)

['atgcgcattcgggcgtga']

['cgc', 'att', 'cgg', 'gcg']

In [108]:
search(orfpat,dna).captures(1)

['atgcgcattcgggcgtga']

We can also capture things by name.  The string **s** is an excerpt of a long file describing a gene network.  Each line contains two gene names, and the strength of the connection between them.  In this example, we are only interested in gathering the gene names.

In [109]:
s = """AT1G01280	AT1G01450	5.1E-3
AT1G01480	AT1G01560	2.3E-2
AT1G01600	AT1G01610	1.6E-2
AT1G01430	AT1G01630	2.1E-2
AT1G01150	AT1G01700	1.1E-2
"""
m = match(
    r'(?x) (?: (?P<geneA>\w+) \s+ (?P<geneB>\w+) \s+ \S+ \n )*',
    s)
m.capturesdict()

{'geneA': ['AT1G01280', 'AT1G01480', 'AT1G01600', 'AT1G01430', 'AT1G01150'],
 'geneB': ['AT1G01450', 'AT1G01560', 'AT1G01610', 'AT1G01630', 'AT1G01700']}

We can even reuse a name;

In [97]:

m = match(
    r'(?x) (?: (?P<gene>\w+) \s+ (?P<gene>\w+) \s+ \S+ \n )*',
    s)

m.capturesdict()


{'gene': ['AT1G01280',
  'AT1G01450',
  'AT1G01480',
  'AT1G01560',
  'AT1G01600',
  'AT1G01610',
  'AT1G01430',
  'AT1G01630',
  'AT1G01150',
  'AT1G01700']}

In [98]:
sorted(set(m.capturesdict()['gene']))

['AT1G01150',
 'AT1G01280',
 'AT1G01430',
 'AT1G01450',
 'AT1G01480',
 'AT1G01560',
 'AT1G01600',
 'AT1G01610',
 'AT1G01630',
 'AT1G01700']

### Reverse searching

Searches can now work backwards:

Note: the result of a reverse search is not necessarily the reverse of a forward search:

In [112]:
findall(r"(?r)..", "abcde")

['de', 'bc']

Who cares?  I thought of an example.

In [100]:
sub(r'(?rx) (\d\d\d)',
    r',\1',
    '1 mile = 1760 yards = 5280 ft = 63360 in = 1609344 mm = 160934.4 cm, more or less.  Pi = 3.14159')

'1 mile = 1,760 yards = 5,280 ft = 63,360 in = 1,609,344 mm = ,160,934.4 cm, more or less.  Pi = 3.14,159'

In [101]:
sub(r'(?rx)  (?<=\d) (\d\d\d)',
    r',\1',
    '1 mile = 1760 yards = 5280 ft = 63360 in = 1609344 mm = 160934.4 cm, more or less.  Pi = 3.14159')

'1 mile = 1,760 yards = 5,280 ft = 63,360 in = 1,609,344 mm = 160,934.4 cm, more or less.  Pi = 3.14,159'

In [102]:
sub(r'(?rx)  (?<! [.] \d*) (?<=\d) (\d\d\d)',
    r',\1',
    '1 mile = 1760 yards = 5280 ft = 63360 in = 1609344 mm = 160934.4 cm, more or less.  Pi = 3.14159')

'1 mile = 1,760 yards = 5,280 ft = 63,360 in = 1,609,344 mm = 160,934.4 cm, more or less.  Pi = 3.14159'

### POSIX Matching (Leftmost Longest)

The default matching method for alternations is to match the first alternative that will match. The POSIX standard is to find the leftmost longest match. This can be turned on using the POSIX flag **(?p)**.

In [113]:
list(finditer( r'(dog|doge|doggerel)', 'The doge wrote nothing but doggerel.'))

[<regex.Match object; span=(4, 7), match='dog'>,
 <regex.Match object; span=(27, 30), match='dog'>]

In [104]:

list(finditer( r'(?p)(dog|doge|doggerel)', 'The doge wrote nothing but doggerel.'))


[<regex.Match object; span=(4, 8), match='doge'>,
 <regex.Match object; span=(27, 35), match='doggerel'>]


### **fullmatch**
The pattern must match the entire string.

In [105]:

match(r'The doge', 'The doge wrote nothing but doggerel.')


<regex.Match object; span=(0, 8), match='The doge'>

In [106]:

fullmatch(r'The doge', 'The doge wrote nothing but doggerel.')


Nope, that one didn't match.

### Partial matching
Can the target string be extended to match the pattern?  The optional **partial** argument to **match**, **search**, and **fullmatch** can answer this question. This could be useful if you are validating input from the terminal, for example.

This one is true, because the target can be extended to 'The doge wrote nothing but doggerel.' to match the pattern.  But, if you think about it, you will see that this one is true for any target string.  (It can be extended with 'dogdog'.)

In [119]:
fullmatch(r'.*dog.*dog.*', 'The doge wrote nothing', partial=True)

<regex.Match object; span=(0, 22), match='The doge wrote nothing', partial=True>

This one is more interesting: Can the string be extended to be a Social Security Number?

In [120]:
fullmatch(r'\d\d\d-\d\d-\d\d\d\d',  "999-89-7", partial=True)

<regex.Match object; span=(0, 8), match='999-89-7', partial=True>

In [121]:
match(r'\d\d\d-\d\d-\d\d\d\d',  "999-89-7", partial=True)

<regex.Match object; span=(0, 8), match='999-89-7', partial=True>

In [122]:
fullmatch(r'\d\d\d-\d\d-\d\d\d\d',  "My SSN is 999-89-7", partial=True)

In [123]:
match(r'\d\d\d-\d\d-\d\d\d\d',  "My SSN is 999-89-7", partial=True)

In [124]:
search(r'\d\d\d-\d\d-\d\d\d\d',  "My SSN is 999-89-7", partial=True)

<regex.Match object; span=(10, 18), match='999-89-7', partial=True>

Notice that this one is a complete match, so the **partial** field is missing from the match object:

In [125]:
search(r'\d\d\d-\d\d-\d\d\d\d',  "My SSN is 999-89-7654, but don't tell.", partial=True)

<regex.Match object; span=(10, 21), match='999-89-7654'>

In [126]:
search(r'\d\d\d-\d\d-\d\d\d\d',  "My SSN is 999-89-76, but don't tell.", partial=True)

<regex.Match object; span=(36, 36), match='', partial=True>

### Some functional programming fun

In [130]:

def twice(f):
    return lambda x: f(f(x))

def prepender(s):
    return lambda t: s + t

twice(twice)(twice(prepender('spam ')))('eggs and spam')


'spam spam spam spam spam spam spam spam eggs and spam'

In [131]:

twice(twice)(twice(prepender(len('spam '))))(len('eggs and spam'))


53

### Puzzle
Which character is most frequent in a string?

In [132]:
def most(s, care_about=r'\w'):
    t=''.join(sorted(s))
    p = r'((' + care_about + r')\2*)'
    L = [ m.group(1) for m in finditer(p, t) ]
    m = max(L, key=len)
    return (m[0], len(m))

most('123462232340997092')

('2', 5)

In [133]:
most('123462232340997092', care_about='[13579]')

('3', 3)

In [134]:
most(twice(twice)(twice(prepender('spam ')))('eggs and spam'))

('a', 10)

In [135]:
most(twice(twice)(twice(prepender('spam ')))('eggs and spam'), r'[^aeiou\s]')

('s', 10)