# Regex examples in Python for LIS452

Create some data to work with:

In [5]:
import re
az = 'The quick brown fox jumped over the lazy dog.'
name_list = ['Jane Jones','Jerry Jerrolds', 'Abby Abbot-Majors', 'Albert Einstein', 'Martin Luther', 
             'Martin Luther King, Jr.', 'Richard Richards', 'John Johnson', 'John Cleese', 
             'Eric Idle', 'Graham Chapman', 'Julius Cæsar', 'Sofía Vergara', 'Robert Robertson', 
             'Alexander McCall-Smith', 'Jeff MacNelly', "Conan O'Brien", 'Billy Connolly',
             'Ronald McDonald', 'Apple Macintosh', 'Sir James Bond', 'Sir Ian McKellen', 
             'Sir Paul McCartney', 'Dame Judy Dench', 'Lord Robert Grantham', 'Dame Maggie Smith',
             'Tyrion Lannister', 'Lord Eddard Stark', 'King Richard I', 'King Henry V', 
             'Julia Louis-Dreyfus', 'Lady Catelyn Stark', 'Lord Tywin Lannister', 'Ser Jamie Lannister',
             'Jack Lord', 'Bobby McFerrin', 'Cher', 'Madonna', 'Mrs. Nora Charles',
             'Queen Cersei Lannister', 'Daniel Michael Blake Day-Lewis', 'F. Scott Fitzgerald',
             'J. R. R. Tolkien', 'Douglas Fairbanks, Senior', 'Sting', 'Joe Schmoe III', 
             'King Henry VIII', 'Br. Cadfael', 'Dr. Albert Schweitzer',
             '孔子'  # 'Master Kong' (Confucius)
            ]
name_list.sort()  # alphabetize the list
names = '; '.join(name_list)

In [6]:
names

"Abby Abbot-Majors; Albert Einstein; Alexander McCall-Smith; Apple Macintosh; Billy Connolly; Bobby McFerrin; Br. Cadfael; Cher; Conan O'Brien; Dame Judy Dench; Dame Maggie Smith; Daniel Michael Blake Day-Lewis; Douglas Fairbanks, Senior; Dr. Albert Schweitzer; Eric Idle; F. Scott Fitzgerald; Graham Chapman; J. R. R. Tolkien; Jack Lord; Jane Jones; Jeff MacNelly; Jerry Jerrolds; Joe Schmoe III; John Cleese; John Johnson; Julia Louis-Dreyfus; Julius Cæsar; King Henry V; King Henry VIII; King Richard I; Lady Catelyn Stark; Lord Eddard Stark; Lord Robert Grantham; Lord Tywin Lannister; Madonna; Martin Luther; Martin Luther King, Jr.; Mrs. Nora Charles; Queen Cersei Lannister; Richard Richards; Robert Robertson; Ronald McDonald; Ser Jamie Lannister; Sir Ian McKellen; Sir James Bond; Sir Paul McCartney; Sofía Vergara; Sting; Tyrion Lannister; 孔子"

## Raw strings in Python
Get in the habit of using **raw strings** for your patterns because it'll be simpler when you have backslashes in the patterns (which are common).

In [7]:
print("This regular string has a single backslash: \\ but I had to escape it with an extra one.")
print(r"This raw string contains a single \ and doesn't need the escaping")

This regular string has a single backslash: \ but I had to escape it with an extra one.
This raw string contains a single \ and doesn't need the escaping


## `re.search()` for simple True/False match decisions
From name_list, print the items with hyphens:

In [None]:
for name in name_list:
    if re.search(r'-', name):
        print(name)

## `re.findall()` and a capture group to get a list of matches
To do the same as above, but extracting from one big string containing all the names, we have to make the regex pattern more complex. That's because it has to find where the names begin and end (in this case, delimited by semicolons), and using a **capture group**:

In [None]:
matches = re.findall(r'([^;]+-[^;]+)', names)
print(matches)

Find all the Lannisters:

In [None]:
for name in name_list:
    if re.search(r'', name):
        print(name)

## simple validation of input:

In [None]:
s = input('Enter a string and I will determine if it is purely digits')
if re.search(r'', s):
    print('Yes, it is purely digital.')
else:
    print('Nope.')

List the names of all Kings, Queens, Lords, Ladies, Dames, and Sirs or Sers. Notice we have to be careful -- names like *Jack Lord* could incorrectly match:

In [None]:
for name in name_list:
    if re.search(r'', name):
        print(name)

List all the single word names like *Sting*:

List the names that look Irish or Scottish. (Last names starting with *O'...*, *Mac*, or *Mc* followed by a capital letter)  So *Macintosh* should not match:

List the names of Kings, Queens, Lords, Ladies, Dames, and Sirs or Sers, but DON'T include their title:

In [8]:
matches = re.findall(r'(?:[^;]+(?:King|Queen|Lord|Lady|Dame|Sir|Ser)\s+)([^;]+)', names)
print(matches)

['Judy Dench', 'Maggie Smith', 'Henry V', 'Henry VIII', 'Richard I', 'Catelyn Stark', 'Eddard Stark', 'Robert Grantham', 'Tywin Lannister', 'Cersei Lannister', 'Jamie Lannister', 'Ian McKellen', 'James Bond', 'Paul McCartney']


List the names that have non-English letters or glyphs in them:

In [None]:
for name in name_list:
    if re.search(r'[^\w\s\-\',\.]', name, flags=re.ASCII):
        print(name)

Do the same but without the `re.ASCII` *flag*, using more specific `[a-zA-Z]` instead of `\w` to match letters. `\w` by default matches any letter or wordglyph defined in Unicode.

In [None]:
for name in name_list:
    if re.search(r'[^a-zA-Z\s\-\',\.]', name):
        print(name)

List the names that are repetitive in the way that *'John Johnson'* or *'Rob Roberts'* are:

## multiple capture groups in one pattern
## verbose commented regexes
Separate names into their parts: Title, First name, Middle Name or Initial, Last name, Suffix.  Because of the variety of name formats I included in this short list, the regex to properly separate them out is fairly complex. So I've used  the `re.VERBOSE` flag that allows commenting and formatting for better readability.

In [None]:
print('     Title       First              Middle                Last    Suffix')  # column headings
print('---------- ----------- ------------------- ------------------- ---------')
for name in name_list:
    m = re.search(r"""\A  # anchor at start of string
                    (Mr\.|Mrs\.|Fr\.|Br\.|Dr\.|Sir|Ser|Lord|Lady|Dame|Master|King|Queen|) # title or nothing
                    \s?              # possible space
                    ([\w\.]+)\s*     # First name or initial.
                    ([\w\.\ ]*?)\s*  # Middle name(s) or initial(s)
                    ([\w\-\']*?)\s*  # Last name, allowing hyphens
                    (?:,\s+)?        # optionally match a comma and spaces without capturing
                    (Jr\.|Sr\.|Junior|Senior|[IVX]+|)  # things like "Jr." or "VIII"
                    \Z  # anchor at end of string
                    """, name, flags=re.VERBOSE)
    print('{0: >10}'.format(m.group(1)), end='')
    print('{0: >12}'.format(m.group(2)), end='')
    print('{0: >20}'.format(m.group(3)), end='')
    print('{0: >20}'.format(m.group(4)), end='')
    print('{0: >10}'.format(m.group(5)))

## Named groups
This is identical to the previous example, except each *match group* has been assigned a name in the regex. So it produces a result of matches we access by name instead of position. It's similar to the keys of a dictionary vs. a list.

In [4]:
print('     Title       First              Middle                Last    Suffix')  # column headings
print('---------- ----------- ------------------- ------------------- ---------')
for name in name_list:
    m = re.search(r"""\A  # anchor at start of string
                    (?P<title>Mr\.|Mrs\.|Fr\.|Br\.|Dr\.|Sir|Ser|Lord|Lady|Dame|Master|King|Queen|) # title or nothing
                    \s?  # possible space
                    (?P<first>[\w\.]+)\s*    # First name or initial.
                    (?P<mid>[\w\.\ ]*?)\s*   # Middle name(s) or initial(s)
                    (?P<last>[\w\-\']*?)\s*  # Last name, allowing hyphens
                    (?:,\s+)?                # optionally match a comma and spaces without capturing
                    (?P<suffix>Jr\.|Sr\.|Junior|Senior|[IVX]+|)  # things like "Jr." or "VIII"
                    \Z  # anchor at end of string
                    """, name, flags=re.VERBOSE)
    print('{0: >10}'.format(m.group('title')), end='')
    print('{0: >12}'.format(m.group('first')), end='')
    print('{0: >20}'.format(m.group('mid')), end='')
    print('{0: >20}'.format(m.group('last')), end='')
    print('{0: >10}'.format(m.group('suffix')))

     Title       First              Middle                Last    Suffix
---------- ----------- ------------------- ------------------- ---------
                  Abby                            Abbot-Majors          
                Albert                                Einstein          
             Alexander                            McCall-Smith          
                 Apple                               Macintosh          
                 Billy                                Connolly          
                 Bobby                                McFerrin          
       Br.     Cadfael                                                  
                  Cher                                                  
                 Conan                                 O'Brien          
      Dame        Judy                                   Dench          
      Dame      Maggie                                   Smith          
                Daniel       Michael Blake         

In [None]:
utopia = open('../lis452-solutions/data_files/utopia.txt','r').read()
gulliver = open('../lis452-solutions/data_files/gulliver.txt','r').read()
gulliver_orig = open('../lis452-solutions/data_files/17157-8.txt','r', encoding='ISO-8859-1').read()