# Day 4: Strings and Regular Expressions

## `str` Review
- We have been working with strings up to this point in the class
- `str` objects are collections of characters
- enclosed in single or double quotes

### Indexing
- `str` objects are indexed similarly to lists

In [1]:
# str indexing
s = "The quick brown fox jumped over the lazy dog."

# get the first 3 characters
print(s[:3])

# get every other letter starting witht the second letter
print(s[1::2])

# reverse the string
print(s[::-1])

The
h uc rw o updoe h aydg
.god yzal eht revo depmuj xof nworb kciuq ehT


### Conversion to `str`
- We can convert other types of objects to strings with `str`

In [2]:
print(str(1001))

1001


## `str` methods
- All string methods do not work on the original string. The output of these functions is a new string

In [3]:
x = 'lol'
print("applying upper():")
print(x.upper())

print('the value of x after the upper function has been called:')
print(x)

y = x.upper()
print("after y=x, x =")
print(x)

print('y =')
print(y)

print('y == x')
print(y == x)

print("Are y and x the same object?")
print(y is x)


applying upper():
LOL
the value of x after the upper function has been called:
lol
after y=x, x =
lol
y =
LOL
y == x
False
Are y and x the same object?
False


### Case
- We have already seen `.upper()`, `.lower()` and `.capitalize()`
- `.casefold()` also converts to lowercase but can handle more characters than `.lower()`
- We can check the case of the string with:
    - `.isupper()` and `.islower()`
- `.swapcase` will reverse a string so uppercase letters are lowercase and vice versa

In [4]:
x = 'lol'
y = x.upper()

print(x.isupper())
print(y.islower())
print(y.isupper())

False
False
True


In [5]:
"cOdInG iS eAsY".swapcase()

'CoDiNg Is EaSy'

### Splitting Strings into lists
- We can split a string into a list of substrings with `.split`
- `.split` takes a string as input and will break the string wherever it finds the input string
    - The input string is called a delimiter
    - `split` takes a second optional argument specifying how many times to split the string starting from the left

In [6]:
animals = 'aardvark,bear,cat,dog,elephant'
animals.split(',')
# we have split animals on the comma
# notice in the output there are no commas in the strings
# the commas we see are separating the list elements

['aardvark', 'bear', 'cat', 'dog', 'elephant']

In [7]:
s = "The quick brown fox jumped over the lazy dog."
# get every word in s into a list by splitting wherever there is a space
s.split(" ")
# notice how there are no spaces in the output

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

In [8]:
# we can use any string as a delimiter
# split teh string on "fox"
"The quick brown fox jumped over the lazy dog".split("fox")

['The quick brown ', ' jumped over the lazy dog']

In [9]:
# We can specify how many times from the left we want to split the string
# we use the second input to .split

# only split 3 times
"The quick brown fox jumped over the lazy dog.".split(' ',3)
# notice that the string only splits 3 times

['The', 'quick', 'brown', 'fox jumped over the lazy dog.']

In [10]:
# The default number of splits is -1 which is all possible splits
"The quick brown fox jumped over the lazy dog.".split(' ',-1)

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']

In [11]:
# to split from the right side use .rsplit
# if you do not specify how many splits or use -1 rsplit is equivalent to split
"The quick brown fox jumped over the lazy dog.".rsplit(' ',3)

['The quick brown fox jumped over', 'the', 'lazy', 'dog.']

- `.partition` and `.rpartition` split a string into 3 parts
- They take as input a substring and return a `tuple` with 3 elements
    1. The string before the substring
    2. The substring
    3. The string to the right of the substring
- `.partition` starts from the left side and `rpartition` starts from the right side

In [12]:
# .partition and .rpartition break a string into three parts using a substring
"The quick brown fox jumped over the lazy dog.".partition("fox")

('The quick brown ', 'fox', ' jumped over the lazy dog.')

In [13]:
'aardvark,bear,cat,dog'.partition('a')
# the first element is an emptry string because a is the first letter

('', 'a', 'ardvark,bear,cat,dog')

### Joining lists of strings
- The opposite of splitting a string into a list of strings is joining a list of strings into a single string
- We can join strings with `.join`
- The string that calls `.join` will be the delimiter of the combined list that is output

In [14]:
animals = ['aardvark', 'bear', 'cat', 'dog', 'elephant']

# make animals into a comma separated string
','.join(animals)

'aardvark,bear,cat,dog,elephant'

In [15]:
animals = ['aardvark', 'bear', 'cat', 'dog', 'elephant']

# make animals into a comma separated string
# I can add a space after the comma if I want there to be space between the words
', '.join(animals)

'aardvark, bear, cat, dog, elephant'

In [16]:
# what if I want to join a string but not have a delimiter inserted into the output string?

animals = ['aardvark', 'bear', 'cat', 'dog', 'elephant']

# I can use an empty string "" as the delimiter
"".join(animals)

'aardvarkbearcatdogelephant'

### Finding substrings in strings
- `.index` and `.find` will search a string for a substring
- `.index` will return the index where the first (from the left side) occurrence of the substring begins
    - An error will occur when calling `.index` with a substring that is not in the string
- `.find`
    - It will return the index where the first occurrence of the substring begins
    - If the substring is not found it will return -1
    - `.find` has to optional arguments `start` and `stop` which will tell it to search only in the slice of the string indicated by `start` and `stop`.
- difference between `.index` and `.find` is that `find` will have an error if the substring does not exist
- `.rindex` and `.rfind` do the same thing except they index from the right side instead of the left

In [17]:
# find a substring
sentence = "The quick brown fox jumped over the lazy dog."
print(sentence.find('brown'))
print(sentence.find('negate'))

10
-1


In [18]:
my_string = "potato,apple,banana,apple,guava"
print(my_string.find("apple"))
print(my_string.rfind("apple"))

7
20


In [19]:
# search a slice of the string
sentence = "The quick brown fox jumped over the lazy dog."
print(sentence[2:14])
sentence.find("brown",2,14)
# brown is not found because it is not contained completely in the slice

e quick brow


-1

#### `.startswith` and `.endswith`
- These are pretty intuitive
- `.startswith` takes an input and returns `True` if the sequence starts with the input
- `.endswith` returns `True` if the end of the sequence matches the input

In [20]:
dna_sequence = 'ATGATCCCCGGGGATTGA'
start_codon = 'ATG'
stop_codon = 'TGA'

print(dna_sequence.startswith(start_codon))
print(dna_sequence.endswith(stop_codon))

True
True


In [21]:
# .startswith is equivalent to
def my_startswith(s,starting_str):
    return(s[:len(starting_str)] == starting_str)

my_startswith('ATGATCCCCGGGGATTGA','ATG')

True

### `.replace`
- You can use `.replace` to replace part of a string with another string
- replace takes the substring to remove and the substring to replace it with as arguments
- It also takes an optional third argument of how many substitutions to make

In [22]:
s = 'burgers kale ice cream sandwiches potato chips kale broccoli tomatoes kale watermelons'
s.replace('kale','pizza')

'burgers pizza ice cream sandwiches potato chips pizza broccoli tomatoes pizza watermelons'

In [23]:
s = 'burgers kale ice cream sandwiches potato chips kale broccoli tomatoes kale watermelons'
# only replace the first two occurrences of kale with pizza
s.replace('kale','pizza',2)

'burgers pizza ice cream sandwiches potato chips pizza broccoli tomatoes kale watermelons'

### String Conditionals
- There are a series of string functions that return a bool based on whether the string meets certain criteria

#### Numbers in strings
- `.isdigit`, `.isdecimal`, `is.numeric`
- `.isdigit` and `.isnumeric` are very similar
    - They differ only in how they handle special characters
    - Check if a string contains only numeric digits
- `.isdecimal` checks if the entire string is a valid base 10 integer

In [24]:
print('1034'.isdigit())

True


In [25]:
print('-1034'.isdigit())

False


In [26]:
# False because "." is not a digit
print('3.14'.isdigit())

False


In [27]:
print('a1034'.isdigit())

False


In [28]:
print('1034'.isnumeric())

True


In [29]:
print('-1034'.isnumeric())

False


In [30]:
# False because "." is not a digit
print('3.14'.isnumeric())

False


In [31]:
print('s1034'.isnumeric())

False


In [32]:
print('1034'.isdecimal())

True


In [33]:
print('3.14'.isdecimal())

False


In [34]:
print('0xff'.isdecimal())

False


### Sorting numbers and numeric strings
- Sorting a list of numbers will order the list from least to greatest number
- However sorting a list of strings of numbers will sort them digit by digit
- You can sort using the `.sort`method of lists or the `sorted` function

In [35]:
num_list = [13, 1, 4, 39, 2, 3, 25]
sorted(num_list)

[1, 2, 3, 4, 13, 25, 39]

In [36]:
# num_list = ['13', '1', '4', '39', '2', '3','25']
num_list = [str(n) for n in num_list]
print(num_list)
# it will sort the list digit by digit
sorted(num_list)

['13', '1', '4', '39', '2', '3', '25']


['1', '13', '2', '25', '3', '39', '4']

#### `.zfill` will pad numeric strings with zeroes to solve this problem
- `.zfill` takes as input the number of digits the numeric string should have and adds zeroes to the left side of the string until the string has that many digits

In [37]:
'10'.zfill(4)

'0010'

In [38]:
'10'.zfill(2)

'10'

In [39]:
num_list = ['13', '1', '4', '39', '2', '3','25']

new_num_list = [n.zfill(2) for n in num_list]
print(new_num_list)
print(sorted(new_num_list))

['13', '01', '04', '39', '02', '03', '25']
['01', '02', '03', '04', '13', '25', '39']


In [40]:
'a'.zfill(3)

'00a'

### Testing Letters in strings
- `.isalpha` checks if all characters in a string are letters
- `.isalnum` checks if all characters in a string are letters or numbers
- `.isspace` returns True if the entire string is whitespace characters

In [41]:
'AaBb'.isalpha()

True

In [42]:
'0xff'.isalnum()

True

In [43]:
'Aa-Bb:Cc'.isalpha()

False

In [44]:
'b3'.isalpha()

False

In [45]:
'    '.isspace()

True

### Dealing with extra whitespace
- `.strip`, `.lstrip`, and `.rstrip` all remove whitespace
- `.strip` removes whitespace from the left and right side
- `.lstrip` only removes whitespace on the left side of the string
- `.rstrip` onlyr removes whitespace from the right side of the string
- All of these functions only remove whitespace from the two ends of the string they do not remove whitespace that is surrounded by non-whitespace characters
- These functions are invaluable when you have to clean up messy data

In [46]:
'   hello   '.strip()

'hello'

In [47]:
'   hello   '.lstrip()

'hello   '

In [48]:
'   hello   '.rstrip()

'   hello'

## Escape characters
- Escape characters are characters that begin with "\\".
- They do not print literally, instead they encode special behaviors
- `\n` is a new line character
- `\t` is tab
- `\\` will print a single backslash
- `\'` and `\"` will print single and double quotes respectively
    - Using escape characters with quotes will tell python to include these characters inside a string instead of terminating the string
- There are more escape characters but these are the most commonly used escape characters.

In [49]:
# new line
print("First Line\nSecond Line")

First Line
Second Line


In [50]:
# tab
print('Column0\tColumn1\tColumn2\tColumn3')

Column0	Column1	Column2	Column3


In [51]:
# expandtabs will print the tabs as a certain number of single spaces
print('Column0\tColumn1\tColumn2\tColumn3'.expandtabs(20))

Column0             Column1             Column2             Column3


In [52]:
print("What if I want to print \\ or \' or \" in my string?")

What if I want to print \ or ' or " in my string?


### Splitting and joining with escape characters
- Escape characters are useful for splitting and joining
- A common text processing task is to split a multiline string into a list of individual lines
- We also often need to combine a list into a single string with tabs or new lines

In [53]:
print('Column0\tColumn1\tColumn2\tColumn3'.split('\t'))

['Column0', 'Column1', 'Column2', 'Column3']


In [54]:
some_text = """Call me Ishmael.
Some years ago—never mind how long precisely—having little or no money in my purse,
and nothing particular to interest me on shore,
I thought I would sail about a little and see the watery part of the world."""
some_text.split('\n')

['Call me Ishmael.',
 'Some years ago—never mind how long precisely—having little or no money in my purse,',
 'and nothing particular to interest me on shore,',
 'I thought I would sail about a little and see the watery part of the world.']

#### We can use `.splitlines` to do the same thing

In [55]:
some_text = """Call me Ishmael.
Some years ago—never mind how long precisely—having little or no money in my purse,
and nothing particular to interest me on shore,
I thought I would sail about a little and see the watery part of the world."""
some_text.splitlines()

['Call me Ishmael.',
 'Some years ago—never mind how long precisely—having little or no money in my purse,',
 'and nothing particular to interest me on shore,',
 'I thought I would sail about a little and see the watery part of the world.']

### Raw Strings
- What if I want to print an escape character?
    - One way would be `"this \\t that"`
- Instead we can use raw strings to cause python to ignore escape characters
- raw strings begin with `r`
    - `r"some example \n that doesn't make two lines"`
- raw strings are useful if you need to write out a windows path which contains backslashes

In [56]:
print("first line\nsecond line")

first line
second line


In [57]:
print('first line\\nsecond line')

first line\nsecond line


In [58]:
# now using a raw string
print(r'first line\nsecond line')

first line\nsecond line


In [59]:
# the raw string equals the corresponding normal string which contains a double backslash
r'first line\nsecond line' == 'first line\\nsecond line'

True

In [60]:
print(r'c:\users\rehman')

c:\users\rehman


## String interpolation
- String interpolation is the dynamic insertion of values into strings
- Unfortunately, there are 3 ways to do string interpolation in python

### `%` method
- This is the oldest string interpolation method
- It is considered deprecated by the python community
    - Meaning we should not use it in new code but it hasn't been removed from python to ensure backward compatibility with old code
- I recommend you **do not** interpolate strings this way but we will go over it for the sake of completeness

In [61]:
name = "Rehman"
print("Hello my name is %s" % name)
# %s means the value we are inserting into the string is a string

Hello my name is Rehman


In [62]:
from math import pi
print(pi)
print("pi is equal to %.2f" % pi)

3.141592653589793
pi is equal to 3.14


##### format specifications
- `%s` means a string should be inserted
- `%f` - float
- `%d` - base10 integer
- `%e` - scientific notation
- leading the format spec letter with a decimal point indicates how many places to print
- There are more format specs

In [63]:
pi = 3.14159
print("pi/100 = %.1E" % (pi/100))

pi/100 = 3.1E-02


### `.format()` method
- `.format` is newer than using the `%` operator
- Instead of using a `%` as a placeholder in the string where we want to insert our value `.format` uses `{}`
- I recommend you **do not** interpolate strings this way either.

In [64]:
name = "Rehman"
job = "scientist"
"Hello, my name is {} and I am a {}".format(name,job)

'Hello, my name is Rehman and I am a scientist'

In [65]:
"Hello, my name is {name} and I am a {occupation}".format(name="Rehman",occupation="scientist")

'Hello, my name is Rehman and I am a scientist'

In [66]:
from math import pi
print(pi)
"pi is equal to {:.3f}".format(pi)

3.141592653589793


'pi is equal to 3.142'

In [67]:
from math import pi
# round to a whole number
"pi is equal to {:.0f}".format(pi)

'pi is equal to 3'

In [68]:
from math import pi
# scientific notation
"pi is equal to {:.1E}".format(pi)
# you can use either "e" or "E" for scientific notation 
# depending on whether you want the E to be upper case or not

'pi is equal to 3.1E+00'

In [69]:
from math import pi
# percentage
"pi/100 is {:.1%}".format(pi/100)

'pi/100 is 3.1%'

### f-strings
- f-strings are the preferred way to interpolate strings
- They are the newest string interpolation method in python
- They also execute faster than the other interpolation methods
- f strings allow you to put valid expressions in a string
- they begin with the letter f: `f"pi = {pi}"`
- Of the 3 string interpolation options I encourage you to use f-strings

In [70]:
# f-strings can insert variables into strings
name = "Rehman"
job = "scientist"
f"My name is {name} and I am a {job}."

'My name is Rehman and I am a scientist.'

In [71]:
# f-strings can evaluate expressions
f"The tricentennial is {2076 - 2022} years from now"

'The tricentennial is 54 years from now'

In [72]:
# you can call functions in f-strings
numbers = [1, 5, 0, -8, 12]
f"The sum is {sum(numbers)}"

'The sum is 10'

In [73]:
# you can use dictionaries
person = {"name":"Rehman", "job":"scientist"}
text = f"Hello my name is {person['name']} and I am a {person['job']}"
print(text)
# note that I need to use the single quotes for the dict keys because the f-string is enclosed in double quotes

Hello my name is Rehman and I am a scientist


In [74]:
# this is equivalent with the quote types swapped
person = {"name":"Rehman", "job":"scientist"}
text = f'Hello my name is {person["name"]} and I am a {person["job"]}'
print(text)
# you will get an error if you try to use the same quotes for the string and the dict keys

Hello my name is Rehman and I am a scientist


In [75]:
from math import pi
# You can similarly format f-strings
f"pi = {pi:.2f}"

'pi = 3.14'

In [76]:
# percentages
f"My grade is {(92.5/100):.1%}"

'My grade is 92.5%'

In [77]:
# commas in the thousands place
f"1 terabyte is {1e12:,.0f} bytes"

'1 terabyte is 1,000,000,000,000 bytes'

In [78]:
# pad zeros left
f"{1:03d}"

'001'

In [79]:
# you can concatenate f-strings
first = "Rehman"
last = "Qureshi"
f"First name: {first} " + f"Last name: {last}"

'First name: Rehman Last name: Qureshi'

In [80]:
# multiline f-string
s = f"a = {12.004:.0f}, " \
f"b = {31234:.1e}, " \
f"c = {(1/2):.1%}"
print(s)

a = 12, b = 3.1e+04, c = 50.0%


### format specifications
- `f` - floating point
- `d` - integer
- `e` - scientific notation
- `%` - percentage
- and more
- See the [official python documentation](https://docs.python.org/3/library/string.html#formatspec) for specifying the formatting for f-strings. Unfortunately, they still use `.format` in their examples, but the exact same format specifier goes to the right of the colon in the `{expression:format_spec}` expression in f-strings.

## Regular Expressions
- A "mini programming language" for pattern matching strings
- We have seen we can identify exact matches in strings with `.find`
- Regular expressions let us match multiple sub strings that fit a pattern
- Often called regex

### `re` module
- To use regular expressions in python we need to use the `re` module
- we need to import the module to use it
```python
import re
```
- Python modules contain new objects and functions that we can use

### `re.findall`
- `re.findall` will search a string for a pattern and return a list of all substrings matching the pattern
- First we will examine exact matches
- `re.findall(pattern,string)`

In [81]:
import re
s = "kale pizza burgers ice cream kale tomatoes"
# find substrings matching "kale" in s
re.findall('kale',s)

['kale', 'kale']

### Regex Characters
- Symbols and escape characters are used to generate patterns

### Regex symbols
|symbol|definition|
| :--: |  :----  |
|\s    | any whitespace|
|\S|any character except whitespace|
|\w|[A-Z] [a-z] [0-9] _ underscore|
|\W|any character not matching \w|
|\d|any digit from 0-9|
|\D|any character except digits|
|[]|match any character inside the brackets|
|[a-z]|match any lowercase letter from a to z|
|[A-Za-z]|match any letter upper or lower case|
|[0-9]|match any digit from 0 to 9|
[^xyz]|match any characters except x, y, and z|
|.|any character except newline|
|* |match the expression to the left zero or more times|
|+ |match the expression to the left one or more times|
|{n}|match the expression to the left *n* times|
|{,n}|match the expression to the left *n* or fewer times|
|{n,}|match the expression to the left *n* or more times|
|{m,n}|match the expression to the left at least *m* times but not more than *n* times|
| \| |match expression to the left **OR** expression to the right|
|^|match expression to the right only if it occurs at the beginning of the string|
|$|match expression to the left only if it occurs at the end of a string|
|()|group and capture the expression inside the parentheses|
|(?:x)|group x but do not capture it|
|(?=x)|positive lookahead|
|(?!=x)|negative lookahead|
|(?<=x)|positive lookbehind|
|(?!<x)|negative lookbehind|


In [82]:
# get phone numbers from text
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('\(\d{3}\)\s*\d{3}\-*\d{4}',s)

['(215) 555-1000', '(327)674-3409', '(215)4563456']

- `\d` matches any digit from 0-9. 
- The `{3}` means match exactly 3 times
    - `\d{3}` means match exactly 3 digits. 
    - `\d{4}` means match exactly 4 digits
- `\s` matches any whitespace character
    - whitespace is just empy space characters like spaces, tabs, and new lines
    - `\s*` means match 0 or more whitespace characters
    - We need the asterisk because one of the phone numbers doesn't have a space after the area code
        - So we need to be able to match whitespace or the lack of whitespace after the area code
- `\(` and `\)` are escaped parentheses.
    - Parentheses are special code characters in regular expressions so to actually match parentheses we need to use backslash to tell python's regular expession engine to match the parentheses
- We also need to use `\` to match a hyphen because hyphens are also special characters in regular expressions
    - `\-` will match a hyphen
    - not all the phone numbers have a hyphen so we add an asterisk to be able to match a hyphen zero or more times `\-*`

In [83]:
# get restaurant names from text
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('\w+\s+\w+',s)

['Pizza King', 'Wing Palace', 'Burger Castle']

In [84]:
# get restaurant names from text
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('[A-Za-z]+\s+[A-Za-z]+',s)

['Pizza King', 'Wing Palace', 'Burger Castle']

- This time we're trying to match the restaurant names which are 2 word pairs
- `\w` matches any "word" character: `[A-Z]` or `[a-z]` or `[0-9]` or `_`
    - Every word has at least one letter so we use `+` to indicate that
    - `\w+` means match one or more "word" characters
- We know there is at least one space between the words in the restaurant name so we use `\s+`
- Alternatively instead of `\w+` we could use `[A-Za-z]+` to match at least one upper or lower case letters

In [85]:
# find the restaurants that match `Palace` or `Castle`
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('\w+\s+(Palace|Castle)',s)

['Palace', 'Castle']

We didn't get the full restaurant name.

In [86]:
# find the restaurants that match `Palace` or `Castle`
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('\w+\s+(?:Palace|Castle)',s)

['Wing Palace', 'Burger Castle']

Now we have the full restaurant name

- `(Palace|Castle)` will match either Palace or Castle
- However `re.findall` does not match the full expression `\w+\s+(Palace|Castle)` 
- Using parentheses has grouped the expression inside them, but it also tells `re.findall` to only match the expression inside the parentheses.
    - Expressions inside parentheses are called capture groups
- We can use `?:` inside the parentheses to tell python not to treat the parentheses as a capture group
    - Using `(?:Palace|Castle)` in the regex will enable `re.findall` to match the entire expression

In [87]:
# get everything except the letters
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('[^A-Za-z]+',s)

[' ', ': (215) 555-1000, ', ' ', ': (327)674-3409, ', ' ', ': (215)4563456']

#### Lookahead and lookbehind

In [88]:
# find the first and last names of the doctors
people = "Dr. Bruce Banner, Scott Lang, Dr. Jane Foster, Dr. Wendy Lawson, James Woo"
re.findall("(?<=Dr\.\s)\w+\s\w+",people)

['Bruce Banner', 'Jane Foster', 'Wendy Lawson']

- `(?<=Dr\.\s)` tells python to look behind the expression and match it only if it matches the expression in the parentheses.
- lookbehind means look to the right of the expression
- lookahead means look to the left of the expression

In [89]:
# find the names of the non-doctors
people = "Dr. Bruce Banner, Scott Lang, Dr. Jane Foster, Dr. Wendy Lawson, James Woo"
re.findall("(?<!Dr\.\s)\w+\s\w+",people)

['ruce Banner', 'Scott Lang', 'ane Foster', 'endy Lawson', 'James Woo']

- That's not what we want!
- `(?<!Dr\.\s)` tells python not to match expressions following "Dr. "
    - `(?<!)` is the negative lookbehind operation
- `\w` matches upper and lower case words so only the first letter is following "Dr. " is omitted
    - Python then matches the rest of the names

In [90]:
# find the names of the non-doctors
people = "Dr. Bruce Banner, Scott Lang, Dr. Jane Foster, Dr. Wendy Lawson, James Woo"
re.findall("(?<!Dr\.\s)[A-Z][a-z]+\s[A-Z][a-z]+",people)

['Scott Lang', 'James Woo']

- This is what we want
- `[A-Z][a-z]+` is telling python to match any capital letter followed by at least one lower case letter

In [91]:
# get the first names only
people = "Dr. Bruce Banner, Scott Lang, Dr. Jane Foster, Dr. Wendy Lawson, James Woo"
re.findall("[A-Z][a-z]+(?=\s[A-Z][a-z]+)",people)

['Bruce', 'Scott', 'Jane', 'Wendy', 'James']

- `(?=\s[A-Z][a-z])` is the positive look ahead expression.
    - It will match expressions if the look ahead expression is to the right
- `(?!=)` is the negative look ahead

#### lookahead and lookbehind restrictions
- lookahead and lookbehind expressions cannot contain a variable number of characters
- The number of characters in the lookahead/lookbehind expression must be fixed
    - Cannot use `+`,`*`,or `{m,n}`

### Regular expressions vs. `str` functions
- Sometimes you can use standard `str` methods instead of a regular expression
- But code for parsing strings can become very complicated if you do not use regular expressions

In [92]:
# find the first and last names of the doctors using only str functions
people = "Dr. Bruce Banner, Scott Lang, Dr. Jane Foster, Dr. Wendy Lawson, James Woo"
[person.partition('Dr. ')[-1] for person in people.split(', ') if person.startswith('Dr. ')]

['Bruce Banner', 'Jane Foster', 'Wendy Lawson']

In [93]:
# split on "," instead of ", "
people = "Dr. Bruce Banner, Scott Lang, Dr. Jane Foster, Dr. Wendy Lawson, James Woo"
[person.strip().partition('Dr. ')[-1] for person in people.split(',') if person.strip().startswith('Dr. ')]

['Bruce Banner', 'Jane Foster', 'Wendy Lawson']

In [94]:
# find the first and last names of the non-doctors using only str functions
people = "Dr. Bruce Banner, Scott Lang, Dr. Jane Foster, Dr. Wendy Lawson, James Woo"
[person for person in people.split(', ') if not person.startswith('Dr. ')]

['Scott Lang', 'James Woo']

In [95]:
# find the names of the doctors using only str functions instead of regular expressions

#### Beginning and ending of strings

In [96]:
# Get the first word in the string
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('^\w+',s)

['Pizza']

`^` only matches expressions at the beginning of the string

In [97]:
# Get the last word in the string
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.findall('\w+$',s)

['4563456']

`$` only matches expressions at the end of the string

#### `.` matches any character except for newline `\n`

In [98]:
s = "C-3PO"
re.findall('C.*',s)

['C-3PO']

#### Greedy vs non-greedy matching
- By default `+`, `*`, and `{}` match the longest expression possible
    - They are "greedy" operators
    - We don't always want greedy matching
- Adding `?` makes the matching non-greedy or lazy
    - Python will return the shortest possible matches
    - `*?`, `+?`, `{n}?`, etc.

In [99]:
# get the html tags
html = '<html><head><title>Title</title>'

# first we will use a greedy regular expression
re.findall('\<.*\>',html)

['<html><head><title>Title</title>']

This gives us the entire string not the individual tags. This is because the entire string is the longest possible match for our regular expression.

In [100]:
html = '<html><head><title>Title</title>'

# using a non-greedy regular expression
re.findall('\<.*?\>',html)

['<html>', '<head>', '<title>', '</title>']

Now we have a list of the individual tags.

### `re.compile`
- Regular expressions can get quite long and can be a pain to copy and paste over and over
- You can use re.compile to create a `pattern` object that has regular expression methods
- `pattern` objects have a `findall` method

In [126]:
# compile our regular expression into a pattern
pattern = re.compile('\(\d{3}\)\s*\d{3}\-*\d{4}')

# get phone numbers from text
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
pattern.findall(s)

['(215) 555-1000', '(327)674-3409', '(215)4563456']

### `re.split`
- We can split a string using a regular expression with `re.split`
    - `re.split(expr,string_to_split)`
- `pattern` objects returned by `re.compile` also have a `.split` method

In [127]:
# split on colon or comma
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.split('[\:\,]',s)

['Pizza King',
 ' (215) 555-1000',
 ' Wing Palace',
 ' (327)674-3409',
 ' Burger Castle',
 ' (215)4563456']

In [128]:
# split on colon or comma using a compiled expression
pattern = re.compile('[\:\,]')
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
pattern.split(s)

['Pizza King',
 ' (215) 555-1000',
 ' Wing Palace',
 ' (327)674-3409',
 ' Burger Castle',
 ' (215)4563456']

In [104]:
# .pattern will return the string containing the regular expression
pattern = re.compile('[\:\,]')
pattern.pattern

'[\\:\\,]'

### `re.sub` and `re.subn`
- We can replace regular expression patterns with a new value using `re.sub`
- `re.sub(pattern,replacement,string)`
- `pattern` objects also have `.sub` and `.subn` methods
- `re.subn` and `pattern.subn` do the same thing but returns a tuple with the second element being the number of replacements performed

In [129]:
# replace all area codes with 721
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.sub('\(\d{3}\)','(721)',s)

'Pizza King: (721) 555-1000, Wing Palace: (721)674-3409, Burger Castle: (721)4563456'

In [106]:
# replace all area codes with 721 using re.compile
pattern = re.compile('\(\d{3}\)')

s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
pattern.sub('(721)',s)

'Pizza King: (721) 555-1000, Wing Palace: (721)674-3409, Burger Castle: (721)4563456'

In [130]:
# replace all area codes with 721 using subn
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
re.subn('\(\d{3}\)','(721)',s)

('Pizza King: (721) 555-1000, Wing Palace: (721)674-3409, Burger Castle: (721)4563456',
 3)

In [108]:
# replace all area codes with 721 using re.compile and pattern.subn
pattern = re.compile('\(\d{3}\)')

s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
pattern.subn('(721)',s)

('Pizza King: (721) 555-1000, Wing Palace: (721)674-3409, Burger Castle: (721)4563456',
 3)

#### Replacement using capture groups
- We can dynamically capture certain values from the original string and insert them back into the replacement string using capture groups
- Parentheses indicate a capture group.
- We can access capture groups in the replacement string by using their number.
    - Groups are numbered from left to right starting at 1
    - `\\1` will insert the first capture group

In [131]:
# reformat the phone numbers
pattern = re.compile('(\w+\s\w+)\:\s*\((\d{3})\)\s*(\d{3})\-*(\d{4})')

s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"

# each number corresponds to a set of parentheses starting at 1 going from left to right
pattern.sub('\\1: \\2-\\3-\\4',s)

# Note that we cannot add backslash to "-" because it will print the backslash

'Pizza King: 215-555-1000, Wing Palace: 327-674-3409, Burger Castle: 215-456-3456'

#### Named capture
- The code above can be confusing so instead of using numbers we can name the capture groups and access them that way.
- We name a capture group like so `(?P<group_name>expr)`
- We can access a named group in the replacement string like so `\g<group_name>`

In [110]:
# reformat the phone numbers
pattern = re.compile('(?P<restaurant>\w+\s\w+)\:\s*\((?P<area_code>\d{3})\)\s*(?P<first3>\d{3})\-*(?P<last4>\d{4})')

s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"

# each number corresponds to a set of parentheses starting at 1 going from left to right
pattern.sub('\g<restaurant>: \g<area_code>-\g<first3>-\g<last4>',s)

'Pizza King: 215-555-1000, Wing Palace: 327-674-3409, Burger Castle: 215-456-3456'

In [132]:
# we can view the capture group indices with .groupindex
pattern = re.compile('(?P<restaurant>\w+\s\w+)\:\s*\((?P<area_code>\d{3})\)\s*(?P<first3>\d{3})\-*(?P<last4>\d{4})')
pattern.groupindex

mappingproxy({'restaurant': 1, 'area_code': 2, 'first3': 3, 'last4': 4})

### `re.search` and `re.match`
- `re.search` and `re.match` work like `re.findall` except they return a match object instead of the matching string
- `pattern` objects have `.search` and `.match` methods
- `re.match` can only match expressions at the beginning of the string, while `re.search` can match expressions anywhere in thes string
- Both `re.match` and `re.search` **only return the first match**

In [133]:
# compile our regular expression into a pattern
pattern = re.compile('\(\d{3}\)\s*\d{3}\-*\d{4}')

# get phone numbers from text
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
print(pattern.match(s))

None


We get `None` because the beginning of the string does not match our pattern

In [134]:
# compile our regular expression into a pattern
pattern = re.compile('\(\d{3}\)\s*\d{3}\-*\d{4}')

# get phone numbers from text
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"
pattern.search(s)

<re.Match object; span=(12, 26), match='(215) 555-1000'>

### match objects
- `match` objects have several properties and methods
- `match.re` will give us the compiled regular expression that generated the match object
- `match.groups()` will give us the strings matching the capture groups
- `match.groupdict()` will give us a dictionary with group names as keys and matching strings as values
- `match.start()` and `match.end()` will give us starting and ending indices of the match in the string
- `match.span(group_name)` gives us the starting and ending indices of a specific capture group
- `match.expand()` will allow us to do replacement using capture groups just like `re.sub`
- There are more properties and methods, refer to the [official python documentation](https://docs.python.org/3/library/re.html#match-objects)


In [135]:
# reformat the phone numbers
pattern = re.compile('(?P<restaurant>\w+\s\w+)\:\s*\((?P<area_code>\d{3})\)\s*(?P<first3>\d{3})\-*(?P<last4>\d{4})')
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"

# each number corresponds to a set of parentheses starting at 1 going from left to right
match_obj = pattern.search(s)
print(match_obj)

<re.Match object; span=(0, 26), match='Pizza King: (215) 555-1000'>


In [115]:
print(match_obj.re)

re.compile('(?P<restaurant>\\w+\\s\\w+)\\:\\s*\\((?P<area_code>\\d{3})\\)\\s*(?P<first3>\\d{3})\\-*(?P<last4>\\d{4})')


In [116]:
print(match_obj.groups())

('Pizza King', '215', '555', '1000')


In [117]:
print(match_obj.groupdict())

{'restaurant': 'Pizza King', 'area_code': '215', 'first3': '555', 'last4': '1000'}


In [118]:
print(match_obj.start())
print(match_obj.end())

0
26


In [137]:
print(match_obj.span("area_code"))

(13, 16)


In [120]:
match_obj.expand('Call \g<restaurant> at (\g<area_code>) \g<first3>-\g<last4>')

'Call Pizza King at (215) 555-1000'

### `re.finditer`
- We have seen that `re.match` and `re.search` only give us the first match
- We can use `re.finditer` to get all matches as match objects
- `.finditer` will return an iterator like `range`
    - That means we either have to convert it to a `list` or use a `loop` to access the contents
- `pattern` objects also have a `.finditer` method

In [121]:
pattern = re.compile('(?P<restaurant>\w+\s\w+)\:\s*\((?P<area_code>\d{3})\)\s*(?P<first3>\d{3})\-*(?P<last4>\d{4})')
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"

# use .finditer to get the matches
matches = pattern.finditer(s)
print(matches)
print(list(matches))

<callable_iterator object at 0x000001FA6C28B0A0>
[<re.Match object; span=(0, 26), match='Pizza King: (215) 555-1000'>, <re.Match object; span=(28, 54), match='Wing Palace: (327)674-3409'>, <re.Match object; span=(56, 83), match='Burger Castle: (215)4563456'>]


In [138]:
pattern = re.compile('(?P<restaurant>\w+\s\w+)\:\s*\((?P<area_code>\d{3})\)\s*(?P<first3>\d{3})\-*(?P<last4>\d{4})')
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"

# use .finditer to get the matches
matches = pattern.finditer(s)

# loop through the matches
for match in matches:
    print(match.groups())

('Pizza King', '215', '555', '1000')
('Wing Palace', '327', '674', '3409')
('Burger Castle', '215', '456', '3456')


In [139]:
pattern = re.compile('(?P<restaurant>\w+\s\w+)\:\s*\((?P<area_code>\d{3})\)\s*(?P<first3>\d{3})\-*(?P<last4>\d{4})')
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"

# use .finditer to get the matches
matches = pattern.finditer(s)
# get an empty list
lunch = []
# loop through the matches
for match in matches:
    lunch.append(
        match.expand(
            "Call \g<restaurant> at \g<area_code>-\g<first3>-\g<last4>"
        )
    )

print(lunch)

['Call Pizza King at 215-555-1000', 'Call Wing Palace at 327-674-3409', 'Call Burger Castle at 215-456-3456']


In [140]:
# Note that if you try to access the output of finditer more than once 
# you will get an empty list after the first time it is called
# We will talk about why this happens in a future lecture
pattern = re.compile('(?P<restaurant>\w+\s\w+)\:\s*\((?P<area_code>\d{3})\)\s*(?P<first3>\d{3})\-*(?P<last4>\d{4})')
s = "Pizza King: (215) 555-1000, Wing Palace: (327)674-3409, Burger Castle: (215)4563456"

# use .finditer to get the matches
matches = pattern.finditer(s)
# access matches the first time
print(list(matches))
# access matches the second time
print(list(matches))

[<re.Match object; span=(0, 26), match='Pizza King: (215) 555-1000'>, <re.Match object; span=(28, 54), match='Wing Palace: (327)674-3409'>, <re.Match object; span=(56, 83), match='Burger Castle: (215)4563456'>]
[]


### Regular Expression flags
- There are several flags you can give to functions in the `re` module that will change python's behavior when it's parsing your regular expression
- `re.IGNORECASE` aka `re.I`- makes regex case insensitive
    - You can also include `(?i)` at the begining of your regex
- `re.DOTALL` will allow `.` to match new line characters
    - You can also include `(?s)` at the beginning of your regex
- You can use multiple flags at the same time
    - `(?is)` will use both `DOTALL` and `IGNORECASE`
    - `re.compile(my_expr,re.IGNORECASE,re.DOTALL)`
- There are more flags; check out the [python documentation](https://docs.python.org/3/library/re.html#flags) to learn more

In [145]:
# ignore case

#get DNA sequences
s = "> My Sequence AATgcATCcccgggtagtaaatg"

nucleotide = re.compile('[ATGC]{2,}',re.IGNORECASE)
nucleotide.findall(s)

['AATgcATCcccgggtagtaaatg']

### Regular Expression Tips
- Regular expressions are incredibly powerful
    - I use them almost daily to clean up data that was manually recorded or to rename things
- I recommend that you use `re.compile` to compile your regex into a pattern object instead of using `re.sub`, `re.search`, etc.
    - You can access all these methods from the pattern object produced by `re.compile`
- Regular Expressions may seem overwhelming, but as long as you understand the principles you don't need to have all the syntax memorized
    - Even I am always looking at the python documentation for [Regular Expressions](https://docs.python.org/3/library/re.html) while writing my code
- Practice, Practice, Practice
    - This goes for programming in general
    - I learned Regular Expressions by using them over and over for years