# Regular Expressions

## What are regular expressions?


- Regular expressions are designed for matching patterns in text.
- Examples are picking out three digits in a row or characters separated by white space or punctuation(i.e. words in a sentence).
- **They can be extremely useful for parsing & cleaning data - check that entries follow a certain format or extract parts of a string based on certain criteria**
- The downside is that they aren't very human-readable.
- Can be tricky to code
- Sometimes called REs, regexes or regex patterns

![regex](https://imgs.xkcd.com/comics/regular_expressions.png)

## Regular Expressions in Python

In [1]:
# re is the regular expression module in python
import re


## Matching
Regular expressions can be used to match letters and characters. The syntax for matching (and many other re functions) is `re.match(pattern, string)`.

In [2]:
re.match?

In [3]:
# Most characters will match themselves
m = re.match('test', 'test')
( bool(m), m )


(True, <re.Match object; span=(0, 4), match='test'>)

In [4]:
m = re.match('t', 'today')
( bool(m), m )

(True, <re.Match object; span=(0, 1), match='t'>)

In [5]:
m = re.match('y', 'today')
( bool(m), m )


(False, None)

In [6]:
m = re.match('tod', 'today')
( bool(m), m )


(True, <re.Match object; span=(0, 3), match='tod'>)

## Other Types of Match Functions
There are other types of match functions that can be used with regular expressions.
-  match() - looks to match the beginning of a string
-  search() - will match anywhere in the string
-  findall()  - finds all substrings that match and returns as a list


In [7]:
my_string = "1. 2. 3. ... testing, testing"
my_string


'1. 2. 3. ... testing, testing'

In [8]:
# Match example
m = re.match('testing', my_string)
( bool(m), m )


(False, None)

In [9]:
# Search example
m = re.search('testing', my_string)
( bool(m), m )


(True, <re.Match object; span=(13, 20), match='testing'>)

In [10]:
# Findall example
m = re.findall('testing', my_string)
( bool(m), m )


(True, ['testing', 'testing'])

In [11]:
# If there is no match, it will return None
m = re.search('hi', my_string)
( bool(m), m )


(False, None)

In [12]:
print(re.search('hi', my_string))


None


# Regular Expression Patterns

##**Special Sequences**



Often we may want to match for a generic pattern, in order to find numbers, characters, or text. Below is a table of special matching sequences.

Pattern | Description
--|--
\d | One digit
\D | Inverse of \d. One non-digit
\w | One alphanumeric character
\W | Inverse of \w
\s | One whitespace
\S | Inverse of \s
\b | Empty string at beginning or end

In [13]:
my_sentence = 'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive. '
my_sentence


'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive. '

In [14]:
# Find all single digits
re.findall(r'\d', my_sentence)

['1', '0', '0', '9']

In [15]:
# Find all single characters
re.findall(r'\w', my_sentence)


['I',
 't',
 'w',
 'a',
 's',
 '1',
 '0',
 '0',
 'd',
 'e',
 'g',
 'r',
 'e',
 'e',
 's',
 's',
 'o',
 'I',
 'w',
 'e',
 'n',
 't',
 't',
 'o',
 't',
 'h',
 'e',
 'p',
 'o',
 'o',
 'l',
 'I',
 'j',
 'u',
 'm',
 'p',
 'e',
 'd',
 'o',
 'f',
 'f',
 'a',
 '9',
 'f',
 'o',
 'o',
 't',
 'h',
 'i',
 'g',
 'h',
 'd',
 'i',
 'v',
 'e']

In [16]:
# Find first whitespace
re.search(r'\s', my_sentence)


<re.Match object; span=(2, 3), match=' '>

In [17]:
# Find all non-digits
re.findall(r'\D', my_sentence)


['I',
 't',
 ' ',
 'w',
 'a',
 's',
 ' ',
 ' ',
 'd',
 'e',
 'g',
 'r',
 'e',
 'e',
 's',
 ',',
 ' ',
 's',
 'o',
 ' ',
 'I',
 ' ',
 'w',
 'e',
 'n',
 't',
 ' ',
 't',
 'o',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'p',
 'o',
 'o',
 'l',
 '!',
 ' ',
 'I',
 ' ',
 'j',
 'u',
 'm',
 'p',
 'e',
 'd',
 ' ',
 'o',
 'f',
 'f',
 ' ',
 'a',
 ' ',
 ' ',
 'f',
 'o',
 'o',
 't',
 ' ',
 'h',
 'i',
 'g',
 'h',
 ' ',
 'd',
 'i',
 'v',
 'e',
 '.',
 ' ']

In [18]:
my_sentence


'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive. '

In [19]:
# Find all beginning letters of each word/number
re.findall(r'\b\w', my_sentence)


['I',
 'w',
 '1',
 'd',
 's',
 'I',
 'w',
 't',
 't',
 'p',
 'I',
 'j',
 'o',
 'a',
 '9',
 'f',
 'h',
 'd']

### Your Turn
Create a sentence of your choice. Do the following:
1. Use the `match` function to match on the first character of your sentence.
2. Use the `search` function to match on the letter "a".
3. Use the `findall` function to find all non-whitespaces.


In [20]:
my_line = "The quick brown fox jumped over the lazy dogs."

In [21]:
# Solution 1 - variant 1
re.match(r'\w', my_line)

<re.Match object; span=(0, 1), match='T'>

In [22]:
# Solution 1 - variant 2
re.match(r'T', my_line)

<re.Match object; span=(0, 1), match='T'>

In [23]:
# Solution 2
re.search(r'a', my_line)

<re.Match object; span=(37, 38), match='a'>

In [24]:
# Solution 3
re.findall(r'\S', my_line)

['T',
 'h',
 'e',
 'q',
 'u',
 'i',
 'c',
 'k',
 'b',
 'r',
 'o',
 'w',
 'n',
 'f',
 'o',
 'x',
 'j',
 'u',
 'm',
 'p',
 'e',
 'd',
 'o',
 'v',
 'e',
 'r',
 't',
 'h',
 'e',
 'l',
 'a',
 'z',
 'y',
 'd',
 'o',
 'g',
 's',
 '.']

In [25]:
# re.

## Alternate Sequence Notation

Sometimes we want to be more specific about the alphanumeric characters we want to match on.

Pattern | Description
--|--
[ab12] | Any character from this list
[a-d] | Any character from a to d
[^a-d] | Inverse of [a-d]

In [26]:
my_sentence


'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive. '

In [27]:
# Find all a's, b's, 1's and 2's
re.findall(r'[ab12]', my_sentence)


['a', '1', 'a']

In [28]:
# Find all characters except a-t
re.findall(r'[^a-t]', my_sentence)


['I',
 ' ',
 'w',
 ' ',
 '1',
 '0',
 '0',
 ' ',
 ',',
 ' ',
 ' ',
 'I',
 ' ',
 'w',
 ' ',
 ' ',
 ' ',
 '!',
 ' ',
 'I',
 ' ',
 'u',
 ' ',
 ' ',
 ' ',
 '9',
 ' ',
 ' ',
 ' ',
 'v',
 '.',
 ' ']

In [29]:
# Find all vowels
re.findall(r'[aeiouAEIOU]', my_sentence)


['I',
 'a',
 'e',
 'e',
 'e',
 'o',
 'I',
 'e',
 'o',
 'e',
 'o',
 'o',
 'I',
 'u',
 'e',
 'o',
 'a',
 'o',
 'o',
 'i',
 'i',
 'e']

In [30]:
re.findall?


### Your Turn
Write a regular expression to find all non-vowels in the sentence *Good morning!*.

```python
good_sentence = 'Oh, Good Morning!'
good_sentence
```

In [31]:
# Solution
good_sentence = 'Oh, Good Morning!'
good_sentence


'Oh, Good Morning!'

In [32]:
non_vowels = r'[^aeiouAEIOU]'
re.findall( non_vowels, good_sentence )


['h', ',', ' ', 'G', 'd', ' ', 'M', 'r', 'n', 'n', 'g', '!']

## Repeating

So far, we have only been using special sequences to match on a single character. Often, we will want to match on repeating characters. Below is a table that describes the notation for repeating characters.

Pattern | Description
--|--
* | Checks if preceding character appears zero or more times => {0,}
+ | Checks if preceding character appears one or more times => {1,}
? | Checks if preceding character appears exactly zero or one times => {0,1}
{n} | Checks if preceding character appears n times
{n,m} | Checks if preceding character appears between n and m times, inclusive



In [33]:
my_sentence


'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive. '

In [35]:
# \w
# \w\w
# \w\w\w
# \w\w\w\w
# \w\w\w\w\w
# \w\w\w\w\w\w
# \w\w\w\w\w\w\w
# \w+

In [36]:
# Find all words/numbers with 1 or more character
re.findall(r'\w+', my_sentence)


['It',
 'was',
 '100',
 'degrees',
 'so',
 'I',
 'went',
 'to',
 'the',
 'pool',
 'I',
 'jumped',
 'off',
 'a',
 '9',
 'foot',
 'high',
 'dive']

In [37]:
re.findall(r'\d+', my_sentence)


['100', '9']

In [38]:
my_sentence1 = "Happy Popo Day, Papoooy"
my_sentence1


'Happy Popo Day, Papoooy'

In [39]:
# Find all 'p' or 'po' or 'poo' or 'pooo', etc.
re.findall(r'[Pp]o*', my_sentence1)


['p', 'p', 'Po', 'po', 'P', 'pooo']

In [40]:
# Find all 'p' or 'po'
re.findall(r'po?', my_sentence1)


['p', 'p', 'po', 'po']

In [41]:
# Find all numbers
re.findall(r'\d+', my_sentence)


['100', '9']

In [42]:
# Find any words that end with "f"
re.findall(r'\w+f\b', my_sentence + " office" )


['off']

In [43]:
my_sentence


'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive. '

In [44]:
# Find all four letter words
# re.findall(r'\w{4}', my_sentence)
re.findall(r'\b\w{4}\b', my_sentence)


['went', 'pool', 'foot', 'high', 'dive']

In [45]:
# find all words that are at most 4 characters long
re.findall(r'\b\w{1,4}\b', my_sentence)


['It',
 'was',
 '100',
 'so',
 'I',
 'went',
 'to',
 'the',
 'pool',
 'I',
 'off',
 'a',
 '9',
 'foot',
 'high',
 'dive']

In [46]:
# find all words that are at least 4 characters long
re.findall(r'\b\w{4,}\b', my_sentence)


['degrees', 'went', 'pool', 'jumped', 'foot', 'high', 'dive']

### Your Turn
Use the `findall()` function with regular expressions to find the following from `my_sentence`:
1. All words/numbers that are exactly 3 characters long
2. All numbers that are one digit
3. Any word that begins with "w"
4. **Bonus**: Any word that contains an "a"

```python
my_sentence = "It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive."

```

In [47]:
my_sentence = "It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive."
my_sentence

'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive.'

In [48]:
# Solution 1
re.findall(r'\b\w{3}\b', my_sentence)

['was', '100', 'the', 'off']

In [49]:
# Solution 2
re.findall(r'\b\d\b', my_sentence)

['9']

In [50]:
# Solution 3 - variant 1
re.findall(r'\bw\w*\b', my_sentence)

['was', 'went']

In [51]:
# Solution 3 - variant 2
re.findall(r'\bw\w{0,}\b', my_sentence)

['was', 'went']

In [52]:
# Solution 4
re.findall(r'\b\w*a\w*\b', my_sentence)

['was', 'a']

## A Few Other Matching Sequences

Pattern | Description
--|--
. | Any character (wildcard)
\| | Either or

In [53]:
my_sentence = "It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive."
my_sentence


'It was 100 degrees, so I went to the pool! I jumped off a 9 foot high dive.'

In [54]:
re.findall(r'e.', my_sentence)

['eg', 'ee', 'en', 'e ', 'ed', 'e.']

In [55]:
re.findall(r'high|dive', my_sentence)

['high', 'dive']

# Metacharacters - special characters
These characters have special meanings and do not match themselves
. ^ $ * + ? { } [ ] \ | ( )

## Backslashes - escape out special characters
If you want to match a special character the `\` in front of it lets you do this.


In [56]:
# this will match '2+2' whereas '2+2' without the \ would match any series of more than three 2s
my_char_string = '2+2  22222'
re.findall(r'2+2', my_char_string)


['22222']

In [57]:
re.findall(r'2\+2', my_char_string)

['2+2']

# Match Object

The match and search functions return a match object that has the following methods/attributes:

- .span() - returns a tuple containing the start and end positions of the match.
- .string - returns the string passed into the function
- .group() - returns the part of the string where there was a match

In [58]:
my_text = 'The cohort 14 students are data science ninjas'
my_match = re.search('cohort', my_text)
print(my_match.span())
print(my_match.string)
print(my_match.group())


(4, 10)
The cohort 14 students are data science ninjas
cohort


In [59]:
seq = "GCACGTGTAACTCTGATCTAGACACGTATC"
site = re.search(r'AC[AG][CT]GT', seq)
start, end = site.span()
(
    len(seq),
    site.group(),
    seq[:start],
    seq[start:end],
    seq[end:]

)


(30, 'ACGTGT', 'GC', 'ACGTGT', 'AACTCTGATCTAGACACGTATC')

In [60]:
site = re.search(r'AC[AG][CT]GT', seq[start+1:])
site


<re.Match object; span=(18, 24), match='ACACGT'>

In [61]:

for i in re.findall(r'AC[AG][CT]GT', seq):
  print(re.search(i, seq).span())



(2, 8)
(21, 27)


# Grouping

Groups let you catch parts of a string to use separately.

In [62]:
# group(1) is determined by our first set of ()
# in this case it is one or more word characters \w+
# group(2) is determined by our second set of ()
# in this case it is one or more numbers \d+

# let's say you have data in the form 'name:height'
data = 'Nevin:65:87111:ABQ:505-333-1234'
m1 = re.search(r'(\w+):(\d+)', data)

print(m1.group(0))
print(m1.group(1))
print(m1.group(2))


Nevin:65
Nevin
65


In [63]:
m1 = re.search(r'(\d+)-((\d+)-(\d+))', 'Nevin:65:87111:ABQ:505-333-1234')

# group(0) is always the entire result
print(m1.group(0))
print(m1.group(1))
print(m1.group(2))
print(m1.group(3))
print(m1.group(4))


505-333-1234
505
333-1234
333
1234


In [65]:
name = "Robert          Citek"
name

'Robert          Citek'

In [66]:
m1 = re.search(r'(\w+) +(\w+)', name)
print(m1.group(0))
print(m1.group(1))
print(m1.group(2))
print(f"{m1.group(2)}, {m1.group(1)}")


Robert          Citek
Robert
Citek
Citek, Robert


# Example

Use a regular expression to pick just the phone number out from this string AND store it as a new string in the format XXX-XXX-XXXX.

In [68]:
contact = 'Brandon - 980  - - -  "; DROP TABLE CONTACTS ;" 123 alfjdalsdkf asdf 4567'
contact


'Brandon - 980  - - -  "; DROP TABLE CONTACTS ;" 123 alfjdalsdkf asdf 4567'

In [69]:
m2 = re.search(r'(\d{3})(\D*)(\d{3})(\D*)(\d{4})', contact)
m2.group(0)


'980  - - -  "; DROP TABLE CONTACTS ;" 123 alfjdalsdkf asdf 4567'

In [70]:
(
    m2.group(1),
    m2.group(2),
    m2.group(3),
    m2.group(4),
    m2.group(5),
)

('980',
 '  - - -  "; DROP TABLE CONTACTS ;" ',
 '123',
 ' alfjdalsdkf asdf ',
 '4567')

In [71]:
new_number = f"{m2.group(1)}-{m2.group(3)}-{m2.group(5)}"
new_number


'980-123-4567'

In [73]:
garbage = '''
Joe Smith
Doe, John
       fred jones, jr.
jones jr., fred
'''
print(garbage)


Joe Smith
Doe, John
       fred jones, jr.
jones jr., fred



In [74]:
re.search(r'(\w+) *(\w+)', garbage)


<re.Match object; span=(1, 10), match='Joe Smith'>

In [None]:
for line in garbage.split("\n"):
  print(line)
  if re.search(r'\S*', line):

  if "," in line:
    m1 = re.search(r'(\w+), *(\w+)', line)
    print(f"== {m1.group(1).title()}, {m1.group(2).title()}")
  else:
    m1 = re.search(r'(\w+) *(\w+)', line)
    print(f"== {m1.group(2).title()}, {m1.group(1).title()}")


# Regex & Pandas DataFrames


We can also use methods using regular expressions on dataframes.

This is extremely useful for cleaning data.


In [77]:
import pandas as pd

In [78]:
name_df = pd.DataFrame([[0, 'Kyla Bendt'],
                      [1, 'Ben Ben'],
                      [2, 'Bart Simpson'],
                      [3, 'Stan Lowell'],
                      [4, 'Daniel Tiger']],
                     columns=["id","names"])
name_df

Unnamed: 0,id,names
0,0,Kyla Bendt
1,1,Ben Ben
2,2,Bart Simpson
3,3,Stan Lowell
4,4,Daniel Tiger


In [79]:
bs_names = name_df["names"].str.match(r'[BS]') # Names that start w/ a B or S
bs_names

0    False
1     True
2     True
3     True
4    False
Name: names, dtype: bool

In [81]:
name_df[bs_names]


Unnamed: 0,id,names
1,1,Ben Ben
2,2,Bart Simpson
3,3,Stan Lowell


In [82]:
names_3 = name_df["names"].str.match(r'\b\w{1,3}\b') # Three letter first names
print(names_3)
name_df[names_3]


0    False
1     True
2    False
3    False
4    False
Name: names, dtype: bool


Unnamed: 0,id,names
1,1,Ben Ben


In [83]:
name_df


Unnamed: 0,id,names
0,0,Kyla Bendt
1,1,Ben Ben
2,2,Bart Simpson
3,3,Stan Lowell
4,4,Daniel Tiger


In [84]:
# Extract first names and make a new column in data frame
name_df["first_names"] = name_df["names"].str.extract(r'(\w+)')
name_df


Unnamed: 0,id,names,first_names
0,0,Kyla Bendt,Kyla
1,1,Ben Ben,Ben
2,2,Bart Simpson,Bart
3,3,Stan Lowell,Stan
4,4,Daniel Tiger,Daniel


In [85]:
# Count the number of people who have a first name the start with "B"
name_df['first_names'].str.count(r'B+')


0    0
1    1
2    1
3    0
4    0
Name: first_names, dtype: int64

In [86]:
name_df['first_names'].str.count(r'B+').sum()


2

In [87]:
# Split on a space
name_df['names'].str.split(r' ')


0      [Kyla, Bendt]
1         [Ben, Ben]
2    [Bart, Simpson]
3     [Stan, Lowell]
4    [Daniel, Tiger]
Name: names, dtype: object

## Your Turn
1. Create a pandas data frame named `my_cohort` with one column that contains the names of everyone in your cohort ( or names of people that you know ), and a second column that contains the favorite food of everyone in your cohort.
2. Use regular expressions to find anyone with a 3-5 letter first name, returning a data frame.
3. Use regular expressions to find anyone whose favorite food starts with any letter from "a" to "m", returning a data frame.


In [101]:
%%capture output
!pip install faker faker_food


In [96]:
from faker import Faker
from faker_food import FoodProvider


In [112]:
# Create a Faker instance
fake = Faker()
fake.add_provider(FoodProvider)


In [113]:
size_of_cohort = 10

In [114]:
random_names = [fake.name() for _ in range(size_of_cohort)]
random_names

['Sarah Butler',
 'John Gentry',
 'Dawn White',
 'Debra Clayton',
 'Kevin Hoffman',
 'Michael Estrada',
 'April Navarro',
 'Curtis Lewis',
 'Leslie Benson',
 'Michael Graham']

In [115]:
random_food = [ fake.dish() for _ in range(size_of_cohort) ]
random_food

['Califlower penne',
 'Tuna sashimi',
 'Chicken wings',
 'Linguine with clams',
 'California maki',
 'Cheeseburger',
 'Bruschette with Tomato',
 'Caesar Salad',
 'Pasta and Beans',
 'Pork belly buns']

In [116]:
# Solution 1
my_cohort = pd.DataFrame( {
    "names": random_names,
    "fav_food": random_food,
}
)
my_cohort

Unnamed: 0,names,fav_food
0,Sarah Butler,Califlower penne
1,John Gentry,Tuna sashimi
2,Dawn White,Chicken wings
3,Debra Clayton,Linguine with clams
4,Kevin Hoffman,California maki
5,Michael Estrada,Cheeseburger
6,April Navarro,Bruschette with Tomato
7,Curtis Lewis,Caesar Salad
8,Leslie Benson,Pasta and Beans
9,Michael Graham,Pork belly buns


In [117]:
# Solution 2
filter = my_cohort["names"].str.match(r'\w{3,5}\b')
my_cohort[filter]

Unnamed: 0,names,fav_food
0,Sarah Butler,Califlower penne
1,John Gentry,Tuna sashimi
2,Dawn White,Chicken wings
3,Debra Clayton,Linguine with clams
4,Kevin Hoffman,California maki
6,April Navarro,Bruschette with Tomato


In [118]:
# Solution 3
filter = my_cohort["fav_food"].str.match(r'[a-m]', case=False)
my_cohort[filter]

Unnamed: 0,names,fav_food
0,Sarah Butler,Califlower penne
2,Dawn White,Chicken wings
3,Debra Clayton,Linguine with clams
4,Kevin Hoffman,California maki
5,Michael Estrada,Cheeseburger
6,April Navarro,Bruschette with Tomato
7,Curtis Lewis,Caesar Salad


## Recommended Resources

- [Python.org Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)- Regular Expression Documentation
- [Python Data Science Handbook - Working with Strings](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb) - covers regular expressions and other useful string methods in Pandas
- [Mastering Regular Expressions](https://learning.oreilly.com/library/view/mastering-regular-expressions/0596528124/)
- [Regex planet]( https://www.regexplanet.com/ )
- [Regex reference]( https://www.regular-expressions.info/refflavors.html )


