# Regex

- What is a regular expression?
- When are regular expressions useful?

In [1]:
import pandas as pd
import re 

In [2]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [3]:
import re # part of the python stdlib

- search: shows a single match for a regex
- findall: shows *all* the matches for a regex in a subject

### Literals

In [4]:
regexp = r'a'
#holds regular expression

subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <ol>
        <li>Change your regular expression to match the literal character "b". What do you notice?</li>
        - the index numbers change from 0,1 to 1,2 and the match changes from a to b
        <li>Change your regular expression to match the literal string "ab". What do you notice?</li>
        - the index numbers change from 0,1 to 0,2 and the match changes from a to ab
        <li>Change your regular expression to match the literal "d". What do you notice?</li>
        - There is no output, meaning there is no match
        <li>Use <code>re.findall</code> instead of <code>re.search</code>. How do the results differ?</li>
        - a list of the match is produced, if there is a match, otherwise, the list is empty
        <li>Change your regular expression to just the "." character. What are the results?</li>
        - with findall a list of all the items in the string is produced as isolated elements of a list
        - 
    </ol>
</div>

In [5]:
regexp = r'.'
#holds regular expression

subject = 'abc'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

### Metacharacters

r stands for a 'raw' string
- `.`    matches any character (to include whitespace) 
- `\w`   matches any letter or number  -----  '\W'capitalized matches the inverse -> anything that is NOT a letter or number
- `\s`    matches any white space   -----  '\S'capitalized matches the inverse -> anything that is NOT a space
- `\d`   matches any number   -----  '\D'capitalized matches the inverse -> anything that is NOT a number
- Captial variants

In [6]:
regexp = r'..'
subject = 'abc 123'

re.findall(regexp, subject)

['ab', 'c ', '12']

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Continue to use the same subject variable from above.</p>
    <ol>
        <li>Use all of the above metacharacters with <code>re.findall</code>. What do you notice?</li>
        - everything is separated into isolated instances within a list
        <li>What does the regular expression <code>\w\w</code> match?</li>
        - the first 2 items from each group that are letters or numbers, because it looks for the first instance, then keeps going, looking for the next instance of any letter or number pair (without a space)
        <li>Use only metacharacters to write a regular expression to match "c 1".</li>
        - \w\s\d alphanumeric character followed by a space, followed by a digit
        <li>Use a combination of metacharacters to match 3 digits in a row.</li>
        - \d\d\d
    </ol>
</div>


 - to find a literal ' . '  use '\.' (backslash period to escape regex)

### Repeating

- `{}`    Custom number of repititions
    - x, x or more
    - 
- `*`     Zero of more
- `+`     Whatever is to the left, match one or more of that
- `?`     Optional - can make regular expressions non-greedy
- greedy + non-greedy
    - greedy - will try to match as much as they possible can (3 to 5, means 5) (the last time the pattern could possibly stop)
    - non-greedy - match as little as they possibly can (the first time it could possibly stop)
    
    When question mark is after a metacharacter or literal character it makes that character optional
    When after a repetition operator it changes it from greedy to non-greedy

- \s*\w+.     ------>zero or more spacecs followed by one or more alphanumeric
\w{3}   match the pattern I am defining as 3 alphanumberic characters in a row. 'abc' '123'
.{3,5}  match 3 to 5 matches of anything
.+\d    one or more of anything followed by a digit  (abc 123) 
.+?\d   the first time it could possibly stop

\w+\s?\d+? one or more alphanumeric, optionally followed by a space, follwed by as few digits as possible

In [7]:
regexp = r'\w+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 3), match='abc'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the string below as your subject for this exercise.</p>
    <pre><code>Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.</code></pre>
    <ol>
        <li>Write a regular expression that matches all the numbers.</li>
        <li>Write a regular expression that matches a 5 digit number, but not a number with fewer digits.</li>
        <li>Write a regular expression that matches `http://` or `https://`.</li>
        <li>Write a regular expression that matches all of the words.</li>
    </ol>
</div>

### Write a regular expression that matches all the numbers.

In [8]:
# gives all digits
regexp = r'\d'
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com.'
)
re.findall(regexp, subject)


['2', '0', '1', '4', '6', '0', '0', '3', '5', '0', '7', '8', '2', '3', '0']

In [9]:
# gives all numbers
regexp = r'\d+'
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com.'
)
re.findall(regexp, subject)

['2014', '600', '350', '78230']

### Write a regular expression that matches a 5 digit number, but not a number with fewer digits.

In [10]:
# gives instances of 5 digits in a row
regexp = r'\d{5}'
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com.'
)
re.findall(regexp, subject)

['78230']

In [11]:
# gives instances of 4 or more
regexp = r'\d{4,}'
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com.'
)
re.findall(regexp, subject)

['2014', '78230']

### Write a regular expression that matches `http://` or `https://`

In [12]:
# ANY protocol
regexp = r'\w+://'
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com.'
)

re.findall(regexp, subject)

['http://', 'https://']

In [13]:
# specificcally http
regexp = r'https?://'
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com.'
)

re.findall(regexp, subject)

['http://', 'https://']

### Write a regular expression that matches all of the words.

In [14]:
# produces all words and number groups 
regexp = r'\w+'
subject = (
    'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, '
    'San Antonio, TX 78230. You can find us online at http://codeup.com '
    'and our alumni portal is located at https://alumni.codeup.com.'
)
re.findall(regexp, subject)

['Codeup',
 'founded',
 'in',
 '2014',
 'is',
 'located',
 'at',
 '600',
 'Navarro',
 'St',
 'Suite',
 '350',
 'San',
 'Antonio',
 'TX',
 '78230',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com']

In [15]:
# produces all of the words (the + sign looks for the groups)
regexp = r'[a-zA-Z]+'
subject = 'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.'

re.findall(regexp, subject)

['Codeup',
 'founded',
 'in',
 'is',
 'located',
 'at',
 'Navarro',
 'St',
 'Suite',
 'San',
 'Antonio',
 'TX',
 'You',
 'can',
 'find',
 'us',
 'online',
 'at',
 'http',
 'codeup',
 'com',
 'and',
 'our',
 'alumni',
 'portal',
 'is',
 'located',
 'at',
 'https',
 'alumni',
 'codeup',
 'com']

### Any/None Of

- search - find 1st instance of match in pattern, no matter where it was in the string.
- findall - finds EVERYtime
- match - finds 

In [16]:
# match the character a or the character 1, match the character b or the character 2, etc...
regexp = r'[a1][b2][c3]'
subject = 'abc 123'

re.findall(regexp, subject)

['abc', '123']

In [17]:
# match the first instance of anything between a or z
regexp = r'[a-z]'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(0, 1), match='a'>

In [18]:
# match the first instance of anything between a or z
regexp = r'[0-9]+'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(4, 7), match='123'>

In [19]:
# match characters between a and z
regexp = r'[a-z]+'
subject = 'abc 123'

re.findall(regexp, subject)

['abc']

In [20]:
# match anything that is NOT ^ 1,2,3, or b
regexp = r'[^123b]+'
subject = 'abc 123'

re.findall(regexp, subject)

['a', 'c ']

In [21]:

subject = '123abc'

re.search(regexp, subject)

<re.Match object; span=(3, 4), match='a'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches even numbers.</li>
        <li>Write a regular expression that matches 2 or more odd numbers in a row.</li>
        <li>Write a regular expression that any word with a vowel in it.</li>
    </ol>
</div>

Write a regular expression that matches even numbers.

In [22]:

regexp = r'[\d*[02468]'
subject = '123 456 1 10 13 1234567'

re.findall(regexp, subject)

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '1',
 '1',
 '0',
 '1',
 '3',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7']

Write a regular expression that matches 2 or more odd numbers in a row.

In [23]:
# extract even number out of number
regexp = r'\d*[02468]\D'
subject = '123 '

re.findall(regexp, subject)

[]

In [24]:
# extract even number out of number
regexp = r'\d*[13579]\D+\d*[123579]'
subject = '123 456 123'

re.findall(regexp, subject)

['123 45']

Write a regular expression that any word with a vowel in it.

In [25]:

regexp = r'[a-z]*[aeiou]'
subject = 'abc 123'

re.findall(regexp, subject)

['a']

### Anchors

- `^`
- `$`

In [26]:
regexp = r'b'
subject = 'abc 123'

re.search(regexp, subject)

<re.Match object; span=(1, 2), match='b'>

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>For this exercise you should make up various subjects and test them with your regular expressions.</p>
    <ol>
        <li>Write a regular expression that matches if a word starts with a vowel.</li>
        <li>Write a regular expression that matches if a word starts with a capital letter.</li>
        <li>Write a regular expression that matches if a word ends with a capital letter.</li>        
        <li>Write a regular expression that matches if a word starts <b>and</b> ends with a capital letter.</li>
    </ol>
</div>

In [49]:
regexp = r'.$'
subject = 'abc 123'

re.findall(regexp, subject)

['3']

In [38]:
# anything followed by a word boundary
regexp = r'.\b'
subject = 'abc 123'

re.findall(regexp, subject)

['c', ' ', '3']

In [39]:
# a word boundry followed by anything
regexp = r'\b.'
subject = 'abc 123'

re.findall(regexp, subject)
# the space follows the c as the word boundary abc.  -> the space is the 'anything' that follows the word boundary

['a', ' ', '1']

In [29]:
regexp = r' .$'
subject = 'abc 123'

re.search(regexp, subject)

### Write a regular expression that matches if a word starts with a vowel.

In [41]:
regexp = r'\b[aeiou]'
subject = 'apple banana eggplant'

re.findall(regexp, subject)

['a', 'e']

In [54]:
# If a single word starts with a vowel
regexp = r'^[aeiouAEIOU]\w+'
subject = 'apple'

re.findall(regexp, subject)

['apple']

In [None]:
regexp = r'^[aeiouAEIOU]\w+'
subject = 'apple banana eggplant'

re.findall(regexp, subject)

### Write a regular expression that matches if a word starts with a capital letter.

In [42]:
regexp = r'\b[A-Z]'
subject = 'apple Banana eggplant Carrot'

re.findall(regexp, subject)

['B', 'C']

In [56]:
# If a single word starts with a vowel
regexp = r'^[A-Z]\w+'
subject = 'apple'

re.findall(regexp, subject)

[]

### Write a regular expression that matches if a word ends with a capital letter.

In [44]:
regexp = r'[A-Z]\b'
subject = 'apple Banana eggplant CarroT'

re.findall(regexp, subject)

['T']

### Write a regular expression that matches if a word starts and ends with a capital letter.

In [47]:
# anything followed by a word boundary
regexp = r'^\b[A-Z]\b$'
subject = 'apple Banana eggplant CarroT'

re.findall(regexp, subject)

[]

### Capture Groups

In [60]:
regexp = '.*?(\d+)'
df = pd.DataFrame()
df['word'] = ['abc', 'abc123', '123'] 
df['extracted'] = df.word.str.extract(regexp)
df

Unnamed: 0,word,extracted
0,abc,
1,abc123,123.0
2,123,123.0


## `re.sub`

- removing
- substitution

In [31]:
regexp = r'\d+'
subject = 'abc123'

re.sub(regexp, '', subject)

'abc'

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Mini Exercise
    </div>
    <p>Use the code below to get started on this exercise.</p>
    <pre><code>dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])</code></pre>
    <p>Use regular expression substitution to reformat the dates in the format common in the US: m/d/y.</p>
</div>

In [None]:
regexp = r'\d+'
dates = pd.Series(['2020-11-12', '2020-07-13', '2021-01-12'])
subject = 'abc123'

re.sub(regexp, '', subject)

In [None]:
regexp = ('(\d{4})-(\d{2})-(\d{2})', r\2/\3/\1' , regex = True)
df = pd.DataFrame()
df['dates'] = [['2020-11-12', '2020-07-13', '2021-01-12']] 
df['new_dates'] = df.word.str.extract(regexp)
df

## Misc

### Pandas Usage

- `.str`
    - `.extract`
    - `.count`
    - `.contains`
    - `.replace`
- extract + concat
- named groups

In [32]:
df = pd.DataFrame()
df['text'] = pd.Series([
    'You should go check out https://regex101.com, it is a great website!',
    'My favorite search engine is https://duckduckgo.com',
    'If you have a question, you can get it answered through http://askjeeves.com, it is great!',
])
df

Unnamed: 0,text
0,"You should go check out https://regex101.com, ..."
1,My favorite search engine is https://duckduckg...
2,"If you have a question, you can get it answere..."


In [33]:
df.text.str.extract(r'(https?)://(\w+)\.(\w+)')

Unnamed: 0,0,1,2
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Interactive Regex Tool

To install the `hlre` tool:

```
python -m pip install hlre
```

[For more documentation and the source](https://github.com/zgulde/hlre)

See also [regex101](https://regex101.com) (make sure to select the Python flavor)

### Named capture groups

In [34]:
text = 'You should go check out https://regex101.com, it is a great website!'

match = re.search(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)', text)
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}

In [35]:
df.text.str.extract(r'(?P<protocol>https?)://(?P<base_domain>\w+)\.(?P<tld>\w+)')

Unnamed: 0,protocol,base_domain,tld
0,https,regex101,com
1,https,duckduckgo,com
2,http,askjeeves,com


### Verbose regular expressions

- `re.VERBOSE`
- `(?# this is a comment)`

In [36]:
text = 'You should go check out https://regex101.com, it is a great website!'

regexp = r'''
(?P<protocol>https?)
:// (?# ignore the :// that seperates protocol from domain)
(?P<base_domain>\w+)
\.
(?P<tld>\w+)
'''
match = re.search(regexp, text, re.VERBOSE) # whitespace in the regex is ignored
match.groupdict()

{'protocol': 'https', 'base_domain': 'regex101', 'tld': 'com'}