# Regular Expressions

https://docs.python.org/3/library/re.html

## Searching for specific characters


In [2]:
text = "The agent's phone number is 408-555-1234. Call soon!"

We'll start off by trying to find out if the string "phone" is inside the text string. Now we could quickly do this with:

In [3]:
'phone' in text

True

In [4]:
'408-555-1234' in text

True

### Introducing Regular expression

In [5]:
import re

In [6]:
pattern = 'phone'

In [7]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [9]:
mathced = re.search(pattern,text)
mathced

<re.Match object; span=(12, 17), match='phone'>

In [16]:
mathced.span() #the location of the pattern that we matched
#mathced.end
#mathced.start

(12, 17)

In [21]:
pattern = "phone phone"

In [22]:
re.search(pattern,text) # no output

there i
s no "phone phone" in this sentence

But what if the pattern occurs more than once?

In [23]:
text = "my phone is a new phone"

In [24]:
match = re.search("phone",text)

In [25]:
match.span()

(3, 8)

Notice it only matches the first instance. If we wanted a list of all matches, we can use .findall() method:

In [26]:
matches = re.findall("phone",text)

In [28]:
matches

['phone', 'phone']

In [29]:
len(matches)

2

### use the iterator

In [20]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


If you wanted the actual text that matched, you can use the .group() method.

In [21]:
match.group()

'phone'

# Searching for patterns of Characters



### Introducing Identifiers

Notice how these make heavy use of the backwards slash \ . 
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [33]:
text = "My telephone number is 408-555-1234"

#### Then we can generalize our search pattern

In [34]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)  
# retrive any kinnd of Characters that looks like ###-###-###

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers.

In [35]:
phone.group()

'408-555-1234'

### Introducing Quantifiers
we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [36]:
phone = re.search(r'\d{3}-\d{3}-\d{4}',text)
phone.group()

'408-555-1234'

### Introducing Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

In [37]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
# phone = re.search(r'\d{3}-\d{3}-\d{4}',text) 
# we made a three groups by adding ()

In [38]:
results = re.search(phone_pattern,text)

In [39]:
# The entire result
results.group()

'408-555-1234'

In [41]:
# Can then also call by group position.
# remember groups were separated by parentheses ()
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(1)

'408'

In [42]:
results.group(2)

'555'

In [43]:
results.group(3)

'1234'

In [44]:
# We only had three groups of parentheses
results.group(4)

IndexError: no such group

## Additional Regex Syntax

### Introducing Or operator |

Use the pipe operator to have an **or** statment. For example

In [45]:
man = re.search(r"man|woman","This man was here.")
man.group()

'man'

In [46]:
woman = re.search(r"man|woman","This woman was here.")
woman.group()

'woman'

### Introducing Wildcard

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period :

In [35]:
# find any words that end with at
re.findall(r".at",
           "The cat in the hat sat here.") 


['cat', 'hat', 'sat']

In [36]:
re.findall(r".at","The bat went splat")

['bat', 'lat']

It only matched the first 3 letters, beacuse single period is holding a single position. If we wanto explore more, we can simply adding more period

In [47]:
# finding charaters that of lenght 5
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at".

In [53]:
# One or more non-whitespace(/S+) plus ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

In [55]:
# Just 3 non-whitespace(/S{3}) plus ends with 'at'
re.findall(r'\S{3}at',"The bat went splat") # length of 5

['splat']

### Starts With and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [56]:
# Ends with a number(\d$)
re.findall(r'\d$','This ends with a number 2')

['2']

In [57]:
# Starts with a number(^\d)
re.findall(r'^\d','1 is the loneliest number.')

['1']

In [59]:
# Note that this is for the entire string, not individual words
re.findall(r'^\d','5thounsand 20years old people.')

['5']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets\ **[]**. Anything inside the brackets is excluded. For example:

In [41]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [42]:
re.findall(r'[^\d]',phrase) # using [^] to exclude characters inside it

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

To get the words back together, use a + sign 

In [43]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

We can use this to remove punctuation from a sentence.

In [44]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [45]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [46]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [47]:
clean

'This is a string But it has punctuation How can we remove it'

## Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [48]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [49]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

## Parentheses for Multiple Options

If we have multiple options for matching, we can use parentheses to list out these options. For Example:

In [61]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [63]:
text1 = re.search(r'cat(fish|nap|claw)',text)
text1.group()

'catfish'

In [52]:
re.search(r'cat(fish|nap|claw)',texttwo)

<_sre.SRE_Match object; span=(32, 38), match='catnap'>

In [53]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)