Regular expressions are handled using Python's built-in **re** library. See [the docs](https://docs.python.org/3/library/re.html) for more information.

In [1]:
text = "The agent's phone number is 408-555-1234."

In [4]:
"phone" in text

True

In [8]:
import re
pattern ='phone'
re.search(pattern,text)

<re.Match object; span=(12, 17), match='phone'>

re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned (in Jupyter Notebook this just means that nothing is output below the cell).


In [13]:
match=re.search(pattern,text)
print(match.span())
print(match.start())
print(match.end())

(12, 17)
12
17


In [43]:
text="This phone is not small phone. This is a Big phone"
pattern='phone'
match1=re.search(pattern,text)      #search() returns the first match found
print(match1.group(),match1.span())
match2=re.findall(pattern,text)     #findall() returns a list of all found matches
print(match2)

phone (5, 10)
['phone', 'phone', 'phone']


In [44]:
for match in re.finditer(pattern,text):
    print(match.span())

(5, 10)
(24, 29)
(45, 50)


# Patterns
<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

# Quantifiers
<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

We separate groups of regular expressions using parentheses

In [50]:
text = "The agent's phone number is 408-555-1234."
pattern=r'\d{3}-\d{3}-\d{4}'
match=re.search(pattern,text)
match.group()

'408-555-1234'

In [54]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
results = re.search(phone_pattern,text)
print(results.group())
print(results.group(1))

408-555-1234
408


In [55]:
re.search(r"man|woman","This man was here.") 

<re.Match object; span=(5, 8), match='man'>

In [56]:
re.search(r"man|woman","This woman was here.") 

<re.Match object; span=(5, 10), match='woman'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period **.** for this. For example:

In [59]:
re.findall(r".at","The cat in the hat sat here in splat.")

['cat', 'hat', 'sat', 'lat']

In [62]:
re.findall(r"..at","The cat in the hat sat here in splat.")

[' cat', ' hat', ' sat', 'plat']

In [66]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The cat in the hat sat here in splat.")

['cat', 'hat', 'sat', 'splat']

### Starts With and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [69]:
re.findall(r'^The',"The cat in the hat sat here in splat.")

['The']

In [77]:
re.findall(r'\d$',"ends with number 9")

['9']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [78]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [81]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [82]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [87]:
test_phrase = 'This is a string! But it $ has punctuation. How can we remove it?'
re.findall(r'[^!^$.?]+',test_phrase)

['This is a string', ' But it ', ' has punctuation', ' How can we remove it']

In [90]:
re.findall(r'[^{$,!,?}]+',test_phrase)

['This is a string', ' But it ', ' has punctuation. How can we remove it']

In [91]:
lst=re.findall(r'[^{$,!,?}]+',test_phrase)
' '.join(lst)

'This is a string  But it   has punctuation. How can we remove it'

## Brackets for Grouping

we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [99]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

In [103]:
text=" these are a set of email-ids:abaker@ourcompany.com,cdonaldson@ourcompany.com,efreeman@ourcompany.com"
re.findall(r'[\w]+@[\w]+\.com',text)

['abaker@ourcompany.com',
 'cdonaldson@ourcompany.com',
 'efreeman@ourcompany.com']