## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [1]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [2]:
'phone' in text

True

In [5]:
import re

In [17]:
pattern = 'phone'

In [18]:
result = re.search(pattern, text)

In [19]:
print(result)

<re.Match object; span=(12, 17), match='phone'>


In [20]:
result

<re.Match object; span=(12, 17), match='phone'>

In [21]:
pattern = 'NOT IN TEXT'

In [22]:
match = re.search(pattern, text)

In [23]:
match

In [24]:
type(match)

NoneType

In [25]:
pattern = 'phone'

In [26]:
match = re.search(pattern, text)

In [27]:
match.span()

(12, 17)

In [28]:
match.start()

12

In [29]:
match.end()

17

In [30]:
text = 'my phone one, my phone twice'

In [31]:
match = re.search(pattern, text)

In [32]:
match

<re.Match object; span=(3, 8), match='phone'>

In [33]:
matches = re.findall(pattern, text)

In [34]:
matches

['phone', 'phone']

In [38]:
for match in re.finditer('phone', text):
    print(match.span())
    print(match.group())

(3, 8)
phone
(17, 22)
phone


In [39]:
10 ** 10

10000000000

In [40]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [41]:
phone = re.search('408-555-1234', text)

In [42]:
phone

<re.Match object; span=(28, 40), match='408-555-1234'>

In [44]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [45]:
phone

<re.Match object; span=(28, 40), match='408-555-1234'>

In [46]:
text = "The agent's phone number is 408-555-7777. Call soon!"

In [47]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [48]:
phone

<re.Match object; span=(28, 40), match='408-555-7777'>

In [49]:
phone.group()

'408-555-7777'

In [50]:
phone = re.search(r'\d{3}-\d{3}-\d{4}', text)

In [51]:
phone.group()

'408-555-7777'

In [52]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [54]:
result = re.search(phone_pattern, text)

In [55]:
result.group()

'408-555-7777'

In [56]:
result.group(1)

'408'

In [57]:
result.group(2)

'555'

In [58]:
result.group(3)

'7777'

In [61]:
re.search(r'cat', 'the cat and dog are here')

<re.Match object; span=(4, 7), match='cat'>

In [63]:
re.search(r'cat|dog', 'the kitty and dog are here')

<re.Match object; span=(14, 17), match='dog'>

In [66]:
re.findall(r'cat', 'the cat in the hat sat there')

['cat']

In [69]:
# the . is a wildcard
re.findall(r'.at', 'the cat in the hat sat there')

['cat', 'hat', 'sat']

In [70]:
# ^ specifies that starts with
re.findall(r'^\d', '1 is a number')

['1']

In [71]:
# $ specifies that ends with
re.findall(r'\d$', '1 is a number lower than 2')

['2']

In [72]:
phrase = 'there are 3 numbers 34 inside 5 this sentence'

In [76]:
# This is to exclude numbers
pattern = r'[^\d]'

In [77]:
re.findall(pattern, phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e']

In [78]:
# This is to exclude numbers
# + means Occurs one or more times
pattern = r'[^\d]+'

In [79]:
re.findall(pattern, phrase)

['there are ', ' numbers ', ' inside ', ' this sentence']

In [80]:
test_phrase = 'This is a string! But it has punctiation. How can we remove it? '

In [83]:
re.findall(r'[^!.? ]+', test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctiation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [84]:
# Excludes punctuations
clean = re.findall(r'[^!.? ]+', test_phrase)

In [85]:
' '.join(clean)

'This is a string But it has punctiation How can we remove it'

In [86]:
# Include punctuations
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [89]:
pattern = r'[\w]+-[\w]+'

In [90]:
re.findall(pattern, text)

['hypen-words', 'long-is']

In [91]:
text = 'Hello, would you like some catfish?'
texttwo = 'Hello, would you like to take a catnap?'
textthree = 'Hello, have you seen this caterpillar?'

In [94]:
re.search(r'cat(fish|nap|claw)', texttwo)

<re.Match object; span=(32, 38), match='catnap'>