### REGULAR EXPRESSIONS (regex)
- We already know we can search for substrings within a larger string with the **"in"** operator. dog in "my dog is great" returns True
- Regular expressions allow us to search for general patterns  in text data.
- For ex- simple email format can be: user@email.com. we are looking for test" + "@" + "text" + ".com"
- The library __"re"__ allows us to create specialized pattern strings and then search for matches within text.
- The primary skill set for regex is understanding the special syntax for these pattern strings.
- ```Phone number: (555)-555-5555```
- ```Regex pattern: r"(\d\d\d)-\d\d\d-\d\d\d\d"``` ```\d=digit. these are the identifiers for regular expression```
- ```Regex pattern: r"(\d{3})-\d{3}-\d{4}"``` ```--->representing in quantifier format```

In [1]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [3]:
'phone' in text

True

In [4]:
import re

In [5]:
pattern = 'phone'

In [6]:
r.search(pattern,text)

<re.Match object; span=(12, 17), match='phone'>

In [7]:
pattern = 'NOT IN TEXT'

In [8]:
re.search(pattern,text) 

In [9]:
pattern = 'phone'

In [11]:
match = re.search(pattern,text)

In [12]:
match

<re.Match object; span=(12, 17), match='phone'>

In [13]:
match.span()

(12, 17)

In [14]:
match.start()

12

In [15]:
match.end()

17

In [2]:
import re
text = 'my phone once, my phone twice'

In [3]:
match = re.search('phone',text)

In [4]:
match

<re.Match object; span=(3, 8), match='phone'>

In [5]:
# FINDALL (phone)
matches = re.findall('phone',text) # FINDING ALL THE MATCHES

In [6]:
matches

['phone', 'phone']

In [7]:
len(matches)

2

In [8]:
# ITERATING WITH MATCHING PATTERN (phone)
for match in re.finditer('phone',text):
    #print(match)
    #print(match.span())
    print(match.group()) # LOOKING FOR THE ACTUAL MATCH (phone)

phone
phone


### REGEX - PART 2

#### CHAR IDENTIFIERS
- ``` CHAR   DESCRIPTION       Ex.Pattern Code    Example Match ```
- ``` \d     A digit           file_\d\d            file_25 ```
- ``` \w     Alphanumeric      \w-\w\w\w            A-b_1 ```
- ``` \s     White Space       \a\sb\sc             a b c```
- ``` \D     A non digit       \D\D\D               ABC```
- ``` \W     Non-alphanumeric  \W\W\W\W\W           *-+=)```
- ``` \S     Non-whitespace    \S\S\S\S             YoyO```
- ``` ^      startswith a number```
- ``` $      endswith a number```

#### QUANTIFIERS
- ``` CHAR    DESCRIPTION                     Ex.Pattern Code      Example Match ```
- ```  +       Occurs one or more times        Version \w-\w+       Version A-b1_1 ```
- ``` {3}     Occurs exactly 3 times          \D{3}                abc```
- ``` {2,4}   Occurs 2 to 4 times             \d{2,4}              123```
- ``` {3,}    Occurs 3 or more                \w{3,}               anychar```
- ``` *       Occurs zero or more times       A*B*C*               AAACC```
- ``` ?       Once or non                     plurals?             plural```

In [9]:
text = 'My phone number is 408-555-8635'

In [10]:
phone = re.search('408-555-1234',text)

In [11]:
phone

In [12]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [13]:
phone

<re.Match object; span=(19, 31), match='408-555-8635'>

In [14]:
phone.group()

'408-555-8635'

In [15]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [16]:
phone = re.search(r'\d{3}-\d{3}-\d{4}',text)

In [17]:
phone

<re.Match object; span=(19, 31), match='408-555-8635'>

In [18]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [19]:
results = re.search(phone_pattern,text)

In [20]:
results.group()

'408-555-8635'

In [21]:
results.group(1)

'408'

In [22]:
results.group(3)

'8635'

### Additional Regex 

In [23]:
re.search(r'cat','The cat is here')

<re.Match object; span=(4, 7), match='cat'>

In [24]:
# want to search for cat or dog using "|" operator 

In [25]:
re.search(r'cat|dog','The cat is here') 

<re.Match object; span=(4, 7), match='cat'>

In [26]:
re.findall(r'.at','The cat in the hat sat there') # . for wildcard char 

['cat', 'hat', 'sat']

In [27]:
re.findall(r'.at','The cat in the hat sat there')

['cat', 'hat', 'sat']

In [28]:
re.findall(r'.at','The cat sat on the mat and playing with a rat')

['cat', 'sat', 'mat', 'rat']

In [30]:
re.findall(r'...at','The cat in the hat went splat') # with three periods for wild card chars 

['e cat', 'e hat', 'splat']

In [31]:
re.findall(r'^\d','1 is a number') # ^ is for "startswith" . $ is for endswith.

['1']

In [32]:
re.findall(r'^\d','This 2 is a number')

[]

In [33]:
re.findall(r'\d$','This number is 2')

['2']

In [34]:
phrase = 'there are 3 numbers 34 inside 5 this sentence'

In [35]:
pattern = r'[^\d]' #KEEP INSIDE [] TO EXCLUDE NUMBERS 

In [36]:
re.findall(pattern,phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e']

In [37]:
pattern = r'[^\d]+' # "+" TO INCLUDE GROUPS OF SENTENCES or TO REMOVE NUMBERS 

In [38]:
re.findall(pattern,phrase)

['there are ', ' numbers ', ' inside ', ' this sentence']

In [39]:
test_str = 'This is a string! But it has punctuation. How can we remove it?'

In [42]:
re.findall(r'[^!.?]+',test_str)  # TO REMOVE PUNCTUAION, KEEP INSIDE []

['This is a string', ' But it has punctuation', ' How can we remove it']

In [44]:
# ADDING SPACE ALSO
clean = re.findall(r'[^!.? ]+',test_str)  # TO REMOVE PUNCTUAION, KEEP INSIDE []

In [46]:
' '.join(clean)

'This is a string But it has punctuation How can we remove it'

In [47]:
text = 'Only find the hyphen-words in this sentence. But you do not know hwo long-ish they are'

In [50]:
# TO FIND WORDS WITH HYPHEN IN WORDS
pattern = r'[\w]+-[\w]+'

In [51]:
re.findall(pattern,text)

['hyphen-words', 'long-ish']

In [52]:
text1 = "Hello, would you like some catfish?"
text2 = "Hello, would you like to take a catnap?"
text3 = "Hello, have you seen this caterpillar?"


In [61]:
re.search(r'cat(fish|nap|erpillar)',text1)

<re.Match object; span=(27, 34), match='catfish'>