>   ### Regular Expressions

-   If we want to search for a pattern structure within a string for instance -> Searching for general patterns for instance emails "xyz@xyz.com" -> Regex shine here

>   ### Python comes with a built in regular expressions library or re for short, that allows us to create specialized pattern strings and search for matches within text





<hr />




##### Understanding the special syntax for these pattern strings is the key to working well with Regex

>   ### Understanding how to look up the relevant information to quickly find the appropriate/correct pattern -> Understanding how to convert what we are looking for into specialized pattern syntax and then we can look up the necessary patterns/quantifiers/identifiers that we need





<hr />




>   #### Phone number example:

               (555)-555-5555

### Regex Pattern:

                <!-- The "r" informs python, especially when using the regex library not to treat this as a normal string, and to expect identifiers within this string -->
                <!-- The "\" backslashes correspond to the individual identifiers -->
                r"(\d\d\d)-\d\d\d-\d\d\d\d"



### The identifiers are like placeholders/wildcards, waiting for a match based off a particular data type

>   #### Above "\d" stands for digit -> looking for 3 or 4 digits in a row

>   #### General identifiers and exact strings we are looking for construct a regex pattern



<hr />


#### Searching for basic patterns in text:




In [1]:
text = "The phone number is 545-555-1234. Call soon!"

# we could search for simple strings in the text
'phone' in text

True

In [18]:
# importing the regular expressions module
import re

pattern = 'phone'

# This method returns much more information in a match object
# the Match object reports whether or not there a match to the 'phone', but also where the actual index location span to (start and end)
re.search(pattern, text)

<re.Match object; span=(3, 8), match='phone'>

>   #### Looking for a pattern that's not present in the text:

In [6]:
pattern = 'NOT IN TEXT'

# Nothing is returned
re.search(pattern, text)

In [9]:
# resetting pattern to phone
pattern = 'phone'

# saving it into a regex match object to explore
match = re.search(pattern, text)

match

<re.Match object; span=(4, 9), match='phone'>

In [12]:
# The match object has a lot of information that we can grab from:

print(f"getting the index location of the span: {match.span()}")

print(f"Start Index {match.start()}, and end index {match.end()} of the matched")

getting the index location of the span: (4, 9)
Start Index 4, and end index 9 of the matched


>   ### If we had multiple matches inside a string, we would unfortunately only get back the first match:

In [19]:
text = 'my phone once, my phone twice'

match = re.search('phone', text)

# returns only one span, the first match
match

<re.Match object; span=(3, 8), match='phone'>

>   ### "re.findall()" helps us find all the matches:

In [15]:
matches = re.findall('phone', text)

# returns a list of how many matches we have
matches

['phone', 'phone']

>   ### To get back the actual matched objects, we can use the iterator:

In [22]:
# iterates though the text and returns each matched object it found

for match in re.finditer("phone", text):
    print(match.span())

    # returning the actual text that matched, using the group method
    print(match.group())


(3, 8)
phone
(18, 23)
phone


<hr />

## Special pattern code used to build out our own pattern sequences

<hr />

>   ### Character Identifiers:

-   \d: a Digit (file_25 -> file_\d\d)

-   \w: Alphanumeric (A-b_1 -> \w-\w_1)

-   \s: White space (a b c -> a\sb\sc)

-   \D: Non Digit (ABC -> \D\D\D)

-   \W: Non-alphanumeric *-+=) -> \W\W\W\W\W

-   \S: Non-whitespace Yoyo -> \S\S\S\S

In [27]:
text = 'My phone number is 543-333-2432'

# searching for the pattern itself
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

print(phone)

print(f"The actual matched string: {phone.group()}")

<re.Match object; span=(19, 31), match='543-333-2432'>
The actual matched string: 543-333-2432


>   #### For longer digit patterns, its not so efficient to write that much \d for each digit, for this reason, quantifiers are used to indicate repetition of the same character:


#### Quantifiers:

-   +: Occurs one or more times (Version A-b1_1 -> Version \w-\w+)

-   {3}: Occurs exactly 3 times (abc -> \D{3})

-   {2,4}: Occurs 2 to 4 times (123 -> \d{2,4})

-   {3,}: Occurs 3 or more times (anycharacters -> \w{3,})

-   *: Occurs zero or more times (AAACC -> A\*B\*C\*)

-   ?: Once or none (plural -> plurals?)

In [29]:
# Transforming the original search for a telephone number using quantifiers:
phone = re.search(r'\d{3}-\d{3}-\d{4}', text)
print(phone)

print(f"The actual matched string: {phone.group()}")

<re.Match object; span=(19, 31), match='543-333-2432'>
The actual matched string: 543-333-2432


>   ### To accomplish two tasks at once, for instance find phone numbers and be quickly able to extract the area code -> The first three digits of the phone number, groups allows us to group regular expressions to later on break them down


>   ### This is accomplished with the compile function, that compiles together different regular expression pattern codes

#### The compile function takes in multiple pattern codes, and each pattern code is separated with a parenthesis as a group. The compile function, compiles them into a single expression. Using the compile, it still understands that these are three separate groupings and we can call the groupings individually:


>   ### Important: Group ordering starts at 1 -> NOT zero indexed!

In [33]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

results = re.search(phone_pattern, text)

print(results.group())

print(f'Calling the results by the groups position to extract area code for example, notice indexing starts at 1 not zero: {results.group(1)}')

print(f'Group 2: {results.group(2)}')

print(f'Group 3: {results.group(3)}')

print(f'Non-exitant group: {results.group(5)}')

543-333-2432
Calling the results by the groups position to extract area code for example, notice indexing starts at 1 not zero: 543
Group 2: 333
Group 3: 2432


IndexError: no such group

<hr />

### Useful Regex Methods:

<hr />

1) The "|" (or) operator to search for multiple terms: 



In [34]:
re.search(r'cat|dog', 'The dog is here')

<re.Match object; span=(4, 7), match='dog'>

2) The wildcard operator that acts as a placement and will match any character we place there:

In [36]:
# grabbing a certain letter in front of "at", standing for a wild card, meaning anything there, attached to "at"
re.findall(r'.at', 'The cat in the hat sat there')

['cat', 'hat', 'sat']

In [38]:
# If we had something longer -> notice how this behaves
re.findall(r'...at', 'The cat in the hat went splat')

['e cat', 'e hat', 'splat']

3. Starts with and End with:

In [40]:
# finding everything that starts with a number -> for the entire text, not a random number inside of this:
re.findall(r'^\d', '1 is a number too 2')

['1']

In [41]:
# Ends with a number:
re.findall(r'\d$', 'The number is 2')

['2']

4)  Exclusion: to exclude characters, the carrot(power) "^" symbol is used in conjunction with a set of brackets:

>   #### A very common way to get rid of punctuations from a sentence

In [44]:
phrase = 'there are 3 numbers 54 inside 7 this sentence'

# excluding inside the "[]"
# the "+" sign is added to group the words back together -> In our quantifiers, the plus sign means occurs one or more times
pattern = r'[^\d]+'

# will return a list of every single character that is not a number
re.findall(pattern, phrase)

['there are ', ' numbers ', ' inside ', ' this sentence']

In [46]:
test_phrase = 'This is a string! but it has punctuation. How can we remove it?'

# returns the string without any punctuation and split on where the punctuation were
re.findall(r'[^!.?]+', test_phrase)

['This is a string', ' but it has punctuation', ' How can we remove it']

In [49]:
# If we add a space to the above pattern, we end up removing the spaces and getting a list of all the words:
clean = re.findall(r'[^!.? ]+', test_phrase)

print(clean)

# joining with space
' '.join(clean)

['This', 'is', 'a', 'string', 'but', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'it']


'This is a string but it has punctuation How can we remove it'

5)  Inclusion grouping:

In [51]:
text = 'Only find the hyphen-words in this sentence. But you do not know how long-ish they are'

# we have a group of alphanumeric characters, then dash "-" followed by another group of alphanumeric characters

pattern = r'[\w]+-[\w]+'

re.findall(pattern, text)

['hyphen-words', 'long-ish']

6)  Using parentheses to list out multiple options for matching:

In [53]:
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

# finding diff options:

re.search(r'cat(fish|nap|erpillar)', textthree)

<re.Match object; span=(26, 37), match='caterpillar'>