# Regular Expressions

Regular Expressions (sometimes called regex) allow us to search for strings using almost any sort of rule we can come up with. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions can be strange in their syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.
Regular expressions are handled using Python's built-in **re** library. See [this link](https://docs.python.org/3/library/re.html) for more information.

Here's a nice link that provides information on all possible regex patterns: docs.python.org/3/howto/regex.html

## Searching for Basic Patterns

Let's imagine that we have the following string:

In [1]:
text = "The lecturer's phone number in LYIT is 074-91-56789. Please call him soon!"

We would like to know if the word (or string) "phone" is in the contents of our string called "text".
We can quickly do a check using the **in** keyword and check if this word exists in the string

In [2]:
'LYIT' in text

True

And the **in** keyword can be used when we want to check if an exact number such as a phone number is within a string. 
For example, does the phone number 074-91-56789 exist in our string. We can use this command

In [3]:
"074-91-56789" in text

True

## Potential problems with this method

Be careful when using this type of string matching technique as an **exact** copy of your search text is needed before a TRUE result is detected. For example, this does not work since one of the dashes is missing.

In [4]:
"074-9156789" in text

False

## Regular expressions

What if we don't know the exact number that we're looking for? For example, if we only know the number format, including dashes? Or what if we wanted to find phone numbers within a document? Or other informtion such as a date or an email address?

We can use a **regular expression** to find a specific pattern within text. Regular expressions allow for pattern searching within a text string, or within an entire document.

We will use regular expressions in this module. 

In [5]:
# Import the regular expressions library in Python
import re

Key points... 
Every character type has a corresponding pattern code

Digits have a placeholder pattern code of **\d**. I'll explain this in more detail shortly in this document.

Using a backslash allows Python to know that what you are tyoping is a **special character** and not the letter "d".

In [6]:
pattern = 'LYIT'

In [7]:
re.search (pattern, text)

<re.Match object; span=(31, 35), match='LYIT'>

There's some useful information within this output that we can examine. For example, we can show where the match occurs within the string. Remember that the text indexing position starts at zero.

In [8]:
# Copy the output of the search keyword into a variable
text_match = re.search (pattern, text)

# show where the match occurs within the string
text_match.span()

(31, 35)

This means that the character **L** in the search word **LYIT** starts at location 31 and the **T** is in position 35.

We can request particular attributes. pRESS THE tab key after pressing the "." to see all potential options associated with your keyed-in command.

In [9]:
text_match.start()

31

In [10]:
text_match.end()

35

Be careful when using **re.search** keyword when you are looking for multiple occurrences of a string. Herer's an example why.

In [11]:
my_new_text = "I am a new student in LYIT and I think LYIT is great."
# Using pattern variable from previous definition

# re.search only finds the first instance
my_match = re.search(pattern, my_new_text)
my_match.span()

(22, 26)

Instead we need to use the **re.findall** command to get all matches of the specified keyword.

In [12]:
my_match = re.findall(pattern, my_new_text)
my_match

['LYIT', 'LYIT']

we can find a list or iteration of all matched objects using the **re.finditer** command. In this example I'm also using a **FOR** loop to **PRINT** each iteration of the requested match.

In [13]:
for matched_word in re.finditer(pattern, my_new_text):
    print(matched_word.span())

(22, 26)
(39, 43)


## Searching for general text patterns

So far we've learned how to search for a basic string. Now we will check for specific patterns. We can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \. 

When defining a pattern string for regular expression we use the format:

r'mypattern'

placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

This table shows all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>mywork_\d\d\d</td><td>mywork_722</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-z_9</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

Lets use these text patterns to find a phone number pattern within the text string I created earlier in this document.

In [14]:
# Show the text pattern to remind us of its contents
text

"The lecturer's phone number in LYIT is 074-91-56789. Please call him soon!"

In [15]:
search_pattern = r'\d\d\d-\d\d-\d\d\d\d\d'
phone_numbers = re.search(search_pattern, text)
# show the contents of phone_numbers
phone_numbers

<re.Match object; span=(39, 51), match='074-91-56789'>

The output shows where the pattern occured, and the actual found match.

We can show just the search result using the **group** command

In [16]:
phone_numbers.group()

'074-91-56789'

This will work for all identified patterns within the searched text string. If the phone number in the text string changes, the **re.search** command will still pick out the phone number, as long as it is using the same pattern.



We can use the **re.findall()** command to find multiple occurrences of text with the same pattern. 

We can use **re.finditer** to work through each iteration, just as we did earlier in this document.

Notice the repetition of \d. It can be annoying to use, particularly if we are looking for very long strings of numbers. We can use quantifiers to improve this.

## Quantifiers

Now that we know the special character designations (from above), we can use them along with quantifiers to define how many we expect.

Here's a table showing the quantifiers available to us.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>any characters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

Now I'll convert the phone number pattern using a quantifier.

Here's the original pattern from earlier.

In [17]:
search_pattern

'\\d\\d\\d-\\d\\d-\\d\\d\\d\\d\\d'

In [18]:
# we use 'r' first
# "\d" means a digit (from table above)
# Note there's no spaces between any of the expression
search_pattern = r"\d{3}-\d{2}-\d{5}"


In [19]:
re.search(search_pattern, text)

<re.Match object; span=(39, 51), match='074-91-56789'>

In [20]:
my_match = re.search(search_pattern, text)

In [21]:
my_match.group()

'074-91-56789'

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code by extracting the first three digits? We can use groups for any general task that involves grouping together regular expressions. Then we can break them down. 

We can add brackets around each part of the search pattern to create a group and to improve the search result. Then we can use these brackets to allow us to find an occurrence of a group. 

Using the phone number example, we can separate groups of regular expressions using parentheses:

In [22]:
# Change the search_pattern so that it is split into groups
search_pattern = r"(\d{3})-(\d{2})-(\d{5})"

my_match = re.search(search_pattern, text)

# Find phone numbers that contain a match for the first group
my_match.group(1)


'074'

If we only wanted the final set of digits from the phone number, we can specify the third set of brackets - group 3.

In [23]:
my_match.group(3)

'56789'

## Other Regex syntax we can use

We can use the **OR** statement to find more than one text item within a string.

In [24]:
text = "A man and a woman were here earlier."

re.search(r"man|woman", text)

<re.Match object; span=(2, 5), match='man'>

Note that the **re.search** only finds the first occurrence of the match pattern, just as it did when we used it . So it does not detect the occurrence of **woman** in the search text. We can solve this by using the **re.findall** command, just as I did earlier.

In [25]:
re.findall(r"man|woman", text)

['man', 'woman']

## The wildcard character

We can use a wildcard to find any occurrance of a character in place of the wildcard. The example below describes how to search for any words that end with **at** and include one letter.

In [26]:
text =(r"The cat sat on the mat.")
# The "." represents a wildcard. 
# Therefore any word matching a leter + "at" will be selected.
re.findall(r".at", text)

['cat', 'sat', 'mat']

If I change the text to contain the word **splat**, the search term will only find part of the word **splat** because of the single **.** in the search term. 

In [27]:
text =(r"The cat sat on the mat and then went splat.")
re.findall(r".at", text)

['cat', 'sat', 'mat', 'lat']

I can improve on the search to find any two characters before **at** text. Notice that the output picks up the space before **cat**, **sat** and **mat**

In [28]:
text =(r"The cat sat on the mat and then went splat.")
re.findall(r"..at", text)

[' cat', ' sat', ' mat', 'plat']

## Starts with and ends with

We can chec kto see if text starts or ends with a particular character. We use the **^** and **$** symbols to find these. 

For example if we want to find any occurance of a particular number at the end of a sentence then we use the **$** sign.

In [29]:
re.findall(r"\d$", "All rooms on the second floor in LYIT end with a 2")

['2']

Note that this will only find digits if they are the final element of the search string. It will not work if a **.** is the final character. For example, this does not work.

In [30]:
re.findall(r"\d$", "All rooms on the second floor in LYIT end with a 2.")

[]

Similarly, we can check whether a sentence begins with a digit using the **^** symbol. Here's an example

In [31]:
re.findall(r"^\d", "1 divided by 0 gives an error.")

['1']

Note that both these options check an entire string and not individual words.

We can exclude numbers from a string using the **[]** symbols. Anything inside the square brackets will be excluded from the search result.

Here's an example:

In [32]:
# Here's a search phrase including several numbers
search_phrase = "There are 3 numbers within this sentence. The first 1 is 3, the 2nd is 6, and the 3rd is 9."
# Everything within the square brackets will be excluded from the result
re.findall(r"[^\d]", search_phrase)

['T',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 'w',
 'i',
 't',
 'h',
 'i',
 'n',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.',
 ' ',
 'T',
 'h',
 'e',
 ' ',
 'f',
 'i',
 'r',
 's',
 't',
 ' ',
 ' ',
 'i',
 's',
 ' ',
 ',',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'n',
 'd',
 ' ',
 'i',
 's',
 ' ',
 ',',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'r',
 'd',
 ' ',
 'i',
 's',
 ' ',
 '.']

To get only the words out of a sentence and remove punctuation, we can use the **+** symbol to exclude all within the square brackets. This is really useful within NLP.

Lets look at an example:

In [33]:
search_phrase = "There are lots of punctuation within this sentence! And I would like to remove it! But how can I do that?"
re.findall(r"[^.!?]+", search_phrase)

['There are lots of punctuation within this sentence',
 ' And I would like to remove it',
 ' But how can I do that']

The string is now broken into sections, based on what is found within the square brackets. Each break occurs when there is a match for a character within the square brackets. 

For example, if I remove the **!** from the square brackets, then sentences ending in a **!** will not have this character removed. And these sentences will remain as one sentence.


In [34]:
re.findall(r"[^.?]+", search_phrase)

['There are lots of punctuation within this sentence! And I would like to remove it! But how can I do that']

We can use the **+** with grouping to find any number of a search term. 

Let's use this feature to find hyphenated words in a sentence. I'm going to use the **\w** to find any number of alphanumeric text. refer to the table above if you do not understyand why I am using **\w** for alphanumeric text. Note that the full stop is not included as an alphanumeric character. And I do not need to specify how long each character will need to be.

In [35]:
search_phrase = "Here is a sentence with some hyphen-words. I want to remove any long-ish words and hyphen-words."
# Using the "+" symbol allows for any length of alphanumeric word to be found.
# The "-" symbol is part of the pattern ie "an alphanumeric word - an alhanumeric word".
re.findall(r"[\w]+-[\w]+", search_phrase)

['hyphen-words', 'long-ish', 'hyphen-words']

## Parentheses for Multiple Options

If we have multiple options for matching within a word, we can use parentheses to list out these options. For Example:

In [40]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = "Hello, would you like some catfish?"
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [42]:
re.search(r"cat(fish|nap|claw)",text)

<re.Match object; span=(27, 34), match='catfish'>

In [43]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [45]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)