**Table of contents**<a id='toc0_'></a>    
- [RegEx](#toc1_)    
- [Basic RegEx](#toc2_)    
  - [Main methods](#toc2_1_)    
  - [💡 Do it yourself](#toc2_2_)    
  - [`re` in web scraping](#toc2_3_)    
  - [`re` IRL](#toc2_4_)    
  - [💡 Do it yourself](#toc2_5_)    
- [References/Acknowledgments](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[RegEx](#toc0_)

> A regular expression (regex) is a powerful tool for pattern matching within text. It consists of a string of characters that defines a search pattern, allowing you to perform tasks like validating email addresses, extracting specific information from text, or finding and replacing text based on a particular pattern. Regex patterns can include letters, numbers, and special characters, each with special meanings, making it a versatile tool for text manipulation and data extraction in programming and text processing tasks.

(courtesy of ChatGPT)

The `re` library in Python enables us to implement pattern-based text matching but regular expressions are used in all languages, including SQL and Excel.

In [None]:
import re

# <a id='toc2_'></a>[Basic RegEx](#toc0_)

## <a id='toc2_1_'></a>[Main methods](#toc0_)

- `findall` - finds all string instances satisfying a particular pattern
- `search` - finds the position of the first instance where the pattern was found
- `match` - finds positions of the found elements based on a pattern
- `sub` (substitute) - substitutes the found instances of your pattern with something else

In [None]:
re.findall('[a-z]at', 'Hello, I have a cat and a hat')

In [None]:
re.search('[a-z]at', 'Hello, I have a cat and a hat')

In [None]:
print(re.match('[a-z]at', 'Hello, I have a cat and a hat'))

In [None]:
re.sub('[a-z]at', 'cat',  'Hello, I have a cat and a hat')

```python
re.findall(pattern, string)
```
Returns a list of the matched patterns

In [None]:
string = "that pilates class is at 9:00"
re.findall('at', string)

**Note:** RegEx are case sensitive.

In [None]:
string = "At 9:00 Muna takes the bus"
re.findall('at', string) #At, a t do not match

This method starts to make more sense when you start using RegEx patterns:

In [None]:
string = "he took his hat off in his hut before he h1t that hotdog shackle"

In [None]:
# Find hat or hit or hot
re.findall('hat|hit|hot', string)

A simpler way to write the above is by using a "list" of compatible characters:

In [None]:
# Do the same but using the character list
re.findall('h[aio]t', string)

Similarly, you can also specify what characters **not** to match:

In [None]:
# Add not operator to avoid matching
re.findall('h[^aio]t', string)

In [None]:
re.findall('h[a-z]t', string)

You can also look for **any** matching character by using a dot (`.`):

In [None]:
re.findall('h.t', string)

Or find specific types of characters, such as:
- alphanumeric characters (i.e. letters and digits) `\w` (this also includes underscore - `_`)
- whitespace (i.e. spaces, tabs, enters) `\s`
- digits `\d` 

In [None]:
string = """We1rd 
w@rd """

In [None]:
list(string)

In [None]:
# Find all alphanumeric chars
display(re.findall('\w', string)) # @ symbol was not counted

In [None]:
# Find all spaces
display(re.findall('\s', string)) # We have 2 spaces

In [None]:
# Find all digits
display(re.findall('\d', string))

If you're expecting to see a lot of the same characters but not sure how many, you can mention that something will repeat more than one time by using a plus sign (`+`): 

In [None]:
string = 'Elton, this is a looooooooooooooooooooong sentence with a looooooooooooot of repetition'

In [None]:
# Find the looooong strings
re.findall('lo+[nt]', string)

However, if you're not sure that an item will appear at all, you can add a question mark (`?`) after it:

In [None]:
english_string = 'I prefer color to colour'

In [None]:
re.findall('colou?r', english_string)

In [None]:
# Make the rule stricter
re.findall('lo+[nt]g?', string)

## <a id='toc2_2_'></a>[💡 Do it yourself](#toc0_)

How would you find `looooaaaaaoooaoaoaong` and `looaaooaoooaaoooot` in the sentence:
`Elton, this is a looooaaaaaoooaoaoaong sentence with a looaaooaoooaaoooot of repetition`?

In [None]:
# Your code here

In addition to using the plus sign (`+`) to look for one or more characters, you can also use the asterisk (`*`) for sets of characters that may appear zero or more times:

In [None]:
# WIll 'o' appear at all?
re.findall('l.[nto]g?','Elton, this is a loooooong sentence with a looooooooooooot of repetition')

 <span style="color:orange">Where is the `lt` coming from?</span>

So far we've discussed how to look for characters that may appear more than once (`+`), may appear once or not at all (`?`), or may appear multiple times or not at all (`*`). However, if you know the number of characters you are looking for, you can also specify that in your RegEx pattern (`{n}`):

In [None]:
string = 'The phone numbers, with indicatives, are 00451123 456 789 and 00353 987654321'

# Find the country prefixes
re.findall('00\d{3}', string) #ireland

In [None]:
# Find the country prefixes
re.findall('0035\d', string)

Or you can specify a range instead of a specific number of characters (`{start, end}`):

In [None]:
string = 'The phone numbers, with indicatives, are 00351123 456 789 and 0049 987654321'

# Find the country prefixes - level 2
re.findall('00\d{2,3}', string) #germany

Lastly, you can extract separate elements at the same time. For example, we can extract the prefix and the phone number simultaneously by using brackets (`()`):

In [None]:
string = 'The phone numbers, with indicatives, are 00351912345678 and +351967654321'

# Find the country prefixes - level 3
re.findall('(\+\d{3}|00\d{3})(\d{9})', string)

**Note:** Since `+` is a special character used in creating RegEx patterns, we need to "escape" it using the slash (`\`): `\+` is equivalent to searching for the plus sign itself.

 <span style="color:orange">Where other RegEx special characters that we've seen today might need to be escaped?</span>

To escape a slash (`\`) is a bit trickier, as you need 4 of them:

In [None]:
slash_string = '\string1 is very cool'
re.findall('\\\\', slash_string) 

Let's extract just the phone numbers:

In [None]:
re.findall('(\+\d{3}|00\d{3})(\d{9})', string)

In [None]:
# Extract the prefix-number pairs
indicative_number_pairs = re.findall('(\+\d{3}|00\d{3})(\d{9})', string)

# Get the numbers with a loop
numbers=[]
for member in indicative_number_pairs:
  numbers.append(member[1])
numbers

In [None]:
# Extract the numbers with a lambda
list(map((lambda x : x[1]), indicative_number_pairs))

**Advanced**

What if you had to find really weirdly formatted phone numbers?

In [None]:
string = 'The phone numbers, as they gave them to us, are 00351 933456789, +351927654321, 00351 915 678 901, 969 343 291'

In [None]:
re.findall('((\+\d{3}|00\d{3} ?)?)((\d{3} ?){3})', string)

In [None]:
# Get just the phone numbers
groupings_complex = re.findall('((\+\d{3} ?|00\d{3} ?)?)((\d{3} ?){3})', string)
list(map( (lambda x : x[2]), groupings_complex))

## <a id='toc2_3_'></a>[`re` in web scraping](#toc0_)

In addition to using the `BeautifulSoup` library to search for HTML tags, attributes and CSS selectors, we can also use RegEx to find patterns:

In [None]:
# I will create a typical pattern for matching a script tag
pattern = '<script>.*</script>'

In [None]:
# Then I'll just get the Wikipedia landing page for my example
import requests
response = requests.get('https://wikipedia.com')
response.content

In [None]:
# Now I'll extract the JS scripts from the page:
re.findall(pattern, response.content)

I get an error because the HTML response content is a bytes-like (computer readable) object instead of a string, so I need to convert it to a string (human-readable) object before I find my pattern: 

In [None]:
re.findall(pattern, str(response.content))

## <a id='toc2_4_'></a>[`re` IRL](#toc0_)

In [None]:
# Your BFF is back
import pandas as pd

In [None]:
# Read our dataset and get an idea of how it looks like
enron = pd.read_csv('enron.csv')
display(enron.shape)
enron.head()

In [None]:
# How does a message look like?
print(enron.iloc[0]['raw message'])

We see we have a sender (`From:`), a subject (`Subject:`), CC, BCC, the date (`Date:`) and the body (`body:`) of the message. Therefore, we can parse our dataset so it has a column for each of these bits of information.

In [None]:
# Get the sender/s of the message
def get_sender(message):
    return re.findall('From: [\w@\.]+ ', message)

In [None]:
# Apply to dataframe
enron['From'] = enron['raw message'].apply(get_sender)
enron

What if there's no `From:`? We can extract the first email we find instead:

In [None]:
# Let's do it better
def get_sender(message):
    return re.findall('(From: )([\w\@\.-]+)( )',message)[0][1]

In [None]:
enron['From'] = enron['raw message'].apply(get_sender)
enron

What if there's no email at all?

In [None]:
def get_sender(message):
    try:
        out = re.findall('(From: )([\w\@\.-]+)( )', message)[0][1]
    except:
        out = ''
    return out

In [None]:
enron['From'] = enron['raw message'].apply(get_sender)
enron

## <a id='toc2_5_'></a>[💡 Do it yourself](#toc0_)

Following a similar logic, extract the recipient column!

In [None]:
# Your code here

#solution
def get_receiver(message):
  to_list = re.findall('To:.*Subject:',message)
  if len(to_list)>0:
    out = to_list
  else:
    out=''
  return out

In [None]:
enron['To'] = enron['raw message'].apply(get_receiver)
enron


In [None]:
# Your code here

#solution
def get_receiver(message):
  to_list = re.findall('(To: )([\w\@\.-]+)([ ,])',message)
  if len(to_list)>0:
    out = to_list[0][1]
  else:
    out=''
  return out

In [None]:
enron['To'] = enron['raw message'].apply(get_receiver)
enron


In [None]:
print(enron.iloc[3]['raw message'])

Now let's get the date in a column:

In [None]:
# Check raw message again
print(enron.iloc[0]['raw message'])

We see the date is formatted like: {`Day of the week` (3 letters)}, {`Day`} {`Month` (3 letters)} {`Year` (4 digits)} {`Hours`}:{`Minutes`}:{`Seconds`} {`Time zone` (+/- 4 digits)} ({`Timezone name`})

In [None]:
date_pattern = 'Date: \w{3}, \d{1,2} \w{3} \d{4}'
enron['Date'] = enron['raw message'].apply(lambda x: re.findall(date_pattern, x)[0])
enron

In [None]:
# Let's remove the Date
date_pattern = '(Date: )(\w{3}, \d{1,2} \w{3} \d{4})'
enron['Date'] = enron['raw message'].apply(lambda x: re.findall(date_pattern, x)[0][1])
enron

In [None]:
# Let's remove the day of the week
date_pattern = '(Date: )(\w{3}, )(\d{1,2} \w{3} \d{4})'
enron['Date'] = enron['raw message'].apply(lambda x: re.findall(date_pattern, x)[0][2])
enron

Let's also find potential names by looking for the following pattern: {`First Name`} {`Last Name`}

In [None]:
def names_mentioned_narrow_down(message):
    return re.findall('[A-Z][a-z]+ [A-Z][a-z]+', message)

**Notes:**
- This time we don't use `\w` as we know that names do not have digits (unless you're `X AE A-XII`, formerly known as `X Æ A-12`)
- We can define ranges of characters to search for `[a-z]`
- We can specify the capitalization of the range we're interested in `[A-Z]`, `[a-z]`, or `[A-z]` 

In [None]:
enron['names_mentioned'] = enron['raw message'].apply(names_mentioned_narrow_down)
enron

## <a id='toc2_5_'></a>[💡 Do it yourself](#toc0_)

Now find the emails mentioned!

In [None]:
# Your code here

We can also extract any phone numbers that appear in our message, as they typically have this pattern: `###-###-###`

In [None]:
def phone_nr_mentioned(message):
    return re.findall('([0-9]{3}-[0-9]{3}-[0-9]{3})', message)

In [None]:
enron['phone_nr_mentioned'] = enron['raw message'].apply(phone_nr_mentioned)
enron

# <a id='toc3_'></a>[References/Acknowledgments](#toc0_)

This lesson was taken from David Henriques with a couple of edits.