<a href="https://colab.research.google.com/github/ua-datalab/NLP-Speech/blob/main/Introduction_to_Regular_Expressions/Introduction_to_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Regular Expressions (RegEx)

![](https://raw.githubusercontent.com/ua-datalab/NLP-Speech/main/Introduction_to_Regular_Expressions/regex_header.avif)

Image Credit: [Real Python: Regular Expressions: Regexes in Python ](https://realpython.com/regex-python/)

## Quick Housekeeping
- Check screenshare
- Ensure in-person attendees have access
- Post link in zoom chat
- Discuss series and other workshops

## The What

### What are regular expressions?
Regular expressions are a **combination of symbols and characters** that can be used to **intelligently and speedily** perform **pattern matching for text**. They have unicode compatibility, meaning they can work on multiple languages, scientific symbols and scripts.

In combination with other tools, they alow for efficient text scraping, filtering, and advanced find-and-replace programs.

What are some use cases for RegEx in your work?

### Brief history
- Introduced to solve editing and string matching problems in the 1950s
- Built into advanced functions in Perl once nLP began to take off
- Very important for many pre-processing steps

![](https://github.com/ua-datalab/NLP-Speech/blob/main/Introduction_to_Regular_Expressions/regular_expressions.png?raw=true)


### Terminology
(We will refer back to this section throughout the session)

- **Pattern**: a contruction that will be matched across the text. Example: `and\..*`
- **String**: test string used to match the pattern
- **Character**: refers to a letter, digit or symbol
- **Letters, digits and alphanumeric characters**: roman letters (`[a-z][A-Z]`), numbers from 0 to 9 (`[0-9]` or `\d`. Together with the underscore `_`, they are represented by `\w`.  
- **Meta characters**: special symbols used in regular expressions. You will need to escape `\` to use them literally. See examples below.
- **Space**: `\s`
- **Newline**: `\n`
- **Wildcard**
  - **dots**: `.` One or more
  - **stars**: `*` Zero or more
- **Brackets**:
  - **Curly**: Quantifiers. `{n}, {n,}, {n,m}`
  - **Square**: list of characters. `[?4s*]{3}`
  - **Round**: used to group characters together. `c(at)`
- **Anchors** start of line `^`(when used outside a list) end of line `?`

## The Why

- Advanced searches are the backbone of any strong data cleaning pipeline, and an initial step to any major NLP task.
- Search pipelines often contain regular expressions, and having a good intuition of how they work gives us a better look under the hood.
- Regex is a search option in most popular text editing platforms.
- Consider this as an exploration of assessing if your project needs an LLM, or just advanced search!

## The How
### How to I validate my regular expression?
- Online platforms such as [regex101](https://regex101.com/)
- Brainstorming with LLMs

### How do I use regex with different platforms?





In [None]:
# Regex on the CLI
#  Extract a phone number with grep
!echo "Please call back at 111-092-0192" | grep -oE '[0-9-]+'

In [None]:
# Regex with Python
#  re.match returns a boolean value for the first match it finds
#  at the beginning of a string.

import re

def regex_match(pattern, string): #function takes two inputs
    result = re.match(pattern, string) #regex execution
    if result:
        print(f"The input \"{string}\" is a regex match for \"{pattern}\".")
    else:
        print(f"The input \"{string}\" is not a regex match for \"{pattern}\".")
    return None

In [None]:
#Some more examples for patterns discussed above:

# Wildcard
# dots: . One or more
# stars: * Zero or more
regex_match(pattern="T+[a-z]+", string="A cat.")
regex_match(pattern="T+[a-z]+", string="Tthe cat.")
regex_match(pattern="T+[a-z]+", string="The cat.")

# Letters, digits and alphanumeric characters:
regex_match(pattern="\w.*", string="alphanummeric string 1.")
regex_match(pattern="[A-Z].*", string="This line begins with capitalization.")

# Space: `\s`
regex_match(pattern=".+\s.+", string="spaces in line.")

# Newline: `\n`
regex_match(pattern="[A-Z].*\n[A-Z].*", string="First line.\nNext line.")

# Other Brackets:
  # Curly: Quantifiers. `{n}, {n,}, {n,m}`
regex_match(pattern="T[a-z]{2}", string="The cat.")
regex_match(pattern="T[a-z]{2}", string="To cat.")
regex_match(pattern="T[a-z]{1,2}", string="To cat.")

  # Round: used to group characters together. `c(at)`
regex_match(pattern=".*c(at)", string="A cat")

#Anchors** start of line `^`(when used outside a list) end of line `?`
regex_match(pattern="^A.*c(at)", string="A cat.")
regex_match(pattern="^[A-Z].*c(at)", string="Cats are cool.")
regex_match(pattern=".*c(at).*\.$", string="A cat is a mammal.")

In [None]:
# Regex with R
# activate R magic
%load_ext rpy2.ipython

In [None]:
# We can use the R grep or grepl libraries:
%%R

text <- c("Please call back at 111-092-0192", "Our number is 093-817-9281", "we have no callback number.")
pattern <- "[0-9]+"
matches <- grep(pattern, text, value = TRUE)

print(matches)

## Group Project
The task: Take a look at a fraudulent emails dataset, to extract some parts of the email bodies. Tasks could include:
- Capturing sender email IDs
- Reading subject lines
- Capturing lines with the email text

Aim: Consider this a text pre-processing step, where we are looking at our dataset, extracting different components that we may need


 Dataset: https://www.kaggle.com/datasets/rtatman/fraudulent-email-corpus

In [None]:
# Load the dataset
import base64
import requests

master = "https://raw.githubusercontent.com/JSemelhago/FraudEmailAnalysis/master/data/fraudulent_emails.txt"
req = requests.get(master)

# For ease, let's split the dataset by line
email_corpus = req.text.splitlines()

In [None]:
# Print some lines:
for line in email_corpus[:300]:
  print(line)


### What are some relevant portions of this corpus that we may need to extract?
- From or return email ID
- Subject
- The message itself

### Determining the pattern for each relevant element
- Check for newline, capitalization, symbols
- Consider the pattern for assessing what makes an email sound like spam ("unique opportunity", if the sender is royalty, text in all caps, html tags in the content).

<details>
<summary><strong>Pattern for elements of each email </strong></summary>
<ul>
<li> Capitalized letter at the beginning of the line, followed by heading and a colon and a space </li>
<li> Return path has all these, plus < > enclosing the email ID</li>
</ul>
</details>

Let's try to locate all sender IDs and save them:

In [None]:
import re
#Look for "From: in the beginning of the sentence"
# ignore everything until an angular bracket is seen
# ignore everything after closing bracket

sender_addresses = []
for line in email_corpus:
  pattern = r"^[Ff]rom:.+<(.+?)>"
  match = re.search(pattern, line)
  if match:
      sender_addresses.append(match.group(1))
# print(sender_addresses)
# print(len(sender_addresses))

Try extracting the subject of emails. You can use the code above.



In [None]:
## Add your code here

How about exploring the various salutations?

In [None]:
# How many times does an email begin with "dear", or "attention"?

salutations = []
for line in email_corpus:
  pattern = r"(^dear\s|attention\s)"
  match = re.search(pattern, line, re.IGNORECASE)
  if match:
      salutations.append(match.group(1))
print(len(salutations))

How would you examine the email contents for weird elements?
- Text in all caps
- HTML tags in the text
- strings containing non-alphanumeric symbols

In [None]:
## Add your code here

<details>
<summary><strong>How difficult would it be to extract the email body from this dataset?  </strong></summary>

Consider the following regular expression that extracts all the content that is between a "Status:..." line, and the next "From r..." line, which woughly corresponds to the body of the email.
```
# code for extracting email body:
pattern = r"Status:\s+O\n+(.*?)(?=\nFrom:|$)"
matches = re.findall(pattern, req.text, re.DOTALL)

print(matches[2])
print("number of matches for email body: ",len(matches))
```
To get a sense of challenge of this NLP task, let us examine how complex this regex task is. Read different outputs, and compare how many matches this returned, vs number of actual emails (how would be get a sense of that number?).

At their core language models have been working on problems such as this one, that now, they can generate regular expressions with simple prompts!


</details>



## Further reading and resources
- Regex Cookbook
- Platforms
- Regex crossword
- [Learn Regex: A Beginner’s Guide](https://www.sitepoint.com/learn-regex/)
-  [RegexOne: Learn Regular Expressions with simple, interactive exercises](https://www.regexone.com/)