# List of Regular Expression for Data Cleaning and Validation
- Author: @stefansphtr
- Date: February 29, 2024
---

## List of Regular Expression for Data Cleaning

### 1. Whitespace removal

Whitespace at the beginning and end of strings can cause issues. The regex `^\s+|\s+$` can be used to identify and remove leading and trailing whitespace.

In [None]:
import re

# Example of using regular expressions to remove whitespace
text = "   Hello, World!   "
clean_text = re.sub(r"^\s+|\s+$", "", text)
print(clean_text)

### 2. Punctuation 

Punctuation can cause problems when searching for strings in text. The regex `[^\w\s]` can be used to remove all punctuation from a string.

In [None]:
import re

text = "Hello, World!"
clean_text = re.sub(r"[^\w\s]", "", text)
print(clean_text)

### 3. Digit removal 

Digit removal is useful when cleaning text data. The regex `\d` can be used to identify and remove digits from a string.

In [None]:
import re

text = "This is a test. 1 2 3"
# Remove digits from the text without separating the words
clean_text = re.sub(r"\d", "", text)
print(clean_text)


### 4. Non-ASCII characters removal 

Non-ASCII characters can cause issues when working with text data. The regex `[^\x00-\x7F]+` can be used to remove non-ASCII characters from a string.

In [None]:
import re

# Define a string that contains both ASCII and non-ASCII characters
text = "This is the example こんにちは"

# The regular expression "[^\x00-\x7F]+" matches any non-ASCII character. Here's how:
# "[^\x00-\x7F]+" is a character set that matches any character that is not (^) in the range from \x00 to \x7F.
# \x00-\x7F is the range of ASCII characters, so [^\x00-\x7F] matches any non-ASCII character.
# The plus sign (+) means that the regex matches one or more of these non-ASCII characters.

# Use the sub method to replace all non-ASCII characters in the text with an empty string
clean_text = re.sub(r"[^\x00-\x7F]+", "", text)

# Print the cleaned text
print(clean_text)

### 5. Email extraction 

Extracting email addresses from text data can be useful. The regex `\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b` can be used to extract email addresses from a string.

In [None]:
import re

# Define a string that contains an email address
text = "This is an example text with an email address: example@example.com"

# The regular expression "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b" matches any email address. Here's how:
# "\b" is a word boundary. This ensures that the pattern matches an entire word and not a part of a word.
# "[A-Za-z0-9._%+-]+" matches one or more (+) of the following:
# - any uppercase or lowercase letter ([A-Za-z]),
# - any digit ([0-9]),
# - any of the special characters in the set [._%+-].
# This part matches the username part of the email address.
# "@" matches the @ symbol in an email address.
# "[A-Za-z0-9.-]+" matches one or more (+) of the following:
# - any uppercase or lowercase letter ([A-Za-z]),
# - any digit ([0-9]),
# - either a dot (.) or a hyphen (-).
# This part matches the domain name of the email address.
# "\.[A-Z|a-z]{2,}" matches a dot (.) followed by two or more ({2,}) uppercase or lowercase letters ([A-Z|a-z]).
# This part matches the top-level domain of the email address, like 'com', 'org', 'net', etc.
# "\b" is another word boundary, ensuring the pattern matches the end of the word.
email_regex = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"

# Use the findall method to find all matches of the regex in the text
emails = re.findall(email_regex, text)

# Print the list of found email addresses
print(emails)

### 6. URL extraction

Extracting URLs from text data can be useful. The regex `(http|https)://[^\s]+` can be used to extract URLs from a string.

In [None]:
import re

# Define a string that contains a URL
text = "This is a sample text with a URL: https://www.example.com"

# The regular expression "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" matches any URL. Here's how:
# "http[s]?://" matches the start of a URL (http:// or https://).
# "(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" is a non-capturing group (indicated by (?:...)) that matches 1 or more (+) of the following:
# - any uppercase or lowercase letter ([a-zA-Z]),
# - any digit ([0-9]),
# - any of the special characters in the set [$-_@.&+],
# - any of the special characters in the set [!*\\(\\),],
# - any two hexadecimal digits preceded by a % sign (URL encoding).
url_regex = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

# Use the findall method to find all matches of the regex in the text
urls = re.findall(url_regex, text)

# Print the list of found URLs
print(urls)

### 7. HTML tags removal 

HTML tags can cause issues when working with text data. The regex `<[^>]*>` can be used to remove HTML tags from a string.

In [None]:
import re

# Define a string that contains HTML tags
text = "This is a <b>sample</b> text with <i>HTML</i> tags."

# The regular expression "<.*?>" matches any HTML tag. Here's how:
# "<" matches the opening angle bracket of the HTML tag.
# ".*?" is a non-greedy match for any character (.) any number of times (*), 
# as few times as possible to make the match (?). This matches the content of the HTML tag.
# ">" matches the closing angle bracket of the HTML tag.

# The regular expression "<[^>]*>" also matches any HTML tag, but in a different way:
# "<" matches the opening angle bracket of the HTML tag.
# "[^>]*" matches any character that is not a closing angle bracket (>) any number of times (*). 
# This matches the content of the HTML tag.
# ">" matches the closing angle bracket of the HTML tag.
# This regex is more specific because it will not match strings that span multiple HTML tags.

# Uncomment the line below to use the "<[^>]*>" regex
# clean_text = re.sub(r"<[^>]*>", "", text)

# Use the "<.*?>" regex to remove all HTML tags from the text
clean_text = re.sub(r"<.*?>", "", text)

# Print the cleaned text
print(clean_text)