# Regexs in programs

Regular expressions are very versatile. They can be used to automate many tedious text-related tasks, such as input text validation or data collection. In this topic, we will have a look at two examples of simple yet powerful programs that employ regular expressions.

## Email validation program

Let's have a look at a basic program that checks whether the text contains email addresses and, if it does, returns them in sequential order:



In [2]:
import re

def find_emails(string):
    # Here we compile our simple pattern that will match email addresses
    pattern = re.compile(r'[\w\.-]+@[\w\.-]+')

    # Remember that re.findall() returns a list of all matched email strings
    emails = re.findall(pattern, string) 

    # To print the matched strings one by one
    for email in emails:
        print(email)


The program above carries out a rather simple check. It checks if the @ character is preceded and followed by alphanumeric characters, an underscore, and a dot. Mind that \w is equal to [A-Za-z0-9_].



In [3]:
'Lets test our program:'

# Suppose we have a text with various email addresses
string = '''cat billy123@something.com, dog 456 
          alice_2000@website.com johnnY.b@blahblahblah.com'''
find_emails(string)
# billy123@something.com
# alice_2000@website.com
# johnnY.b@blahblahblah.com


billy123@something.com
alice_2000@website.com
johnnY.b@blahblahblah.com


The downside is that our program will also match strings like _@._. They obviously cannot be considered email addresses.

## Email validation 2.0
If usernames and domain names are too short, it may lead to rather bad scenarios. We can set some restrictions when compiling our pattern to avoid this:



In [4]:
# Let's say we want the username to be at least 5 characters long 
# and the domain name of 2 to 4 characters 
pattern = re.compile(r'[\w\.-]{5,}@[\w-]+\.\w{2,4}')


Let's break it down piece by piece:

- [\w\.-]{5,} matches alphanumeric characters, underscores, a dot, or a dash that appear at least 5 times;
- @ matches the @ sign;
- [\w-]+\. matches alphanumeric characters, underscores, or a dash followed by a dot;
- \w{2,4} matches alphanumeric characters and underscores that appear 2-4 times.


In [5]:
#Here's our final program:

def find_emails(string):
    # Here we compile our simple pattern that will match email addresses
    pattern = re.compile(r'[\w\.-]{5,}@[\w-]+\.[\w]{2,4}')

    # Remember that re.findall() returns a list of all matched email strings
    emails = re.findall(pattern, string) 

    # To print the matched strings one by one
    for email in emails:
        print(email)

In [6]:
string = '''_@._ mary_liu@abc._ billy123@something.com, dog 456 
            alice_2000@website.com johnnY.b@blahblahblah.com one@one.one'''
find_emails(string)
# billy123@something.com
# alice_2000@website.com
# johnnY.b@blahblahblah.com

billy123@something.com
alice_2000@website.com
johnnY.b@blahblahblah.com


## Tokenization
As you may already know, text preprocessing plays a crucial role with textual data. Tokenization or splitting text into smaller units (usually words) is the first step in text processing. Text tokenization can go smoother with regular expressions if you don't want to use other special tools.

The most straightforward approach to tokenization is to split a text by whitespaces. Let's see how it works:



In [None]:
import re

def tokenize(string):
    tokens = re.split('\s+', string)
    return tokens

string = "This is a sample string. (And here's another one!!)"
tokenize(string)
# ['This', 'is', 'a', 'sample', 'string.', '(And', "here's", 'another', 'one!!)']


After giving it a thorough look, you can spot the elephant in the room — punctuation marks. Let's get rid of them before we split our sentence:



In [7]:
import re

def tokenize(string):
    # Let's create a pattern that contains punctuation marks
    punctuation = re.compile(r'[\.,\?!\*:;()]')

    # Substitute the punctuations with empty strings
    no_punct = re.sub(punctuation, '', string)
    print(no_punct)
    # This is a sample string And here's another one

    # Split sentences by whitespaces
    tokens = re.split('\s+', no_punct)
    return tokens

tokenize(string)
# ['This', 'is', 'a', 'sample', 'string', 'And', "here's", 'another', 'one']


_@_ mary_liu@abc_ billy123@somethingcom dog 456 
            alice_2000@websitecom johnnYb@blahblahblahcom one@oneone


['_@_',
 'mary_liu@abc_',
 'billy123@somethingcom',
 'dog',
 '456',
 'alice_2000@websitecom',
 'johnnYb@blahblahblahcom',
 'one@oneone']

We have not omitted the apostrophe ' in the punctuation mark list. This is quite important as we do not want to split words like Let's, here's, or Mary's into two different tokens and change their meaning.

As you can see, tokenization can be a bit tricky, but regex can help you with it. Of course, there are a lot of ways to tokenize a text depending on the text type you are dealing with. We have presented you with one of the simplest ways to do it.



## Which of the following patterns can be used to check whether a given vehicle license plate number is valid? Let's assume that a standard plate is of the following format: ABC-1234.


### re.compile(r'[A-Z]{3}-\d{4}')