# Notebook on Regular Expressions

A regular expression is a sequence of characters that specifies a match pattern in text. It is used in programming to match, search for or replace a sequence of characters in a string.

In [1]:
import re

## Regular Expression Modules 

Two commonly used modules in regular expression are:
- re.match(): Used to check if a string matches with a certain pattern.
- re.sub(): Used to replace a certain part of a string with another string.

These modules will be utilized in this notebook. However, more can be found at https://docs.python.org/3/library/re.html#functions

## Character Sets and Ranges

A character set is simply a set of characters. In regex, we are often trying to find if the first character of a string matches with a character from our predefined character set (e.g., **"[a-z]"**) within the pattern. Our predefined character sets are represented using square brackets (**"[abc]"**). In the following cell, the string "name" represents a string and we are trying to match it with a character set **"[a-z]"**. 

**Please note that in the following example, regex only matches the first letter of the string with our predefined character set. Hence, if we have a string that begins with an alphabet (within a to z) but is followed by numerical digits, we will still get a match.**

In [43]:
if re.match("[a-z]", "n1234"):
    print(True)
else:
    print(False)

True


There is a difference between putting characters in character sets (e.g., **"[abc]"**) and putting them in an isolated manner inside the inverted commas (e.g., **"abc"**). Putting them in acharacter set will tell regex to match any character (a or b or c) from the set with the first character of our string. On the other hand, putting the characters in isolation (e.g., **"abc"**) will tell regex to check if the string begins with **"abc".** 

Look at the following examples to get a better understanding:

In [52]:
bool(re.match("[a-z]", "cure"))

True

In [110]:
bool(re.match("ure", "cure"))

False

## Repeating 

Instead of matching only the first character of the string, if we want to match more than one characters at the beginning of the string, we can use repeating in regex to do that. In the following cell, a number from 0-9 is being repeated 10 times. The following code returns True if it matches an instance of a 10-digit number from the beginning of the string. 

**Please note that If the 10 digit number is followed by something else (e.g., a character or more numbers), it will still return True. Because regex finds a 10 digit number at the beginning of the string.**

In [42]:
if re.match("[0-9]{10}", "4783011535abc"):
    print(True)
else:
    print(False)

True


The above expression is the same as writing the character set 10 times. It checks the first 10 characters (not just the first) of the string to see if they are all digits between 0 to 9.

In [54]:
bool(re.match("[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]", "4783011535abc"))

True

We can also specify a range for the repetition. For example, in the following cell, we are trying to check if the first 5 to 8 characters of the string are alphabets between a to z. This means that even if the string is 13 characters long, with the first 5 to 8 characters being alphabets it will still return true.

In [55]:
if re.match("[A-Za-z]{5,8}", "Laptop12345"):
    print(True)
else:
    print(False)

True


We can also specify a lower limit of repetitions and set the upper limit to infinity like the following example.

In [50]:
if re.match("[A-Za-z]{5,}", "Scandinavia"):
    print(True)
else:
    print(False)

True


To set a repeat range of 1 or higher number of alphabets, we can use the epxression **"[A-Za-z]{1,}"** or we can use a simpler version of it like **"[A-Za-z]+".**

In [61]:
bool(re.match("[A-Za-z]+", "a1111"))

True

To set a repeat range of 0 or higher number of alphabets, we can use the epxression **"[A-Za-z]{0,}"** or we can use a simpler version of it like **"[A-Za-z]*".**

In [82]:
bool(re.match("[A-Za-z]*", ""))

True

## Escape Characters

Escape Characters/Meta characters are special characters that typically start with backslashes (e.g., \w, \s), which are used to represent a specific type of characters in regex.

- \w represents a word character: A-Z, a-z, 0-9, _
- \s represents a whitespace character
- \d represents a digits character
- and more at https://pynative.com/python-regex-special-sequences-and-character-classes/

The characters w,s and d are regular alphabetical characters but the backslash preceding indicates that these are metacharacters and not interpreted in their literal sense i.e., as alphabets.

In the following example, we are trying to match a string to a pattern containing **ONE** word character, followed by **ONE** space, followed by **ONE** digit. Notice that only one character of each type if accepted. Replacing the first character of the string with two alphabets won't produce a match because of how the pattern is defined.

In [80]:
if re.match("\w\s\d", "n 7"):
    print(True)
else:
    print(False)

True


## Special Characters 

We have special characters that denote the number of times we are intending to repeat a character. Take the following special characters for example:
- **?** indicates that the character preceding it can appear 0 or 1 time in a pattern, making it optional.
- **.** indicates any character whatsoever.
- **\*** indicates that the character following it can appear 0 or more times in a pattern.
- **+** indicates that the character preceding it can appear 1 or more times in a pattern.
- **^** indicates that we are asking it to match at the start of a string.
- **$** indicates that we are asking it to match at the end of a string.

If we want to match a literal "+" or a literal "^", we need to place a "\\" behind them. Backslash just makes a character escape its normal behavior.

In the following example, the pattern **[a-z]*\d+** indicates that a set of characters a to z is allowed to appear 0 or more times, followed by one or more digits.

In [83]:
if re.match("[a-z]*\d", "11111111111111"):
    print(True)
else:
    print(False)

True


The following regular expression is supposed to replace any real number (including numbers with decimal points) with a the string "unknown".

In [85]:
re.sub("[1-9]+\.?[1-9]+", "", "His grade in the exam was 11.22")

'His grade in the exam was '

## Starting and Ending Patterns 

We have seen how to use repeating patterns in regex above. However, there is a problem with it. If we use the pattern [a-z]{5} to match a 5-digit number, we will get a match if the string being tested begins with a 5-digit number. We get a match even if the number if longer than 5 digits or the 5-digit number is followed by alphabets.

If we want to match a number of 5 digits and 5 digits only, we have to do things differently. We have to indicate that we want to match the pattern within the beginning and end of the string, meaning we do not match if it starts with a 5-digit number and goes on or if it starts with some random characters and ends with a 5 digit number.

Replacing the following string with "abc12345" or "12345abc" or "1234567" will return False. The characters **^** and **$** denote the starting and ending of the pattern respectively.

In [112]:
if re.match("^[1-9]{5}$", "24549"):
    print(True)
else:
    print(False)

True


## Alternate Characters 

Putting something inside parenthesis allows regular expression to evaluate it separately. Whatever is put inside the parentheses is called a group in regular expression.

The two following cells are supposed to replace a link to a website with a literal string "link". This can be accomplished in two different ways as shown in the following two cells. 

**Note that putting something in first brackets make regex treat it as a separate entity, meaning that any special character following the entity applies to everything inside the bracket collectively.**

In [70]:
website = "https://www.youtube.com/watch?v=VRBpeqNamrI"
re.sub("(https:)\S*", "link", website)

'link'

In [132]:
website = "https://www.youtube.com/watch?v=VRBpeqNamrI"
re.sub("(https://)?(www.)?[\w]+\.(com).*", "link", website)

'link'