<a href="https://colab.research.google.com/github/tando96/python101/blob/main/1_5_%5BLecture%5D_Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# REGULAR EXPRESSION

Regular Expressions, often shortened as regex, are a sequence of characters used to ***extract*** or ***check*** whether a **pattern exists in a given text (string) or not**.

They are used at the server side to validate for example the format of email addresses or passwords during registration, used for parsing text data files to find, replace, or delete certain string, etc.

[Example](https://raw.githubusercontent.com/anhquan0412/dataset/main/sample_text.txt)

Very common use case of regular expression
- Password validation
- Email validation
- Valid date format
- Empty string validation
- Phone number/Credit card number validation
- ...


![](https://imgs.xkcd.com/comics/regular_expressions.png)



In conclusion: Regular expression helps in manipulating textual data, which is often a prerequisite for **data science projects involving text mining**.

## 📖 PYTHON ``re`` MODULE

The **re library** in Python provides several functions that make it a skill worth mastering. You will see some of them closely in this tutorial.

### re.search()

**`re.search(pattern, string)`**
- Scan through string looking for the ***first matched*** location
- Return a corresponding **match object**.
- Return None if no position in the string matches the pattern


In [None]:
import re

In [None]:
pattern = r"cookie"
string = "In this cookie store we sell cookie"

match_obj = re.search(pattern, string)
match_obj

<re.Match object; span=(8, 14), match='cookie'>

**.group()** from the match object returns the **matched part**

In [None]:
match_obj.group()

'cookie'

**.span()** returns a tuple containing the start and end positions of the match.

In [None]:
match_obj.span()

(8, 14)

### **`re.findall()`**

You can use findall to return multiple matches


**`re.findall(pattern, string)`**
- Return all **non-overlapping matches** of pattern in string, as a ***list*** of strings.
- The string is scanned **left-to-right**, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list
of tuples if the pattern has more than one group. Empty matches are included in the result.


In [None]:
pattern = r"pop"
string = "In this popop store we sell pop"  # overlapping match

match_obj = re.findall(pattern, string)
match_obj

['pop', 'pop']

### Group with ()

You can use parenthesis `()` to extract a sub-match (group) of a whole match.

To extract this sub-match, use **.group(index)** or **.groups()** syntax from **match object**

In [None]:
pattern = r"(091)(2345678)"
string = "Always Be Learning at 0912345678"

match_obj = re.search(pattern, string)
match_obj.group()

'0912345678'

In [None]:
match_obj.group(1)  # index start from 1

'091'

In [None]:
match_obj.groups()[0]

'091'

### re.sub() to replace

**`re.sub(pattern, repl, string)`**
Return the string obtained by replacing the **leftmost** non-overlapping occurrences of pattern
in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

In [None]:
pattern = r"Learning"
string = "Always Be Learning at CoderSchool"

result = re.sub(pattern, r"********", string)
result

'Always Be ******** at CoderSchool'

📖 You can also keep the sub-match using literal string: **\position**
- \1: keep the first group
- \2: keep the second group
- ...


In [None]:
pattern = r"(Coder)(School)"
string = "Always Be Learning at CoderSchool"

result = re.sub(pattern, r"\2\1", string)
result

'Always Be Learning at SchoolCoder'

In [None]:
pattern = r"(Coder)(School)"
string = "Always Be Learning at CoderSchool"

result = re.sub(pattern, r"\1X", string)
result

'Always Be Learning at CoderX'

### **`r`** - raw string literal

🌟 What is ``r`` at the start of the pattern?
This is called a **raw string literal**.

To ***interpreted an [escape character](https://www.w3schools.com/python/gloss_python_escape_characters.asp) as it is***, you should use the ``r`` prefix.

In [None]:
pattern = "A word \t Another word \n A new line"
print(pattern)

A word 	 Another word 
 A new line


In [None]:
pattern = r"A word \t Another word \n A new line"
print(pattern)

A word \t Another word \n A new line


## 📝 WILDCARD CHARACTERS

The following table lists a few of these characters that are commonly useful:

|Character classes||Quantifiers & Alternation||
|--- |--- |--- |--- |
|.|any character except newline|a* a+ a?|0 or more a / 1 or more a / 0 or 1 a|
|\w \d \s|word / digit / whitespace|a{5} a{2,}|exactly five, two or more|
|\W \D \S|not word / not digit / not whitespace|a{1,3}|between one & three|
|[abc]|any of a, b, or c|a+? a{2,}?|match as few as possible (non-greedy)|
|[^abc]|not a, b, or c|(cat\|dog)|match 'cat' or 'dog'|
|[a-g]|character between a & g|||
|**Anchors**||**Escaped characters**||
|^abc$|start / end of the string|\. \* \\|\ is used to escape special chars. \* matches *|
|\b|word boundary|\t \n \r|tab, linefeed, carriage return|


**Note**: \w (word character) matches any **single letter**, **number** or **underscore** (same as [a-zA-Z0-9_] )


| Character | Description | Example |
|------------|-----------|------------|
| ? | Match zero or one repetitions of preceding |  "ab?" matches "a" or "ab" |
| * | Match zero or more repetitions of preceding | "ab*" matches "a", "ab", "abb", "abbb"... |
| + | Match one or more repetitions of preceding |  "ab+" matches "ab", "abb", "abbb"... but not "a" |
| {n} | Match n repetitions of preceding | "ab{2}" matches "abb" |
| {m,n} | Match between m and n repetitions of preceding |  "ab{2,3}" matches "abb" or "abbb" |



## 🏃🏻‍♂️ EXAMPLE: PHONE NUMBER VALIDATION (US)


### Read a text file in Python

Let's first download a file to Colab

In [1]:
!wget -q -c https://raw.githubusercontent.com/anhquan0412/dataset/main/sample_text.txt

☘️ How can you read a text file line by line?

In [2]:
with open("sample_text.txt") as f:
    sequences = []
    for line in f:
        sequences.append(line.strip())

sequences

['This is my phone number 2816837760.',
 'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.',
 'This is another phone format: 281-683-7760.']

**Optional**: Using file readlines()

In [None]:
with open("sample_text.txt") as f:
    sequences = f.readlines()

print(sequences)

['This is my phone number 2816837760.\n', 'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.\n', 'This is another phone format: 281-683-7760.\n']


In [None]:
# but you have to strip the new line character
for i in range(len(sequences)):
    sequences[i] = sequences[i].strip()

print(sequences)

['This is my phone number 2816837760.', 'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.', 'This is another phone format: 281-683-7760.']


### Design a regex pattern

❓Find whether **there is a phone number** in the 1st sentence

In [None]:
sequences[0]

'This is my phone number 2816837760.'

In [None]:
pattern = r"\d\d\d\d\d\d\d\d\d\d"

match_obj = re.search(pattern, sequences[0])
match_obj.group()

'2816837760'

The pattern above is good but repetitive. We can write better matching pattern.

In [None]:
pattern = r"\d{10}"

match_obj = re.search(pattern, sequences[0])
match_obj.group()

'2816837760'

❓ Extract **all** phone numbers from the 2nd sentence

In [None]:
sequences[1]

'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.'

In [None]:
match_list = re.findall(pattern, sequences[1])
match_list

['2816837760', '2811234567']

🤔 Extract the **last 4 digits** of the following phone number?

In [None]:
sequences[2]

'This is another phone format: 281-683-7760.'

In [None]:
pattern = r"(\d{3})-(\d{3})-(\d{4})"

match_obj = re.search(pattern, sequences[2])
match_obj.group()

'281-683-7760'

In [None]:
match_obj.group(3)

'7760'

In [None]:
match_obj.groups()[2]

'7760'

😎 Hide phone numbers with ****

In [None]:
sequences[1]

'I am at 123 Main street, NY 10010, phone number is 2816837760, but I have an alternative: 2811234567.'

In [None]:
pattern = r"\d{10}"

result = re.sub(pattern, r"**********", sequences[1])
result

'I am at 123 Main street, NY 10010, phone number is **********, but I have an alternative: **********.'

❓Hide the first 3 digits only

In [None]:
pattern = r"(\d{3})(\d{7})"

result = re.sub(pattern, r"***\2", sequences[1])
print(result)

I am at 123 Main street, NY 10010, phone number is ***6837760, but I have an alternative: ***1234567.


## 🏃🏻‍♂️🏃🏻‍♂️ PRACTICE TIME!

Regular Expression might be overwhelming as you have to remember those wildcard characters and know how to apply them to your application. That's why there are a lot of resources for you to write and test your regular expression patterns. No matter what you do, to use regular expression well, you need to **practice**!

Some useful links:
- To practice regular expression
[https://regexone.com/lesson/introduction_abcs](https://regexone.com/lesson/introduction_abcs)
- To check your regular expression pattern [https://regex101.com/](https://regex101.com/)
- To read more about regex https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285


👇 Regex can look very complex but actually not hard to understand 👌?

![](https://imgur.com/l3T65Q1.png)