In [1]:
import re

The re module uses a backtracking regular expression engine

Regular expressions match text patterns 

Use case examples:

- Check if an email or phone number was written correctly.
- Split text by some mark (comma, dot, newline) which may be useful to parse data.
- Get content from HTML tags.
- Improve your linux command skills.

However ...

>Some people, when confronted with a problem, think "I know, I'll use regular expressions". Now they have two problems - Jamie Zawinski, 1997

## **General** 

Our build blocks are composed of:

- Literals
- Metacharacter
 - Backslash: \\
 - Caret: \^
 - Dollar Sign: \$
 - Dot: \.
 - Pipe Symbol: \|
 - Question Mark: \?
 - Asterisk: \*
 - Plus sign: \+
 - Opening parenthesis: \(
 - Closing parenthesis: \)
 - Opening square bracket: \[
 - The opening curly brace: \{


### **Literals** 

In [2]:
"""
version 1: with compile
"""
def areYouHungry_v1(pattern, text):
    if match := pattern.search(text): print("HERE !!!\n")
    else: print("Sorry pal, you'll starve to death.\n")

helloWorldRegex = r"rodrigo"

pattern = re.compile(helloWorldRegex)

text1 = "Where can I find food here? - rodrigo"
text2 = "Where can I find food here? - Rodrigo"

areYouHungry_v1(pattern,text1)
areYouHungry_v1(pattern,text2)

HERE !!!

Sorry pal, you'll starve to death.



In [3]:
"""
version 2: without compile
"""
def areYouHungry_v2(regex, text):
    if match := re.search(regex, text): print("HERE !!!\n")
    else: print("Sorry pal, you'll starve to death.\n")

helloWorldRegex = r"rodrigo"

text1 = "Where can I find food here? - rodrigo"
text2 = "Where can I find food here? - Rodrigo"

areYouHungry_v2(helloWorldRegex, text1)
areYouHungry_v2(helloWorldRegex, text2)

HERE !!!

Sorry pal, you'll starve to death.



### **Character classes**

In [4]:
"""
version 3: classes
"""
def areYouHungry_v3(pattern, text):
    if match := pattern.search(text): print("Beer is also food !!\n")
    else: print("Sorry pal, you'll starve to death.\n")

helloWorldRegex = r"[rR]odrigo"

pattern = re.compile(helloWorldRegex)

text1 = "Where can I find food here? - rodrigo"
text2 = "Where can I find food here? - Rodrigo"

areYouHungry_v3(pattern,text1)
areYouHungry_v3(pattern,text2)

Beer is also food !!

Beer is also food !!



Usual Classes:

- [0-9]: Matches anything between 0 and 9.
- [a-z]: Matches anything between a and z.
- [A-Z]: Matches anything between A and Z.

Predefined Classes:
- **\.** : Matches everything except newline.
- Lower Case classes:
    - \d : Same as [0-9].
    - \s : Same as [ \t\n\r\f\v] the first character of the class is the whitespace character.
    - \w : Same as [a-zA-Z0-9_] the last character of the class is the underscore character.

- Upper Case classes (the negation):
    - \D : Matches non decimal digit, same as [^0-9].
    - \S : Matches any non whitespace character [^ \t\n\r\f\v].
    - \W : Matches any non alphanumeric character [^a-zA-Z0-9_] .


Both codes will do the same ... 

The re module keeps a cache with come compiled regex so you do not need to compile the regex everytime you call the function (technique called memoization). 

The first version just give you a fine control ...

```Pattern``` was **re.Pattern** variable which has a lot of methods. Let's find out with methods are there using regular expression !!

In [5]:
patternText = "\n".join(dir(pattern))
patternText

'__class__\n__copy__\n__deepcopy__\n__delattr__\n__dir__\n__doc__\n__eq__\n__format__\n__ge__\n__getattribute__\n__gt__\n__hash__\n__init__\n__init_subclass__\n__le__\n__lt__\n__ne__\n__new__\n__reduce__\n__reduce_ex__\n__repr__\n__setattr__\n__sizeof__\n__str__\n__subclasshook__\nfindall\nfinditer\nflags\nfullmatch\ngroupindex\ngroups\nmatch\npattern\nscanner\nsearch\nsplit\nsub\nsubn'

In [6]:
#Regex for does not start with “__”
pattern_list_methods = re.findall(r"^(?!__).*$", patternText, re.M)
pattern_list_methods

['findall',
 'finditer',
 'flags',
 'fullmatch',
 'groupindex',
 'groups',
 'match',
 'pattern',
 'scanner',
 'search',
 'split',
 'sub',
 'subn']

## **Problem 1** - Phone Number

### **Search** 

In [30]:
def isThere_v1(regexObject, text):
    if regexObject: return f"Your number is: {regexObject.group()}!"
    else: return "Hey! I did not find it."

    
text = """ 9-96379889
           996379889
           96379889
           9-9637-9889

           42246889
           4224-6889
           
           99637 9889
           9 96379889
       """

#The first character is not a number, but a whitespace.
regex1 = re.search(r"\d?", text)

#Removing the whitespace character we find the number ! The ? operator means optional
regex2 = re.search(r"\d?", text.strip())

#Then, it could appear a optional whitespace or -. We also get two decimal character with \d\d 
regex3 = re.search(r"\d?-?\d\d", text.strip())

#However we want more than one decimal chracter. This can be achievied by using the + operator
regex4 = re.search(r"\d?-?\d+", text.strip())

#Looking backwards $
regex5 = re.search(r"\d?-?\d+$", text.strip())

#Using class to get - or whitespace
regex6 = re.search(r"\d?[-\s]?\d+$", text.strip())

regex_lst = [regex1, regex2, regex3, regex4, regex5, regex6]
for index, regex in enumerate(regex_lst):
    print(f"Regex Number {index+1}")
    print(isThere_v1(regex,text) + "\n")

Regex Number 1
Your number is: !

Regex Number 2
Your number is: 9!

Regex Number 3
Your number is: 9-96!

Regex Number 4
Your number is: 9-96379889!

Regex Number 5
Your number is: 96379889!

Regex Number 6
Your number is: 9 96379889!



### **Findall**  

In [53]:
def isThere_v2(regexObject, text):
    if regexObject: return f"Uow phone numbers:\n{regexObject} !"
    else: return "Hey! I did not find it."
    
text = """ 996349889
           96359889
           9-96349889
           
           9-9634-9889
           42256889
           4225-6889
           
           99634 9889
           9 96349889
       """
#findall looks for every possible match.

regex7  = re.findall(r"\d?[-\s]?\d+", text)
"""
Why is [... ' 9', '-96349889' ...] splited? 

Step1: \d?    is not consumed.
Step2: [-\s]? the whitespace is consumed.
Step3: \d+    Consumes 9 and stop due to the - character.

Therefore ' 9' is recognized.
"""

regex8  = re.findall(r"\d?[-\s]?\d+[-\s]?\d+", text.strip())
"""
Why is [... ' 9-9634', '-9889' ...] splited? 

Step1: \d?    is consumed.
Step2: [-\s]? is consumed.
Step3: \d+    Consumes until the - character
Step4: [-\s]? is not consumed
Step5: \d+    is ignored because the first decimal was consumed in Step3

Therefore ' 9-9634' is recognized.
"""

#Adds a restrition of 4 decimals in the first part.
regex9  = re.findall(r"\d?[-\s]?\d{4}[-\s]?\d+", text.strip())

#Adds a restrition of 4 decimals in the second part forcing a number after the whitespace.
regex10 = re.findall(r"\d?[-\s]?\d{4}[-\s]?\d{4}", text.strip())

regex_lst = [regex7, regex8, regex9, regex10]

for index, regex in enumerate(regex_lst):
    print(f"Regex Number {index+7}")
    print(isThere_v2(regex,text) + "\n")

Regex Number 7
Uow phone numbers:
[' 996349889', ' 96359889', ' 9', '-96349889', ' 9', '-9634', '-9889', ' 42256889', ' 4225', '-6889', ' 99634', ' 9889', ' 9', ' 96349889'] !

Regex Number 8
Uow phone numbers:
['996349889', ' 96359889', ' 9-96349889', ' 9-9634', '-9889', ' 42256889', ' 4225-6889', ' 99634 9889', ' 9 96349889'] !

Regex Number 9
Uow phone numbers:
['996349889', ' 96359889', '9-96349889', '9-9634-9889', ' 42256889', ' 4225-6889', ' 99634', '9 96349889'] !

Regex Number 10
Uow phone numbers:
['996349889', ' 96359889', '9-96349889', '9-9634-9889', ' 42256889', ' 4225-6889', '99634 9889', '9 96349889'] !



In [57]:
text_dirty = r"""996379889
                 96379889
                 9-96379889
                 9-9637-9889
                 42246889
                 4224-6889
                 99637 9889
                 9 96379889
                 777 777 777
                 90 329921 0
                 9999999999 9
                 8588588899436
             """

#Regex 10
regex_dirty1 = re.findall(r"\d?[-\s]?\d{4}[-\s]?\d{4}", text_dirty.strip())

#Adding Negative look behind and negative look ahead
regex_dirty2 = re.findall(r"(?<!\d)\d?[-\s]?\d{4}[-\s]?\d{4}(?!\d)", text_dirty.strip())

#A hard solution that
regex_dirty3 = re.findall(r'\b((?:9[ -]?)?\d{4}[ \-]?\d{4})\b', text_dirty.strip())


regex_dirty_lst = [regex_dirty1, regex_dirty2, regex_dirty3]

for index, result in enumerate(map(lambda x: isThere_v2(x,text_dirty), regex_dirty_lst)):
    print(f"Regex Dirty Number {index+1}")
    print(result + "\n")

Regex Dirty Number 1
Uow phone numbers:
['996379889', ' 96379889', '9-96379889', '9-9637-9889', ' 42246889', ' 4224-6889', '99637 9889', '9 96379889', ' 99999999', ' 85885888'] !

Regex Dirty Number 2
Uow phone numbers:
['996379889', ' 96379889', '9-96379889', '9-9637-9889', ' 42246889', ' 4224-6889', '99637 9889', '9 96379889'] !

Regex Dirty Number 3
Uow phone numbers:
['996379889', '96379889', '9-96379889', '9-9637-9889', '42246889', '4224-6889', '99637 9889', '9 96379889'] !



### **Finditer** 

In [88]:
real_text_example = """
                    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis viverra consectetur sodales. Vestibulum consequat, 
                    risus in sollicitudin imperdiet, velit 996379889 elit congue sem, vitae aliquet ligula mi eget justo. Nulla facilisi. 
                    Maecenas a egestas nisi. Morbi purus dolor, ornare ac dui a, eleifend dignissim nunc. Proin pellentesque dolor non lectus pellentesque 
                    tincidunt. Ut et 345.323.343-9 tempus orci. Duis molestie 9 96379889 cursus tortor vitae pretium. 4224-6889 Donec non sapien neque. Pellentesque urna ligula, finibus a lectus sit amet
                    , ultricies cursus metus. Quisque eget orci et turpis faucibus  4224-6889 pharetra. 
                    """
match_genarator = re.finditer(r'\b((?:9[ -]?)?\d{4}[ \-]?\d{4})\b', real_text_example.strip())

In [89]:
for match in match_genarator:
    print(f"Phone Number: {match.group()}\nText Position: {match.span()}\n")

Phone Number: 996379889
Text Position: (173, 182)

Phone Number: 9 96379889
Text Position: (487, 497)

Phone Number: 4224-6889
Text Position: (527, 536)

Phone Number: 4224-6889
Text Position: (697, 706)



## **Problem 2** - Email

### **Groups**

In [96]:
email_text = "hey my email is: localPart@domain"

Using the parenthesis it is possible to capture a group:

In [105]:
match = re.search("(\w+)@(\w+)", email_text)
print(match.group(1))
print(match.group(2))

localPart
domain


Using the following syntax it is possible to give a name to the group:
```?P<name>pattern```

In [106]:
match = re.search("(?P<localPart>\w+)@(?P<domain>\w+)", email_text)
print(match.group("localPart"))
print(match.group("domain"))

localPart
domain


### **sub** 

Suppose a text with the following structure:

```
time - day | usage | id, description \n
```

Definitely, a unique separator should be used .. However life is tough.

In [None]:
my_txt = r"""
         20:18:14 - 21/01 | 0.65 | 3947kedj, No dia em que eu saí de casa\n
         25:32:26 - 11/07 | 0.80 | 5679lqui, Minha mãe me disse: filho, vem cá\n
         12:13:00 - 12/06 | 0.65 | 5249dqok, Passou a mão em meus cabelos\n
         23:12:35 - 13/03 | 0.77 | 3434afdf, Olhou em meus olhos, começou falar\n
         20:22:00 - 12/02 | 0.98 | 1111absd, We are the champions, my friends\n
         22:12:00 - 07/03 | 0.65 | 4092bvds, And we'll keep on fighting till the end\n
         22:52:59 - 30/02 | 0.41 | 9021poij, We are the champions, we are the champions\n
         21:47:00 - 28/03 | 0.15 | 6342fdpo, No time for losers, 'cause we are the champions\n
         19:19:00 - 31/08 | 0.30 | 2314qwen, of the world\n
         xx:yy:zz - xx/yy | 0.00 | 0000aaaa,\n
         """

In [None]:
pattern = re.compile(r"(?P<time>\d{2}:\d{2}:\d{2}) - (?P<day>\d{2}/\d{2}) | (?P<usage>\d\.\d{2}) | (?P<id>\d\.\d{2}), (?P<id>\w*)")
pattern.search(my_txt)

## **Performance**

>Programmers waste enormous amounts of time thinking about, or worrying
about, the speed of noncritical parts of their programs, and these attempts at
efficiency actually have a strong negative impact when debugging and maintenance
are considered. We should forget about small efficiencies, say about 97% of the
time: premature optimization is the root of all evil. Yet we should not pass up our
opportunities in that critical 3%. - Donald Knuth

General:

- Don't be greedy.
- Reuse compiled patterns.
- Be specific.