# Regular expressions (regex) : love or hate?

![commit strip](http://www.commitstrip.com/wp-content/uploads/2014/02/Strips-Le-dernier-des-vrais-codeurs-650-finalenglsih.jpg)

Regular expressions are used in almost all languages. It is a very powerful tool to check if the content of a variable has the shape you expect. 

For example, if you retrieve a phone number, you expect the variable to be composed of numbers and spaces (or dashes) but nothing more. 

Regular expressions not only warn you of an unwanted character but also delete/modify all those that are not desirable.


**There are two ways to use regular expressions:**
* The first consists in calling the function with the pattern as the first parameter, and the string to be analyzed as the second parameter.
* The second way is to compile the regex, and then use the methods of the created object to analyze a string passed as an argument. This method speeds up processing when a regex is used several times.  

In [1]:
import re

In [2]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Searches the pattern in the previous string and return a `MatchObject` if matches are found,
# otherwise returns `None`.
print(re.search(pattern, string))

<re.Match object; span=(1, 2), match=' '>


In [3]:
pattern = "[ ]"
string = "I am fine ! There are still 6 months left :()"

# Cuts the string according to the occurrence of the pattern.
print(re.split(pattern, string))

['I', 'am', 'fine', '!', 'There', 'are', 'still', '6', 'months', 'left', ':()']


### A little syntax

    [xy]  A possible segment list. Example[abc] equals: a, b or c

    (x|y) Indicates a multiple choice type (ps|ump) equals "ps" OR "UMP" 

    \d    the segment is composed only of numbers, which is equivalent to[0-9].

    \D    the segment is not composed of numbers, which is equivalent to[^0-9].

    \s    A space, which is equivalent to [ \t\n\r\r\f\v].

    \S    No space, which is equivalent to[^ ^ \t\n\r\f\v].

    \w    Alphanumeric presence, which is equivalent to[a-zA-Z0-9_].

    \W    No alphanumeric presence[^a-zA-Z0-9_].

    \     Is an escape character

Let's try it.

If the answer is not `None`, it means the match matches. GREY is indeed a name beginning with GR followed by a character and ending with Y.

In [4]:
print(re.match("GR(.)?Y", "GREY"))
# (.)? means that we expect 0 or 1 character.
# 0 or 1 because of the `?` following the character `.`, which means any character

<re.Match object; span=(0, 4), match='GREY'>


In [5]:
pattern = "GR(.)?Y"
string = "GREY"

result = re.match(pattern, string)
print(result)

# It is equal to
compiled = re.compile(pattern)
result = compiled.match(string)
print(result)

<re.Match object; span=(0, 4), match='GREY'>
<re.Match object; span=(0, 4), match='GREY'>


In [6]:
#  So in a loop the second syntax is nicer
pattern = "GR(.)?Y"
compiled = re.compile(pattern)
l = ["GREY 'S", "GRAY", "GREYISH", "A GREY"]

for elem in l:
    result = compiled.match(elem)
    print(elem, result)

GREY 'S <re.Match object; span=(0, 4), match='GREY'>
GRAY <re.Match object; span=(0, 4), match='GRAY'>
GREYISH <re.Match object; span=(0, 4), match='GREY'>
A GREY None


In the following, we search for specific expressions in a string.

In [7]:
print(re.findall("GR(.)?Y", "GREY"))
# so here we are looking for a unique element (.)? between GR and Y

['E']


In [11]:
# Ditto for two characters to be found
re.findall("G(.)?(.)?Y", "GREY")

[('R', 'E')]

To keep only numbers. 

In [8]:
# Only numbers
print(re.findall("([0-9]+)", "Hello I live on the 7th floor of 220 street of sims"))
# "+" Means 1 or more characters

['7', '220']


And conversely, if you only want to keep the words. 

In [9]:
# Only words
print(re.findall("([A-z]+)", "Hello I live on the 7th floor of 220 street of sims"))

['Hello', 'I', 'live', 'on', 'the', 'th', 'floor', 'of', 'street', 'of', 'sims']


### Stop, we recap !

Character | Meaning   
:-------------------------:|:-------------------------:
**.** | **Refers to any character.**
**^** | **Indicates that the beginning of the string must match <br/> (i.e. a string can only match if it starts in the same way, <br /> if it is preceded by spaces or a line break)**
**$** | **Indicates that the end of the chain must match <br /> (the same remark as above applies, but at the end level).**
**{n}**|**Indicates that the previous character must be repeated n times.**
**{n, m}**|**Indicates that the previous character must be repeated between n and m times.**
 *| **The previous character can be repeated none or several times. <br />For example, ab* may correspond to: a, ab, or a followed by any number of b.**
**+**|**The previous character can be repeated once or several times. <br/>For example, to ab+ corresponds an a followed by any number of b.**
**?**|**The previous character can be repeated zero or once.<br /> For example, to ab? correspond ab and a.**
**\w** | **it corresponds to any alphabetical character, it is equivalent to [a-zA-Z].**
**\W** | **it corresponds to everything that is not an alphabetical character.**
**\d** | **it corresponds to any numeric character, i.e. it is equivalent to[0-9].**
**\D** | **it corresponds to everything that is not a numeric character.**

![alt text](http://www.codercaste.com/wp-content/uploads/2013/01/regex.gif)

### Some useful resources
http://www.rexegg.com/regex-quickstart.html  
http://www.dreambank.net/regex.html#examples  
https://pythex.org/ *(Pythex is a real-time regular expression editor for Python, a quick way to test your regular expressions.)*   
https://regex101.com/   
*(Regex101 is online regex editor and debugger. Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, Python, Golang and JavaScript. The website also features a community where you can share useful expressions.)*

#### How to check that the entered string is that of a number ?

In [10]:
number = input("Your number : ")
if re.match("^[0-9]+$", number):
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number.")

Your number : 986433
The string entered is a number.


Another way

In [36]:
compiled = re.compile("^[0-9]+$")
if compiled.search(number) is not None:
    print("The string entered is a number.")
else:
    print("The string entered is NOT a number")

The string entered is a number.


## Drill 


**1. Create a regex that finds integers without size limit.**

In [70]:
import re
s = "sssgdds8sfsfs"
x = re.findall(r"\d", s)
print(x)




['8']


**2. Create a regex that finds negative integers without size limit.**

In [69]:
import re
s = "sssgdds-8sfsfs"
x = re.findall(r"-\d", s)
print(x)




['-8']


**3. Create a regex that finds (positive or negative) integers without size limit.**

In [68]:
import re
s = "sssgdds-8s8fsfs"
x = re.findall(r"-?\d", s)
print(x)




['-8', '8']


**4. Capture all the numbers of the following sentence :**

In [87]:
import re
text = "21 scouts and 3 tanks fought against 4,003 protestors, so the manager was not 100.00% happy."

x = re.findall(r"\W{1,}[0-9,?\.?]+|[0-9]", text)
print(x)



['2', '1', ' 3', ' 4,003', ' 100.00']


**5. Find all words that end with 'ly'.**

In [80]:
import re
text = "He had prudently disguised himself but was quickly captured by the police."
x=re.findall(r"\w+ly",text)
print(x)

['prudently', 'quickly']


**6. License plate number**  
A license plate consists of 2 capital letters, a dash ('-'), 3 digits, a dash ('-') and finally 2 capital letters. Write a script to check that an input string is a license plate.  
If it's correct, print `"good"`. If it's not correct, print `"Not good"`.

In [6]:
import re
plate = input("Enter your license plate number: ")
if re.match(r'\b[A-Z]{2}-\d{3}-\b[A-Z]{2}\b',plate):
    print("good")
else:
    print("Not good")
     

Enter your license plate number: ASD-123-JKD
Not good


**7 . Address IPV4**  
An IPv4 address is composed of 4 numbers between 0 and 255 separated by '.'   
Write a script to verify that a string entered is that of an IPv4 address.

In [11]:
import re
ip = input("Enter your IP address :")
if re.match(r'[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]\\.{3}[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5]',ip):
    print('yes')
else:
    print('no')

Enter your IP address :255.6.0.45
yes


**8. Valid Mail**  
An email is composed of alphanumeric characters followed by `@` and a domain name.  
Write a script that checks that the string entered by a user is indeed that of an email, otherwise ask him to re-enter it again (until he gets a valid email).

In [14]:
import re
mail = input("Enter your email :")
if re.match(r'[A-z0-9]+@[A-z0-9]+\.[A-z]+',mail):
    print("email")
else:
    print("not email")

Enter your email :souka96@gmail.com
email


**9. Valid Password**  
Write an additional script that verifies the password (obviously if the email is valid) where the only specificity of the password is that it has to contain at least 6 characters.

In [None]:

password = input("Enter your password :")


**10. Valid Password bis**  
The password must now contain at least 6 characters AND  

- at least one lowercase letter AND 
- at least one uppercase letter AND 
- at least one number AND 
- at least one special character (among `$#@`).

In [18]:
import re
password = input("Enter your password :")
if re.match(r'^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[#?!@$%^&*-]).{6,}$',password):
    print('correct')
else:
    print('not correct')

Enter your password :ADfg5@
correct


**11. Search by groups**  
It is possible to search by groups, and it is very powerful!  
`?P<x>\w+` means the capture of a "group" named `x`, this group is composed of at least (`+`) one alphanumeric  character `(\w)`.

In [3]:
import re
m = re.search(
    "Welcome to (?P<where>\w+) ! You are (?P<age>\d+) years old ?",
    "Welcome to Olivier ! You are 32 years old ?",
)
print(m.group("where"))
print(m.group("age"))

Olivier
32


In [4]:
# Another Example
m = re.search(
    "^(?P<who>\w*)[.]?(?P<who2>\w*)@(?P<operator>\w+)[.](?P<zone>\w+$)",
    "audrey.boulevart@benextcomapgny.com",
)
if m is not None:
    print(m.group("who"))
    print(m.group("who2"))
    print(m.group("operator"))
    print(m.group("zone"))

audrey
boulevart
benextcomapgny
com


Load the file `./data/mail.txt` and clean it with the regex. The goal is to retrieve the last name, first name, operator and zone, as in the previous example. Store each of those into their own separate list.

In [28]:

import re

# loading and opening the mails' file:

file = open("./data/mail.txt","r") 
mailList = file.readlines()

#print(len(mailList))
#print(mailList[4000])

# creating the lists for each type od data:

firstNameList = []
lastNameList = []
operatoreList = []
zonesList = []

# implementing the pattern that can clean all emails:

mail_pattern = re.compile(r"^(?P<firstName>\w*)[.-](?P<LastName>\w*)@(?P<operator>\w+)[.](?P<zone>\w+$)")

# check each email
for each_email in mailList:
    check_step = re.search(mail_pattern,each_email)
    if check_step is not None:
        firstNameList.append(check_step.group("firstName"))
        lastNameList.append(check_step.group("LastName"))
        operatoreList.append(check_step.group("operator"))
        zonesList.append(check_step.group("zone"))

# print("The First Names list is : \n", firstNameList)
print("The last Names list is : \n", lastNameList)
#print("The operators Names list is : \n", operatoreList)
#print("The zone Names list is : \n", zonesList)
#print(len(lastNameList))
#print(len(firstNameList))
#print(len(operatoreList))
#print(len(zonesList))

# file.close()
# print(list_mail)

The last Names list is : 
 ['joe', 'thomas', 'hylan', 'steve', 'mark', 'jack', 'hal', 'yang', 'carl', 'joe', 'victor', 'weiss', 'walter', 'hank', 'carl', 'roger', 'dan', 'alex', 'nuttle', 'tim', 'roberts', 'monte', 'linde', 'larry', 'casal', 'sellon', 'matthew', 'schickowski', 'hal', 'carl', 'dan', 'frank', 'weiss', 'otto', 'myers', 'trusela', 'adam', 'jagtap', 'jack', 'fred', 'haworth', 'george', 'yang', 'john', 'hank', 'aaron', 'schickowski', 'alex', 'schwager', 'victor', 'roger', 'joe', 'roger', 'uddin', 'stahl', 'hal', 'aaron', 'john', 'ventotla', 'nugent', 'mark', 'victor', 'matthew', 'dan', 'vogal', 'soulis', 'paul', 'zeaser', 'jack', 'monte', 'frank', 'larry', 'yates', 'eastman', 'roger', 'trusela', 'edward', 'roger', 'deitz', 'victor', 'fred', 'otto', 'john', 'hank', 'duffman', 'tiernan', 'hal', 'alex', 'john', 'fletcher', 'ty', 'bongard', 'jack', 'fred', 'ty', 'peter', 'monte', 'moore', 'adam', 'darnell', 'peter', 'stahl', 'aaron', 'mccord', 'steve', 'mark', 'fletcher', 'roger

**12. Another way of doing things.**

In [1]:
mail = "audrey.boulevart@benextcomapgny.com"
splitMail = mail.replace(".", " ").split("@").copy()

firstName = []
name = []
ope = []
zone = []

firstName.append(splitMail[0].split()[0])
name.append(splitMail[0].split()[-1])
ope.append(splitMail[1].split()[0])
zone.append(splitMail[1].split()[-1])

firstName, name, ope, zone

(['audrey'], ['boulevart'], ['benextcomapgny'], ['com'])

Repeat the previous exercise with this new formula and compare the length of your lists with those of the previous exercise.  
What do you notice ?

In [2]:
file = open("./data/mail.txt","r") 
NewmailList = file.readlines()

# creating the lists for each type od data:

firstNameList2 = []
lastNameList2 = []
operatoreList2 = []
zonesList2 = []

# to separate each email in two parts:
for eachmail in NewmailList:
    new_string = eachmail.replace(".", " ").replace("-", " ").replace("", " ").split("@").copy()
    print(new_string[0])

    firstNameList2.append(new_string[0].split()[0])
    lastNameList2.append(new_string[0].split()[-1])
    operatoreList2.append(new_string[1].split()[0])
    zonesList2.append(new_string[1].split()[-1])

print(len(lastNameList2))

 v o g a l   r o g e r 
 a i k i n   j o e 
 m o o r e 
 h a l k n u t s o n 
 a l e x n o r q u i s t 
 m a t t h e w l u l l o f f 
 j e n s o n   t h o m a s 
 m a r k 4 4 5 1 
 m o n t e   h y l a n 
 d a n 8 0 
 s c h m i t t   s t e v e 
 k n u t s o n d a n 
 l e p a g e 
 c h a p m a n _ b e n 
 u p s o n 1 5 4 4 
 5 5 2 9 6 6 9 5 9 
 s o l o m a n _ z i e g l e r 
 o r t i z   m a r k 
 a s h w o o n _ h a n k 
 p e t t i g r e w 
 d o r a n e d w a r d 
 m i l l s _ j o e 
 v a l e n t e _ a l e x 
 y a n g 
 i k e 
 h a n k s h a f f e r 
 l a r r y t r e b i l 
 d a v i s _ i k e 
 d a v i d s o n h a l 
 j o h n 8 7 3 3 
 j o h n f l e t c h e r 
 c a t a l d i _ l a r r y 
 r o g e r f i e t z e r 
 e d w a r d 4 1 8 5 
 h a n c o c k _ m a t t h e w 
 s o l b e r g _ c a s t 
 v a d e r   j a c k 
 g e o r g e p a k 
 c a s w e l l   h a l 
 i k e 
 t a p i a _ q u i n n 
 7 9 5 4 1 9 8 6 0 
 c h a m b e r s _ j o h n s e n 
 j a c k 
 j o e 
 r e y e s a a r o n 
 b e n

 j a c k n u t t l e 
 v a i l 
 n a p i e r   m a r k 
 m o n t e   i n g r a m 
 w e i s s 
 m c c o r d _ q u i z o z 
 1 2 2 8 6 9 9 6 
 l i n d e _ u d d i n 
 h e s c h a a r o n 
 a a r o n 1 4 1 6 
 n u g e n t _ s c h m i t t 
 i k e   m a k i 
 o r t i z 
 v i c t o r   s h a f f e r 
 w a l t e r 
 s c h m i t t _ w e i s s 
 h a l j a g t a p 
 t r u s e l a   o t t o 
 p e t e r 8 8 3 8 
 d o r a n 2 6 7 3 
 i k e n e l s o n 
 b e n 
 w a k e f i e l d   c a r l 
 p a i s e r 8 2 4 1 
 s t e v e 
 v a n d e r p o e l   e d w a r d 
 y a t e s 
 a d a m 
 s t e v e r i c e 
 t h o m a s   b o w e r s 
 s t e v e m o o r e 
 d a v i d s o n l a r r y 
 r o g e r   r o g e r s 
 c h a p m a n   f r a n k 
 j e n s o n t i m 
 s e v e r s o n   d a n 
 r o g e r 1 6 2 9 
 b a t e m a n _ s t e v e 
 s t a h l 5 1 7 4 
 m o o d y _ s t e v e 
 r a m a c h a n d r a n   b e n 
 m a k i _ m o o d y 
 p a u l   b o y d 
 m a t t h e w 5 9 7 5 
 o r y _ d u g e l m a n 
 w a l t e

 p e t e r   m y e r s 
 f r e d 9 3 9 6 
 5 7 7 5 4 6 3 8 4 
 l e w i s _ b a u e r 
 m a k i s t e v e 
 b o y d 1 0 0 6 
 a a r o n   b a t e m a n 
 s h a f f e r   t h o m a s 
 j o h n   d u f f m a n 
 o t t o o l s o n 
 b e n m o o r e 
 v o g a l   a l e x 
 1 9 1 3 3 9 1 1 2 
 m a r t i n   b e n 
 b e n m a r t i n 
 j a c k 7 0 3 6 
 a a r o n 
 n a t h a n 7 1 1 5 
 s h a f f e r   a d a m 
 w a l t e r n e l s o n 
 u l r i c h 
 s c h u s t e r 2 7 8 7 
 s a w y e r   j o e 
 s c h u t z g e o r g e 
 l e p a g e _ l a r r y 
 m a r k   m o o r e 
 t a p i a   h a n k 
 d a n t i s l e r 
 s c h u t z _ h y l a n 
 w a g n e r   r o g e r 
 e d w a r d d a v i s 
 5 7 8 0 1 5 1 4 2 
 d a v i d b o y d 
 s o n d e r l i n g 7 6 5 
 2 4 4 5 9 5 4 9 9 
 o l s o n   m a r k 
 r o g e r 
 t h o m p s o n i k e 
 t h o m p s o n v i c t o r 
 n o r q u i s t _ a l e x 
 h a n k   m c c o r m a c k 
 w a l t e r 7 4 9 6 
 q u i z o z 2 0 0 
 j o n e s   t h o m a s 
 i l l i n