### Regular Expression (RegEx)

* A regular expression is a special text string for describing a search pattern.
* Python has a built-in package called **re**, which can be used to work with Regular Expressions.
<pre>
import re
</pre>

* There are a lot of problems that can be solved with RegEx. 
 - For example, if you are in marketing and sales team and you got large data set with customers information in unstructured format and you want the name, age and contact information from the list. With the help of RegEx you can extract these informations.
 - With the help of RegEx, one can also verify data format and patterns, find a string and replace it with another string and can also format the data.
 - web scrapping 
 
 <img src=attachment:image.png width="200">

--------------------------------------------------------------
| Function | Description |
|------|------|
| findall | Returns a list containing all matches |
| search | Returns a Match object if there is a match anywhere in the string |
| split | Returns a list where the string has been split at each match |
| sub |	Replaces one or many matches with a string |





In [2]:
import re

There is also metacharacter used with regEx that is used for different purposes with characters:

| Character |	Description |	Example	|
|----|----|----|
| [] |	A set of characters |	'[a-m]' |	
| \	| Signals a special sequence (can also be used as escape sequence) | '\d' |	
| . |	Any character |	'he..o' |	
| ^ |	Starts with |	'^hello' |
| \$	| Ends with	| 'planet$' |
| \* | Zero or more occurrences |	'he.*o' |	
| +	| One or more occurrences |	'he.+o' |	
| ? |	Zero or one occurrences	| 'he.?o' |	
| {}	| Exactly the specified number of occurrences | 'he{2}o' |
| \| | Either or	| 'falls\|stays' |	
| () |	Capture and group |

An escape character is a backslash \ followed by a character, has different meaning

| Escape Sequence | Description | Example |
|-----------------|-------------|------------------------|
| \A | Returns a match if the specified characters are at the beginning of the string |	'\AThe' |	
| \b | Returns a match where the specified characters are at the beginning or at the end of a word | '\bain' or 'ain\b' |
| \B | Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word | '\Bain' or 'ain\B' |	
| \d | Returns a match where the string contains digits (numbers from 0-9) | '\d' |
| \D | Returns a match where the string DOES NOT contain digits | '\D' |
| \s | Returns a match where the string contains a white space character | '\s' |
| \S | Returns a match where the string DOES NOT contain a white space character | '\S' |
| \w | Returns a match where the string contains any word characters (a-z,0-9 and the underscore _ character) | '\w' |	
| \W | Returns a match where the string DOES NOT contain any word characters | '\W' |
| \Z | Returns a match if the specified characters are at the end of the string	| 'Spain\Z' |

A set is a combination of characters inside square brackets [] with a special meaning with regular expression:

| Sets | Description | 
|----|----|
| [clt] | Returns a match where one of the specified characters (c, l, or t) are present |
| [a-n] | Returns a match for any lower case character, alphabetically between a and n|
|[^clt]|Returns a match for any character EXCEPT c, l, and t|
|[0123]|Returns a match where any of the specified digits (0, 1, 2, or 3) are present|
|[0-9]|Returns a match for any digit between 0 and 9|
|[0-5][0-9]| Returns a match for any two-digit numbers from 00 and 59|
|[a-zA-Z]|Returns a match for any character alphabetically between a and z, lower case OR upper case|
|[+]|In sets, +, *, ., \|, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string|


**search() function searches the string for a match. If there is more than one match, only the first occurrence of the match will be returned**

In [4]:
# Check if the string starts with 'This' and ends with 'text'

txt = 'These are example texts'
x = re.search('^This.*text$', txt)

if x:
  print('Yes it is True')
else:
  print('False')

False


In [6]:
# search the first white space character

x = re.search("\s", txt)

print("The first white-space character is located in position:", x)

The first white-space character is located in position: <re.Match object; span=(5, 6), match=' '>


**findall() function returns a list containing all matches**

In [13]:
# extracting name from the text

txt = '''
Sugandh is 32 year old 
Sonika is 14 year old 
Alice is 20 year old
Sofia is 28 year old
Tanja is 100 year old
'''

name = re.findall(r"[A-Z][a-z]*", txt)

print(name)

['Sugandh', 'Sonika', 'Alice', 'Sofia', 'Tanja']


In [14]:
# extracting age from the text

age = re.findall(r"\d{1,3}", txt)

print(age)

['32', '14', '20', '28', '100']


**finditer() function is quite similar to the search function and returna an iterator object, unlike search function returns a single match object**

In [15]:
txt = 'Python is an easy language to learn and very good tool for many solutions.'

for i in re.finditer('to',txt):
    tup = i.span()
    print(tup)

(27, 29)
(50, 52)


**split() function returns a list where the string has been split at each match**

In [16]:
# split() each word in the string by white space

x = re.split("\s", txt)
print(x)

['Python', 'is', 'an', 'easy', 'language', 'to', 'learn', 'and', 'very', 'good', 'tool', 'for', 'many', 'solutions.']


In [17]:
# split words can be controlled by specifying the maximum split parameter in split function

x = re.split("\s", txt, 1)

print(x)

['Python', 'is an easy language to learn and very good tool for many solutions.']


**sub() function replaces the matches with the text of your choice**

In [18]:
# replace every white-space character with the number 9

txt = "We will learn Python"

x = re.sub("\s", "9", txt) 
print(x)


We9will9learn9Python


In [19]:
# You can control the number of replacements by specifying the count parameter

txt = "We will learn Python"

x = re.sub("\s", "9", txt, 2)  # here 2 is limiting the first 2 whitespace to get to be replaced
print(x)


We9will9learn Python


In [20]:
# more example: to check email

line = 'Hi, I am Albert and my email address is alb.24@somecompany.com and my friend\'s email is allan_robert124@gmail.com'

match = re.search('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)', line) # search will look for the match and as it will get first will stop

if match:
    print(match.group())  

alb.24@somecompany.com


In [21]:
# findall will look for the match and will get all the matches been found as a list

match = re.findall('([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)', line) 

if match:
    print(match)  

['alb.24@somecompany.com', 'allan_robert124@gmail.com']


In [23]:
txt ='my name is Vijay and my phone number is 111-222-3333 and my friend\'s number is 444-555-6666'

phone= re.findall('\w{3}-\w{3}-\w{4}',txt)
print(phone)

['111-222-3333', '444-555-6666']


In [26]:
txt ='my name is Vijay and my phone number is 31 00 44 55 and my friend\'s number is 11023344'

phone= re.findall('\d{2} \d{2} \d{2} \d{2}',txt)
other_phone= re.findall('\d{8}',txt)
print(phone)
print(other_phone)

['31 00 44 55']
['11023344']


In [27]:
#Check if the string starts with "The"

txt = "The train will be dealyed by 30 minutes."
txt1 = "All the passengers, please come and collect the refreshments from the counter."

x = re.findall("\AThe", txt)
y = re.findall("\AThe", txt1)
print('txt: ',x)
print('txt1: ',y)

txt:  ['The']
txt1:  []


In [28]:
#Check if the string contains any digits (numbers from 0-9)

txt = "The train will be dealyed by 30 minutes."

x = re.findall("\d", txt)
print(x)

['3', '0']


In [29]:
#Return a match at every no-digit character

txt = "The train will be dealyed by 30 minutes."

x = re.findall("\D", txt)
print(x)

['T', 'h', 'e', ' ', 't', 'r', 'a', 'i', 'n', ' ', 'w', 'i', 'l', 'l', ' ', 'b', 'e', ' ', 'd', 'e', 'a', 'l', 'y', 'e', 'd', ' ', 'b', 'y', ' ', ' ', 'm', 'i', 'n', 'u', 't', 'e', 's', '.']


In [32]:
#Search for a sequence that starts with "co", followed by two (any) characters, and an "e"
txt = "All the passengers, please come and collect the refreshments from the counter."

x = re.findall("co.e", txt)
print(x)

['come']
