# Python Regular Expressions

In [1]:
from datetime import date
today = date.today()
print("Last Updated Date:", today.strftime("%d %B %Y"))

Last Updated Date: 03 October 2022


<br>[Text Processing - Handling Text Data](#Text_Processing_Handling_Text_Data)
<br>[Regular Expressions](#Regular_Expressions)
<br>[Word Boundaries](#Word_Boundaries)
<br>[Regex Groups and the Pipe Character](#Regex_Groups_and_the_Pipe_Character)
<br>[Non-Capturing Groups](#Non-Capturing_Groups)
<br>[Repetition in Regex Patterns and Greedy/Non-Greedy Matching](#Repetition_in_Regex_Patterns)
<br>[Regex Character Classes and the findall() Method](#Regex_Character)
<br>[Regex Dot Star and Caret Dollar Characters](#Regex_Dot)
<br>[Regex sub() Method and Verbose Mode. Substituting Strings with the sub() Method](#Regex_sub)
<br>[Regex Example Program: A Phone and Email Scraper](#Regex_Example_Program)
<br>[Zero-Width Assertions](#Zero-Width_Assertions)
<br>[Using RegEx on Real-World Dataset](#Using_RegEx)

You can practice REGEX on https://regex101.com/
<br>[Sample Text](./Media/data/SampleText.txt)

# <a id='Text_Processing_Handling_Text_Data'></a>Text Processing - Handling Text Data
[Text Processing - Handling Text Data.pdf](./Media/Slides/NLP_1_TextProcessingHandlingTextData.pdf)
* 11 Metacharacters, 10 Special Sequences, 8 Set Combinations, 8 RegEx Functions

## Metacharacters - 11
Metacharacters are characters with a special meaning:
<br>
<br>**Character&emsp;&emsp;Description&emsp;Example**
<br>`[ ]`&emsp;A set of characters&emsp;&emsp;&emsp;"\[a-m\]"
<br>`\`&emsp;&emsp;Signals a special sequence (can also be used to escape special characters)&emsp;&emsp;&emsp;"\d"
<br>`.`&emsp;&emsp;Any character (except newline character)&emsp;&emsp;&emsp;"he..o"
<br>`^`&emsp;&emsp;Starts with&emsp;&emsp;&emsp;"^hello"
<br>`$` &emsp;&emsp;Ends with&emsp;&emsp;&emsp;"world\$"
<br>`*`&emsp;&emsp;Zero or more occurrences&emsp;&emsp;&emsp;"aix*"
<br>`+`&emsp;&emsp;One or more occurrences&emsp;&emsp;&emsp;"aix+"
<br>`?`&emsp;&emsp;Zero or One (optional)&emsp;&emsp;&emsp;"aix?"
<br>`{N}`&emsp;&ensp;Exactly the specified N number of occurrences&emsp;&emsp;&emsp;"al{2}"
<br>`|`&emsp;&emsp;Either or&emsp;&emsp;&emsp;"falls|stays"
<br>`()`&emsp;&ensp;Capture and group

## Special Sequences - 10

A Special Sequence is a `\` followed by one of the characters in the list below, and has a special meaning:
<br> Easy to Remember. Opposite of below is \\D, \\W, \\S
* \\d indicates 0 to 9 digits
* \\w indicates alpha-numeric and underscore
* \\s indicates space, tab, newline

<br>**Character&emsp;&emsp;Description&emsp;Example**
<br>`\A`&emsp;&emsp;Returns a match if the specified characters are at the beginning of the string&emsp;&emsp;&emsp;"\AThe"
<br>`\Z`&emsp;&emsp;Returns a match if the specified characters are at the end of the string&emsp;&emsp;&emsp;"Spain\Z"
<br>`\b` (Word) Boundary - Matches at the begining or ending of a word but not in the middle of word (the "r" in the beginning is making sure that the string is being treated as a "raw string”)&emsp;&emsp;&emsp;r"\bain"&emsp;r"ain\b"
<br>\B Opposite of \b. Matches in the middle of a word but not in the begining or ending of word&emsp;&emsp;&emsp;r"\Bain"&emsp;r"ain\B"
<br>`\d` Matches any decimal digit; this is equivalent to the class [0-9]&emsp;&emsp;&emsp;"\d"
<br>`\D` Matches any non-digit character; this is equivalent to the class [^0-9]&emsp;&emsp;&emsp;"\D"
<br>`\s` Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]&emsp;&emsp;&emsp;"\s"
<br>`\S` Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]&emsp;&emsp;&emsp;"\S"
<br>`\w` Matches any alphanumeric character and the underscore _ character; this is equivalent to the class [a-zA-Z0-9_]&emsp;&emsp;&emsp;"\w"
<br>`\W` Matches any non-alphanumeric character and the non-underscore _ character; this is equivalent to the class [^a-zA-Z0-9_]&emsp;&emsp;&emsp;"\W"
<br>__NOTE__`\A` only matches at the beginning of the ENTIRE text and `\Z` only matches at the end of the ENTIRE text, as opposed to just a line beginning or ending. Functionality of `^` and `\A` are different. Functionality of Dollar and `\Z` are different.

## Sets - 8
A set is a set of characters inside a pair of square brackets \[\] with a special meaning:
<br>
<br>**Set&emsp;&emsp;&emsp;&emsp;Description**
<br>`[arn]`&emsp;&emsp;Returns a match where one of the specified characters (a, r, or n) are present
<br>`[a-n]`&emsp;&emsp;Returns a match for any lower case character, alphabetically between a and n
<br>`[^arn]`&emsp;&ensp;Returns a match for any character EXCEPT a, r, and n
<br>`[0123]`&emsp;&ensp;Returns a match where any of the specified digits (0, 1, 2, or 3) are present
<br>`[0-9]`&emsp;&emsp;Returns a match for any digit between @ and 9
<br>`[0-5][0-9]`&ensp;&ensp;Returns a match for any two-digit numbers from 00 and 59
<br>`[a-zA-Z]`&ensp;Returns a match for any character alphabetically between a and z, lower case OR upper case
<br>`[+]`&emsp;&emsp;In Sets, `+`, `\*`, `.`, `|`, `()`, `$`,`{}` has no special meaning, so `[+]` means: return a match for any `+` character in the string

## Variations of Metacharacters and Special Sequences
<br>`abc...` Letters
<br>`123...` Digits
<br>`{min,max}` Matches min and can't exceed Max
<br>`{min,}` 	Matches min with no max. Unbounded Maximum
<br>`{,max}` 	Matches zero upto max. Unbounded Minimum
<br>`^$`	To find Empty Strings
<br>`^Exact Match$`	To find out Exact Match
<br>`\group_number` - This tool (\1 references the first capturing group) matches the same text as previously matched by the capturing group.
<br>`(...)` Capture Group
<br>`(a(bc))` Capture Sub-group
<br>`(.*)` Capture all
<br>`(abc|def)` Matches abc or def
<br>`(?:pattern)` Non-Capturing Groups - To make a group Non-Capturing
   
### Greedy / Non-Greedy Matching
<br>`*` Greedily matches the expression to its left 0 or more times (append ? for Non-Greedy matching). It tries to match the longest possible string that matches pattern.
<br>`.*`	Greedy Matches
<br>`.*?`	Non Greedy Match
<br>`+` 	Greedily matches the expression to its left 1 or more times (append ? for Non-Greedy matching)
<br>`.+`  Greedy Matches
<br>`*? or +?`  Non-greedy matching
<br>`{x,y}?`  Non-greedy matching
Note: Here this question mark is different from Zero or One (optional) question mark

# <a id='Regular_Expressions'></a>Regular Expressions

* A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern
* Python has a built-in package called re, which can be used to work with Regular Expressions
* re.compile() - It's basically defines the text or pattern that we are looking for
* You will usually pass raw strings (r'') to re.compile()
<br> __Note__: You no need to use regex object re.compile() everytime, you can directly call functions re.search(pattern,string), re.findall(pattern,string), re.match(pattern,string), re.sub(pattern,replacement,string)

<b> Regular Expression Blue Print / Syntax
    <br> 1. Specifying the Pattern
    <br> 2. Searching the Pattern
    <br> 3. Print the result <b>

[Regular Expressions.pdf](./Media/Slides/NLP_2_RegularExpressions.pdf)
<br> [Regular Expressions Functions.pdf](./Media/Slides/NLP_3_RegularExpressionsPart1.pdf)
* Regular Expression Functions are available in __re__ library of Python

1. __compile()__
2. __match()__
3. __search()__
4. __findall()__
5. __finditer()__
6. __sub()__
7. __split()__
8. __groups__

<br>![](./Media/15_15.png)

### Match Object
* A Match Object is an object containing information about the search and the result.
* If there is no match, the value None will be returned, instead of the Match Object.
* __.start()__ and __.end()__ also gives the start position and end position
* __.span()__ returns a tuple containing the start and end positions of the match. 
* __.string__ returns the string passed into the function
* __.group()__ returns the part of the string where there was a match
* __.groups()__ returns all the groups (in the match object) in the form of tuple

In [2]:
import re
str = "The rain in Spain"
x = re.search(r"\bS\w+", str)
print(x)
print(x.start())
print(x.end())
print(x.span())
print(x.string)
print(x.group())

<re.Match object; span=(12, 17), match='Spain'>
12
17
(12, 17)
The rain in Spain
Spain


In [3]:
str = "eay easy easssy eay eaty"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)

print(x)

['eay', 'easy', 'easssy', 'eay']


## 1. match()
* Checks for a match only at the beginning of the string
* __re.match()__ function will search the regular expression pattern and return the first occurrence. This method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. But if a match is found in some other line, it returns null.

In [2]:
# Defining a string
string = "Tiger is the national animal of India. Tiger lives in Forest."

# Defining the pattern
pattern = "Tiger"

# Running match() on a string
result = re.match(pattern, string)

# Printing the result
print(result)

<_sre.SRE_Match object; span=(0, 5), match='Tiger'>


In [3]:
# Defining a string
string = "Tiger is the national animal of India. Tiger lives in Forest."

# Defining the pattern
pattern = "Tiger"

# Extracting String from a match object
result = re.match(pattern, string).group()

# Printing the result
print(result)

Tiger


In [4]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Checking for match
result = re.match(pattern, string)
print(result)

None


In [2]:
import re
list = ['guru99 get','guru99 give','guru Selenium']
for element in list:
    mo = re.match(r'(g\w+)\s(g\w+)',element)
    print(mo)
    if mo:
        print(mo.groups())

<re.Match object; span=(0, 10), match='guru99 get'>
('guru99', 'get')
<re.Match object; span=(0, 11), match='guru99 give'>
('guru99', 'give')
None


In [5]:
import re
regex_pattern = r".{3}\..{3}\..{3}\..{3}"    # .(dot) means any character except new line
test_string = 'xyz.lmn.pqr.................'
match = re.match(regex_pattern, test_string)
print(match)

<re.Match object; span=(0, 15), match='xyz.lmn.pqr....'>


In [1]:
import re
txt = ['1 877 2638277','91-011-23413627']
pattern = re.compile(r'(\d{1,3})(-|\s)(\d{3})(-|\s)(\d{4,10})')
# pattern = re.compile(r'(\d){1,3}(-|\s)(\d){3}(-|\s)(\d){4,10}') #This has different groups than above groups. Don't Confuse
for i in txt:
    mo = re.search(pattern,i)
    print(f'CountryCode={mo.group(1)},LocalAreaCode={mo.group(3)},Number={mo.group(5)}')

CountryCode=1,LocalAreaCode=877,Number=2638277
CountryCode=91,LocalAreaCode=011,Number=23413627


## 2. search()
* Locates a sub-string matching the RegEx pattern anywhere in the string
* The __search()__ function searches the string for a match, and __returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned__

In [5]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Searching a substring using search()
result = re.search(pattern, string)
print(result)

<_sre.SRE_Match object; span=(32, 37), match='Tiger'>


In [6]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Extracting searched string
result = re.search(pattern, string).group()
print(result)

Tiger


In [216]:
txt = "The rain in Spain"
x = re.search('^The.*in*',txt)
print(x)

<re.Match object; span=(0, 17), match='The rain in Spain'>


In [217]:
if x:
    print('Yes')
else:
    print('No')

Yes


In [218]:
str = "The rain in Spain"
x = re.search("\s", str)
print(x)
print("The first white-space character is located in position:", x.start())
print("The first white-space character is located in position:", x.end())
print(x.span())

<re.Match object; span=(3, 4), match=' '>
The first white-space character is located in position: 3
The first white-space character is located in position: 4
(3, 4)


## 3. findall()
* Finds all the sub-strings matching the RegEx pattern
* The __re.findall()__ function returns a list containing all matches. If no matches are found, an empty list is returned
* __VIMP: When given groups in a pattern, findall() matches only groups but not the entire pattern__

In [7]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Using findall() on a string
result = re.findall(pattern, string)
print(result)

['Tiger', 'Tiger']


In [9]:
# Defining the string
text = "India got freedom on 15-08-1947, and it is celebrated as Independence Day.\
        Indian Constitution came into effect on 26-01-1950, and it is celebrated as Republic Day."

# Defining the pattern
date_pattern = r'\d{2}-\d{2}-\d{4}'

# Extracting dates using findall()
re.findall(date_pattern, text)

['15-08-1947', '26-01-1950']

In [1]:
import re
str = "The rain in Spain"
x = re.findall('ai',str)
x

['ai', 'ai']

In [5]:
print(re.findall('^The',str))
print(re.findall('Spain$',str))

['The']
['Spain']


In [214]:
myString = 'Send an email from this@email.com to test@user.com 34 times.'
print(re.findall('[0-9]+',myString))
print(re.findall('[a-zA-Z]+',myString))
print(re.findall('\S+@\S+',myString))

['34']
['Send', 'an', 'email', 'from', 'this', 'email', 'com', 'to', 'test', 'user', 'com', 'times']
['this@email.com', 'test@user.com']


In [6]:
import re
txt = """
i love cats
i love dogs
"""
pat = re.compile(r'i love (cats|dogs)')
mo = pat.search(txt)
print(mo)
print(mo.group())
print(mo.group(1))

<re.Match object; span=(1, 12), match='i love cats'>
i love cats
cats


In [8]:
pat = re.compile(r'i love (cats|dogs)')
mo = pat.findall(txt) # findall() only captures groups but not enitre pattern like match(),search(),finditer()
print(mo)

['cats', 'dogs']


In [10]:
txt = """
i love cats
i love dogs
"""
pat = re.compile(r'i love (cats|dogs)')
mo = pat.finditer(txt)
for i in mo:
    print(i)
    print(i.group())
    print(i.group(1))
    print('------------------')

<re.Match object; span=(1, 12), match='i love cats'>
i love cats
cats
------------------
<re.Match object; span=(13, 24), match='i love dogs'>
i love dogs
dogs
------------------


## 4. finditer()
* Similar to findall() but returns an iterator
* The re.finditer(patter,string) returns an iterator that contains starting index of a matching string

In [10]:
string = "The national animal of India is Tiger. Tiger lives in Forest."
pattern = "Tiger"

# Using finditer() on a string
result = re.finditer(pattern, string)
print(result)

# Iterating over the iterator
for m in result:
    # Printing match object
    print(m)
    # Printing starting and ending index with matched substring
    print('Start:',m.start(),' End:',m.end(),' Sub-string:',m.group())

<callable_iterator object at 0x7f79f1704b00>
<_sre.SRE_Match object; span=(32, 37), match='Tiger'>
Start: 32  End: 37  Sub-string: Tiger
<_sre.SRE_Match object; span=(39, 44), match='Tiger'>
Start: 39  End: 44  Sub-string: Tiger


In [8]:
string = 'tiger is the national animal of india and national sports is hockey'
pattern = 'national'
mo = re.finditer(pattern,string)
print(mo)

<callable_iterator object at 0x000002885D299748>


In [9]:
for i in mo:
    print(i)
    print(i.start())

<re.Match object; span=(13, 21), match='national'>
13
<re.Match object; span=(42, 50), match='national'>
42


## 5. sub()
* Searches for a substring and replaces it with another string
* The __re.sub(pattern,replacement,string) i.e., re.sub(old,new,string)__ function replaces the matches with the text of your choice
* You can control the number of replacements by specifying the count parameter

In [11]:
text="Analytics Vidhya is largest Analytics community of India."

# Replacing a substring using sub()
result=re.sub('India', 'the World',text)
print(result)

Analytics Vidhya is largest Analytics community of the World.


In [221]:
# Example Replace every white-space character with the number 9
str = "The rain in Spain"
x = re.sub('\s','9',str)
x

'The9rain9in9Spain'

In [222]:
# Example: Replace the first 2 occurrences
str = "The rain in Spain"
x = re.sub('\s','9',str,2)  # You can control the number of replacements by specifying the count parameter
x

'The9rain9in Spain'

## 6. split()
* Split the text by the given RegEx Pattern
* The re.split(pattern,string) function returns a list where the string has been split at each match
* You can control the number of occurrences by specifying the __maxsplit__ parameter

In [2]:
line = "I have a big test tomorrow; I can't go out tonight."

# Splitting a string into multiple substrings
re.split(r';', line)

['I have a big test tomorrow', " I can't go out tonight."]

In [12]:
line = "I have a big test tomorrow; I can't go out tonight."

# Splitting a string into multiple substrings
re.split(r'[;]', line)

['I have a big test tomorrow', " I can't go out tonight."]

In [219]:
str = "The rain in Spain"
x = re.split('\s',str)
x

['The', 'rain', 'in', 'Spain']

In [220]:
# Example Split the string only at the first occurrence
str = "The rain in Spain"
x = re.split('\s',str,1)    #You can control the number of occurrences by specifying the maxsplit parameter
x

['The', 'rain in Spain']

In [2]:
string = 'this is;a,sample,text string'
pattern = r'[;,\s]'
re.split(pattern,string)

['this', 'is', 'a', 'sample', 'text', 'string']

## 7. Groups

In [13]:
# Running a simple pattern on some text
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

pattern="[\w]+ [\w]+ \$[\d,]+ [a-zA-z ]+ \d{2}-\d{2}-\d{4}"

result=re.findall(pattern,string)

print(result)

['Ajay credited $500 to your account on 13-08-2020', 'Anmol debited $1,700 from your account on 14-08-2020', 'Alex debited $100 on 16-08-2020']


In [14]:
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

# Creating groups in the previous pattern
pattern="([\w]+) ([\w]+) (\$[\d,]+) [a-zA-z ]+ (\d{2}-\d{2}-\d{4})"

result=re.findall(pattern,string)

print(result)

[('Ajay', 'credited', '$500', '13-08-2020'), ('Anmol', 'debited', '$1,700', '14-08-2020'), ('Alex', 'debited', '$100', '16-08-2020')]


In [15]:
import pandas as pd

# Creating a dataframe
df=pd.DataFrame(result,columns=['Name','Type','Amount','Date'])
df

Unnamed: 0,Name,Type,Amount,Date
0,Ajay,credited,$500,13-08-2020
1,Anmol,debited,"$1,700",14-08-2020
2,Alex,debited,$100,16-08-2020


In [16]:
# Using finditer() for getting match objects
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

pattern="([\w]+) ([\w]+) (\$[\d,]+) [a-zA-z ]+ (\d{2}-\d{2}-\d{4})"

result=re.finditer(pattern,string)

# Accessing groups separately
for i in result:
    print(i.group(0),'=>',i.group(1),'=>',i.group(2),'=>',i.group(3),'=>',i.group(4))

Ajay credited $500 to your account on 13-08-2020 => Ajay => credited => $500 => 13-08-2020
Anmol debited $1,700 from your account on 14-08-2020 => Anmol => debited => $1,700 => 14-08-2020
Alex debited $100 on 16-08-2020 => Alex => debited => $100 => 16-08-2020


**Note:** Syntax for naming groups: `(?P<Group Name>Pattern)`       VIMP

In [18]:
string="Ajay credited $500 to your account on 13-08-2020.\
      Anmol debited $1,700 from your account on 14-08-2020.\
      Alex debited $100 on 16-08-2020 from your account."

# Naming Groups
pattern="(?P<Name>[\w]+) (?P<Type>[\w]+) (?P<Amount>\$[\d,]+) [a-zA-z ]+ (?P<Date>\d{2}-\d{2}-\d{4})"

result=list(re.finditer(pattern,string))

In [19]:
# Accessing data by group names
for i in result:
    print(i.group('Name'),'=>',i.group('Amount'),'=>',i.group('Date'),'=>',i.group('Type'))

Ajay => $500 => 13-08-2020 => credited
Anmol => $1,700 => 14-08-2020 => debited
Alex => $100 => 16-08-2020 => debited


In [20]:
# Printing data with group names
for i in result:
    print(i.groupdict())

{'Name': 'Ajay', 'Type': 'credited', 'Amount': '$500', 'Date': '13-08-2020'}
{'Name': 'Anmol', 'Type': 'debited', 'Amount': '$1,700', 'Date': '14-08-2020'}
{'Name': 'Alex', 'Type': 'debited', 'Amount': '$100', 'Date': '16-08-2020'}


## Character Set

1. A set is a bunch of characters inside a pair of square brackets `[ ]` with a special meaning.

In [0]:
str = "Analytics Vidhya is one of the largest data science communities"

#Check for the characters y, d, or h, in the above string
x = re.findall("[ydh]", str)

print(x)

['y', 'd', 'h', 'y', 'h', 'd']


In [0]:
str = "Analytics Vidhya is the one of the largest data science communities"

#Check for the characters between a and g, in the above string
x = re.findall("[a-g]", str)

print(x)

['a', 'c', 'd', 'a', 'e', 'e', 'f', 'e', 'a', 'g', 'e', 'd', 'a', 'a', 'c', 'e', 'c', 'e', 'c', 'e']


2. `[^]` Check whether string has __other characters__ mentioned after `^`

In [0]:
str = "Analytics Vidhya is one of the largest data sciece communities"

#Check if every word character has characters other than y, d, or h

x = re.findall("[^ydh]", str)

print(x)

['A', 'n', 'a', 'l', 't', 'i', 'c', 's', ' ', 'V', 'i', 'a', ' ', 'i', 's', ' ', 'o', 'n', 'e', ' ', 'o', 'f', ' ', 't', 'e', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'c', 'e', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'i', 'e', 's']


In [0]:
str = "@AnalyticsVidhya"

x = re.findall("[^@]", str)

print(x)

['A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'V', 'i', 'd', 'h', 'y', 'a']


# <a id='Word_Boundaries'></a>Word Boundaries
* The metacharacter `\b` is an anchor just like the caret `^` and the dollar `$` sign. It matches at a position that is called a __word boundary__. This match is zero-length.
* `\b` is called 'boundary' and allows you isolate words
* `\b` allows you to perform a __whole words only__ search using a regular expression in the form of `\bword\b`. A “word character” is a character that can be used to form words. All characters that are not __word characters__ are __non-word characters__

In [3]:
import re
mstr = 'This island is beautiful'
print(re.findall(r'\bis\b',mstr))
mo = re.search(r'\bis\b',mstr)
mo

['is']


<re.Match object; span=(12, 14), match='is'>

**\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [0]:
# Check if there is any word that ends with "est"
x = re.findall(r"ics\b", "Analytics Vidhya is one of the largest data science communities")
print(x)

['ics']


In [0]:
# Check if there is any word that ends with "est"
x = re.findall(r"est\b", "Analytics Vidhya is one of the largest data science communities")
print(x)

['est']


<br>https://www.udemy.com/course/automate/learn/lecture/3465866#overview
<br> <b> You no need to use regex object re.compile() everytime, you can directly call functions re.search(pattern,string), re.findall(pattern,string), re.match(pattern,string), re.sub(pattern,replacement,string)

<br>![](./Media/15_2.png)

In [1]:
import re

In [2]:
msg = 'Call me at 888-555-1011 tomorrow. 415-555-9999 is my office.'
ro = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')           # Pattern
mo = ro.search(msg)           # Search Pattern in the Message
mo.group()                  

'888-555-1011'

# <a id='Regex_Groups_and_the_Pipe_Character'></a>Regex Groups and the Pipe Character
<br>![](./Media/15_3.png)
* <b> findall() method returns a list of all matched patterns
* <b> pipe | is similar to OR Example: ([a-f]|[A-F]) will match any of the following characters: a, b, c, d, e, f, A, B, C, D, E, or F.

In [5]:
str = 'Call me at 888-555-1011 tomorrow. 415-555-9999 is my office.'
ro = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
lo = ro.findall(str)
print(lo)

['888-555-1011', '415-555-9999']


In [3]:
ro = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')        
mo = ro.search('Call me at (888)-555-1011 tomorrow. 415-555-9999 is my office.')
print(mo.group())

415-555-9999


In [16]:
ro = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')        # parenthesis are used for grouping purpose
mo = ro.search('Call me at (888)-555-1011 tomorrow. 415-555-9999 is my office.')
print(mo.group())
print(mo.group(1))

415-555-9999
415


In [17]:
pb = re.compile(r'(\(\d\d\d\))-\d\d\d-\d\d\d\d')        # parenthesis are used for grouping purpose
mo = pb.search('Call me at (888)-555-1011 tomorrow. 415-555-9999 is my office.')
print(mo.group())
print(mo.group(1))

(888)-555-1011
(888)


In [18]:
pb = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')        # parenthesis are used for grouping purpose
mo = pb.search('Call me at 888-555-1011 tomorrow. 415-555-9999 is my office.')
print(mo.group(0))
print(mo.group(1))

888-555-1011
888


In [19]:
pb = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')        # parenthesis are used for grouping purpose
mo = pb.search('Call me at (888)-555-1011 tomorrow. 415-555-9999 is my office.')
print(mo.group(0))
print(mo.group(1))

415-555-9999
415


In [20]:
pb = re.compile('(\d\d\d)-(\d\d\d)-(\d\d\d\d)')        # parenthesis are used for grouping purpose
mo = pb.search('Call me at 888-555-1011 tomorrow. 415-555-9999 is my office.')
print(mo.group())
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))
print(mo.group(3))

888-555-1011
888-555-1011
888
555
1011


In [21]:
msg1 = 'Call me at (888)-555-1011 tomorrow. 415-555-9999 is my office.'

pb = re.compile(r'\(\d\d\d\)-(\d\d\d)-(\d\d\d\d)')        # you can escape the parenthesis
mo = pb.search(msg1)
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))

(888)-555-1011
555
1011


In [22]:
msg1 = 'Call me at (888)-555-1011 tomorrow. 415-555-9999 is my office.'

pb = re.compile(r'(\(\d\d\d\))-(\d\d\d)-(\d\d\d\d)')        # you can escape the parenthesis
mo = pb.search(msg1)
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))
print(mo.group(3))

(888)-555-1011
(888)
555
1011


In [23]:
msg2 = 'Batmobile at a loss'
p = re.compile(r'Bat(man|woman|mobile|singh)')  # pipe(|) is like 'or' operator
mo = p.search(msg2)
print(mo.group())                # You will get error when you try to apply group on Null or None
print(mo.group(1))     # To find out which suffix it matched, use group(1)

Batmobile
mobile


In [24]:
msg2 = 'Batmobile at a loss'
p = re.compile(r'Bat (man|woman|mobile|singh)')
mo = p.search(msg2)
mo.group()                # You will get error when you try to apply group on Null or None

AttributeError: 'NoneType' object has no attribute 'group'

<br>![](./Media/16_9.png)

In [1]:
Regex_Pattern = r'^([a-z])(\w)(\s)(\W)(\d)(\D)([A-Z])([a-zA-Z])([aeiouAEIOU])(\S)\1\2\3\4\5\6\7\8\9\10$'	# Do not delete 'r'.

import re

print(str(bool(re.search(Regex_Pattern, input()))).lower()) # input - ab #1?AZa$ab #1?AZa$

ab #1?AZa$ab #1?AZa$
true


In [6]:
bool(None)

False

# <a id='Non-Capturing_Groups'></a>Non-Capturing Groups

There are cases when we want to use groups, but we're not interested in extracting the information, i.e. capturing the matched text inside paranthesis only. An example is alteration.

Hence, to make a group non-capturing, we have to use the syntax __(?:pattern)__ instead of (pattern), we normally use parenthesis for grouping

Let's consider an example where we want to find the strings i love cats or i love dogs in the given text.

In [1]:
import re

In [2]:
txt = """
i love cats
i love dogs
"""

In [6]:
pattern = r'i love (cats|dogs)'
mo = re.findall(pattern,txt)
mo    # As we can see, the group captured part contains only cats or dogs instead of complete sentences.

['cats', 'dogs']

In [7]:
pattern = r'i love (?:cats|dogs)'   # Now grouping will not be captured               #VIMP
mo = re.findall(pattern,txt)
mo    

['i love cats', 'i love dogs']

# <a id='Repetition_in_Regex_Patterns'></a>Repetition in Regex Patterns and Greedy/Non-Greedy Matching

<br> `?` Zero or One (optional)
<br> `*` Greedily matches the expression to its left 0 or more times (append `?` for Non-Greedy matching). It tries to match the longest possible string that matches pattern.
<br> `+` Greedily matches the expression to its left 1 or more times (append `?` for Non-Greedy matching)
<br> `*?` or `+?` → Non-greedy matching
<br> `{x,y}?` → Non-greedy matching
<br> __Note__: Here this question mark is different from Zero or One (optional) question mark
<br> `{x}` - Exactly x times
<br> `{min,max}` - Matches min and can't exceed max
<br> `{min,}` - Matches min with no max. Unbounded Maximum
<br> `{,max`} - Matches zero upto max. Unbounded Minimum
<br>![](./Media/15_4.png)

In [27]:
# Brackets puts things in a group and '?' indicates, it can appear once or zero times
p = re.compile(r'Bat(wo)man')
mo = p.search('The Adventure of Batwoman')
if mo is not None:
    print(mo.group())
    print(mo.group(1))
else:
    print(mo)

Batwoman
wo


In [29]:
# Brackets puts things in a group and '?' indicates, it can appear once or zero times
p = re.compile(r'Bat(wo)?man')
mo = p.search('The Adventure of Batman')
if mo is not None:
    print(mo.group())
else:
    print(mo)

Batman


In [14]:
# Brackets puts things in a group and '*' indicates, it can appear zero or more
p = re.compile(r'Bat(wo)*man')
mo = p.search('The Adventure of Batman')
if mo is not None:
    print(mo.group())
else:
    print(mo)# Brackets puts things in a group and '?' indicates, it can appear once or zero times

Batman


In [12]:
# Brackets puts things in a group and '+' indicates, it can appear atleast ones
p = re.compile(r'Bat(wo)+man')
mo = p.search('The Adventure of Batwowowoman')
if mo is not None:
    print(mo.group())
else:
    print(mo)

Batwowowoman


In [15]:
p = re.compile(r'\(\d\d\d\)?-\d\d\d-\d\d\d\d')
mo = p.search('Call me at (415)-555-1011 tomorrow. Call me Tomorrow.')
print(mo.group())

(415)-555-1011


In [17]:
p = re.compile(r'(\d\d\d-)?\d\d\d-\d\d\d\d')
mo = p.search('Call me at 415-555-1011 tomorrow. Call me Tomorrow.')
print(mo.group())

415-555-1011


In [19]:
p = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')
mo = p.search('Call me at 415-555-1011 tomorrow. Call me Tomorrow.')
print(mo.group())

415-555-1011


In [30]:
p = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')
mo = p.search('Call me at (415)-555-1011 tomorrow. Call me Tomorrow.')
print(mo.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [31]:
p = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')
mo = p.search('Call me at 415-555-1011 tomorrow. Call me Tomorrow.')
print(mo.group())
print(mo.group(1))

415-555-1011
415


In [53]:
p = re.compile(r'\(\d\d\d\)-\d\d\d-\d\d\d\d')
mo = p.search('Call me at (415)-555-1011 tomorrow. Call me Tomorrow.')
print(mo.group())

(415)-555-1011


In [54]:
p = re.compile(r'Dinner\?')
mo = p.search('Call me at 415-555-1011 tomorrow. Dinner? Tomorrow.')
print(mo.group())

Dinner?


In [60]:
# Brackets puts things in a group and '?' indicates, it can appear once or zero times
p = re.compile(r'Bat(wo)*man')
mo = p.search('Batman')
if mo is not None:
    print(mo.group())
else:
    print(mo)

Batman


In [61]:
p = re.compile(r'Bat(wo)*man')
mo = p.search('Batwowowowoman')
if mo is not None:
    print(mo.group())
else:
    print(mo)

Batwowowowoman


In [62]:
p = re.compile(r'Bat(wo)*man')
mo = p.search('Batwoman')
if mo is not None:
    print(mo.group())
else:
    print(mo)

Batwoman


In [2]:
p = re.compile(r'(Ha){3}')
mo = p.search('This boy said HaHa')
if mo is not None:
    print(mo.group())
else:
    print(mo)

None


In [3]:
p = re.compile(r'(Ha){3}')
mo = p.search('This boy said HaHaHa')
if mo is not None:
    print(mo.group())
else:
    print(mo)

HaHaHa


In [7]:
p = re.compile(r'(la){3,5}')
mo = p.search('This is lalalalal')
mo.group()

'lalalala'

In [56]:
p = re.compile(r'(la){3,}')           # If you don't specify the end, then end is any number of times
mo = p.search('This is lalalalal') 
mo.group()

'lalalala'

# <a id='Regex_Character'></a>Regex Character Classes and the findall() Method
* __If you put ^ inside square braces then it becomes Negative Character Class__
* __In the Character class / Negative Character class escape `\` or comma `,` or underscores `_` doesn't have any special meaning or importance__

<br>![](./Media/15_5.png)
<br>![](./Media/15_6.png)

In [1]:
import re

In [2]:
p = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mobs = p.findall('Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.')  # Zero or One Group, returns List
mobs

['415-555-1011', '415-555-9999']

In [4]:
p = re.compile(r'(\d\d\d)-\d\d\d-\d\d\d\d')
mobs = p.findall('Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.')  # Zero or One Group, returns List
mobs

['415', '415']

In [7]:
p = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')      
mobs = p.findall('Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.')  # Two or more Group, returns Tuples
mobs

[('415', '555-1011'), ('415', '555-9999')]

In [10]:
p = re.compile(r'\d+\s')
mobs = p.findall('Me 12 how are is 9 so what has happend 873 at 415-555-1011 tomorrow. 415-555-9999 is my Alexa.')  # Zero or One Group, returns List
mobs

['12 ', '9 ', '873 ', '1011 ', '9999 ']

In [21]:
p = re.compile(r'\D+')
mobs = p.findall('Me 12 how are is 9 so what has happend 873 at 415-555-1011 tomorrow. 415-555-9999 is my Alexa.')  # Zero or One Group, returns List
mobs

['Me ',
 ' how are is ',
 ' so what has happend ',
 ' at ',
 '-',
 '-',
 ' tomorrow. ',
 '-',
 '-',
 ' is my Alexa.']

In [11]:
p = re.compile(r'[1-4]')
mobs = p.findall('Me 12 how are is 9 so what has happend 873 at 415-555-1011 tomorrow. 415-555-9999 is my Alexa.')  # Zero or One Group, returns List
mobs

['1', '2', '3', '4', '1', '1', '1', '1', '4', '1']

In [12]:
p = re.compile(r'[aeiouAEIOU]')
mobs = p.findall('Me 12 how are is 9 so what has happend 873 at 415-555-1011 tomorrow. 415-555-9999 is my Alexa.')  # Zero or One Group, returns List
mobs

['e',
 'o',
 'a',
 'e',
 'i',
 'o',
 'a',
 'a',
 'a',
 'e',
 'a',
 'o',
 'o',
 'o',
 'i',
 'A',
 'e',
 'a']

In [19]:
p = re.compile(r'[aeiouAEIOU]{2}')
mobs = p.findall('Me 12 how are is 9 so what eat has happend 873 at 415-555-1011 tomorrow can aol. 415-555-9999 is my Alexa.')
mobs

['ea', 'ao']

In [26]:
p = re.compile(r'[^0-9]+')      # if you put ^ inside square braces then it becomes Negative Character Class 
mobs = p.findall('Me 12 how are is 9 so what eat has happend 873 at 415-555-1011 tomorrow can aol. 415-555-9999 is my Alexa.')
mobs

['Me ',
 ' how are is ',
 ' so what eat has happend ',
 ' at ',
 '-',
 '-',
 ' tomorrow can aol. ',
 '-',
 '-',
 ' is my Alexa.']

# <a id='Regex_Dot'></a>Regex . Dot * Star and Caret ^ / \$ Dollar Characters
<br> `.`      - Any Character except the new line. Dot is a Wild Card Character
<br> `.*`     - Zero or more characters except the new line 
<br> ` .* `   - Greedy Match
<br> ` .*? `  - Non Greedy Match
<br> `^`      - Starts with 
<br> `$`      - Ends with
<br> `^$`     - To find Empty Strings
<br> `^Exact Match$` - To find out Exact Match.   VIMP
<br>![](./Media/15_7.png)

`^` starts with

In [0]:
str = "Data Science"

#Check if the string starts with 'Data':
x = re.findall("^Data", str)

if (x):
  print("Yes, the string starts with 'Data'")
else:
  print("No match")
  
#print(x)  

Yes, the string starts with 'Data'


In [0]:
# try with a different string
str2 = "Big Data"

#Check if the string starts with 'Data':
x2 = re.findall("^Data", str2)

if (x2):
  print("Yes, the string starts with 'data'")
else:
  print("No match")
  
#print(x2)  

No match


`$` ends with

In [0]:
str = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

Yes, the string ends with 'Science'


In [0]:
str = "Big Data"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

No match


In [4]:
p = re.compile('^Hello')
mo = p.search('Hello, World')
mo.group()

'Hello'

In [5]:
p = re.compile('^Hello')
mo = p.search('This the World and say Hello')
mo.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [9]:
p = re.compile(r'World$')
mo = p.search('This the World and say Hello World')
mo.group()

'World'

In [10]:
p = re.compile(r'^\d$')
mo = p.search('9547989113')
mo.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [11]:
p = re.compile(r'^\d+$')
mo = p.search('9547989113')
mo.group()

'9547989113'

In [16]:
p = re.compile(r'^\d+$')
mo = p.search('9x547989113')
mo.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [12]:
p = re.compile(r'^\d+$')
mo = p.search('9547989113')
mo.group()

'9547989113'

In [14]:
p = re.compile(r'^\d$')
mo = p.search('9')
mo.group()

'9'

In [17]:
# . Match any single character except the new line
p = re.compile(r'.')
mo = p.search('This is how is gonna be')
mo.group()

'T'

In [18]:
p = re.compile(r'.at')
mo = p.search('There is a cat in the hat, went on and sat on to the mat')
mo.group()

'cat'

In [26]:
p = re.compile(r'.at')
mo = p.findall('There is a cat in the hat, went on and sat on to the mat')
mo

['cat', 'hat', 'sat', 'mat']

In [30]:
p = re.compile(r'.{1,2}at')   # One or three characters preceded by 'at'
mo = p.findall('There is a cat in the hat, went on and sat on to the mat saat')
mo

[' cat', ' hat', ' sat', ' mat', 'saat']

<br> `*?` or `+?` → Non-greedy matching
<br> `{x,y}?` → Non-greedy matching
<br> __Note__: Here this question mark is different from Zero or One (optional) question mark
<br>![](./Media/15_11.png)
<br>![](./Media/15_12.png)
<br>![](./Media/15_13.png)

In [33]:
# * - Zero or More.i.e., Anything whatsoever
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
print(mo)
print(mo.group(1))
print(mo.group(2))

<re.Match object; span=(0, 34), match='First Name: Al Last Name: Sweigart'>
Al
Sweigart


In [36]:
# * - Zero or More.i.e., Anything whatsoever
nameRegex = re.compile(r'First Name: .* Last Name: .*')
mo = nameRegex.search('First Name: Al Last Name: Sweigart')
mo.group()

'First Name: Al Last Name: Sweigart'

In [37]:
# Greedy Match Example
greedyRegex = re.compile(r'<.*>')
mo = greedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man> for dinner.>'

In [38]:
# Non-Greedy Match Example. Little anything as possible because we are not greedy
nongreedyRegex = re.compile(r'<.*?>')
mo = nongreedyRegex.search('<To serve man> for dinner.>')
mo.group()

'<To serve man>'

In [41]:
# . matches anything except newline
p = re.compile(r'.*')
dotstar = p.search('Serve the public trust.\nProtect the innocent.\nUphold the law.')
dotstar

<re.Match object; span=(0, 23), match='Serve the public trust.'>

In [45]:
# re.DOTALL matches everything including newline
newlineRegex = re.compile(r'.*',re.DOTALL)  # We have to pass a second argument to tell the compliler to search including newline
newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()  # instead of giving mo.group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

In [47]:
# re.IGNORECASE
regex1 = re.compile(r'[aeiou]')
regex1.findall('RoboCop is a awesome cop')

['o', 'o', 'o', 'i', 'a', 'a', 'e', 'o', 'e', 'o']

In [49]:
# re.IGNORECASE
regex1 = re.compile(r'[aeiou]',re.IGNORECASE)         # you can also mention re.I
regex1.findall('THE ROBOCOPI is a awesome cop')

['E', 'O', 'O', 'O', 'I', 'i', 'a', 'a', 'e', 'o', 'e', 'o']

In [49]:
# re.I
regex1 = re.compile(r'[aeiou]',re.I)
regex1.findall('THE ROBOCOPI is a awesome cop')

['E', 'O', 'O', 'O', 'I', 'i', 'a', 'a', 'e', 'o', 'e', 'o']

# <a id='Regex_sub'></a>Regex sub() Method and Verbose Mode. Substituting Strings with the sub() Method
sub() - Similar to find and replace feature in word
<br>![](./Media/15_8.png)

In [56]:
p = re.compile(r'Agent \w+')
mo = p.findall('Agent Alice gave the secret documents to Agent Bob.')
mo

['Agent Alice', 'Agent Bob']

In [58]:
p = re.compile(r'Agent (\w+)')
mo = p.findall('Agent Alice gave the secret documents to Agent Bob.')
mo

['Alice', 'Bob']

In [59]:
p = re.compile(r'Agent (\w)')
mo = p.findall('Agent Alice gave the secret documents to Agent Bob.')
mo

['A', 'B']

In [57]:
p = re.compile(r'Agent \w+')
mo = p.sub('CENSORED','Agent Alice gave the secret documents to Agent Bob.')
mo

'CENSORED gave the secret documents to CENSORED.'

In [65]:
p = re.compile(r'Agent (\w)\w*')
mo = p.findall('Agent Alice gave the secret documents to Agent Bob.')
mo

['A', 'B']

In [66]:
# Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub(), you can type \1, \2, \3, and so on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”
p = re.compile(r'Agent (\w)\w*')
mo = p.sub(r'Agent \1****','Agent Alice gave the secret documents to Agent Bob.')
mo

'Agent A**** gave the secret documents to Agent B****.'

### re.VERBOSE does reading complex regular expressions more easier and more understandable to us as well as others

In [67]:
p = re.compile(r'''
\d\d\d    # Area Code
-
\d\d\d    # City Code
-
\d\d\d\d  # Mobile Number
''',re.VERBOSE)
mo = p.findall('Me 12 how are is 9 so what eat has happend 873 at 415-555-1011 tomorrow can aol. 415-555-9999 is my Alexa.')
mo

['415-555-1011', '415-555-9999']

### Use pipe ( | )  operator (bitwise OR operator) (it works as 'or' ) to combine all the re Methods

In [68]:
p = re.compile(r'''
\d\d\d    # Area Code
-
\d\d\d    # City Code
-
\d\d\d\d  # Mobile Number
''',re.DOTALL | re.IGNORECASE | re.VERBOSE)
mo = p.findall('Me 12 how are is 9 so what eat has happend 873 at 415-555-1011 tomorrow can aol. 415-555-9999 is my Alexa.')
mo

['415-555-1011', '415-555-9999']

# <a id='Regex_Example_Program'></a>Regex Example Program: A Phone and Email Scraper
* pip install pyperclip

In [None]:
#! python3
import re
import pyperclip
#%%
# 1. Get the Data
text = pyperclip.paste() # Make sure you are copying the examplePhoneEmailDirectory.pdf
# 2. Prepare Phone Regex Object and Extract the Phone Results
phonero = re.compile('''
                     (((\()?\d\d\d(\))?)?(-)?(\s)?
                      \d\d\d - \d\d\d\d((\s)?(ext|x)(\.)?(\s)?\d{2,5})?)
                       ''',re.VERBOSE | re.IGNORECASE)
phoneresults = phonero.findall(text)
phoneNos = []
for i in phoneresults:
    phoneNos.append(i[0])
# 3. Prepare Email Regex Object and Extract the Email Results
emailro = re.compile('''([a-zA-Z0-9.+_]+@[a-zA-Z0-9.+_]+)''')
emailresults = emailro.findall(text)
# 4. Join Phone and Email Results and Print it
results = '\n'.join(phoneNos) + '\n'.join(emailresults)
print(results)

# <a id='Zero-Width_Assertions'></a>Zero-Width Assertions
* Characters which indicate positions rather than actual content are called zero-width assertions.
* For instance, the caret symbol (^) is a representation of the beginning of a line or the dollar sign ($) for the end of a line.
* They effectively do assertion without consuming characters; they just return a positive or negative result of the match.
* A more powerful kind of zero-width assertion is Lookaround (i.e., Lookahead and Lookbehind)

## Lookaround - Lookahead and Lookbehind, collectively called Lookaround

__Lookaround__ is a simple mechanism which during the matching process, at the current position, looks forward (or behind, depends on type of lookaround used) to see if some pattern matches before continuing with the actual match.

The most important thing to understand here is that __look around__ mechanism consists of 2 parts:
* __actual expression__: an expression whose match constitutes the final result.
* __non-consuming expression__: an expression whose match is evaluated before or after the actual expression, just to see if it can succeed. It is not actually consumed by the regex engine.
There are 2 main categories of look around which, in turn, have 2 sub-categories each.
<br>![](./Media/16_7.png)
<br>![](./Media/16_8.png)

## Lookahead
Lookahead checks the match for a non-consuming expression ahead of a given pattern.

### Positive Lookahead
Positive Lookahead will succeed if the passed non-consuming expression does match against the forthcoming input.

The syntax is A(?=B) where A is the actual expression and B is the non-consuming expression.

In [4]:
txt = 'i love python, i love regex'
mo = re.search(r'love(?=\sregex)',txt)
mo

<re.Match object; span=(17, 21), match='love'>

In [9]:
txt = "My favorite colors are red, green, and blue."
pattern = re.compile(r'\w+(?=,|\.)')
mo = re.findall(pattern,txt)
mo

['red', 'green', 'blue']

### Negative Lookahead
Negative Lookahead will succeed if the passed non-consuming expression does not match against the forthcoming input.

The syntax is A(?!B) where A is the actual expression and B is the non-consuming expression.

In [5]:
txt = 'i love python, i love regex'
mo = re.search(r'love(?!\sregex)',txt)
mo

<re.Match object; span=(2, 6), match='love'>

In [10]:
txt = "My favorite colors are red, green, and blue."
pattern = re.compile(r'\w+(?!,|\.)')
mo = re.findall(pattern,txt)
mo

['My', 'favorite', 'colors', 'are', 're', 'gree', 'and', 'blu']

In [3]:
txt = "My favorite colors are red, green, and blue."
ro = re.compile(r'\b\w+(?!,|\.)\b')             # \b specifies it has to be full word
mo = ro.findall(txt)
mo

['My', 'favorite', 'colors', 'are', 'and']

## Lookbehind
Lookbehind checks the match for a non-consuming expression behind a given pattern.

### Positive Lookbehind
Positive look behind will succeed if the passed non-consuming expression does match against the forthcoming input.

The syntax is (?<=B)A where A is the actual expression and B is the non-consuming expression.

In [14]:
# Let's assume that we want to find a match for regex in the given text only if it is succeeded by love or hate.
txt = "love regex or hate regex, can't ignore regex"
pattern = re.compile(r'(?<=love\s|hate\s)regex')
mo = re.findall(pattern,txt)
mo

['regex', 'regex']

### Negative look behind
Negative look behind will succeed if the passed non-consuming expression does not match against the forthcoming input.

The syntax is (?<!B)A where A is the actual expression and B is the non-consuming expression.

In [3]:
# Let's assume that we want to find a match for regex in the given text if it is not followed by love or hate.
txt = "love regex or hate regex, can't ignore regex"
pattern = re.compile(r'(?<!love\s|hate\s)regex')
mo = re.search(pattern,txt)
print(mo)

<re.Match object; span=(39, 44), match='regex'>


## Solve Some Queries
Let us try solving some queries that we are likely to come across while working with real world text datasets.

### Eliminating Unwanted Terms

In [0]:
str = "@AV a Data Science community #AV!!"

# Eliminate words that start with a special character
x = re.sub("[^a-zA-Z ]", "",str)

print(x)

AV a Data Science community AV


In [0]:
str = "@AV a Data Science community #AV!!"

# extract words that start with a special character
x = re.sub("[^a-zA-Z ]\w+", "",str)

print(x)

 a Data Science community !!


### Finding Email IDs

In [0]:
str = 'Send a mail to rohan.1997@gmail.com, smith_david34@yahoo.com and priya@yahoo.com about the meeting @2PM'
  
# \w matches any alpha numeric character 
# + for repeats a character one or more times
x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)     
  
# Printing of List 
print(x) 

['rohan.1997@gmail.com', 'smith_david34@yahoo.com', 'priya@yahoo.com']


# <a id='Using_RegEx'></a>Using RegEx on Real-World Dataset

## Table of Contents
 1. About the Dataset
 2. Regex for Cleaning Text Data 
 3. Regex for Text Data Extraction
 4. Regex Challenge


## 1. About the Dataset

In [2]:
import pandas as pd 

#Loading the dataset
df = pd.read_csv("./Media/data/tweets.csv", encoding = "ISO-8859-1")

# Printing first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,X,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,1,1,RT @rssurjewala: Critical question: Was PayTM ...,False,0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331,True,False
1,2,2,RT @Hemant_80: Did you vote on #Demonetization...,False,0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66,True,False
2,3,3,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12,True,False
3,4,4,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338,True,False
4,5,5,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120,True,False


In [3]:
# Looking at some Tweets
for index, tweet in enumerate(df["text"][10:15]):
    print(index+1,".",tweet)

1 . Many opposition leaders are with @narendramodi on the #Demonetization 
And respect their decision,but support opposition just b'coz of party
2 . RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
3 . @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders.
4 . RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
5 . RT @sona2905: When I explained #Demonetization to myself and tried to put it down in my words which are not laced with any heavy technical


## 2. Regex for Cleaning Text Data

In [1]:
import re

### a. Removing `RT`

In [3]:
# Removing RT from a single Tweet
text = "RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r"
clean_text = re.sub('^RT ','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
Text after:
 @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r


In [4]:
# Tweets before removal
df['text'].head()

0    RT @rssurjewala: Critical question: Was PayTM ...
1    RT @Hemant_80: Did you vote on #Demonetization...
2    RT @roshankar: Former FinSec, RBI Dy Governor,...
3    RT @ANI_news: Gurugram (Haryana): Post office ...
4    RT @satishacharya: Reddy Wedding! @mail_today ...
Name: text, dtype: object

In [4]:
# Removing RT from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('^RT ','',x))

In [5]:
# Tweets after removal
df['text'].head()

0    @rssurjewala: Critical question: Was PayTM inf...
1    @Hemant_80: Did you vote on #Demonetization on...
2    @roshankar: Former FinSec, RBI Dy Governor, CB...
3    @ANI_news: Gurugram (Haryana): Post office emp...
4    @satishacharya: Reddy Wedding! @mail_today car...
Name: text, dtype: object

### b. Removing `<U+..>` like symbols

In [6]:
# Removing <U+..> like symbols from a single tweet
text = "@Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders"
clean_text = re.sub('<U\+[A-Z0-9]+>','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders
Text after:
 @Jaggesh2 Bharat band on 28??<ed><ed>Those who  are protesting #demonetization  are all different party leaders


**Note** that although we have gotten rid of majority of symbols, `<ed>` is still present. I leave this as an exercise for you to try out. 

In [7]:
# Removing <U+..> like symbols from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('<U\+[A-Z0-9]+>', '', x))

In [8]:
# Removing <ed> from text
df['text'] = df['text'].apply(lambda x: re.sub(r'<ed>','',x))

In [9]:
for i in df['text'][10:15]:
    print(i)

Many opposition leaders are with @narendramodi on the #Demonetization 
And respect their decision,but support opposition just b'coz of party
@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
@Jaggesh2 Bharat band on 28??Those who  are protesting #demonetization  are all different party leaders.
@Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
@sona2905: When I explained #Demonetization to myself and tried to put it down in my words which are not laced with any heavy technical


### c. Replacing `&amp;` with the `&`

In [11]:
# Replacing &amp with & in a single tweet
text = "RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney"
clean_text = re.sub(r'&amp;','&', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney
Text after:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response & Commitment in fight against Blackmoney


In [10]:
# Replacing &amp with & in all the tweets
df['text']=df['text'].apply(lambda x: re.sub(r'&amp', '&', x))

## 3. Regex for Text Data Extraction
### a. Extracting platform type of tweets

In [15]:
# Getting number of tweets per platform type
platform_count = df["statusSource"].value_counts()

In [16]:
platform_count

<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>                    7642
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                                      2548
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>                      2093
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>                      492
<a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>                                     263
                                                                                                        ... 
<a href="http://www.ns-madenza.com/tweetpress/wales/specials/pombuzz" rel="nofollow">Pom July AI</a>       1
<a href="https://www.punjabupdate.com" rel="nofollow">PB Update</a>                                        1
<a href="http://desi.buzz" rel="nofollow">Desi Buzz</a>                                                    1
<a href="http://www

In [17]:
#List platforms that have more than 100 tweets
top_platforms = platform_count[platform_count>100]
top_platforms

<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>    7642
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                      2548
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>      2093
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      492
<a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>                     263
<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M5)</a>                  178
<a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a>                    167
<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>        165
<a href="http://www.twitter.com" rel="nofollow">Twitter for Windows Phone</a>            139
<a href="http://onlywire.com/" rel="nofollow">OnlyWire / Official App</a>                136
<a href="http://www.twitter.com" rel="nofollow">Twitter for Windows</a

In [37]:
def platform_type(x):
    ser = re.search( r"android|iphone|web|windows|mobile|google|facebook|ipad|tweetdeck|onlywire", x, re.IGNORECASE)
    if ser:
        return ser.group()
    else:
        return None

#reset index of the series
top_platforms = top_platforms.reset_index()["index"]

#extract platform types
top_platforms.apply(lambda x: platform_type(x))

0       android
1           Web
2        iphone
3     tweetdeck
4        mobile
5        mobile
6      facebook
7          ipad
8       Windows
9      onlywire
10      Windows
11       mobile
12       google
Name: index, dtype: object

### b. Extracting hashtags from the tweets

In [26]:
# Extract first hashtag from a tweet
text = "RT @Atheist_Krishna: The effect of #Demonetization !!\r\n. https://t.co/A8of7zh2f5"
hashtag = re.search('#\w+', text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtag.group())

Tweet:
 RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
Hashtag:
 #Demonetization


In [27]:
# Extract multiple hastags from a tweet
text = """RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo"""
hashtags = re.findall('#\w+', text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtags)

Tweet:
 RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo
Hashtag:
 ['#Doltiwal', '#JaiChandKejriwal', '#Demonetization']


In [28]:
df['hashtags']=df['text'].apply(lambda x: re.findall('#\w+', x))

In [29]:
df[['text','hashtags']].head()

Unnamed: 0,text,hashtags
0,@rssurjewala: Critical question: Was PayTM inf...,[#Demonetization]
1,@Hemant_80: Did you vote on #Demonetization on...,[#Demonetization]
2,"@roshankar: Former FinSec, RBI Dy Governor, CB...",[#Demonetization]
3,@ANI_news: Gurugram (Haryana): Post office emp...,[#demonetization]
4,@satishacharya: Reddy Wedding! @mail_today car...,"[#demonetization, #ReddyWedding]"


## 4. Regex Challenge

Now that you have learned all the concepts regarding regex and have also seen it in action, it's time for you to utilize that to solve a challenge all by yourself. Here are some of the tasks that you have to do - 

### a. Removing URLs from tweets

**Difficulty - Easy**

There are multiple URLs present in individual tweet's `text` and they don't neccessarily provide useful information so we can get rid of them. For example -  

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*


We can very well remove the URL as it isn't providing much useful information.


In [54]:
# Your Code Here
df['text_nourls'] = df['text'].apply(lambda x: re.sub(r'\bhttps:.*','',x))

Tweet:
 @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
After:
 ['https://t.co/pYgK8Rmg7r']


### b. Extract Top 100 mentions

**Difficulty - Medium**

Many of the tweets have mentions of people in the form *@username*, for example see the following tweet - 

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*

Here *@Joydas* is a mention. You need to extract mentions from all the tweets and find which are the top 100 usernames.

In [11]:
# Your Code Here
mo = re.search(r'@[\w]+(?=[:\s]?)',df['text'][0])
df['mentions'] = df['text'].apply(lambda x: re.findall(r'@[\w]+(?=[:\s]?)',x))

In [54]:
mentionsarray = df['mentions'].values
mentiondict = {}
for i in mentionsarray:
    for j in i:
        if j in mentiondict:
            mentiondict[j]+=1
        else:
            mentiondict[j]=0

In [55]:
ndf = pd.DataFrame({'Mentions':list(mentiondict.keys()),'Count':list(mentiondict.values())})
ndf.sort_values(by='Count',ascending=False,inplace=True,ignore_index=True)

In [56]:
ndf.head()

Unnamed: 0,Mentions,Count
0,@evanspiegel,1310
1,@URautelaForever,1272
2,@narendramodi,1137
3,@gauravcsawant,540
4,@ModiBharosa,539


### Solution - 1

In [30]:
# Removing URLs from a single tweet
text='@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r'
re.sub('https?://[A-Za-z0-9.-/]+','',text)

'@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy '

In [31]:
# Removing URLs from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('https?://[A-Za-z0-9.-/]+','',x))

### Solution - 2

In [32]:
# Function for extracting mentions from the tweet
def mention(x):
    found=re.findall(r'@\w+',x)               # @.*?(?=[:| ])
    if found:
        return found
    return None

In [33]:
# Extract mentions from all the tweets
arr=df['text'].apply(lambda x : mention(x))

In [34]:
arr

0                       [@rssurjewala]
1                         [@Hemant_80]
2                         [@roshankar]
3                          [@ANI_news]
4        [@satishacharya, @mail_today]
                     ...              
14935                [@saxenavishakha]
14936                             None
14937                [@bharat_builder]
14938          [@Stupidosaur, @Vidyut]
14939                        [@Vidyut]
Name: text, Length: 14940, dtype: object

In [35]:
# Combining all the mentions into a list
mentions_arr=[]

for x in arr:
    if x != None:
        mentions_arr.extend(x)

In [36]:
mentions_arr[:10]

['@rssurjewala',
 '@Hemant_80',
 '@roshankar',
 '@ANI_news',
 '@satishacharya',
 '@mail_today',
 '@DerekScissors1',
 '@ambazaarmag',
 '@gauravcsawant',
 '@Joydeep_911']

In [37]:
# Getting top 100 mentions
mentions_count=pd.Series(mentions_arr).value_counts().head(100)

In [38]:
mentions_count

@evanspiegel        1311
@URautelaForever    1273
@narendramodi       1138
@gauravcsawant       541
@ModiBharosa         540
                    ... 
@hi_paresh            30
@sanjayuv             30
@rupasubramanya       30
@MinhazMerchant       29
@sardesairajdeep      29
Length: 100, dtype: int64