**Day 10RegularExpression**

- A Regular Expressions (RegEx) is a special sequence of characters that uses a search pattern to find a string or set of strings. 
- It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. 
- Python provides a re module that supports the use of regex in Python. Its primary function is to offer a search, where it takes a regular expression and a string

In [28]:
import re
 
s = 'Hello All I am a Robot with lots of code'
 
match = re.search(r'All', s)
 
print('Start Index:', match.start())
print('End Index:', match.end())

Start Index: 6
End Index: 9


Meta Characters  Description
- \	  - Used to drop the special meaning of character following it
- []  - Represent a character class
- ^	- Matches the beginning
- $	 -  Matches the end
- .	- Matches any character except newline
- |	- Means OR (Matches with any of the characters separated by it.
- ?	- Matches zero or one occurrence
- *	 Any number of occurrences (including 0 occurrences)
- +	 One or more occurrences
- {} - Indicate the number of occurrences of a preceding regex to match.
- () - Enclose a group of Regex

- backslash (\)\ makes sure that the character is not treated in a special way. 
- This can be considered a way of escaping metacharacters.
- For example, if you want to search for the dot(.) in the string then you will find that dot(.) will be treated as a special character as is one of the metacharacters (as shown in the above table).

In [3]:
import re
 
s = 'geeks.forgeeks'
 
# without using \
match = re.search(r'.', s)
print(match)
 
# using \
match = re.search(r'\.', s)
print(match)

<re.Match object; span=(0, 1), match='g'>
<re.Match object; span=(5, 6), match='.'>


- Square bracket : Square Brackets ([]) represent a character class consisting of a set of characters that we wish to match. 
- For example, the character class [abc] will match any single a, b, or c. 

- Caret (^) symbol matches the beginning of the string i.e. checks whether the string starts with the given character(s) or not.

- Dollar($) symbol matches the end of the string i.e checks whether the string ends with the given character(s) or not.

- Dot(.) symbol matches only a single character except for the newline character (\n). 

- Or symbol works as the or operator meaning it checks whether the pattern before or after the or symbol is present in the string or not.

- Question mark(?) checks if the string before the question mark in the regex occurs at least once or not at all.

- Star (*) symbol matches zero or more occurrences of the regex preceding the * symbol.

- Plus (+) symbol matches one or more occurrences of the regex preceding the + symbol. 

- {m,n} Braces match any repetitions preceding regex from m to n both inclusive.

- Group symbol is used to group sub-patterns.

- refindall()- Return all non-overlapping matches of pattern in string, as a list of strings.

In [31]:

# A Python program to demonstrate working of findall()
import re
 
# A sample text string where regular expression is searched.
string = """Hello I am a datascientist and I like to 56789 read books and code ml models and my number is 123456789"""
 
# A sample regular expression to find digits.
regex = '\d+'
 
match = re.findall(regex, string)
print(match)

['56789', '123456789']


- recompile() - Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions. 

In [32]:
# Module Regular Expression is imported using __import__().
import re
 
# compile() creates regular expression
# character class [a-e],
# which is equivalent to [abcde].
# class [abcde] will match with string with
# 'a', 'b', 'c', 'd', 'e'.
p = re.compile('[a-e]')
 
# findall() searches for the Regular Expression
# and return a list upon finding
print(p.findall("Hie I am Tanue Solanki"))

['e', 'a', 'a', 'e', 'a']


In [7]:
import re
 
# \d is equivalent to [0-9].
p = re.compile('\d')
print(p.findall("I got the code right at  10 A.M. on 1 Sept 2022"))
 
# \d+ will match a group on [0-9], group
# of one or greater size
p = re.compile('\d+')
print(p.findall("I got the code right at  10 A.M. on 1 Sept 2022"))

['1', '0', '1', '2', '0', '2', '2']
['10', '1', '2022']


In [9]:

import re
 
# \w is equivalent to [a-zA-Z0-9_].
p = re.compile('\w')
print(p.findall("He said * in very known language."))
 
# \w+ matches to group of alphanumeric character.
p = re.compile('\w+')
print(p.findall("I check his address at 10 avenue link road , he \
said *** in known language."))
 
# \W matches to non alphanumeric characters.
p = re.compile('\W')
print(p.findall("He said * in very,  known language."))

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 'v', 'e', 'r', 'y', 'k', 'n', 'o', 'w', 'n', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']
['I', 'check', 'his', 'address', 'at', '10', 'avenue', 'link', 'road', 'he', 'said', 'in', 'known', 'language']
[' ', ' ', '*', ' ', ' ', ',', ' ', ' ', ' ', '.']


In [10]:
import re
 
# '*' replaces the no. of occurrence
# of a character.
p = re.compile('cd*')
print(p.findall("cdascdascdascd"))

['cd', 'cd', 'cd', 'cd']


- resplit()
    - Split string by the occurrences of a character or a pattern, upon finding that pattern, the remaining characters from the string are returned as part of the resulting list. 

In [None]:
# Syntax for resplit()
# re.split(pattern, string, maxsplit=0, flags=0)

In [11]:
from re import split
 
# '\W+' denotes Non-Alphanumeric Characters
# or group of characters Upon finding ','
# or whitespace ' ', the split(), splits the
# string from that point
print(split('\W+', 'Time, time , Time'))
print(split('\W+', "Time's time Time"))
 
# Here ':', ' ' ,',' are not AlphaNumeric thus,
# the point where splitting occurs
print(split('\W+', 'On 1 Sept 2022, at 04:02 PM'))
 
# '\d+' denotes Numeric Characters or group of
# characters Splitting occurs at '12', '2016',
# '11', '02' only
print(split('\d+', 'On 1 Sept 2022, at 04:02 PM'))

['Time', 'time', 'Time']
['Time', 's', 'time', 'Time']
['On', '1', 'Sept', '2022', 'at', '04', '02', 'PM']
['On ', ' Sept ', ', at ', ':', ' PM']


In [14]:
import re
 
# Splitting will occurs only once, at
# '12', returned list will have length 2
print(re.split('\d+', 'On 1 Sept 2022, at 04:02 PM', 2))
 
# 'Boy' and 'boy' will be treated same when
# flags = re.IGNORECASE
print(re.split('[a-f]+', 'Hey, what you like, to play', flags=re.IGNORECASE))
print(re.split('[a-f]+', 'Hey, what you like, to play'))

['On ', ' Sept ', ', at 04:02 PM']
['H', 'y, wh', 't you lik', ', to pl', 'y']
['H', 'y, wh', 't you lik', ', to pl', 'y']


- resub()
    - The ‘sub’ in the function stands for SubString, a certain regular expression pattern is searched in the given string(3rd parameter), and upon finding the substring pattern is replaced by repl(2nd parameter), count checks and maintains the number of times this occurs. 

In [33]:
import re
# Regular Expression pattern 'ub' matches the
# string at "Subject" and "Uber". As the CASE
# has been ignored, using Flag, 'ub' should
# match twice with the string Upon matching,
# 'ub' is replaced by '~*' in "Subject", and
# in "Uber", 'Ub' is replaced.
print(re.sub('ub', '~*', 'SUBject has Uber booked already',
             flags=re.IGNORECASE))
 
# Consider the Case Sensitivity, 'Ub' in
# "Uber", will not be replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already'))
 
# As count has been given value 1, the maximum
# times replacement occurs is 1
print(re.sub('ub', '~*', 'Subject has Uber booked already',
             count=1, flags=re.IGNORECASE))
 
# 'r' before the pattern denotes RE, \s is for
# start and end of a String.
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam',
             flags=re.IGNORECASE))

S~*ject has ~*er booked already
S~*ject has Uber booked already
S~*ject has Uber booked already
Baked Beans & Spam


- resubn()
    - subn() is similar to sub() in all ways, except in its way of providing output. It returns a tuple with a count of the total of replacement and the new string rather than just the string

In [16]:
#  re.subn(pattern, repl, string, count=0, flags=0)

import re
 
print(re.subn('ub', '~*', 'Subject has Uber booked already'))
 
t = re.subn('ub', '~*', 'Subject has Uber booked already',
            flags=re.IGNORECASE)
print(t)
print(len(t))
 
# This will give same output as sub() would have
print(t[0])

('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
2
S~*ject has ~*er booked already


- re.escape()
    - Returns string with all non-alphanumerics backslashed, this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

In [18]:
import re
 
# escape() returns a string with BackSlash '\',
# before every Non-Alphanumeric Character
# In 1st case only ' ', is not alphanumeric
# In 2nd case, ' ', caret '^', '-', '[]', '\'
# are not alphanumeric
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


- re.search()
    - This method either returns None (if the pattern doesn’t match), or a re.MatchObject contains information about the matching part of the string.

In [21]:
# A Python program to demonstrate working of re.match().
import re
 
# Lets use a regular expression to match a date string
# in the form of Month name followed by day number
regex = r"([a-zA-Z]+) (\d+)"
 
match = re.search(regex, "I got admission on Sept 2016")
 
if match != None:
 
    # We reach here when the expression "([a-zA-Z]+) (\d+)"
    # matches the date string.
 
    print ("Match at index %s, %s" % (match.start(), match.end()))
 
    # We us group() method to get all the matches and
    # captured groups. The groups contain the matched values.
    # In particular:
    # match.group(0) always returns the fully matched string
    # match.group(1) match.group(2), ... return the capture
    # groups in order from left to right in the input string
    # match.group() is equivalent to match.group(0)
 
   
    print ("Full match: %s" % (match.group(0)))
 

    print ("Month: %s" % (match.group(1)))
 
   
    print ("Day: %s" % (match.group(2)))
 
else:
    print ("The regex pattern does not match.")

Match at index 19, 28
Full match: Sept 2016
Month: Sept
Day: 2016


- match.re attribute returns the regular expression passed and match.string attribute returns the string passed.

In [25]:

import re
 
s = "Welcome to the amazing world of coding"
 
# here x is the match object
res = re.search(r"\ba", s)
 
print(res.re)
print(res.string)

re.compile('\\ba')
Welcome to the amazing world of coding
