# Regular Expressions

Regular Expression (or) RegEx is a sequence of characters that specifies a search pattern in text. RegEx can be used to check if a string contians the specified search pattern. We can find and replace the specified text using Regular Expressions.  
Regular Expressions are very useful for data cleaning, input validation, reading log errors and etc.  
Python contains inbuilt package 're', which can be used to work with RegEx's.

In [1]:
import re

## Functions
`re` module offers a set of useful functions that allows us to search for a match

**search** - Returns a match object if there is a match anywhere in the string.  
**Findall** - Returns a list containing all matches.  
**split** - Return a list where the string has been split at each match.  
**Sub** - Replaces one or many matches with a string.

## search()
The most useful method is search method.  
Takes two arguments `pattern` and `string`  
*Pattern* is the pattern of the text we looking for.  
*string* is the text we have to look into (simply the target text)  
**Syntax :** `re.search(pattern, string)`

In [13]:
pattern = r"hello"    #Always use raw_string for pattern
string = "I said hello"
re.search(pattern, string)

<re.Match object; span=(7, 12), match='hello'>

In [4]:
re.search(r"want to","I want to go home")

<re.Match object; span=(2, 9), match='want to'>

*search* function searches the string for the match, and returns a Match object if there is match else it returns None  
If there is more than one match. Only the first occurance of the match will be returned.  
  
`Match` object contains `span` which is the range of matched text, and `match` tell us what it exactly matched (our pattern)  
  

## Metacharacters  
- Most of the letters and numbers match themselves while we pass them as a pattern as shown above. By default they are case-sensitive but we can enable case-insensitive also.
- Except some characters doesn't obey this rule, called special metacharacters, and don't match themselves. 
- Instead they are very useful and makes easy to build a pattern.
- `Metacharacters :`  . ^ $ * + ? { } [ ] \ | ( ).
- Much we know about these characters much we become fluent with text matching.
  

### `[ ]`
- They’re used for specifying a character class, which is a set of characters that we wish to match. 
- Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. 
- For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If we wanted to match only lowercase alphabets, our RE would be [a-z].
- [0-9] matches all the numbers from 0 to 9. ([0-9][0-9] matches all two digit numbers.), [A-Za-z0-9] will match any hexadecimal digit.  
- Characters that are not within a range can be matched by complementing the set. If the first character of the set is `'^'`, all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
- Metacharacters (except \\) are not active inside classes. For example, [akm$\$$] will match any of the characters 'a', 'k', 'm', or '$\$$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature.

In [17]:
#Both ouputs the same result
print(re.search(r'[abc]','cab'))
print(re.search(r'[a-c]','cab'))

<re.Match object; span=(0, 1), match='c'>
<re.Match object; span=(0, 1), match='C'>


In [15]:
print(re.search(r'[^l]','Hello'))

<re.Match object; span=(0, 1), match='H'>


## \

- Perhaps the most important metacharacter is the backslash, \. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \\, you can precede them with a backslash to remove their special meaning: \\[ or \\\\.
- Some of the special sequences beginning with '\\' represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.
- **\\A** - Matches only at the start of the string. (Equal to ^)
- **\\b** - Matches the empty string, but only at the beginning or end of a word.
- **\\B** - Matches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b
- **\\A** - Matches only at the start of the string.
- **\\d** - Matches any decimal digit; This is equivalent to the class [0-9].
- **\\D** - Matches any non decimal digit character; This is equivalent to the class [^0-9].
- **\\s** - Matches any whitespace character; This is equivalent to the class [ \t\n\r\f\v].
- **\\S** - Matches any non-whitespace character; This is equivalent to the class [^ \t\n\r\f\v].
- **\\w** - Matches any alphanumeric character; This is equivalent to the class [a-zA-Z0-9_]
- **\\W** - Matches any non-alphanumeric character; This is equivalent to the class [^a-zA-Z0-9_]
- **\\Z** - Matches only at the end of the string. (Equal to $)

In [66]:
print(re.search(r'\AI.*s','I travel 40kms to go to home'))
print(re.search(r'\btravel\b','I travel 40kms to go to home'))
print(re.search(r't.*t\B','I travel 40kms to go to home'))
print(re.search(r'\d','I travel 40kms to go to home'))
print(re.search(r'\D','I travel 40kms to go to home'))
print(re.search(r'\w','I travel 40kms to go to home'))
print(re.search(r'\s','I travel 40kms to go to home'))
print(re.search(r'\S','I travel 40kms to go to home'))
print(re.search(r'\W','I travel 40kms to go to home'))
print(re.search(r'I.*e\Z','I travel 40kms to go to home'))

<re.Match object; span=(0, 14), match='I travel 40kms'>
<re.Match object; span=(2, 8), match='travel'>
<re.Match object; span=(2, 22), match='travel 40kms to go t'>
<re.Match object; span=(9, 10), match='4'>
<re.Match object; span=(0, 1), match='I'>
<re.Match object; span=(0, 1), match='I'>
<re.Match object; span=(1, 2), match=' '>
<re.Match object; span=(0, 1), match='I'>
<re.Match object; span=(1, 2), match=' '>
<re.Match object; span=(0, 28), match='I travel 40kms to go to home'>


## Repeating Things
To repeat a pattern or match multiple characters, we use special metacharacters in the pattern.

### *
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

In [4]:
pattern = r'ca*t'
print(re.search(pattern,'ct'))      #0 'a' Characters 
print(re.search(pattern,'cat'))     #1 'a' Characters 
print(re.search(pattern,'caat'))    #2 'a' Characters 
print(re.search(pattern,'caaaaaat'))#Many 'a' Characters 

<re.Match object; span=(0, 2), match='ct'>
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(0, 4), match='caat'>
<re.Match object; span=(0, 8), match='caaaaaat'>


When repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don't match, the matching engine backup and match one less character in repeating class and check for later portion for match. It repeats until it founds a match for the pattern.

In [7]:
print(re.search(r'a[bcd]*b', 'abcbd'))

<re.Match object; span=(0, 4), match='abcb'>


### +
Matches one or more repetitions.

In [10]:
pattern = r'ca+t'
print(re.search(pattern,'ct'))      #0 'a' Characters returns None
print(re.search(pattern,'cat'))     #1 'a' Characters 
print(re.search(pattern,'caat'))    #2 'a' Characters 
print(re.search(pattern,'caaaaaat'))#Many 'a' Characters 

None
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(0, 4), match='caat'>
<re.Match object; span=(0, 8), match='caaaaaat'>


### ?
Matches either once or twice

In [15]:
print(re.search(r'ca?t','ct'))
print(re.search(r'ca?t','cat'))
print(re.search(r'ca?t','caat'))

<re.Match object; span=(0, 2), match='ct'>
<re.Match object; span=(0, 3), match='cat'>
None
None


### {m,n}
We customize the number of repetitions, m is minimum repetitions, n is the maximum repetitons, You can omit either m or n. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity  
{0,} is the same as `*`  
{1,} is the same as `+`  
{0,1} is the same as `?`

In [41]:
pattern = r'a/{1,3}b'             
print(re.search(pattern,'ab'))
print(re.search(pattern,'a/b'))
print(re.search(pattern,'a///b'))
print(re.search(pattern,'a////b'))

None
<re.Match object; span=(0, 3), match='a/b'>
<re.Match object; span=(0, 5), match='a///b'>
None


## .
In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

In [44]:
pattern = r'b.ll'
print(re.search(pattern,'ball'))
print(re.search(pattern,'bill gates'))

<re.Match object; span=(0, 4), match='ball'>
<re.Match object; span=(0, 4), match='bill'>


## ^
Matches the start of the string

In [50]:
print(re.search(r"^t.*ed","Did tyler killed dodge",re.IGNORECASE))
print(re.search(r"^t.*ed","Tyler Killed dodge",re.IGNORECASE))

None
<re.Match object; span=(0, 12), match='Tyler Killed'>


## $
Matches the end of the string

In [53]:
print(re.search(r"one$","Lucy is one of the cutest",re.IGNORECASE))
print(re.search(r"one$","Lucy is the cutest one",re.IGNORECASE))

None
<re.Match object; span=(19, 22), match='one'>


## |
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.

In [57]:
print(re.search(r'[L|B]ook','Book a cab'))
print(re.search(r'[L|B]ook','Look for the cab'))
print(re.search(r'man|woman','He is a man'))
print(re.search(r'man|woman','She is a woman'))

<re.Match object; span=(0, 4), match='Book'>
<re.Match object; span=(0, 4), match='Look'>
<re.Match object; span=(8, 11), match='man'>
<re.Match object; span=(9, 14), match='woman'>


## ()
Used to create groups in the pattern, the contents of a group can be retrieved after a match has been performed.

In [63]:
a = re.search(r'([\w]+), (\w+)','Kinsey, locke')
print(a)
print(a[1])   #Group 1
print(a[2])   #Group 2

<re.Match object; span=(0, 13), match='Kinsey, locke'>
Kinsey
locke


## (?...)
his is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the function. Flags should be used first in the expression string

In [71]:
print(re.search(r'(?i)ab','aB')) # i represents IgnoreCase

<re.Match object; span=(0, 2), match='aB'>


## (?P\<name>...)
Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name

In [25]:
exp = re.search(r'playing.*(?P<in_quotes>".*").*','I am playing "Cricket" in the ground')
print(exp)
print(exp.group("in_quotes"))

<re.Match object; span=(5, 36), match='playing "Cricket" in the ground'>
"Cricket"


## (?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Newton) will match 'Isaac ' only if it’s followed by 'Newton'.

In [102]:
pattern = r"Isaac (?=Newton)"
print(re.search(pattern, "Isaac Newton"))
print(re.search(pattern, "Isaac Asimov"))

<re.Match object; span=(0, 6), match='Isaac '>
None


## (?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Newton) will match 'Isaac ' only if it’s not followed by 'Newton'.

In [28]:
pattern = r"Isaac (?!Newton)"
print(re.search(pattern, "Isaac Newton"))
print(re.search(pattern, "Isaac Asimov"))

None
<re.Match object; span=(0, 6), match='Isaac '>


## (?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in 'abcdef',

In [31]:
print(re.search(r'(?<=-)\w+','spam-egg'))

<re.Match object; span=(5, 8), match='egg'>


## (?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. 

In [34]:
print(re.search(r'(?<!-)\w+','spam-egg'))

<re.Match object; span=(0, 4), match='spam'>


## findall()
- Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
- The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern.
- If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

In [68]:
print(re.findall(r'\bf[a-z]*','which foot or hand fell fastest')) #returns list with the strings that starts with f
print(re.findall(r'(\w+)=(\d+)','set width=20 and height=30')) #returns list with tuples 

['foot', 'fell', 'fastest']
[('width', '20'), ('height', '30')]


## split()
Split string by the matches of the regular expression. If capturing parentheses are used in the RE, then their contents will also be returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits are performed.

In [72]:
print(re.split(r'\W+','Words, words, words.'))
print(re.split(r'(\W+)','Words, words, words.'))
print(re.split(r'\W+','Words, words, words.',1))

['Words', 'words', 'words', '']
['Words', ', ', 'words', ', ', 'words', '.', '']
['Words', 'words, words.']


## sub()
- Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn’t found, string is returned unchanged.
- The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. The default value of 0 means to replace all occurrences.

In [75]:
print(re.sub(r'(blue|red|white)', 'color', 'blue shocks and red shoes'))
print(re.sub(r'(blue|red|white)', 'color', 'blue shocks and red shoes',count=1))

color shocks and color shoes
color shocks and red shoes


In [85]:
def hexrepl(match):
    "Return the hex string for a decimal number"
    value = int(match.group())
    return hex(value)

print(re.sub(r'\d+', hexrepl, 'Call 65490 for printing, 49152 for user code.'))

Call 0xffd2 for printing, 0xc000 for user code.


# Another useful Functions

## match()
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

In [91]:
print(re.match(r'[a-z]+','this matches'))
print(re.match(r'[a-z]+','This return None'))

<re.Match object; span=(0, 4), match='this'>
None


## fullmatch()
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.

In [95]:
print(re.fullmatch(r'[\w ]+','Whole string is matched'))
print(re.fullmatch(r'\w+','Whole string is not matched'))

<re.Match object; span=(0, 23), match='Whole string is matched'>
None


## finditer()
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

In [97]:
result = re.finditer(r'\bc[a-z]+','Can I catch you at cinema hall',re.I)
for i in result:
    print(i)

<re.Match object; span=(0, 3), match='Can'>
<re.Match object; span=(6, 11), match='catch'>
<re.Match object; span=(19, 25), match='cinema'>


## escape()
Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

In [117]:
pattern = re.escape('https://www.python.org')  # . is a metacharacter
print(pattern)
print(re.search(pattern,"Please visit https://www.python.org"))

https://www\.python\.org
<re.Match object; span=(13, 35), match='https://www.python.org'>


## compile()
Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.

In [125]:
pattern = re.compile(r'[a-z]+')
print(pattern)

re.compile('[a-z]+')


## Methods of Regular Expression
### search(), match(), fullmatch(), findall(), finditer(),
All the above methods performs similiar to Normal functions, only difference is we can give an index where the search is to start and to end. 

In [128]:
print(pattern.search("hello"))
print(pattern.search("hello",1))

<re.Match object; span=(0, 5), match='hello'>
<re.Match object; span=(1, 5), match='ello'>


we can extract useful information from the Match object.

In [134]:
pattern = re.compile(r'(\d+)/(\d+)/(\d+)',re.I)
date = pattern.search("Meet you on 28/04/2024")
print(date)            # return Match object
print(date.group())    # return matched text
print(date.span())     # return range of matched text
print(date.start())    # return starting position where match is started
print(date.end())      # return end position where match is end
print(date.groups())   # return all the groups in tuple

<re.Match object; span=(12, 22), match='28/04/2024'>
28/04/2024
(12, 22)
12
22
('28', '04', '2024')


In a Match object group(0) attribute returns entire match, group(1) returns first group of match. And resulting values return following groups, upto number of groups exists

In [139]:
print(date.group(0))
print(date.group(1))
print(date.group(2))
print(date.group(3))

28/04/2024
28
04
2024


In [141]:
print(date.group(4))  # Error since group doesn't exist

IndexError: no such group

we can also use [] for accessing groups

In [142]:
print(date[0])
print(date[1])

28/04/2024
28


# The End$