# Regex Python


https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/    

In [2]:
import re

In [2]:
s = 'foo123bar'

### re.search()

In [3]:
re.search('123', s)

if re.search('123', s):
    print ("Found")
else:
    print ("not found")

Found


### Python Regex Metacharacters

In [4]:
s = 'foo123bar'

In [5]:
re.search('[0-9][0-9][0-9]', s)

<_sre.SRE_Match object; span=(3, 6), match='123'>

In [6]:
#  string that doesn’t contain three consecutive digits won’t match
print(re.search('[0-9][0-9][0-9]', '12foo34'))

None


#### The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard

In [8]:
s = 'foo123bar'
print(re.search('1.3', s))

s = 'foo13bar'
print(re.search('1.3', s))

<_sre.SRE_Match object; span=(3, 6), match='123'>
None



'.' is any char except newline; 

'?' any char zero or one occurence;

^ = start and $ = end of string

for repetition ->

'*'  is zero or more repitition; 

'+' is 1 or more repitition

{}	Matches an explicitly specified number of repetitions


Special char -

\	 Escapes a metacharacter of its special meaning
    
[]	Specifies a character class

|	Designates alternation

()	Creates a group


### Character Class

Characters contained in square brackets ([  ]) represent a character class—an enumerated set of characters to match from. A character class metacharacter sequence will match any single character contained in the class.

In [3]:
re.search('ba[artz]', 'foobarqux')

<_sre.SRE_Match object; span=(3, 6), match='bar'>

In [4]:
re.search('[a-z]', 'FOObar')

<_sre.SRE_Match object; span=(3, 4), match='b'>

In [5]:
re.search('[0-9][0-9]', 'foo123bar')

<_sre.SRE_Match object; span=(3, 5), match='12'>

In [6]:
#identify hex value
re.search('[0-9a-fA-f]', '--- a0 ---')

<_sre.SRE_Match object; span=(4, 5), match='a'>

In [7]:
# In the following example, [^0-9] matches any character that isn’t a digit:

re.search('[^0-9]', '12345foo')

<_sre.SRE_Match object; span=(5, 6), match='f'>

In [9]:
# In the following example, ^[0-9] matches any character that starts with a digit:

re.search('^[0-9]', '12345foo')

<_sre.SRE_Match object; span=(0, 1), match='1'>

In [11]:
#  What if you want the character class to include a literal hyphen character? 
# You can place it as the first or last character or escape it with a backslash (\):

print (re.search('[ab\-c]', '123-456'))


<_sre.SRE_Match object; span=(3, 4), match='-'>


In [12]:
# If you want to include a literal ']' in a character class

re.search('[ab\]cd]', 'foo[1]')

<_sre.SRE_Match object; span=(5, 6), match=']'>

In [15]:
# Other regex metacharacters lose their special meaning inside a character class:

re.search('[)*+|]', '123*456')

<_sre.SRE_Match object; span=(3, 4), match='*'>

#### \w  and     \W    Match based on whether a character is/is not  a word character.

\w matches any alphanumeric word character. 
Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, so \w is essentially shorthand for [a-zA-Z0-9_]

\W is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_]


In [23]:
print (re.search('\w', '#(.a$@&'))

print (re.search('[a-zA-Z0-9_]', '#(.a$@&'))

<_sre.SRE_Match object; span=(3, 4), match='a'>
<_sre.SRE_Match object; span=(3, 4), match='a'>


In [22]:
print (re.search('\W', 'a_1*3Qb'))

print (re.search('[^a-zA-Z0-9_]', 'a_1*3Qb'))

<_sre.SRE_Match object; span=(3, 4), match='*'>
<_sre.SRE_Match object; span=(3, 4), match='*'>


### \d and  \D  Match based on whether a character is or is not a decimal digit.

\d matches any decimal digit character. 

\D is the opposite. It matches any character that isn’t a decimal digit:

In [24]:
print (re.search('\d', 'abc4def'))
print (re.search('\D', 'abc4def'))


<_sre.SRE_Match object; span=(3, 4), match='4'>
<_sre.SRE_Match object; span=(0, 1), match='a'>


\s  and \S  Match based on whether a character represents whitespace or not.

\s matches any whitespace character

\S matches any non-whitespace character:

In [26]:
print (re.search('\s', 'foo\nbar baz'))

print (re.search('\S', '  \n foo  \n  '))

<_sre.SRE_Match object; span=(3, 4), match='\n'>
<_sre.SRE_Match object; span=(4, 5), match='f'>


The character class sequences \w, \W, \d, \D, \s, and \S can appear inside a square bracket character class as well.

In this case, [\d\w\s] matches any digit, word, or whitespace character. 


In [27]:
print (re.search('[\d\w\s]', '---3---'))

print (re.search('[\d\w\s]', '---a---'))

print (re.search('[\d\w\s]', '--- ---'))

<_sre.SRE_Match object; span=(3, 4), match='3'>
<_sre.SRE_Match object; span=(3, 4), match='a'>
<_sre.SRE_Match object; span=(3, 4), match=' '>


### Escaping Metacharacters
Occasionally, you’ll want to include a metacharacter in your regex, except you won’t want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character.

backslash (\\)

Removes the special meaning of a metacharacter.

In [34]:
print (re.search('.', 'foo.bar'))
print (re.search('\.', 'foo.bar'))


<_sre.SRE_Match object; span=(0, 1), match='f'>
<_sre.SRE_Match object; span=(3, 4), match='.'>


### Handling of \\ in a string is tedious. soln is to use RAW STRING instead of double backslash

In [41]:
print (re.search(r'\\', r'foo\bbar'))

<_sre.SRE_Match object; span=(3, 4), match='\\'>


##### Python raw string is created by prefixing a string literal with ‘r’ or ‘R’. 

Python raw string treats backslash (\) as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.

we can’t create a raw string of single backslash. So, r'\\' is WRONG.

A raw string can’t have an odd number of backslashes at the end. So, r'ab\\\\\\' is WRONG.


In [38]:
s = 'Hi\nHello'
print(s)

Hi
Hello


In [39]:
raw_s = r'Hi\nHello'
print(raw_s)

Hi\nHello


In [40]:
s = r'Hi\xHello'
print(s)

Hi\xHello


In [42]:
print (re.search('^foo', 'foobar'))
print (re.search('bar$', 'foobar'))

<_sre.SRE_Match object; span=(0, 3), match='foo'>
<_sre.SRE_Match object; span=(3, 6), match='bar'>


#### Quantifiers
A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed.
    
They include -
    
    '*'   Zero or more repetition of the preceding regex.
    
    '+'   One or more repetiion of the preceding regex.
    
    '?'   zero or one repetition of the preceding regex.
    
    {m}   exactly 'm' repetition of the preceding regex.
    
    {m,n} between 'm' and 'n' repetition of the preceding regex
    

In [64]:
print (' *')
print (re.search('foo-*bar', 'foobar'))
print (re.search('foo-*bar', 'foo-bar'))
print (re.search('foo-*bar', 'foo--bar'))
print ('\n +')
print (re.search('foo-+bar', 'foobar'))
print (re.search('foo-+bar', 'foo-bar'))
print (re.search('foo-+bar', 'foo--bar'))
print ('\n ?')
print (re.search('foo-?bar', 'foobar'))
print (re.search('foo-?bar', 'foo-bar'))
print (re.search('foo-?bar', 'foo--bar'))
print ('\n {m}')
print (print(re.search('x-{3}x', 'x--x')))
print (print(re.search('x-{3}x', 'x---x')))
print (print(re.search('x-{3}x', 'x----x')))
print ('\n {m,n}')
print (re.search('x-{2,3}x', 'x--x'))
print (re.search('x-{2,3}x', 'x---x'))
print (re.search('x-{2,3}x', 'x----x'))



 *
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
<_sre.SRE_Match object; span=(0, 8), match='foo--bar'>

 +
None
<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
<_sre.SRE_Match object; span=(0, 8), match='foo--bar'>

 ?
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
<_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
None

 {m}
None
None
<_sre.SRE_Match object; span=(0, 5), match='x---x'>
None
None
None

 {m,n}
<_sre.SRE_Match object; span=(0, 4), match='x--x'>
<_sre.SRE_Match object; span=(0, 5), match='x---x'>
None


#### +, * , ? When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match.

Consider below example -

In [49]:
re.search('<.*>', '%<foo> <bar> <baz>%')   #note the span

<_sre.SRE_Match object; span=(1, 18), match='<foo> <bar> <baz>'>

#### *? , +?, ??  -- The non-greedy (or lazy) versions of the *, +, and ? quantifiers. If you want the shortest possible match instead, then use the non-greedy metacharacter sequence.

Consider below example -

In [56]:
print (re.search('<.*?>', '%<foo> <bar> <baz>%') )   #note the span


print ('''\nIn below, the greedy version, ?, matches one occurrence, so ba? matches 'b' followed by a single 'a'. 
The non-greedy version, ??, matches zero occurrences, so ba?? matches just 'b'.''')
print (re.search('ba?', 'baaaa'))
print (re.search('ba??', 'baaaa'))


<_sre.SRE_Match object; span=(1, 6), match='<foo>'>

In below, the greedy version, ?, matches one occurrence, so ba? matches 'b' followed by a single 'a'. 
The non-greedy version, ??, matches zero occurrences, so ba?? matches just 'b'.
<_sre.SRE_Match object; span=(0, 2), match='ba'>
<_sre.SRE_Match object; span=(0, 1), match='b'>


#### Grouping Constructs and Backreferences
Grouping constructs break up a regex in Python into subexpressions or groups.

(<regex>)

Defines a subexpression or group.
    
bar+	  The + metacharacter applies only to the character 'r'

(bar)+	  The + metacharacter applies to the entire string 'bar'.	 One or more occurrences of 'bar'

foo(bar)?	'foo' optionally followed by 'bar'
    
(foo(bar)?)+	One or more occurrences of the above

\d\d\d	Three decimal digit characters
    
(\d\d\d)?	Zero or one occurrences of the above    

In [66]:
print (re.search('(bar)', 'foo bar baz'))
print (re.search('(bar)+', 'foo bar baz'))
print ( re.search('(bar)+', 'foo barbarbarbar baz'))
print ('\n')
print ( re.search('(ba[rz]){2,4}(qux)?', 'barbar'))

print (re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar'))
print (re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar123'))
print (re.search('(foo(bar)?)+(\d\d\d)?', 'foofoo123'))

<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(4, 7), match='bar'>
<_sre.SRE_Match object; span=(4, 16), match='barbarbarbar'>


<_sre.SRE_Match object; span=(0, 6), match='barbar'>
<_sre.SRE_Match object; span=(0, 9), match='foofoobar'>
<_sre.SRE_Match object; span=(0, 12), match='foofoobar123'>
<_sre.SRE_Match object; span=(0, 9), match='foofoo123'>


##### m.groups() 
Returns a tuple containing all the captured groups from a regex match.

In [71]:
m = re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar')
print (m.groups() )

m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print (m.groups() )



('foobar', 'bar', None)
('foo', 'quux', 'baz')


##### m.group()   
Returns a string containing the <n>th captured match. With one argument, .group() returns a single captured match. Note that the arguments are one-based, not zero-based. So, m.group(1) refers to the first captured match, m.group(2) to the second, and so on.

In [77]:
m = re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar')
print (m.group(1))
print (m.group(2))
print (m.group(3))
print ('\n')
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print (m.group(1))
print (m.group(2))
print (m.group(3))
print ('\n')
print ( '''m.group(0) has a special meaning. m.group(0) returns the entire match, and m.group() does the same. ''')
print (m.group(0))
print ('\n')
print (m.group(3, 2, 1))

foobar
bar
None


foo
quux
baz


m.group(0) has a special meaning. m.group(0) returns the entire match, and m.group() does the same. 
foo,quux,baz


('baz', 'quux', 'foo')


### Searching Functions

#### re.match()    Looks for a regex match at the beginning of a string. Same as re.search('^pattern', s)
#### re.fullmatch()    Looks for a regex match on an entire string. Same as re.search('^patters$', s)
#### re.findall()	Returns a list of all regex matches in a string. IMP
#### re.finditer()	Returns an iterator that yields regex matches from a string. IMP

In [81]:
print (re.search(r'(\d+)', 'foo123bar'))
print ('\nre.match()')
print (re.match(r'\d+', '123foobar'))
print(re.match(r'\d+', 'foo123bar'))

print ('\nre.fullmatch()')
print(re.fullmatch(r'\d+', 'foo123bar'))
print (re.fullmatch(r'\d+', '123'))

print ('\nre.findall()')
print (re.findall(r'\w+', '...foo,,,,bar:%$baz//|'))
print (re.findall(r'#(\w+)#', '#foo#.#bar#.#baz#'), 'Here hash (#) characters don’t appear in the return list because they’re outside the grouping parentheses.')
print (re.findall(r'#\w+#', '#foo#.#bar#.#baz#'))

<_sre.SRE_Match object; span=(3, 6), match='123'>

re.match()
<_sre.SRE_Match object; span=(0, 3), match='123'>
None

re.fullmatch()
None
<_sre.SRE_Match object; span=(0, 3), match='123'>

re.findall()
['foo', 'bar', 'baz']
['foo', 'bar', 'baz'] Here hash (#) characters don’t appear in the return list because they’re outside the grouping parentheses.
['#foo#', '#bar#', '#baz#']


### re.finditer()  
scans <string> for non-overlapping matches of <regex> and returns an iterator that yields the match objects from any it finds. It scans the search string from left to right and returns matches in the order it finds them

In [85]:
it = re.finditer(r'\w+', '...foo,,,,bar:%$baz//|')
print (next(it))
print (next(it))
print (next(it))
print (next(it))   #error here as no more match

<_sre.SRE_Match object; span=(3, 6), match='foo'>
<_sre.SRE_Match object; span=(10, 13), match='bar'>
<_sre.SRE_Match object; span=(16, 19), match='baz'>


StopIteration: 

### Substitution Functions
Substitution functions replace portions of a search string that match a specified regex.

re.sub()	Scans a string for regex matches, replaces the matching portions of the string with the specified replacement string, and returns the result

re.subn()	Behaves just like re.sub() but also returns information regarding the number of substitutions made


Both re.sub() and re.subn() create a new string with the specified substitutions and return it. The original string remains unchanged. (Remember that strings are immutable in Python, so it wouldn’t be possible for these functions to modify the original string.)



In [8]:
print ('re.sub()')

s = 'foo.123.bar.789.baz'
print (re.sub('\d+','#', s), ' replace all digits by Hash(#)')
print (re.sub('[a-zA-Z]','#', s), ' replace all Characters by Hash(#)')

print ('''\nNumbered References in re.sub() : \g<n> or \<n> .
Here \g<n> or \<n refers to numbered match, starting from 1.''')
print (re.sub(r'(\w+),bar,baz,(\w+)', r'\2,bar,baz,\1' ,'foo,bar,baz,qux' ))
print (re.sub(r'(\w+),bar,baz,(\w+)', r'\g<2>,bar,baz,\g<1>' ,'foo,bar,baz,qux' ))


print ('\nRestricting number of substitution in re.sub() using count= argument')
print (re.sub('\d+', '#','foo.123.bar.456.hello.789.world.111', count=2))

print ('''\nCalling a function to implement substution logic. 
Here each matching value is passed as argument to function and value returned subsitutes the matched value.\n

def replace_mystring(x):
    s = x.group(0)   # get string from match object 
    
    # s.isdigit() returns True if all characters in s are digits
    if s.isdigit():
         return str(int(s) * 10)
    else:
        return s.upper()

>>>print (re.sub('\d+', replace_mystring,'foo.123.bar.456.hello.789.world.111', count=2))
''')

def replace_mystring(x):
    s = x.group(0)   # get string from match object 
    
    # s.isdigit() returns True if all characters in s are digits
    if s.isdigit():
         return str(int(s) * 10)
    else:
        return s.upper()

print (re.sub('\d+', replace_mystring,'foo.123.bar.456.hello.789.world.111', count=2))


re.sub()
foo.#.bar.#.baz  replace all digits by Hash(#)
###.123.###.789.###  replace all Characters by Hash(#)

Numbered References in re.sub() : \g<n> or \<n> .
Here \g<n> or \<n refers to numbered match, starting from 1.
qux,bar,baz,foo
qux,bar,baz,foo

Restricting number of substitution in re.sub() using count= argument
foo.#.bar.#.hello.789.world.111

Calling a function to implement substution logic. 
Here each matching value is passed as argument to function and value returned subsitutes the matched value.


def replace_mystring(x):
    s = x.group(0)   # get string from match object 
    
    # s.isdigit() returns True if all characters in s are digits
    if s.isdigit():
         return str(int(s) * 10)
    else:
        return s.upper()

>>>print (re.sub('\d+', replace_mystring,'foo.123.bar.456.hello.789.world.111', count=2))

foo.1230.bar.4560.hello.789.world.111


In [12]:
print ('re.subn()')
print (re.subn(r'\w+', 'xxx', 'foo.bar.baz.qux', count=2))

print ('\n')
def replace_mystring(x):
    s = x.group(0)   # get string from match object 
    
    # s.isdigit() returns True if all characters in s are digits
    if s.isdigit():
         return str(int(s) * 10)
    else:
        return s.upper()

print (re.subn('\d+', replace_mystring,'foo.123.bar.456.hello.789.world.111', count=2) , '''<-- the 2nd variable in tuple tells number of substitution made. ''' )
    

re.subn()
('xxx.xxx.baz.qux', 2)


('foo.1230.bar.4560.hello.789.world.111', 2) <-- the 2nd variable in tuple tells number of substitution made. 


### Utility Functions

re.split()	Splits a string into substrings using a regex as a delimiter

re.escape()	Escapes characters in a regex

In [19]:
print ('re.split()')

print (re.split('\s*[,;/]\s*', 'foo,bar  ;  baz / qux'))

print (re.split('(\s*[,;/]\s*)', 'foo,bar  ;  baz / qux'))

print ('\nre.sub()  use maxsplit= argument')
print (re.split(r',\s*','foo, bar, baz, qux, quux, corge', maxsplit=3), ' maxsplit=3  3 splits + remainder string as a value in list')
print (re.split(r',\s*','foo, bar, baz, qux, quux, corge', maxsplit=0), ' maxsplit=0  Split for all values')
print (re.split(r',\s*','foo, bar, baz, qux, quux, corge', maxsplit=-2), ' maxsplit=-3 i.e. negative no split - whole string returned')

re.split()
['foo', 'bar', 'baz', 'qux']
['foo', ',', 'bar', '  ;  ', 'baz', ' / ', 'qux']

re.sub()  use maxsplit= argument
['foo', 'bar', 'baz', 'qux, quux, corge']  maxsplit=3  3 splits + remainder string as a value in list
['foo', 'bar', 'baz', 'qux', 'quux', 'corge']  maxsplit=0  Split for all values
['foo, bar, baz, qux, quux, corge']  maxsplit=-3 i.e. negative no split - whole string returned


#### Compiled Regex Objects in Python
The re module supports the capability to precompile a regex in Python into a regular expression object that can be repeatedly used later.

re.compile('regex', flags=0)

There are two ways to use a compiled regular expression object. You can specify it as the first argument to the re module functions in place of <regex>:
    
    re_obj = re.compile(<regex>, <flags>)
    result = re.search(re_obj, <string>)

You can also invoke a method directly from a regular expression object:
    
    re_obj = re.compile(<regex>, <flags>)
    result = re_obj.search(<string>)


why needed - If you use a particular regex in your Python code frequently, then precompiling allows you to separate out the regex definition from its uses. This enhances modularity.    
    
    
A compiled regular expression object re_obj supports the following methods:

    re_obj.search(<string>[, <pos>[, <endpos>]])
    re_obj.match(<string>[, <pos>[, <endpos>]])
    re_obj.fullmatch(<string>[, <pos>[, <endpos>]])
    re_obj.findall(<string>[, <pos>[, <endpos>]])
    re_obj.finditer(<string>[, <pos>[, <endpos>]])    

In [25]:
s = 'foo123bar'
print (re.search(r'(\d+)', s))

myRegex = re.compile('(\d+)')

print (re.search(myRegex,s), ' method 1')
print (myRegex.search(s), ' method 2')

<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(3, 6), match='123'>  method 1
<_sre.SRE_Match object; span=(3, 6), match='123'>  method 2


In [27]:
s1, s2, s3, s4 = 'foo.bar', 'foo123bar', 'baz99', 'qux & grault'

re_obj = re.compile('\d+')
print (re_obj.search(s1))
print (re_obj.search(s2))
print (re_obj.search(s3))
print (re_obj.search(s4))

None
<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(3, 5), match='99'>
None


In [30]:
re_obj = re.compile('^bar')
s = 'foobarbaz'
print (s[3:])
print ('''Here, even though 'bar' does occur at the start of the substring beginning at character 3, 
it isn’t at the start of the entire string, so the caret (^) anchor fails to match.''')
print(re_obj.search(s, 3))

barbaz
Here, even though 'bar' does occur at the start of the substring beginning at character 3, 
it isn’t at the start of the entire string, so the caret (^) anchor fails to match.
None
