[source-1](https://www.w3schools.com/python/python_regex.asp)

- A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.


- RegEx can be used to check if a string contains the specified search pattern.


- ^a...s$ defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s.

**Here, we used re.match() function to search pattern within the test_string**

In [1]:
import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")

Search successful.


## Specify Pattern Using RegEx

- To specify regular expressions, metacharacters are used. In the above example, ^ and $ are metacharacters.

---------------------------------------------------------------
- [] To specify regular expressions, metacharacters are used. In the above example, ^ and $ are metacharacters.

        Here, [abc] will match if the string you are trying to match contains any of the a, b or c.
        
            [abc] a -- 1 match
                  ac -- 2 match
                  Hey Jude -- No match
      
        You can also specify a range of characters using - inside square brackets.

            [a-e] is the same as [abcde].
            [1-4] is the same as [1234].
            [0-39] is the same as [01239].

        You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

            [^abc] means any character except a or b or c.
            [^0-9] means any non-digit character.

---------------------------------------------------------------

- A period matches any single character (except newline '\n').

            .. a -- no match
            .. ac -- 1 match
            .. acd -- 1 match
            .. acde -- 2 match
            
---------------------------------------------------------------

- The caret symbol ^ is used to check if a string starts with a certain character.

            ^a a -- 1 match
            ^a abc -- 1 match
            ^a bac -- no match

            ^ab abc -- 1 match
            ^ab acb -- no match
---------------------------------------------------------------
- The dollar symbol $ is used to check if a string ends with a certain character.

                a$ a --  1 match
                formula -- 1 match
                cab -- no match
                
--------------------------------------------------------------- 
- The star symbol * matches zero or more occurrences of the pattern left to it.

            ma*n mn -- 1 match
                 man -- 1 match
                 maaan -- 1 match
                 main -- no match ( a is not followed by n)
                 woman -- 1 match
     
---------------------------------------------------------------
- The plus symbol + matches one or more occurrences of the pattern left to it.

            ma+n mn -- no match
                 man -- 1 match
                 maaan -- 1 match
                 main -- No match
                 woman -- 1 match

---------------------------------------------------------------
- The question mark symbol ? matches zero or one occurrence of the pattern left to it.


            ma?n mn -- 1 match
                 man -- 1 match
                 maaan -- no match
                 main -- No match
                 woman -- 1 match
---------------------------------------------------------------
- Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

            a{2,3} abc dat -- no match
                   abd daat -- 1 match (daat)
                   aabd daaat -- 2 match
                   aabc daaaat --2 match
                   
            [0-9]{2,4} -- matches at least 2 digits but not more than 4 digits
            ab123csde	1 match (match at ab123csde)
            12 and 345673	3 matches (12, 3456, 73)
            1 and 2	No match
            
---------------------------------------------------------------
- Vertical bar | is used for alternation (or operator)

            a|b cde	No match
                ade	1 match (match at ade)
                acdbea	3 matches (at acdbea)
                
---------------------------------------------------------------
- Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

            (a|b|c)xz
            ab xz	No match
            abxz	1 match (match at abxz)
            axz cabxz	2 matches (at axzbc cabxz)
            
---------------------------------------------------------------
- Backlash \ is used to escape various characters including all metacharacters. For example, "\$a"  match if a string contains $ followed by a.

## Special Sequences

- \A - Matches if the specified characters are at the start of a string.

            \Athe	the sun 	Match
                    In the sun	No match
**************************************************************

- \b - Matches if the specified characters are at the beginning or end of a word.

            \bfoo	football	Match
                    a football	Match
                    afootball	No match
                    
            foo\b	the foo	Match
                    the afoo test	Match
                    the afootest	No match
             
**************************************************************
- \B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

            \Bfoo	football	No match
                    a football	No match
                    afootball	Match



            foo\B	the foo	No match
                    the afoo test	No match
                    the afootest	Match
**************************************************************
- \d - Matches any decimal digit. Equivalent to [0-9]

            \d	12abc3	3 matches (at 12abc3)
                Python	No match
                
**************************************************************
- \D - Matches any non-decimal digit. Equivalent to [^0-9]

            \D	1ab34"50	3 matches (at 1ab34"50)
                1345	No match
                
************************************************************** 
- \s  Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v]       

            \s	Python RegEx	1 match
                PythonRegEx	No match
                
************************************************************** 
- \S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

            \S	a b	2 matches (at  a b)
                    No match
                                    
************************************************************** 

- \w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.


        \w	12&": ;c 	3 matches (at 12&": ;c)
            %"> !	    No match
            
- \W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

        \W	1a2%c	1 match (at 1a2%c)
            Python	No match
            
            
**************************************************************

- \Z - Matches if the specified characters are at the end of a string.

        Python\Z	I like Python	1 match
                    I like Python Programming	No match
                    Python is fun.	No match



## Python RegEx

In [2]:
import re

**The re.findall(  ) method returns a list of strings containing all matches.**

In [3]:
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['12', '89', '34']


**The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.**

In [4]:
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

['Twelve:', ' Eighty nine:', '.']


You can pass maxsplit argument to the re.split() method. It's the maximum number of splits that will occur.

In [5]:
string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

['Twelve:', ' Eighty nine:89 Nine:9.']


**re.sub(  )**

The syntax of re.sub() is:

re.sub(pattern, replace, string)

The method returns a string where matched occurrences are replaced with the content of replace variable.

In [7]:
# multiline string
string = 'abc 12\n de 23 \n f45 6'
print(string)

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

abc 12
 de 23 
 f45 6
abc12de23f456


In [11]:
# You can pass count as a fourth parameter to the re.sub() method. If omited, it results to 0. This will replace all occurrences.
# multiline string
string = 'abc 12\
de 23 \n f45 6'
print(string)

# matches all whitespace characters
pattern = '\s+'
replace = ''

print("**********************\n")
new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

abc 12de 23 
 f45 6
**********************

abc12de 23 
 f45 6


**The re.subn( ) is similar to re.sub( ) expect it returns a tuple of 2 items containing the new string and the number of substitutions made.**

In [12]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

print(string)
# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

abc 12de 23 
 f45 6
('abc12de23f456', 4)


**The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.**

If the search is successful, re.search() returns a match object; if not, it returns None.

In [13]:
string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
    print("pattern found inside the string")
else:
    print("pattern not found")  

pattern found inside the string


## Match object

You can get methods and attributes of a match object using dir() function.

**The group( ) method returns the part of the string where there is a match.**

In [15]:
string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
    print(match.group())
else:
    print("pattern not found")

801 35


In [16]:
match.group(1)

'801'

In [17]:
match.group(2)

'35'

In [18]:
match.group(1, 2)

('801', '35')

In [19]:
match.groups()

('801', '35')

In [20]:
# The start() function returns the index of the start of the matched substring. 
# Similarly, end() returns the end index of the matched substring.
# The span() function returns a tuple containing start and end index of the matched part.
print(match.start())

print(match.end())

print(match.span())

2
8
(2, 8)


In [21]:
# The re attribute of a matched object returns a regular expression object. 
# string attribute returns the passed string.
print(match.re)

print(match.string)

re.compile('(\\d{3}) (\\d{2})')
39801 356, 2102 1111


**using r prefix before RegEx**

When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.
 

Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.



In [22]:
string = '\n and \r are escape sequences.'


result = re.findall(r'[\n\r]', string) 

print(result)

['\n', '\r']
