# re - Regular Expression Operation

#### 1. The pattern and search string should be of same type (eihter both unicode(utf-8) or both 8-bit strings
#### 2. Backslash '\\' is used to use special characters without invoking thier special use, eg. if you want to use '\\' then you have to write '\\\\'.
#### 3. Also '\n' is a single newline character but if we write r'\n' - they are two characters '\\' and 'n'. Backslashes are not handled in any special way in a string literal prefixed with 'r'.

## MetaCharacters
    1. \   Used to drop the special meaning of character following it
    2. []  Represent a character class
    3. ^   Matches the beginning
    4. $   Matches the end
    5. .   Matches any character except newline, 
           If the DOTALL flag has been specified, this matches any character including a newline.
    6. ?   Matches zero or one occurrence.
    7. |   Means OR (Matches with any of the characters separated by it).
    8. *   Any number of occurrences (including 0 occurrences)
    9. +   One ore more occurrences
    10. {}  Indicate number of occurrences of a preceding RE to match.
    11. ()  Enclose a group of REs
    12. \d   Matches any decimal digit, this is equivalent to the set class [0-9].
    13. \D   Matches any non-digit character.
    14. \s   Matches any whitespace character.
    15. \S   Matches any non-whitespace character
    16. \w   Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].
    17. \W   Matches any non-alphanumeric character. 
    18. [\s,.] will match any whitespace character, ' , ' or ' . '
    19. {m} Matches exactly m copies of character. e.g. a{6} matches exact 6 "a's"
    20. {m, n} Matches m to n characters, if n not defined then inf
    21. {m, n}? Matches m i.e. minimum although n are available(non-greedy)
    22. Inside set [] special character loses special meaning i.e [(+*)] matches '(', '+', '*', ')'
    23. [0-9A-Fa-f] matches any hexadecimal, [a\-z] matches a, '-' and z also [-a] and [a-] matches '-', 'a'

In [1]:
import re

#### re.compile(pattern, flags=0)
    
    Compile a regular expression pattern into a regular expression object-- type-(re.Pattern), which can be used for matching using its match(), search() and other methods

In [2]:
pattern = re.compile(r'([a-e])')
print(type(pattern))

<class 're.Pattern'>


In [3]:
pattern.findall("Hello Arya Stark")

['e', 'a', 'a']

In [4]:
pattern = re.compile(r'\d*')

In [5]:
pattern.findall("This is 1st May 2019")

['', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '2019', '']

In [6]:
pattern = re.compile(r'\D+')

In [7]:
pattern.findall("This is 1st May 2019")

['This is ', 'st May ']

In [8]:
pattern = re.compile(r'\s')

In [9]:
pattern.findall("This is 1st May 2019")

[' ', ' ', ' ', ' ']

In [10]:
pattern = re.compile(r'\S+')

In [11]:
pattern.findall("This is 1st May 2019")

['This', 'is', '1st', 'May', '2019']

In [12]:
pattern = re.compile(r'[\s,.]')

In [13]:
pattern.findall("This is 1st May, 2019.")

[' ', ' ', ' ', ',', ' ', '.']

In [14]:
pattern = re.compile(r'\w+')

In [15]:
pattern.findall("This is 1_st May, * * 2019.")

['This', 'is', '1_st', 'May', '2019']

In [16]:
pattern = re.compile(r'\W+')

In [17]:
pattern.findall("This is 1_st May, * * 2019.")

[' ', ' ', ' ', ', * * ', '.']

In [18]:
pattern = re.compile(r'ab*')

In [19]:
pattern.findall("abababbbbbbbabbbaaabbbbbbb")

['ab', 'ab', 'abbbbbbb', 'abbb', 'a', 'a', 'abbbbbbb']

In [20]:
from re import split 
  
# '\W+' denotes Non-Alphanumeric Characters or group of characters 
# Upon finding ',' or whitespace ' ', the split(), splits the string from that point 
print(split('\W+', 'Words, words , Words')) 
print(split('\W+', "Word's words Words")) 
  
# Here ':', ' ' ,',' are not AlphaNumeric thus, the point where splitting occurs 
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM')) 
  
# '\d+' denotes Numeric Characters or group of characters 
# Spliting occurs at '12', '2016', '11', '02' only 
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM')) 

['Words', 'words', 'Words']
['Word', 's', 'words', 'Words']
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']


In [21]:
pattern = re.compile(r'\sAND\s', flags=re.IGNORECASE)

In [22]:
repl = r' & '

In [23]:
re.sub(pattern, repl, "You have to learn and do things.")

'You have to learn & do things.'

In [24]:
re.subn(pattern, repl, "You have to learn and do things and don't wait")

("You have to learn & do things & don't wait", 2)

In [25]:
print(re.escape("[a-z]+ is all the lower character from a - z with 1 or more occurences"))

\[a\-z\]\+\ is\ all\ the\ lower\ character\ from\ a\ \-\ z\ with\ 1\ or\ more\ occurences


#### re.search()  
    This method either returns None (if the pattern doesn’t match), or a re.MatchObject that contains information about the matching part of the string. This method stops after the first match, so this is best suited for testing a regular expression more than extracting data.

In [26]:
regex = r"([a-zA-Z]+) (\d+)"

match = re.search(regex, "I was born on June 24") 

if match != None: 
    print("Match start index at %s, Match end index at %s" % (match.start(), match.end()))
    print(match.group(0))
    print(match.group(1))
    print(match.group(2)) 
else: 
    print("The regex pattern does not match.")

Match start index at 14, Match end index at 21
June 24
June
24


In [27]:
def findWon(string):
    pattern = re.compile(r'(.*)(YOU\sWON)(\s?)', flags=re.IGNORECASE)
    match = re.match(pattern, string)
    return match

In [28]:
print(findWon("Congrats! YOU WON 10000"))

<re.Match object; span=(0, 18), match='Congrats! YOU WON '>
