# REGEX COMPILATION FLAGS

###  Flags (also called modifiers) can change the behaviour of RE. Flags can be used to influence:
+   case sensitivity
1)  re.IGNORECASE or re.I
+  character set
2)  re.ASCII or re.A
3)  re.UNICODE or re.U
4)  re.LOCALE
+  change metacharacter behaviour
5)  re.MULTILINE or re.M
6)  re.DOTALL or re.S
+  make REGEX pattern readable
7)  re.VERBOSE or re.X
+  debug REGEX expression
8)  re.DEBUG
+  multiple flags
+  inline flags

In [61]:
enc='utf-8'
with open("miracle_in_the_andes.txt","r",encoding=enc) as f:
    book=f.read()
    #print(book)

###  Case Sensitivity

1) re.IGNORECASE or re.I
+  Perform case-insensitive matching
+  Full Unicode matching works unless the ASCII flag is used

In [66]:
#### Find characters without re.I

import re

def myfunc(string):
    pattern=r'chapter'
    array=re.findall(pattern,string)
    return array

myfunc(book)

['chapter']

In [68]:
#### Find characters with re.I

import re

def myfunc(string):
    pattern=r'chapter'
    array=re.findall(pattern,string,re.I)
    return array

myfunc(book)

['Chapter',
 'Chapter',
 'Chapter',
 'Chapter',
 'Chapter',
 'Chapter',
 'Chapter',
 'Chapter',
 'Chapter',
 'Chapter',
 'chapter',
 'Chapter']

###  Character set

2)  re.ASCII or re.A
3)  re.UNICODE or re.U              # this is default so superfluous
4)  re.LOCALE or re.L              # outdated and not reliable

+  Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching.
>\d = [0-9] - and no longer matches Hindi, Bengali, etc. digits  
>\D = [^0-9] - and matches any characters other than ASCII digits  
>\w = [A-Za-z0-9_] - and it only matches ASCII words now, Wiktor is matched with \w+, but Виктор does not  
>\W = [^A-Za-z0-9_] - it matches any char but ASCII letters/digits/_ (i.e. it matches 你好吗, Виктор, etc.)  
>\s = [ \t\n\r\f\v] - matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab  
>\S = [^ \t\n\r\f\v] - matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, 
so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace.
+  This is only meaningful for Unicode (str) patterns, and is ignored for bytes patterns.

https://realpython.com/python-encodings-guide/  
https://docs.python.org/3/howto/unicode.html#python-s-unicode-support

In [74]:
####   Non-English words without ASCII flag

import re

text = "la cigüeña es bonita" # the stork is pretty

def myfunc(string):
    pattern=re.compile(r'\w+')
    array=pattern.findall(string)
    return array

myfunc(text)

['la', 'cigüeña', 'es', 'bonita']

In [76]:
####   Non-English words with ASCII flag

import re

text = "la cigüeña es bonita" # the stork is pretty

def myfunc(string):
    pattern=re.compile(r'\w+',re.A)
    array=pattern.findall(string)
    return array

myfunc(text)

['la', 'cig', 'e', 'a', 'es', 'bonita']

In [78]:
####   UNICODE digits without ASCII flag

import re

text = '\u0967\u096a\u096c'

def myfunc(string):
    pattern=re.compile(r'\d+')
    array=pattern.findall(string)
    return array

myfunc(text)

['१४६']

In [80]:
####   UNICODE digits with ASCII flag

import re

text = '\u0967\u096a\u096c'

def myfunc(string):
    pattern=re.compile(r'\d+',re.A)
    array=pattern.findall(string)
    return array

myfunc(text)

[]

### Changing metacharacter behaviour

5) re.MULTILINE or re.M
+  Causes start-of-string and end-of-string anchors to match at embedded newlines
+  only modifies the ^ and $ anchors in this way. It doesn’t have any effect on the \A and \Z anchors

6) re.DOTALL or re.S
+  Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

In [83]:
####   MULTILINE without flag

import re

text = 'IT WAS FRIDAY, the thirteenth of October.\nWe joked about that—flying over the Andes on such an unlucky day,'

def myfunc(string):
    pattern=re.compile(r'^\w{2}')
    array=pattern.findall(string)
    return array

myfunc(text)

['IT']

In [85]:
####   MULTILINE with flag

import re

text = 'IT WAS FRIDAY, the thirteenth of October.\nWe joked about that—flying over the Andes on such an unlucky day,'

def myfunc(string):
    pattern=re.compile(r'^\w{2}',re.M)
    array=pattern.findall(string)
    return array

myfunc(text)

['IT', 'We']

In [2]:
####   DOTALL without flag

import re

text = 'IT WAS FRIDAY, the thirteenth of October.\nWe joked about that—flying over the Andes on such an unlucky day,'

def myfunc(string):
    pattern=re.compile(r'the.*Andes')
    array=pattern.sub("it was",string)
    return array

myfunc(text)

'IT WAS FRIDAY, the thirteenth of October.\nWe joked about that—flying over it was on such an unlucky day,'

In [105]:
####   DOTALL with flag

import re

text = 'IT WAS FRIDAY, the thirteenth of October.\nWe joked about that—flying over the Andes on such an unlucky day,'

def myfunc(string):
    pattern=re.compile(r'the.*Andes',re.S)
    array=pattern.sub("it was",string)
    return array

myfunc(text)

'IT WAS FRIDAY, it was on such an unlucky day,'

###  Make REGEX pattern readable

7) re.VERBOSE or re.X
+  write regular expressions that are more readable by allowing you to visually separate logical sections of the pattern and add comments. 

In [107]:
import re

s = 'Python 3'

pattern = r'''^(\w+) # match one or more characters at the beginning of the string
               \s*   # match zero or more spaces
              (\d+)$ # match one or more digits at the end of the string'''

l = re.findall(pattern, s, re.VERBOSE)
print(l)

[('Python', '3')]


### Debug REGEX expression

8) re.DEBUG
+  Display debug information about compiled expression

In [119]:
import re

s = 'Python 3'

pattern = r'^(\w+)\s*(\d+)$'

l = re.findall(pattern, s, re.DEBUG)
print(l)

AT AT_BEGINNING
SUBPATTERN 1 0 0
  MAX_REPEAT 1 MAXREPEAT
    IN
      CATEGORY CATEGORY_WORD
MAX_REPEAT 0 MAXREPEAT
  IN
    CATEGORY CATEGORY_SPACE
SUBPATTERN 2 0 0
  MAX_REPEAT 1 MAXREPEAT
    IN
      CATEGORY CATEGORY_DIGIT
AT AT_END

 0. INFO 4 0b0 2 MAXREPEAT (to 5)
 5: AT BEGINNING
 7. MARK 0
 9. REPEAT_ONE 9 1 MAXREPEAT (to 19)
13.   IN 4 (to 18)
15.     CATEGORY UNI_WORD
17.     FAILURE
18:   SUCCESS
19: MARK 1
21. REPEAT_ONE 9 0 MAXREPEAT (to 31)
25.   IN 4 (to 30)
27.     CATEGORY UNI_SPACE
29.     FAILURE
30:   SUCCESS
31: MARK 2
33. REPEAT_ONE 9 1 MAXREPEAT (to 43)
37.   IN 4 (to 42)
39.     CATEGORY UNI_DIGIT
41.     FAILURE
42:   SUCCESS
43: MARK 3
45. AT END
47. SUCCESS
[('Python', '3')]


In [None]:
### Multiple flags

+  use | or + operator to add multiple flags

In [None]:
### Inline flags

+  (+#comment)                 another way to add comments, not a flag
+  (?flags:pattern)            Inline flags only for this pat (?aiLmsx), overrides flags argument where flags is re.I, re.S, etc except re.L
+  (?-flags:pattern)           negates flags for this pattern
+  (?flags-flags:pattern)      apply and negate particular flags only for this pat
+  (?flags)                    apply flags for whole RE, can be used only at start of RE; anchors if any, should be specified after (?flags)