<a href="https://colab.research.google.com/github/whatsupabhijit/py_rambling/blob/master/re.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source code: [Lib/re.py](https://github.com/python/cpython/tree/3.7/Lib/re.py)

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). 

However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa.

Similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

There is a fantastic[ cheat sheet](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285) for the general requiar expression not specific to pythonic re module. I suggest to read that first. If you are well versed with that you can ignore. 

Python follows 95% of the same syntax with few exceptions. I am just mentioning those mismatch or extra below:-


**1. Groups can be used to define flags:-**

(?aiLmsux):- (One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.)
 
                       The group matches the empty string; the letters set the corresponding flags: 
                       re.A (ASCII-only matching), 
                       re.I (ignore case), 
                       re.L (locale dependent), 
                       re.M (multi-line), 
                       re.S (dot matches all), 
                       re.U (Unicode matching), and 
                       re.X (verbose), for the entire regular expression. 
                       (The flags are described in Module Contents.) 
                       
This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. Flags should be used first in the expression string.


**2. (?P<name>...) ** 

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.

Named groups can be referenced in three contexts. If the pattern is `(?P<quote>['"]).*?(?P=quote)`(i.e. matching a string quoted with either single or double quotes):

<table border="1" class="last docutils">
<colgroup>
<col width="53%">
<col width="47%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Context of reference to group “quote”</th>
<th class="head">Ways to reference it</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>in the same pattern itself</td>
<td><ul class="first last simple">
<li><code class="docutils literal notranslate"><span class="pre">(?P=quote)</span></code> (as shown)</li>
<li><code class="docutils literal notranslate"><span class="pre">\1</span></code></li>
</ul>
</td>
</tr>
<tr class="row-odd"><td>when processing match object <em>m</em></td>
<td><ul class="first last simple">
<li><code class="docutils literal notranslate"><span class="pre">m.group('quote')</span></code></li>
<li><code class="docutils literal notranslate"><span class="pre">m.end('quote')</span></code> (etc.)</li>
</ul>
</td>
</tr>
<tr class="row-even"><td>in a string passed to the <em>repl</em>
argument of <code class="docutils literal notranslate"><span class="pre">re.sub()</span></code></td>
<td><ul class="first last simple">
<li><code class="docutils literal notranslate"><span class="pre">\g&lt;quote&gt;</span></code></li>
<li><code class="docutils literal notranslate"><span class="pre">\g&lt;1&gt;</span></code></li>
<li><code class="docutils literal notranslate"><span class="pre">\1</span></code></li>
</ul>
</td>
</tr>
</tbody>
</table>



**(?P=name)**

A backreference to a named group; it matches whatever text was matched by the earlier group named name.

**(?#...)**

A comment; the contents of the parentheses are simply ignored.

In [0]:
## re.compile

import re

In [6]:
## Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.

## The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the | operator).

pattern = r'''[a-z]'''
string  = 'A quick brown fox fox jumps over the lazy dog'

prog = re.compile(pattern)

result = prog.match(string)

## or in short cut -> result = re.match(pattern, string)

#print (result)

None


**re.VERBOSE**

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. 

Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. 

When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

In [0]:
# verbose flag explanation self explanatory
# This means that the two following regular expression objects that match a decimal number are functionally equal:

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)

b = re.compile(r"\d+\.\d*")

In [8]:
re.split(r'\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

In [9]:
re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

In [10]:
re.split(r'\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In [11]:
re.split(r'(\W+)', '...words, words...')

['', '...', 'words', ', ', 'words', '...', '']

In [12]:
re.split(r'\b', 'Words, words, words.')

ValueError: ignored

In [13]:
re.split(r'\W*', '...words...')

  return _compile(pattern, flags).split(string, maxsplit)


['', 'words', '']

In [14]:
re.split(r'(\W*)', '...words...')

  return _compile(pattern, flags).split(string, maxsplit)


['', '...', 'words', '...', '']

In [0]:
re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',r'static PyObject*\npy_\1(void)\n{','def myfunc():')'static PyObject*\npy_myfunc(void)\n{'

[more](https://docs.python.org/3/library/re.html#regular-expression-syntax)