## Some Advanced Python Topics (Class 13)

This notebook references these Python tutorials and resources:

 __Python Distilled__, by David Beazley; Pearson; 1st edition (September 22, 2021), ISBN-10: 0134173279, ISBN-13: 978-0134173276
  
 __Regular Expressions: Regexes in Python (Part 1)__:  
 https://realpython.com/regex-python/#regexes-in-python-and-their-uses


This notebook by:

***Eric V. Level***  

Graduate Programs in Software Engineering and Data Science  
University of St Thomas
St Paul, MN



### Notebook Objectives

- To review the idea of "Python protocols" from D. Beazley's book, and explore how to implement them in your classes.

- To understand and know how to use regular expressions via the `regex` package, from the RealPython.com website.

### "Python Distilled" (PD, Chapter 4.8-4.18):  Object Protocols

#### 4.8 Object Protocols and Data Abstraction

Most Python language features are defined by protocols. Consider the following function:

In [1]:
def compute_cost(unit_price, num_units):
    return unit_price * num_units

Now, ask yourself the question: What inputs are allowed? The answer is deceptively simple—everything is allowed! At first glance, this function looks like it might apply to numbers:

In [2]:
compute_cost(1.25, 50)

62.5

Indeed, it works as expected. However, the function works with much more. You can use specialized numbers such as fractions or decimals:

In [14]:
from fractions import Fraction
# from ourfraction import Fraction
compute_cost(Fraction(5, 4), 50)
# Fraction(125, 2)

Fraction(125, 2)

In [12]:
from decimal import Decimal
compute_cost(Decimal('1.25'), Decimal('50'))

Decimal('62.50')

Not only that—the function works with arrays and other complex structures from packages such as `numpy`. 

For example:

In [5]:
import numpy as np
prices = np.array([1.25, 2.10, 3.05])
units = np.array([50, 20, 25])
compute_cost(prices, units)

array([62.5 , 42.  , 76.25])

The function might even work in unexpected ways:

In [6]:
compute_cost('a lot', 10)

'a lota lota lota lota lota lota lota lota lota lot'

And yet, certain combinations of types fail:

In [15]:
#from ourfraction import Fraction
compute_cost(Fraction(5, 4), 50)

Fraction(125, 2)

Unlike a compiler for a static language, Python does not verify correct program behavior in advance. Instead, the behavior of an object is determined by a dynamic process that involves the dispatch of so-called **“special” or “magic” methods**. The names of these special methods are always preceded and followed by double underscores (`__`). The methods are automatically triggered by the interpreter as a program executes. For example, the operation `x * y` is carried out by a method `x.__mul__(y)`. The names of these methods and their corresponding operators are hard-wired. The behavior of any given object depends entirely on the set of special methods that it implements.

The next few sections describe the special methods associated with different categories of core interpreter features. These categories are sometimes called **“protocols.”** An object, including a user-defined class, may define any combination of these features to make the object behave in different ways.

### PD - 4.9 Object Protocol

The methods in Table 4.1 are related to the overall management of objects. This includes object creation, initialization, destruction, and representation.

#### Table 4.1 Methods for Object Management

**Method                         
    Description**

`__new__(cls [,*args [,**kwargs]])`  
	A static method called to create a new instance.

`__init__(self [,*args [,**kwargs]])`  
	Called to initialize a new instance after it’s been created.

`__del__(self)`  
	Called when an instance is being destroyed.

`__repr__(self)`  
	Create a string representation.

The `__new__()` and `__init__()` methods are used together to create and initialize instances. When an object is created by calling `SomeClass(args)`, it is translated into the following steps:

In [22]:
st = "Are we having fun yet"
print(st)
st
repr(st)

Are we having fun yet


"'Are we having fun yet'"

In [16]:
class SomeClass():
    def __init__(self,l):
        print ("SomeClass instance initialized!")

In [17]:
args = []
x = SomeClass.__new__(SomeClass, args)
if isinstance(x, SomeClass):
    x.__init__(args)

SomeClass instance initialized!


Normally, these steps are handled behind the scenes and you don’t need to worry about it. The most common method implemented in a class is `__init__()`. Use of `__new__()` almost always indicates the presence of advanced magic related to instance creation (for example, it is used in class methods that want to bypass `__init__()` or in certain creational design patterns such as singletons or caching). The implementation of `__new__()` doesn’t necessarily need to return an instance of the class in question—if not, the subsequent call to `__init__()` on creation is skipped.

The `__del__()` method is invoked when an instance is about to be **garbage-collected**. This method is invoked only when an instance is no longer in use. Note that the statement `del x` only decrements the instance reference count and doesn’t necessarily result in a call to this function. `__del__()` is almost never defined unless an instance needs to perform additional resource management steps upon destruction.

The `__repr__()` method, called by the built-in `repr()` function, creates a string representation of an object that can be useful for debugging and printing. This is also the method responsible for creating the output of values you see when inspecting variables in the interactive interpreter. The convention is for `__repr__()` to return an expression string that can be evaluated to re-create the object using `eval()`. For example:

In [18]:
a = [2, 3, 4, 5]   # Create a list
s = repr(a)        # s = '[2, 3, 4, 5]'
print (type(s),s)
b = eval(s)        # Turns s back into a list
print (type(b),b)

<class 'str'> [2, 3, 4, 5]
<class 'list'> [2, 3, 4, 5]


If a string expression cannot be created, the convention is for `__repr__()` to return a string of the form `<...message...>`, as shown here:

In [19]:
f = open('foo.txt')
a = repr(f)
a
# a = "<_io.TextIOWrapper name='foo.txt' mode='r' encoding='UTF-8'>

"<_io.TextIOWrapper name='foo.txt' mode='r' encoding='UTF-8'>"

### PD - 4.10 Number Protocol

Table 4.2 lists special methods that objects must implement to provide mathematical operations.

#### Table 4.2 Methods for Mathematical Operations

```
Method                     Operation

__add__(self, other)       self + other
__sub__(self, other)       self - other
__mul__(self, other)       self * other
__truediv__(self, other)   self / other
__floordiv__(self, other)  self // other
__mod__(self, other)       self % other
__matmul__(self, other)    self @ other
__divmod__(self, other)    divmod(self, other)
__pow__(self, other [, modulo])   self ** other, pow(self, other, modulo)
__lshift__(self, other)    self << other
__rshift__(self, other)    self >> other
__and__(self, other)       self & other
__or__(self, other)        self | other
__xor__(self, other)       self ^ other
__radd__(self, other)      other + self
__rsub__(self, other)      other - self
__rmul__(self, other)      other * self
__rtruediv__(self, other)  other / self
__rfloordiv__(self, other) other // self
__rmod__(self, other)      other % self
__rmatmul__(self, other)   other @ self
__rdivmod__(self, other)   divmod(other, self)
__rpow__(self, other)      other ** self
__rlshift__(self, other)   other << self
__rrshift__(self, other)   other >> self
__rand__(self, other)      other & self
__ror__(self, other)       other | self
__rxor__(self, other)      other ^ self
__iadd__(self, other)      self += other
__isub__(self, other)      self -= other
__imul__(self, other)      self *= other
__itruediv__(self, other)  self /= other
__ifloordiv__(self, other) self //= other
__imod__(self, other)      self %= other
__imatmul__(self, other)   self @= other
__ipow__(self, other)      self **= other
__iand__(self, other)      self &= other
__ior__(self, other)       self |= other
__ixor__(self, other)      self ^= other
__ilshift__(self, other)   self <<= other
__irshift__(self, other)   self >>= other
__neg__(self)              –self
__pos__(self)              +self
__invert__(self)           ~self
__abs__(self)              abs(self)
__round__(self, n)         round(self, n)
__floor__(self)            math.floor(self)
__ceil__(self)             math.ceil(self)
__trunc__(self)            math.trunc(self)
```

#### Fraction from DSP-1 book, augmented with some of the above...

In [20]:
# _dsp-1_13_1_2_3-fraction_class.py

def gcd(m, n):
    while m % n != 0:
        m, n = n, m % n
    return n

class Fraction:
    def __init__(self, top, bottom):
        self.num = top
        self.den = bottom

    def __str__(self):
        return "{:d}/{:d}".format(self.num, self.den)

    def __eq__(self, other_fraction):
        first_num = self.num * other_fraction.den
        second_num = other_fraction.num * self.den
        print ("calling __eq__()")
        return first_num == second_num

    def __add__(self, other_fraction):
        new_num = self.num * other_fraction.den \
        + self.den * other_fraction.num
        new_den = self.den * other_fraction.den
        cmmn = gcd(new_num, new_den)
        return Fraction(new_num // cmmn, new_den // cmmn)

    def show(self):
        print("{:d}/{:d}".format(self.num, self.den))
        
    def __repr__(self): # added 
        ''' finish this'''
        to_return = f'Fraction({self.num},{self.den})'
        return to_return

x = Fraction(1, 2)
x.show()
y = Fraction(2, 3)
print(y)
print(x + y) # uses __add__
print(x == y)

# added to book's code
print (x!=y)# no __ne__, so __eq__ used and negated..

print (x<y) # not implemented...yet


1/2
2/3
7/6
calling __eq__()
False
calling __eq__()
True


TypeError: '<' not supported between instances of 'Fraction' and 'Fraction'

When presented with an expression such as `x + y`, the interpreter invokes a combination of the methods `x.__add__(y)` or `y.__radd__(x)` to carry out the operation. The initial choice is to try `x.__add__(y)` in all cases except for the special case where `y` happens to be a subtype of `x`; in that case, `y.__radd__(x)` executes first. If the initial method fails by returning `NotImplemented`, an attempt is made to invoke the operation with reversed operands such as `y.__radd__(x)`. If this second attempt fails, the entire operation fails. Here is an example:

In [23]:
a = 42       # int
b = 3.7      # float

In [24]:
a.__add__(b) # not implemented: 

NotImplemented

In [25]:
b.__radd__(a)

45.7

In [26]:
isinstance(47,int)

True

In [27]:
isinstance(47,float) 

False

In [28]:
issubclass(type(int),type(float))

True

In [29]:
issubclass(type(float),type(int)) # ??

True

This example might seem surprising but it reflects the fact that integers don’t actually know anything about floating-point numbers. However, floating-point numbers do know about integers—as integers are, mathematically, a special kind of floating-point numbers. Thus, the reversed operand produces the correct answer.

The methods `__iadd__()`, `__isub__()`, and so forth are used to support in-place arithmetic operators such as `a += b` and `a -= b` (also known as *augmented assignment*). A distinction is made between these operators and the standard arithmetic methods because the implementation of the in-place operators might be able to provide certain customizations or performance optimizations. For instance, if the object is not shared, the value of an object could be modified in place without allocating a newly created object for the result. If the in-place operators are left undefined, an operation such as `a += b` is evaluated using `a = a + b` instead.

There are no methods that can be used to define the behavior of the logical `and`, `or`, or `not` operators. The `and` and `or` operators implement short-circuit evaluation where evaluation stops if the final result can already be determined. For example:

In [30]:
True or 1/0      # Does not evaluate 1/0: short circuit!

True

This behavior involving unevaluated subexpressions **can’t be expressed** using the evaluation rules of a normal function or method. Thus, there is no protocol or set of methods for redefining it. Instead, it is handled as a special case deep inside the implementation of Python itself.

### PD - 4.11 Comparison Protocol

Objects can be compared in various ways. The most basic check is an identity check with the `is` operator. For example, `a is b`. Identity does not consider the values stored inside of an object, even if they happen to be the same. For example:

In [31]:
a = [1, 2, 3]
b = a
a is b

True

In [32]:
c = [1, 2, 3]
a is c

False

The `is` operator is an internal part of Python that can’t be redefined. All other comparisons on objects are implemented by the methods in Table 4.3.

#### Table 4.3 Methods for Instance Comparison and Hashing
```
Method                  Description

__bool__(self)          Returns False or True for truth-value testing                        __eq__(self, other)     self == other
__ne__(self, other)     self != other
__lt__(self, other)     self < other
__le__(self, other)     self <= other
__gt__(self, other)     self > other
__ge__(self, other)     self >= other
__hash__(self)          Computes an integer hash index
```

The `__bool__()` method, if present, is used to determine the truth value when an object is tested as part of a condition or conditional expression. For example:

```
if a:              # Executes a.__bool__()
   ...
else:
   ...
```

If `__bool__()` is undefined, then `__len__()` is used as a fallback. If both `__bool__()` and `__len__()` are undefined, an object is simply considered to be `True`.

The `__eq__()` method is used to determine basic equality for use with the `==` and `!=` operators. The default implementation of `__eq__()` compares objects by identity using the `is` operator. The `__ne__()` method, if present, can be used to implement special processing for `!=`, but is usually not required as long as `__eq__()` is defined.

Ordering is determined by the relational operators (`<`, `>`, `<=`, and `>=`) using methods such as `__lt__()` and `__gt__()`. As with other mathematical operations, the evaluation rules are subtle. To evaluate `a < b`, the interpreter will first try to execute `a.__lt__(b)` except where `b` is a subtype of `a`. In that one specific case, `b.__gt__(a)` executes instead. If this initial method is not defined or returns `NotImplemented`, the interpreter tries a reversed comparison, calling `b.__gt__(a)`. Similar rules apply to operators such as `<=` and `>=`. For example, evaluating `<=` first tries to evaluate `a.__le__(b)`. If not implemented, `b.__ge__(a)` is tried.

Each of the comparison methods takes two arguments and is allowed to return any kind of value, including a Boolean value, a list, or any other Python type. For instance, a numerical package might use this to perform an element-wise comparison of two matrices, returning a matrix with the results. If comparison is not possible, the methods should return the built-in object `NotImplemented`. This is not the same as the `NotImplementedError` exception. For example:

In [33]:
a = 42      # int
b = 52.3    # float
a.__lt__(b)

NotImplemented

In [34]:
b.__gt__(a)

True

It is not necessary for an ordered object to implement all of the comparison operations in Table 4.3. If you want to be able to sort objects or use functions such as `min()` or `max()`, then `__lt__()` must be minimally defined. If you are adding comparison operators to a user-defined class, the `@total_ordering` class decorator in the `functools` module may be of some use. It can generate all of the methods as long as you minimally implement `__eq__()` and one of the other comparisons.

The `__hash__()` method is defined on instances that are to be placed into a set or be used as keys in a mapping (dictionary). The value returned is an integer that should be the same for two instances that compare as equal. Moreover, `__eq__()` should always be defined together with `__hash__()` because the two methods work together. The value returned by `__hash__()` is typically used as an internal implementation detail of various data structures. However, it’s possible for two different objects to have the same hash value. Therefore, `__eq__()` is necessary to resolve potential collisions.

In [35]:
print (hash(47.0))
print (hash(47))
print (hash("47"))
print (hash("48"))

47
47
7975181688420203625
-4652752973849767532


#### Fraction, again with comparisons added

### PD - 4.12 Conversion Protocols

Sometimes, you must convert an object to a built-in type such as a string or a number. The methods in Table 4.4 can be defined for this purpose.

#### Table 4.4 Methods for Conversions

```
Method                              Description

__str__(self)                       Conversion to a string
__bytes__(self)                     Conversion to bytes
__format__(self, format_spec)       Creates a formatted representation
__bool__(self)                      bool(self)
__int__(self)                       int(self)
__float__(self)                     float(self)
__complex__(self)                   complex(self)
__index__(self)                     Conversion to a integer index [self]
```	

In [37]:
type(0+1j)

complex

The `__str__()` method is called by the built-in `str()` function and by functions related to printing. The `__format__()` method is called by the `format()` function or the `format()` method of strings. The `format_spec` argument is a string containing the format specification. This string is the same as the `format_spec` argument to `format()`. For example:

In [38]:
bytes(47)

b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

In [36]:
spec = '>10'
print(f'{x:spec}')                 # Calls x.__format__('spec')
print(format(x, 'spec'))           # Calls x.__format__('spec')
print('x is {0:spec}' .format(x))  # Calls x.__format__('spec')

TypeError: unsupported format string passed to Fraction.__format__

The syntax of the format specification is arbitrary and can be customized on an object-by-object basis. However, there is a standard set of conventions used for the built-in types. More information about string formatting, including the general format of the specifier, can be found in Chapter 9.

The `__bytes__()` method is used to create a byte representation if an instance is passed to `bytes()`. Not all types support byte conversion.

The numeric conversions `__bool__()`, `__int__()`, `__float__()`, and `__complex__()` are expected to produce a value of the corresponding built-in type.

Python never performs implicit type conversions using these methods. Thus, even if an object `x` implements an `__int__()` method, the expression `3 + x` will still produce a `TypeError`. The only way to execute `__int__()` is through an explicit use of the `int()` function.

The `__index__()` method performs an integer conversion of an object when it’s used in an operation that requires an integer value. This includes indexing in sequence operations. For example, if items is a list, performing an operation such as `items[x]` will attempt to execute `items[x.__index__()]` if `x` is not an integer. `__index__()` is also used in various base conversions such as `oct(x)` and `hex(x)`.

### PD - 4.13 Container Protocol

The methods in Table 4.5 are used by objects that want to implement containers of various kinds—lists, dicts, sets, and so on.

#### Table 4.5 Methods for Containers
```
Method                            Description

__len__(self)                     Returns the length of self
__getitem__(self, key)            Returns self[key]
__setitem__(self, key, value)     Sets self[key] = value
__delitem__(self, key)            Deletes self[key]
__contains__(self, obj)           obj in self
```

Here’s an example:

In [39]:
a = [1, 2, 3, 4, 5, 6]

print (len(a))       # a.__len__()
x = a[2]             # x = a.__getitem__(2)
print(x)
a[1] = 7             # a.__setitem__(1,7)
print (a)
del a[2]             # a.__delitem__(2)
print (a)
5 in a               # a.__contains__(5)

6
3
[1, 7, 3, 4, 5, 6]
[1, 7, 4, 5, 6]


True

The `__len__()` method is called by the built-in `len()` function to return a nonnegative length. This function also determines truth values unless the `__bool__()` method has also been defined.

For accessing individual items, the `__getitem__()` method can return an item by key value. The key can be any Python object, but it is expected to be an integer for ordered sequences such as lists and arrays. The `__setitem__()` method assigns a value to an element. The `__delitem__()` method is invoked whenever the `del` operation is applied to a single element. The `__contains__()` method is used to implement the `in` operator.

Slicing operations such as `x = s[i:j]` are also implemented using `__getitem__()`, `__setitem__()`, and `__delitem__()`. For slices, a special slice instance is passed as the key. This instance has attributes that describe the range of the slice being requested. For example:

In [40]:
a = [1,2,3,4,5,6]
x = a[1:5]           # x = a.__getitem__(slice(1, 5, None))
a[1:3] = [10,11,12]  # a.__setitem__(slice(1, 3, None), [10, 11, 12])
del a[1:4]           # a.__delitem__(slice(1, 4, None))

The slicing features of Python are more powerful than many programmers realize. For example, the following variations of extended slicing are all supported and may be useful for working with multidimensional data structures such as matrices and arrays:

In [41]:
import numpy as np
# units = np.array([50, 20, 25])

# m = numpy.array(5,5)
m = np.arange(100).reshape(10, 10)
print (m)
a = m[0:100:10]          # Strided slice (step=10)
b = m[1:10, 3:20]        # Multidimensional slice
c = m[0:100:10, 50:75:5] # Multiple dimensions with strides
# m[0:5, 5:10] = n         # extended slice assignment
# del m[:10, 15:]          # extended slice deletion

print("a==",a)
print("b==",b)
print("c==",c)

[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]
 [50 51 52 53 54 55 56 57 58 59]
 [60 61 62 63 64 65 66 67 68 69]
 [70 71 72 73 74 75 76 77 78 79]
 [80 81 82 83 84 85 86 87 88 89]
 [90 91 92 93 94 95 96 97 98 99]]
a== [[0 1 2 3 4 5 6 7 8 9]]
b== [[13 14 15 16 17 18 19]
 [23 24 25 26 27 28 29]
 [33 34 35 36 37 38 39]
 [43 44 45 46 47 48 49]
 [53 54 55 56 57 58 59]
 [63 64 65 66 67 68 69]
 [73 74 75 76 77 78 79]
 [83 84 85 86 87 88 89]
 [93 94 95 96 97 98 99]]
c== []


The general format for each dimension of an extended slice is `i:j[:stride]`, where `stride` is optional. As with ordinary slices, you can omit the starting or ending values for each part of a slice.

In addition, the `Ellipsis` (written as `...`) is available to denote any number of trailing or leading dimensions in an extended slice:

In [42]:
a = m[..., 10:20]    # extended slice access with Ellipsis
m[10:20, ...] = a  # changed from book: fix this!

ValueError: could not broadcast input array from shape (10,0) into shape (0,10)

When using extended slices, the `__getitem__()`, `__setitem__()`, and `__delitem__()` methods implement access, modification, and deletion, respectively. However, instead of an integer, the value passed to these methods is a tuple containing a combination of slice or `Ellipsis` objects. For example,

In [43]:
a = m[0:10, 0:100:5, ...]
a

array([[ 0,  5],
       [10, 15],
       [20, 25],
       [30, 35],
       [40, 45],
       [50, 55],
       [60, 65],
       [70, 75],
       [80, 85],
       [90, 95]])

invokes `__getitem__()` as follows:

In [44]:
a = m.__getitem__((slice(0,10,None), slice(0,100,5), Ellipsis))
a

array([[ 0,  5],
       [10, 15],
       [20, 25],
       [30, 35],
       [40, 45],
       [50, 55],
       [60, 65],
       [70, 75],
       [80, 85],
       [90, 95]])

Python strings, tuples, and lists currently provide some support for extended slices. No part of Python or its standard library make use of multidimensional slicing or the `Ellipsis`. Those features are reserved purely for third-party libraries and frameworks. Perhaps the most common place you would see them used is in a library such as `numpy`.

### PD - 4.14 Iteration Protocol

If an instance, `obj`, supports iteration, it provides a method, `obj.__iter__()`, that returns an **iterator**. An iterator `iter`, in turn, implements a single method, `iter.__next__()`, that returns the next object or raises `StopIteration` to signal the end of iteration. These methods are used by the implementation of the `for` statement as well as other operations that implicitly perform iteration. For example, the statement `for x in s` is carried out by performing these steps:

In [45]:
s = [1,2,47]
_iter = s.__iter__()
while True:
    try:
        x = _iter.__next__()
        print (x)
    except StopIteration:
        break
    # Do statements in body of for loop
    ...

1
2
47


An object may optionally provide a reversed iterator if it implements the `__reversed__()` special method. This method should return an iterator object with the same interface as a normal iterator (that is, a `__next__()` method that raises `StopIteration` at the end of iteration). This method is used by the built-in `reversed()` function. For example:

In [46]:
for x in reversed([1,2,3]):
    print(x)

3
2
1


A common implementation technique for iteration is to use a generator function involving `yield`. For example:

In [47]:
# this is how range() is implemented...

class FRange:
    def __init__(self, start, stop, step):
        self.start = start
        self.stop = stop
        self.step = step

    def __iter__(self):
        x = self.start
        while x < self.stop:
            yield x
            x += self.step

# Example use:
nums = FRange(0.0, 1.0, 0.1)
for x in nums:
    print(x)     # 0.0, 0.1, 0.2, 0.3, ...

0.0
0.1
0.2
0.30000000000000004
0.4
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999


In [48]:
print(type(range(47)))

<class 'range'>


This works because generator functions conform to the iteration protocol themselves. It’s a bit easier to implement an iterator in this way since you only have to worry about the `__iter__()` method. The rest of the iteration machinery is already provided by the generator.

### PD - 4.15 Attribute Protocol

The methods in Table 4.6 read, write, and delete the attributes of an object using the dot (`.`) operator and the `del` operator, respectively.

**Table 4.6 Methods for Attribute Access**

```
Method                          Description

__getattribute__(self, name)    Returns the attribute self.name  
__getattr__(self, name)         Returns the attribute self.name if it’s not found 
                                  through __getattribute__()
__setattr__(self, name, value)  Sets the attribute self.name = value
__delattr__(self, name)         Deletes the attribute del self.name
```

Whenever an attribute is accessed, the `__getattribute__()` method is invoked. If the attribute is located, its value is returned. Otherwise, the `__getattr__()` method is invoked. The default behavior of `__getattr__()` is to raise an `AttributeError` exception. The `__setattr__()` method is always invoked when setting an attribute, and the `__delattr__()` method is always invoked when deleting an attribute.

These methods are fairly blunt, in that they allow a type to completely redefine attribute access for all attributes. User-defined classes can define properties and descriptors which allow for more fine-grained control of attribute access. This is discussed further in Chapter 7.

### PD - 4.16 Function Protocol

An object can emulate a function by providing the `__call__()` method. If an object `x` provides this method, it can be invoked like a function. That is, `x(arg1, arg2, ...)` invokes `x.__call__(arg1, arg2, ...)`.

There are many built-in types that support function calls. For example, types implement `__call__()` to create new instances. Bound methods implement `__call__()` to pass the `self` argument to instance methods. Library functions such as `functools.partial()` also create objects that emulate functions.

### PD - 4.17 Context Manager Protocol

The with statement allows a sequence of statements to execute under the control of an instance known as a **context manager**. The general syntax is as follows:
```
with context [as var]:
     statements
```
A context object shown here is expected to implement the methods listed in Table 4.7.

**Table 4.7 Methods for Context Managers**

```
Method             Description

__enter__(self)    Called when entering a new context.  
                   The return value is placed in the variable listed with the `as`   
                   specifier to the `with` statement.  

__exit__(self, type, value, tb)
                   Called when leaving a context. If an exception occurred, 
                   type, value, and tb have the exception type, value, and traceback  
                   information.  
```


The `__enter__()` method is invoked when the `with` statement executes. The value returned by this method is placed into the variable specified with the optional `as var` specifier. The `__exit__()` method is called as soon as control flow leaves the block of statements associated with the `with` statement. As arguments, `__exit__()` receives the current exception type, `value`, and a traceback if an exception has been raised. If no errors are being handled, all three values are set to `None`. The `__exit__()` method should return `True` or `False` to indicate if a raised exception was handled or not. If `True` is returned, any pending exception is cleared and program execution continues normally with the first statement after the `with` block.

The primary use of the context management interface is to allow for simplified resource control on objects involving system state such as open files, network connections, and locks. By implementing this interface, an object can safely clean up resources when execution leaves a context in which an object is being used. Further details are found in Chapter 3.

### PD - 4.18 Final Words: On Being Pythonic

A commonly cited design goal is to write code that is “Pythonic.” That can mean many things, but basically it encourages you to follow established idioms used by the rest of Python. That means knowing Python’s protocols for containers, iterables, resource management, and so forth. Many of Python’s most popular frameworks use these protocols to provide good user experience. You should strive for that as well.

Of the different protocols, three deserve special attention because of their widespread use. One is creating a proper object representation using the `__repr__()` method. Python programs are often debugged and experimented with at the interactive REPL. It is also common to output objects using `print()` or a logging library. If you make it easy to observe the state of your objects, it will make all of these things easier.

Second, iterating over data is one of the most common programming tasks. If you’re going to do it, you should make your code work with Python’s `for` statement. Many core parts of Python and the standard library are designed to work with iterable objects. By supporting iteration in the usual way, you’ll automatically get a significant amount of extra functionality and your code will be intuitive to other programmers.

Finally, use context managers and the `with` statement for the common programming pattern where statements get sandwiched between some kind of startup and teardown steps—for example, opening and closing resources, acquiring and releasing locks, subscribing and unsubscribing, and so on.

#### The following tutorial is from the `realpython.com` (RP) web site.

### RP - Regular Expressions:  Regexes in Python (1)



### Table of Contents

1. Regexes in Python and Their Uses
        - A (Very Brief) History of Regular Expressions
        - The re Module
        - How to Import re.search()
        - First Pattern-Matching Example
        - Python Regex Metacharacters
2. Metacharacters Supported by the re Module
        - Metacharacters That Match a Single Character
        Escaping Metacharacters
        Anchors
        Quantifiers
        Grouping Constructs and Backreferences
        Lookahead and Lookbehind Assertions
        Miscellaneous Metacharacters
3. Modified Regular Expression Matching With Flags
        Supported Regular Expression Flags
        Combining <flags> Arguments in a Function Call
        Setting and Clearing Flags Within a Regular Expression
4. Conclusion



In this tutorial, you’ll explore regular expressions, also known as regexes, in Python. A regex is a special sequence of characters that defines a pattern for complex string-matching functionality.

Earlier in this series, in the tutorial Strings and Character Data in Python, you learned how to define and manipulate string objects. Since then, you’ve seen some ways to determine whether two strings match each other:

- You can test whether two strings are equal using the equality (==) operator.

- You can test whether one string is a substring of another with the in operator or the built-in string methods `.find()` and `.index()`.

String matching like this is a common task in programming, and you can get a lot done with string operators and built-in methods. At times, though, you may need more sophisticated pattern-matching capabilities.

In this tutorial, you’ll learn:

- How to access the re module, which implements regex matching in Python
- How to use re.search() to match a pattern against a string
- How to create complex matching pattern with regex metacharacters

Fasten your seat belt! Regex syntax takes a little getting used to. But once you get comfortable with it, you’ll find regexes almost indispensable in your Python programming.

### RP - Regexes in Python and Their Uses

Imagine you have a string object s. Now suppose you need to write Python code to find out whether s contains the substring `'123'`. There are at least a couple ways to do this. You could use the `in` operator:

In [49]:
s = 'foo123bar'
print ('123' in s) # True

True


If you want to know not only whether `'123'` exists in `s` but also where it exists, then you can use `.find()` or `.index()`. Each of these returns the character position within `s` where the substring resides:

In [50]:
s = 'foo123bar'
print (s.find('123'))
print (s.index('123'))

3
3


In these examples, the matching is done by a straightforward character-by-character comparison. That will get the job done in many cases. But sometimes, the problem is more complicated than that.

For example, rather than searching for a fixed substring like `'123'`, suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings `'foo123bar'`, `'foo456bar'`, `'234baz'`, and `'qux678'`.

Strict character comparisons won’t cut it here. This is where **regexes** in Python come to the rescue.

### RP - A (Very Brief) History of Regular Expressions

In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation.

Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.

### RP - The `re` Module

Regex functionality in Python resides in a module named `re`. The `re` module contains many useful functions and methods, most of which you’ll learn about in the next tutorial in this series.

For now, you’ll focus predominantly on one function, `re.search()`.

**`re.search(<regex>, <string>)`**

- Scans a string for a regex match.

`re.search(<regex>, <string>)` scans `<string>` looking for the first location where the pattern `<regex>` matches. If a match is found, then `re.search()` returns a **match object**. Otherwise, it returns `None`.

`re.search()` takes an optional third `<flags>` argument that you’ll learn about at the end of this tutorial.

### RP -  How to Import `re.search()`

Because search() resides in the re module, you need to import it before you can use it. One way to do this is to import the entire module and then use the module name as a prefix when calling the function:

`
import re
re.search(...)
`

Alternatively, you can import the function from the module by name and then refer to it without the module name prefix:

`
from re import search
search(...)
`

You’ll always need to import `re.search()` by one means or another before you’ll be able to use it.

The examples in the remainder of this tutorial will assume the first approach shown—importing the `re` module and then referring to the function with the module name prefix: `re.search()`. For the sake of brevity, the `import re` statement will usually be omitted, but remember that it’s always necessary.

For more information on importing from modules and packages, check out Python Modules and Packages—An Introduction.

### RP -  First Pattern-Matching Example

Now that you know how to gain access to re.search(), you can give it a try:

In [51]:
s = 'foo123bar'

# One last reminder to import!

import re
re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

Here, the search pattern `<regex>` is `123` and `<string>` is `s`. The returned match object appears on line 7 (output!). Match objects contain a wealth of useful information that you’ll explore soon.

For the moment, the important point is that `re.search()` did in fact return a match object rather than `None`. That tells you that it found a match. In other words, the specified `<regex>` pattern `123` is present in `s`.
`
A match object is __truthy__, so you can use it in a Boolean context like a conditional statement:

In [52]:
if re.search('123', s):
    print('Found a match.')
else:
    print('No match.')

Found a match.


The interpreter displays the match object as `<_sre.SRE_Match object; span=(3, 6), match='123'>`. This contains some useful information.

`span=(3, 6)` indicates the portion of `<string>` in which the match was found. This means the same thing as it would in slice notation:

In [53]:
s[3:6] # '123'

'123'

In this example, the match starts at character position 3 and extends up to but not including position 6.

`match='123'` indicates which characters from `<string>` matched.

This is a good start. But in this case, the `<regex>` pattern is just the plain string `'123'`. The pattern matching here is still just character-by-character comparison, pretty much the same as the `in` operator and `.find()` examples shown earlier. The `match` object helpfully tells you that the matching characters were `'123'`, but that’s not much of a revelation since those were exactly the characters you searched for.

You’re just getting warmed up.

### RP -  Python Regex Metacharacters

The real power of regex matching in Python emerges when `<regex>` contains special characters called __metacharacters__. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

In a regex, a set of characters specified in square brackets (`[]`) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

In [54]:
import re
s = 'foo123bar'
print (re.search('[0-9][0-9][0-9]', s)) # <_sre.SRE_Match object; span=(3, 6), match='123'>

<re.Match object; span=(3, 6), match='123'>


`[0-9]` matches any single decimal digit character—any character between `'0'` and `'9'`, inclusive. The full expression `[0-9][0-9][0-9]` matches any sequence of three decimal digit characters. In this case, s matches because it contains three consecutive decimal digit characters, `'123'`.

These strings also match:

In [55]:
re.search('[0-9][0-9][0-9]', 'foo456bar')
# <_sre.SRE_Match object; span=(3, 6), match='456'>

<re.Match object; span=(3, 6), match='456'>

In [56]:
re.search('[0-9][0-9][0-9]', '234baz')
# <_sre.SRE_Match object; span=(0, 3), match='234'>

<re.Match object; span=(0, 3), match='234'>

In [57]:
re.search('[0-9][0-9][0-9]', 'qux678')
# <_sre.SRE_Match object; span=(3, 6), match='678'>

<re.Match object; span=(3, 6), match='678'>

On the other hand, a string that doesn’t contain three consecutive digits won’t match:

In [58]:
print(re.search('[0-9][0-9][0-9]', '12foo34')) # None

None


With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.

Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

In [59]:
s = 'foo123bar'
print (re.search('1.3', s)) # <_sre.SRE_Match object; span=(3, 6), match='123'>

<re.Match object; span=(3, 6), match='123'>


In [60]:
s = 'foo13bar'
print(re.search('1.3', s)) # None

None


In the first example, the regex `1.3` matches `'123'` because the `'1'` and `'3'` match literally, and the `.` matches the `'2'`. Here, you’re essentially asking, “Does `s` contain a `'1'`, then any character (except a newline), then a `'3'`?” The answer is yes for `'foo123bar'` but no for `'foo13bar'`.

These examples provide a quick illustration of the power of regex metacharacters. Character class and dot are but two of the metacharacters supported by the re module. There are many more. Next, you’ll explore them fully.

### RP - Metacharacters Supported by the `re` Module

The following table briefly summarizes all the metacharacters supported by the re module. Some characters serve more than one purpose:

**Character(s)** => **Meaning** 

`.` => Matches any single character except newline  
`^` => Anchors a match at the start of a string,  
    => or complements a character class  

`$` => Anchors a match at the end of a string  
`*` => Matches zero or more repetitions  
`+` => Matches one or more repetitions  
`?` => Matches zero or one repetition  
    => Specifies the non-greedy versions of *, +, and ?  
    => Introduces a lookahead or lookbehind assertion  
    => Creates a named group  
`{}` => Matches an explicitly specified number of repetitions   
`\`  => Escapes a metacharacter of its special meaning   
     => Introduces a special character class   
     => Introduces a grouping backreference
`[]` => Specifies a character class  
`|`  => Designates alternation   
`()` => Creates a group   
`:`, `#`, `=`, `!` => Designate a specialized group   
`<>` => Creates a named group   

This may seem like an overwhelming amount of information, but don’t panic! The following sections go over each one of these in detail.

The `regex` parser regards any character not listed above as an ordinary character that matches only itself. For example, in the first pattern-matching example shown above, you saw this:

In [61]:
s = 'foo123bar'
re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In this case, `123` is technically a regex, but it’s not a very interesting one because it doesn’t contain any metacharacters. It just matches the string `'123'`.

Things get much more exciting when you throw metacharacters into the mix. The following sections explain in detail how you can use each metacharacter or metacharacter sequence to enhance pattern-matching functionality.

### RP - Metacharacters That Match a Single Character

The metacharacter sequences in this section try to match a single character from the search string. When the regex parser encounters one of these metacharacter sequences, a match happens if the character at the current parsing position fits the description that the sequence describes.

**`[]`**   

       Specifies a specific set of characters to match.

Characters contained in square brackets (`[]`) represent a **character class** == an enumerated set of characters to match from. A character class metacharacter sequence will match any single character contained in the class.

You can enumerate the characters individually like this:

In [62]:
import re

print (re.search('ba[artz]', 'foobarqux')) # <_sre.SRE_Match object; span=(3, 6), match='bar'>
print (re.search('ba[artz]', 'foobazqux')) # <_sre.SRE_Match object; span=(3, 6), match='baz'>

<re.Match object; span=(3, 6), match='bar'>
<re.Match object; span=(3, 6), match='baz'>


The metacharacter sequence `[artz]` matches any single `'a'`, `'r'`, `'t'`, or `'z'` character. In the example, the regex `ba[artz]` matches both `'bar'` and `'baz'` (and would also match `'baa'` and `'bat'`).

A character class can also contain a range of characters separated by a hyphen (`-`), in which case it matches any single character within the range. For example, `[a-z]` matches any lowercase alphabetic character between `'a'` and `'z'`, inclusive:

In [63]:
re.search('[a-z]', 'FOObar') # <_sre.SRE_Match object; span=(3, 4), match='b'>

<re.Match object; span=(3, 4), match='b'>

**`[0-9]`**

    Matches any digit character:

In [64]:
re.search('[0-9][0-9]', 'foo123bar') # <_sre.SRE_Match object; span=(3, 5), match='12'>

<re.Match object; span=(3, 5), match='12'>

In this case, `[0-9][0-9]` matches a sequence of two digits. The first portion of the string `'foo123bar'` that matches is `'12'`.

`[0-9a-fA-F]` matches any hexadecimal digit character:

In [65]:
re.search('[0-9a-fA-f]', '--- a0 ---') # <_sre.SRE_Match object; span=(4, 5), match='a'>

<re.Match object; span=(4, 5), match='a'>

Here, `[0-9a-fA-F]` matches the first hexadecimal digit character in the search string, `'a'`.

Note: In the above examples, the return value is always the leftmost possible match. `re.search()` scans the search string from left to right, and as soon as it locates a match for `<regex>`, it stops scanning and returns the match.

You can complement a character class by specifying `^` as the first character, in which case it matches any character that isn’t in the set. In the following example, `[^0-9]` matches any character that isn’t a digit:

In [66]:
re.search('[^0-9]', '12345foo') # <_sre.SRE_Match object; span=(5, 6), match='f'>

<re.Match object; span=(5, 6), match='f'>

Here, the match object indicates that the first character in the string that isn’t a digit is `'f'`.

If a `^` character appears in a character class but isn’t the first character, then it has no special meaning and matches a literal `'^'` character:

In [67]:
re.search('[#:^]', 'foo^bar:baz#qux') # <_sre.SRE_Match object; span=(3, 4), match='^'>

<re.Match object; span=(3, 4), match='^'>

As you’ve seen, you can specify a range of characters in a character class by separating characters with a hyphen. What if you want the character class to include a literal hyphen character? You can place it as the first or last character or escape it with a backslash (`\`):

In [68]:
print(re.search('[-abc]', '123-456')) # <_sre.SRE_Match object; span=(3, 4), match='-'>
print(re.search('[abc-]', '123-456')) # <_sre.SRE_Match object; span=(3, 4), match='-'>
print(re.search('[ab\-c]', '123-456')) # <_sre.SRE_Match object; span=(3, 4), match='-'>

<re.Match object; span=(3, 4), match='-'>
<re.Match object; span=(3, 4), match='-'>
<re.Match object; span=(3, 4), match='-'>


If you want to include a literal `']'` in a character class, then you can place it as the first character or escape it with backslash:

In [69]:
print(re.search('[]]', 'foo[1]')) # <_sre.SRE_Match object; span=(5, 6), match=']'>
print(re.search('[ab\]cd]', 'foo[1]')) # <_sre.SRE_Match object; span=(5, 6), match=']'>

<re.Match object; span=(5, 6), match=']'>
<re.Match object; span=(5, 6), match=']'>


Other regex metacharacters lose their special meaning inside a character class:

In [70]:
print (re.search('[)*+|]', '123*456')) # <_sre.SRE_Match object; span=(3, 4), match='*'>
print(re.search('[)*+|]', '123+456')) # <_sre.SRE_Match object; span=(3, 4), match='+'>

<re.Match object; span=(3, 4), match='*'>
<re.Match object; span=(3, 4), match='+'>


As you saw in the table above, `*` and `+` have special meanings in a regex in Python. They designate repetition, which you’ll learn more about shortly. But in this example, they’re inside a character class, so they match themselves literally.

**dot (`.`)**

    Specifies a wildcard.

The `.` metacharacter matches any single character except a newline:

In [71]:
print(re.search('foo.bar', 'fooxbar')) # <_sre.SRE_Match object; span=(0, 7), match='fooxbar'>
print(re.search('foo.bar', 'foobar')) # None
print(re.search('foo.bar', 'foo\nbar')) # None

<re.Match object; span=(0, 7), match='fooxbar'>
None
None


As a regex, `foo.bar` essentially means the characters `'foo'`, then any character except newline, then the characters `'bar'`. The first string shown above, `'fooxbar'`, fits the bill because the `.` metacharacter matches the `'x'`.

The second and third strings fail to match. In the last case, although there’s a character between `'foo'` and `'bar'`, it’s a newline, and by default, the `.` metacharacter doesn’t match a newline. There is, however, a way to force `.` to match a newline, which you’ll learn about at the end of this tutorial.

**`\w`** or **`\W`**

    Match based on whether a character is a word character.

`\w` matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore (`_`) character, so `\w` is essentially shorthand for `[a-zA-Z0-9_]`:

In [72]:
print(re.search('\w', '#(.a$@&')) # <_sre.SRE_Match object; span=(3, 4), match='a'>
print(re.search('[a-zA-Z0-9_]', '#(.a$@&')) # <_sre.SRE_Match object; span=(3, 4), match='a'>

<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(3, 4), match='a'>


In this case, the first word character in the string `'#(.a$@&'` is `'a'`.

`\W` is the opposite. It matches any non-word character and is equivalent to `[^a-zA-Z0-9_]`:

In [73]:
print(re.search('\W', 'a_1*3Qb')) # <_sre.SRE_Match object; span=(3, 4), match='*'>
print(re.search('[^a-zA-Z0-9_]', 'a_1*3Qb')) # <_sre.SRE_Match object; span=(3, 4), match='*'>

<re.Match object; span=(3, 4), match='*'>
<re.Match object; span=(3, 4), match='*'>


Here, the first non-word character in `'a_1*3!b'` is `'*'`.

**`\d`** or **`\D`**

    Match based on whether a character is a decimal digit.

`\d` matches any decimal digit character. `\D` is the opposite. It matches any character that **isn’t** a decimal digit:

In [74]:
print(re.search('\d', 'abc4def')) # <_sre.SRE_Match object; span=(3, 4), match='4'>
print(re.search('\D', '234Q678')) # <_sre.SRE_Match object; span=(3, 4), match='Q'>

<re.Match object; span=(3, 4), match='4'>
<re.Match object; span=(3, 4), match='Q'>


`\d` is essentially equivalent to `[0-9]`, and `\D` is equivalent to `[^0-9]`.

**`\s`** or **`\S`**

    Match based on whether a character represents whitespace.

`\s` matches any whitespace character:

In [75]:
print(re.search('\s', 'foo\nbar baz')) # <_sre.SRE_Match object; span=(3, 4), match='\n'>

<re.Match object; span=(3, 4), match='\n'>


Note that, unlike the dot wildcard metacharacter, `\s` does match a newline character.

`\S` is the opposite of `\s`. It matches any character that isn’t whitespace:

In [76]:
print(re.search('\S', '  \n foo  \n  ')) # <_sre.SRE_Match object; span=(4, 5), match='f'>

<re.Match object; span=(4, 5), match='f'>


Again, `\s` and `\S` consider a newline to be whitespace. In the example above, the first non-whitespace character is `'f'`.

The character class sequences `\w`, `\W`, `\d`, `\D`, `\s`, and `\S` can appear inside a square bracket character class as well:

In [77]:
print(re.search('[\d\w\s]', '---3---')) # <_sre.SRE_Match object; span=(3, 4), match='3'>
print(re.search('[\d\w\s]', '---a---')) # <_sre.SRE_Match object; span=(3, 4), match='a'>
print(re.search('[\d\w\s]', '--- ---')) # <_sre.SRE_Match object; span=(3, 4), match=' '>

<re.Match object; span=(3, 4), match='3'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(3, 4), match=' '>


In this case, `[\d\w\s]` matches any digit, word, or whitespace character. And since `\w` includes `\d`, the same character class could also be expressed slightly shorter as `[\w\s]`.

### RP - Escaping Metacharacters

Occasionally, you’ll want to include a metacharacter in your regex, except you won’t want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character.

**backslash (`\`)**

    Removes the special meaning of a metacharacter.

As you’ve just seen, the backslash character can introduce special character classes like word, digit, and whitespace. There are also special metacharacter sequences called anchors that begin with a backslash, which you’ll learn about below.

When it’s not serving either of these purposes, the backslash escapes metacharacters. A metacharacter preceded by a backslash loses its special meaning and matches the literal character instead. Consider the following examples:

In [78]:
print(re.search('.', 'foo.bar')) # <_sre.SRE_Match object; span=(0, 1), match='f'>
print(re.search('\.', 'foo.bar')) # <_sre.SRE_Match object; span=(3, 4), match='.'>

<re.Match object; span=(0, 1), match='f'>
<re.Match object; span=(3, 4), match='.'>


In the `<regex>` on line 1, the dot (`.`) functions as a wildcard metacharacter, which matches the first character in the string (`'f'`). The `.` character in the `<regex>` on line 4 is escaped by a backslash, so it isn’t a wildcard. It’s interpreted literally and matches the `'.'` at index 3 of the search string.

Using backslashes for escaping can get messy. Suppose you have a string that contains a single backslash:

In [79]:
s = r'foo\bar'
print(s) # foo\bar

foo\bar


Now suppose you want to create a `<regex>` that will match the backslash between `'foo'` and `'bar'`. The backslash is itself a special character in a regex, so to specify a literal backslash, you need to escape it with another backslash. If that’s that case, then the following should work: `re.search('\\', s)`

But... not quite. See what you get if you try it:

In [80]:
re.search('\\', s)

error: bad escape (end of pattern) at position 0

Oops. What happened?

The problem here is that the backslash escaping happens twice, first by the Python interpreter on the string literal and then again by the regex parser on the regex it receives.

Here’s the sequence of events:

1. The Python interpreter is the first to process the string literal `'\\'`. It interprets that as an escaped backslash and passes only a single backslash to `re.search()`.
2. The `regex` parser receives just a single backslash, which isn’t a meaningful regex, so the messy error ensues.

There are two ways around this. First, you can escape both backslashes in the original string literal:

In [None]:
print(re.search('\\\\', s)) # <_sre.SRE_Match object; span=(3, 4), match='\\'>)

Doing so causes the following to happen:

1. The interpreter sees `'\\\\'` as a pair of escaped backslashes. It reduces each pair to a single backslash and passes `'\\'` to the regex parser.
2. The regex parser then sees `\\` as one escaped backslash. As a `<regex>`, that matches a single backslash character. You can see from the match object that it matched the backslash at index 3 in `s` as intended. It’s cumbersome, but it works.

The second, and probably cleaner, way to handle this is to specify the <regex> using a raw string:

In [None]:
print(re.search(r'\\', s)) # <_sre.SRE_Match object; span=(3, 4), match='\\'>

This suppresses the escaping at the interpreter level. The string `'\\'` gets passed unchanged to the regex parser, which again sees one escaped backslash as desired.

It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.

### RP - Anchors

**Anchors** are zero-width matches. They don’t match any actual characters in the search string, and they don’t consume any of the search string during parsing. Instead, an anchor dictates a particular location in the search string where a match must occur.

**`^`** or **`\A`**

    Anchor a match to the start of `<string>`.

When the regex parser encounters `^` or `\A`, the parser’s current position must be at the beginning of the search string for it to find a match.

In other words, regex `^foo` stipulates that `'foo'` must be present not just any old place in the search string, but at the beginning:

In [None]:
print (re.search('^foo', 'foobar')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>
print(re.search('^foo', 'barfoo')) # None

`\A` functions similarly:

In [None]:
print(re.search('\Afoo', 'foobar')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>
print(re.search('\Afoo', 'barfoo')) # None

`^` and `\A` behave slightly differently from each other in **MULTILINE** mode. You’ll learn more about MULTILINE mode below in the section on flags.

**`$`** or **`\Z`**

    Anchor a match to the end of <string>.

When the regex parser encounters `$` or `\Z`, the parser’s current position must be at the end of the search string for it to find a match. Whatever precedes `$` or `\Z` must constitute the end of the search string:

In [None]:
print (re.search('bar$', 'foobar')) # <_sre.SRE_Match object; span=(3, 6), match='bar'>
print (re.search('bar$', 'barfoo'))  # None
print (re.search('bar\Z', 'foobar')) # <_sre.SRE_Match object; span=(3, 6), match='bar'>
print(re.search('bar\Z', 'barfoo')) # None

As a special case, `$` (but not `\Z`) also matches just before a single newline at the end of the search string:

>>> re.search('bar$', 'foobar\n')
<_sre.SRE_Match object; span=(3, 6), match='bar'>

In this example, 'bar' isn’t technically at the end of the search string because it’s followed by one additional newline character. But the regex parser lets it slide and calls it a match anyway. This exception doesn’t apply to \Z.

`$` and `\Z` behave slightly differently from each other in `MULTILINE` mode. See the section below on flags for more information on MULTILINE mode.

**`\b`**

    Anchors a match to a word boundary.

`\b` asserts that the regex parser’s current position must be at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores (`[a-zA-Z0-9_]`), the same as for the `\w` character class:

In [None]:
print(re.search(r'\bbar', 'foo bar') ) # 1 <_sre.SRE_Match object; span=(4, 7), match='bar'>

In [None]:
print(re.search(r'\bbar', 'foo.bar')) # 2 <_sre.SRE_Match object; span=(4, 7), match='bar'>

In [None]:
print(re.search(r'\bbar', 'foobar')) # 3 None

In [None]:
print(re.search(r'foo\b', 'foo bar')) # 4 <_sre.SRE_Match object; span=(0, 3), match='foo'>

In [None]:
print(re.search(r'foo\b', 'foo.bar')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>

In [None]:
print(re.search(r'foo\b', 'foobar')) # None

In the above examples, a match happens on lines 1 and 3 because there’s a word boundary at the start of `'bar'`. This isn’t the case on line 6, so the match fails there.

Similarly, there are matches on lines 9 and 11 because a word boundary exists at the end of `'foo'`, but not on line 14.

Using the `\b` anchor on both ends of the `<regex>` will cause it to match when it’s present in the search string as a whole word:

>>> re.search(r'\bbar\b', 'foo bar baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>
>>> re.search(r'\bbar\b', 'foo(bar)baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>

>>> print(re.search(r'\bbar\b', 'foobarbaz'))
None

This is another instance in which it pays to specify the <regex> as a raw string, as the above examples have done.

Because `'\b'` is an escape sequence for both string literals and regexes in Python, each use above would need to be double escaped as `'\\b'` if you didn’t use raw strings. That wouldn’t be the end of the world, but raw strings are tidier.

**`\B`**

    Anchors a match to a location that isn’t a word boundary.

\B does the opposite of `\b`. It asserts that the regex parser’s current position must not be at the start or end of a word:

>>> print(re.search(r'\Bfoo\B', 'foo'))

None

>>> print(re.search(r'\Bfoo\B', '.foo.'))

None

>>> re.search(r'\Bfoo\B', 'barfoobaz')

<_sre.SRE_Match object; span=(3, 6), match='foo'>

In this case, a match happens on line 7 because no word boundary exists at the start or end of 'foo' in the search string 'barfoobaz'.

### RP - Quantifiers

A **quantifier** metacharacter immediately follows a portion of a `<regex>` and indicates how many times that portion must occur for the match to succeed.

**`*`**

    Matches zero or more repetitions of the preceding regex.

For example, a* matches zero or more `'a'` characters. That means it would match an empty string, `'a'`, `'aa'`, `'aaa'`, and so on.

Consider these examples:

In [None]:
print(re.search('foo-*bar', 'foobar')) # Zero dashes - <_sre.SRE_Match object; span=(0, 6), match='foobar'>

In [None]:
print(re.search('foo-*bar', 'foo-bar')) # One dash - <_sre.SRE_Match object; span=(0, 7), match='foo-bar'>

In [None]:
print(re.search('foo-*bar', 'foo--bar')) # Two dashes - <_sre.SRE_Match object; span=(0, 8), match='foo--bar'>

On line 1, there are zero `'-'` characters between `'foo'` and `'bar'`. On line 3 there’s one, and on line 5 there are two. The metacharacter sequence `-*` matches in all three cases.

You’ll probably encounter the regex `.*` in a Python program at some point. This matches zero or more occurrences of any character. In other words, it essentially matches any character sequence up to a line break. (Remember that the . wildcard metacharacter doesn’t match a newline.)

In this example, `.*` matches everything between `'foo'` and `'bar'`:

In [None]:
print(re.search('foo.*bar', '# foo $qux@grault % bar #')) # <_sre.SRE_Match object; span=(2, 23), match='foo $qux@grault % bar'>

Did you notice the `span=` and `match=` information contained in the match object?

Until now, the regexes in the examples you’ve seen have specified matches of predictable length. Once you start using quantifiers like `*`, the number of characters matched can be quite variable, and the information in the match object becomes more useful.

You’ll learn more about how to access the information stored in a match object in the next tutorial in the series.

**`+`**

    Matches one or more repetitions of the preceding regex.

This is similar to `*`, but the quantified regex must occur at least once:

In [82]:
print(re.search('foo-+bar', 'foobar')) # Zero dashes - None
print(re.search('foo-+bar', 'foo-bar')) # One dash - <_sre.SRE_Match object; span=(0, 7), match='foo-bar'>
print(re.search('foo-+bar', 'foo--bar')) # Two dashes - <_sre.SRE_Match object; span=(0, 8), match='foo--bar'>

None
<re.Match object; span=(0, 7), match='foo-bar'>
<re.Match object; span=(0, 8), match='foo--bar'>


Remember from above that `foo-*bar` matched the string `'foobar'` because the `*` metacharacter allows for zero occurrences of `'-'`. The `+` metacharacter, on the other hand, requires at least one occurrence of `'-'`. That means there isn’t a match on line 1 in this case.

**`?`**

    Matches zero or one repetitions of the preceding regex.

Again, this is similar to `*` and `+`, but in this case there’s only a match if the preceding regex occurs once or not at all:

In [83]:
print (re.search('foo-?bar', 'foobar')) # Zero dashes - <_sre.SRE_Match object; span=(0, 6), match='foobar'>

<re.Match object; span=(0, 6), match='foobar'>


In [84]:
print(re.search('foo-?bar', 'foo-bar'))  # One dash - <_sre.SRE_Match object; span=(0, 7), match='foo-bar'>

<re.Match object; span=(0, 7), match='foo-bar'>


In [85]:
print(re.search('foo-?bar', 'foo--bar')) # Two dashes - None

None


In this example, there are matches on lines 1 and 3. But on line 5, where there are two '-' characters, the match fails.

Here are some more examples showing the use of all three quantifier metacharacters:

In [86]:
print (re.match('foo[1-9]*bar', 'foobar')) # <_sre.SRE_Match object; span=(0, 6), match='foobar'>

<re.Match object; span=(0, 6), match='foobar'>


In [87]:
print(re.match('foo[1-9]*bar', 'foo42bar')) # <_sre.SRE_Match object; span=(0, 8), match='foo42bar'>

<re.Match object; span=(0, 8), match='foo42bar'>


In [None]:
print(re.match('foo[1-9]+bar', 'foobar')) # None

In [88]:
print (re.match('foo[1-9]+bar', 'foo42bar')) # <_sre.SRE_Match object; span=(0, 8), match='foo42bar'>

<re.Match object; span=(0, 8), match='foo42bar'>


In [89]:
print (re.match('foo[1-9]?bar', 'foobar')) # <_sre.SRE_Match object; span=(0, 6), match='foobar'>

<re.Match object; span=(0, 6), match='foobar'>


In [90]:
print(re.match('foo[1-9]?bar', 'foo42bar')) # None

None


This time, the quantified regex is the character class `[1-9]` instead of the simple character `'-`'.

**`*?`, `+?`, `??`**

    The non-greedy (or lazy) versions of the '*', '+', and '?' quantifiers.

When used alone, the quantifier metacharacters `*`, `+`, and `?` are all greedy, meaning they produce the longest possible match. Consider this example:

In [91]:
print(re.search('<.*>', '%<foo> <bar> <baz>%')) # <_sre.SRE_Match object; span=(1, 18), match='<foo> <bar> <baz>'>

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>


The regex `<.*>` effectively means:

    A '<' character
    Then any sequence of characters
    Then a '>' character

But which '>' character? There are three possibilities:

    The one just after 'foo'
    The one just after 'bar'
    The one just after 'baz'

Since the `*` metacharacter is greedy, it dictates the longest possible match, which includes everything up to and including the `'>'` character that follows `'baz'`. You can see from the match object that this is the match produced.

If you want the shortest possible match instead, then use the non-greedy metacharacter sequence `*?`:

In [92]:
print(re.search('<.*?>', '%<foo> <bar> <baz>%')) # <_sre.SRE_Match object; span=(1, 6), match='<foo>'>

<re.Match object; span=(1, 6), match='<foo>'>


In this case, the match ends with the `'>'` character following `'foo'`.

Note: You could accomplish the same thing with the regex `<[^>]*>`, which means:

    A '<' character
    Then any sequence of characters other than '>'
    Then a '>' character

This is the only option available with some older parsers that don’t support lazy quantifiers. Happily, that’s not the case with the `regex` parser in Python’s `re` module.

There are lazy versions of the `+` and `?` quantifiers as well:

In [93]:
print (re.search('<.+>', '%<foo> <bar> <baz>%')) # <_sre.SRE_Match object; span=(1, 18), match='<foo> <bar> <baz>'>

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>


In [94]:
print (re.search('<.+?>', '%<foo> <bar> <baz>%')) # <_sre.SRE_Match object; span=(1, 6), match='<foo>'>

<re.Match object; span=(1, 6), match='<foo>'>


In [95]:
print (re.search('ba?', 'baaaa')) # <_sre.SRE_Match object; span=(0, 2), match='ba'>

<re.Match object; span=(0, 2), match='ba'>


In [96]:
print(re.search('ba??', 'baaaa')) # <_sre.SRE_Match object; span=(0, 1), match='b'>

<re.Match object; span=(0, 1), match='b'>


The first two examples on lines 1 and 3 are similar to the examples shown above, only using `+` and `+?` instead of `*` and `*?`.

The last examples on lines 6 and 8 are a little different. In general, the `?` metacharacter matches zero or one occurrences of the preceding regex. The greedy version, `?`, matches one occurrence, so `ba?` matches `'b'` followed by a single `'a'`. The non-greedy version, `??`, matches zero occurrences, so `ba??` matches just `'b'`.

**`{m}`**

    Matches exactly m repetitions of the preceding regex.

This is similar to `*` or `+`, but it specifies exactly how many times the preceding regex must occur for a match to succeed:

In [97]:
print(re.search('x-{3}x', 'x--x'))   # Two dashes - None

None


In [98]:
print (re.search('x-{3}x', 'x---x'))  # Three dashes - <_sre.SRE_Match object; span=(0, 5), match='x---x'>

<re.Match object; span=(0, 5), match='x---x'>


In [99]:
print(re.search('x-{3}x', 'x----x')) # Four dashes - None

None


Here, `x-{3}x` matches `'x'`, followed by exactly three instances of the `'-'` character, followed by another `'x'`. The match fails when there are fewer or more than three dashes between the `'x'` characters.

**`{m,n}`**

    Matches any number of repetitions of the preceding regex from m to n, inclusive.

In the following example, the quantified <regex> is `-{2,4}`. The match succeeds when there are two, three, or four dashes between the `'x'` characters but fails otherwise:

In [108]:
for i in range(1, 6):
    s = f"x{'-' * i}x"
    print(f'{i}  {s:10}', re.search('x-{2,4}x', s))

'''
1  x-x        None
2  x--x       <_sre.SRE_Match object; span=(0, 4), match='x--x'>
3  x---x      <_sre.SRE_Match object; span=(0, 5), match='x---x'>
4  x----x     <_sre.SRE_Match object; span=(0, 6), match='x----x'>
5  x-----x    None
'''
print()

1  x-x        None
2  x--x       <re.Match object; span=(0, 4), match='x--x'>
3  x---x      <re.Match object; span=(0, 5), match='x---x'>
4  x----x     <re.Match object; span=(0, 6), match='x----x'>
5  x-----x    None



Omitting `m` implies a lower bound of `0`, and omitting `n` implies an unlimited upper bound:

```
Regular           Matches 
Expression 	     Identical to
                    

<regex>{,n} 	  Any number of repetitions of `<regex>` less than or equal to n
                    <regex>{0,n}  
<regex>{m,} 	  Any number of repetitions of <regex> greater than or equal to m 	  
                    ----
<regex>{,} 	   Any number of repetitions of <regex> 	
                    <regex>{0,}  
                    <regex>*
```

If you omit all of `m`, `n`, and the comma, then the curly braces no longer function as metacharacters. `{}` matches just the literal string `'{}'`:

In [106]:
print (re.search('x{}y', 'x{}y')) # <_sre.SRE_Match object; span=(0, 4), match='x{}y'>

<re.Match object; span=(0, 4), match='x{}y'>


In fact, to have any special meaning, a sequence with curly braces must fit one of the following patterns in which `m` and `n` are nonnegative integers:

- `{m,n}`  
- `{m,}` 
- `{,n}` 
- `{,}`  

Otherwise, it matches literally:

In [100]:
print (re.search('x{foo}y', 'x{foo}y')) # <_sre.SRE_Match object; span=(0, 7), match='x{foo}y'>

<re.Match object; span=(0, 7), match='x{foo}y'>


In [101]:
print (re.search('x{a:b}y', 'x{a:b}y')) # <_sre.SRE_Match object; span=(0, 7), match='x{a:b}y'>

<re.Match object; span=(0, 7), match='x{a:b}y'>


In [102]:
print (re.search('x{1,3,5}y', 'x{1,3,5}y')) # <_sre.SRE_Match object; span=(0, 9), match='x{1,3,5}y'>

<re.Match object; span=(0, 9), match='x{1,3,5}y'>


In [103]:
print (re.search('x{foo,bar}y', 'x{foo,bar}y')) # <_sre.SRE_Match object; span=(0, 11), match='x{foo,bar}y'>

<re.Match object; span=(0, 11), match='x{foo,bar}y'>


Later in this tutorial, when you learn about the `DEBUG` flag, you’ll see how you can confirm this.

**`{m,n}?`**

    The non-greedy (lazy) version of {m,n}.

`{m,n}` will match as many characters as possible, and `{m,n}?` will match as few as possible:

In [104]:
print (re.search('a{3,5}', 'aaaaaaaa')) # <_sre.SRE_Match object; span=(0, 5), match='aaaaa'>

<re.Match object; span=(0, 5), match='aaaaa'>


In [105]:
print (re.search('a{3,5}?', 'aaaaaaaa')) # <_sre.SRE_Match object; span=(0, 3), match='aaa'>

<re.Match object; span=(0, 3), match='aaa'>


In this case, `a{3,5}` produces the longest possible match, so it matches five `'a'` characters. `a{3,5}?` produces the shortest match, so it matches three.

#### RP - Grouping Constructs and Backreferences

Grouping constructs break up a regex in Python into subexpressions or **groups**. This serves two purposes:

1. **Grouping**: A group represents a single syntactic entity. Additional metacharacters apply to the entire group as a unit.
2. **Capturing**: Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. You can retrieve captured matches later through several different mechanisms.

Here’s a look at how grouping and capturing work.

**`(<regex>)`**

    Defines a subexpression or group.

This is the most basic grouping construct. A regex in parentheses just matches the contents of the parentheses:

In [109]:
print (re.search('(bar)', 'foo bar baz')) # <_sre.SRE_Match object; span=(4, 7), match='bar'>

<re.Match object; span=(4, 7), match='bar'>


In [110]:
print (re.search('bar', 'foo bar baz')) # <_sre.SRE_Match object; span=(4, 7), match='bar'>

<re.Match object; span=(4, 7), match='bar'>


As a regex, `(bar)` matches the string `'bar'`, the same as the regex `bar` would without the parentheses.

#### RP - Treating a Group as a Unit

A quantifier metacharacter that follows a group operates on the entire subexpression specified in the group as a single unit.

For instance, the following example matches one or more occurrences of the string 'bar':

>>> re.search('(bar)+', 'foo bar baz')
<_sre.SRE_Match object; span=(4, 7), match='bar'>

>>> re.search('(bar)+', 'foo barbar baz')
<_sre.SRE_Match object; span=(4, 10), match='barbar'>

>>> re.search('(bar)+', 'foo barbarbarbar baz')
<_sre.SRE_Match object; span=(4, 16), match='barbarbarbar'>

Here’s a breakdown of the difference between the two regexes with and without grouping parentheses:

`   Regex 	     Interpretation 	                Matches 	             Examples`
```
bar+ 	   The + metacharacter applies 	  'ba' followed by one or   'bar'               
             only to the character 'r'.         more occurrences of 'r' 'barr'	
                                                                        'barrr'
                                                                        
(bar)+ 	 The + metacharacter applies       One or more occurrences   'bar'
             to the entire string 'bar'.        of 'bar'                'barbar'
                                                                        'barbarbar'
```

Now take a look at a more complicated example. The regex `(ba[rz]){2,4}(qux)?` matches 2 to 4 occurrences of either `'bar'` or `'baz'`, optionally followed by `'qux'`:

In [111]:
print (re.search('(ba[rz]){2,4}(qux)?', 'bazbarbazqux')) # <_sre.SRE_Match object; span=(0, 12), match='bazbarbazqux'>

<re.Match object; span=(0, 12), match='bazbarbazqux'>


In [112]:
print (re.search('(ba[rz]){2,4}(qux)?', 'barbar')) # <_sre.SRE_Match object; span=(0, 6), match='barbar'>

<re.Match object; span=(0, 6), match='barbar'>


The following example shows that you can nest grouping parentheses:

In [113]:
print (re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar')) # <_sre.SRE_Match object; span=(0, 9), match='foofoobar'>

<re.Match object; span=(0, 9), match='foofoobar'>


In [114]:
print (re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar123')) # <_sre.SRE_Match object; span=(0, 12), match='foofoobar123'>

<re.Match object; span=(0, 12), match='foofoobar123'>


In [115]:
print (re.search('(foo(bar)?)+(\d\d\d)?', 'foofoo123')) # <_sre.SRE_Match object; span=(0, 9), match='foofoo123'>

<re.Match object; span=(0, 9), match='foofoo123'>


The regex `(foo(bar)?)+(\d\d\d)?` is pretty elaborate, so let’s break it down into smaller pieces:

`   Regex 	       Matches`
```
foo(bar)? 	'foo' optionally followed by 'bar'`
(foo(bar)?)+  One or more occurrences of the above
\d\d\d 	   Three decimal digit characters
(\d\d\d)? 	Zero or one occurrences of the above
```

String it all together and you get: at least one occurrence of `'foo'` optionally followed by `'bar'`, all optionally followed by three decimal digit characters.

As you can see, you can construct very complicated regexes in Python using grouping parentheses.

#### RP - Capturing Groups

Grouping isn’t the only useful purpose that grouping constructs serve. Most (but not quite all) grouping constructs also capture the part of the search string that matches the group. You can retrieve the captured portion or refer to it later in several different ways.

Remember the match object that `re.search()` returns? There are two methods defined for a match object that provide access to captured groups: `.groups()` and `.group()`.

**`m.groups()`**

    Returns a tuple containing all the captured groups from a regex match.

Consider this example:

In [117]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print (m) # <_sre.SRE_Match object; span=(0, 12), match='foo:quux:baz'>

<re.Match object; span=(0, 12), match='foo,quux,baz'>


Each of the three `(\w+)` expressions matches a sequence of word characters. The full regex `(\w+),(\w+),(\w+)` breaks the search string into three comma-separated tokens.

Because the `(\w+)` expressions use grouping parentheses, the corresponding matching tokens are captured. To access the captured matches, you can use `.groups()`, which returns a tuple containing all the captured matches in order:

In [118]:
m.groups() # ('foo', 'quux', 'baz')

('foo', 'quux', 'baz')

Notice that the tuple contains the tokens but not the commas that appeared in the search string. That’s because the word characters that make up the tokens are inside the grouping parentheses but the commas aren’t. The commas that you see between the returned tokens are the standard delimiters used to separate values in a tuple.

**`m.group(<n>)`**

    Returns a string containing the <n>th captured match.

With one argument, `.group()` returns a single captured match. Note that the arguments are one-based, not zero-based. So, `m.group(1)` refers to the first captured match, `m.group(2)` to the second, and so on:

In [119]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print (m.groups()) # ('foo', 'quux', 'baz')

print(m.group(1)) # 'foo'

print (m.group(2)) # 'quux'

print (m.group(3)) # 'baz'

('foo', 'quux', 'baz')
foo
quux
baz


Since the numbering of captured matches is one-based, and there isn’t any group numbered zero, `m.group(0)` has a special meaning:

In [120]:
print (m.group(0)) # 'foo,quux,baz'
print (m.group()) # 'foo,quux,baz'

foo,quux,baz
foo,quux,baz


`m.group(0)` returns the entire match, and `m.group()` does the same.

**`m.group(<n1>, <n2>, ...)`**

    Returns a tuple containing the specified captured matches.

With multiple arguments, `.group()` returns a tuple containing the specified captured matches in the given order:

In [121]:
print (m.groups()) # ('foo', 'quux', 'baz')

print (m.group(2, 3)) # ('quux', 'baz')

print(m.group(3, 2, 1)) # ('baz', 'quux', 'foo')

('foo', 'quux', 'baz')
('quux', 'baz')
('baz', 'quux', 'foo')


This is just convenient shorthand. You could create the tuple of matches yourself instead:

In [122]:
print (m.group(3, 2, 1)) # ('baz', 'qux', 'foo')

print (m.group(3), m.group(2), m.group(1)) # ('baz', 'qux', 'foo')

('baz', 'quux', 'foo')
baz quux foo


The two statements shown are functionally equivalent.

#### RP - Backreferences

You can match a previously captured group later within the same regex using a special metacharacter sequence called a backreference.

**`\<n>`**

    Matches the contents of a previously captured group.

Within a regex in Python, the sequence `\<n>`, where `<n>` is an integer from 1 to 99, matches the contents of the `<n>`th captured group.

Here’s a regex that matches a word, followed by a comma, followed by the same word again:

In [123]:
regex = r'(\w+),\1'

m = re.search(regex, 'foo,foo')
print (m) # <_sre.SRE_Match object; span=(0, 7), match='foo,foo'>

print (m.group(1)) # 'foo'

<re.Match object; span=(0, 7), match='foo,foo'>
foo


In [124]:
m = re.search(regex, 'qux,qux')

print(m) # <_sre.SRE_Match object; span=(0, 7), match='qux,qux'>

print (m.group(1)) # 'qux'

<re.Match object; span=(0, 7), match='qux,qux'>
qux


In [125]:
m = re.search(regex, 'foo,qux')

print(m) # None

None


In the first example, on line 3, `(\w+)` matches the first instance of the string `'foo'` and saves it as the first captured group. The comma matches literally. Then `\1` is a backreference to the first captured group and matches `'foo'` again. The second example, on line 9, is identical except that the `(\w+)` matches `'qux'` instead.

The last example, on line 15, doesn’t have a match because what comes before the comma isn’t the same as what comes after it, so the `\1` backreference doesn’t match.

Note: Any time you use a regex in Python with a numbered backreference, it’s a good idea to specify it as a raw string. Otherwise, the interpreter may confuse the backreference with an octal value.

Consider this example:

In [126]:
print(re.search('([a-z])#\1', 'd#d')) # None

None


The regex `([a-z])#\1` matches a lowercase letter, followed by `'#'`, followed by the same lowercase letter. The string in this case is `'d#d'`, which should match. But the match fails because Python misinterprets the backreference `\1` as the character whose octal value is one:

In [127]:
oct(ord('\1')) # '0o1'

'0o1'

You’ll achieve the correct match if you specify the regex as a raw string:

In [128]:
print (re.search(r'([a-z])#\1', 'd#d')) # <_sre.SRE_Match object; span=(0, 3), match='d#d'>

<re.Match object; span=(0, 3), match='d#d'>


Remember to consider using a raw string whenever your regex includes a metacharacter sequence containing a backslash.

Numbered backreferences are one-based like the arguments to `.group()`. Only the first ninety-nine captured groups are accessible by backreference. The interpreter will regard `\100` as the `'@'` character, whose octal value is `100`.

#### RP - Other Grouping Constructs

The `(<regex>)` metacharacter sequence shown above is the most straightforward way to perform grouping within a regex in Python. The next section introduces you to some enhanced grouping constructs that allow you to tweak when and how grouping occurs.

**`(?P<name><regex>)`**

    Creates a named captured group.

This metacharacter sequence is similar to grouping parentheses in that it creates a group matching `<regex>` that is accessible through the match object or a subsequent backreference. The difference in this case is that you reference the matched group by its given symbolic `<name>` instead of by its number.

Earlier, you saw this example with three captured groups numbered 1, 2, and 3:

In [129]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print (m.groups()) # ('foo', 'quux', 'baz')

print (m.group(1, 2, 3)) # ('foo', 'quux', 'baz')

('foo', 'quux', 'baz')
('foo', 'quux', 'baz')


The following effectively does the same thing except that the groups have the symbolic names `w1`, `w2`, and `w3`:

In [130]:
m = re.search('(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'foo,quux,baz')
print (m.groups()) # ('foo', 'quux', 'baz')

('foo', 'quux', 'baz')


You can refer to these captured groups by their symbolic names:

In [131]:
print (m.group('w1')) # 'foo'

print (m.group('w3')) # 'baz'

print (m.group('w1', 'w2', 'w3')) # ('foo', 'quux', 'baz')

foo
baz
('foo', 'quux', 'baz')


You can still access groups with symbolic names by number if you wish:

In [132]:
m = re.search('(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'foo,quux,baz')

print (m.group('w1')) # 'foo'

print (m.group(1)) # 'foo'

print (m.group('w1', 'w2', 'w3')) # ('foo', 'quux', 'baz')

print (m.group(1, 2, 3)) # ('foo', 'quux', 'baz')

foo
foo
('foo', 'quux', 'baz')
('foo', 'quux', 'baz')


Any `<name>` specified with this construct must conform to the rules for a Python identifier, and each `<name>` can only appear once per regex.

**`(?P=<name>)`**

    Matches the contents of a previously captured named group.

The `(?P=<name>)` metacharacter sequence is a backreference, similar to `\<n>`, except that it refers to a named group rather than a numbered group.

Here again is the example from above, which uses a numbered backreference to match a word, followed by a comma, followed by the same word again:

In [133]:
m = re.search(r'(\w+),\1', 'foo,foo')

print (m) # <_sre.SRE_Match object; span=(0, 7), match='foo,foo'>

print (m.group(1)) # 'foo'

<re.Match object; span=(0, 7), match='foo,foo'>
foo


The following code does the same thing using a named group and a backreference instead:

In [134]:
m = re.search(r'(?P<word>\w+),(?P=word)', 'foo,foo')

print (m) # <_sre.SRE_Match object; span=(0, 7), match='foo,foo'>

print (m.group('word')) # 'foo'

<re.Match object; span=(0, 7), match='foo,foo'>
foo


`(?P=<word>\w+)` matches `'foo'` and saves it as a captured group named word. Again, the comma matches literally. Then `(?P=word)` is a backreference to the named capture and matches `'foo'` again.

Note: The angle brackets (`<` and `>`) are required around name when creating a named group but not when referring to it later, either by backreference or by `.group()`:

In [135]:
m = re.match(r'(?P<num>\d+)\.(?P=num)', '135.135')

print (m) # <_sre.SRE_Match object; span=(0, 7), match='135.135'>

print (m.group('num')) # '135'

<re.Match object; span=(0, 7), match='135.135'>
135


Here, `(?P<num>\d+)` creates the captured group. But the corresponding backreference is `(?P=num)` without the angle brackets.

**`(?:<regex>)`**

    Creates a non-capturing group.

(?:<regex>) is just like (<regex>) in that it matches the specified <regex>. But (?:<regex>) doesn’t capture the match for later retrieval:

>>> m = re.search('(\w+),(?:\w+),(\w+)', 'foo,quux,baz')
>>> m.groups()
('foo', 'baz')

>>> m.group(1)
'foo'
>>> m.group(2)
'baz'

In this example, the middle word `'quux'` sits inside non-capturing parentheses, so it’s missing from the tuple of captured groups. It isn’t retrievable from the match object, nor would it be referable by backreference.

Why would you want to define a group but not capture it?

Remember that the regex parser will treat the `<regex>` inside grouping parentheses as a single unit. You may have a situation where you need this grouping feature, but you don’t need to do anything with the value later, so you don’t really need to capture it. If you use non-capturing grouping, then the tuple of captured groups won’t be cluttered with values you don’t actually need to keep.

Additionally, it takes some time and memory to capture a group. If the code that performs the match executes many times and you don’t capture groups that you aren’t going to use later, then you may see a slight performance advantage.

**`(?(<n>)<yes-regex>|<no-regex>)`**  
**`(?(<name>)<yes-regex>|<no-regex>)`**

    Specifies a conditional match.

A conditional match matches against one of two specified regexes depending on whether the given group exists:

- `(?(<n>)<yes-regex>|<no-regex>)` matches against `<yes-regex>` if a group numbered `<n>` exists. Otherwise, it matches against `<no-regex>`.

- `(?(<name>)<yes-regex>|<no-regex>)` matches against `<yes-regex>` if a group named `<name>` exists. Otherwise, it matches against `<no-regex>`.

Conditional matches are better illustrated with an example. Consider this regex:

In [136]:
regex = r'^(###)?foo(?(1)bar|baz)'

Here are the parts of this regex broken out with some explanation:

1. `^(###)?` indicates that the search string optionally begins with `'###'`. If it does, then the grouping parentheses around `###` will create a group numbered `1`. Otherwise, no such group will exist.
    
2. The next portion, `foo`, literally matches the string `'foo'`.
    
3. Lastly, `(?(1)bar|baz)` matches against `'bar'` if group 1 exists and `'baz'` if it doesn’t.

The following code blocks demonstrate the use of the above regex in several different Python code snippets:

**Example 1:**

In [137]:
print (re.search(regex, '###foobar')) # <_sre.SRE_Match object; span=(0, 9), match='###foobar'>

<re.Match object; span=(0, 9), match='###foobar'>


The search string `'###foobar'` does start with `'###'`, so the parser creates a group numbered `1`. The conditional match is then against `'bar'`, which matches.

**Example 2:**

In [138]:
print(re.search(regex, '###foobaz')) # None

None


The search string `'###foobaz'` does start with `'###'`, so the parser creates a group numbered `1`. The conditional match is then against `'bar'`, which doesn’t match.

**Example 3:**

In [139]:
print(re.search(regex, 'foobar')) # None

None


The search string `'foobar'` doesn’t start with `'###'`, so there isn’t a group numbered `1`. The conditional match is then against 'baz', which doesn’t match.

**Example 4:**

In [140]:
print (re.search(regex, 'foobaz')) # <_sre.SRE_Match object; span=(0, 6), match='foobaz'>

<re.Match object; span=(0, 6), match='foobaz'>


The search string `'foobaz'` doesn’t start with `'###'`, so there isn’t a group numbered `1`. The conditional match is then against `'baz'`, which matches.

Here’s another conditional match using a named group instead of a numbered group:

In [141]:
regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'

This regex matches the string `'foo'`, preceded by a single non-word character and followed by the same non-word character, or the string `'foo'` by itself.

Again, let’s break this down into pieces:

```
Regex 	          Matches

^ 	            The start of the string  

(?P<ch>\W) 	   A single non-word character, captured in a group named ch  

(?P<ch>\W)? 	  Zero or one occurrences of the above  

foo 	          The literal string 'foo'  

(?(ch)(?P=ch)|)   The contents of the group named ch if it exists, or   
                   the empty string if it doesn’t   

$ 	            The end of the string  
```

If a non-word character precedes 'foo', then the parser creates a group named ch which contains that character. The conditional match then matches against <yes-regex>, which is (?P=ch), the same character again. That means the same character must also follow 'foo' for the entire match to succeed.

If 'foo' isn’t preceded by a non-word character, then the parser doesn’t create group ch. <no-regex> is the empty string, which means there must not be anything following 'foo' for the entire match to succeed. Since ^ and $ anchor the whole regex, the string must equal 'foo' exactly.

Here are some examples of searches using this regex in Python code:

In [142]:
print (re.search(regex, 'foo')) # line1 - <_sre.SRE_Match object; span=(0, 3), match='foo'>

print (re.search(regex, '#foo#')) # line3 - <_sre.SRE_Match object; span=(0, 5), match='#foo#'>

print (re.search(regex, '@foo@')) # line5 - <_sre.SRE_Match object; span=(0, 5), match='@foo@'>

print (print(re.search(regex, '#foo'))) # line 8 - None

print(re.search(regex, 'foo@')) # line 10 - None

print(re.search(regex, '#foo@')) # line 12 - None

print(re.search(regex, '@foo#')) # line 14 - None

<re.Match object; span=(0, 3), match='foo'>
<re.Match object; span=(0, 5), match='#foo#'>
<re.Match object; span=(0, 5), match='@foo@'>
None
None
None
None
None


On line 1, `'foo'` is by itself. On lines 3 and 5, the same non-word character precedes and follows `'foo'`. As advertised, these matches succeed.

In the remaining cases, the matches fail.

Conditional regexes in Python are pretty esoteric and challenging to work through. If you ever do find a reason to use one, then you could probably accomplish the same goal with multiple separate `re.search()` calls, and your code would be less complicated to read and understand.

#### RP - Lookahead and Lookbehind Assertions

**Lookahead** and **lookbehind** assertions determine the success or failure of a regex match in Python based on what is just behind (to the left) or ahead (to the right) of the parser’s current position in the search string.

Like anchors, lookahead and lookbehind assertions are zero-width assertions, so they don’t consume any of the search string. Also, even though they contain parentheses and perform grouping, they don’t capture what they match.

**`(?=<lookahead_regex>)`**

    Creates a positive lookahead assertion.

`(?=<lookahead_regex>)` asserts that what follows the regex parser’s current position must match `<lookahead_regex>`:

In [148]:
print (re.search('foo(?=[a-z])', 'foobar')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>

<re.Match object; span=(0, 3), match='foo'>


The lookahead assertion `(?=[a-z])` specifies that what follows `'foo'` must be a lowercase alphabetic character. In this case, it’s the character `'b'`, so a match is found.

In the next example, on the other hand, the lookahead fails. The next character after `'foo'` is `'1'`, so there isn’t a match:

In [147]:
print(re.search('foo(?=[a-z])', 'foo123')) # None

None


What’s unique about a lookahead is that the portion of the search string that matches `<lookahead_regex>` isn’t consumed, and it isn’t part of the returned match object.

Take another look at the first example:

In [146]:
print (re.search('foo(?=[a-z])', 'foobar')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>

<re.Match object; span=(0, 3), match='foo'>


The regex parser looks ahead only to the `'b'` that follows `'foo'` but doesn’t pass over it yet. You can tell that `'b'` isn’t considered part of the match because the match object displays match=`'foo'`.

Compare that to a similar example that uses grouping parentheses without a lookahead:

In [145]:
print (re.search('foo([a-z])', 'foobar')) # <_sre.SRE_Match object; span=(0, 4), match='foob'>

<re.Match object; span=(0, 4), match='foob'>


This time, the regex consumes the `'b'`, and it becomes a part of the eventual match.

Here’s another example illustrating how a lookahead differs from a conventional regex in Python:

In [143]:
m = re.search('foo(?=[a-z])(?P<ch>.)', 'foobar') # line 1

print (m.group('ch')) # 'b'

b


In [144]:
m = re.search('foo([a-z])(?P<ch>.)', 'foobar') # line 5

print (m.group('ch')) # 'a'

a


In the first search, on line 1, the parser proceeds as follows:

1. The first portion of the regex, `foo`, matches and consumes `'foo'` from the search string `'foobar'`.

2. The next portion, `(?=[a-z])`, is a lookahead that matches `'b'`, but the parser doesn’t advance past the `'b'`.
    
3. Lastly, `(?P<ch>.)` matches the next single character available, which is `'b'`, and captures it in a group named `ch`.

The `m.group('ch')` call confirms that the group named `ch` contains `'b'`.

Compare that to the search on line 5, which doesn’t contain a lookahead:

1. As in the first example, the first portion of the regex, `foo`, matches and consumes `'foo'` from the search string `'foobar'`.

2. The next portion, `([a-z])`, matches and consumes `'b'`, and the parser advances past `'b'`.

3. Lastly, `(?P<ch>.)` matches the next single character available, which is now `'a'`.

`m.group('ch')` confirms that, in this case, the group named `ch` contains `'a'`.

**`(?!<lookahead_regex>)`**

    Creates a negative lookahead assertion.

`(?!<lookahead_regex>)` asserts that what follows the regex parser’s current position must not match `<lookahead_regex>`.

Here are the positive lookahead examples you saw earlier, along with their negative lookahead counterparts:

In [151]:
print (re.search('foo(?=[a-z])', 'foobar')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>

print(re.search('foo(?![a-z])', 'foobar')) # None

print(re.search('foo(?=[a-z])', 'foo123')) # None

print(re.search('foo(?![a-z])', 'foo123')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>

<re.Match object; span=(0, 3), match='foo'>
None
None
<re.Match object; span=(0, 3), match='foo'>


The negative lookahead assertions on lines 3 and 8 stipulate that what follows `'foo'` should not be a lowercase alphabetic character. This fails on line 3 but succeeds on line 8. This is the opposite of what happened with the corresponding positive lookahead assertions.

As with a positive lookahead, what matches a negative lookahead isn’t part of the returned match object and isn’t consumed.

**`(?<=<lookbehind_regex>)`**

    Creates a positive lookbehind assertion.

`(?<=<lookbehind_regex>)` asserts that what precedes the regex parser’s current position must match `<lookbehind_regex>`.

In the following example, the lookbehind assertion specifies that `'foo'` must precede `'bar`':

In [154]:
print (re.search('(?<=foo)bar', 'foobar')) # <_sre.SRE_Match object; span=(3, 6), match='bar'>

<re.Match object; span=(3, 6), match='bar'>


This is the case here, so the match succeeds. As with lookahead assertions, the part of the search string that matches the lookbehind doesn’t become part of the eventual match.

The next example fails to match because the lookbehind requires that `'qux'` precede `'bar'`:

In [155]:
print(re.search('(?<=qux)bar', 'foobar')) # None

None


There’s a restriction on lookbehind assertions that doesn’t apply to lookahead assertions. The `<lookbehind_regex>` in a lookbehind assertion must specify a match of fixed length.

For example, the following isn’t allowed because the length of the string matched by `a+` is indeterminate:

In [None]:
print (re.search('(?<=a+)def', 'aaadef')) #
'''
Traceback (most recent call last):
  File "<pyshell#72>", line 1, in <module>
    re.search('(?<=a+)def', 'aaadef')
  File "C:\Python36\lib\re.py", line 182, in search
    return _compile(pattern, flags).search(string)
  File "C:\Python36\lib\re.py", line 301, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Python36\lib\sre_compile.py", line 566, in compile
    code = _code(p, flags)
  File "C:\Python36\lib\sre_compile.py", line 551, in _code
    _compile(code, p.data, flags)
  File "C:\Python36\lib\sre_compile.py", line 160, in _compile
    raise error("look-behind requires fixed-width pattern")
sre_constants.error: look-behind requires fixed-width pattern
'''

This, however, is okay:

In [158]:
print (re.search('(?<=a{3})def', 'aaadef')) # <_sre.SRE_Match object; span=(3, 6), match='def'>

<re.Match object; span=(3, 6), match='def'>


Anything that matches `a{3}` will have a fixed length of three, so `a{3}` is valid in a lookbehind assertion.

**`(?<!<lookbehind_regex>)`**

    Creates a negative lookbehind assertion.

`(?<!<lookbehind_regex>)` asserts that what precedes the regex parser’s current position must not match <lookbehind_regex>:

In [160]:
print(re.search('(?<!foo)bar', 'foobar')) # None

None


In [159]:
print (re.search('(?<!qux)bar', 'foobar')) # <_sre.SRE_Match object; span=(3, 6), match='bar'>

<re.Match object; span=(3, 6), match='bar'>


As with the positive lookbehind assertion, `<lookbehind_regex>` must specify a match of fixed length.

### RP - Miscellaneous Metacharacters

There are a couple more metacharacter sequences to cover. These are stray metacharacters that don’t obviously fall into any of the categories already discussed.

**`(?#...)`**

    Specifies a comment.

The regex parser ignores anything contained in the sequence `(?#...)`:

In [161]:
print (re.search('bar(?#This is a comment) *baz', 'foo bar baz qux')) # <_sre.SRE_Match object; span=(4, 11), match='bar baz'>

<re.Match object; span=(4, 11), match='bar baz'>


This allows you to specify documentation inside a regex in Python, which can be especially useful if the regex is particularly long.

**Vertical bar, or pipe (|)**

    Specifies a set of alternatives on which to match.

An expression of the form <regex1>|<regex2>|...|<regexn> matches at most one of the specified <regexi> expressions:

In [162]:
print (re.search('foo|bar|baz', 'bar')) # <_sre.SRE_Match object; span=(0, 3), match='bar'>

<re.Match object; span=(0, 3), match='bar'>


In [163]:
print (re.search('foo|bar|baz', 'baz')) # <_sre.SRE_Match object; span=(0, 3), match='baz'>

<re.Match object; span=(0, 3), match='baz'>


In [164]:
print(re.search('foo|bar|baz', 'quux')) # None

None


Here, `foo|bar|baz` will match any of `'foo'`, `'bar'`, or `'baz'`. You can separate any number of regexes using `|`.

Alternation is non-greedy. The regex parser looks at the expressions separated by `|` in left-to-right order and returns the first match that it finds. The remaining expressions aren’t tested, even if one of them would produce a longer match:

In [165]:
print (re.search('foo', 'foograult')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>

<re.Match object; span=(0, 3), match='foo'>


In [166]:
print (re.search('grault', 'foograult')) # <_sre.SRE_Match object; span=(3, 9), match='grault'>

<re.Match object; span=(3, 9), match='grault'>


In [167]:
print (re.search('foo|grault', 'foograult')) # <_sre.SRE_Match object; span=(0, 3), match='foo'>

<re.Match object; span=(0, 3), match='foo'>


In this case, the pattern specified on line 6, `'foo|grault'`, would match on either `'foo'` or `'grault'`. The match returned is `'foo'` because that appears first when scanning from left to right, even though `'grault'` would be a longer match.

You can combine alternation, grouping, and any other metacharacters to achieve whatever level of complexity you need. In the following example, (foo|bar|baz)+ means a sequence of one or more of the strings 'foo', 'bar', or 'baz':

>>> re.search('(foo|bar|baz)+', 'foofoofoo')
<_sre.SRE_Match object; span=(0, 9), match='foofoofoo'>
>>> re.search('(foo|bar|baz)+', 'bazbazbazbaz')
<_sre.SRE_Match object; span=(0, 12), match='bazbazbazbaz'>
>>> re.search('(foo|bar|baz)+', 'barbazfoo')
<_sre.SRE_Match object; span=(0, 9), match='barbazfoo'>

In the next example, `([0-9]+|[a-f]+)` means a sequence of one or more decimal digit characters or a sequence of one or more of the characters `'a-f'`:

In [168]:
print (re.search('([0-9]+|[a-f]+)', '456')) # <_sre.SRE_Match object; span=(0, 3), match='456'>

<re.Match object; span=(0, 3), match='456'>


In [169]:
print (re.search('([0-9]+|[a-f]+)', 'ffda')) # <_sre.SRE_Match object; span=(0, 4), match='ffda'>

<re.Match object; span=(0, 4), match='ffda'>


With all the metacharacters that the re module supports, the sky is practically the limit.

#### RP - That’s All, Folks!

That completes our tour of the regex metacharacters supported by Python’s `re` module. (Actually, it doesn’t quite — there are a couple more stragglers you’ll learn about below in the discussion on flags.)

It’s a lot to digest, but once you become familiar with regex syntax in Python, the complexity of pattern matching that you can perform is almost limitless. These tools come in very handy when you’re writing code to process textual data.

If you’re new to regexes and want more practice working with them, or if you’re developing an application that uses a regex and you want to test it interactively, then check out the Regular Expressions 101 website. It’s seriously cool!

### RP - Modified Regular Expression Matching With Flags

Most of the functions in the `re` module take an optional `<flags>` argument. This includes the function you’re now very familiar with, `re.search()`.

`re.search(<regex>, <string>, <flags>)`

- Scans a string for a regex match, applying the specified modifier <flags>.

Flags modify regex parsing behavior, allowing you to refine your pattern matching even further.

#### RP - Supported Regular Expression Flags

The table below briefly summarizes the available flags. All flags except `re.DEBUG` have a short, single-letter name and also a longer, full-word name:

```
Short  Long
Name   Name 	       Effect

re.I  re.IGNORECASE Makes matching of alphabetic characters case-insensitive  
re.M  re.MULTILINE  Causes start-of-string and end-of-string anchors to match       
re.S  re.DOTALL     Causes the dot metacharacter to match a newline 
re.X  re.VERBOSE    Allows inclusion of whitespace and comments within a 
                     regular expression 
----  re.DEBUG      Causes the regex parser to display debugging information to 
                     the console 
re.A  re.ASCII      Specifies ASCII encoding for character classification 
re.U  re.UNICODE    Specifies Unicode encoding for character classification 
re.L  re.LOCALE 	Specifies encoding for character classification based on the 
                     current locale
```

The following sections describe in more detail how these flags affect matching behavior.

**`re.I`**  
**`re.IGNORECASE`**

- Makes matching case insensitive.

When `IGNORECASE` is in effect, character matching is case insensitive:

In [None]:
print(re.search('a+', 'aaaAAA')) # <_sre.SRE_Match object; span=(0, 3), match='aaa'>
print(re.search('A+', 'aaaAAA')) # <_sre.SRE_Match object; span=(3, 6), match='AAA'>
print(re.search('a+', 'aaaAAA', re.I)) # <_sre.SRE_Match object; span=(0, 6), match='aaaAAA'>
print(re.search('A+', 'aaaAAA', re.IGNORECASE)) # <_sre.SRE_Match object; span=(0, 6), match='aaaAAA'>

In the search on line 1, `a+` matches only the first three characters of `'aaaAAA'`. Similarly, on line 3, `A+` matches only the last three characters. But in the subsequent searches, the parser ignores case, so both `a+` and `A+` match the entire string.

`IGNORECASE` affects alphabetic matching involving character classes as well:

In [None]:
print(re.search('[a-z]+', 'aBcDeF')) # <_sre.SRE_Match object; span=(0, 1), match='a'>
print(re.search('[a-z]+', 'aBcDeF', re.I)) # <_sre.SRE_Match object; span=(0, 6), match='aBcDeF'>

When case is significant, the longest portion of `'aBcDeF'` that `[a-z]+` matches is just the initial `'a'`. Specifying `re.I` makes the search case insensitive, so `[a-z]+` matches the entire string.

**`re.M`**  
**`re.MULTILINE`**  

    Causes start-of-string and end-of-string anchors to match at embedded newlines.

By default, the `^` (start-of-string) and `$` (end-of-string) anchors match only at the beginning and end of the search string:

In [None]:
s = 'foo\nbar\nbaz'
print(re.search('^foo', s)) # <_sre.SRE_Match object; span=(0, 3), match='foo'>
print(re.search('^bar', s)) # None
print(re.search('^baz', s)) # None
print(re.search('foo$', s)) # None
print(re.search('bar$', s)) # None
print(re.search('baz$', s)) # <_sre.SRE_Match object; span=(8, 11), match='baz'>

In this case, even though the search string `'foo\nbar\nbaz'` contains embedded newline characters, only `'foo'` matches when anchored at the beginning of the string, and only `'baz'` matches when anchored at the end.

If a string has embedded newlines, however, you can think of it as consisting of multiple internal lines. In that case, if the `MULTILINE` flag is set, the `^` and `$` anchor metacharacters match internal lines as well:

- `^` matches at the beginning of the string or at the beginning of any line within the string (that is, immediately following a newline).
- `$` matches at the end of the string or at the end of any line within the string (immediately preceding a newline).

The following are the same searches as shown above:

In [None]:
s = 'foo\nbar\nbaz'
print(s)

print(re.search('^foo', s, re.MULTILINE)) # <_sre.SRE_Match object; span=(0, 3), match='foo'>
print(re.search('^bar', s, re.MULTILINE)) # <_sre.SRE_Match object; span=(4, 7), match='bar'>
print(re.search('^baz', s, re.MULTILINE)) # <_sre.SRE_Match object; span=(8, 11), match='baz'>
print(re.search('foo$', s, re.M)) # <_sre.SRE_Match object; span=(0, 3), match='foo'>
print(re.search('bar$', s, re.M)) # <_sre.SRE_Match object; span=(4, 7), match='bar'>
print(re.search('baz$', s, re.M)) # <_sre.SRE_Match object; span=(8, 11), match='baz'>

In the string `'foo\nbar\nbaz'`, all three of `'foo'`, `'bar'`, and `'baz'` occur at either the start or end of the string or at the start or end of a line within the string. With the `MULTILINE` flag set, all three match when anchored with either `^` or `$`.

Note: The `MULTILINE` flag only modifies the `^` and `$` anchors in this way. It doesn’t have any effect on the `\A` and `\Z` anchors:

In [None]:
s = 'foo\nbar\nbaz'
print(re.search('^bar', s, re.MULTILINE))  # <_sre.SRE_Match object; span=(4, 7), match='bar'>
print(re.search('bar$', s, re.MULTILINE))  # <_sre.SRE_Match object; span=(4, 7), match='bar'>
print(re.search('\Abar', s, re.MULTILINE)) # None
print(re.search('bar\Z', s, re.MULTILINE)) # None

On lines 3 and 5, the `^` and `$` anchors dictate that `'bar'` must be found at the start and end of a line. Specifying the `MULTILINE` flag makes these matches succeed.

The examples on lines 8 and 10 use the `\A` and `\Z` flags instead. You can see that these matches fail even with the MULTILINE flag in effect.

**`re.S`**  
**`re.DOTALL`**  

    Causes the dot (`.`) metacharacter to match a newline.

Remember that by default, the dot metacharacter matches any character except the newline character. The `DOTALL` flag lifts this restriction:

In [None]:
print(re.search('foo.bar', 'foo\nbar')) # None - Line 1
print(re.search('foo.bar', 'foo\nbar', re.DOTALL)) # <_sre.SRE_Match object; span=(0, 7), match='foo\nbar'> - Line 3
print(re.search('foo.bar', 'foo\nbar', re.S)) # <_sre.SRE_Match object; span=(0, 7), match='foo\nbar'> - Line 5

In this example, on line 1 the dot metacharacter doesn’t match the newline in `'foo\nbar'`. On lines 3 and 5, `DOTALL` is in effect, so the dot does match the newline. Note that the short name of the `DOTALL` flag is `re.S`, not `re.D` as you might expect.

**`re.X`**  
**`re.VERBOSE`**  

- Allows inclusion of whitespace and comments within a regex.

The `VERBOSE` flag specifies a few special behaviors:

- The regex parser ignores all whitespace unless it’s within a character class or escaped with a backslash.

- If the regex contains a # character that isn’t contained within a character class or escaped with a backslash, then the parser ignores it and all characters to the right of it.

What’s the use of this? It allows you to format a regex in Python so that it’s more readable and self-documenting.

Here’s an example showing how you might put this to use. Suppose you want to parse phone numbers that have the following format:

- Optional three-digit area code, in parentheses
- Optional whitespace
- Three-digit prefix
- Separator (either `'-'` or `'.'`)
- Four-digit line number

The following regex does the trick:

In [68]:
regex = r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$'

print(re.search(regex, '414.9229')) # <_sre.SRE_Match object; span=(0, 8), match='414.9229'>
print(re.search(regex, '414-9229')) # <_sre.SRE_Match object; span=(0, 8), match='414-9229'>
print(re.search(regex, '(712)414-9229')) # <_sre.SRE_Match object; span=(0, 13), match='(712)414-9229'>
print(re.search(regex, '(712) 414-9229')) # <_sre.SRE_Match object; span=(0, 14), match='(712) 414-9229'>

<re.Match object; span=(0, 8), match='414.9229'>
<re.Match object; span=(0, 8), match='414-9229'>
<re.Match object; span=(0, 13), match='(712)414-9229'>
<re.Match object; span=(0, 14), match='(712) 414-9229'>


But `r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$'` is an eyeful, isn’t it? Using the `VERBOSE` flag, you can write the same regex in Python like this instead:

In [69]:
regex = r'''^               # Start of string
(\(\d{3}\))?    # Optional area code
\s*             # Optional whitespace
\d{3}           # Three-digit prefix
[-.]            # Separator character
\d{4}           # Four-digit line number
$               # Anchor at end of string
'''

print(re.search(regex, '414.9229', re.VERBOSE)) # <_sre.SRE_Match object; span=(0, 8), match='414.9229'>
print(re.search(regex, '414-9229', re.VERBOSE)) # <_sre.SRE_Match object; span=(0, 8), match='414-9229'>
print(re.search(regex, '(712)414-9229', re.X)) # <_sre.SRE_Match object; span=(0, 13), match='(712)414-9229'>
print(re.search(regex, '(712) 414-9229', re.X)) # <_sre.SRE_Match object; span=(0, 14), match='(712) 414-9229'>

<re.Match object; span=(0, 8), match='414.9229'>
<re.Match object; span=(0, 8), match='414-9229'>
<re.Match object; span=(0, 13), match='(712)414-9229'>
<re.Match object; span=(0, 14), match='(712) 414-9229'>


The `re.search()` calls are the same as those shown above, so you can see that this regex works the same as the one specified earlier. But it’s less difficult to understand at first glance.

Note that triple quoting makes it particularly convenient to include embedded newlines, which qualify as ignored whitespace in `VERBOSE` mode.

When using the `VERBOSE` flag, be mindful of whitespace that you do intend to be significant. Consider these examples:

In [None]:
print(re.search('foo bar', 'foo bar')) # <_sre.SRE_Match object; span=(0, 7), match='foo bar'>
print(re.search('foo bar', 'foo bar', re.VERBOSE)) # None - Line 4
print(re.search('foo\ bar', 'foo bar', re.VERBOSE)) # <_sre.SRE_Match object; span=(0, 7), match='foo bar'> - Line 7
print(re.search('foo[ ]bar', 'foo bar', re.VERBOSE)) # <_sre.SRE_Match object; span=(0, 7), match='foo bar'> - Line 9

After all you’ve seen to this point, you may be wondering why on line 4 the regex `foo bar` doesn’t match the string `'foo bar'`. It doesn’t because the `VERBOSE` flag causes the parser to ignore the space character.

To make this match as expected, escape the space character with a backslash or include it in a character class, as shown on lines 7 and 9.

As with the `DOTALL` flag, note that the `VERBOSE` flag has a non-intuitive short name: `re.X`, not `re.V`.

**`re.DEBUG`**

- Displays debugging information.

The `DEBUG` flag causes the regex parser in Python to display debugging information about the parsing process to the console:

In [None]:
print(re.search('foo.bar', 'fooxbar', re.DEBUG))
'''
LITERAL 102
LITERAL 111
LITERAL 111
ANY None
LITERAL 98
LITERAL 97
LITERAL 114
<_sre.SRE_Match object; span=(0, 7), match='fooxbar'>
''' 
47

When the parser displays `LITERAL nnn` in the debugging output, it’s showing the ASCII code of a literal character in the regex. In this case, the literal characters are `'f'`, `'o'`, `'o'` and `'b'`, `'a'`, `'r'`.

Here’s a more complicated example. This is the phone number regex shown in the discussion on the `VERBOSE` flag earlier:

In [70]:
regex = r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$'

print(re.search(regex, '414.9229', re.DEBUG)) 

# '''  
# AT AT_BEGINNING
# MAX_REPEAT 0 1
#   SUBPATTERN 1 0 0
#     LITERAL 40
#     MAX_REPEAT 3 3
#       IN
#         CATEGORY CATEGORY_DIGIT
#     LITERAL 41
# MAX_REPEAT 0 MAXREPEAT
#   IN
#     CATEGORY CATEGORY_SPACE
# MAX_REPEAT 3 3
#   IN
#     CATEGORY CATEGORY_DIGIT
# IN
#   LITERAL 45
#   LITERAL 46
# MAX_REPEAT 4 4
#   IN
#     CATEGORY CATEGORY_DIGIT
# AT AT_END
# <_sre.SRE_Match object; span=(0, 8), match='414.9229'>
# '''  

AT AT_BEGINNING
MAX_REPEAT 0 1
  SUBPATTERN 1 0 0
    LITERAL 40
    MAX_REPEAT 3 3
      IN
        CATEGORY CATEGORY_DIGIT
    LITERAL 41
MAX_REPEAT 0 MAXREPEAT
  IN
    CATEGORY CATEGORY_SPACE
MAX_REPEAT 3 3
  IN
    CATEGORY CATEGORY_DIGIT
IN
  LITERAL 45
  LITERAL 46
MAX_REPEAT 4 4
  IN
    CATEGORY CATEGORY_DIGIT
AT AT_END

 0. INFO 4 0b0 8 MAXREPEAT (to 5)
 5: AT BEGINNING
 7. REPEAT 21 0 1 (to 29)
11.   MARK 0
13.   LITERAL 0x28 ('(')
15.   REPEAT_ONE 9 3 3 (to 25)
19.     IN 4 (to 24)
21.       CATEGORY UNI_DIGIT
23.       FAILURE
24:     SUCCESS
25:   LITERAL 0x29 (')')
27.   MARK 1
29: MAX_UNTIL
30. REPEAT_ONE 9 0 MAXREPEAT (to 40)
34.   IN 4 (to 39)
36.     CATEGORY UNI_SPACE
38.     FAILURE
39:   SUCCESS
40: REPEAT_ONE 9 3 3 (to 50)
44.   IN 4 (to 49)
46.     CATEGORY UNI_DIGIT
48.     FAILURE
49:   SUCCESS
50: IN 5 (to 56)
52.   RANGE 0x2d 0x2e ('-'-'.')
55.   FAILURE
56: REPEAT_ONE 9 4 4 (to 66)
60.   IN 4 (to 65)
62.     CATEGORY UNI_DIGIT
64.     FAILURE
65:   SUCCESS


This looks like a lot of esoteric information that you’d never need, but it can be useful. See the Deep Dive below for a practical application.

### RP - Deep Dive: Debugging Regular Expression Parsing

As you know from above, the metacharacter sequence `{m,n}` indicates a specific number of repetitions. It matches anywhere from `m` to `n` repetitions of what precedes it:

In [71]:
print (re.search('x[123]{2,4}y', 'x222y')) # <_sre.SRE_Match object; span=(0, 5), match='x222y'>

<re.Match object; span=(0, 5), match='x222y'>


You can verify this with the DEBUG flag:

In [73]:
print(re.search('x[123]{2,4}y', 'x222y', re.DEBUG)) #
'''
LITERAL 120
MAX_REPEAT 2 4
  IN
    LITERAL 49
    LITERAL 50
    LITERAL 51
LITERAL 121
<_sre.SRE_Match object; span=(0, 5), match='x222y'>
'''
47

LITERAL 120
MAX_REPEAT 2 4
  IN
    LITERAL 49
    LITERAL 50
    LITERAL 51
LITERAL 121

 0. INFO 8 0b1 4 6 (to 9)
      prefix_skip 1
      prefix [0x78] ('x')
      overlap [0]
 9: LITERAL 0x78 ('x')
11. REPEAT_ONE 10 2 4 (to 22)
15.   IN 5 (to 21)
17.     RANGE 0x31 0x33 ('1'-'3')
20.     FAILURE
21:   SUCCESS
22: LITERAL 0x79 ('y')
24. SUCCESS
<re.Match object; span=(0, 5), match='x222y'>


47

`MAX_REPEAT 2 4` confirms that the regex parser recognizes the metacharacter sequence `{2,4}` and interprets it as a range quantifier.

But, as noted previously, if a pair of curly braces in a regex in Python contains anything other than a valid number or numeric range, then it loses its special meaning.

You can verify this also:

In [74]:
print(re.search('x[123]{foo}y', 'x222y', re.DEBUG))

#     LITERAL 120
#     IN
#       LITERAL 49
#       LITERAL 50
#       LITERAL 51
#     LITERAL 123
#     LITERAL 102
#     LITERAL 111
#     LITERAL 111
#     LITERAL 125
#     LITERAL 121

LITERAL 120
IN
  LITERAL 49
  LITERAL 50
  LITERAL 51
LITERAL 123
LITERAL 102
LITERAL 111
LITERAL 111
LITERAL 125
LITERAL 121

 0. INFO 8 0b1 8 8 (to 9)
      prefix_skip 1
      prefix [0x78] ('x')
      overlap [0]
 9: LITERAL 0x78 ('x')
11. IN 5 (to 17)
13.   RANGE 0x31 0x33 ('1'-'3')
16.   FAILURE
17: LITERAL 0x7b ('{')
19. LITERAL 0x66 ('f')
21. LITERAL 0x6f ('o')
23. LITERAL 0x6f ('o')
25. LITERAL 0x7d ('}')
27. LITERAL 0x79 ('y')
29. SUCCESS
None


You can see that there’s no `MAX_REPEAT` token in the debug output. The `LITERAL` tokens indicate that the parser treats `{foo}` literally and not as a quantifier metacharacter sequence. `123`, `102`, `111`, `111`, and `125` are the ASCII codes for the characters in the literal string `'{foo}'`.

Information displayed by the `DEBUG` flag can help you troubleshoot by showing you how the parser is interpreting your regex.

Curiously, the `re` module doesn’t define a single-letter version of the `DEBUG` flag. You could define your own if you wanted to:

In [None]:
import re

re.D = re.DEBUG
print (re.D)
print(re.search('foo', 'foo', re.D))
# '''
# LITERAL 102
# LITERAL 111
# LITERAL 111
# <_sre.SRE_Match object; span=(0, 3), match='foo'>
# '''
# ''

But this might be more confusing than helpful, as readers of your code might misconstrue it as an abbreviation for the DOTALL flag. If you did make this assignment, it would be a good idea to document it thoroughly.

**`re.A`**  
**`re.ASCII`**  
**`re.U`**  
**`re.UNICODE`**  
**`re.L`**  
**`re.LOCALE`**  

    Specify the character encoding used for parsing of special regex character classes.

Several of the regex metacharacter sequences (`\w`, `\W`, `\b`, `\B`, `\d`, `\D`, `\s`, and `\S`) require you to assign characters to certain classes like word, digit, or whitespace. The flags in this group determine the encoding scheme used to assign characters to these classes. The possible encodings are ASCII, Unicode, or according to the current locale.

You had a brief introduction to character encoding and Unicode in the tutorial on Strings and Character Data in Python, under the discussion of the ord() built-in function. For more in-depth information, check out these resources:

    Unicode & Character Encodings in Python: A Painless Guide:
    https://realpython.com/python-encodings-guide
    
    Python’s Unicode Support:
    https://docs.python.org/3/howto/unicode.html#python-s-unicode-support

Why is character encoding so important in the context of regexes in Python? Here’s a quick example.

You learned earlier that `\d` specifies a single digit character. The description of the `\d` metacharacter sequence states that it’s equivalent to the character class `[0-9]`. That happens to be true for English and Western European languages, but for most of the world’s languages, the characters `'0'` through `'9'` don’t represent all or even any of the digits.

For example, here’s a string that consists of three Devanagari digit characters:

In [170]:
s = '\u0967\u096a\u096c'
print (s) # '१४६'

१४६


For the regex parser to properly account for the Devanagari script, the digit metacharacter sequence `\d` must match each of these characters as well.

The Unicode Consortium created Unicode to handle this problem. Unicode is a character-encoding standard designed to represent all the world’s writing systems. All strings in Python 3, including regexes, are Unicode by default.

So then, back to the flags listed above. These flags help to determine whether a character falls into a given class by specifying whether the encoding used is ASCII, Unicode, or the current locale:

- `re.U` and `re.UNICODE` specify Unicode encoding. Unicode is the default, so these flags are superfluous. They’re mainly supported for backward compatibility.
- `re.A` and `re.ASCII` force a determination based on ASCII encoding. If you happen to be operating in English, then this is happening anyway, so the flag won’t affect whether or not a match is found.
- `re.L` and `re.LOCALE` make the determination based on the current locale. Locale is an outdated concept and isn’t considered reliable. Except in rare circumstances, you’re not likely to need it.

Using the default Unicode encoding, the regex parser should be able to handle any language you throw at it. In the following example, it correctly recognizes each of the characters in the string '१४६' as a digit:

In [171]:
s = '\u0967\u096a\u096c'
print(s) # '१४६'
print (re.search('\d+', s)) # <_sre.SRE_Match object; span=(0, 3), match='१४६'>

१४६
<re.Match object; span=(0, 3), match='१४६'>


Here’s another example that illustrates how character encoding can affect a regex match in Python. Consider this string:

In [172]:
s = 'sch\u00f6n'
print (s) # 'schön'

schön


`'schön'` (the German word for pretty or nice) contains the `'ö'` character, which has the 16-bit hexadecimal Unicode value `00f6`. This character isn’t representable in traditional 7-bit ASCII.

If you’re working in German, then you should reasonably expect the regex parser to consider all of the characters in `'schön'` to be word characters. But take a look at what happens if you search `s` for word characters using the `\w` character class and force an ASCII encoding:

In [173]:
print(re.search('\w+', s, re.ASCII)) # <_sre.SRE_Match object; span=(0, 3), match='sch'>

<re.Match object; span=(0, 3), match='sch'>


When you restrict the encoding to ASCII, the regex parser recognizes only the first three characters as word characters. The match stops at 'ö'.

On the other hand, if you specify re.UNICODE or allow the encoding to default to Unicode, then all the characters in 'schön' qualify as word characters:

In [174]:
print (re.search('\w+', s, re.UNICODE)) # <_sre.SRE_Match object; span=(0, 5), match='schön'>
print (re.search('\w+', s)) # <_sre.SRE_Match object; span=(0, 5), match='schön'>

<re.Match object; span=(0, 5), match='schön'>
<re.Match object; span=(0, 5), match='schön'>


The ASCII and LOCALE flags are available in case you need them for special circumstances. But in general, the best strategy is to use the default Unicode encoding. This should handle any world language correctly.

### RP - Combining \<flags\> Arguments in a Function Call

Flag values are defined so that you can combine them using the bitwise OR (`|`) operator. This allows you to specify several flags in a single function call:

In [175]:
print (re.search('^bar', 'FOO\nBAR\nBAZ', re.I|re.M)) # <_sre.SRE_Match object; span=(4, 7), match='BAR'>

<re.Match object; span=(4, 7), match='BAR'>


This `re.search()` call uses bitwise OR to specify both the `IGNORECASE` and `MULTILINE` flags at once.

### RP - Setting and Clearing Flags Within a Regular Expression

In addition to being able to pass a `<flags>` argument to most re module function calls, you can also modify flag values within a regex in Python. There are two regex metacharacter sequences that provide this capability.

**`(?<flags>)`**

    Sets flag value(s) for the duration of a regex.

Within a regex, the metacharacter sequence `(?<flags>)` sets the specified flags for the entire expression.

The value of `<flags>` is one or more letters from the set `a`, `i`, `L`, `m`, `s`, `u`, and `x`. Here’s how they correspond to the `re` module flags:


```
Letter 	Flags

a 	    re.A  re.ASCII
i 	    re.I  re.IGNORECASE
L 	    re.L  re.LOCALE
m 	    re.M  re.MULTILINE
s 	    re.S  re.DOTALL
u 	    re.U  re.UNICODE
x 	    re.X  re.VERBOSE
```


The `(?<flags>)` metacharacter sequence as a whole matches the empty string. It always matches successfully and doesn’t consume any of the search string.

The following examples are equivalent ways of setting the IGNORECASE and MULTILINE flags:

In [179]:
print (re.search('^bar', 'FOO\nBAR\nBAZ\n', re.I|re.M)) # <_sre.SRE_Match object; span=(4, 7), match='BAR'>

<re.Match object; span=(4, 7), match='BAR'>


In [180]:
print (re.search('(?im)^bar', 'FOO\nBAR\nBAZ\n')) # <_sre.SRE_Match object; span=(4, 7), match='BAR'>

<re.Match object; span=(4, 7), match='BAR'>


Note that a `(?<flags>)` metacharacter sequence sets the given flag(s) for the entire regex no matter where you place it in the expression:

In [181]:
print (re.search('foo.bar(?s).baz', 'foo\nbar\nbaz')) # <_sre.SRE_Match object; span=(0, 11), match='foo\nbar\nbaz'>

<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>


In [184]:
print (re.search('foo.bar.baz(?s)', 'foo\nbar\nbaz')) # <_sre.SRE_Match object; span=(0, 11), match='foo\nbar\nbaz'>

<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>


In the above examples, both dot metacharacters match newlines because the `DOTALL` flag is in effect. This is true even when `(?s)` appears in the middle or at the end of the expression.

As of Python 3.7, it’s deprecated to specify `(?<flags>)` anywhere in a regex other than at the beginning:

In [185]:
import sys
sys.version # '3.8.0 (default, Oct 14 2019, 21:29:03) \n[GCC 7.4.0]'

'3.9.13 (main, Aug 25 2022, 18:24:45) \n[Clang 12.0.0 ]'

In [187]:
print (re.search('foo.bar.baz(?s)', 'foo\nbar\nbaz')) # 

<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>


If there was an error, the following might be shown:
```
<stdin>:1: DeprecationWarning: Flags not at the start
    of the expression 'foo.bar.baz(?s)'
<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>
```

It still produces the appropriate match, but you’ll get a warning message.

**`(?<set_flags>-<remove_flags>:<regex>)`**

    Sets or removes flag value(s) for the duration of a group.

`(?<set_flags>-<remove_flags>:<regex>)` defines a non-capturing group that matches against `<regex>`. For the `<regex>` contained in the group, the regex parser sets any flags specified in `<set_flags>` and clears any flags specified in `<remove_flags>`.

Values for `<set_flags>` and `<remove_flags>` are most commonly `i`, `m`, `s` or `x`.

In the following example, the `IGNORECASE` flag is set for the specified group:

In [188]:
print (re.search('(?i:foo)bar', 'FOObar')) # <re.Match object; span=(0, 6), match='FOObar'>

<re.Match object; span=(0, 6), match='FOObar'>


This produces a match because `(?i:foo)` dictates that the match against `'FOO'` is case insensitive.

Now contrast that with this example:

In [191]:
print(re.search('(?i:foo)bar', 'FOOBAR')) # None

None


As in the previous example, the match against `'FOO'` would succeed because it’s case insensitive. But once outside the group, `IGNORECASE` is no longer in effect, so the match against `'BAR'` is case sensitive and fails.

Here’s an example that demonstrates turning a flag off for a group:

In [190]:
print(re.search('(?-i:foo)bar', 'FOOBAR', re.IGNORECASE)) # None

None


Again, there’s no match. Although `re.IGNORECASE` enables case-insensitive matching for the entire call, the metacharacter sequence `(?-i:foo)` turns off `IGNORECASE` for the duration of that group, so the match against `'FOO'` fails.

As of Python 3.7, you can specify `u`, `a`, or `L` as `<set_flags>` to override the default encoding for the specified group:

In [196]:
s = 'sch\u00f6n'
s # 'schön'

'schön'

In [197]:
>>> # Requires Python 3.7 or later
print (re.search('(?a:\w+)', s)) # <re.Match object; span=(0, 3), match='sch'>

<re.Match object; span=(0, 3), match='sch'>


In [198]:
print (re.search('(?u:\w+)', s)) # <re.Match object; span=(0, 5), match='schön'>

<re.Match object; span=(0, 5), match='schön'>


You can only set encoding this way, though. You can’t remove it:

In [199]:
print (re.search('(?-a:\w+)', s))

error: bad inline flags: cannot turn off flags 'a', 'u' and 'L' at position 4

Should display:
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/re.py", line 199, in search
    return _compile(pattern, flags).search(string)
  File "/usr/lib/python3.8/re.py", line 302, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 805, in _parse
    flags = _parse_flags(source, state, char)
  File "/usr/lib/python3.8/sre_parse.py", line 904, in _parse_flags
    raise source.error(msg)
re.error: bad inline flags: cannot turn off flags 'a', 'u' and 'L' at
position 4
```

`u`, `a`, and `L` are mutually exclusive. Only one of them may appear per group.

### RP - Conclusion

This concludes your introduction to regular expression matching and Python’s re module. Congratulations! You’ve mastered a tremendous amount of material.

You now know how to:

- Use `re.search()` to perform regex matching in Python
- Create complex pattern matching searches with regex metacharacters
- Tweak regex parsing behavior with flags

But you’ve still seen only one function in the module: `re.search()`! The re module has many more useful functions and objects to add to your pattern-matching toolkit. The next tutorial in the series will introduce you to what else the `regex` module in Python has to offer.