# Expressions: Regexes in Python (Part 1)

magine you have a string object s. Now suppose you need to write Python code to find out whether s contains the substring '123'. There are at least a couple ways to do this. You could use the in operator:

In [1]:
 s = 'foo123bar'
'123' in s

True

If you want to know not only whether '123' exists in s but also where it exists, then you can use .find() or .index(). Each of these returns the character position within s where the substring resides:



In [2]:
s.find('123')

3

In [3]:
s.index('123')

3

In these examples, the matching is done by a straightforward character-by-character comparison. That will get the job done in many cases. But sometimes, the problem is more complicated than that.

<br>For example, rather than searching for a fixed substring like '123', suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.

<br>Strict character comparisons won’t cut it here. This is where regexes in Python come to the rescue.

# The `re` Module

Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods, most of which you’ll learn about in the next tutorial in this series.

For now, you’ll focus predominantly on one function, `re.search()`.



### `re.search(<regex>, <string>)`

Scans a string for a regex match.

`re.search(<regex>, <string>)` scans `<string>` looking for the first location where the pattern `<regex>` matches. If a match is found, then `re.search()` returns a `match object.` Otherwise, it returns `None`.

`re.search()` takes an optional third `<flags>` argument that you’ll learn about at the end of this tutorial.

In [4]:
# importing regex module

import re 


## First Pattern-Matching Example
Now that you know how to gain access to re.search(), you can give it a try:

In [5]:
s = 'foo123bar'

re.search('123',s)

<re.Match object; span=(3, 6), match='123'>

A match object is **truthy**, so you can use it in a Boolean context like a conditional statement:

In [6]:
if re.search('123',s):
    print('Found a match')
else:
    print("no match.")

Found a match


The interpreter displays the match object as <_sre.SRE_Match object; span=(3, 6), match='123'>. This contains some useful information.

span=(3, 6) indicates the portion of <string> in which the match was found. This means the same thing as it would in slice notation:



In [7]:
s[3:6]

'123'

In this example, the match starts at character position 3 and extends up to but not including position 6.

## Python Regex Metacharacters

The real power of regex matching in Python emerges when `<regex>` contains special characters called **metacharacters**. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

<br>In a regex, a set of characters specified in square brackets ([]) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

In [8]:
s = 'foo123bar'

re.search('[0-9][0-9][0-9]',s)

<re.Match object; span=(3, 6), match='123'>

[0-9] matches any single decimal digit character—any character between '0' and '9', inclusive. The full expression [0-9][0-9][0-9] matches any sequence of three decimal digit characters. In this case, s matches because it contains three consecutive decimal digit characters, '123'.

These strings also match:

In [9]:
re.search('[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [10]:
re.search('[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [11]:
re.search('[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

On the other hand, a string that doesn’t contain three consecutive digits won’t match:

In [12]:
print(re.search('[0-9][0-9][0-9]', '12foo34'))

None


With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.

<br>Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

In [13]:
s = 'foo123bar'
re.search('1.3', s)

<re.Match object; span=(3, 6), match='123'>

In [14]:
s = 'foo13bar'
print(re.search('1.3', s))

None


In the first example, the regex 1.3 matches '123' because the '1' and '3' match literally, and the . matches the '2'. Here, you’re essentially asking, “Does s contain a '1', then any character (except a newline), then a '3'?” The answer is yes for 'foo123bar' but no for 'foo13bar'.

## Metacharacters Supported by the re Module

In [15]:
s = 'foo123bar'
re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In this case, 123 is technically a regex, but it’s not a very interesting one because it doesn’t contain any metacharacters. It just matches the string '123'.

Things get much more exciting when you throw metacharacters into the mix. The following sections explain in detail how you can use each metacharacter or metacharacter sequence to enhance pattern-matching functionality.



## Metacharacters That Match a Single Character
The metacharacter sequences in this section try to match a single character from the search string. When the regex parser encounters one of these metacharacter sequences, a match happens if the character at the current parsing position fits the description that the sequence describes.

`[]`
Specifies a specific set of characters to match.

Characters contained in square brackets `([])` represent a character class—an enumerated set of characters to match from. A character class metacharacter sequence will match any single character contained in the class.

You can enumerate the characters individually like this:

In [16]:
re.search('ba[artz]', 'foobarqux')

<re.Match object; span=(3, 6), match='bar'>

In [17]:
re.search('ba[artz]', 'foobazqux')

<re.Match object; span=(3, 6), match='baz'>

The metacharacter sequence [artz] matches any single 'a', 'r', 't', or 'z' character. In the example, the regex ba[artz] matches both 'bar' and 'baz' (and would also match 'baa' and 'bat').

A character class can also contain a range of characters separated by a hyphen (-), in which case it matches any single character within the range. For example, `[a-z]` matches any lowercase alphabetic character between 'a' and 'z', inclusive:

In [18]:
re.search('[a-z]', 'FOObar')

<re.Match object; span=(3, 4), match='b'>

[0-9] matches any digit character:

In [19]:
re.search('[0-9][0-9]','foo123bar')

<re.Match object; span=(3, 5), match='12'>

n this case, [0-9][0-9] matches a sequence of two digits. The first portion of the string 'foo123bar' that matches is '12'.

<br>[0-9a-fA-F] matches any hexadecimal digit character:

In [20]:
re.search('[0-9a-fA-F]','--- a0 ---')

<re.Match object; span=(4, 5), match='a'>

Here, [0-9a-fA-F] matches the first hexadecimal digit character in the search string, 'a'.

You can complement a character class by specifying ^ as the first character, in which case it matches any character that isn’t in the set. In the following example, [^0-9] matches any character that isn’t a digit:

In [21]:
re.search('[^0-9]', '12345foo')

<re.Match object; span=(5, 6), match='f'>

Here, the match object indicates that the first character in the string that isn’t a digit is 'f'.

<br>If a `^` character appears in a character class but isn’t the first character, then it has no special meaning and matches a literal `'^'` character:

In [22]:
re.search('[#:^]','foo^bar:baz#qux')

<re.Match object; span=(3, 4), match='^'>

As you’ve seen, you can specify a range of characters in a character class by separating characters with a hyphen. What if you want the character class to include a literal hyphen character? You can place it as the first or last character or escape it with a backslash (\):

In [23]:
re.search('[-abc]','123-456')

<re.Match object; span=(3, 4), match='-'>

In [24]:
re.search('[abc-]','123-456')

<re.Match object; span=(3, 4), match='-'>

In [25]:
re.search('[ab\-c]','123-456')

<re.Match object; span=(3, 4), match='-'>

If you want to include a literal ']' in a character class, then you can place it as the first character or escape it with backslash:

In [26]:
re.search('[]]', 'foo[1]')

<re.Match object; span=(5, 6), match=']'>

In [27]:
re.search('[ab\]cd]','foo[1]')

<re.Match object; span=(5, 6), match=']'>

Other regex metacharacters lose their special meaning inside a character class:

In [28]:
re.search('[)*+|]', '123*456')

<re.Match object; span=(3, 4), match='*'>

In [29]:
re.search('[)*+|]', '123+456')

<re.Match object; span=(3, 4), match='+'>

As you saw in the table above, * and + have special meanings in a regex in Python. They designate repetition, which you’ll learn more about shortly. But in this example, they’re inside a character class, so they match themselves literally.

### `dot (.)`

Specifies a wildcard.

The `.` metacharacter matches any single character except a newline:

In [30]:
re.search('foo.bar', 'fooxbar')

<re.Match object; span=(0, 7), match='fooxbar'>

In [31]:
re.search('foo.bar', 'foobar')

In [32]:
re.search('foo.bar', 'foo\nbar')

As a regex, `foo.bar` essentially means the characters `'foo'`, then any character except newline, then the characters 'bar'. The first string shown above, `'fooxbar'`, fits the bill because the . metacharacter matches the 'x'.

<br>The second and third strings fail to match. In the last case, although there’s a character between `'foo'` and `'bar'`, it’s a newline, and by default, the . metacharacter doesn’t match a newline. There is, however, a way to force . to match a newline, which you’ll learn about at the end of this tutorial.

### \w
### \W

Match based on whether a character is a word character.


`\w` matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore `(_)` character, so `\w` is essentially shorthand for `[a-zA-Z0-9_]`:

In [33]:
re.search('\w', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [34]:
 re.search('[a-zA-Z0-9_]', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In this case, the first word character in the string `'#(.a$@&' is 'a'`.

`\W`is the opposite. It matches any non-word character and is equivalent to `[^a-zA-Z0-9_]`:



In [35]:
re.search('\W', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

In [36]:
re.search('[^a-zA-Z0-9_]', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

Here, the first non-word character in 'a_1*3!b' is '*'.

### \d
### \D

Match based on whether a character is a decimal digit.

`\d` matches any decimal digit character. `\D` is the opposite. It matches any character that isn’t a decimal digit:

In [37]:
re.search('\d','abc4def')

<re.Match object; span=(3, 4), match='4'>

In [38]:
re.search('\D','234Q678')

<re.Match object; span=(3, 4), match='Q'>

`\d` is essentially equivalent to `[0-9]`, and `\D` is equivalent to `[^0-9]`.

In [39]:
re.search('[^0-9]','234Q678')

<re.Match object; span=(3, 4), match='Q'>

### \s white space 
### \S opposit of white space

Match based on whether a character represents whitespace.

`\s` matches any `whitespace` character:

In [40]:
re.search('\s', 'foo\nbar baz')

<re.Match object; span=(3, 4), match='\n'>

Note that, unlike the dot wildcard metacharacter, `\s` does match a newline character.

`\S` is the `opposite` of `\s`. It matches any character that `isn’t whitespace`:

In [41]:
re.search('\S', '  \n foo  \n  ')

<re.Match object; span=(4, 5), match='f'>

Again, \s and \S consider a newline to be whitespace. In the example above, the first non-whitespace character is 'f'

The character class sequences \w, \W, \d, \D, \s, and \S can appear inside a square bracket character class as well:

In [42]:
re.search('[\d\w\s]','---3---')

<re.Match object; span=(3, 4), match='3'>

In [43]:
re.search('[\d\w\s]', '---a---')

<re.Match object; span=(3, 4), match='a'>

In [44]:
re.search('[\d\w\s]', '--- ---')

<re.Match object; span=(3, 4), match=' '>

In this case, `[\d\w\s]` matches any digit, word, or whitespace character. And since \w includes \d, the same character class could also be expressed slightly shorter as `[\w\s]`.

### Escaping Metacharacters
Occasionally, you’ll want to include a metacharacter in your regex, except you won’t want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character.

## backslash (`\`)

Removes the special meaning of a metacharacter.

As you’ve just seen, the backslash character can introduce special character classes like word, digit, and whitespace. There are also special metacharacter sequences called **anchors** that begin with a backslash, which you’ll learn about below.

When it’s not serving either of these purposes, the backslash **escapes** metacharacters. A metacharacter preceded by a backslash loses its special meaning and matches the literal character instead. Consider the following examples:

In [45]:
re.search('.', 'foo.bar')
# not matches `.` in string

<re.Match object; span=(0, 1), match='f'>

In [46]:
re.search('\.', 'foo.bar')
# after giving `\' before `.`it detects the `.`

<re.Match object; span=(3, 4), match='.'>

In the `<regex>` on **line 1**, the dot (`.`) functions as a wildcard metacharacter, which matches the first character in the string (`'f'`). The . character in the `<regex>` on **line 4** is escaped by a backslash, so it isn’t a wildcard. It’s interpreted literally and matches the `'.'` at index 3 of the search string.

Using backslashes for escaping can get messy. Suppose you have a string that contains a single backslash:

In [47]:
s = r'foo\bar'

In [48]:
print(s)

foo\bar


Now suppose you want to create a `<regex>` that will match the backslash between `'foo'` and `'bar'`. The backslash is itself a special character in a regex, so to specify a literal backslash, you need to escape it with another backslash. If that’s that case, then the following should work:

In [49]:
# re.search('\\', s)

### error
The problem here is that the backslash escaping happens twice, first by the Python interpreter on the string literal and then again by the regex parser on the regex it receives.

Here’s the sequence of events:

1. The Python interpreter is the first to process the string literal '\\'. It interprets that as an escaped backslash and passes only a single backslash to re.search().
2. The regex parser receives just a single backslash, which isn’t a meaningful regex, so the messy error ensues.

<br>There are two ways around this. First, you can escape both backslashes in the original string literal:

In [50]:
re.search('\\\\', s)

<re.Match object; span=(3, 4), match='\\'>

Doing so causes the following to happen:

1. The interpreter sees `'\\\\'` as a pair of escaped backslashes. It reduces each pair to a single backslash and passes `'\\'` to the regex parser.
2. The regex parser then sees `\\` as one escaped backslash. As a `<regex>`, that matches a single backslash character. You can see from the match object that it matched the backslash at index 3 in s as intended. It’s cumbersome, but it works.

The second, and probably cleaner, way to handle this is to specify the <regex> using a raw string: `r`

In [51]:
re.search(r'\\', s)

<re.Match object; span=(3, 4), match='\\'>

This suppresses the escaping at the interpreter level. The string '\\' gets passed unchanged to the regex parser, which again sees one escaped backslash as desired.

<br>It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.

## Anchors
Anchors are zero-width matches. They don’t match any actual characters in the search string, and they don’t consume any of the search string during parsing. Instead, an anchor dictates a particular location in the search string where a match must occur.

### `^`
### `\A`

Anchor a match to the start of `<string>`.

When the regex parser encounters ^ or \A, the parser’s current position must be at the beginning of the search string for it to find a match.

In other words, regex ^foo stipulates that 'foo' must be present not just any old place in the search string, but at the beginning:

In [52]:
re.search('^foo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [53]:
print(re.search('^foo', 'barfoo'))

None


`\A`functions similarly:

In [54]:
re.search('\Afoo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [55]:
print(re.search('\Afoo', 'barfoo'))

None


`^` and `\A` behave slightly differently from each other in MULTILINE mode. You’ll learn more about MULTILINE mode below in the section on flags.

###  `$`
### `\Z`

Anchor a match to the end of `<string>`.

When the regex parser encounters `$` or `\Z`, the parser’s current position must be at the end of the search string for it to find a match. Whatever precedes `$` or `\Z` must constitute the end of the search string:

In [56]:
re.search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [57]:
print(re.search('bar$', 'barfoo'))

None


In [58]:
re.search('bar\Z', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

As a special case, `$` (but not `\Z`) also matches just before a single newline at the end of the search string:

In [59]:
re.search('bar$', 'foobar\n')

<re.Match object; span=(3, 6), match='bar'>

In this example, `'bar'` isn’t technically at the end of the search string because it’s followed by one additional newline character. But the regex parser lets it slide and calls it a match anyway. This exception doesn’t apply to `\Z`.

`$` and `\Z` behave slightly differently from each other in MULTILINE mode. See the section below on flags for more information on MULTILINE mode.

### `\b`

Anchors a match to a word boundary.

`\b` asserts that the regex parser’s current position must be at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores `([a-zA-Z0-9_])`, the same as for the `\w` character class:

In [60]:
re.search(r'\bbar', 'foo bar')

<re.Match object; span=(4, 7), match='bar'>

In [61]:
re.search(r'\bbar', 'foo.bar')

<re.Match object; span=(4, 7), match='bar'>

In [62]:
print(re.search(r'\bbar', 'foobar'))

None


In [63]:
re.search(r'foo\b', 'foo.bar')

<re.Match object; span=(0, 3), match='foo'>

In [64]:
print(re.search(r'foo\b', 'foobar'))

None


In the above examples, a match happens on `lines 1` and `3` because there’s a word boundary at the start of `'bar'`. This isn’t the case on `line 6`, so the match fails there.

Similarly, there are matches on `lines 9` and `11` because a word boundary exists at the end of `'foo'`, but not on `line 14`.

Using the `\b` anchor on both ends of the `<regex>` will cause it to match when it’s present in the search string as a whole word:

In [65]:
re.search(r'\bbar\b', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [66]:
re.search(r'\bbar\b', 'foo(bar)baz')

<re.Match object; span=(4, 7), match='bar'>

In [67]:
print(re.search(r'\bbar\b', 'foobarbaz'))

None


## Quantifiers

A **quantifier** metacharacter immediately follows a portion of a `<regex>` and indicates how many times that portion must occur for the match to succeed.

## *

Matches zero or more repetitions of the preceding regex.

For example, `a*` matches zero or more `'a'` characters. That means it would match an empty string, `'a'`, `'aa'`, `'aaa'`, and so on.

Consider these examples:

In [68]:
re.search('foo-*bar', 'foobar')     # Zero dashes
# *min 0

<re.Match object; span=(0, 6), match='foobar'>

In [69]:
re.search('foo-*bar', 'foo-bar')       # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [70]:
re.search('foo-*bar', 'foo--bar')        # Two dashes

<re.Match object; span=(0, 8), match='foo--bar'>

On **line 1**, there are zero `'-'` characters between `'foo'` and `'bar'`. On **line 3** there’s one, and on **line 5** there are two. The metacharacter sequence `-*` matches in all three cases.

In [71]:
re.search('foo-*bar', 'foo------bar')   # * (zero or more occurrences)

<re.Match object; span=(0, 12), match='foo------bar'>

You’ll probably encounter the regex `.*` in a Python program at some point. `This matches zero or more occurrences of any character`. In other words, it essentially matches `any character sequence up` to a line break. (Remember that the . wildcard metacharacter doesn’t match a newline.)

In this example, `.*` matches everything between `'foo'` and `'bar'`:

In [72]:
re.search('foo.*bar', '# foo $qux@grault % bar #')
# it matches all metacharacters


<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>

Did you notice the `span=` and `match=` information contained in the match object?

Until now, the regexes in the examples you’ve seen have specified matches of predictable length. Once you start using quantifiers like `*`, the number of characters matched can be quite variable, and the information in the match object becomes more useful.

You’ll learn more about how to access the information stored in a match object in the next tutorial in the series.

### `+`

Matches one or more repetitions of the preceding regex.

This is similar to `*`, but the quantified regex must occur at least once:

In [73]:
print(re.search('foo-+bar', 'foobar'))              # Zero dashes

# at least one occurrence of `-` 

None


In [74]:
re.search('foo-+bar', 'foo-bar')                    # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [75]:
re.search('foo-+bar', 'foo--bar')                    # Two dashes

<re.Match object; span=(0, 8), match='foo--bar'>

Remember from above that foo`-*`bar matched the string 'foobar' because the `*` metacharacter allows for zero occurrences of `'-'`. The + metacharacter, on the other hand, requires at least one occurrence of `'-'`. That means there isn’t a match on **line 1** in this case.

## `?`

Matches zero or one repetitions of the preceding regex.

`?` is also similar to `*` and `+`, but in this case there’s only a match if the preceding regex occurs once or not at all:

In [76]:
re.search('foo-?bar', 'foobar') # Zero 

#?optional

<re.Match object; span=(0, 6), match='foobar'>

In [77]:
re.search('foo-?bar', 'foo-bar')            # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [78]:
print(re.search('foo-?bar', 'foo--bar'))     # Two dashes
# match fail because of to `--` are present

None


In this example, there are matches on **lines 1** and **3**. But on **line 5**, where there are `two '-' characters`, the `match fails`.

In [79]:
print(re.search('foo--?bar', 'foo--bar')) 

<re.Match object; span=(0, 8), match='foo--bar'>


In [80]:
print(re.search('foo---?bar', 'foo--bar')) 

<re.Match object; span=(0, 8), match='foo--bar'>


In [81]:
print(re.search('foo--?bar', 'foo---bar')) 
# maximum limit exceeded

None


Here are some more `examples` showing the use of `all three quantifier metacharacters`:

In [82]:
re.match('foo[1-9]*bar', 'foobar')

# it can contains foo , [0-9]*, bar
# [0-9]* means 0 or more occurance of digits
# [0-9]+ means at least 1 or more of digits
#[0-9]? means 0 or up to 1 of digit

<re.Match object; span=(0, 6), match='foobar'>

In [83]:
re.match('foo[1-9]*bar', 'foo42bar')

# * 0 or more occureance of digits

<re.Match object; span=(0, 8), match='foo42bar'>

In [84]:
print(re.match('foo[1-9]+bar', 'foobar'))

#[1-9]+ means at least 1 or more time occurance of digits   
#note- it is showing none because that string does not contain any digit

None


In [85]:
re.match('foo[1-9]+bar', 'foo42bar')
#[1-9]+ means at least 1 or more time occurance of digits   
#note-  string contains more than one 1 digits

<re.Match object; span=(0, 8), match='foo42bar'>

In [86]:
re.match('foo[1-9]?bar', 'foobar')
#[1-9]? for this serach required 0 or max one digit 


<re.Match object; span=(0, 6), match='foobar'>

In [87]:
print(re.match('foo[1-9]?bar', 'foo42bar'))
#[1-9]? for this serach required 0 or max one digit 
# note - shows none because condition not satiesfied (more than 1 degits are present)

None


This time, the quantified regex is the character class `[1-9]` instead of the simple character `'-'`.

### *?
### +?
### ??

The non-greedy (or lazy) versions of the `*`, `+`, and `?` quantifiers.

When used alone, the quantifier metacharacters *, +, and ? are `all greedy`, meaning they `produce the longest possible match`. Consider this example:

In [88]:
re.search('<.*>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>

The regex `<.*>` effectively means:

- A `'<'` character
- Then any sequence of characters
- Then a `'>'` character

But which `'>'` character? There are three possibilities:

1. The one just after 'foo'
2. The one just after 'bar'
3. The one just after 'baz'

Since the `*` metacharacter is greedy, it dictates the longest possible match, which includes everything up to and including the `'>'` character that follows `'baz'`. You can see from the match object that this is the match produced.

If you want the shortest possible match instead, then use the non-greedy metacharacter sequence `*?`:

In [89]:
re.search('<.*?>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>

In this case, the match ends with the '>' character following 'foo'.
<br>**Note:** You could accomplish the same thing with the regex `<[^>]*>`, which means:
* A '<' character
* Then any sequence of characters other than '>'
* Then a '>' character

This is the only option available with some older parsers that don’t support lazy quantifiers. Happily, that’s not the case with the regex parser in Python’s re module.

There are lazy versions of the + and ? quantifiers as well:

In [90]:
re.search('<.+>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>

In [91]:
re.search('<.+?>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>

In [92]:
re.search('ba?', 'baaaa')
# ? zero or max

<re.Match object; span=(0, 2), match='ba'>

In [93]:
re.search('ba??', 'baaaa')

<re.Match object; span=(0, 1), match='b'>

The first two examples on **lines 1** and **3** are similar to the examples shown above, only using `+` and `+?` instead of `*` and `*?`.

The last examples on **lines 6** and **8** are a little different. In general, the ? metacharacter matches zero or one occurrences of the preceding regex. The greedy version, ?, matches one occurrence, so ba? matches 'b' followed by a single 'a'. The non-greedy version, ??, matches zero occurrences, so ba?? matches just 'b'.

In [94]:
re.search('baa??', 'baaaa')
#?? means o or max-1 

<re.Match object; span=(0, 2), match='ba'>

## {m}

Matches exactly `m` repetitions of the preceding regex.

This is similar to `*` or `+`, but it specifies exactly how many times the preceding regex must occur for a match to succeed:

In [95]:
print(re.search('x-{3}x', 'x--x'))                       # Two dashes
# -{3} means - occures exactly 3 time otherwise it shows none

None


In [96]:
re.search('x-{3}x', 'x---x') 
# -{3} means - occures exactly 3 time otherwise it shows none

<re.Match object; span=(0, 5), match='x---x'>

In [97]:
print(re.search('x-{3}x', 'x----x'))                      # Four dashes

None


Here, `x-{3}x` matches `'x'`, followed by exactly three instances of the `'-'` character, followed by another `'x'`. The match fails when there are fewer or more than three dashes between the `'x'` characters.

## {m,n}

Matches any number of repetitions of the preceding regex from m to n, inclusive.

In the following example, the quantified `<regex>` is `-{2,4}`. The match `succeeds when there are two, three, or four dashes` between the `'x'` characters but fails otherwise:

In [98]:
for i in range(1, 6):
        s = f"x{'-' * i}x"
        print(f'{i}  {s:10}', re.search('x-{2,4}x', s))

1  x-x        None
2  x--x       <re.Match object; span=(0, 4), match='x--x'>
3  x---x      <re.Match object; span=(0, 5), match='x---x'>
4  x----x     <re.Match object; span=(0, 6), match='x----x'>
5  x-----x    None


1. result is none because `-` are occured less than 2
2. matched becaused because `-` are occured 2 times 
3. matched becaused because `-` are occured  more than 2 times and less than 4 
4. matched becaused because `-` are occured more than 2 times and equal 4
5. result is none because `-` are occured more than 4 times
 


Omitting m implies a lower bound of 0, and omitting n implies an unlimited upper bound:


| Regular Expression  | Matches | Matches |
| --- | --- | --- |
| `<regex>{,n}` | Any number of repetitions of `<regex>` less than or equal to n | `<regex>{0,n}` |
| `<regex>{m,}` |	Any number of repetitions of `<regex>` greater than or equal to m | `----` |
| <regex>{,}	| Any number of repetitions of `<regex>` | `<regex>{0,}<regex>*` |

If you omit all of m, n, and the comma, then the curly braces no longer function as metacharacters. {} matches just the literal string '{}':

In [99]:
re.search('x{}y', 'x{}y')

<re.Match object; span=(0, 4), match='x{}y'>

In fact, to have any special meaning, a sequence with curly braces must fit one of the following patterns in which m and n are nonnegative integers:

* {m,n}
* {m,}
* {,n}
* {,}

Otherwise, it matches literally:

In [100]:
re.search('x{foo}y', 'x{foo}y')
# matches as it is 

<re.Match object; span=(0, 7), match='x{foo}y'>

In [101]:
re.search('x{a:b}y', 'x{a:b}y')
# it is alos matches as it is

<re.Match object; span=(0, 7), match='x{a:b}y'>

In [102]:
re.search('x{1,3,5}y', 'x{1,3,5}y')
# matches as it is 

<re.Match object; span=(0, 9), match='x{1,3,5}y'>

In [103]:
re.search('x{foo,bar}y', 'x{foo,bar}y')
# matches as it is

<re.Match object; span=(0, 11), match='x{foo,bar}y'>

Later in this tutorial, when you learn about the `DEBUG` flag, you’ll see how you can confirm this.

## {m,n}?

The non-greedy (lazy) version of {m,n}.

`{m,n}` will match as many characters as possible, and `{m,n}?` will match as few as possible:

In [104]:
re.search('a{3,5}', 'aaaaaaaa')
# {m,n} will match as many characters as possible

<re.Match object; span=(0, 5), match='aaaaa'>

In [105]:
re.search('a{3,5}?', 'aaaaaaaa')
#  {m,n}? will match as few as possible
# note- here minimum is 3 hence character 3 occures ony three time

<re.Match object; span=(0, 3), match='aaa'>

## Grouping Constructs and Backreferences

Grouping constructs break up a regex in Python into subexpressions or groups. This serves two purposes:

1. **Grouping:** A group represents a single syntactic entity. Additional metacharacters apply to the entire group as a unit.
2. **Capturing:** Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. You can retrieve captured matches later through several different mechanisms.



## `(<regex>)`
    Defines a subexpression or group.

This is the most basic grouping construct. A regex in parentheses just matches the contents of the parentheses:

In [106]:
re.search('(bar)', 'foo bar baz')
#matches with parentheses

<re.Match object; span=(4, 7), match='bar'>

In [107]:
re.search('bar', 'foo bar baz')
#matches without parentheses

<re.Match object; span=(4, 7), match='bar'>

As a regex, `(bar)` matches the string `'bar'`, the same as the regex bar would without the parentheses.

## Treating a Group as a Unit
A quantifier metacharacter that follows a group operates on the entire subexpression specified in the group as a single unit.

For instance, the following example matches one or more occurrences of the string `'bar'`:

In [108]:
re.search('(bar)+', 'foo bar baz')
# + means at least 1 or more than one 

<re.Match object; span=(4, 7), match='bar'>

In [109]:
re.search('(bar)+', 'foo barbar baz')
# + means at least 1 or more than one 
# note - captures frequently occured bar

<re.Match object; span=(4, 10), match='barbar'>

In [110]:
re.search('(bar)+', 'foo barbarbarbar baz')
# + means at least 1 or more than one 
# note - captures frequently occured bar

<re.Match object; span=(4, 16), match='barbarbarbar'>

Here’s a breakdown of the difference between the two regexes with and without grouping parentheses:

| Regex | Interpretation|	Matches | Examples|
|---:|:-------------|:-----------|:------|
bar+ |	The + metacharacter applies only to the character 'r'.	|'ba' followed by one or more occurrences of 'r' |	'bar' 'barr' 'barrr'|
| (bar)+ | The + metacharacter applies to the entire string 'bar'. | One or more occurrences of | 'bar' 'bar' 'barbar' 'barbarbar'|

Now take a look at a more complicated example. The regex `(ba[rz]){2,4}(qux)?` matches 2 to 4 occurrences of either `'bar'` or `'baz'`, optionally followed by `'qux'`:

In [111]:
re.search('(ba[rz]){2,4}(qux)?', 'bazbarbazqux')

<re.Match object; span=(0, 12), match='bazbarbazqux'>

In [112]:
re.search('(ba[rz]){2,4}(qux)?', 'barbar')

<re.Match object; span=(0, 6), match='barbar'>

The following example shows that you can nest grouping parentheses:

In [113]:
re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar')

<re.Match object; span=(0, 9), match='foofoobar'>

In [114]:
re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar123')

<re.Match object; span=(0, 12), match='foofoobar123'>

In [115]:
re.search('(foo(bar)?)+(\d\d\d)?', 'foofoo123')

<re.Match object; span=(0, 9), match='foofoo123'>

The regex (foo(bar)?)+(\d\d\d)? is pretty elaborate, so let’s break it down into smaller pieces:

| Regex	| Matches|
|---:|:-------------|
|foo(bar)?|	'foo' optionally followed by 'bar'|
|(foo(bar)?)+|	One or more occurrences of the above |
|\d\d\d | Three decimal digit characters |
|(\d\d\d)? | Zero or one occurrences of the above |

Capturing Groups
Grouping isn’t the only useful purpose that grouping constructs serve. Most (but not quite all) grouping constructs also capture the part of the search string that matches the group. You can retrieve the captured portion or refer to it later in several different ways.

Remember the match object that `re.search()` returns? There are two methods defined for a match object that provide access to captured groups: `.groups()` and `.group()`.

## `m.groups()`

Returns a tuple containing all the captured groups from a regex match.

In [116]:
m = re.search('(\w+),(\w+),(\w)', 'foo,quux,baz')
m

<re.Match object; span=(0, 10), match='foo,quux,b'>

\w+ mean at least 1 or more words 
<br>\w mean exactly words 

Each of the three (\w+) expressions matches a sequence of word characters. The full regex (\w+),(\w+),(\w+) breaks the search string into three comma-separated tokens.

Because the (\w+) expressions use grouping parentheses, the corresponding matching tokens are **captured**. To access the captured matches, you can use .groups(), which returns a tuple containing all the captured matches in order:

In [117]:
m.groups()

('foo', 'quux', 'b')

Notice that the tuple contains the tokens but not the commas that appeared in the search string. That’s because the word characters that make up the tokens are inside the grouping parentheses but the commas aren’t. The commas that you see between the returned tokens are the standard delimiters used to separate values in a tuple.

## `m.group(<n>)`
Returns a string containing the `<n>th` captured match.

With one argument, .group() returns a single captured match. Note that the arguments are one-based, not zero-based. So, m.group(1) refers to the first captured match, m.group(2) to the second, and so on:



In [118]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
# prantheses and quama created  group
m.groups()

('foo', 'quux', 'baz')

In [119]:
m.group(1)
# group 1 formed up to 1 quama  

'foo'

In [120]:
m.group(2)

'quux'

In [121]:
m.group(3)

'baz'

Since the numbering of captured matches is one-based, and there isn’t any group numbered zero, `m.group(0)` has a special meaning:

In [122]:
m.group(0)
# 0 will returns entire match means all groups

'foo,quux,baz'

In [123]:
m.group()
# same as 0 

'foo,quux,baz'

`m.group(0)` returns the entire match, and `m.group()` does the same.

## m.group(<n1>, <n2>, ...)
Returns a tuple containing the specified captured matches.

With multiple arguments, .group() returns a tuple containing the specified captured matches in the given order:

In [124]:
m.group()

'foo,quux,baz'

In [125]:
m.group(2, 3)

('quux', 'baz')

In [126]:
m.group(3, 2, 1)

('baz', 'quux', 'foo')

This is just convenient shorthand. You could create the tuple of matches yourself instead:

In [127]:
m.group(3, 2, 1)

('baz', 'quux', 'foo')

In [128]:
(m.group(3), m.group(2), m.group(1))
# same as above 

('baz', 'quux', 'foo')

The two statements shown are functionally equivalent.

## Backreferences
You can match a previously captured group later within the same regex using a special metacharacter sequence called a **backreference**.

## `\<n>`

Matches the contents of a previously captured group.



Within a regex in Python, the sequence `\<n>`, where `<n>` is an integer from 1 to 99, matches the contents of the `<n>th` captured group.

Here’s a regex that matches a word, followed by a comma, followed by the same word again:

In [129]:
regex = r'(\w+),\1'

In [130]:
m = re.search(regex, 'foo,foo')

In [131]:
m

<re.Match object; span=(0, 7), match='foo,foo'>

In [132]:
m = re.search(regex, 'qux,qux')
m

<re.Match object; span=(0, 7), match='qux,qux'>

In [133]:
m.group(1)

'qux'

In [134]:
m = re.search(regex, 'foo,qux')

In [135]:
print(m)

None


In the first example, on **line 3**, `(\w+)` matches the first instance of the string `'foo'` and saves it as the first captured group. The `comma matches literally`. 
<br>Then `\1` is a backreference to the first captured group and matches `'foo'` again. 
<br>The second example, on **line 9**, is identical except that the `(\w+)` matches `'qux'` instead.

The last example, on **line 15**, doesn’t have a match because what comes before the comma isn’t the same as what comes after it, so the `\1` backreference doesn’t match.

**Note:** Any time you use a regex in Python with a numbered backreference, it’s a good idea to specify it as a raw string. Otherwise, the interpreter may confuse the backreference with an octal value.

<br>Consider this example:

In [136]:
print(re.search('([a-z])#\1', 'd#d'))

None


The regex `([a-z])#\1` matches a lowercase letter, followed by `'#'`, followed by the same lowercase letter. The string in this case is `'d#d'`, which should match. But the match fails because Python misinterprets the backreference `\1` as the character whose octal value is one:

In [137]:
oct(ord('\1'))

'0o1'

You’ll achieve the correct match if you specify the regex as a raw string:

In [138]:
re.search(r'([a-z])#\1', 'd#d')

<re.Match object; span=(0, 3), match='d#d'>

Remember to consider using a raw string whenever your regex includes a metacharacter sequence containing a backslash.

Numbered backreferences are one-based like the arguments to `.group()`. Only the first ninety-nine captured groups are accessible by backreference. The interpreter will regard `\100` as the `'@'` character, whose octal value is 100.

## Other Grouping Constructs

Enhanced grouping constructs that allow you to tweak when and how grouping occurs

## `(?P<name><regex>)`

Creates a named captured group.

This metacharacter sequence is similar to grouping parentheses in that it creates a group matching `<regex>` that is accessible through the match object or a subsequent backreference. The difference in this case is that you reference the matched group by its given symbolic `<name>` instead of by its number.

Earlier, you saw this example with three captured groups numbered 1, 2, and 3:

In [144]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')

In [145]:
m.groups()

('foo', 'quux', 'baz')

In [146]:
m.group(1,2,3)

('foo', 'quux', 'baz')

In [147]:
m = re.search('(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'foo,quux,baz')



You can refer to these captured groups by their symbolic names:

In [148]:
m.group('w1')
# group 1 named as the w1

'foo'

In [149]:
m.group('w3')
# group 3 named as the w3

'baz'

In [150]:
m.group('w1','w2','w3')

('foo', 'quux', 'baz')

You can still access groups with symbolic names by number if you wish:

In [152]:
m = re.search('(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'foo,quux,baz')
# (?P<w1>) this is for naming the group name

In [153]:
m.group('w1')

'foo'

In [155]:
m.group(1)
# also we use digits for printing the group

'foo'

In [156]:
m.group('w1', 'w2', 'w3')

('foo', 'quux', 'baz')

In [157]:
m.group(1, 2, 3)

('foo', 'quux', 'baz')

Any `<name>` specified with this construct must conform to the rules for a Python identifier, and each `<name>` can only appear once per regex.

## `(?P=<name>)`

Matches the contents of a previously captured named group.

The `(?P=<name>)` metacharacter sequence is a backreference, similar to `\<n>`, except that it refers to a named group rather than a numbered group.

Here again is the example from above, which uses a numbered backreference to match a word, followed by a comma, followed by the same word again:

In [158]:
m = re.search(r'(\w+),\1', 'foo,foo')
m

<re.Match object; span=(0, 7), match='foo,foo'>

In [159]:
m.group(1)

'foo'

The following code does the same thing using a named group and a backreference instead:

In [162]:
m = re.search(r'(?P<word>\w+),(?P=word)', 'foo,foo')
# named group back preference

In [163]:
m

<re.Match object; span=(0, 7), match='foo,foo'>

In [164]:
m.group('word')

'foo'

(?P=<word>\w+) matches 'foo' and saves it as a captured group named word. Again, the comma matches literally. Then (?P=word) is a backreference to the named capture and matches 'foo' again.

`Note: The angle brackets (< and >) are required around name when creating a named group but not when referring to it later, either by backreference or by .group():`

In [166]:
m = re.match(r'(?P<num>\d+)\.(?P=num)','135.135')
# backreference

In [167]:
m

<re.Match object; span=(0, 7), match='135.135'>

In [168]:
m.group('num')

'135'

Here, `(?P<num>\d+)` creates the captured group. But the corresponding backreference is `(?P=num)` without the angle brackets.

## `(?:<regex>)`

Creates a non-capturing group.

`(?:<regex>)` is just like `(<regex>)` in that it matches the specified `<regex>`. But `(?:<regex>)` doesn’t capture the match for later retrieval:

In [170]:
m = re.search('(\w+),(?:\w+),(\w+)', 'foo,quux,baz')

In [175]:
m.groups()
#in  groups() function can not matches 'quux' as a group

('foo', 'baz')

In [174]:
m.group()

'foo,quux,baz'

In [176]:
 m.group(1)

'foo'

In [179]:
 m.group(2)
# it can not matches quux as a group 2 beacause we used :?

'baz'

## Why would you want to define a group but not capture it?

<br>Remember that the regex parser will treat the <regex> inside grouping parentheses as a single unit. You may have a situation where you need this grouping feature, but you don’t need to do anything with the value later, so you don’t really need to capture it. If you use non-capturing grouping, then the tuple of captured groups won’t be cluttered with values you don’t actually need to keep.

<br>Additionally, it takes some time and memory to capture a group. If the code that performs the match executes many times and you don’t capture groups that you aren’t going to use later, then you may see a slight performance advantage

## `(?(<n>)<yes-regex>|<no-regex>)`
## `(?(<name>)<yes-regex>|<no-regex>)`

Specifies a conditional match.

A conditional match matches against one of two specified regexes depending on whether the given group exists:

* `(?(<n>)<yes-regex>|<no-regex>)` matches against `<yes-regex>` if a group numbered `<n>` exists. Otherwise, it matches against `<no-regex>`.

* `(?(<name>)<yes-regex>|<no-regex>)` matches against `<yes-regex>` if a group named `<name>` exists. Otherwise, it matches against `<no-regex>`.

Conditional matches are better illustrated with an example. Consider this regex:

In [186]:
regex = r'^(###)?foo(?(1)bar|baz)'
# ^(###)? means search string optionally begins with '###'


Here are the parts of this regex broken out with some explanation:
1. ^(###)? indicates that the search string optionally begins with '###'. If it does, then the grouping parentheses around ### will create a group numbered 1. Otherwise, no such group will exist.
2. The next portion, foo, literally matches the string 'foo'.
3. Lastly, (?(1)bar|baz) matches against 'bar' if group 1 exists and 'baz' if it doesn’t.

The following code blocks demonstrate the use of the above regex in several different Python code snippets:

In [190]:
#i.e 1
#r'^(###)?foo(?(1)bar|baz)'
re.search(regex, "###foobar")
# string `###foobar` matches of start condition '###' == True
# string `###foobar` matches against the 'bar' == True 

<re.Match object; span=(0, 9), match='###foobar'>

The search string `'###foobar'` does start with `'###'`, so the parser creates a group numbered 1. The conditional match is then against 'bar', which matches.

In [188]:
#i.e 2 
#r'^(###)?foo(?(1)bar|baz)'
print(re.search(regex, '###foobaz'))

#this condition does start with '###' and it is also called group no1 == True
# but condition match is against 'bar', hence it does not match  == False

None


In [191]:
# i.e 3
#r'^(###)?foo(?(1)bar|baz)'

print(re.search(regex, 'foobar'))
# in this condition string 'foobar' does not start with "###"  == False 
# string `###foobar` matches against the 'bar' == True 

None


In [192]:
# i.e. 4
#r'^(###)?foo(?(1)bar|baz)'
re.search(regex, 'foobaz')
re.search(regex, "###foobar")
# string `###foobar` matches of start condition '###' == False
# string `###foobar` matches against the 'bar' == False
# (?(1)bar|baz) this group is atached to group with ?(1)bar match 
# If it is fails it satiesfied with baz if it is mathches 

<re.Match object; span=(0, 6), match='foobaz'>

Here’s another conditional match using a named group instead of a numbered group:

In [194]:
regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'

# new group name ch 
# ?P=ch backreference

This regex matches the string 'foo', preceded by a single non-word character and followed by the same non-word character, or the string 'foo' by itself.

Again, let’s break this down into pieces:

|Regex|	Matches|
|---:|:-------------|
| `^` |	The start of the string|
| `(?P<ch>\W)` |	A single non-word character, captured in a group named ch|
| `(?P<ch>\W)?` | Zero or one occurrences of the above
|`foo`|The literal string 'foo'|
| `(?(ch)(?P=ch)|)` | `The contents of the group named ch if it exists, or the empty string if it doesn’t` |
|`$`|	The end of the string|

In [196]:
# regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'
re.search(regex, 'foo')
# 'foo' is by itself.

<re.Match object; span=(0, 3), match='foo'>

In [198]:
re.search(regex, '#foo#')
# same non-word characher precedes and follows 'foo' As advertised, these matches succeed.

<re.Match object; span=(0, 5), match='#foo#'>

In [199]:
# regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'

re.search(regex,"@foo@")
# same non-word characher precedes and follows 'foo' As advertised, these matches succeed.

<re.Match object; span=(0, 5), match='@foo@'>

In [201]:
# regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'

print(re.search(regex, '#foo'))

# matches fails 

None


In [202]:
# regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$'
print(re.search(regex, 'foo@'))
# matches fails     

None


Conditional regexes in Python are pretty esoteric and challenging to work through. If you ever do find a reason to use one, then you could probably accomplish the same goal with multiple separate `re.search()` calls, and your code would be less complicated to read and understand.

## Lookahead and Lookbehind Assertions
* **Lookahead** and **lookbehind** assertions determine the success or failure of a regex match in Python based ON what is just `behind (to the left)` or `ahead (to the right)` of the parser’s current position in the search string
* Like anchors, lookahead and lookbehind assertions are zero-width assertions, so they don’t consume any of the search string. Also, even though they contain parentheses and perform grouping, they don’t capture what they match.

## `(?=<lookahead_regex>)`

Creates a positive lookahead assertion.



`(?=<lookahead_regex>)` asserts that what follows the regex parser’s current position must match `<lookahead_regex>`:



In [204]:
re.search('foo(?=[a-z])', 'foobar')
# lookahead assertion (?=[a-z]) means  string must be lowercase alphabetic character
# In this case, it's the character 'b', so a match is found

<re.Match object; span=(0, 3), match='foo'>

In the next example, on the other hand, the lookahead fails. The next character after `'foo'` is '1', so there isn’t a match:

In [206]:
print(re.search('foo(?=[a-z])', 'foo123'))
# foo matches 
# (?=[a-z]) this condition applied for next character from foo it must be start with any digit 

None


Here’s another example illustrating how a `lookahead` differs from a conventional regex in Python

In [210]:
m = re.search('foo(?=[a-z])(?P<ch>.)', 'foobar')
m.group('ch')
# foo matches 
# after foo charaters next character should be lowercase alphabet 
# also (?P<ch>.) it createds 2 nd group match which given name is ch

'b'

In [212]:
m = re.search('foo([a-z])(?P<ch>.)', 'foobar')
m.group('ch')


'a'

In the first search, the parser proceeds as follows:

1. The first portion of the regex, foo, matches and consumes `'foo'` from the search string 'foobar'.
2. The next portion, `(?=[a-z])`, is a lookahead that matches 'b', but the parser doesn’t advance past the 'b'.
3. Lastly, `(?P<ch>.)` matches the next single character available, which is 'b', and captures it in a group named ch.

<br>The m.group('ch') call confirms that the group named ch contains 'b'.

In [214]:
m = re.search('foo([a-z])(?P<ch>.)', 'foobar')
m.group('ch')
# it returns a when calling group ch  because we used (?P<ch>.)

'a'

Compare that to the search on above cell, which doesn’t contain a lookahead:

1. As in the first example, the first portion of the regex, `foo`, matches and consumes `'foo'` from the search string 'foobar'.
2. The next portion, `([a-z])`, matches and consumes 'b', and the parser advances past 'b'.
3. Lastly, `(?P<ch>.)` matches the next single character available, which is now 'a'.

m.group('ch') confirms that, in this case, the group named ch contains 'a'.

## `(?!<lookahead_regex>)`

Creates a negative lookahead assertion.



`(?!<lookahead_regex>)` asserts that what follows the regex parser’s current position must not match `<lookahead_regex>`.

In [216]:
# Here are the positive lookahead examples you saw earlier, 
# along with their negative lookahead counterparts:
re.search('foo(?=[a-z])', 'foobar')
#?= (positive lookahead)

<re.Match object; span=(0, 3), match='foo'>

In [218]:
print(re.search('foo(?![a-z])', 'foobar'))
# ?!  (negative lookahead)
# it does not match because lookahead negative lookahead assertions on

None


In [219]:
print(re.search('foo(?=[a-z])', 'foo123'))
# ?= (positive lookahead)
# reason - needs to match lowercase alphabets does not match

None


In [220]:
re.search('foo(?![a-z])', 'foo123')
# ?!  (negative lookahead)
# reason - it does not match because lookahead negative lookahead assertions on lowercase alphabets  

<re.Match object; span=(0, 3), match='foo'>

## `(?<=<lookbehind_regex>)`

Creates a positive lookbehind assertion.

`(?<=<lookbehind_regex>)` asserts that what precedes the regex parser’s current position must match `<lookbehind_regex>`.

In the following example, the lookbehind assertion specifies that `'foo'` must precede `'bar'`:

In [222]:
re.search('(?<=foo)bar', 'foobar')
#?<= this is for lookbehind

<re.Match object; span=(3, 6), match='bar'>

This is the case here, so the match succeeds. As with lookahead assertions, the part of the search string that matches the lookbehind doesn’t become part of the eventual match.

The next example fails to match because the lookbehind requires that `'qux'` precede `'bar'`:

In [223]:
print(re.search('(?<=qux)bar', 'foobar'))
# example fails to match because the lookbehind requires that 'qux' precede 'bar':

None


In [225]:
print(re.search('(?<=qux)bar', 'foobar'))


None


* There’s a restriction on lookbehind assertions that doesn’t apply to lookahead assertions.
* The <lookbehind_regex> in a lookbehind assertion must specify a match of fixed length.

For example, the following isn’t allowed because the length of the string matched by a+ is indeterminate:

In [228]:
#re.search('(?<=a+)def', 'aaadef')
# this will shows error because re.search('(?<=a+)def', 'aaadef')
# error: look-behind requires fixed-width pattern
# this means a+ require exactly one occurance 

This, however, is okay:



In [None]:
re.search('(?<=a{3})def', 'aaadef')
# Anything that matches a{3} will have a fixed length of three, 
# a{3} is valid in a lookbehind assertion.

# `(?<!--<lookbehind_regex-->)`

Creates a negative lookbehind assertion.

`(?<!--<lookbehind_regex-->)` asserts that what precedes the regex parser’s current position must not match `<lookbehind_regex>`:

In [230]:
print(re.search('(?<!foo)bar', 'foobar'))

None


In [231]:
re.search('(?<!qux)bar', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

As with the positive lookbehind assertion, `<lookbehind_regex>` must specify a match of fixed length.

## Miscellaneous Metacharacters
There are a couple more metacharacter sequences to cover. These are stray metacharacters that don’t obviously fall into any of the categories already discussed.

## `(?#...)`
Specifies a comment.

The regex parser ignores anything contained in the sequence (?#...):

In [232]:
re.search('bar(?#This is a comment) *baz', 'foo bar baz qux')
# ?#  will allows to documentation inside a regex in python

<re.Match object; span=(4, 11), match='bar baz'>

This allows you to specify documentation inside a regex in Python, which can be especially useful if the regex is particularly long.

## Vertical bar, or pipe `(|)`
Specifies a set of alternatives on which to match.

An expression of the form `<regex1>|<regex2>|...|<regexn>` matches at most one of the specified `<regexi>` expressions:



In [233]:
re.search('foo|bar|baz', 'bar')
# `|` == Here, foo|bar|baz will match any of 'foo', 'bar', or 'baz'

<re.Match object; span=(0, 3), match='bar'>

In [234]:
re.search('foo|bar|baz', 'baz')
# `|` == Here, foo|bar|baz will match any of 'foo', 'bar', or 'baz'

<re.Match object; span=(0, 3), match='baz'>

In [235]:
print(re.search('foo|bar|baz', 'quux'))
# `|` == Here, foo|bar|baz will match any of 'foo', 'bar', or 'baz'# 

None


Alternation is `non-greedy`. The regex parser `looks` at the `expressions separated` by `|` in `left-to-right` order and returns the `first match` that it finds. The `remaining expressions aren’t tested`, `even if one of them would produce a longer match`:

In [236]:
# i.e 
re.search('foo', 'foograult')
# foo matches in search

<re.Match object; span=(0, 3), match='foo'>

In [237]:
re.search('grault', 'foograult')
# grault matches 

<re.Match object; span=(3, 9), match='grault'>

In [239]:
re.search('foo|grault', 'foograult')
# in regex parser 'foo|grault' finds the first first match (foo) in string 'foograult'
# hence it will terminate serach

<re.Match object; span=(0, 3), match='foo'>

In [241]:
re.search('foo|grault', 'graultfoo')
# for understanding 

<re.Match object; span=(0, 6), match='grault'>

You can combine alternation, grouping, and any other metacharacters to achieve whatever level of complexity you need. In the following example, `(foo|bar|baz)+` means a sequence of one or more of the strings `'foo'`, `'bar'`, or `'baz'`:

In [242]:
re.search('(foo|bar|baz)+', 'foofoofoo')
# repetitive matches occures at single group
# + means extacly one group
# here is only one group because string not seperated by ,

<re.Match object; span=(0, 9), match='foofoofoo'>

In [243]:
re.search('(foo|bar|baz)+', 'bazbazbazbaz')
# repetitive matches occures at single group
# + means extacly one group
# here is only one group because string not seperated by ,

<re.Match object; span=(0, 12), match='bazbazbazbaz'>

In [244]:
re.search('(foo|bar|baz)+', 'bazbazbazbaz')
# repetitive matches occures at single group
# + means extacly one group
# here is only one group because string not seperated by ,

<re.Match object; span=(0, 12), match='bazbazbazbaz'>

In the next example, `([0-9]+|[a-f]+)` means a sequence of one or more decimal digit characters or a sequence of one or more of the characters `'a-f'`:

In [245]:
re.search('([0-9]+|[a-f]+)', '456')
# it searches digits or alphabets range between a to f

<re.Match object; span=(0, 3), match='456'>

In [247]:
re.search('([0-9]+|[a-f]+)', 'ffda')
# it searches digits or alphabets range between a to f

<re.Match object; span=(0, 4), match='ffda'>

With all the metacharacters that the re module supports, the sky is practically the limit.

# Modified Regular Expression Matching With Flags

Most of the functions in the re module take an optional `<flags>` argument. This includes the function you’re now very familiar with, `re.search()`.

## `re.search(<regex>, <string>, <flags>)`

Scans a string for a regex match, applying the specified modifier `<flags>`.

Flags modify regex parsing behavior, allowing you to refine your pattern matching even further.

## Supported Regular Expression Flags

The table below briefly summarizes the available flags. All flags except `re.DEBUG` have a short, single-letter name and also a longer, full-word name:

## `re.I`
## `re.IGNORECASE`

Makes matching case insensitive.

When `IGNORECASE` is in effect, character matching is case insensitive:

In [250]:
re.search('a+', 'aaaAAA')
# a+ used for matches character 'a' morethan 1 times

<re.Match object; span=(0, 3), match='aaa'>

In [251]:
re.search('A+', 'aaaAAA')
# A+ used for matches character 'A' morethan 1 times

<re.Match object; span=(3, 6), match='AAA'>

In [253]:
re.search('a+', 'aaaAAA', re.I)
# re.I means ignore a+ 

<re.Match object; span=(0, 6), match='aaaAAA'>

In [256]:
re.search('A+', 'aaaAAA',re.IGNORECASE )
# re.IGNORECASE it is same as re.I
# it ignores A+ which is AAA

<re.Match object; span=(0, 6), match='aaaAAA'>

`IGNORECASE` affects alphabetic matching involving character classes as well:

In [257]:
re.search('[a-z]+', 'aBcDeF')
# [a-z]+ means exacly one lowercase alphabet

<re.Match object; span=(0, 1), match='a'>

In [259]:
re.search('[a-z]+', 'aBcDeF', re.I)
# re.I ignores which has to shows [a-z]+ exactaly one character  

<re.Match object; span=(0, 6), match='aBcDeF'>

When case is significant, the longest portion of `'aBcDeF'`
that `[a-z]+` matches is just the initial `'a'`. Specifying re.I makes the search case insensitive, so `[a-z]+` matches the entire string.

## `re.M`
## `re.MULTILINE`

Causes start-of-string and end-of-string anchors to match at embedded newlines.

By default, the `^ (start-of-string)` and `$ (end-of-string)` anchors match only at the beginning and end of the search string:

In [261]:
# ^ start of the string
# $ end of the string

In [262]:
s = 'foo\nbar\nbaz'

In [264]:
print(s)

foo
bar
baz


In [267]:
re.search('^foo', s)

<re.Match object; span=(0, 3), match='foo'>

In [269]:
print(re.search('^bar', s))
# ^bar regex does not find bar at start of the string 
# ^bar is it an second line

None


In [271]:
 print(re.search('^baz', s))
# '^baz' regex does not find 'baz' at start of the string 
# '^baz' is it an second line

None


In [273]:
 print(re.search('baz$', s))

<re.Match object; span=(8, 11), match='baz'>


In this case, even though the search string `'foo\nbar\nbaz'` contains embedded newline characters, only `'foo'` matches when `anchored at the beginning of the string`, and only `'baz' matches when anchored at the end.`

* `^` matches at the `beginning of the string` or at the `beginning of any line` within the `string` (that is, immediately following a newline).
* `$` matches at the `end of the string` or at the `end of any line within` the `string` (immediately preceding a newline).

In [277]:
#The following are the same searches as shown above:
s = 'foo\nbar\nbaz'
print(s)

foo
bar
baz


In [275]:
# sames search 
re.search('^foo', s, re.MULTILINE)

<re.Match object; span=(0, 3), match='foo'>

In [276]:
# sames search 
re.search('^bar', s, re.MULTILINE)
# bar is at the first character in Second line in s string

# as because of re.MULTILINE it detects linewise first character

<re.Match object; span=(4, 7), match='bar'>

In [278]:
re.search('^baz', s, re.MULTILINE)
# baz is at the fist character in third line in s string
# as because of re.MULTILINE it detects linewise first character

<re.Match object; span=(8, 11), match='baz'>

In [279]:
re.search('foo$', s, re.M)
# re.M is same as the re.MULTILINE
# $ detects the last word in string
# foo is the last word in 1st line

<re.Match object; span=(0, 3), match='foo'>

In [280]:
re.search('bar$', s, re.M)
# re.M is same as the re.MULTILINE
# $ detects the last word in string
# bar is the last word in 2nd line

<re.Match object; span=(4, 7), match='bar'>

In [281]:
re.search('baz$', s, re.M)
# re.M is same as the re.MULTILINE
# $ detects the last word in string
# baz is the last word in 3nd line

<re.Match object; span=(8, 11), match='baz'>

In the string 'foo\nbar\nbaz', all three of `'foo'`, '`bar'`, and `'baz'` occur at either the start or end of the string or at the start or end of a line within the string. With the MULTILINE flag set, all three match when anchored with either `^` or `$`.

**Note:** The MULTILINE flag only modifies the ^ and $ anchors in this way. It doesn’t have any effect on the \A and \Z anchors:

In [283]:
s = 'foo\nbar\nbaz'
print(s)

foo
bar
baz


In [284]:
 re.search('^bar', s, re.MULTILINE)

<re.Match object; span=(4, 7), match='bar'>

In [285]:
re.search('bar$', s, re.MULTILINE)

<re.Match object; span=(4, 7), match='bar'>

In [286]:
print(re.search('\Abar', s, re.MULTILINE))
# \A only matches at the beginning of the ENTIRE text
# these matches fail even with the MULTILINE flag in effect.

None


In [288]:
print(re.search('bar\Z', s, re.MULTILINE))
# \Z only matches at the beginning of the ENTIRE text
# these matches fail even with the MULTILINE flag in effect.

None


## `re.S`
## `re.DOTALL`
Causes the dot (.) metacharacter to match a newline.

Remember that by default, the dot metacharacter matches any character except the newline character. The `DOTALL` flag lifts this restriction:

In [290]:
#ie. 1
print(re.search('foo.bar', 'foo\nbar'))
# . does not matches new line character
# . is used for any unknown characters her \n is unknown in regex

None


In [293]:
#ie. 2
# to overcome the multiline dot issue used re.DOTALL
re.search('foo.bar', 'foo\nbar', re.DOTALL)
# re.DOTALL matches \n by the . 

<re.Match object; span=(0, 7), match='foo\nbar'>

In [294]:
#ie. 3
re.search('foo.bar', 'foo\nbar', re.S)
# re.S is the alternative of the re.DOTALL

<re.Match object; span=(0, 7), match='foo\nbar'>

* In this example, on example 1 the dot metacharacter doesn’t match the newline in `'foo\nbar'`.
* On example 3 and 5, DOTALL is in effect, so the dot does match the newline. Note that the short name of the DOTALL flag is re.S, not re.D as you might expect.

## `re.X`
## `re.VERBOSE`

Allows inclusion of whitespace and comments within a regex.

The VERBOSE flag specifies a few special behaviors:
* The regex parser ignores all whitespace unless it’s within a character class or escaped with a backslash.
* If the regex contains a # character that isn’t contained within a character class or escaped with a backslash, then the parser ignores it and all characters to the right of it.

What’s the use of this? It allows you to format a regex in Python so that it’s more readable and self-documenting.

Here’s an example showing how you might put this to use. Suppose you want to parse phone numbers that have the following format:

* Optional three-digit area code, in parentheses
* Optional whitespace
* Three-digit prefix
* Separator (either '-' or '.')
* Four-digit line number

The following regex does the trick:

In [295]:
regex = r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$'
# it is an eyeful and much more complex to see 

In [296]:
re.search(regex, '414.9229')

<re.Match object; span=(0, 8), match='414.9229'>

In [297]:
re.search(regex, '414-9229')

<re.Match object; span=(0, 8), match='414-9229'>

In [298]:
re.search(regex, '(712)414-9229')

<re.Match object; span=(0, 13), match='(712)414-9229'>

But r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$' is an eyeful, isn’t it? Using the VERBOSE flag, you can write the same regex in Python like this instead:



In [299]:
regex = r'''^               # Start of string
             (\(\d{3}\))?    # Optional area code 
             \s*             # Optional whitespace re.DOTALL
             \d{3}           # Three-digit prefix
             [-.]            # Separator character
             \d{4}           # Four-digit line number
             $               # Anchor at end of string
             '''

In [300]:
re.search(regex, '414.9229', re.VERBOSE)
# it detect (\(\d{3}\))? optionaly
#\s* for whitespace one or more
#\d{3} for extact 3 digts (414 is in this case)
# [-.] it searches of . or -
#\d{4}  for extact 3 digts (9229 is in this case)
#$ Anchor for end string

<re.Match object; span=(0, 8), match='414.9229'>

In [301]:
re.search(regex, '414-9229', re.VERBOSE)

<re.Match object; span=(0, 8), match='414-9229'>

In [302]:
re.search(regex, '(712)414-9229', re.X)

<re.Match object; span=(0, 13), match='(712)414-9229'>

In [303]:
re.search(regex, '(712)414-9229', re.X)

<re.Match object; span=(0, 13), match='(712)414-9229'>

The `re.search()` calls are the `same as` those shown above, so you can see that this regex works the `same as the one specified earlier`. `But it’s less difficult to understand` at first glance.

Note that `triple quoting makes it particularly convenient to include embedded newlines`, which qualify as `ignored whitespace in VERBOSE` mode.

When using the `VERBOSE flag`, be mindful of whitespace that you do intend to be significant. Consider these examples:

In [305]:
# i.e.1
re.search('foo bar', 'foo bar')


<re.Match object; span=(0, 7), match='foo bar'>

In [307]:
# ie.2
print(re.search('foo bar', 'foo bar', re.VERBOSE))
# VERBOSE flag, be mindful of whitespace
# it can not detect whitespace 

None


In [309]:
# i.e.3
re.search('foo\ bar', 'foo bar', re.VERBOSE)
# \ will be helpfull for detecting re.VERBOSE

<re.Match object; span=(0, 7), match='foo bar'>

In [311]:
# i.e.4
re.search('foo[ ]bar', 'foo bar', re.VERBOSE)
# [] will be helpfull for detecting re.VERBOSE

<re.Match object; span=(0, 7), match='foo bar'>

After all you’ve seen to this point, you may be wondering why on **ie.2** the regex foo bar doesn’t match the string 'foo bar'. It doesn’t because the `VERBOSE flag` causes the `parser` to ignore the space character.

To make this match as expected, escape the space character with a backslash or include it in a character class, as shown on **i.e. 3** and **4**.

As with the DOTALL flag, note that the VERBOSE flag has a non-intuitive short name: `re.X`, not `re.V`.

## `re.DEBUG`
Displays debugging information.

The DEBUG flag causes the regex parser in Python to display debugging information about the parsing process to the console:

In [312]:
re.search('foo.bar', 'fooxbar', re.DEBUG)

LITERAL 102
LITERAL 111
LITERAL 111
ANY None
LITERAL 98
LITERAL 97
LITERAL 114

 0. INFO 12 0b1 7 7 (to 13)
      prefix_skip 3
      prefix [0x66, 0x6f, 0x6f] ('foo')
      overlap [0, 0, 0]
13: LITERAL 0x66 ('f')
15. LITERAL 0x6f ('o')
17. LITERAL 0x6f ('o')
19. ANY
20. LITERAL 0x62 ('b')
22. LITERAL 0x61 ('a')
24. LITERAL 0x72 ('r')
26. SUCCESS


<re.Match object; span=(0, 7), match='fooxbar'>

When the parser displays LITERAL nnn in the debugging output, it’s showing the ASCII code of a literal character in the regex. In this case, the literal characters are `f', 'o', 'o'` and `'b', 'a', 'r'`.

Here’s a more complicated example. This is the phone number regex shown in the discussion on the VERBOSE flag earlier:

In [313]:
regex = r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$'

In [314]:
re.search(regex, '414.9229', re.DEBUG)

AT AT_BEGINNING
MAX_REPEAT 0 1
  SUBPATTERN 1 0 0
    LITERAL 40
    MAX_REPEAT 3 3
      IN
        CATEGORY CATEGORY_DIGIT
    LITERAL 41
MAX_REPEAT 0 MAXREPEAT
  IN
    CATEGORY CATEGORY_SPACE
MAX_REPEAT 3 3
  IN
    CATEGORY CATEGORY_DIGIT
IN
  LITERAL 45
  LITERAL 46
MAX_REPEAT 4 4
  IN
    CATEGORY CATEGORY_DIGIT
AT AT_END

 0. INFO 4 0b0 8 MAXREPEAT (to 5)
 5: AT BEGINNING
 7. REPEAT 21 0 1 (to 29)
11.   MARK 0
13.   LITERAL 0x28 ('(')
15.   REPEAT_ONE 9 3 3 (to 25)
19.     IN 4 (to 24)
21.       CATEGORY UNI_DIGIT
23.       FAILURE
24:     SUCCESS
25:   LITERAL 0x29 (')')
27.   MARK 1
29: MAX_UNTIL
30. REPEAT_ONE 9 0 MAXREPEAT (to 40)
34.   IN 4 (to 39)
36.     CATEGORY UNI_SPACE
38.     FAILURE
39:   SUCCESS
40: REPEAT_ONE 9 3 3 (to 50)
44.   IN 4 (to 49)
46.     CATEGORY UNI_DIGIT
48.     FAILURE
49:   SUCCESS
50: IN 5 (to 56)
52.   RANGE 0x2d 0x2e ('-'-'.')
55.   FAILURE
56: REPEAT_ONE 9 4 4 (to 66)
60.   IN 4 (to 65)
62.     CATEGORY UNI_DIGIT
64.     FAILURE
65:   SUCCESS


<re.Match object; span=(0, 8), match='414.9229'>

This looks like a lot of esoteric information that you’d never need, but it can be useful. See the Deep Dive below for a practical application.

## Deep Dive: Debugging Regular Expression Parsing

As you know from above, the metacharacter sequence {m,n} indicates a specific number of repetitions. It matches anywhere from m to n repetitions of what precedes it:

In [315]:
re.search('x[123]{2,4}y', 'x222y')

<re.Match object; span=(0, 5), match='x222y'>

In [316]:
re.search('x[123]{2,4}y', 'x222y', re.DEBUG)

LITERAL 120
MAX_REPEAT 2 4
  IN
    LITERAL 49
    LITERAL 50
    LITERAL 51
LITERAL 121

 0. INFO 8 0b1 4 6 (to 9)
      prefix_skip 1
      prefix [0x78] ('x')
      overlap [0]
 9: LITERAL 0x78 ('x')
11. REPEAT_ONE 10 2 4 (to 22)
15.   IN 5 (to 21)
17.     RANGE 0x31 0x33 ('1'-'3')
20.     FAILURE
21:   SUCCESS
22: LITERAL 0x79 ('y')
24. SUCCESS


<re.Match object; span=(0, 5), match='x222y'>

* `MAX_REPEAT 2 4` confirms that the regex parser recognizes the metacharacter sequence `{2,4}` and interprets it as a range quantifier.

But, as noted previously, if a pair of curly braces in a regex in Python contains anything other than a valid number or numeric range, then it loses its special meaning.

You can verify this also:

In [318]:
re.search('x[123]{foo}y', 'x222y', re.DEBUG)

LITERAL 120
IN
  LITERAL 49
  LITERAL 50
  LITERAL 51
LITERAL 123
LITERAL 102
LITERAL 111
LITERAL 111
LITERAL 125
LITERAL 121

 0. INFO 8 0b1 8 8 (to 9)
      prefix_skip 1
      prefix [0x78] ('x')
      overlap [0]
 9: LITERAL 0x78 ('x')
11. IN 5 (to 17)
13.   RANGE 0x31 0x33 ('1'-'3')
16.   FAILURE
17: LITERAL 0x7b ('{')
19. LITERAL 0x66 ('f')
21. LITERAL 0x6f ('o')
23. LITERAL 0x6f ('o')
25. LITERAL 0x7d ('}')
27. LITERAL 0x79 ('y')
29. SUCCESS


You can see that there’s no MAX_REPEAT token in the debug output. The LITERAL tokens indicate that the parser treats {foo} literally and not as a quantifier metacharacter sequence. 123, 102, 111, 111, and 125 are the ASCII codes for the characters in the literal string '{foo}'.

Information displayed by the DEBUG flag can help you troubleshoot by showing you how the parser is interpreting your regex.

In [320]:
print(re.search('x[123]{foo}y', 'x222y'))

None


Curiously, the re module doesn’t define a single-letter version of the DEBUG flag. You could define your own if you wanted to:

In [322]:
import re
# re.D
# shows error
# AttributeError: module 're' has no attribute 'D'

In [323]:
re.D = re.DEBUG

In [324]:
re.search('foo','foo', re.D)

LITERAL 102
LITERAL 111
LITERAL 111

 0. INFO 12 0b11 3 3 (to 13)
      prefix_skip 3
      prefix [0x66, 0x6f, 0x6f] ('foo')
      overlap [0, 0, 0]
13: LITERAL 0x66 ('f')
15. LITERAL 0x6f ('o')
17. LITERAL 0x6f ('o')
19. SUCCESS


<re.Match object; span=(0, 3), match='foo'>

But this might be more confusing than helpful, as readers of your code might misconstrue it as an abbreviation for the DOTALL flag. If you did make this assignment, it would be a good idea to document it thoroughly.

### re.A
### re.ASCII
### re.U
### re.UNICODE
### re.L
### re.LOCALE

Specify the character encoding used for parsing of special regex character classes.

Several of the regex metacharacter sequences (\w, \W, \b, \B, \d, \D, \s, and \S) require you to assign characters to certain classes like word, digit, or whitespace. The flags in this group determine the encoding scheme used to assign characters to these classes. The possible encodings are ASCII, Unicode, or according to the current locale.

You had a brief introduction to character encoding and Unicode in the tutorial on Strings and Character Data in Python, under the discussion of the ord() built-in function. For more in-depth information, check out these resources:

For example, here’s a string that consists of three Devanagari digit characters:

In [325]:
s = '\u0967\u096a\u096c'
s

'१४६'

For the regex parser to properly account for the Devanagari script, the digit metacharacter sequence \d must match each of these characters as well.

The Unicode Consortium created Unicode to handle this problem. Unicode is a character-encoding standard designed to represent all the world’s writing systems. All strings in Python 3, including regexes, are Unicode by default.

<br>So then, back to the flags listed above. These flags help to determine whether a character falls into a given class by specifying whether the encoding used is ASCII, Unicode, or the current locale:

* **re.U** and **re.UNICODE** specify Unicode encoding. Unicode is the default, so these flags are superfluous. They’re mainly supported for backward compatibility.
* **re.A** and **re.ASCII** force a determination based on ASCII encoding. If you happen to be operating in English, then this is happening anyway, so the flag won’t affect whether or not a match is found.
* **re.L** and **re.LOCALE** make the determination based on the current locale. Locale is an outdated concept and isn’t considered reliable. Except in rare circumstances, you’re not likely to need it.

Using the default Unicode encoding, the regex parser should be able to handle any language you throw at it. In the following example, it correctly recognizes each of the characters in the string '१४६' as a digit:

In [326]:
s = '\u0967\u096a\u096c'

In [328]:
print(s)

१४६


In [329]:
re.search('\d+', s)


<re.Match object; span=(0, 3), match='१४६'>

Here’s another example that illustrates how character encoding can affect a regex match in Python. Consider this string:

In [331]:
s = 'sch\u00f6n'
print(s)


schön


`'schön'` (the German word for pretty or nice) contains the `'ö'` character, which has the 16-bit hexadecimal Unicode value `00f6`. This character isn’t representable in traditional 7-bit ASCII.

If you’re working in German, then you should reasonably expect the regex parser to consider all of the characters in `'schön'` to be word characters.But take a look at what happens if you search s for word characters using the `\w` character class and force an `ASCII encoding`:

In [332]:
re.search('\w+', s, re.ASCII)

<re.Match object; span=(0, 3), match='sch'>

When you restrict the encoding to `ASCII`, the regex parser recognizes only the first three characters as word characters. The match stops at `'ö'`.

On the other hand, if you specify `re.UNICODE` or allow the encoding to default to Unicode, then all the characters in `'schön'` qualify as word characters:

In [333]:
re.search('\w+', s, re.UNICODE)

<re.Match object; span=(0, 5), match='schön'>

In [334]:
re.search('\w+', s)

<re.Match object; span=(0, 5), match='schön'>

The `ASCII` and `LOCALE` flags are available in case you need them for special circumstances. But in general, the best strategy is to use the default Unicode encoding. This should handle any world language correctly.

### Combining `<flags>` Arguments in a Function Call

Flag values are defined so that you can combine them using the bitwise OR (|) operator. This allows you to specify several flags in a single function call:

In [336]:
re.search('^bar','FOO\nBAR\nBAZ', re.I|re.M)

<re.Match object; span=(4, 7), match='BAR'>

This `re.search()` call uses bitwise OR to specify both the `IGNORECASE` and `MULTILINE` flags at once.

### Setting and Clearing Flags Within a Regular Expression
In addition to being able to pass a `<flags>` argument to most re module function calls, you can also modify flag values within a regex in Python. There are two regex metacharacter sequences that provide this capability.



## `(?<flags>)`

Sets flag value(s) for the duration of a regex.

Within a regex, the metacharacter sequence `(?<flags>)`sets the specified flags for the entire expression.

The value of `<flags>` is one or more letters from the set a, i, L, m, s, u, and x. Here’s how they correspond to the re module flags:

|Letter	| Flags |
|---:|:-------------|
|a	|re.A re.ASCII|
|i	|re.I re.IGNORECASE|
|L	|re.L re.LOCALE|
|m	|re.M re.MULTILINE|
|s	|re.S re.DOTALL|
|u	|re.U re.UNICODE|
|x	|re.X re.VERBOSE|

The `(?<flags>)` metacharacter sequence as a whole matches the empty string. It always matches successfully and doesn’t consume any of the search string.

The following examples are equivalent ways of setting the IGNORECASE and MULTILINE flags:

In [337]:
re.search('^bar', 'FOO\nBAR\nBAZ\n', re.I|re.M)
#re.I ignorecase 
#re.M Multiline

<re.Match object; span=(4, 7), match='BAR'>

In [338]:
re.search('(?im)^bar', 'FOO\nBAR\nBAZ\n')

<re.Match object; span=(4, 7), match='BAR'>

Note that a `(?<flags>)` metacharacter sequence sets the given flag(s) for the entire regex no matter where you place it in the expression:

In [339]:
re.search('foo.bar(?s).baz', 'foo\nbar\nbaz')

  re.search('foo.bar(?s).baz', 'foo\nbar\nbaz')


<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>

In [340]:
re.search('foo.bar.baz(?s)', 'foo\nbar\nbaz')

  re.search('foo.bar.baz(?s)', 'foo\nbar\nbaz')


<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>

In the above examples, both dot metacharacters match newlines because the DOTALL flag is in effect. This is true even when (?s) appears in the middle or at the end of the expression.

As of Python 3.7, it’s deprecated to specify `(?<flags>)` anywhere in a regex other than at the beginning:

In [341]:
import sys
sys.version

'3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]'

In [342]:
re.search('foo.bar.baz(?s)', 'foo\nbar\nbaz')

<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>

It still produces the appropriate match, but you’ll get a warning message

### `(?<set_flags>-<remove_flags>:<regex>)`

Sets or removes flag value(s) for the duration of a group.

`(?<set_flags>-<remove_flags>:<regex>)` defines a non-capturing group that matches against `<regex>`. For the `<regex> `contained in the group, the regex parser sets any flags specified in `<set_flags>` and clears any flags specified in `<remove_flags>`.

Values for <set_flags> and <remove_flags> are most commonly i, m, s or x.

In the following example, the IGNORECASE flag is set for the specified group:

In [344]:
re.search('(?i:foo)bar', 'FOObar')

<re.Match object; span=(0, 6), match='FOObar'>

This produces a match because `(?i:foo)` dictates that the match against `'FOO'` is case insensitive.

Now contrast that with this example:

In [346]:
print(re.search('(?i:foo)bar', 'FOOBAR'))

None


As in the previous example, the match against 'FOO' would succeed because it’s case insensitive. But once outside the group, IGNORECASE is no longer in effect, so the match against 'BAR' is case sensitive and fails.

Here’s an example that demonstrates turning a flag off for a group:

In [347]:
print(re.search('(?-i:foo)bar', 'FOOBAR', re.IGNORECASE))
# no match found 
# (?-i:foo) turns off the ignorecase

None


Again, there’s no match. Although re.IGNORECASE enables case-insensitive matching for the entire call, the metacharacter sequence `(?-i:foo)` turns off IGNORECASE for the duration of that group, so the match against 'FOO' fails.

### Conclusion
This concludes your introduction to regular expression matching and Python’s re module. Congratulations! You’ve mastered a tremendous amount of material.

You now know how to:

* Use **re.search()** to perform regex matching in Python
* Create complex pattern matching searches with regex **metacharacters**
* Tweak regex parsing behavior with **flags**