# Regexes in Python and Their Uses

magine you have a string object s. Now suppose you need to write Python code to find out whether s contains the substring '123'. There are at least a couple ways to do this. You could use the in operator:

In [3]:
 s = 'foo123bar'
'123' in s

True

If you want to know not only whether '123' exists in s but also where it exists, then you can use .find() or .index(). Each of these returns the character position within s where the substring resides:



In [4]:
s.find('123')

3

In [5]:
s.index('123')

3

In these examples, the matching is done by a straightforward character-by-character comparison. That will get the job done in many cases. But sometimes, the problem is more complicated than that.

<br>For example, rather than searching for a fixed substring like '123', suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.

<br>Strict character comparisons won’t cut it here. This is where regexes in Python come to the rescue.

# The `re` Module

Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods, most of which you’ll learn about in the next tutorial in this series.

For now, you’ll focus predominantly on one function, `re.search()`.



### `re.search(<regex>, <string>)`

Scans a string for a regex match.

`re.search(<regex>, <string>)` scans `<string>` looking for the first location where the pattern `<regex>` matches. If a match is found, then `re.search()` returns a `match object.` Otherwise, it returns `None`.

`re.search()` takes an optional third `<flags>` argument that you’ll learn about at the end of this tutorial.

In [12]:
# importing regex module

import re 


## First Pattern-Matching Example
Now that you know how to gain access to re.search(), you can give it a try:

In [15]:
s = 'foo123bar'

re.search('123',s)

<re.Match object; span=(3, 6), match='123'>

A match object is **truthy**, so you can use it in a Boolean context like a conditional statement:

In [16]:
if re.search('123',s):
    print('Found a match')
else:
    print("no match.")

Found a match


The interpreter displays the match object as <_sre.SRE_Match object; span=(3, 6), match='123'>. This contains some useful information.

span=(3, 6) indicates the portion of <string> in which the match was found. This means the same thing as it would in slice notation:



In [17]:
s[3:6]

'123'

In this example, the match starts at character position 3 and extends up to but not including position 6.

## Python Regex Metacharacters

The real power of regex matching in Python emerges when `<regex>` contains special characters called **metacharacters**. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

<br>In a regex, a set of characters specified in square brackets ([]) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:

In [18]:
s = 'foo123bar'

re.search('[0-9][0-9][0-9]',s)

<re.Match object; span=(3, 6), match='123'>

[0-9] matches any single decimal digit character—any character between '0' and '9', inclusive. The full expression [0-9][0-9][0-9] matches any sequence of three decimal digit characters. In this case, s matches because it contains three consecutive decimal digit characters, '123'.

These strings also match:

In [19]:
re.search('[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [20]:
re.search('[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [21]:
re.search('[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

On the other hand, a string that doesn’t contain three consecutive digits won’t match:

In [24]:
print(re.search('[0-9][0-9][0-9]', '12foo34'))

None


With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.

<br>Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:

In [25]:
s = 'foo123bar'
re.search('1.3', s)

<re.Match object; span=(3, 6), match='123'>

In [26]:
s = 'foo13bar'
print(re.search('1.3', s))

None


In the first example, the regex 1.3 matches '123' because the '1' and '3' match literally, and the . matches the '2'. Here, you’re essentially asking, “Does s contain a '1', then any character (except a newline), then a '3'?” The answer is yes for 'foo123bar' but no for 'foo13bar'.

## Metacharacters Supported by the re Module

In [27]:
s = 'foo123bar'
re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In this case, 123 is technically a regex, but it’s not a very interesting one because it doesn’t contain any metacharacters. It just matches the string '123'.

Things get much more exciting when you throw metacharacters into the mix. The following sections explain in detail how you can use each metacharacter or metacharacter sequence to enhance pattern-matching functionality.



## Metacharacters That Match a Single Character
The metacharacter sequences in this section try to match a single character from the search string. When the regex parser encounters one of these metacharacter sequences, a match happens if the character at the current parsing position fits the description that the sequence describes.

[]
Specifies a specific set of characters to match.

Characters contained in square brackets ([]) represent a character class—an enumerated set of characters to match from. A character class metacharacter sequence will match any single character contained in the class.

You can enumerate the characters individually like this:

In [28]:
re.search('ba[artz]', 'foobarqux')

<re.Match object; span=(3, 6), match='bar'>

In [29]:
re.search('ba[artz]', 'foobazqux')

<re.Match object; span=(3, 6), match='baz'>

The metacharacter sequence [artz] matches any single 'a', 'r', 't', or 'z' character. In the example, the regex ba[artz] matches both 'bar' and 'baz' (and would also match 'baa' and 'bat').

A character class can also contain a range of characters separated by a hyphen (-), in which case it matches any single character within the range. For example, `[a-z]` matches any lowercase alphabetic character between 'a' and 'z', inclusive:

In [31]:
re.search('[a-z]', 'FOObar')

<re.Match object; span=(3, 4), match='b'>

[0-9] matches any digit character:

In [32]:
re.search('[0-9][0-9]','foo123bar')

<re.Match object; span=(3, 5), match='12'>

n this case, [0-9][0-9] matches a sequence of two digits. The first portion of the string 'foo123bar' that matches is '12'.

<br>[0-9a-fA-F] matches any hexadecimal digit character:

In [34]:
re.search('[0-9a-fA-F]','--- a0 ---')

<re.Match object; span=(4, 5), match='a'>

Here, [0-9a-fA-F] matches the first hexadecimal digit character in the search string, 'a'.

You can complement a character class by specifying ^ as the first character, in which case it matches any character that isn’t in the set. In the following example, [^0-9] matches any character that isn’t a digit:

In [35]:
re.search('[^0-9]', '12345foo')

<re.Match object; span=(5, 6), match='f'>

Here, the match object indicates that the first character in the string that isn’t a digit is 'f'.

<br>If a `^` character appears in a character class but isn’t the first character, then it has no special meaning and matches a literal `'^'` character:

In [36]:
re.search('[#:^]','foo^bar:baz#qux')

<re.Match object; span=(3, 4), match='^'>

As you’ve seen, you can specify a range of characters in a character class by separating characters with a hyphen. What if you want the character class to include a literal hyphen character? You can place it as the first or last character or escape it with a backslash (\):

In [37]:
re.search('[-abc]','123-456')

<re.Match object; span=(3, 4), match='-'>

In [38]:
re.search('[abc-]','123-456')

<re.Match object; span=(3, 4), match='-'>

In [39]:
re.search('[ab\-c]','123-456')

<re.Match object; span=(3, 4), match='-'>

If you want to include a literal ']' in a character class, then you can place it as the first character or escape it with backslash:

In [40]:
re.search('[]]', 'foo[1]')

<re.Match object; span=(5, 6), match=']'>

In [43]:
re.search('[ab\]cd]','foo[1]')

<re.Match object; span=(5, 6), match=']'>

Other regex metacharacters lose their special meaning inside a character class:

In [44]:
re.search('[)*+|]', '123*456')

<re.Match object; span=(3, 4), match='*'>

In [45]:
re.search('[)*+|]', '123+456')

<re.Match object; span=(3, 4), match='+'>

As you saw in the table above, * and + have special meanings in a regex in Python. They designate repetition, which you’ll learn more about shortly. But in this example, they’re inside a character class, so they match themselves literally.

### `dot (.)`

Specifies a wildcard.

The `.` metacharacter matches any single character except a newline:

In [47]:
re.search('foo.bar', 'fooxbar')

<re.Match object; span=(0, 7), match='fooxbar'>

In [51]:
re.search('foo.bar', 'foobar')

In [50]:
re.search('foo.bar', 'foo\nbar')

As a regex, `foo.bar` essentially means the characters `'foo'`, then any character except newline, then the characters 'bar'. The first string shown above, `'fooxbar'`, fits the bill because the . metacharacter matches the 'x'.

<br>The second and third strings fail to match. In the last case, although there’s a character between `'foo'` and `'bar'`, it’s a newline, and by default, the . metacharacter doesn’t match a newline. There is, however, a way to force . to match a newline, which you’ll learn about at the end of this tutorial.

### \w
### \W

Match based on whether a character is a word character.


`\w` matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore `(_)` character, so `\w` is essentially shorthand for `[a-zA-Z0-9_]`:

In [53]:
re.search('\w', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In [54]:
 re.search('[a-zA-Z0-9_]', '#(.a$@&')

<re.Match object; span=(3, 4), match='a'>

In this case, the first word character in the string `'#(.a$@&' is 'a'`.

`\W`is the opposite. It matches any non-word character and is equivalent to `[^a-zA-Z0-9_]`:



In [55]:
re.search('\W', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

In [56]:
re.search('[^a-zA-Z0-9_]', 'a_1*3Qb')

<re.Match object; span=(3, 4), match='*'>

Here, the first non-word character in 'a_1*3!b' is '*'.

### \d
### \D

Match based on whether a character is a decimal digit.

`\d` matches any decimal digit character. `\D` is the opposite. It matches any character that isn’t a decimal digit:

In [57]:
re.search('\d','abc4def')

<re.Match object; span=(3, 4), match='4'>

In [59]:
re.search('\D','234Q678')

<re.Match object; span=(3, 4), match='Q'>

`\d` is essentially equivalent to `[0-9]`, and `\D` is equivalent to `[^0-9]`.

In [60]:
re.search('[^0-9]','234Q678')

<re.Match object; span=(3, 4), match='Q'>

### \s
### \S

Match based on whether a character represents whitespace.

`\s` matches any `whitespace` character:

In [62]:
re.search('\s', 'foo\nbar baz')

<re.Match object; span=(3, 4), match='\n'>

Note that, unlike the dot wildcard metacharacter, `\s` does match a newline character.

`\S` is the `opposite` of `\s`. It matches any character that `isn’t whitespace`:

In [63]:
re.search('\S', '  \n foo  \n  ')

<re.Match object; span=(4, 5), match='f'>

Again, \s and \S consider a newline to be whitespace. In the example above, the first non-whitespace character is 'f'

The character class sequences \w, \W, \d, \D, \s, and \S can appear inside a square bracket character class as well:

In [65]:
re.search('[\d\w\s]','---3---')

<re.Match object; span=(3, 4), match='3'>

In [66]:
re.search('[\d\w\s]', '---a---')

<re.Match object; span=(3, 4), match='a'>

In [67]:
re.search('[\d\w\s]', '--- ---')

<re.Match object; span=(3, 4), match=' '>

In this case, `[\d\w\s]` matches any digit, word, or whitespace character. And since \w includes \d, the same character class could also be expressed slightly shorter as `[\w\s]`.

### Escaping Metacharacters
Occasionally, you’ll want to include a metacharacter in your regex, except you won’t want it to carry its special meaning. Instead, you’ll want it to represent itself as a literal character.

## backslash (`\`)

Removes the special meaning of a metacharacter.

As you’ve just seen, the backslash character can introduce special character classes like word, digit, and whitespace. There are also special metacharacter sequences called **anchors** that begin with a backslash, which you’ll learn about below.

When it’s not serving either of these purposes, the backslash **escapes** metacharacters. A metacharacter preceded by a backslash loses its special meaning and matches the literal character instead. Consider the following examples:

In [68]:
re.search('.', 'foo.bar')
# not matches `.` in string

<re.Match object; span=(0, 1), match='f'>

In [70]:
re.search('\.', 'foo.bar')
# after giving `\' before `.`it detects the `.`

<re.Match object; span=(3, 4), match='.'>

In the `<regex>` on **line 1**, the dot (`.`) functions as a wildcard metacharacter, which matches the first character in the string (`'f'`). The . character in the `<regex>` on **line 4** is escaped by a backslash, so it isn’t a wildcard. It’s interpreted literally and matches the `'.'` at index 3 of the search string.

Using backslashes for escaping can get messy. Suppose you have a string that contains a single backslash:

In [78]:
s = r'foo\bar'

In [79]:
print(s)

foo\bar


Now suppose you want to create a `<regex>` that will match the backslash between `'foo'` and `'bar'`. The backslash is itself a special character in a regex, so to specify a literal backslash, you need to escape it with another backslash. If that’s that case, then the following should work:

In [81]:
# re.search('\\', s)

### error
The problem here is that the backslash escaping happens twice, first by the Python interpreter on the string literal and then again by the regex parser on the regex it receives.

Here’s the sequence of events:

1. The Python interpreter is the first to process the string literal '\\'. It interprets that as an escaped backslash and passes only a single backslash to re.search().
2. The regex parser receives just a single backslash, which isn’t a meaningful regex, so the messy error ensues.

<br>There are two ways around this. First, you can escape both backslashes in the original string literal:

In [83]:
re.search('\\\\', s)

<re.Match object; span=(3, 4), match='\\'>

Doing so causes the following to happen:

1. The interpreter sees `'\\\\'` as a pair of escaped backslashes. It reduces each pair to a single backslash and passes `'\\'` to the regex parser.
2. The regex parser then sees `\\` as one escaped backslash. As a `<regex>`, that matches a single backslash character. You can see from the match object that it matched the backslash at index 3 in s as intended. It’s cumbersome, but it works.

The second, and probably cleaner, way to handle this is to specify the <regex> using a raw string: `r`

In [85]:
re.search(r'\\', s)

<re.Match object; span=(3, 4), match='\\'>

This suppresses the escaping at the interpreter level. The string '\\' gets passed unchanged to the regex parser, which again sees one escaped backslash as desired.

<br>It’s good practice to use a raw string to specify a regex in Python whenever it contains backslashes.

## Anchors
Anchors are zero-width matches. They don’t match any actual characters in the search string, and they don’t consume any of the search string during parsing. Instead, an anchor dictates a particular location in the search string where a match must occur.

### `^`
### `\A`

Anchor a match to the start of `<string>`.

When the regex parser encounters ^ or \A, the parser’s current position must be at the beginning of the search string for it to find a match.

In other words, regex ^foo stipulates that 'foo' must be present not just any old place in the search string, but at the beginning:

In [86]:
re.search('^foo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [87]:
print(re.search('^foo', 'barfoo'))

None


`\A`functions similarly:

In [88]:
re.search('\Afoo', 'foobar')

<re.Match object; span=(0, 3), match='foo'>

In [89]:
print(re.search('\Afoo', 'barfoo'))

None


`^` and `\A` behave slightly differently from each other in MULTILINE mode. You’ll learn more about MULTILINE mode below in the section on flags.

###  `$`
### `\Z`

Anchor a match to the end of `<string>`.

When the regex parser encounters `$` or `\Z`, the parser’s current position must be at the end of the search string for it to find a match. Whatever precedes `$` or `\Z` must constitute the end of the search string:

In [90]:
re.search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [91]:
print(re.search('bar$', 'barfoo'))

None


In [92]:
re.search('bar\Z', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

As a special case, `$` (but not `\Z`) also matches just before a single newline at the end of the search string:

In [93]:
re.search('bar$', 'foobar\n')

<re.Match object; span=(3, 6), match='bar'>

In this example, `'bar'` isn’t technically at the end of the search string because it’s followed by one additional newline character. But the regex parser lets it slide and calls it a match anyway. This exception doesn’t apply to `\Z`.

`$` and `\Z` behave slightly differently from each other in MULTILINE mode. See the section below on flags for more information on MULTILINE mode.

### `\b`

Anchors a match to a word boundary.

`\b` asserts that the regex parser’s current position must be at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores `([a-zA-Z0-9_])`, the same as for the `\w` character class:

In [96]:
re.search(r'\bbar', 'foo bar')

<re.Match object; span=(4, 7), match='bar'>

In [97]:
re.search(r'\bbar', 'foo.bar')

<re.Match object; span=(4, 7), match='bar'>

In [98]:
print(re.search(r'\bbar', 'foobar'))

None


In [99]:
re.search(r'foo\b', 'foo.bar')

<re.Match object; span=(0, 3), match='foo'>

In [100]:
print(re.search(r'foo\b', 'foobar'))

None


In the above examples, a match happens on `lines 1` and `3` because there’s a word boundary at the start of `'bar'`. This isn’t the case on `line 6`, so the match fails there.

Similarly, there are matches on `lines 9` and `11` because a word boundary exists at the end of `'foo'`, but not on `line 14`.

Using the `\b` anchor on both ends of the `<regex>` will cause it to match when it’s present in the search string as a whole word:

In [101]:
re.search(r'\bbar\b', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [102]:
re.search(r'\bbar\b', 'foo(bar)baz')

<re.Match object; span=(4, 7), match='bar'>

In [103]:
print(re.search(r'\bbar\b', 'foobarbaz'))

None


## Quantifiers

A **quantifier** metacharacter immediately follows a portion of a `<regex>` and indicates how many times that portion must occur for the match to succeed.

## *

Matches zero or more repetitions of the preceding regex.

For example, `a*` matches zero or more `'a'` characters. That means it would match an empty string, `'a'`, `'aa'`, `'aaa'`, and so on.

Consider these examples:

In [106]:
re.search('foo-*bar', 'foobar')     # Zero dashes

<re.Match object; span=(0, 6), match='foobar'>

In [107]:
re.search('foo-*bar', 'foo-bar')                    # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [108]:
re.search('foo-*bar', 'foo--bar')                   # Two dashes

<re.Match object; span=(0, 8), match='foo--bar'>

On **line 1**, there are zero `'-'` characters between `'foo'` and `'bar'`. On **line 3** there’s one, and on **line 5** there are two. The metacharacter sequence `-*` matches in all three cases.

In [110]:
re.search('foo-*bar', 'foo------bar')   # * (zero or more occurrences)

<re.Match object; span=(0, 12), match='foo------bar'>

You’ll probably encounter the regex `.*` in a Python program at some point. `This matches zero or more occurrences of any character`. In other words, it essentially matches `any character sequence up` to a line break. (Remember that the . wildcard metacharacter doesn’t match a newline.)

In this example, `.*` matches everything between `'foo'` and `'bar'`:

In [112]:
re.search('foo.*bar', '# foo $qux@grault % bar #')
# it matches all metacharacters


<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>

Did you notice the `span=` and `match=` information contained in the match object?

Until now, the regexes in the examples you’ve seen have specified matches of predictable length. Once you start using quantifiers like `*`, the number of characters matched can be quite variable, and the information in the match object becomes more useful.

You’ll learn more about how to access the information stored in a match object in the next tutorial in the series.

### `+`

Matches one or more repetitions of the preceding regex.

This is similar to `*`, but the quantified regex must occur at least once:

In [114]:
print(re.search('foo-+bar', 'foobar'))              # Zero dashes

# at least one occurrence of `-` 

None


In [115]:
re.search('foo-+bar', 'foo-bar')                    # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [116]:
re.search('foo-+bar', 'foo--bar')                    # Two dashes

<re.Match object; span=(0, 8), match='foo--bar'>

Remember from above that foo`-*`bar matched the string 'foobar' because the `*` metacharacter allows for zero occurrences of `'-'`. The + metacharacter, on the other hand, requires at least one occurrence of `'-'`. That means there isn’t a match on **line 1** in this case.

## `?`

Matches zero or one repetitions of the preceding regex.

`?` is also similar to `*` and `+`, but in this case there’s only a match if the preceding regex occurs once or not at all:

In [118]:
re.search('foo-?bar', 'foobar')                       # Zero dashes

<re.Match object; span=(0, 6), match='foobar'>

In [117]:
re.search('foo-?bar', 'foo-bar')                         # One dash

<re.Match object; span=(0, 7), match='foo-bar'>

In [120]:
print(re.search('foo-?bar', 'foo--bar'))            # Two dashes
# match fail because of to `--` are present

None


In this example, there are matches on **lines 1** and **3**. But on **line 5**, where there are `two '-' characters`, the `match fails`.

In [121]:
print(re.search('foo--?bar', 'foo--bar')) 

<re.Match object; span=(0, 8), match='foo--bar'>


In [122]:
print(re.search('foo---?bar', 'foo--bar')) print(re.search('foo--?bar', 'foo--bar')) 

<re.Match object; span=(0, 8), match='foo--bar'>


In [123]:
print(re.search('foo--?bar', 'foo---bar')) 
# maximum limit exceeded

None


Here are some more `examples` showing the use of `all three quantifier metacharacters`:

In [128]:
re.match('foo[1-9]*bar', 'foobar')

# it can contains foo , [0-9]*, bar
# [0-9]* means 0 or more occurance of digits
# [0-9]+ means at least 1 or more of digits
#[0-9]? means 0 or up to 1 of digit

<re.Match object; span=(0, 6), match='foobar'>

In [127]:
re.match('foo[1-9]*bar', 'foo42bar')

# * 0 or more occureance of digits

<re.Match object; span=(0, 8), match='foo42bar'>

In [131]:
print(re.match('foo[1-9]+bar', 'foobar'))

#[1-9]+ means at least 1 or more time occurance of digits   
#note- it is showing none because that string does not contain any digit

None


In [133]:
re.match('foo[1-9]+bar', 'foo42bar')
#[1-9]+ means at least 1 or more time occurance of digits   
#note-  string contains more than one 1 digits

<re.Match object; span=(0, 8), match='foo42bar'>

In [134]:
re.match('foo[1-9]?bar', 'foobar')
#[1-9]? for this serach required 0 or max one digit 


<re.Match object; span=(0, 6), match='foobar'>

In [136]:
print(re.match('foo[1-9]?bar', 'foo42bar'))
#[1-9]? for this serach required 0 or max one digit 
# note - shows none because condition not satiesfied (more than 1 degits are present)

None


This time, the quantified regex is the character class `[1-9]` instead of the simple character `'-'`.

### *?
### +?
### ??

The non-greedy (or lazy) versions of the `*`, `+`, and `?` quantifiers.

When used alone, the quantifier metacharacters *, +, and ? are `all greedy`, meaning they `produce the longest possible match`. Consider this example:

In [137]:
re.search('<.*>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>

The regex `<.*>` effectively means:

- A `'<'` character
- Then any sequence of characters
- Then a `'>'` character

But which `'>'` character? There are three possibilities:

1. The one just after 'foo'
2. The one just after 'bar'
3. The one just after 'baz'

Since the `*` metacharacter is greedy, it dictates the longest possible match, which includes everything up to and including the `'>'` character that follows `'baz'`. You can see from the match object that this is the match produced.

If you want the shortest possible match instead, then use the non-greedy metacharacter sequence `*?`:

In [139]:
re.search('<.*?>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>

In this case, the match ends with the '>' character following 'foo'.
<br>**Note:** You could accomplish the same thing with the regex `<[^>]*>`, which means:
* A '<' character
* Then any sequence of characters other than '>'
* Then a '>' character

This is the only option available with some older parsers that don’t support lazy quantifiers. Happily, that’s not the case with the regex parser in Python’s re module.

There are lazy versions of the + and ? quantifiers as well:

In [141]:
re.search('<.+>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>

In [142]:
re.search('<.+?>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>

In [148]:
re.search('ba?', 'baaaa')
# ? zero or max

<re.Match object; span=(0, 2), match='ba'>

In [144]:
re.search('ba??', 'baaaa')

<re.Match object; span=(0, 1), match='b'>

The first two examples on **lines 1** and **3** are similar to the examples shown above, only using `+` and `+?` instead of `*` and `*?`.

The last examples on **lines 6** and **8** are a little different. In general, the ? metacharacter matches zero or one occurrences of the preceding regex. The greedy version, ?, matches one occurrence, so ba? matches 'b' followed by a single 'a'. The non-greedy version, ??, matches zero occurrences, so ba?? matches just 'b'.

In [147]:
re.search('baa??', 'baaaa')
#?? means o or max-1 

<re.Match object; span=(0, 2), match='ba'>

## {m}

Matches exactly `m` repetitions of the preceding regex.

This is similar to `*` or `+`, but it specifies exactly how many times the preceding regex must occur for a match to succeed:

In [150]:
print(re.search('x-{3}x', 'x--x'))                       # Two dashes
# -{3} means - occures exactly 3 time otherwise it shows none

None


In [151]:
re.search('x-{3}x', 'x---x') 
# -{3} means - occures exactly 3 time otherwise it shows none

<re.Match object; span=(0, 5), match='x---x'>

In [152]:
print(re.search('x-{3}x', 'x----x'))                      # Four dashes

None


Here, `x-{3}x` matches `'x'`, followed by exactly three instances of the `'-'` character, followed by another `'x'`. The match fails when there are fewer or more than three dashes between the `'x'` characters.

## {m,n}

Matches any number of repetitions of the preceding regex from m to n, inclusive.

In the following example, the quantified `<regex>` is `-{2,4}`. The match `succeeds when there are two, three, or four dashes` between the `'x'` characters but fails otherwise:

In [155]:
for i in range(1, 6):
        s = f"x{'-' * i}x"
        print(f'{i}  {s:10}', re.search('x-{2,4}x', s))

1  x-x        None
2  x--x       <re.Match object; span=(0, 4), match='x--x'>
3  x---x      <re.Match object; span=(0, 5), match='x---x'>
4  x----x     <re.Match object; span=(0, 6), match='x----x'>
5  x-----x    None


1. result is none because `-` are occured less than 2
2. matched becaused because `-` are occured 2 times 
3. matched becaused because `-` are occured  more than 2 times and less than 4 
4. matched becaused because `-` are occured more than 2 times and equal 4
5. result is none because `-` are occured more than 4 times
 


Omitting m implies a lower bound of 0, and omitting n implies an unlimited upper bound:


| Regular Expression  | Matches | Matches |
| --- | --- | --- |
| `<regex>{,n}` | Any number of repetitions of `<regex>` less than or equal to n | `<regex>{0,n}` |
| `<regex>{m,}` |	Any number of repetitions of `<regex>` greater than or equal to m | `----` |
| <regex>{,}	| Any number of repetitions of `<regex>` | `<regex>{0,}<regex>*` |

If you omit all of m, n, and the comma, then the curly braces no longer function as metacharacters. {} matches just the literal string '{}':

In [156]:
re.search('x{}y', 'x{}y')

<re.Match object; span=(0, 4), match='x{}y'>

In fact, to have any special meaning, a sequence with curly braces must fit one of the following patterns in which m and n are nonnegative integers:

* {m,n}
* {m,}
* {,n}
* {,}

Otherwise, it matches literally:

In [159]:
re.search('x{foo}y', 'x{foo}y')
# matches as it is 

<re.Match object; span=(0, 7), match='x{foo}y'>

In [160]:
re.search('x{a:b}y', 'x{a:b}y')
# it is alos matches as it is

<re.Match object; span=(0, 7), match='x{a:b}y'>

In [163]:
re.search('x{1,3,5}y', 'x{1,3,5}y')
# matches as it is 

<re.Match object; span=(0, 9), match='x{1,3,5}y'>

In [164]:
re.search('x{foo,bar}y', 'x{foo,bar}y')
# matches as it is

<re.Match object; span=(0, 11), match='x{foo,bar}y'>

Later in this tutorial, when you learn about the `DEBUG` flag, you’ll see how you can confirm this.

## {m,n}?

The non-greedy (lazy) version of {m,n}.

`{m,n}` will match as many characters as possible, and `{m,n}?` will match as few as possible:

In [166]:
re.search('a{3,5}', 'aaaaaaaa')
# {m,n} will match as many characters as possible

<re.Match object; span=(0, 5), match='aaaaa'>

In [168]:
re.search('a{3,5}?', 'aaaaaaaa')
#  {m,n}? will match as few as possible
# note- here minimum is 3 hence character 3 occures ony three time

<re.Match object; span=(0, 3), match='aaa'>

## Grouping Constructs and Backreferences

Grouping constructs break up a regex in Python into subexpressions or groups. This serves two purposes:

1. **Grouping:** A group represents a single syntactic entity. Additional metacharacters apply to the entire group as a unit.
2. **Capturing:** Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. You can retrieve captured matches later through several different mechanisms.



## `(<regex>)`
    Defines a subexpression or group.

This is the most basic grouping construct. A regex in parentheses just matches the contents of the parentheses:

In [169]:
re.search('(bar)', 'foo bar baz')
#matches with parentheses

<re.Match object; span=(4, 7), match='bar'>

In [171]:
re.search('bar', 'foo bar baz')
#matches without parentheses

<re.Match object; span=(4, 7), match='bar'>

As a regex, `(bar)` matches the string `'bar'`, the same as the regex bar would without the parentheses.

## Treating a Group as a Unit
A quantifier metacharacter that follows a group operates on the entire subexpression specified in the group as a single unit.

For instance, the following example matches one or more occurrences of the string `'bar'`:

In [172]:
re.search('(bar)+', 'foo bar baz')
# + means at least 1 or more than one 

<re.Match object; span=(4, 7), match='bar'>

In [174]:
re.search('(bar)+', 'foo barbar baz')
# + means at least 1 or more than one 
# note - captures frequently occured bar

<re.Match object; span=(4, 10), match='barbar'>

In [176]:
re.search('(bar)+', 'foo barbarbarbar baz')
# + means at least 1 or more than one 
# note - captures frequently occured bar

<re.Match object; span=(4, 16), match='barbarbarbar'>

Here’s a breakdown of the difference between the two regexes with and without grouping parentheses:

| Regex | Interpretation|	Matches | Examples|
|---:|:-------------|:-----------|:------|
bar+ |	The + metacharacter applies only to the character 'r'.	|'ba' followed by one or more occurrences of 'r' |	'bar' 'barr' 'barrr'|
| (bar)+ | The + metacharacter applies to the entire string 'bar'. | One or more occurrences of | 'bar' 'bar' 'barbar' 'barbarbar'|

Now take a look at a more complicated example. The regex `(ba[rz]){2,4}(qux)?` matches 2 to 4 occurrences of either `'bar'` or `'baz'`, optionally followed by `'qux'`:

In [178]:
re.search('(ba[rz]){2,4}(qux)?', 'bazbarbazqux')

<re.Match object; span=(0, 12), match='bazbarbazqux'>

In [179]:
re.search('(ba[rz]){2,4}(qux)?', 'barbar')

<re.Match object; span=(0, 6), match='barbar'>

The following example shows that you can nest grouping parentheses:

In [181]:
re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar')

<re.Match object; span=(0, 9), match='foofoobar'>

In [182]:
re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar123')

<re.Match object; span=(0, 12), match='foofoobar123'>

In [183]:
re.search('(foo(bar)?)+(\d\d\d)?', 'foofoo123')

<re.Match object; span=(0, 9), match='foofoo123'>

The regex (foo(bar)?)+(\d\d\d)? is pretty elaborate, so let’s break it down into smaller pieces:

| Regex	| Matches|
|---:|:-------------|
|foo(bar)?|	'foo' optionally followed by 'bar'|
|(foo(bar)?)+|	One or more occurrences of the above |
|\d\d\d | Three decimal digit characters |
|(\d\d\d)? | Zero or one occurrences of the above |

Capturing Groups
Grouping isn’t the only useful purpose that grouping constructs serve. Most (but not quite all) grouping constructs also capture the part of the search string that matches the group. You can retrieve the captured portion or refer to it later in several different ways.

Remember the match object that `re.search()` returns? There are two methods defined for a match object that provide access to captured groups: `.groups()` and `.group()`.

## `m.groups()`

Returns a tuple containing all the captured groups from a regex match.

In [184]:
m = re.search('(\w+),(\w+),(\w)', 'foo,quux,baz')
m

<re.Match object; span=(0, 10), match='foo,quux,b'>

\w+ mean at least 1 or more words 
<br>\w mean exactly words 

Each of the three (\w+) expressions matches a sequence of word characters. The full regex (\w+),(\w+),(\w+) breaks the search string into three comma-separated tokens.

Because the (\w+) expressions use grouping parentheses, the corresponding matching tokens are **captured**. To access the captured matches, you can use .groups(), which returns a tuple containing all the captured matches in order:

In [185]:
m.groups()

('foo', 'quux', 'b')

Notice that the tuple contains the tokens but not the commas that appeared in the search string. That’s because the word characters that make up the tokens are inside the grouping parentheses but the commas aren’t. The commas that you see between the returned tokens are the standard delimiters used to separate values in a tuple.

## `m.group(<n>)`
Returns a string containing the `<n>th` captured match.

With one argument, .group() returns a single captured match. Note that the arguments are one-based, not zero-based. So, m.group(1) refers to the first captured match, m.group(2) to the second, and so on:



In [186]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
# prantheses and quama created  group
m.groups()

('foo', 'quux', 'baz')

In [187]:
m.group(1)
# group 1 formed up to 1 quama  

'foo'

In [188]:
m.group(2)

'quux'

In [189]:
m.group(3)

'baz'

Since the numbering of captured matches is one-based, and there isn’t any group numbered zero, `m.group(0)` has a special meaning:

In [190]:
m.group(0)
# 0 will returns entire match means all groups

'foo,quux,baz'

In [191]:
m.group()
# same as 0 

'foo,quux,baz'

`m.group(0)` returns the entire match, and `m.group()` does the same.

## m.group(<n1>, <n2>, ...)
Returns a tuple containing the specified captured matches.

With multiple arguments, .group() returns a tuple containing the specified captured matches in the given order:

In [193]:
m.group()

'foo,quux,baz'

In [194]:
m.group(2, 3)

('quux', 'baz')

In [195]:
m.group(3, 2, 1)

('baz', 'quux', 'foo')

This is just convenient shorthand. You could create the tuple of matches yourself instead:

In [197]:
m.group(3, 2, 1)

('baz', 'quux', 'foo')

In [199]:
(m.group(3), m.group(2), m.group(1))
# same as above 

('baz', 'quux', 'foo')

The two statements shown are functionally equivalent.

## Backreferences
You can match a previously captured group later within the same regex using a special metacharacter sequence called a **backreference**.

## `\<n>`

Matches the contents of a previously captured group.



Within a regex in Python, the sequence `\<n>`, where `<n>` is an integer from 1 to 99, matches the contents of the `<n>th` captured group.

Here’s a regex that matches a word, followed by a comma, followed by the same word again:

In [200]:
regex = r'(\w+),\1'

In [201]:
m = re.search(regex, 'foo,foo')

In [202]:
m

<re.Match object; span=(0, 7), match='foo,foo'>

In [204]:
m = re.search(regex, 'qux,qux')
m

<re.Match object; span=(0, 7), match='qux,qux'>

In [205]:
m.group(1)

'qux'

In [206]:
m = re.search(regex, 'foo,qux')

In [208]:
print(m)

None


In the first example, on **line 3**, `(\w+)` matches the first instance of the string `'foo'` and saves it as the first captured group. The `comma matches literally`. 
<br>Then `\1` is a backreference to the first captured group and matches `'foo'` again. 
<br>The second example, on **line 9**, is identical except that the `(\w+)` matches `'qux'` instead.

The last example, on **line 15**, doesn’t have a match because what comes before the comma isn’t the same as what comes after it, so the `\1` backreference doesn’t match.

**Note:** Any time you use a regex in Python with a numbered backreference, it’s a good idea to specify it as a raw string. Otherwise, the interpreter may confuse the backreference with an octal value.

<br>Consider this example:

In [209]:
print(re.search('([a-z])#\1', 'd#d'))

None


The regex `([a-z])#\1` matches a lowercase letter, followed by `'#'`, followed by the same lowercase letter. The string in this case is `'d#d'`, which should match. But the match fails because Python misinterprets the backreference `\1` as the character whose octal value is one:

In [210]:
oct(ord('\1'))

'0o1'

You’ll achieve the correct match if you specify the regex as a raw string:

In [211]:
re.search(r'([a-z])#\1', 'd#d')

<re.Match object; span=(0, 3), match='d#d'>

Remember to consider using a raw string whenever your regex includes a metacharacter sequence containing a backslash.

Numbered backreferences are one-based like the arguments to `.group()`. Only the first ninety-nine captured groups are accessible by backreference. The interpreter will regard `\100` as the `'@'` character, whose octal value is 100.

## Other Grouping Constructs