# Chapter 3 Grouping

Grouping is a powerful tool that allows you to perform operations such as:
-  Creating subexpressions to apply quantifers. For instance, repeating a subexpression rather than a single character.
-  Limiting the scope of the alternation. Instead of alternating the whole expression, we can defne exactly what has to be alternated.
-  Extracting information from the matched pattern. For example, extracting a date from lists of orders.
- Using the extracted information again in the regex, which is probably the most useful property. One example would be to detect repeated words.

Throughout this chapter, we will explore groups, from the simplest to the most complex ones. We'll review some of the previous examples in order to bring clarity to how these operations work.

# Introduction

We've already used groups in several examples throughout *Chapter 2, Regular 
Expressions with Python*. **Grouping is accomplished through two metacharacters, the parentheses ()**. The simplest example of the use of parentheses would be building a subexpression. For example, imagine you have a list of products, the ID for each product being made up of two or three sequences of one digit followed by a dash and followed by one alphanumeric character, 1-a2-b:

In [4]:
import re

In [7]:
re.match(r"(\d-\w){2,3}", r"1-a2-b")

<re.Match object; span=(0, 6), match='1-a2-b'>

As you can see in the preceding example, the parentheses indicate to the regex 
engine that the pattern inside them has to be treated like a unit.

Let's see another example; in this case, we need to match whenever there is one or more ```ab``` followed by ```c```:

In [9]:
re.search(r"(ab)+c", r"ababcc")

<re.Match object; span=(0, 5), match='ababc'>

In [10]:
re.search(r"(ab)+c", r"abbc")

**So, you could use parentheses whenever you want to group meaningful subpatterns inside the main pattern**.

Another simple example of their use is limiting the scope of alternation. For example, let's say we would like to write an expression to match if someone is from Spain. In Spanish, the country is spelled España and Spaniard is spelled Español. So, we want to match España and Español. The Spanish letter ñ can be confusing for non-Spanish speakers, so in order to avoid confusion we'll use Espana and Espanol instead of España and Español.

we can achieve it with the following alternation:

In [11]:
re.search("Espana|ol", "Espanol")

<re.Match object; span=(5, 7), match='ol'>

In [13]:
re.search("Espana|ol", "Espana")

<re.Match object; span=(0, 6), match='Espana'>

The problem is that this also matches ```ol```:

In [12]:
re.search("Espana|ol", "ol")

<re.Match object; span=(0, 2), match='ol'>

So, let's try character classes/character set (using ```[ ]```) as in the following code:

In [14]:
re.search("Espan[aol]", "Espanol")

<re.Match object; span=(0, 6), match='Espano'>

In [15]:
re.search("Espan[aol]", "Espana")

<re.Match object; span=(0, 6), match='Espana'>

It works, but here we have another problem: It also matches ```"Espano"``` and ```"Espanl"``` that doesn't mean anything in Spanish:

In [16]:
re.search("Espan[a|ol]", "Espano")

<re.Match object; span=(0, 6), match='Espano'>

In [21]:
# this regex pattern will only take 1 character from character inside []
re.search("Espan[a|ol]", "Espanol")

<re.Match object; span=(0, 6), match='Espano'>

The solution here is to use parentheses:

In [17]:
re.search("Espan(a|ol)", "Espana")

<re.Match object; span=(0, 6), match='Espana'>

In [19]:
re.search("Espan(a|ol)", "Espanol")

<re.Match object; span=(0, 7), match='Espanol'>

In [28]:
# my own experiment
# re.search("Espan(a|ol)", "Espanao")

# <re.Match object; span=(0, 6), match='Espana'>

In [29]:
re.search("Espan(a|ol)", "Espan")

In [30]:
re.search("Espan(a|ol)", "Espano")

In [32]:
re.search("Espan(a|ol)", "ol")

Let's see another key feature of grouping, **capturing**. Groups also capture the matched pattern, so you can use them later in several operations, such as sub or in the regex itself.

**For example**, imagine you have a list of products, the IDs of which are made up of digits representing the country of the product, a dash as a separator, and one or more alphanumeric characters as the ID in the DB. You're requested to extract the country codes:

```"1-a\n20-baer\n34-afcr"```

In [48]:
pattern = re.compile(r"(\d+)-\w+")

it = pattern.finditer(r"1-a\n20-baer\n34-afcr")

In [49]:
match = [item for item in it]
match

[<re.Match object; span=(0, 3), match='1-a'>,
 <re.Match object; span=(5, 12), match='20-baer'>,
 <re.Match object; span=(14, 21), match='34-afcr'>]

In [51]:
match = it.next()

AttributeError: 'callable_iterator' object has no attribute 'next'

In [50]:
# my own experiment on finding country code
country_code = re.findall(pattern, r"1-a\n20-baer\n34-afcr")
country_code

['1', '20', '34']

In the preceding example, we've created a pattern to match the IDs, but we're only capturing a group made up of the country digits. Remember that when working with the ```group``` method, the index 0 returns the whole match, and the groups start at index 1.

Capturing groups give a huge range of possibilities due to which they can also be used with several operations, which we would discuss in the upcoming sections.

# Backreferences

As we've mentioned previously, one of the most powerful functionalities that grouping gives us is the possibility of *using the captured group inside the regex or other operations*. That's exactly what **backreferences** provide. Probably the best known example to bring some clarity is the regex to fnd duplicated words, as shown in the following code:

In [68]:
pattern = re.compile(r"(\w+) \1")
matches = pattern.search(r"hello hello world")

In [69]:
# my own experiment
matches.group()

'hello hello'

In [70]:
# my own experiment on group()
matches.group(0)

'hello hello'

In [72]:
# my own experiment on group()
matches.group(1)

'hello'

In [71]:
matches.groups()

('hello',)

Here, we're capturing a group made up of one or more alphanumeric characters, 
after which the pattern tries to match a whitespace, and fnally we have the \1 
backreference. You can see it highlighted in the code, meaning that it must exactly match the same thing it matched as the frst group.

Backreferences can be used with the first 99 groups .Obviously, with an increase in the number of groups, you will fnd the task of reading and maintaining the regex more complex. This is something that can be reduced with named groups; we'll see them in the following section. But before that, we still have a lot of things to learn with backreferences. So, let's continue with another operation in which backreferences really come in handy. Recall the previous example, in which we had a list of products. Now, let's try to change the order of the ID, so we have the ID in the DB, a dash, and the country code:

In [62]:
pattern = re.compile(r"(\d+)-(\w+)")

In [63]:
pattern.sub(r"\2-\1", "1-a\n20-baer\n34-afcr")

'a-1\nbaer-20\nafcr-34'

That's it. Easy, isn't it? Note that we're also capturing the ID in the DB, so we can use it later. With the highlighted code (```r"\2-\1"```), we're saying, "Replace what you've matched with the second group, a dash, and the first group".

As with the previous example, using numbers can be diffcult to follow and to 
maintain. So, let's see what Python, through the re module, offers to help with this.

# Named groups

Remember from the previous chapter when we got a group through an index?

In [73]:
pattern = re.compile(r"(\w+) (\w+)")

In [74]:
match = pattern.search("Hello world")

In [75]:
match.group(1)

'Hello'

In [76]:
match.group(2)

'world'

We just learnt how to access the groups using indexes to extract information and to use it as backreferences. Using numbers to refer to groups can be tedious and confusing, and the worst thing is that it doesn't allow you to give meaning or context to the group. That's why we have named groups.

Imagine a regex in which you have several backreferences, let's say 10, and you
find out that the third one is invalid, so you remove it from the regex. That means you have to change the index for every backreference starting from that one onwards. In order to solve this problem, in 1997, Guido Van Rossum designed 
named groups for Python 1.5. This feature was offered to Perl for cross pollination.

Nowadays, it can be found in almost any favor. Basically it allows us to give names to the groups, so we can refer to them by their names in any operation where groups are involved.

In order to use it, we have to use the syntax,```(?P<name>pattern)```, where the P comes from Python-specifc extensions (as you can read in the e-mail Guido sent to Perl developers at http://markmail.org/message/oyezhwvefvotacc3)

Let's see how it works with the previous example in the following code snippet:

In [77]:
pattern = re.compile(r"(?P<first>\w+) (?P<second>\w+)")

In [83]:
match = pattern.search("Hello world")

In [84]:
match.group("first")

'Hello'

In [85]:
match.group("second")

'world'

So, backreferences are now much simpler to use and maintain as is evident in the following example:

In [86]:
pattern = re.compile(r"(?P<country>\d+)-(?P<id>\w+)")

In [89]:
pattern.sub(r"\g<id>-\g<country>", "1-a\n20-baer\n34-afcr")

'a-1\nbaer-20\nafcr-34'

As we see in the previous example, in order to reference a group by the name  
in the ```sub``` operation, we have to use ```\g<name>```.

We can also use named groups inside the pattern itself, as seen in the following example:

In [90]:
pattern = re.compile(r"(?P<word>\w+) (?P=word)")

In [91]:
match = pattern.search(r"hello hello world")

In [93]:
match.groups()

('hello',)

This is simpler and more readable than using numbers.

Through these examples, we have used the following three different ways to refer to named group:

|Use | Syntax|
|:- | :- |
|Inside a pattern | ```(?=name)```|
|In the ```repl``` string of the ```sub``` operation | ```\g<name>```|
|in any of the operation of the ```MatchObject``` | ```match.group('name')``` |

# Non-capturing group

As we've mentioned before, capturing content is not the only use of groups.  

There are cases when we want to use groups, but we're not interested in extracting the information; alternation would be a good example. That's why we have a way to create groups without capturing. Throughout this book, we've been using *groups to create subexpressions*, as can be seen in the following example:

In [94]:
re.search("Españ(a|ol)", "Español")

<re.Match object; span=(0, 7), match='Español'>

In [95]:
re.search("Españ(a|ol)", "Español").groups()

('ol',)

In [104]:
# This method groups() is the same as group(1)
# re.search("Españ(a|ol)", "Español").group(1)

# output
# 'ol'

You can see that we've captured a group even though we're not interested in the 
content of the group. So, let's try it without capturing, but first we have to know the syntax, which is almost the same as in normal groups, ```(?:pattern)```. As you can see, we've only added ```?:```. Let's see the following example:

In [105]:
re.search("Españ(?:a|ol)", "Español")

<re.Match object; span=(0, 7), match='Español'>

In [106]:
re.search("Españ(?:a|ol)", "Español").groups()

()

After using the new syntax, we have the same functionality as before, but now  
we're saving resources and the regex is easier to maintain. Note that the group 
cannot be referenced.

### Atomic groups

They're a special case of non-capturing groups; they're usually used to improve 
performance. It disables backtracking, so with them you can avoid cases where trying every possibility or path in the pattern doesn't make sense. This concept is diffcult to understand, so stay with me up to the end of the section.

The re module doesn't support atomic groups. So, in order to see an example,  
we're going to use the regex module: https://pypi.python.org/pypi/regex.

Imagine we have to look for an ID made up of one or more alphanumeric characters followed by a dash and by a digit:

In [109]:
import regex

In [107]:
data = "aaaaabbbbbaaaaccccccdddddaaa"

In [110]:
regex.match("(?>\w+)-\d", data)

Let's see step by step what's happening here:
1.  The regex engine matches the frst a.
2.  It then matches every character up to the end of the string.
3.  It fails because it doesn't fnd the dash.
4.  So, the engine does backtracking and tries the same with the following a.
5.  Start the same process again.

It tries this with every character. If you think about what we're doing,it doesn't make any sense to keep trying once you have failed the frst time. And that's exactly what an atomic group is useful for. For example:

```regex.match("(?>\w+)-\d", data)```

Here we've added ```?>```, which indicates an atomic group, so once the regex engine fails to match, it doesn't keep trying with every character in the data.

# Special cases with groups

Python provides us with some forms of groups that can helps us to modify the regular expressions or even to match a pattern only when a previous group exist in the match, such as an ```if``` statement.

### Flags per group

There is a way to apply the flags we've seen in *Chapter 2, Regular Expressions 
with Python*, using a special form of grouping: ```(?iLmsux)```.

|Letter | Flag |
|:- | :-|
|i |re.IGNORECASE|
|L |re.LOCALE|
|m |re.MULTILINE|
|s |re.DOTALL|
|u |re.UNICODE|
|x |re.VERBOSE|

for example:

In [112]:
re.findall(r"(?u)\w+", ur"ñ")

SyntaxError: invalid syntax (1427664906.py, line 1)

the above example is the same as:

In [113]:
re.findall(r"\w+" ,ur"ñ", re.U)

SyntaxError: invalid syntax (1977784232.py, line 1)

We've seen what these examples do several times in the previous chapter.
Remember that a flag is applied to the whole expression.

### yes-pattern|no-pattern