# Grouping
> RE are used to disect a strinjg into subgroups depending upon pattern
> for ex: a string `"User-Agent: Thunderbird 1.5.0.9 (X11/20061227)"` we want group of each individual item based on pattern  
>they are indicated by `()`  
> groups are numbered from 0 and group() is group 0 by default and is always present

In [3]:
import re

In [4]:
test = re.compile("(cd)*")
test.match("cdcdcdcda").span()

(0, 8)

_Notice the span not just start of first match it spans from first match to the end_ 

In [5]:
# Extracting specific data with Regex
str1 = "User-Agent: Thunderbird 1.5.0.9 (X11/20061227)"
x = re.match(r"^([\w]+-[\w]+): ([\w]+) ([0-9.]+) ([\(\w]+/[\d\)]+)", str1) # each () is a group and follows rules similar to math
x

<re.Match object; span=(0, 46), match='User-Agent: Thunderbird 1.5.0.9 (X11/20061227)'>

In [6]:
for i in range(1,len(x.groups())+1):
    print(x.group(i))

User-Agent
Thunderbird
1.5.0.9
(X11/20061227)


In [7]:
x.groups()  # returns a tuple of all group items 

('User-Agent', 'Thunderbird', '1.5.0.9', '(X11/20061227)')

In [8]:
x.group(0,1,2,3,4)  # gives tuple based on groups from groups() method respective to each index

('User-Agent: Thunderbird 1.5.0.9 (X11/20061227)',
 'User-Agent',
 'Thunderbird',
 '1.5.0.9',
 '(X11/20061227)')

## Finding repeating word in a string

In [9]:
string_1 = "Iam the best best actor best"
regex_1 = re.compile(r'\b(\w+)\s+\1\b')  #\1 means that the next group item should be similar to previous one
regex_1.search(string_1).group()

'best best'

## Non capturing group `(9?:pattern)`  # pattern is regex pattern
> This are not captured while grouping

In [10]:
# Suppose we dont need to capture any specific group
str1 = "User-Agent: Thunderbird 1.5.0.9 (X11/20061227)"
y = re.match(r"^([\w]+-[\w]+): ([\w]+) (?:[0-9.]+) (?:[\(\w]+/[\d\)]+)", str1) # each () is a group and follows rules similar to math
for i in range(1,len(y.groups())+1):  #loop to print all groups 
    print(y.group(i))
# (?:[0-9.]+) (?:[\(\w]+/[\d\)] from later part of pattern is preceded by ?: which means do not capture this group

User-Agent
Thunderbird


_Thus last two items are not included in y_

In [11]:
# A simple example of all capturing group
str2 = "cat dog rat man"
z = re.match(r"(^cat)\s(\bdog)\s(rat\b)\s(\w+)", str2)
for i in range(1,len(z.groups())+1):  #loop to print all groups 
    print(z.group(i))

cat
dog
rat
man


_As no group `()` begin with `?:` each is included in the group_

In [12]:
# If we don't want to group rat and man simply add `?:` inside the group

str2 = "cat dog rat man"
z = re.match(r"(^cat)\s(\bdog)\s(?:rat\b)\s(?:\w+)", str2)
for i in range(1,len(z.groups())+1):  #loop to print all groups 
    print(z.group(i))

cat
dog


_Thus the group are excluded_

## Named groups `(?P<name>pattern)`
>`<name>` is name we desire to give a group  
> pattern is regex expression
> This is python specific extension

In [13]:
str2 = "cat dog rat man"
a = re.search(r"(?P<pet1>^cat)[a-zA-Z ]+(?P<owner>man$)", str2)
# for i in range(1,len(z.groups())+1):  #loop to print all groups 
#     print(z.group(i))
a.groupdict()

{'pet1': 'cat', 'owner': 'man'}

In [14]:
a.groups()  # Nameds groups are also indexed so we can use indices to fetch them

('cat', 'man')

### Case insensitive mode OFF
> To make certain part case insensitive `(?i)` case-insensitive mode ON 


In [None]:
sen_list = ['AB9r 0396 yjlq',
 'w35u AB9y smge',
 'wji9 fgyx ABls',
 'ABfy abx3 whfc',
 'a0ig sz71 abkh',
 'ajg2 abge fzvd']

In [16]:
pattern_1 = r"(?i)(ab+)"  #add ((?i) before a set to make it case insensitive
for index,string in enumerate(sen_list, start =1):
    print(f'{index} : {string}'.ljust(12), "-->",re.findall(pattern_1, string))

1 : AB9r 0396 yjlq --> ['AB']
2 : w35u AB9y smge --> ['AB']
3 : wji9 fgyx ABls --> ['AB']
4 : ABfy abx3 whfc --> ['AB', 'ab']
5 : a0ig sz71 abkh --> ['ab']
6 : ajg2 abge fzvd --> ['ab']


_we are able to match every thing with ab ignoring case_