## Regular Expression
This script is for getting an advanced idea about regular expression

In [1]:
import re

#### Regular Expression - Search()
match = re.search(pat, str)
1. **pat**: regular expression pattern
2. **str**: string where search operation will be performed
3. returns: matched pattern if match is found else returns **None**
- Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded.

In [3]:
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)

# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')

found word:cat


In [4]:
match1 = re.search(r'ample', str)
if match1:
    print('found', match1.group())
else:
    print('did not find')

found ample


In [20]:
regex = r'\d\s+\d\s*\dx?'
match2 = re.search(regex, 'xxx1 2      6xxdfsrelho')
if match2:
    print('found:', match2.group())
else:
    print('did not find match')

found: 1 2      6x


### Email Example
Regular expression to extract emails

In [30]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'

b@google


### Square Bracket [ ]
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [40]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@google.com


### Group Extraction
The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

In [35]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', str)
if match:
    print(match.group())  ## 'alice-b@google.com' (the whole match)
    print(match.group(1)) ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


### findall
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

#### findall [Without Group Extraction]

In [53]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

# Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com'] [Without Group Extraction]
for email in emails:
    # do something with each found email string
    print(email)

alice@google.com
bob@abc.com


#### findall [With Group Extraction]

In [52]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
#emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']

emails = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)  ## my regex

for email in emails:
    # do something with each found email string
    print('Email Tuple:', email)
    print('User name:', email[0])
    print('Host:', email[1])
    print()

Email Tuple: ('alice', 'google.com')
User name: alice
Host: google.com

Email Tuple: ('bob', 'abc.com')
User name: bob
Host: abc.com

