In [68]:
import re

## Literal Characters

- The most basic regular expression consists of a __single literal__ character, such as $a$. 
- It matches the first occurrence of that character in the string. 
- If the string is __Jack is a boy__, it matches the __a__ after the __J__. 
- The fact that this __a__ is in the middle of the word does not matter to the regex engine. 
    - we can tell that to the regex engine by using __word boundaries__. 

- This regex can match the second __a__ too. It only does so when we tell the regex engine to start searching through the string after the first match.
    
Similarly, the regex __cat__ matches __cat__ in __About cats and dogs__. 

- This regular expression consists of a series of 3 literal characters. This is like saying to the regex engine: find a __c__, immediately followed by an __a__, immediately followed by a __t__.

Note that regex engines are __case sensitive__ by default. unless we tell the regex engine to ignore differences in case.

## Special Characters

- Because we want to do more than __simply search__ for literal pieces of text, we need to __reserve certain characters for special use__. 

- there are 12 characters with special meanings: 

    - the backslash \
    - the caret ^
    - the dollar sign 
    - the period or dot .
    - the vertical bar or pipe symbol |
    - the question mark ?
    - the asterisk or star *
    - the plus sign +
    - the opening parenthesis (
    - the closing parenthesis )
    - the opening square bracket [
    - and the opening curly brace { 
    
These special characters are often called “metacharacters”. Most of them are errors when used alone.

- escape with \ if we want to use the above as literal char. 
    - to match __1+1=2__, the correct regex is __1\\+1=2__. 

## How a Regex Engine Works Internally

- When applying a __regex__ to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the __first character__. 

- Only if all possibilities have been tried and found to __fail__, does the engine continue with the __second character__ in the text. 

- Again, it tries all possible permutations of the regex, in exactly the same order. 

- The result is that the regex engine returns the __leftmost match__.

- When applying __cat__ to __He captured a catfish for his cat__., 
    - the engine tries to match the first token in the regex c to the __first character__ in the match __H__. 
    - This fails. 
    - There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters.
    - So the regex engine tries to match the __c__ with the __e__. This fails too, as does matching the __c__ with the space. 
    - Arriving at the 4th character in the string, __c__ matches __c__. 
    - The engine then tries to match the second token __a__ to the 5th character, __a__. This succeeds too. 
    - But then, __t__ fails to match __p__. 
    - At that point, the engine knows the regex __cannot__ be matched starting at the __4th characterv in the string. 
    - So it continues with the __5th__: a. 
    - Again, __c__ fails to match here and the engine carries on. 
    - At the 15th character in the string, __c__ again matches __c__. 
    - The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that __a__ matches __a__ and __t__ matches __t__.

- The entire regular expression could be matched starting at character 15. 
- The engine is “eager” to report a match. It therefore reports the first 3 letters of catfish as a valid match. 
- The engine never proceeds beyond this point to see if there are any “better” matches. The first match is considered good enough.


## Character Classes or Character Sets

With a “character class”, also called “character set”, we can tell the regex engine to match only one out of several characters. 

Simply place the characters we want to match between __square brackets__. 

If we want to match an __a__ or an __e__, use __[ae]__. You could use this in __gr[ae]y__ to match either __gray__ or __grey__. 

In [69]:
re.search(r'gr[ae]y', 'grey gray')
#gray
#grey

<re.Match object; span=(0, 4), match='grey'>

A character class matches only a __single character__. 

__gr[ae]y__ does not match __graay__, __graey__ 

The order of the characters inside a character class does not matter. 

In [70]:
re.search(r'gr[ae]y', 'graey')   # No find
#grey

In [71]:
re.search(r'gr[ae]y', 'greay')   # No find


In [72]:
re.search(r'gr[ae]y', 'greyy')   # find
#grey

<re.Match object; span=(0, 4), match='grey'>

In [6]:
re.search(r'gr[ae]y', 'greey')   # No find

In [7]:
re.search(r'gr[ae]y', 'graay')   # No find

We can use a hyphen inside a character class to specify a range of characters. 

[0-9] matches a single digit between 0 and 9. We can use more than one range. [0-9a-fA-F] matches a __single__ hexadecimal digit, case insensitively. 

You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X. 

Again, the order of the characters and the ranges does not matter.

> You can find a word, even if it is misspelled, 

In [73]:
re.findall(r'bks[0-9]move', 'bks3move bks303move')   # find

['bks3move']

In [74]:
re.findall(r'popcorn[0-9a-fA-F]', 'popcornA popcornAI popcorna') # find

['popcornA', 'popcornA', 'popcorna']

In [75]:
re.search(r'sep[ae]r[ae]te', 'seperete') #  find

<re.Match object; span=(0, 8), match='seperete'>

In [76]:
re.search(r'[abc]', 'ac')

<re.Match object; span=(0, 1), match='a'>

In [77]:
re.search(r'[abc]', 'ac cd')

<re.Match object; span=(0, 1), match='a'>

## Negated Character Classes

Typing a __caret__ after the opening square bracket __negates__ the character class. 

The result is that the character class matches any character that is __not__ in the character class. 

Unlike the __dot__, __negated character classes__ also match (invisible) __line break__ characters. 

If you don’t want a negated character class to match line breaks, you need to include the line break characters in the class. 

[^0-9\r\n] matches any character that is not a digit or a line break.

In [13]:
re.search(r'[^0-9\r\n]', 'popc6666orn.ai') #  find

<re.Match object; span=(0, 1), match='p'>

In [14]:
re.search(r'q[^u]', 'iraq') #  no find

In [15]:
re.search(r'q[^u]', 'iraqi') #  find

<re.Match object; span=(3, 5), match='qi'>

## Metacharacters Inside Character Classes

- the only special characters or metacharacters inside a character class are the __closing bracket ]__, the __backslash \__, the __caret ^__, and the __hyphen -__. 

The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. 

To search for a star or plus, use __[+*]__

To include a backslash as a character without any special meaning inside a character class, we have to escape it with another backslash. 
    __[\\\x]__ matches a backslash or an x.

In [78]:
re.search(r'Hello [@%&*$#!]', 'Hello !! Hello ##') 

<re.Match object; span=(0, 7), match='Hello !'>

In [79]:
re.findall(r'Hello [@%&*$#!]', 'Hello !! Hello ##') 

['Hello !', 'Hello #']

In [18]:
re.search(r'Hello [@%&*$#!]', 'Hello #') 

<re.Match object; span=(0, 7), match='Hello #'>

## Character Class Subtraction

match any single character present in one list (the character class), but not present in another list (the subtracted class). 

The syntax for this is __[class-[subtract]]__.

The character class __[a-z-[aeiuo]]__ matches a single letter that is not a vowel. 

In other words: it matches a single consonant. 

Without character class subtraction or intersection, the only way to do this would be to list all consonants: __[b-df-hj-np-tv-z]__.

In [19]:
re.findall(r'[aeiou]', 'yyyy god ') 

['o']

In [20]:
re.findall(r'[a-z-[aeiou]]', 'yyyy god ') 

[]

In [21]:
re.findall(r'[0-9-[0-1]]', '123 5647 36') 


[]

## Start of String and End of String Anchors

- They do not match any character at all. 
- they match a position before, after, or between characters. 
- They can be used to “anchor” the regex match at a certain position. 
- the caret ^ matches the position before the first character in the string. 
- Applying __^a__ to __abc__ matches a. 
- __^b__ does not match __abc__ at all, because the b cannot be matched right after the start of the string, matched by ^.

Similarly, \$ matches right after the last character in the string. __c\$__ matches __c__ in __abc__

In [22]:
re.findall(r'^Data', 'Data Science Data mining') 

['Data']

In [23]:
re.findall(r'mining$', 'Data mining Data mining') 

['mining']

#### what happens when we try to match ^4$ to 749\n486\n4

In [24]:
re.search(r'^4$', '4') 

<re.Match object; span=(0, 1), match='4'>

In [25]:
re.search(r'^4$', '\n4') 

In [26]:
re.search(r'^4$', '749\n486\n4 ') 

## Word Boundaries

- The metacharacter __\b__ is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. 

- This match is zero-length.

There are 3 different positions that qualify as word boundaries:

- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.

Simply put: __\b__ allows you to perform a __“whole words only”__ search using a regular expression in the form of __\bword\b__. 

A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.

Since digits are considered to be word characters, __\b4\b__ can be used to match a 4 that is not part of a larger number. This regex does not match __44 sheets of a4__. 

So saying __“\b matches before and after an alphanumeric sequence”__ is more exact than saying “before and after a word”.

__\B__ is the __negated__ version of __\b__. 

__\B__ matches at every position where __\b__ does not. 

> Effectively, __\B__ matches at any position between two word characters as well as at any position between two non-word characters.

In [27]:
re.search(r'\bgreat\b', 'hellopopcorn.ai is great place to work, hail popcorn.ai') 

<re.Match object; span=(19, 24), match='great'>

In [28]:
re.findall(r'\bpopcorn.ai\b', 'hellopopcorn.ai is great place to work, hail popcorn.ai') 

['popcorn.ai']

In [29]:
re.findall(r'\b4\b', '4 44 sheets of a4') 

['4']

## Alternation with The Vertical Bar or Pipe Symbol

You can use alternation to match a single regular expression out of several possible regular expressions.

If you want to search for the literal text __cat__ or __dog__, separate both options with a vertical bar or pipe symbol: __cat|dog__. 

If you want more options, simply expand the list: __cat|dog|mouse|fish__.

The alternation operator has the lowest precedence of all regex operators. That is, it tells the regex engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you need to use parentheses for grouping. If we want to improve the first example to match whole words only, we would need to use \b(cat|dog)\b. This tells the regex engine to find a word boundary, then either cat or dog, and then another word boundary. If we had omitted the parentheses then the regex engine would have searched for a word boundary followed by cat, or, dog followed by a word boundary.

In [80]:
s = 'airways aircraft airplane bomber'
result = re.findall(r'(airways|airplane|bomber)', s)
print (result)

['airways', 'airplane', 'bomber']


In [81]:
result2 = re.findall(r'(air(ways|plane)|bomber)', s)
print (result2)


[('airways', 'ways'), ('airplane', 'plane'), ('bomber', '')]


In [32]:
result3 = re.findall(r'(air(?:ways|plane)|bomber)', s)
print (result3)

['airways', 'airplane', 'bomber']


In [33]:
result3 = re.findall(r'(air(?:ways|plane)|bomber)', s)
print (result3)

['airways', 'airplane', 'bomber']


## Optional Items
The __question mark__ makes the preceding token in the regular expression optional. __colou?r__ matches both __colour__ and __color__. 

The question mark is called a quantifier.

You can make several tokens optional by grouping them together using __parentheses__, and placing the question mark after the closing parenthesis. E.g.: __Nov(ember)?__ matches __Nov__ and __November__.

You can write a regular expression that matches many alternatives by including more than one question mark. __Feb(ruary)? 23(rd)?__ matches __February 23rd__, __February 23__, __Feb 23rd__ and __Feb 23__.

You can also use __curly braces__ to make something optional. 

__colou{0,1}r__ is the same as __colou?r__. 

In [34]:
re.findall(r'(Feb(ruary)?) (23(rd)?)', 'February 23rd February 23 Feb 23rd Feb 23')

[('February', 'ruary', '23rd', 'rd'),
 ('February', 'ruary', '23', ''),
 ('Feb', '', '23rd', 'rd'),
 ('Feb', '', '23', '')]

In [35]:
text = """
1111. ricochet robots
2. settlers of catan
3. acquire
"""
re.findall(r'^(\d+)\.(.*)$', text, re.MULTILINE)

[('1111', ' ricochet robots'), ('2', ' settlers of catan'), ('3', ' acquire')]

## Repetition with Star and Plus

The __asterisk__ or __sta__ tells the engine to attempt to match the preceding token __zero__ or __more__ times. 

The __plus__ tells the engine to attempt to match the preceding token __once__ or __more__. 

__<[A-Za-z][A-Za-z0-9]*>__ matches an HTML tag without any attributes. 

- The angle brackets are literals. 
- The first character class matches a letter. 
- The second character class matches a letter or digit. 
- The __star__ repeats the second character class. Because we used the __star__, it’s OK if the second character class matches nothing. 

So our regex will match a tag like <B>
    
When matching <HTML>, the first character class will match H. The star will cause the second character class to be repeated 3 times, matching T, M and L with each step.
    
- \+ 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
    
- \* 0 or more occurrences of the pattern to its left
    
- \? match 0 or 1 occurrences of the pattern to its left

In [36]:
text = '''
Data
li
I like Data
data science with Python
Python is great for data science
science
ldatadatadata
aaa
aa
aaaa
ab
abc
abcc
'''

In [37]:
re.findall(r'lik*', text, re.MULTILINE)

['li', 'lik']

In [38]:
re.findall(r'aaa+', text, re.MULTILINE)

['aaa', 'aaaa']

In [39]:
re.findall(r'abc+', text, re.MULTILINE)

['abc', 'abcc']

In [40]:
re.search(r'Co+kie', 'Cooookie').group()

'Cooookie'

In [41]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Caokie').group()

'Caokie'

In [42]:
# Checks for exactly zero or one occurrence of  a or o or both in the given sequence
re.search(r'Ca?o?kie', 'Caokie').group()

'Caokie'

## Limiting Repetition

The syntax is __{min,max}__, where __min__ is zero or a positive integer number indicating the minimum number of matches, and __max__ is an integer equal to or greater than __min__ indicating the maximum number of matches. 

If the comma is present but max is omitted, the maximum number of matches is infinite. 

So 
- __{0,1}__ is the same as __?__, 
- __{0,} is the same as __*__, 
- __{1,}__ is the same as __+__. 

You could use 

- __\b[1-9][0-9]{3}\b__ to match a number between 1000 and 9999. 
- __\b[1-9][0-9]{2,4}\b__ matches a number between 100 and 99999.



In [43]:
s = "sheeeeeeeeple"
re.search(r"he+", s)

<re.Match object; span=(1, 10), match='heeeeeeee'>

In [44]:
text = '''
Data
li
I like Data
data science with Python
Python is great for data science
science
ldatadatadata
aaa
aa
aaaa
ab
abc
abcc
abccc
'''

In [45]:
re.findall(r'a{2}', text, re.MULTILINE)

['aa', 'aa', 'aa', 'aa']

In [46]:
re.findall(r'abc{2}', text, re.MULTILINE)

['abcc', 'abcc']

In [47]:
re.findall(r'a{2,5}', text, re.MULTILINE)

['aaa', 'aa', 'aaaa']

In [48]:
re.findall(r'a{4,}', text, re.MULTILINE)

['aaaa']

In [49]:
text = '''
wazzzzzup
wazzzup
wazup
'''

In [50]:
re.findall(r'waz{3,5}', text, re.MULTILINE)

['wazzzzz', 'wazzz']

## Use Parentheses for Grouping and Capturing

By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternation to part of the regex.

Only parentheses can be used for grouping.

In [51]:
text = '''
abcabca
'''

In [52]:
re.findall(r'(abc){3}', text, re.MULTILINE)

[]

In [53]:
re.findall(r'(abc){2,4}', text, re.MULTILINE)

['abc']

In [54]:
re.findall(r'(abc){2,}', text, re.MULTILINE)

['abc']

In [55]:
re.findall(r'(abc){2}', text, re.MULTILINE)

['abc']

In [56]:
email_address = 'Please contact us at: bhupen@popcorn-ai.com'

match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)

if match:
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

bhupen@popcorn-ai.com
bhupen
popcorn-ai.com


In [57]:
contactInfo = 'Raman, Kumar, 080-2856-1733'

match = re.search(r'(\w+), (\w+), ([\d-]+)', contactInfo)
#match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)

print(match.group(1))
print(match.group(2))
print(match.group(3))
match.group(0)

Raman
Kumar
080-2856-1733


'Raman, Kumar, 080-2856-1733'

The reason that the group numbering starts with group 1 is because group 0 is reserved to hold the entire match

In [58]:
input_str    = 'purple support@popcorn-ai.com monkey dishwasher'
result       = re.search(r'([\w.-]+)@([\w.-]+)', input_str)

if result:
    print (result.group() ) 
    print (result.group(1) ) 
    print (result.group(2) ) 

support@popcorn-ai.com
support
popcorn-ai.com


In [59]:
input_str    = 'June 24, August 9, Dec 12'

matches = re.findall(r"[a-zA-Z]+ \d+", input_str)

for match in matches:
    print("match: %s" % (match))

match: June 24
match: August 9
match: Dec 12


To capture the specific months of each date 

In [60]:
input_str    = 'June 24, August 9, Dec 12'

matches = re.findall(r"([a-zA-Z]+) \d+", input_str)

for match in matches:
    print("match: %s" % (match))

match: June
match: August
match: Dec


#### Grouping by Name

Sometimes, especially when a regular expression has a lot of groups, it is impractical to address each group by its number. 

In [61]:
contactInfo = 'Raman, Kumar, 080-2856-1733'

match = re.search(r'(?P<last>\w+), (?P<first>\w+), (?P<phone>\S+)', contactInfo)

print(match.group('last'))
print(match.group('first'))
print(match.group('phone'))
match.group(0)

Raman
Kumar
080-2856-1733


'Raman, Kumar, 080-2856-1733'

Grouping can be used with the findall() method too, even though it doesn’t return match objects. Instead, findall() will return a list of tuples, where the Nth element of each tuple corresponds to the Nth group of the regex pattern:

_However, named grouping doesn’t work when using the findall() method._

In [62]:
contactInfo = 'Raman, Kumar, 080-2856-1733'

re.findall(r'(\w+), (\w+), (\S+)', contactInfo)

[('Raman', 'Kumar', '080-2856-1733')]

## Using Backreferences To Match The Same Text Again

In [63]:
text = '''
grrrrrreat
coooooooooooooooooool
awwwwwwwwwwwwwwwwwwsommmmmmmmmmmmmmmme
looooooooooooooooooooooooovvvvvvvvvvvvvvvve
'''

In [64]:
re.search(r'([a-zA-Z])\1{3}', 'grrrrrreat')

<re.Match object; span=(1, 5), match='rrrr'>

## Ex - repeating chars

Let’s say you want to match a tag like __!abc!__ or __!123!__

In [65]:
input_str = '!abc! is great guy. 123 is great too, !123! is just cool'

re.findall(r"!(abc|123)!", input_str)

['abc', '123']

Now let’s say that the tag can contain multiple sequences of __abc__ and __123__, like __!abc123!__ or __!123abcabc!__.

In [66]:
input_str = '!abc123! is great guy. !123abc! is great too, !123abcabc! is just cool'

re.findall(r"!(abc|123)+!", input_str)

['123', 'abc', 'abc']

This regular expression will indeed match these tags. However, DOES NOT our requirement to capture the tag’s label into the __capturing__ group.

- When this regex matches !abc123!, the capturing group stores only 123. 
- When it matches !123abcabc!, it only stores abc.

regex engine applies __!(abc|123)+!__ to __!abc123!__

- First, ! matches !. 

- The engine then enters the __capturing__ group. 

- It makes note that __capturing__ group #1 was entered when the engine reached the position between the first and second character in the subject string. The first token in the group is __abc__, which matches __abc__. 

- A match is found, so the second alternative isn’t tried. (The engine does store a backtracking position, but this won’t be used in this example.) 

- The engine now leaves the __capturing__ group. It makes note that capturing group #1 was exited when the engine reached the position between the 4th and 5th characters in the string.

- After having exited from the group, the engine notices the __plus__. The plus is __greedy__, so the group is tried again. The engine enters the group again, and takes note that capturing group #1 was entered between the 4th and 5th characters in the string. 

- It also makes note that since the __plus__ is not possessive, it may be backtracked. That is, if the group cannot be matched a second time, that’s fine. In this backtracking note, the regex engine also saves the entrance and exit positions of the group during the previous iteration of the group.

__abc__ fails to match __123__, but __123__ succeeds. 

The group is exited again. The exit position between characters 7 and 8 is stored.

In [67]:
input_str = '!abc123! is great guy. !123abc! is great too, !123abcabc! is just cool'

re.findall(r"!((abc|123)+)!", input_str)

[('abc123', '123'), ('123abc', 'abc'), ('123abcabc', 'abc')]