# Regular Expression in Python

Below is the step by step approach to use a regular expression in Python.

1. Import the regex module with import re.
2. Create a Regex object with the ```re.compile()``` function.
3. Pass the string you want to search into the Regex object’s ```search()``` method. This returns a Match object.
4. Call the Match object’s ```group()``` method to return a string of the actual matched text.

In [3]:
import re

In [2]:
#American phone number regex
#456-123-2323

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242')
print('phone number found:' + mo.group())

phone number found:415-555-4242


Here, we pass our desired pattern to ```re.compile()``` and store the resulting Regex object in phoneNumRegex. Then we call ```search()``` on phoneNumRegex and pass ```search()``` the string we want to match for during the search. The result of the search gets stored in the variable mo. 

In the above example, we know that our pattern will be found in the string, so we know that a Match object will be returned. Knowing that **mo** contains a Match object and not the null value None, we can call group() on mo to return the match. Writing ```mo.group()``` inside our ```print()``` function call displays the whole match, 415-555-4242.

## Grouping with Parentheses

**Parentheses** have a special meaning in regular expressions. Adding parentheses will create groups in the regex: (\d\d\d)-(\d\d\d-\d\d\d\d). Then you can use the ```group()``` match object method to grab the matching text from just one group.

In [8]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242')

In [9]:
mo.group(1)

'415'

In [10]:
mo.group(2)

'555-4242'

In [11]:
mo.group()

'415-555-4242'

In [12]:
mo.groups()

('415', '555-4242')

In [14]:
areaCode, mainNumber = mo.groups()
areaCode, mainNumber

('415', '555-4242')

If you need yo match a parenthesis, then you need to escape the ( and ) with a backslash.

In [15]:
phoneNumRegex = re.compile(r'(\(\d\d\d\))(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is (415)555-4242')

In [16]:
mo.group(1)

'(415)'

In [17]:
mo.group(2)

'555-4242'

In regular expressions, the following characters have special meanings. So, If you want to detect these characters as part of your text pattern, you need to escape them with a backslash

```.  ^  $  *  +  ?  {  }  [  ]  \  |  (  )```

## Matching Multiple Groups with the Pipe

If you want to match one of many expressions then you can use ```|``` (pipe). If both of the String (before and after pipe) matches, then first occurance of the matching text will be returned as the Match object.

In [20]:
heroRegEx = re.compile(r'Batman|Superman')
mo1 = heroRegEx.search('Batman Vs Superman - Dawn of Justice')
mo1.group()

'Batman'

In [23]:
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo2 = batRegex.search('Batmobile lost a wheel')
mo2.group()

'Batmobile'

In [24]:
mo2.group(1)

'mobile'

## Optional Matching with the Question Mark

Sometimes there is a pattern that you want to match only optionally. That is, the regex should find a match regardless of whether that bit of text is there. The ? character flags the group that precedes it as an optional part of the pattern.

In [26]:
batRegex = re.compile(r'Bat(wo)?man')
mo3 = batRegex.search('The adventures of Batman')
mo3.group()

'Batman'

In [28]:
batRegex = re.compile(r'Bat(wo)?man')
mo4 = batRegex.search('The adventures of Batwoman')
mo4.group()

'Batwoman'

## Matching Zero or More with the Star

The ```*``` means **match zero or more**

In [29]:
batRegex = re.compile(r'Bat(wo)*man')
mo5 = batRegex.search('The adventures of Batman')
mo5.group()

'Batman'

In [30]:
mo6 = batRegex.search('The adventures of Batwowowoman')
mo6.group()

'Batwowowoman'

In [31]:
mo7 = batRegex.search('The adventures of Batwowooman')
mo7.group()

AttributeError: 'NoneType' object has no attribute 'group'

## Matching One or More with the Plus


The ```+``` means **match one or more**.

In [35]:
batRegex = re.compile(r'Bat(wo)+man')
mo8 = batRegex.search('The adventures of Batman')
mo8.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [36]:
mo8 == None

True

In [34]:
mo9 = batRegex.search('The adventures of Batwoman')
mo9.group()

'Batwoman'

## Matching Specific Repetitions with Braces 

If you want to match exactly 2 ocurances of ```wo``` in the above example then you can use ```(wo){2}```.

In [37]:
batRegex = re.compile(r'Bat(wo){2}man')
mo10 = batRegex.search('The adventures of Batwowoman')
mo10.group()

'Batwowoman'

In [38]:
mo11 = batRegex.search('The adventures of Batwowowoman')
mo11.group()

AttributeError: 'NoneType' object has no attribute 'group'

In [39]:
mo11 == None

True

AS we are passing 3 occurances of (wo) in the string. Hence it is not able to match with the reguar expression and so it is returning ```None``` object.

You can also match a range of repeated string by specifying minimum and maximum values. suppose {2,5} will match 2 occurances to 5 occurances (inclusive). {2,} will match atleast 2 or more occurances and {,5} will match zero to five instances.

In [40]:
batRegex = re.compile(r'Bat(wo){2,}man')
mo12 = batRegex.search('The adventures of Batwowoman')
mo12.group()

'Batwowoman'

In [41]:
mo13 = batRegex.search('The adventures of Batwowowoman')
mo13.group()

'Batwowowoman'

In [42]:
mo14 = batRegex.search('The adventures of Batwoman')
mo14.group()

AttributeError: 'NoneType' object has no attribute 'group'

### Greedy and Non-greedy Matching

Python’s regular expressions are greedy by default, which means that in ambiguous situations they will match the longest string possible. The non-greedy (also called lazy) version of the braces, which matches the shortest string possible, has the closing brace followed by a question mark.

In [44]:
batRegex = re.compile(r'Bat(wo){3,5}man')
mo15 = batRegex.search('The adventures of Batwowowowoman')
mo15.group()

'Batwowowowoman'

Below will result shortest string possible as we are using a ```?``` after the closing braces.

In [47]:
batRegex = re.compile(r'(wo){3,5}?')
mo16 = batRegex.search('The adventures of Batwowowowoman')
mo16.group()

'wowowo'

## ```FINDALL()``` Method

Regex objects also have ```finall()``` method which will return the strings of every match in the searched string unlike ```search()``` as ```search()``` method will return the Match object of the first matched text.

In [48]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('Cell: 415-555-9999 Work: 212-555-0000')
mo.group()

'415-555-9999'

```findall()``` will not return a Match object but a list of strings — as long as there are no groups in the regular expression. Each string in the list is a piece of the searched text that matched the regular expression

In [50]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo1 = phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')
mo1


['415-555-9999', '212-555-0000']

If there are groups in the regular expression, then ```findall()``` will return a list of tuples of strings.

In [52]:
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')

[('415', '555', '9999'), ('212', '555', '0000')]

## Character Classes

|  Shorthand character class | Represents  |
|---|---|
|  \d   | Any numeric digit from 0 to 9.  |
| \D  |  Any character that is **NOT** a numeric digit from 0 to 9. |
| \w  | Any letter, numeric digit, or the underscore character.  |
| \W  | Any character that is **NOT** letter, numeric digit, or the underscore.  |
| \s  | Any space, tab, or newline character.  |
| \S  | Any character that is **NOT** space, tab, or newline character.  |

Examples -

1. \d is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9)
2. The character class [0-5] will match only the numbers 0 to 5; this is much shorter than typing (0|1|2|3|4|5)
3. [a-zA-Z] matches only characters.
4. [a-zA-Z0-9] will match all lowercase letters, uppercase letters, and numbers.
5. The r'^\d+$' regular expression string matches strings that both begin and end with one or more numeric characters.

### Making Your Own Character Classes

You can define your own character class using square brackets. 

For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase.

By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class. A negative character class will match all the characters that are not in the character class.

In [54]:
consonantRegex = re.compile(r'[^aeiouAEIOU]')
consonantRegex.findall('RoboCop eats baby food. BABY FOOD.')

['R',
 'b',
 'C',
 'p',
 ' ',
 't',
 's',
 ' ',
 'b',
 'b',
 'y',
 ' ',
 'f',
 'd',
 '.',
 ' ',
 'B',
 'B',
 'Y',
 ' ',
 'F',
 'D',
 '.']

## The Caret and Dollar Sign Characters

The Caret (^) at the start of the regex indicaes that a match must occur at the beginning of the searched text.

Likewise the dollar sign ($) at the end of the regex indicates the string must end with this pattern

In [55]:
beginsWithHello = re.compile(r'^Hello')
beginsWithHello.search('Hello, world!')

<re.Match object; span=(0, 5), match='Hello'>

In [56]:
beginsWithHello.search('He said hello.') == None

True

The r'\d$' regular expression string matches strings that end with a numeric character from 0 to 9.

In [57]:
endsWithNumber = re.compile(r'\d$')
endsWithNumber.search('Your number is 42')

<re.Match object; span=(16, 17), match='2'>

In [58]:
endsWithNumber.search('Your number is forty two.') == None

True