# Python Regular Expression

The `re` package contains implementation of regex operations

Regex are string operation given a pattern consisting a string or some constants for dynamical pattern matching

In [1]:
import re

Regex special characters

- `.`: Every character except newline
- `^`: Begining of the string
- `*`: Repetitions
- `?`: 0 or 1 occurence
- `+`: Not null repetitions
- `{x}`: Matches x times
- `{x, y}`: Matches between x and y times
- `[]`: Group of characters
- `|`: Or
- `\d`: Decimal characters
- `\D`: Anything but Decimals
- `\s`: Space, Tabline, Newlines (CR/LF) characters
- `\S`: Anything but those from `\s`
- `\w`: Alpha-numerical characters and '_'
- `\W`: Anything but those from `\w`
- `\`: Escape
- `[^...]` Group of Not these characters
- `()` Grouping
- `[...-...]` interval of a group of character

Usage:

- `re.compile(regex_pattern, flags)` compiles a regex into binary
- `re.match` is used if given string matches the regex

In [2]:
# Regex of a valid Romanian phone number
r = re.compile("07[0-9]{8}")
if r.match("0740123456"):
    print("Valid phone number")

Valid phone number


Examples of regex and what they mean:

- `\w+\s+\w+`: Words with length greater than 1, Breaks with length greater than 1, words with length greater than 1
- `^\w+\s+\w+$`: Exactly the first one
- `[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}`: 4 numbers exactly of length between 1 and 3, separated by dot
- `([0-9]{1,3}\.){3}[0-9]{1,3}`: Same as above
- `^((([0-9])|([1-9][0-9])|(1[0-9]{2})|(2[0-4][0-
9])|(25[0-5]))\.){3}(([0-9])|([1-9][0-9])|(1[0-
9]{2})|(2[0-4][0-9])|(25[0-5]))$`: IP addresses
- `[12]\d{12}`: Valid Romanian ID (Not exactly, there are no conditions for date of birth and county/city/town/comunee/sector/vilage code)
- `0x[0-9a-fA-F]+`: valid hex number (With inconsistent capitalizations)
- `(if|then|else|while|continue|break)`: Groupings of valid programming keywords

`re.match` returns if there is a substring matching the regex and ends, otherwise nothing. Except those that regex has `$` symbol

In [5]:
if re.match("\d+","123 USD"):
    print ("Match")

if re.match("\d+","Price is 123 USD"):
    print ("Match")

if not re.match("\d+$","123 USD"):
    print ("NO Match")

Match
NO Match


  if re.match("\d+","123 USD"):
  if re.match("\d+","Price is 123 USD"):
  if not re.match("\d+$","123 USD"):


`search` stops if a match is found

The object that returns from search or match is a match object. Is always evaluated as true if the search does not find any match. None if is false. This object has the following members:

- `group(index)` Substring that is specific group. 0 means the entire substring
- `lastindex`

In [6]:
result = re.search("\d+","Price is 123 USD")
if result:
    print (result.group(0))

123


  result = re.search("\d+","Price is 123 USD")


For operations like `*` and `+` they can be preceded by `?`. This will specify a NON greedy behaviour

In [3]:
result = re.search(".*(\d+)", "File size if 12345 bytes")
if result:
    print (result.group(1))

result = re.search(".*?(\d+)", "File size if 12345 bytes")
if result:
    print (result.group(1))

5
12345


  result = re.search(".*(\d+)", "File size if 12345 bytes")
  result = re.search(".*?(\d+)", "File size if 12345 bytes")


`()` us used to delimit a SPECIFIC sequence of substring withing the regex pattern

In [8]:
result = re.search("(\d+)[^\d]*(\d+)","Price is 123 USD aprox 110 EUR")
if result:
    print (result.lastindex)
    for i in range(0,result.lastindex+1):
        print (i, "=>", result.group(i))

2
0 => 123 USD aprox 110
1 => 123
2 => 110


  result = re.search("(\d+)[^\d]*(\d+)","Price is 123 USD aprox 110 EUR")


In [9]:
import re
result = re.search("((\d+),(\d+))[^\d]*(\d+)","Color from pixel 20,30 is 123")

if result:
    print (result.lastindex)
    for i in range(0,result.lastindex+1):
        print (i, "=>", result.group(i))

4
0 => 20,30 is 123
1 => 20,30
2 => 20
3 => 30
4 => 123


  result = re.search("((\d+),(\d+))[^\d]*(\d+)","Color from pixel 20,30 is 123")


`search` method returns a group type object that contains a collection designated by found groupings `()` given the regex

To search all instances of the regex inside the string, the `findall` method is used:

In [10]:
result = re.findall("\d+","Color from pixel 20,30 is 123")
if result:
    print (result)

['20', '30', '123']


  result = re.findall("\d+","Color from pixel 20,30 is 123")


`()` is allowed, but the returned variable is a tuple

In [11]:
result = re.findall("(\d)(\d+)","Color from pixel 20,30 is 123")
if result:
    print (result)

[('2', '0'), ('3', '0'), ('1', '23')]


  result = re.findall("(\d)(\d+)","Color from pixel 20,30 is 123")


`split` method is used to return a list of string delimited by regex

In [12]:
result = re.split("[aeiou]+","Color from pixel 20,30 is 123")
print (result) # Returns a list of strings that are fragmented by a grouping or a single vowel

['C', 'l', 'r fr', 'm p', 'x', 'l 20,30 ', 's 123']


In [13]:
print (re.split("\d\d","Color from pixel 20,30 is 123")) # delimited by exactly 2 decimals

['Color from pixel ', ',', ' is ', '3']


  print (re.split("\d\d","Color from pixel 20,30 is 123"))


Groups are used. In this case, the split is done after groups that matches

In [14]:
print (re.split("(\d)(\d)","Color from pixel 20,30 is 123"))

['Color from pixel ', '2', '0', ',', '3', '0', ' is ', '1', '2', '3']


  print (re.split("(\d)(\d)","Color from pixel 20,30 is 123"))


In [15]:
print (re.split("(\d\d+)", "Color from pixel 20,30 is 123"))

['Color from pixel ', '20', ',', '30', ' is ', '123', '']


  print (re.split("(\d\d+)", "Color from pixel 20,30 is 123"))


split method also supports flags

In [16]:
s = "Today I'm having a python course"
print (re.split("[^a-z']+", s))
print (re.split("[^a-z']+", s, 2))
print (re.split("[^a-z']+", s, flags = re.IGNORECASE))
print (re.split("[^a-z']+", s, 2, flags = re.IGNORECASE))
print (re.split("[^a-z'A-Z]+", s))

['', 'oday', "'m", 'having', 'a', 'python', 'course']
['', 'oday', "'m having a python course"]
['Today', "I'm", 'having', 'a', 'python', 'course']
['Today', "I'm", 'having a python course']
['Today', "I'm", 'having', 'a', 'python', 'course']


  print (re.split("[^a-z']+", s, 2))
  print (re.split("[^a-z']+", s, 2, flags = re.IGNORECASE))


`sub` method is used to replace a matched string with another string

In [17]:
s = "Today I'm having a python course"
print (re.sub("having\s+a\s+\w+\s+course", "not doing anything", s))

Today I'm not doing anything


  print (re.sub("having\s+a\s+\w+\s+course", "not doing anything", s))


In [18]:
s = "Today I'm having a python course"
print (re.sub("having\s+a\s+(\w+)\s+course",
r"not doing the \1 course",
s))

Today I'm not doing the python course


  print (re.sub("having\s+a\s+(\w+)\s+course",


In [19]:
def ConvertToHex(s):
    return hex(int(s.group(0)))

s = "File size is 12345 bytes"
print (re.sub("\d+",ConvertToHex, s))

File size is 0x3039 bytes


  print (re.sub("\d+",ConvertToHex, s))


## Regex Extensions

`?<smt>` `?... ` an extension: It sets a name for a group

In [20]:
s = "File size if 12345 bytes"
result = re.search("(?P<file_size>\d+)",s)
if result:
    print ("Size is ",result.group("file_size"))

Size is  12345


  result = re.search("(?P<file_size>\d+)",s)


`groupdict()` returns a directory of the name and group

In [4]:
s = "File config.txt was create on 2016-10-20 and has 12345 bytes"
result = re.search("File\s+(?P<name>[a-z\.]+)\s.*(?P<date>\d{4}-\d{2}-\d{2}).*\s(?P<size>\d+)",s)
if result:
    print (result.groupdict())

{'name': 'config.txt', 'date': '2016-10-20', 'size': '12345'}


  result = re.search("File\s+(?P<name>[a-z\.]+)\s.*(?P<date>\d{4}-\d{2}-\d{2}).*\s(?P<size>\d+)",s)


- `?i` Ignore case (Current group)
- `?s` Will match everything (including newline)

In [24]:
s = "12345abc54321"
result = re.search("(?i)([A-Z]+)",s)
if result:
    print (result.group(1))

abc


`?=smt` match if the next registry matches the given string
`!=smt` match if the next registry does not match the given string
`?#` Comment

In [25]:
s = "Python Course"
result = re.search("(Python)\s+(?=Course)",s)
if result:
    print (result.group(1))

Python


  result = re.search("(Python)\s+(?=Course)",s)


In [26]:
import re
s = "Size is 1234 bytes"
result = re.search("(?# file size)(\d+)",s)
if result:
    print ("Size is ",result.group(1))

Size is  1234


  result = re.search("(?# file size)(\d+)",s)


## Example of a Tokenizer

In [27]:
number = "(?P<number>\d+)"
operation = "(?P<operation>[+\-\*\/])"
braket = "(?P<braket>[\(\)])"
space = "(?P<space>\s)"
other = "(?P<other>.)"
r = re.compile(number+"|"+operation+"|"+braket+"|"+space+"|"+other)

expr = "10 * (250+3)"
for matchobj in r.finditer(expr):
    key = matchobj.lastgroup
    print (matchobj.group(key)+" => "+key)

10 => number
  => space
* => operation
  => space
( => braket
250 => number
+ => operation
3 => number
) => braket


  number = "(?P<number>\d+)"
  operation = "(?P<operation>[+\-\*\/])"
  braket = "(?P<braket>[\(\)])"
  space = "(?P<space>\s)"


The following example is exactly how `yacc` works