<h1>Regular Expressions</h1>
<hr>

This notebook provides introductory notes on regular expressions with examples, using Python's "re" library.

In [1]:
import re

In [2]:
def search( pattern, string, flag = 0 ) :
    '''
    Helper function to display match objects more gracefully
    '''
    i = 0
    for match in re.finditer( pattern, string, flag ) :
        print( i*', ' + match.group(), end = '' )
        i = 1
    return

In [3]:
string = '''ProximaProxima Centauri b is an exoplanet orbiting in the habitable zone of the red dwarf star Proxima Centauri, 
which is the closest star to the Sun and part of a triple star system.
It is approximately 4.2 light-years (4.0x10^13 km) from Earth in the constellation Centaurus, 
making it one of the closest known exoplanets to the Solar System.
On 24 August 2016, a team of 31 scientists confirmed the existence of Proxima Centauri b.'''

<h2>Basic Patterns</h2>
<hr>

(1) <b>Brackets</b> <span style="color:blue; font-weight:bold">[ ]</span> specify a disjunction of characters.

In [4]:
search( '[Ss]ystem', string ) # Matches "S" or "s" followed by "ystem"

system, System

(2) <b>Brackets</b> <span style="color:blue; font-weight:bold">[ ]</span> plus a <b>Dash</b> <span style="color:blue; font-weight:bold">-</span> specifies a range.

In [5]:
search( '[0-9]', string ) # Matches any single digit integer from one to 5 inclusive

4, 2, 4, 0, 1, 0, 1, 3, 2, 4, 2, 0, 1, 6, 3, 1

(3) <b>Brackets</b> <span style="color:blue; font-weight:bold">[ ]</span> plus a <b>Caret</b> <span style="color:blue; font-weight:bold">ˆ</span> specifies negation.

In [6]:
                                       # Matches not a letter, next line character, comma, period, 
search( '[^a-zA-Z\n,.()^ -]', string ) # paranthesis, caret, space or hyphen

4, 2, 4, 0, 1, 0, 1, 3, 2, 4, 2, 0, 1, 6, 3, 1

(4) A <b>Question Mark</b> <span style="color:blue; font-weight:bold">?</span> specifies optionality.

In [7]:
search( 'exoplanets?', string ) # Matches "exoplanet" followes by an "s" or nothing

exoplanet, exoplanets

(5) A <b>Period</b> <span style="color:blue; font-weight:bold">.</span> is a wildcard.

In [8]:
search( 'S..', string ) # Matches "S" followed by two wildcards

Sun, Sol, Sys

(6) A <b>Backslash</b> <span style="color:blue; font-weight:bold">\\</span> before a period refers to an actual period.

In [9]:
search( '[0-9]\.[0-9]', string )

4.2, 4.0

(7) An <b>Asterix</b> <span style="color:blue; font-weight:bold">*</span> specifies zero or more occurrences.

In [10]:
search( '[0-9][0-9]*', string ) # Matches any integer followed by zero or more integers

4, 2, 4, 0, 10, 13, 24, 2016, 31

(8) A <b>Plus Sign</b> <span style="color:blue; font-weight:bold">+</span> specifies one or more occurrences.

In [11]:
search( '[0-9]+', string ) # Matches one or more occurrences of any integer

4, 2, 4, 0, 10, 13, 24, 2016, 31

(9) A <b>Peroid</b> <span style="color:blue; font-weight:bold">.</span> plus an <b>Asterix</b> <span style="color:blue; font-weight:bold">*</span> specifies a string of wildcard characters of indefenite length.

In [12]:
search( '\(.*\)', string ) # Matches any character string enclosed in paranthesis

(4.0x10^13 km)

(10) A <b>Caret</b> <span style="color:blue; font-weight:bold">^</span> and a <b>Dollar Sign</b> <span style="color:blue; font-weight:bold">$</span> specifies the start and end of a line respectively.

In [13]:
search( '^Proxima', string, re.MULTILINE ) # Can only match pattern at the start of a line

Proxima

In [14]:
search( 'system.$', string, re.MULTILINE ) # Can only match pattern at the end of a line

system.

(11) A <b>Backslah</b> <span style="color:blue; font-weight:bold">\\</span> plus <span style="color:blue; font-weight:bold">b</span> specifies a word* boundary. <br>
(Double backslashes are used, because "\b" is a special string character in Python.)

In [15]:
search( '\\bb\\b', string ) # Matches the word "b" (as in Centauri b), not words containing the letter "b"

b, b

*<b>Definition</b> of word in programming languages: Any
sequence of digits, underscores, or letters.
<hr>

(12) A <b>Backslah</b> <span style="color:blue; font-weight:bold">\\</span> plus <span style="color:blue; font-weight:bold">B</span> specifies a word non-boundary. <br>

In [16]:
search( '\\Bb\\B', string ) # Matches words that contain the letter "b"

b, b, b

Note:

In [17]:
search( 'b', string ) # Result is example (11) + (12)

b, b, b, b, b

(13) A <b>Pipe</b> <span style="color:blue; font-weight:bold">|</span> specifies a disjunction.

In [18]:
search( 'Centauri b|Centauri|Centaurus', string ) # Matches "Centauri b," "Centauri" or "Centaurus"

Centauri b, Centauri, Centaurus, Centauri b

(14) <b>Paranthesis</b> <span style="color:blue; font-weight:bold">( )</span> specifies precedence or a capture group.

In [19]:
search( 'Centaur(i b|i|us)', string ) # Equivalent to example (13)

Centauri b, Centauri, Centaurus, Centauri b

In [20]:
search( '[0-9]+(\.[0-9]+)?', string ) # Matches one or more integers followed by an optional decimal value

4.2, 4.0, 10, 13, 24, 2016, 31

Capture group:

In [21]:
                                  # Matches "the" followed by any string of characters followed by "to the,"
search( '(the).*to \\1', string ) # where "\1" denotes the first capture group

the closest star to the, the closest known exoplanets to the

(12) <b>Curly Braces</b> <span style="color:blue; font-weight:bold">{ }</span> specify count.

In [22]:
search( '(Proxima){2}', string ) # Matches exactly two occurences of "Proxima"

ProximaProxima

In [23]:
search( '(Proxima){1,2}', string ) # Matches one or two occurences of "Proxima"

ProximaProxima, Proxima, Proxima

In [24]:
search( '(Proxima){1,}', string ) # Matches at least one occurence of "Proxima"

ProximaProxima, Proxima, Proxima

(13) A <b>Question Mark</b> <span style="color:blue; font-weight:bold">?</span> plus a counter (<span style="color:blue; font-weight:bold">+</span>, <span style="color:blue; font-weight:bold">*</span>) matches as little text as possible

In [25]:
search( '(Proxima)+?', string ) # Equivalent to search( 'Proxima', string )

Proxima, Proxima, Proxima, Proxima

(14) <b>Substitution</b>

In [26]:
re.sub( '(Proxima){2}', 'Proxima', string ) # Substitutes the typo "ProximaProxima" with "Proxima"

'Proxima Centauri b is an exoplanet orbiting in the habitable zone of the red dwarf star Proxima Centauri, \nwhich is the closest star to the Sun and part of a triple star system.\nIt is approximately 4.2 light-years (4.0x10^13 km) from Earth in the constellation Centaurus, \nmaking it one of the closest known exoplanets to the Solar System.\nOn 24 August 2016, a team of 31 scientists confirmed the existence of Proxima Centauri b.'