# REGULAR EXPRESSIONS

Regular expressions are a powerful language tools for matching text patterns. Here is a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python "re" module provides regular expression support.
In Python a regular expression search is typically written as:

match = re.search(pat, str)

In [1]:
import re

In [2]:

x = re.search("nds.$","A cat and a rat can't be friends.")


In [3]:
x.group()

'nds.'

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

In [4]:
str1 = 'an example word:vx7'

In [5]:

match = re.search(r'word:\w\w\d', str1)
# If-statement after search() tests if it succeeded
match

<re.Match object; span=(11, 19), match='word:vx7'>

In [6]:
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')

found word:vx7


The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

Basic Patterns
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:


a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

. (a period) -- matches any single character except newline '\n'

\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

\b -- boundary between word and non-word

\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

\t, \n, \r -- tab, newline, return

\d -- decimal digit [0-9]

^ = start, $ = end -- match the start or end of the string

\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

In [7]:
  ## Search for pattern 'iii' in string 'piiig'.
  ## All of the pattern must match, but it may appear anywhere.
  ## On success, match.group() is matched text.
match = re.search(r'ig', 'piiig') #=>  found, match.group() == "iii"
print(match)

<re.Match object; span=(3, 5), match='ig'>


In [8]:
match = re.search(r'ig', 'piiig') #=>  not found, match == None
print(match)

<re.Match object; span=(3, 5), match='ig'>


In [9]:
  ## . = any char but \n
match = re.search(r'..g', 'iig') #=>  found, match.group() == "iig"
print(match)

<re.Match object; span=(0, 3), match='iig'>


In [10]:
  ## \d = digit char, \w = word char
match = re.search(r'\d\d\s', 'p123g') #=>  found, match.group() == "123"
print(match)

None


In [11]:
match = re.search(r'\w\w\w', '@@abcd!!') #=>  found, match.group() == "abc"
print(match)

<re.Match object; span=(2, 5), match='abc'>


In [12]:
match = re.search(r'iig', 'piigxxxiiig')
print(match)

<re.Match object; span=(1, 4), match='iig'>


Repetition
Things get more interesting when you use + and * to specify repetition in the pattern

'+' means 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's

'*' means 0 or more occurrences of the pattern to its left

'?' means -- match 0 or 1 occurrences of the pattern to its left

In [13]:
  ## i+ = one or more i's, as many as possible.
  match = re.search(r'pi+', 'piiig') #=>  found, match.group() == "piii"
print(match)
  ## Finds the first/leftmost solution, and within it drives the +
  ## as far as possible (aka 'leftmost and largest').
  ## In this example, note that it does not get to the second set of i's.
  match = re.search(r'i+', 'piigiiii') #=>  found, match.group() == "ii"
print(match)
  

<re.Match object; span=(0, 4), match='piii'>
<re.Match object; span=(1, 3), match='ii'>


In [14]:
## \s* = zero or more whitespace chars
  ## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx') #=>  found, match.group() == "1 2   3"
print(match.group())

1 2 3


In [15]:
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') #=>  found, match.group() == "12  3"
print(match.group())

12  3


In [16]:


match = re.search(r'\d\s*\d\s*\d', 'xx123xx') #=>  found, match.group() == "123"
print(match.group())


123


In [21]:
  ## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w*', 'foodbar') #=>  not found, match == None
print(match.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [22]:


  ## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foodbar') #=>  found, match.group() == "bar"
print(match.group())

bar


In [23]:
#match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') #=>  found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') #=>  found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') 
print(match.group())

123


In [24]:
  str = 'purple alice-b@goog-le.com monkey dishwasher'
 

Note : The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [25]:
  match = re.search(r'[\w.-]+@[\w.-]+', str)
  if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@goog-le.com


You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

### Group Extraction

In [26]:
  str = 'purple alice-b@google.com monkey dishwasher'
  match = re.search(r'([\w.-]+)@([\w.-]+)', str)
  if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


findall

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

In [28]:
  ## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

  ## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']

emails

['alice@google.com', 'bob@abc.com']

In [30]:
for email in emails:
    # do something with each found email string
    print(email)
#print(emails)

alice@google.com
bob@abc.com


In [40]:
import re
import os
#os.chdir("D:\\RIL")
    # Open file
f = open('D:\\test.txt', 'r')
  # Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'^This', f.read())
print(strings)

[]
