**Regular Expressions - Part 2**

Again, we'll make use of the following search function below to avoid repetitive typing.

This has been modified to allow for us to pass the **re.DOTALL** flag to be used in a search.

In [1]:
import re
def Search(pattern,text,flags=0):
    m=re.search(pattern,text,flags)
    if m:
        print("found a match")
        print(m.span())
        print(text[m.start():m.end()])
    else:
        print("no match")

In [2]:
Search("dog.*cat","I lost my dog.\n He was chasing your cat")

no match


In [3]:
Search("dog.*cat","I lost my dog.\n He was chasing your cat",re.DOTALL)

found a match
(10, 39)
dog.
 He was chasing your cat


**Escaping**

The backslash character is used to escape a special character in a regex pattern. So, for example, for searching for a literal + character, you would need to do something like the followng.

In [4]:
pattern="\+"
text="2+5=7"
Search(pattern,text)

found a match
(1, 2)
+


**Backslash in strings**

The backslash appears in strings with other special meanings, including specification of 

- a character in a string using its hexadecimal (base 16) representation via \xDD where each D is hexadecimal digit 0-9,A-F.
- a character in a string using its octal (base 8) representation via \DD where each D is an octal digit 0-7.

Here is an example where we create a string consisting of a single ASCII character. The decimal number 65 is 41 in hexadecimal and 101 in octal.

In [5]:
text1="\x41"
text2="\101"
text3=chr(65)
print(text1)
print(text2)
print(text3)
print(text1==text2)
print(text1==text3)

A
A
A
True
True


**Extended ASCII range**

The same works when use the range from 128-255 (what is referred to as the *extended ASCII* range). Here, we create a string with a single character corresponding to the decimal 129.

In [6]:
text1="\x81"
text2="\201"
text3=chr(129)
print(text1)
print(text2)
print(text3)
print(text1==text2)
print(text1==text3)




True
True


And we can create strings with multiple characters in this manner.

In [7]:
text="\x41\101\x81\201"
print(text)
len(text)

AA


4

**Putting a literal backslash in a pattern**

When we want a pattern string to have a literal backslash in it some care is required.

For example, using

text="\\"

doesn't work because \" means literal quote in a string and the string is not closed properly.

**What can we do?**

We have these options:

- escape the backslash
- use chr(92)
- use the unicode representation \x5c 

In [8]:
text1="\\"
text2=chr(92)
text3=b"\x5c".decode()
print(text1)
print(text1==text2)
print(text1==text3)

\
True
True


**Raw Strings**

There will be instances in which we want to create a search pattern or a text string and we want all of the characters in our string to be interpreted literally, rather than using their special meaning in Python strings. 

Python provides the **raw string** mechanism for telling the interpreter to take all of the characters literally, without interpretation. 

In the following example, here's a sentence one might see in a textbook explaing how latex works.

"In latex, we use \xi to represent the greek letter $\xi.$"

And in a textbook explaining unicode, we might write this:

'To get the greek character $\xi$ we would encode the byte array "\xCE\xBE".'

Without escaping the backslash things don't work.

For example this would produce an error:

text="\\xi"

And this doesn't produce the desired result.

In [9]:
text="\xCE"
print(text)

Î


But escaping works ...

In [10]:
text="\\xi"
print(text)
text="\\xCE"
print(text)

\xi
\xCE


We can also use a raw string.

In [11]:
text=r'In latex, we use \xi to represent the greek letter '
print(text)

In latex, we use \xi to represent the greek letter 


In [12]:
text=r'\xCE\xBE'
print(text)

\xCE\xBE


**Getting a single quote in a raw string**

There is one issue to be careful of. How do we get a single quote symbol in a strings when we specify as a raw string?

We could try

text=r'\\'

but that leads to an error.

**Explanation**

The problem is explained in this portion of the Python tutorial:

"When an 'r' or 'R' prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string."

So the interpreter is trying to include the second single quote in the string, and the string is not terminated.

This explains the following example.

In [13]:
text=r'\''
print(text)

\'


And this example.

In [14]:
text=r'\\'
print(text)
len(text)

\\


2

And as a consequence, an odd number of backslashes as in 

text=r'\\\\\\'

can't be used but an even number can.

In [15]:
txt=r"\\\\"
print(txt)

\\\\


So while use of raw strings in regular expression patterns is strongly advised, there will be times when we have to put a single backslash in a string and then we'll need to break up the pattern into pieces and include a single backslash using one of the methods described above.

Since, as we'll see below, the backslash has a special use in regular expressions, we'll need to be mindful of this.

**Character Classes**

Square brackets are used to represent sets of characters. For example, to match one of the letters a, b or c, we can use [abc].

In [None]:
Search("[abc]","d")

In [None]:
Search("[abc]","help me please")

In [None]:
Search("[abcde]{2}","can you help me find my lost cat please")

**Some more complicated examples**

In [None]:
Search("^[ab)cden (]+?","n (ab")

In [None]:
Search("^[ab)cden (]+","n (abx")

**Escaping**

As usual, if we want to search for the $[$ character we have to escape it.

Observe that the second $]$ is automatically treated as a literal character since there is not an opening $[$ to pair with.

In [None]:
Search("\[[abcdr]+]","[abracadabra]")

**Another use of circumflex**

A circumflex at the beginning of the list means any character not included.

In [None]:
Search("[^abc]","abcd")

In [None]:
Search("[^abc]x","c\tx")

**Character ranges can be used**

In [None]:
Search("[0-5]{2}","can you help me find my 12 lost cats please")

In [None]:
Search("[0-5]{2}","can you 17 people help me find my 12 lost cats please?")

In [None]:
Search("[g-m][n-z][a-z]+","I'm hoping you can you help me find my lost cat.")

In [None]:
Search("[g-mt-z]{2}","Can you help me find my lost cat please?")

In [None]:
Search("[4-8][1-2]","5823824854782102786467438")

**Escape again**

If you want your set to include the "-" character, it needs to be escaped. 

Here we search for a three character pattern using the space or dash. 

In [None]:
Search("[\- ]{3}","If you are around today - can you please email me?")

**Special re characters inside \[\]**

Most special re characters inside []'s are taken literally, i.e. they are not interpreted as having any special meaning. 

In [None]:
Search("[/*+]","8*9=72")

In [None]:
Search("[/*+]","8/2=4")

In [None]:
Search("[/*+]","8+2=10")

In [None]:
Search("[?]","Did she have a nice day? Did you?")

In [None]:
Search("[?]$","Did she have a nice day? Did you?")

**Escaping characters with special re meaning**

The backslash can still be used as to escape the meaning to re of a character.

In [None]:
Search("[/*-+]","10-2=8")

In [None]:
Search("[/*\-+]","10-2=8")

**Special non-re characters**

The backslash still gets its usual special meaning in a python strings.

In [None]:
Search("[\t]","Have a nice \t day.")

In [None]:
Search("[\x41\x42]+","ABAB")

In [None]:
Search("[\101\102]+","ABAB")

In [None]:
Search("[?]$","Did she have a nice day? Did you?")

**Finding a single backslash**

How can we find a single backslash in a string? 

Consider the following example.

In [None]:
text="8\\2"
print(text)
print(len(text))
pattern="\\"
print(pattern)
print(len(pattern))
Search(pattern,text)

The issue here is that the pattern string consisting of a single backslash is problematic for the re.search function. That backslash needs to be escaped.

Consequently, we need to take as our initial pattern "\\\\". 

In [None]:
text="8\\2"
print(text)
print(len(text))
pattern="\\\\"
print(pattern)
print(len(pattern))
Search(pattern,text)

Or we can use two backslashes in a raw string. 

In [None]:
text="8\\2"
print(text)
print(len(text))
pattern=r"\\"
print(pattern)
print(len(pattern))
Search(pattern,text)

**Character Sets**

**Digits/non-digits**

There are some special classes of characters that can appear in search patterns. For example, **\d** refers to any digit (0-9), **\D** refers to a non-digit.

- **\d** is the same as **\[0-9\]**
- **\D** is the same as **\[^0-9\]**


In [None]:
Search("\d","9")

In [None]:
Search("\d","-")

In [None]:
Search("\D","9")

In [None]:
Search("\D","-")

**Word/non-word characters**

**\w** refers to a word character, defined as a single letter, digit or underscore character
**\W** refers to non word character

- **\w** is the same as **\[a-z0-9_\]**

- **\W** is the same as **\[^a-z0-9_\]**



In [None]:
pattern="\W(\w\d){3}"
string="My license plate is E3Y4F7."
Search(pattern,string)

**White space characters**

White space characters (newline, space, tab, etc.) are indicated with **\s** and non-white space using **\S**. 

So one could use the following to search for a phone number.

In [None]:
import re
pattern="\s\d{3}-\d{3}-\d{4}"
string="Is your phone number 877-236-1876?"
re.search(pattern,string)

**Example**

We can search for a date (I mean in a text string!).

In [None]:
Search("\s\d{2}[_/\-]?\d{2}[_/\-]?\d{4}","I was born on 04/12/1955")

In [None]:
Search("\s\d{2}[_/\-]?\d{2}[_/\-]?\d{4}","I was born on 04_12_1955")

In [None]:
Search("\s\d{2}[_/\-]?\d{2}[_/\-]?\d{4}","I was born on 04-12-1955")

In [None]:
Search("\s\d{2}[_/\-]?\d{2}[_/\-]?\d{4}","I was born on 04121955")

**Logical or** 

The construction A|B is used to match occurences of one regular expression A or another B.

In [None]:
pattern="(dog)|(cat)"
string="I don't like dogs, I do like cats"
Search(pattern,string)

In the above example, you can leave out the parentheses with the same result because the | operator gets lowest precedence among regex operators. 

So the following is the same.

In [None]:
pattern="dog|cat"
string="I don't like dogs, I do like cats"
Search(pattern,string)

**Another example**

Why do you suppose these give different results?

In [None]:
pattern="(dog)|(cat).*cat"
string="I don't like dogs, I do like cats"
re.search(pattern,string)

In [None]:
pattern="((dog)|(cat)).*cat"
string="I don't like dogs, I do like cats"
re.search(pattern,string)

**Exercise** 

Try to understand what's going on above.

**Order of items matters**

When using A|B if A matches, B is no longer tried, even if it produces a longer match.

In [None]:
Search("dog|dogs","I don't like dogs, I like cats")

In [None]:
Search("dogs|dog","I don't like dogs, I like cats")

**Multiple regular expressions separated by |**

In [None]:
import re
pattern="dog|cat|bird"
string="I don't like birds, or cats or dogs."
re.search(pattern,string)

**Operator precedence in re**

The standard for operator precedence can be found on this page:

&nbsp;&nbsp;&nbsp; https://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html

As mentioned above, the alternation symbol | gets lowest precedence.

**Back References**

Once we create an regex group using parentheses, a patten that is matched is stored for a later match.

The terms \1, \2, ... refer to the groups in order.

In [None]:
pattern=r"(d.g).*(d.g)"
text="""
I can't find my dog Fido, did your cat Fida see him?
I dig him since he's a great dog. 
Do you dig him too?
"""
Search(pattern,text,re.DOTALL)

In [None]:
import re

pattern=r"(d.g).*\1"
text="""
I can't find my dog Fido, did your cat Fida see him?
I dig him since he is a great dog.
Do you dig him too?
"""
Search(pattern,text,re.DOTALL)