**Regular Expressions Part 1**

The regular expressions package is very useful for determining whether strings satisfy different matching criteria.

Lots of applications that allow for search include a regex engine. There is a chrome extension for regex's.

Introduction to Regular Expressions (regex's)

**Some resources**

There is a nice basic regular expressions tutorial here 

https://regexone.com/references/python (click on Interactive Tutorial)

The Python 3 documentation:

https://docs.python.org/3/library/re.html

This is also a nice helpful tutorial:

https://www.tutorialspoint.com/python3/python_reg_expressions.htm

Some more links:

https://docs.python.org/3.6/howto/regex.html

https://www.rexegg.com/

A cheat sheet:

https://cheatography.com/davechild/cheat-sheets/regular-expressions/


**Getting started**

Regular expressions are typically used to determine whether a particular pattern can be found somewhere in a text string and return that location if it does.

A regular expression consists of ordinary characters and special characters. The simplest form of regular expression consists of a single (non-special) character.

We can combine regular expressions by concatenating them. Thus, if A is a regular expression and B is a regular expression, so is AB.

For example, an ordinary character is a regular expression, so "d" is a regular expression. So is "o" and so is "g", so "dog" is a regular expression.

Below is a simple example of regular expression search - looking for a pattern that is a substring of a given string. The result of re.search is a match object if a match is found, and otherwise the result is None.

When we do find a match, we print out some additional information:

- a message indicating that a match was found
- the match object
- the type of the match object
- the span of the match object
- the starting position of the match
- the ending position of the match
- the portion of text that matches

Note that the *span* is a 2-tuple with the starting and ending positions of the match in the text string.

In [1]:
import re

text="My name is John. What is yours?"

pattern="John" # regular expression - no special characters

m=re.search(pattern,text)
if m:
    print("found a match")
    print(type(m))
    print(m)
    print(m.span())
    print(m.start())
    print(m.end())
    print(text[m.start():m.end()])
else:
    print("no match")

found a match
<class 're.Match'>
<re.Match object; span=(11, 15), match='John'>
(11, 15)
11
15
John


**What is returned when there is no match**

In [2]:
text="My name is Joan. What is yours?"
pattern="John" 
m=re.search(pattern,text)
print(m)
print(type(m))


None
<class 'NoneType'>


**A function**

Let's introduce a function that gives information like we got above so we don't have to write so much code in each cell as we test searches.

In [3]:
def Search(pattern,text):
    m=re.search(pattern,text)
    if m:
        print("found a match")
        print(m.span())
        print(text[m.start():m.end()])
    else:
        print("no match")

text="My name is John. What is yours?"
pattern="John" # regular expression - no special characters

Search(pattern,text)

found a match
(11, 15)
John


**Which match is used**

When we use the search method, the information we get back is about the first match.

In [4]:
text="01234567890123456789"
pattern="345"
Search(pattern,text)

found a match
(3, 6)
345


**Searching for special characters**

We can also search for special characters like new line, carriage return, or tab.

In [5]:
text="What is your name?\n\tMy name is John."
pattern="\t"
Search(pattern,text)

found a match
(19, 20)
	


**Special regular expression characters**

There are special characters used to form more complex regular expressions that we can try to match. By default, the period (.) means any single character except newline "\n". So "d.g" matches dog and dig, but not dg and not d\ng.

In [6]:
Search("d.g","dog")

found a match
(0, 3)
dog


In [7]:
Search("d.g","dig")

found a match
(0, 3)
dig


In [8]:
Search("d.g","d.g")

found a match
(0, 3)
d.g


In [9]:
Search("d.g","d0g")

found a match
(0, 3)
d0g


In [10]:
Search("d.g","d\ng")

no match


In [11]:
Search("d.g","dg")

no match


**The DOTALL flag**

If we want "." to be interpreted as any single character, <u>including</u> the  newline character, we use the **DOTALL** *flag*. So in the following example, we do indeed get a match.

In [12]:
re.search("d.g","d\ng",re.DOTALL)

<re.Match object; span=(0, 3), match='d\ng'>

**Escaping**

We ca escape the **.** using a backslash so that it is interpreted literally rather than being assigned the special meaning in re.

In [13]:
Search("d\.g","dog")
Search("d\.g","d.g")

no match
found a match
(0, 3)
d.g


**Matching**

The **match** function is used to determine whether there is a match between a pattern and a substring starting at the beginning of a string.

In [14]:
import re
pattern="d.g"
print(re.match(pattern,"dog"))
print(re.match(pattern,"dog barked"))
print(re.match(pattern,"dog barking"))
print(re.match(pattern,"barking dog"))
print(re.match(pattern,"d\ng"))
print(re.match(pattern,"d\ng",re.DOTALL))

<re.Match object; span=(0, 3), match='dog'>
<re.Match object; span=(0, 3), match='dog'>
<re.Match object; span=(0, 3), match='dog'>
None
None
<re.Match object; span=(0, 3), match='d\ng'>


**Use of circumflex**

We can also use ^ to in a **search** pattern to mean the portion of the string that matches must be an initial substring of the string.

In [15]:
Search("^d.g","dog")

found a match
(0, 3)
dog


In [16]:
Search("^d.g","d3g")

found a match
(0, 3)
d3g


In [17]:
Search("^d.g","d g")

found a match
(0, 3)
d g


In [18]:
Search("^d.g","dog barked")

found a match
(0, 3)
dog


In [19]:
Search("^d.g","barking dog")

no match


In [20]:
Search("^d.g","d\ng")

no match


**Searching differs from matching**

- Searching means determining whether there is a (contiguous) substring in *any* position in a string that matches. 
- Matching means determining whether a match occurs between the pattern and an initial portion of the string.

**Use of dollar sign**

The **$** in a pattern refers to the end of the string, or just before the new line at the end of the string.

In [21]:
Search("Jo.n$","Jo8n")

found a match
(0, 4)
Jo8n


In [22]:
Search("Jo.n$","has anybody seen John")

found a match
(17, 21)
John


In [23]:
Search("Jo.n$","are you there John\n")

found a match
(14, 18)
John


In [24]:
Search("Jo.n$","are you there John?")

no match


In [25]:
Search("Jo.n$","hey John, I was looking for you!")

no match


**Quantifiers**

**Asterisk**

The **\*** refers to zero or more copies of the the preceding regular expression. In the following example, the o doesn't match the first character, but we still get a match because * allows for zero or more occurences of an o.

In [26]:
Search("o*","This is a test")

found a match
(0, 0)



In [27]:
Search("o*","")

found a match
(0, 0)



In [28]:
Search("o*","o")

found a match
(0, 1)
o


In [29]:
Search("o*","ooo This is a test")

found a match
(0, 3)
ooo


In [30]:
Search("o*","ooooooooooooooooooooooo")

found a match
(0, 23)
ooooooooooooooooooooooo


In [31]:
Search("xo*x","xx")

found a match
(0, 2)
xx


In [32]:
Search("xo*x","xooooooooox")

found a match
(0, 11)
xooooooooox


**Greediness**

Regex finds the position where a match first occurs starting from that position.

It is greedy in that once it finds a pattern that matches starting at **some position**, it will try to find the largest match starting from **that position**.

In [33]:
Search("Go*","G things happen to Goood people")

found a match
(0, 1)
G


In [34]:
Search("Go*","Go things happen to goood people")

found a match
(0, 2)
Go


In [35]:
Search("Go*","Good things happen to goood people")

found a match
(0, 3)
Goo


In [36]:
Search("Go*","G-men are bad people non G-men are good people")

found a match
(0, 1)
G


**Preceding regular expression**

Quantifiers are said to quantify what ever regular expression precedes it.

By default, the preceding regular expression refers to the smallest unit.

In the following example, the **\*** quantifies only the b and not the combination ab.

In [37]:
Search("ab*","abababab")

found a match
(0, 2)
ab


In [38]:
Search("ab*","abbbb")

found a match
(0, 5)
abbbb


In [39]:
Search("ab*","a")

found a match
(0, 1)
a


**Grouping**

If we want the asterisk to refer to a more complex regular expression, we can use parentheses for the purpose of grouping components of patterns.

In [40]:
Search("ab*","abababab")

found a match
(0, 2)
ab


In [41]:
Search("(ab)*","abababab")

found a match
(0, 8)
abababab


In [42]:
Search("(ab)*","")

found a match
(0, 0)



**Complex groupings**

We can have more complex groupings. 

For example, in the following, we can search for repeated occurences of a repeated occurences of ab followed by a c

In [43]:
import re
Search("((ab)*c)*","ababcabababcabc")

found a match
(0, 15)
ababcabababcabc


**Combinations**

We can combine special characters. In the following, the .* means zero or more occurences of any character (except \n).

The character repeated need not be the same.

In [44]:
Search("do.*","I see you are doing well.")

found a match
(14, 25)
doing well.


In [45]:
Search("do.*","your dog looks fine, but how are you?")

found a match
(5, 37)
dog looks fine, but how are you?


In [46]:
Search("do.*","your dog looks fine, \n but how are you?")

found a match
(5, 21)
dog looks fine, 


**The plus symbol**

The + symbol means match 1 or more repetitions of the preceding expression.

In [47]:
Search("pup+","here puppy")

found a match
(5, 9)
pupp


In [48]:
Search("pup+","here pup")

found a match
(5, 8)
pup


In [49]:
Search("pup+","here pu")

no match


**Question mark**

The ? character means match **exactly** 0 or 1 repetitions of the previous regular expression. In the following, the ? refers to the preceding group of characters  "xo".

In [50]:
Search("(xo)?","xo")

found a match
(0, 2)
xo


In [51]:
Search("(xo)?","xoxo")

found a match
(0, 2)
xo


In [52]:
Search("(xo)?","")

found a match
(0, 0)



In [53]:
Search("(xo)?","abc")

found a match
(0, 0)



**Ensuring Non-Greedy Search**

To ensure that matching is done in a non-greedy fashion, we can use +? to mean match exactly one occurence of the preceding regular expression.

This is because "+" means one or more occurence and "?" means zero or one.

In [54]:
pattern="(co)+?"
Search("(co)+?","Can you come to my house?")

found a match
(8, 10)
co


In [55]:
Search("(co)+","Do you drink cocoa?")

found a match
(13, 17)
coco


In [56]:
Search("(co)+?","Do you drink cocoa?")

found a match
(13, 15)
co


**Non-greedy \***

The *? is the non-greedy version of * 

In [57]:
Search("o*?","Zm")

found a match
(0, 0)



In [58]:
Search("o*?","Zom")

found a match
(0, 0)



In [59]:
Search("o*","Zoom")

found a match
(0, 0)



**A better example**

Suppose we want to search for patterns starting with "A" and having any characters until seeing a "g". If we use as a pattern:

**"A.\*g"**
    
we will keep looking for other "g"'s after the first one. But maybe we want to stop when we've seen a single "g" after the "A". In this case we could use

**"A.\*?g"**


In [60]:
Search("A.*g","All people have some goodness, but it is hard to see goodness in some.")

found a match
(0, 54)
All people have some goodness, but it is hard to see g


In [61]:
Search("A.*?g","All people have some goodness, but it is hard to see goodness in some.")

found a match
(0, 22)
All people have some g


**Double question marks**

Without other quantifiers, a single question mark makes the preceding regular expression optional. 

In [62]:
Search("any (dogs)?.*","I didn't see any dogs or cats, did you?")

found a match
(13, 39)
any dogs or cats, did you?


In [63]:
Search("any (dogs)?.*","I didn't see any cats, did you?")

found a match
(13, 31)
any cats, did you?


**Inclusion precedence**

When using **?**, preference is given to including an optional expression over not including it.

In the following example inclusion wins out.

In [64]:
Search("ab?c", "abc")

found a match
(0, 3)
abc


**Double question marks**

When we use **??** the effect is to not include the optional expression unless it is **needed** to get a match.

In other words, non-inclusion is given higher precedence than inclusion.

This is demonstrated in the following example.

In [65]:
Search("ab??b","abb")

found a match
(0, 2)
ab


**Another example contrasting ? and ??**

In [66]:
Search("ab?","ab")
Search("ab??","ab")

found a match
(0, 2)
ab
found a match
(0, 1)
a


**Braces**

Braces provide another way to do quantification.

We use {m} to match the previous regular expression exactly m times.

In [67]:
Search("(xo){2}","xo i really love you xoxo")

found a match
(21, 25)
xoxo


In [68]:
Search("(xo){2}","xo i really love you xoxoxox")

found a match
(21, 25)
xoxo


**Enforcing exactly 1 match**

We saw before that **+?** enforces exactly one match.

So **+?** and **{1}** are the same quantifiers

**Ranges**

We can specify a range for the number of appearances using {m,n}. 

We also write 

- {,n} to refer to at most n occurences, and 
- {m,} to mean at least m occurences.

In [69]:
Search("(xo){2,}","xo i really love you xoxo")

found a match
(21, 25)
xoxo


In [70]:
Search("(xo){2,}","xo i really love you xoxoxoxoxoxoxo")

found a match
(21, 35)
xoxoxoxoxoxoxo


In [71]:
Search("(xo){1}.{11}(xo){2}","xo i really love you xoxoxoxoxoxo")

no match


In [72]:
Search("(xo){1}.{19}(xo){2}","xo i really love you xoxoxoxoxoxo")

found a match
(0, 25)
xo i really love you xoxo


In [73]:
Search("(xo){1}.{21}(xo){2}","xo i really love you xoxoxoxoxoxo")

found a match
(0, 27)
xo i really love you xoxoxo


In [74]:
Search("(xo){1}.{22}(xo){2}","xo i really love you xoxoxoxoxoxo")

no match


In [75]:
Search("(xo){,3}","xo i really love you xoxoxox")

found a match
(0, 2)
xo


In [76]:
Search("(xo){4,5}","xo i really love you xoxoxoxoxoxo")

found a match
(21, 31)
xoxoxoxoxo


In [77]:
Search("(xo){4,5}","xo i really love you xoxoxoxoxo")

found a match
(21, 31)
xoxoxoxoxo


**Preventing greediness**

Again, the ? character allows us to create non-greedy versions of the {} patterns.

In [78]:
Search("(xo){2,}?","xo i really love you xoxo")

Search("(xo){2,}?","xo i really love you xoxoxoxo")

found a match
(21, 25)
xoxo
found a match
(21, 25)
xoxo
