**Regular expressions part 3**

Again, we'll use the convenient code below.

I modified this again to replace the line about printing the text that matches with m.group() which gives the same thing.

In [1]:
import re
def Search(pattern,text,flags=0):
    m=re.search(pattern,text,flags)
    if m:
        print("found a match")
        print(m.span())
        print(m.group())
        # print(text[m.start():m.end()]) old version
    else:
        print("no match")

**More flags**

There are additional special flags that can be used to help accomplish certain tasks. 

One example is the **.I** flag means ignore case.

**How flags work**

This is a brief aside.

Every flag stores an integer in which there is a single bit that is non-zero. 

In other words, an integer that is a power of 2.

In [2]:
int(re.DOTALL)

16

In [3]:
int(re.M)

8

In [4]:
int(re.I)

2

**Vertical bar**

When we want multiple flags to be passed to the re.search function, we use the vertical bar **|** and pass a combined flag expressions like this: 

**re.DOTALL | re.M**

In Python (this is not a statement about the regular expressions package), the vertical bar for ints is the *or* operation on its binary digits.

In [5]:
print(bin(192))
print(bin(167))
print(192|167)
print(bin(192|167))

0b11000000
0b10100111
231
0b11100111


When we use this operation on multiple flags we see which bits are *flagged*.

In [6]:
print(bin(re.DOTALL|re.M|re.I))
print(bin(re.DOTALL|re.M))
print(bin(re.DOTALL|re.I))

0b11010
0b11000
0b10010


**Default flag**

So as you might guess, if we specify a flag of 0, this would be taken by the search function to mean *no flags*.

We use this as the default value of **flags** in our new function.

In [7]:
def Search(pattern,text,flags=0):
    m=re.search(pattern,text,flags)
    if m:
        print("found a match")
        print(m.span())
        print(m.group())
    else:
        print("no match")

**Ignore case**

In [8]:
text="Can I help you?"
pattern="i"
Search(pattern,text)

no match


In [9]:
Search(pattern,text,re.I)

found a match
(4, 5)
I


**Ignore case and use DOTALL**

In [10]:
text="Can I help you?\nI need help from you."
pattern="i.*need"
Search(pattern,text)

no match


In [11]:
Search(pattern,text,re.I|re.DOTALL)

found a match
(4, 22)
I help you?
I need


**The .M flag**

Recall that the circumflex matches the beginning of a string and the dollar sign the end of a string.

If a string contains multiple lines, we can use the .M flag which means to 

- interpret ^ as the beginning of any line, and 
- $ as the end of any line.

In [12]:
text="I need some help. \nCan you help me please?"
pattern="^Can.*help"
Search(pattern,text)

no match


In [13]:
Search(pattern,text,re.M)

found a match
(19, 31)
Can you help


In [14]:
text="I need some help.\nCan you help me please?"
pattern=".*help\.$"
Search(pattern,text)

no match


In [15]:
Search(pattern,text,re.M)

found a match
(0, 17)
I need some help.


**Splitting a line**

We already saw that strings have a **split** method.

This method takes as an argument a string to split on.

In [16]:
st="A man. A plan. A canal. Panama. He can"
print(st.split("."))
print(st.split("n."))

['A man', ' A plan', ' A canal', ' Panama', ' He can']
['A ma', ' A pla', ' A canal. Panama. He can']


**Splitting on a regular expression**

Using the re package, we can split a line using a regular expression as a delimiter. 

In the following example, we split on any white space character.

In [17]:
text="Honestly, this is the craziest idea you have\never\tpresented in all my years of being your partner"
pattern="\s"
L=re.split(pattern,text)
print(L)

['Honestly,', 'this', 'is', 'the', 'craziest', 'idea', 'you', 'have', 'ever', 'presented', 'in', 'all', 'my', 'years', 'of', 'being', 'your', 'partner']


If there are no matches, then the output is a list with the whole string.

In [18]:
text="this is the craziest idea you have ever presented in all my years"
pattern="q"
re.split(pattern,text)

['this is the craziest idea you have ever presented in all my years']

**Start of line**

If the start of the line matches, then the list starts with an empty string. (It is as if there is an empty field before the first delimiter.)

In [19]:
text="5AM: Woke up from a dream. 6AM: fell back asleep. 12PM: woke up and felt refreshed."
pattern="\d{1,2}[AP]M: "
re.split(pattern,text)

['',
 'Woke up from a dream. ',
 'fell back asleep. ',
 'woke up and felt refreshed.']

**Capturing all occurences of a pattern**

The **findall** function returns a *list* of all occurences found.

Here, we read in all of Pride and Predjudice as a string and do some searches in that string.

In [20]:
with open("PrideAndPredjudice.txt","rb") as fin:
    text=fin.read().decode()  

**How many times does a word appear**

We can search for all occurences of the word the surrounded by a space character. 

In [21]:
res=re.findall(" the ",text)
print(len(res))

3420


**Questions:**

We can ask all sorts of questions and some of these we'll look at as an in-class exercise.

- How many words are there?
- How many sentences?
- What is the frequency of each word in the text?

To get an approximate count of the number of words, here is one possible approach.

In [22]:
res=re.split("\s",text)
print(len(res))

137074


**Findall doesn't find all**

Findall is greedy in the usual sense.

When a match is found in some position, it looks for a larger match starting from that same position.

In [23]:
pattern="t.*e"
text="theretherethere"
res=re.findall(pattern,text)
print(res)

['theretherethere']


When a match is found, the pattern is next searched starting from the text that follows the end position of the previous match.

In [24]:
import re
pattern="[thwe].*?ere"
text="here, there and everywhere"
res=re.findall(pattern,text)
print(res)

['here', 'there', 'everywhere']


**Iterating over matches**

Another method is **finditer** which returns and iterator over all matches.

For each iterate, we can capture its position and the text making up the match

In [25]:
with open("PrideAndPredjudice.txt","rb") as fin:
    text=fin.read().decode()

pattern="Chapter"
it=re.finditer(pattern,text)
for m in it:
    st="substring = {:10s} start = {:10d} end = {:10d}".format(m.group(),m.start(),m.end())
    print(st)

substring = Chapter    start =         45 end =         52
substring = Chapter    start =       4666 end =       4673
substring = Chapter    start =       9083 end =       9090
substring = Chapter    start =      18785 end =      18792
substring = Chapter    start =      24857 end =      24864
substring = Chapter    start =      30257 end =      30264
substring = Chapter    start =      43546 end =      43553
substring = Chapter    start =      54942 end =      54949
substring = Chapter    start =      66195 end =      66202
substring = Chapter    start =      76147 end =      76154
substring = Chapter    start =      88907 end =      88914
substring = Chapter    start =      98040 end =      98047
substring = Chapter    start =     102057 end =     102064
substring = Chapter    start =     111453 end =     111460
substring = Chapter    start =     118030 end =     118037
substring = Chapter    start =     128000 end =     128007
substring = Chapter    start =     147388 end =     1473

**Substitutions**

We can substitute for all found regular expressions in text.

In [26]:
pattern="[the].*?ere"
text="here, there and everywhere"
substitution="X"
result=re.sub(pattern,substitution,text)
print(result)

X, X and X


**Functions defining the substitution**

We can also specify a function of a match object to substitute instead of a string. 

By default, the function calculates a value for the match object found and substitutes that.

In the following example, the function **f** 

- takes the match object **m** as input
- computes the matching substring (**m.group()**)
- appends "in" to that substring to define a new string **u**
- returns the string **u**

In [27]:
def f(m):
    u=m.group()+"in"
    return(u)

pattern="[the].*?ere"
text="here, there and everywhere"
result=re.sub(pattern,f,text)
print(result)

herein, therein and everywherein


**Groups**

We can use m.group to find locations of matching portions of a pattern.

In [28]:
text = "I believe that students in applied math are smarter than anyone."
pattern = "(st.*ts) .* (math) .*(any)"
m = re.search(pattern, text)
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))

students in applied math are smarter than any
students
math
any


**Compiling Patterns**

In doing repeated searches it is more efficient to _compile_ a pattern to search for.

In the following, we break up Pride and Predjudice into chapters and search each chapter for the word "moderation".

In [29]:
with open("PrideAndPredjudice.txt","rb") as fin:
    text=fin.read().decode()
Chapters=re.split("Chapter",text)

In [30]:
p=re.compile("moderation")
for c in Chapters:
    res=p.search(c)
    print(res)

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
<re.Match object; span=(7123, 7133), match='moderation'>
None
None
None
None
None
None
None
None
None
None
<re.Match object; span=(13301, 13311), match='moderation'>
None
None
None
None
None
None
None
None
None
None
None
None
None
None


This also works with findall.

In [31]:
with open("PrideAndPredjudice.txt","rb") as fin:
    text=fin.read().decode()

pattern="\smoderation\s"
p=re.compile(pattern)
L=p.findall(text)
print(L)

[' moderation ', ' moderation ']


And with finditer.

In [32]:
with open("PrideAndPredjudice.txt","rb") as fin:
    text=fin.read().decode()

pattern="\smoderation\s"
p=re.compile(pattern)
it=p.finditer(text)
for m in it:
    print(m)

<re.Match object; span=(370201, 370213), match=' moderation '>
<re.Match object; span=(513774, 513786), match=' moderation '>
