# Lecture 8: Regular Expressions

- Reading material: tutorial https://regexone.com

- The cheat sheet at the following link is very useful to keep in the background. https://www.debuggex.com/cheatsheet/regex/python 

- Here's another useful website for you to test your regular expressions.
http://pythex.org/

Regular expressions are a language to describe patterns in strings. They are extremely useful when working with data sets. 

To start, we need the module __re__, and two functions from it: __search__ and __findall__. The function __re.search__ takes as input __a pattern and a string__, and __returns an object__ (of a class that is specific to the __re__ module) that contains __the first substring__ matching the pattern, together with __its location in the string__. We can call the former using the __group()__ attribute, and the latter using the __start()__ and __end()__ attributes. 

For example:


In [4]:
import re
ui = re.search(r'abc','abcd').group()
print ui

abc


In [34]:
print type(re.search(r'abdc','abcd'))
re.search(r'ab\d','abd abc ad1 ab2').group() #\d means any given integer
print type(re.search(r'ab\d','abd abc ad1 ab2').group())

<type 'NoneType'>
<type 'str'>


In [35]:
re.search(r'ab\d','abd ab3 ad1 ab2').start() # return the index of start of matching substring

4

In [36]:
re.search(r'ab\d','abd ab3 ad1 ab2').end() # return the index+1 of end of matching substring

7

In [37]:
re.findall(r'ab\d','abd ab3 ad1 ab2') # return a list of all matching substrings

['ab3', 'ab2']

In [38]:
prog = re.compile(r'abc*')
print prog.match('abcd').group()
print prog.match('abcccccd').group()
print prog.match('abccccccccccd').group()

abc
abccccc
abcccccccccc


In [39]:
print r'\\' #with r, \ will not be treated as escape symbol

\\


In [40]:
re.search(r'\\','a\\b').group()

'\\'

In [41]:
re.search(r'abc','abcd').start()

0

In [44]:
re.search(r'abc','abcd').end()

3

The __group()__ attribute allows us to group the pattern, using parentheses, such that we can call different parts of it. 

Imagine, for example, that we are looking for email addresses and wish to sometimes only call the username or the domain. These groups may be nested; they are ordered by their leftmost parenthesis. The 0-group always returns the whole pattern. 

For example:

In [45]:
re.search(r'((a)b)c','abcd').group(0)

'abc'

In [46]:
re.search(r'((a)b)c','abcd').group(1)

'ab'

In [47]:
re.search(r'((a)b)c','abcd').group(2)

'a'

The __findall( )__ functions simply returns a list containing all substrings that match the pattern, without information about their location. 

To ask for more general patterns, we use placeholders, such as 
- __.__ for any character
- __\d__ for any digit
- __[a-z]__ for lower case letters
- __[A-Z]__ for upper case letters 

We also use multipliers, such as 
- __?__ for 0 or 1
- __+__ for 1 or more
- __*__ for any number. 

If we want to look for these special characters, instead of using them as placeholders, we can use __[ ]__. 

We can also use __|__ inside square brackets to mean “or”. So if we need an a or a b, we ask for a|b. 

We can also use __lookforwards__ and __lookbackwards__. These let us filter out patterns that are preceded or followed by another pattern, without asking for that pattern. For example, suppose that we have a list of real numbers that all have decimals, and we would like to filter out the integer parts, so we use a lookforward, indicated by __(?=)__ to a period:

In [51]:
s='1.86 5.30 8.54 13.75'
re.findall(r'\d+(?=\.)',s) # ?= looks for items with pattern on the right
                            # [.] has the same effect as \. ???

['1', '5', '8', '13']

For example, suppose that we have the following piece of text:

"October arrived, spreading a damp chill over the grounds and into the castle. Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair gave the impression that her whole head was on fire."

We are looking for words that are capitalized, but that are not at the beginning of a sentence. These are words that are preceded by a lower case letter and then a space:

In [18]:
s='October arrived, spreading a damp chill over the grounds and into the castle. Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair gave the impression that her whole head was on fire. '

In [19]:
re.findall(r'(?<=[a-z] )[A-Z][a-z]+',s) # ?<= look for items who have pattern on the left

['Pomfrey', 'Pepperup', 'Weasley', 'Percy']

### Exercise:

- Write a regular expression that extracts all full sentences that are questions from a piece of english text.


In [1]:
import re
re.findall(r'''
            (?:\w+\s+)*  #zero or more word then space pairs #?: means non-capturing group
            \w+          #one or more "word characters"
            \?           #literal ? mark
            ''')

SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-1-ba2a49b240f0>, line 4)

In [2]:
r'''(?:^|[.!?]\s+)
    (
      [^.?!]*
      \w
      /?
    )

SyntaxError: EOF while scanning triple-quoted string literal (<ipython-input-2-92d3190c555f>, line 4)

- Write a regular expression that finds the maximum of all numbers (either integers or floats) that are in a string. The pattern ‘26’ should be counted as twenty six, rather than a 2 and a 6.

- Write a regular expression that finds all short sentences in a piece of english text (sentences with no more than 10 words).

- Write a regular expression that lists the first letters of every sentence in a piece of english text.


- Write a regular expression that checks whether a string contains both upper and lower case
letters.


- Write a regular expression that checks whether a string contains upper, lower case letters and at least one digit.

- Use regular expressions to add commas between two numbers in a list of numbers separated only by spaces. So, we want to turn ’1 3 8 2’ into ’1, 3, 8, 2’. Then, also add square brackets, so that the list is written in a way that can be used in python.
We want ’[1, 3, 8, 2]’. Put this all together to create a function that takes as input a string of space separated numbers, and outputs a python list of those numbers (not strings).