# Lecture 12: Regular Expressions

- Reading material: tutorial https://regexone.com

- The cheat sheet at the following link is very useful to keep in the background. https://www.debuggex.com/cheatsheet/regex/python 

- Here's another useful website for you to test your regular expressions.
http://pythex.org/

Regular expressions are a language to describe patterns in strings. They are extremely useful when working with data sets. 

To start, we need the module `re`, and two functions from it: `search` and `findall`. The function `re.search` takes as input a pattern and a string, and returns an object (of a class that is specific to the `re` module) that contains the first substring matching the pattern, together with its location in the string. We can call the former using the `group()` attribute, and the latter using the `start()` and `end()` attributes. 

For example:


In [1]:
import re

In [4]:
# re.search(r'abc','abcd') # returns a match object
re.search(r'abc','abcd').group()

'abc'

In [5]:
# re.search(pattern, string)
re.search(r'abc','abcd').group()

'abc'

In [7]:
re.search(r'abc*','abd abcd abcdddd abcccccccd').group()

'ab'

In [6]:
prog = re.compile(r'abc*')

print(prog.match('abcd').group())
print(prog.match('abcccccd').group())
print(prog.match('abccccccccccd').group())

abc
abccccc
abcccccccccc


In [10]:
print(r"abc\n")
print("abc\n")

abc\n
abc



In [None]:
print(r'\\')

In [13]:
re.search(r'\\','a\\b').group()

'\\'

In [14]:
re.search(r'abc','abcd').start()

0

In [15]:
re.search(r'abc','abcd').end()

3

The `group()` attribute allows us to group the pattern, using parentheses, such that we can call different parts of it. 

Imagine, for example, that we are looking for email addresses and wish to sometimes only call the username or the domain. These groups may be nested; they are ordered by their leftmost parenthesis. The 0-group always returns the whole pattern. 

For example:

In [19]:
re.search(r'abc','abcd').group()

'abc'

In [21]:
#re.search(r'a(b)c','abcd').group()
re.search(r'a(b)c','abcd').group(1)

'b'

In [24]:
#re.search(r'(\w+)@(\w+)[.]','hangjie@math.ucla.edu').group()
#re.search(r'(\w+)@(\w+)[.]','hangjie@math.ucla.edu').group(1)
re.search(r'(\w+)@(\w+)[.]','hangjie@math.ucla.edu').group(2)

'math'

In [25]:
re.search(r'((a)b)c','abcd').group(1)

'ab'

In [26]:
re.search(r'((a)b)c','abcd').group(2)

'a'

In [30]:
#re.findall(r'abc*','abd abcd abcdddd abcccccccd')
#re.findall(r'e','abd abcd abcdddd abcccccccd') # returns an empty list

re.findall(r'(ab(c*))','abd abcd abcdddd abcccccccd')

[('ab', ''), ('abc', 'c'), ('abc', 'c'), ('abccccccc', 'ccccccc')]

The `findall( )` functions simply returns a list containing all substrings that match the pattern, without information about their location. 

To ask for more general patterns, we use placeholders, such as 
- __.__ for any character
- __\d__ for any digit
- __[a-z]__ for lower case letters
- __[A-Z]__ for upper case letters 

We also use multipliers, such as 
- __?__ for 0 or 1
- __+__ for 1 or more
- __*__ for any number. 

If we want to look for these special characters, instead of using them as placeholders, we can use `[ ]`. 

We can also use `|` inside square brackets to mean “or”. So if we need an a or a b, we ask for `a|b`. 

We can also use __lookforwards__ and __lookbackwards__. These let us filter out patterns that are preceded or followed by another pattern, without asking for that pattern. For example, suppose that we have a list of real numbers that all have decimals, and we would like to filter out the integer parts, so we use a lookforward, indicated by `(?=)` to a period:

In [32]:
s='1.86 5.30 8.54 13.75 14~ 49-'
re.findall(r'\d+(?=[.])',s)

['1', '5', '8', '13']

For example, suppose that we have the following piece of text:

"October arrived, spreading a damp chill over the grounds and into the castle. Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair gave the impression that her whole head was on fire."

We are looking for words that are capitalized, but that are not at the beginning of a sentence. These are words that are preceded by a lower case letter and then a space:

In [34]:
s='October arrived, spreading a damp chill over the grounds and into the castle. Madam Pomfrey, the nurse, was kept busy by a sudden spate of colds among the staff and students. Her Pepperup potion worked instantly, though it left the drinker smoking at the ears for several hours afterward. Ginny Weasley, who had been looking pale, was bullied into taking some by Percy. The steam pouring from under her vivid hair gave the impression that her whole head was on fire. '

In [36]:
re.findall(r'[A-Z][a-z]+',s)

['October',
 'Madam',
 'Pomfrey',
 'Her',
 'Pepperup',
 'Ginny',
 'Weasley',
 'Percy',
 'The']

In [37]:
re.findall(r'(?<=[a-z]\s)[A-Z][a-z]+',s)

['Pomfrey', 'Pepperup', 'Weasley', 'Percy']

### Exercise:

- Write a regular expression that extracts all full sentences that are questions from a piece of english text.


- Write a regular expression that finds the maximum of all numbers (either integers or floats) that are in a string. The pattern ‘26’ should be counted as twenty six, rather than a 2 and a 6.

- Write a regular expression that finds all short sentences in a piece of english text (sentences with no more than 10 words).

- Write a regular expression that lists the first letters of every sentence in a piece of english text.


- Use regular expressions to add commas between two numbers in a list of numbers separated only by spaces. So, we want to turn ’1 3 8 2’ into ’1, 3, 8, 2’. Then, also add square brackets, so that the list is written in a way that can be used in python.
We want ’[1, 3, 8, 2]’. Put this all together to create a function that takes as input a string of space separated numbers, and outputs a python list of those numbers (not strings).

In [187]:
def str2list(s):
    s = '['+re.sub(r'\s+',r', ',s)+']'
    s = re.findall(r'[^, \[\]]+', s)
    return [int(i) for i in s]

s = '1 3 8 2'
str2list(s)


[1, 3, 8, 2]