# Regex

Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of a regular expression as a pattern which you give to a regex processor with some source data. The processor then parses that source data using that pattern, and returns chunks of text back to the a data scientist or  programmer for further manipulation. There's really three main reasons you would want to do this - to check whether a pattern exists within some source data, to get all instances of a complex pattern from some source data, or to clean your source data using a pattern generally through string splitting. Regexes are not trivial, but they are a foundational technique for data cleaning in data science applications, and a solid understanding of regexs will help you quickly and efficiently manipulate text data for further data science application.

In [8]:
# First, the re module needs to be imported as it contains the regular expression libraries.
import re 

In [9]:
#search(), checks for a match anywhere in the string, 
# and returns a boolean.
text = "This is a good day"

if re.search("good", text):
    print("Wonderful")
else:
    print("Alas!")

Wonderful


In [10]:
# match() checks for a match that is at the beginning 
# of the string and returns a boolean.

text = "This is a good day"

if re.match("his", text):
    print("Yahoo")
else:
    print("Oh no!")

Oh no!


In [11]:
# The findall() and split() functions will 
# parse the string for us and return chunks. 
# Lets try and example

text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# This is a bit of a fabricated example, 
# but lets split this on all instances of Amy

re.split("Amy", text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is succesful.']

In [12]:
# It is noticeable that split has returned an empty string, 
# followed by a number of statements about Amy, all as
# elements of a list. 
# To count how many times Amy is mentioned, findall() is used.

re.findall("Amy", text)

['Amy', 'Amy', 'Amy']

Anchors specify the start and/or the end of the string that you are trying to match. 
The caret character ^ means start and the dollar sign character $ means end. 

If you put ^ before a string, it means that the text the regex processor retrieves must start with the string you specify. 
For ending, you have to put the $ character after the string, it means that the text Regex retrieves must end with the string you specify.


In [13]:
# Here's an example
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# Lets see if this begins with Amy
re.search("^Amy", text)

<re.Match object; span=(0, 3), match='Amy'>

In [14]:
re.search("succesful.$",text)

<re.Match object; span=(63, 73), match='succesful.'>

# Patterns and Character Classes

Let's talk more about patterns and start with character classes. Let's create a string of a single learners'
grades over a semester in one course across all of their assignments.

In [15]:
grades = "ACAABBAABABCCBAC"

# How many B's are there?
re.findall("B",grades)

['B', 'B', 'B', 'B', 'B']

In [16]:
# How many A's and B's are there?
# To find out, A and B needs to be put inside a Square Bracket.

re.findall("[AB]", grades)

['A', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'B', 'A']

In [17]:
# Lets build a simple regex to parse out all instances 
# where this student receive an A followed by a B or a C.

re.findall("[A][B-C]", grades)


['AC', 'AB', 'AB', 'AB', 'AC']

In [18]:
# This can also be achieved by using the pipe ( | ).

re.findall("AB|AC", grades)

['AC', 'AB', 'AB', 'AB', 'AC']

In [19]:
# if we want to parse out only the grades which were not A's.

re.findall("[^A]", grades)

['C', 'B', 'B', 'B', 'B', 'C', 'C', 'B', 'C']

# Quantifiers

Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{m,n}, where e is the expression or character we are matching, m is the minimum number of times you want it to matched, and n is the maximum number of times the item could be matched.

Let's use these grades as an example. How many times has this student been on a back-to-back A's streak?

In [20]:
re.findall("A{2,10}", grades)

['AA', 'AA']

It's important to note that the regex quantifier syntax does not allow you to deviate from the {m,n} pattern. In particular, if you have an extra space in between the braces you'll get an empty result.

In [21]:
re.findall("A{2, 10}",grades)

[]

If there's just one number in the braces, it's considered to be both m and n.

In [22]:
re.findall("A{2}", grades)

['AA', 'AA']

Using this, we could find a decreasing trend in a student's grades.

In [23]:
re.findall("A{1,10}B{1,10}C{1,10}",grades)

['ABCC']

Lets look at a more complex example, and load some data scraped from wikipedia.

In [42]:
with open("../Datasets/Jerpa.txt", "r") as file:
    wiki = file.read()

wiki

'Overview[edit]\nFERPA gives parents access to their child\'s education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student\'s consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.\n\nOther regulations under this act, effective starting January 3, 2012, allow for greater disclosures of personal and directory student identifying information and regulate student IDs and e-mail addresses.[2] For example, schools may provide external companies with a student\'s personally identifiable information without the student\'s consent.[2]\n\nExamples of situations affected by FERPA include school employees divulging information to anyone other than the student about the student\'s grades o

Scanning through this document one of the things we notice is that the headers all have the words [edit] behind them, followed by a newline character. So if we wanted to get a list of all of the headers in this article we could do so using re.findall

In [69]:
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

This is something new. \w is a metacharacter, and indicates a special pattern of any letter or digit. There are actually a number of different metacharacters listed in the documentation. For instance, \s matches any whitespace character.

Next, there are three other quantifiers we can use which shorten up the curly brace syntax. We can use an asterix * to match 0 or more times, so let's try that.

In [60]:
re.findall("[\w]*\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

Now that we have shortened the regex, let's improve it a little bit. We can add in a spaces using the space character

In [64]:
re.findall("[\w ]*\[edit\]",wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

Ok, so this gets us the list of section titles in the wikipedia page! Now a list of titles can be created by
iterating through this and applying another regex.

In [74]:
for title in re.findall("[\w ]*\[edit\]", wiki):
    print(re.split("[\[]", title)[0])

Overview
Access to public records
Student medical records


In [77]:
for title in re.findall("[\w ]*\[edit\]", wiki):
    print(title.replace("[edit]", ""))

Overview
Access to public records
Student medical records


# Groups

To group patterns together parentheses are used, which is actually pretty natural. Lets rewrite our findall using groups.

In [84]:
re.findall("([\w ]*)(\[edit\])",wiki)

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

We can actually refer to groups by number as well with the match objects that are returned. But, how do we get back a list of match objects?
Thus far we've seen that findall() returns strings, and search() and match() return individual Match objects. But what do we do if we want a list of Match objects? 
In this case, we use the function finditer().

In [85]:
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.groups())

('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')


We see here that the groups() method returns a tuple of the group. We can get an individual group using group(number), where group(0) is the whole match, and each other number is the portion of the match we are interested in. In this case, we want group(1).

In [90]:
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.group(1))

Overview
Access to public records
Student medical records


In [93]:
# Giving them a label and looking at the results as a dictionary
# is pretty useful. 
# For that we use the syntax (?P<name>), where the parethesis 
# starts the group, the ?P indicates that this is an extension 
# to basic regexes, and <name> is the dictionary key we 
# want to use wrapped in <>.

for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['title'])

Overview
Access to public records
Student medical records


In [95]:
print(item.groupdict())

{'title': 'Student medical records', 'edit_link': '[edit]'}
