In this lecture we're going to talk about pattern matching in strings using regular expressions. Regular
expressions, or regexes, are written in a condensed formatting language. In general, you can think of a
regular expression as a pattern which you give to a regex processor with some source data. The processor then
parses that source data using that pattern, and returns chunks of text back to the a data scientist or 
programmer for further manipulation. There's really three main reasons you would want to do this - to check
whether a pattern exists within some source data, to get all instances of a complex pattern from some source
data, or to clean your source data using a pattern generally through string splitting. Regexes are not
trivial, but they are a foundational technique for data cleaning in data science applications, and a solid
understanding of regexs will help you quickly and efficiently manipulate text data for further data science
application.

Now, you could teach a whole course on regular expressions alone, especially if you wanted to demystify how
the regex parsing engine works and efficient mechanisms for parsing text. In this lecture I want to give you
basic understanding of how regex works - enough knowledge that, with a little directed sleuthing, you'll be
able to make sense of the regex patterns you see others use, and you can build up your practical knowledge of
how to use regexes to improve your data cleaning. By the end of this lecture, you will understand the basics
of regular expressions, how to define patterns for matching, how to apply these patterns to strings, and how
to use the results of those patterns in data processing.

Finally, a note that in order to best learn regexes you need to write regexes. I encourage you to stop the
video at any time and try out new patterns or syntax you learn at any time.

In [1]:
# First we'll import the re module, which is where python stores regular expression libraries.
import re

In [None]:
# There are several main processing functions in re that you might use. The first,
# match() checks for a match that is at the beginning of the string and returns a boolean. 
# search(), checks for a match anywhere in the string, and returns a boolean.

# Lets create some text for an example
text = "This is a good day."

# Now, lets see if it's a good day or not:
if re.search("good", text): # the first parameter here is the pattern
    print("Wondeerful!")
else:
    print("Alas :(")

In [None]:
# In addition to checking for conditionals, we can segment a string. The work that regex does here is called
# tokenizing, where the string is separated into substrings based on patterns. Tokenizing is a core activity
# in natural language processing, which we won't talk much about here but that you will study in the future

  
# findall()  
# split() 
# functions will parse the string for us and return chunks. Lets try and example

text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# This is a bit of a fabricated example, but lets split this on all instances of Amy
re.split("Amy", text)
# it gives the strings from the most left to the most right.  

In [None]:
# You'll notice that split has returned an empty string, followed by a number of statements about Amy, all as
# elements of a list. If we wanted to count how many times we have talked about Amy, we could use findall()
re.findall("Amy", text)

In [None]:
# Ok, so we've seen that .search() looks for some pattern and returns a boolean, that .split() will use a
# pattern for creating a list of substrings, and that .findall() will look for a pattern and pull out all
# occurences.

In [None]:



# Now that we know how the python regex API works, lets talk about more complex patterns. The regex
# specification standard defines a markup language to describe patterns in text. Lets start with anchors.
# Anchors specify the start and/or the end of the string that you are trying to match. The caret character ^
# means start and the dollar sign character $ means end. If you put ^ before a string, it means that the text
# the regex processor retrieves must start with the string you specify. For ending, you have to put the $
# character after the string, it means that the text Regex retrieves must end with the string you specify.

# Here's an example
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# Lets see if this begins with Amy
##?????????????
re.search("^Amy",text)
##?????????????
#re.match('Amy',text)


In [None]:
re.match("Amy",text)

In [None]:
# Notice that re.search() actually returned to us a new object, called re.Match object. An re.Match object
# always has a boolean value of True, as something was found, so you can always evaluate it in an if statement
# as we did earlier. The rendering of the match object also tells you what pattern was matched, in this case
# the word Amy, and the location the match was in, as the span.

# Patterns and Character Classes

In [None]:
# Let's talk more about patterns and start with character classes. Let's create a string of a single learners'
# grades over a semester in one course across all of their assignments
grades="ACAAAABCBCBAA"

# If we want to answer the question "How many B's were in the grade list?" we would just use B
re.findall("B",grades)

In [None]:
# If we wanted to count the number of A's or B's in the list, we can't use "AB" since this is used to match
# all A's followed immediately by a B. Instead, we put the characters A and B inside square brackets
print(re.findall("[CB]",grades))
print(re.findall("[BC]",grades))
print(re.findall("[B-C]",grades))
print(re.findall("[C-B]",grades))

In [None]:
# This is called the set operator. You can also include a range of characters, which are ordered
# alphanumerically. For instance, if we want to refer to all lower case letters we could use [a-z] Lets build
# a simple regex to parse out all instances where this student receive an A followed by a B or a C
re.findall("[A][B]",grades)

In [None]:
re.findall("[B-C]",grades)

In [None]:
re.findall("[A][B-C]",grades)

In [None]:
# Notice how the [AB] pattern describes a set of possible characters which could be either (A OR B), while the
# [A][B-C] pattern denoted two sets of characters which must have been matched back to back. You can write
# this pattern by using the pipe operator, which means OR
re.findall("AB|AC",grades)



In [None]:
#compare with 
re.findall("[A][B-C]",grades)

In [None]:
# We can use the caret with the set operator to negate our results. For instance, if we want to parse out only
# the grades which were not A's
re.findall("[^A]",grades)