Regular expressions or regexes are written in a condensed formatting language. In general, you can think of regular expression as a pattern which you give to a regex processor with some source data. The processor then parses that source data using the pattern and returns chunks of texts back to the data scientist or programmer for further manipulation. There's really three main reasons you should want to do this. To check whether a pattern exists within some source data, to get all instances of a complex pattern from some source data, or to clean your source data using a pattern generally through strings splitting. 

In [1]:
# import re module, regular expressions library
import re

In [2]:
# match() checks for match in beginning of string and returns boolean
# search() checks for match anywhere in string and returns boolean

#example
text = "This is a good day!"

if re.search("good", text): # this first parameter is a pattern
    print("Wonderful!")
else:
    print("Too bad!")
    
if re.match("good", text): 
    print("Wonderful!")
else:
    print("Too bad!")

Wonderful!
Too bad!


In [3]:
# To segment a string -> tokenizing -> string is separated to substrings based on patterns
text = "Ali works diligently. Ali gets good grades. Our student Ali is successful."
# split on all instances of Ali using split()
re.split("Ali",text)
# split() will return a list of substrings that contain the word Ali

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is successful.']

In [4]:
# to count how many times we have talked about Ali,  use findall()
re.findall("Ali", text)

['Ali', 'Ali', 'Ali']

# Patterns and Characters Classes

In [5]:
# grades over a semester in one course
grades = "AAABABABABCCCACBBBAA"

# find how many B's
re.findall("B", grades)

['B', 'B', 'B', 'B', 'B', 'B', 'B']

In [6]:
# find how many A's and B's 
re.findall("[AB]", grades)

['A',
 'A',
 'A',
 'B',
 'A',
 'B',
 'A',
 'B',
 'A',
 'B',
 'A',
 'B',
 'B',
 'B',
 'A',
 'A']

In [7]:
# find where student receive an A followed by B or C
re.findall("[A][B-C]", grades)

['AB', 'AB', 'AB', 'AB', 'AC']

In [8]:
# alternate way to write the above using OR operator:
re.findall("AB|AC",grades)

['AB', 'AB', 'AB', 'AB', 'AC']

In [9]:
# find all grades which is not A
re.findall("[^A]", grades)

['B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'B', 'B', 'B']

# Quantifiers

In [10]:
# quantifiers are number of times you want a pattern to be matched in order to match
# expressed in e{m,n},
# e = the expression or character we are matching
# m = the minimum number of times you want it to be match
# n = maximum number of times the item could be matched

In [11]:
# Example: how many times has the student been on a back-to-back A's streak?
re.findall("A{2,10}",grades) # use 2 as min, 10 as max

['AAA', 'AA']

In [12]:
# find 2 A's back-to-back
re.findall("A{2,2}",grades)

['AA', 'AA']

In [13]:
re.findall("A{1,1}A{1,1}",grades)
# if quantifier is {1,1} can don't include and just use "AA"

['AA', 'AA']

In [14]:
re.findall("A{2}",grades)
# if only one number then that is the min and max

['AA', 'AA']

In [15]:
# find decreasing trend in student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

['ABCCC']

In [16]:
# * to match 0 or more times
# ? or + to match 1 or more times

# Group

In [None]:
# can match different patterns called group, at the same time, and then refer to the groups that you want.
# group patterns using parenthesis 