In this lecture, we’ll learn about using regular expressions (regex) for pattern matching in strings. Regex is a tool that allows you to search for patterns in text. It’s useful for:
i)Checking if a pattern exists in data.
ii)Extracting all instances of a pattern from data.
iii)Cleaning data by splitting strings.
Regex is a key skill for data cleaning in data science, as it helps manipulate text efficiently. By the end of the lecture, you’ll understand how to create and use regex patterns for matching and processing text.

In [3]:
#Import re library in which python stores regex.
import re

Main Processing Functions -> match() check for match at beginning of string  and search() check at anywhere and return a boolean.

In [4]:
# First argument is pattern and second is given string.
#Try changing my and 18 to someting else
declaration = "My age is 18"
if re.match("My",declaration):
    print("You have given your age")
else:
    print("You have not given your details")

You have given your age


In [5]:
if re.search("18",declaration):
    print("Your age is 18")
else:
    print("Please edit declaration")

Your age is 18


In addition to checking for conditionals, we can segment a string. The work that regex does here is called tokenizing, where this string is separated into substrings based on patterns. Tokenizing is a core activity in natural language processing which we won't talk much about here but you'll study in the future. The Findall and Split functions will parse the string for us and return chunks.

In [8]:
to_text = "The cat sat on the mat with another cat."#Search for all instances and 0<word>1<word>
#We get a array of strings.
print(re.split("cat",to_text))

['The ', ' sat on the mat with another ', '.']


In [10]:
print(re.findall("t",to_text)) #How many times t occur in to_text.

['t', 't', 't', 't', 't', 't', 't']


Regex Specification Standard -> Markup Language to describe patterns in text. Let's start with Anchors (specify the start and/or the end of string that you are trying to match. ^ (caret) means start and ($) means end. It means that text Regex retrieves.

In [15]:
re.search("^The",to_text)#re.search() returns a new object re.Match() object

<re.Match object; span=(0, 3), match='The'>

In [17]:
re.search("cat.$",to_text)

<re.Match object; span=(36, 40), match='cat.'>

#PATTERNS AND CHARACTER CLASSES

In [35]:
grade_list = "ABCADAABCACCAAAAAABBDA"

Remember findall retrieves all occurences but what if we want to find occurence of two words say A and B in grade_list. We can't pass AB because it will count occurences of "AB" so we have to pass AB in square brackets.

In [27]:
re.findall("A",grade_list)

['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']

In [28]:
re.findall("[ABD]",grade_list)

['A',
 'B',
 'A',
 'D',
 'A',
 'A',
 'B',
 'A',
 'A',
 'A',
 'A',
 'A',
 'A',
 'A',
 'B',
 'B',
 'D']

In [29]:
re.findall("AB",grade_list)

['AB', 'AB', 'AB']

If we want to find a range of charcters alphanumerically using [a-z] to check for characters a to z.

In [30]:
re.findall("[A-C]",grade_list)

['A',
 'B',
 'C',
 'A',
 'A',
 'A',
 'B',
 'C',
 'A',
 'C',
 'C',
 'A',
 'A',
 'A',
 'A',
 'A',
 'A',
 'B',
 'B']

In [32]:
re.findall("[A][B-D]",grade_list) #Check for all occurence where A is been followed by B,C,D.

['AB', 'AD', 'AB', 'AC', 'AB']

In [36]:
#Pipe checks AB|AC (AB or AC)
re.findall("BD|DA",grade_list)

['DA', 'BD']

In [39]:
# We can use set operator to negate the result.
print(re.findall("[^A]",grade_list)) #We get not A's

['B', 'C', 'D', 'B', 'C', 'C', 'C', 'B', 'B', 'D']


<h2>QUANTIFIERS</h2>

Quantifiers are the number of times you want to pattern to be matched in order to actually count as a match. The most basic quantifiers, the expression of E, curly brace M, N curly brace, where E is the expression or character we're matching, M is the minimum number of times you want it to be matched, and N is the maximum number of times the item could be matched. 

In [41]:
re.findall("A{2,10}",grade_list) #We want min 2A streak and max 10A streak.

['AA', 'AAAAAA']

In [43]:
#Using Single Values -> Checking for it back to back
re.findall("A{1,1}A{1,1}",grade_list)

['AA', 'AA', 'AA', 'AA']

In [44]:
#Let's load some data from wikipedia to look at other quantifiers.
with open("vishy.txt","r") as file:
    vishy = file.read()
vishy

'Viswanathan "Vishy" Anand (born 11 December 1969) is an Indian chess grandmaster, a former five-time World Chess Champion[2] and a record two-time Chess World Cup Champion.[3] He became the first grandmaster from India in 1988, and he has the eighth-highest peak FIDE rating of all time.[4] In 2022, he was elected the deputy president of FIDE.Anand defeated Alexei Shirov in a six-game match to win the 2000 FIDE World Chess Championship, a title he held until 2002. He became the undisputed world champion in 2007 and defended his title against Vladimir Kramnik in 2008, Veselin Topalov in 2010, and Boris Gelfand in 2012.In 2013, he lost the title to challenger Magnus Carlsen, and he lost a rematch to Carlsen in 2014 after winning the 2014 Candidates Tournament.'

In [None]:
#Remaining Regex In Case If It is Required 