A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).


In [3]:
import re

In [4]:
text = "This is good a day."

if re.search("good", text): # the first parameter here is the pattern
    print("Wonderful!")
else:
    print("Alas :(")

Wonderful!


In [5]:
# In addition to checking for conditionals, we can segment a string. The work that regex does here is called
# tokenizing, where the string is separated into substrings based on patterns. Tokenizing is a core activity
# in natural language processing.

# The findall() and split() functions will parse the string for us and return chunks.
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# This is a bit of a fabricated example, but lets split this on all instances of Amy
re.split("Amy", text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is succesful.']

In [6]:
print(re.findall("Amy", text))
print(len(re.findall('Amy', text)))

['Amy', 'Amy', 'Amy']
3


In [7]:
# .search() looks for some pattern and returns a boolean, that .split() will use a
# pattern for creating a list of substrings, and that .findall() will look for a pattern and pull out all
# occurences.

In [8]:
# Now that we know how the python regex API works, lets talk about more complex patterns. The regex
# specification standard defines a markup language to describe patterns in text. Lets start with anchors.
# Anchors specify the start and/or the end of the string that you are trying to match. The caret character ^
# means start and the dollar sign character $ means end. If you put ^ before a string, it means that the text
# the regex processor retrieves must start with the string you specify. For ending, you have to put the $
# character after the string, it means that the text Regex retrieves must end with the string you specify.

# Here's an example
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

# Lets see if this begins with Amy
re.search("^Amy",text)

<re.Match object; span=(0, 3), match='Amy'>

In [9]:
s=re.compile(r"tan")
text="""tanishka is a girl
she studied in chandigarh univaersity. tanishka love to dance"""

In [10]:
matches= s.finditer(text)
print(matches)

for i in matches:
    print(i)

<callable_iterator object at 0x000001E2DB3BF070>
<re.Match object; span=(0, 3), match='tan'>
<re.Match object; span=(58, 61), match='tan'>


In [11]:
print(text[0:3])

tan


^ (Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

In [12]:
s=re.compile(r"^[a-z]")
matches= s.finditer(text)
for i in matches:
    print(i)

<re.Match object; span=(0, 1), match='t'>


In [13]:
# But if used inside character set all the characters that are not in the set will be matched.
s=re.compile(r"[^a-z]")
matches= s.finditer(text)
for i in matches:
    print(i)

# so here all the characters except a-z are matched

<re.Match object; span=(8, 9), match=' '>
<re.Match object; span=(11, 12), match=' '>
<re.Match object; span=(13, 14), match=' '>
<re.Match object; span=(18, 19), match='\n'>
<re.Match object; span=(22, 23), match=' '>
<re.Match object; span=(30, 31), match=' '>
<re.Match object; span=(33, 34), match=' '>
<re.Match object; span=(44, 45), match=' '>
<re.Match object; span=(56, 57), match='.'>
<re.Match object; span=(57, 58), match=' '>
<re.Match object; span=(66, 67), match=' '>
<re.Match object; span=(71, 72), match=' '>
<re.Match object; span=(74, 75), match=' '>


In [14]:
emails="""tanishka@email.com 
12hello@email.com
World2.23@university.net
email@email-com"""

pattern=re.compile(r"[\w.]+@\w+[.]\w+")
# or
# pattern=re.compile(r"[a-zA-Z0-9.]+@[a-zA-Z]+[.][a-zA-Z]+")
# pattern=re.compile(r"[a-zA-Z0-9.]+@[a-zA-Z]+[.](net|com)")
matches = pattern.finditer(emails)
for i in matches:
    print(i)

<re.Match object; span=(0, 18), match='tanishka@email.com'>
<re.Match object; span=(20, 37), match='12hello@email.com'>
<re.Match object; span=(38, 62), match='World2.23@university.net'>


In [15]:
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern.finditer(urls)

for match in matches:
    print(match.group(3))

.com
.com
.com
.gov


re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed.

In [16]:
subbed_urls = pattern.sub(r'\2\3', urls)

print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov



In [17]:
sentence = 'Start a sentence and then bring it to an end'

pattern = re.compile(r'start', re.I)
# pattern = re.compile(r'start', re.IGNORECASE)
matches = pattern.search(sentence)

print(matches)

<re.Match object; span=(0, 5), match='Start'>


In [18]:
str = """she is beautiful.
she has black hairs.
she study in 9th standard."""
pattern = re.compile(r"she")
matches = pattern.findall(str)
print(matches)

['she', 'she', 'she']


### Quantifiers
Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic quantifier is expressed as e{m,n}, where e is the expression or character we are matching, m is the minimum number of times you want it to matched, and n is the maximum number of times the item could be matched.

In [19]:
grades="ACAAAABCBCBAA"
re.findall('A{2,5}', grades)

['AAAA', 'AA']

In [20]:
# We might try and do this using single values and just repeating the pattern
matches=re.finditer("A{1,2}A{0,1}",grades)
for i in matches:
    print(i)

<re.Match object; span=(0, 1), match='A'>
<re.Match object; span=(2, 5), match='AAA'>
<re.Match object; span=(5, 6), match='A'>
<re.Match object; span=(11, 13), match='AA'>


### Groups

In [21]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print(m.group(0))       # The entire match
print(m.group(1))       # The first parenthesized subgroup.
print(m.group(2))       # The second parenthesized subgroup.
print(m.group(1, 2))    # Multiple arguments give us a tuple.

Isaac Newton
Isaac
Newton
('Isaac', 'Newton')


#### groupdict()
we use the syntax (?P<name>), where the parethesis starts the group, the ?P indicates that this is an extension to basic regexes, and <name> is the dictionary key we want to use wrapped in <>.

In [22]:
s = "Isaac Newton, physicist"
for item in re.finditer("(?P<first_name>\w+) (?P<last_name>\w+)",s):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['first_name'])

Isaac


In [23]:
print(item.groupdict())

{'first_name': 'Isaac', 'last_name': 'Newton'}


### Look Ahead and Look Behind

In [24]:
# One more concept to be familiar with is called "look ahead" and "look behind" matching. In this case, the
# pattern being given to the regex engine is for text either before or after the text we are trying to
# isolate. For example, in our headers we want to isolate text which  comes before the [edit] rendering, but
# we actually don't care about the [edit] text itself. Thus far we have been throwing the [edit] away, but if
# we want to use them to match but don't want to capture them we could put them in a group and use look ahead
# instead with ?= syntax
s = "Isaac Newton, physicist"
for item in re.finditer("(?P<title>\w+) (?=P<edit_link>\w+)",s):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['title'])

In [25]:
st = """Mr. topi
Mr Sweet
Mrs. Cute"""
pattern = re.compile(r"(Mr[.])+(\s)(\w)+")
match = pattern.finditer(st)
for i in match:
    print(i)

<re.Match object; span=(0, 8), match='Mr. topi'>


In [26]:
st = """Mr. topi
Mr Sweet
Mrs. Cute"""
pattern = re.compile(r"(Mr[.])+(?=\s)(\w)+")
re.findall(pattern,st)

[]

https://regex101.com/

In [29]:
import numpy as np
a1 = np.random.rand(4)
a2 = np.random.rand(4, 1)
a3 = np.array([[1, 2, 3, 4]])
a4 = np.arange(1, 4, 1)
a5 = np.linspace(1 ,4, 4)

In [39]:
text=r'''Everyone has the following fundamental freedoms:
    (a) freedom of conscience and religion;
    (b) freedom of thought, belief, opinion and expression, including freedom of the press and other media of communication;
    (c) freedom of peaceful assembly; and
    (d) freedom of association.'''

import re
pattern = X
print(len(re.findall(pattern,text)))

NameError: name 'X' is not defined