# Regex

In this part we will cover the following topics:
- regular expression - learning to use *, +, and ?
- regular expression - learning to use $ and ^, and the non-start and non-end of a word
- searching multiple literal strings and substring occurrence
- learning to create date regex and a set o characters or ranges of character
- finding all five character words and making abbreviations in some sentences
- learning to write your own regex tokenizer
- learning to write your own regex stemmer

# Learning how to use *, +, and ?

They are often called as wild cards, but we can also call them as zero or more (*), one or more (+), and zero or one (?)

In [1]:
# import required libraries
import re

In [3]:
# define a function that takes as input text and patterns to be applied for match
def text_match(text, patterns):
    if re.search(patterns, text):
        return "Found a match!"
    else:
        return "Not matched!"

The re.search() method applies the given pattern to the text object and returns true or false depending on the outcome after applying the method.

In [4]:
# experiment the following case scenarios:
print(text_match('ac', 'ab?'))

Found a match!


In [5]:
print(text_match('abc', 'ab?'))

Found a match!


In [6]:
print(text_match('abbc', 'ab?'))

Found a match!


We can see that our method found a match on all of the experiments. These patterns are trying to match a part of the input and not the entire input. The pattern says a followed by zero or more b.

The rule says whatever matches zero or one will also matches zero or more. The ? wildcard is a subset of *.

In [9]:
# zero or more wild card *
print(text_match('ac', 'ab*'))

Found a match!


In [10]:
print(text_match('abc', 'ab*'))

Found a match!


In [11]:
print(text_match('abbc', 'ab*'))

Found a match!


Our function found match on three but with different input string. As a rule of thumb whatever matches zero or more or one wild card will also match zero or more.

In [13]:
# one or more wild card +
print(text_match('ac', 'ab+'))

Not matched!


In [14]:
print(text_match('abc', 'ab+'))

Found a match!


In [16]:
print(text_match('abbc', 'ab+'))

Found a match!


Being more specific in the number of repetitions, we could do like:

In [19]:
print(text_match('abbc', 'ab{2}'))

Found a match!


Our pattern says a followed by exactly 2 times b. The function will return true only if a is followed exactly by 2 times b

In [20]:
# range of repetitions
print(text_match('aabbbbc', 'ab{3,5}?'))

Found a match!


# Learning how to use $ and ^, and the non-start and non-end of a word

The following are used to match the given patterns at the start and end of an input text 
- ^ start
- $ end

In [21]:
# import required libraries
import re

In [22]:
# define method to test pattern match
def text_match(text, patterns):
    if re.search(patterns, text):
        return 'Found a match!'
    else:
        return 'Not matched!'

In [23]:
# applying regex pattern to match the start and end of the string
print('Pattern to test starts and ends with')
print(text_match('abbc', '^a.*c$'))

Pattern to test starts and ends with
Found a match!


The following pattern '^a.*c$' means: start with a, followed by zero or more of any characters, and end with c. If found a match for the input text return true. The ('.') dot matches any character except a new line.

When we say .* , it means zero or more occurences of any character.

In [24]:
# find a pattern that looks for an input text that begins with a word.
print('Begin with a word')
print(text_match('Tuffy eats pie, Loki eats peas!', '^\w+'))

Begin with a word
Found a match!


The pattern says start with (^) any alphanumeric character (\w) and one or more occurences of it (+)

In [25]:
# check for an ends with a word and optional punctuation
print('End with a word and optional punctuation')
print(text_match('Tuffy eats pie, Loki eats peas!', '\w+\S*?$'))

End with a word and optional punctuation
Found a match!


The pattern means one or more occurences of \w+ followed by zero or more occurences of \S (white-spaces), and that should be falling towards the end of the input text.

In [26]:
# find a word that contains a specific character
print('Finding a word which contains character, not start or end of the word')
print(text_match('Tuffy eats pie, Loki eats peas!', '\Bu\B'))

Finding a word which contains character, not start or end of the word
Found a match!


In this case \B is an anti-set or reverse of \b. The \b matches an empty string at the beginnign or end of a word. \B will match inside the word and it will match any word in our input string that contains character u. The match has been found on 'Tuffy' word.

# Searching multiple literal strings and substring occurences

In [1]:
# import required libraries
import re

In [2]:
# define a list of string names and an input string
patterns = ['Tuffy', 'Pie', 'Loki']
text = 'Tuffy eats pie, Loki eats peas!'

In [5]:
for pattern in patterns:
    print('Searching for "%s" in "%s" -&gt;' % (pattern, text),)
    if re.search(pattern, text):
        print('Found!')
    else:
        print('Not Found!')

Searching for "Tuffy" in "Tuffy eats pie, Loki eats peas!" -&gt;
Found!
Searching for "Pie" in "Tuffy eats pie, Loki eats peas!" -&gt;
Not Found!
Searching for "Loki" in "Tuffy eats pie, Loki eats peas!" -&gt;
Found!


In this simple for loop, we iterate over all string patterns and calling the search() method of re. Looking at the generated result against all 3 elements, we can see that we had a match for the first and last one while the second determined that it was not found. The reason is that we did not applied lowercase on our patterns.

In [6]:
# another example
text = 'Diwali is a festival of lights, Holi is a festival of colors!'
pattern = 'festival'

In [9]:
for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print('Found "%s" at %d:%d' % (text[s:e], s, e))

Found "festival" at 12:20
Found "festival" at 42:50


The for loop outputs 'festival' found in our text 2 times at the follwoing positions 12:20 respectively 42:50

# Learning to create date regex and a set of characters or ranges of character

In [10]:
# import required libraries
import re

In [11]:
# declare a url object and write a simple date finder regular expression
url="http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewis1/"

In [12]:
date_regex = '/(\d{4})/(\d{1,2})/(\d{1,2})/'

Our variable date_regex is a simple string object which contains a regex that will match a date with format YYYY/DD/MM, \d denotes digits starting from 0 to 9.

In [13]:
# apply date_regex to url
print('Date found in the URL: ', re.findall(date_regex, url))

Date found in the URL:  [('2017', '10', '28')]


In [16]:
# check if the input string contains a specific set of characters or others
def is_allowed_specific_char(string):
    charRe = re.compile(r'[^a-zA-Z0-9.]')
    string = charRe.search(string)
    return not bool(string)

In [15]:
print(is_allowed_specific_char("ABCDEFabcdef123450."))

True


In [17]:
print(is_allowed_specific_char("*&%@#!){"))

False


Looking at the following pattern /(\d{4})/(\d{1,2})/(\d{1,2})/; the pattern means match all input that contains the group of dates with the specified format. The [] notation is a set which means: match characters enclosed inside the set notation. If any single match is found, the pattern is true.

# Find all five-character words and make abbreviations in some sentences

In [18]:
# import required libraries
import re

In [19]:
# define a string variable
street = '21 Ramkrishna Road'

In [21]:
# replace 'Road' with 'Rd' using re.sub() function
print(re.sub('Road', 'Rd', street))

21 Ramkrishna Rd


In [22]:
# find all five-character words inside a given sentence
text = 'Diwali is a festival of light, Holi is a festival of color!'
print(re.findall(r"\b\w{5}\b", text))

['light', 'color']


In this pattern r"\b\w{5}\b" we are using \b boundary set to identify the boundary between words and the {} notation to make sure we are only shortlisting five-character words.

# Learning to write your own regex tokenizer

In [23]:
# import required libraries
import re

In [24]:
# define a raw sentence
raw = "I am big! It's the pictures that got small."

In [25]:
print(re.split(r' +', raw))

['I', 'am', 'big!', "It's", 'the', 'pictures', 'that', 'got', 'small.']


We can see that our pattern didn't split all the tokes on anything non-word characters. 

In [28]:
# split on non-word characters
print(re.split(r'\W+', raw))

['I', 'am', 'big', 'It', 's', 'the', 'pictures', 'that', 'got', 'small', '']


We managed to split on all the non-word characters (' ', ,,!, and so on) but we removed as well them from our result. We need to look into somethong different

In [29]:
print(re.findall(r'\w+|\S\w*', raw))

['I', 'am', 'big', '!', 'It', "'s", 'the', 'pictures', 'that', 'got', 'small', '.']


We started with a simple re.split function on space characters and improvised it using the non-word character. Finally, we changed our approach and instead of splitting we went about matching what we wanted using re.findall function.

# Learning to write your own regex stemmer

In [30]:
# import required library
import re

In [31]:
# define function to perform stemming
def stem(word):
    splits = re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', word)
    stem = splits[0][0]
    return stem

We are applying the re.findall() function to the input word to return two groups as output. The first one is the stem and then it's any possible suffix. We return the first group as our result from the function call.

In [33]:
raw = "Keep your friends close, but your enemies closer."
tokens = re.findall(r'\w+|\S\w*', raw)
print(tokens)

['Keep', 'your', 'friends', 'close', ',', 'but', 'your', 'enemies', 'closer', '.']


In [35]:
# apply stemming on each word
for t in tokens:
    print("'"+stem(t)+"'")

'Keep'
'your'
'friend'
'close'
','
'but'
'your'
'enem'
'closer'
'.'


We are using the re.findall() fnction to get the desired output. To be noted that we are not using an aggresive wildcard match (.*?); otherwise it will reduce the entire word and there will be no suffixes identified. Also the start and end of the input are mandatory to match te entire input word and split it.