Here we'll talk about pattern matching in strings using regular expressions. Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of a regular expression as a pattern which you can give to a regex processor with some source data. The processor then parses that source data using the pattern, and returns chinks of text back to programmer for further manipulation. There's really 3 main reasons you would want to do this:
- To check whether pattern exists within some source data.
- To get all instances of a complex pattern from some source data.
- To clean your source data using a pattern generally through string splitting.

Regexes are not trivial, but they are a foundational techniques for data cleaning in data science appliations.

###### BEST way to LEARN regex is to WRITE regexes

In [2]:
# import "re" module, where python stores regular expression libraries.
import re

In [35]:
# re offers several main functions to use. Like, main() checks for a match
# that is at the beginning of the string and returns a bookean.
# Similarly, search(), checks for a match anywhere in the string and further can be evaluated to boolean value.

text = "This is a nice day!"

# Let's check if it's a nice day or not:
if re.search("nice", text):
    print("Yes ! 🥰")
else:
    print("No 😞")

Yes ! 🥰


In addition to cheking for conditionals, we can segment a string. The work that regex does here is called tokeninzing, where the string is segmented into substrings based on patterns. Tokeninzing is a core activity in NLP.

In [38]:
# findall() and split() functions will parse the string and return chunks. For eg:
text = "Jatin works diligently. Jatin gets good grades. Jatin is successful."

# splits the variable 'text' on all instances of Jatin
print(re.split("Jatin", text))

# how many times we have talked about Jatin, we can use findall(). It looks for a pattern and pull out all occurences.
times = (re.findall("Jatin", text))
print(times)                             # 'times' is an array
print(len(times))                        # length of 'times'

['', ' works diligently. ', ' gets good grades. ', ' is successful.']
['Jatin', 'Jatin', 'Jatin']
3


In [66]:
# re.search() actually returns a new object, called re.Match object. An re.Match object
# always has a boolean value of True, which can be evaluated to true or false in an if statement; as shown above.
# The rendering of match object also tells what pattern was matched.

re.search("^Jatin", text)

<re.Match object; span=(0, 5), match='Jatin'>

### Patterns and Character Classes

In [119]:
# Let's deal with patterns that involve character classes.

# 'grades' is a string of single learners' grades over a semester in one course across al of their assignments.
grades = "ACAAAABCBCBAA"

# Question : how many B's are in the above grades list ?
re.findall("B", grades)

['B', 'B', 'B']

In [120]:
# count number of A's or B's in the list, we can't use "AB" since this is used to match all A's followed immediately by a B.
# Instead, put characters A and B inside square brackets.

re.findall("[AB]", grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

In [121]:
# [] -> this is called a set operator. You can include a range of characters, which are ordered alphanumerically.
# This is called character-based matching i.e. it matches individual character in an OR method.
# For instance, if we want to refer to all lower case letters we could use [a-z].

# For eg, count number of instances where this student receives an A followed by a B or a C.
re.findall("A[B-C]", grades)

['AC', 'AB']

A[B-C] pattern denoted two sets of characters which must be matched back to back. This can also be written using pipe " | " operator, which means OR

In [122]:
re.findall("AB|AC", grades)

['AC', 'AB']

Use the caret operator inside set operator  to negate above result, i.e. to find out only the grades which were not A's

In [123]:
re.findall("[^A]", grades)

['C', 'B', 'C', 'B', 'C', 'B']

In [134]:
# Note this carefully - the caret was previosuly matched to beginning of a string as an anchor point.
# But inside of the set operator the caret (and other special characters), lose their meaning.
# For eg:

re.findall("^[^A]", grades)  # it's an empty list - []. Because regex says that any value at the beginning of string which is not an A. Since our string starts with A, so there is not match found.

[]

 ### Quantifiers
 Quantifiers are the number of times a pattern needs to matched.
 The most basic quantifier is expressed as e(m, n), where e is the expression or character we are matching, m is the minimum number of times it needs to be matched, and n is the maximum number of times the item could be matched.

In [124]:
# Let's use these grades as an example.
# How many times has this student been on a back-to-back A's streak ?
re.findall("A{2,10}", grades)     # 2 as the min, but 10 as a max

# that means, there are only two occurences where once student receives 2 A's and 4 A's in a streak; when looking for any combination of two A's upto ten A's in a row.

['AAAA', 'AA']

In [125]:
re.findall("A{1,1}A{1,1}", grades)
# here, the regex processor will match those patterns where 1 A is followed by only 1 A.

['AA', 'AA', 'AA']

In [126]:
# In particular, if you have an extra space in between the braces you'll get an empty result
re.findall("A{2, 10}", grades)         # notice an incorrect space

[]

In [127]:
# If no quantifier is mentioned then the default is {1,1}
re.findall("B", grades)

['B', 'B', 'B']

In [128]:
# If only one number is inside the braces, it's considered that both m and n are equal.
re.findall("A{2}", grades)

['AA', 'AA', 'AA']

In [132]:
# To find the trend of grades - increasing or decreasing
re.findall("A{1,10}B{1,10}C{1,10}", grades)

# so, it can be said as "decreasing trend"

['AAAABC']

There are 3 other quantifiers that are used as short hand:
- asterix (*) : to match 0 or more times
- question mark (?) : to match one or more times
- plus sign (+) : to match one or more times

These quantifiers shorten up the curly brace syntax.

In [1]:
with open("/Users/barmanr/Downloads/Health-Tweets/nytimeshealth.txt", "r") as file:
    wiki = file.read()


In [178]:
re.split("[\n]", wiki)[10]

'547904193961656321|Wed Dec 24 23:58:34 +0000 2014|The New Health Care: People Are Shopping for Health Insurance, Surprisingly http://nyti.ms/1vpIjqE'

### Look-ahead and Look-behind

In this case, the pattern being given to the regex engine is for text either before or after the text we are trying to 

'wiki' contains numerous tweets by the NYTimes, seperated by pipes |. Here, we will try to get a list of all the hashtags that are included in this data.

In [180]:
pattern = "#[\w\d]*(?=\s)"

# Notice that ending is a look ahead. We're not actually interested in matching
# in return value.

re.findall(pattern, wiki)

['#askwell',
 '#pregnancy',
 '#Colorado',
 '#VegetarianThanksgiving',
 '#FallPrevention',
 '#Ebola',
 '#Ebola',
 '#ebola',
 '#Ebola',
 '#Ebola',
 '#EbolaHysteria',
 '#AskNYT',
 '#Ebola',
 '#Ebola',
 '#Liberia',
 '#Excalibur',
 '#ebola',
 '#Ebola',
 '#dallas',
 '#nobelprize2014',
 '#ebola',
 '#ebola',
 '#monrovia',
 '#ebola',
 '#nobelprize2014',
 '#ebola',
 '#nobelprize2014',
 '#Medicine',
 '#Ebola',
 '#Monrovia',
 '#Ebola',
 '#smell',
 '#Ebola',
 '#Ebola',
 '#Ebola',
 '#Monrovia',
 '#Ebola',
 '#ebola',
 '#monrovia',
 '#liberia',
 '#benzos',
 '#ClimateChange',
 '#Whole',
 '#Wheat',
 '#Focaccia',
 '#Tomatoes',
 '#Olives',
 '#Recipes',
 '#Health',
 '#Ebola',
 '#Monrovia',
 '#Liberia',
 '#Ebola',
 '#Ebola',
 '#Liberia',
 '#Ebola',
 '#blood',
 '#Ebola',
 '#organtrafficking',
 '#EbolaOutbreak',
 '#SierraLeone',
 '#Freetown',
 '#SierraLeone',
 '#ebolaoutbreak',
 '#kenema',
 '#ebola',
 '#Ebola',
 '#ebola',
 '#ebola',
 '#Ebola',
 '#ASMR',
 '#AIDS2014',
 '#AIDS',
 '#MH17',
 '#benzos',
 '#BigSoda

****

# <u>Practice Questions</u>

## Q).

Write a python program to check whether a password is valid or not. A valid password must satisfy the following conditions :

1. should be at least eight characters long
2. contains atleast one uppercase character
3. contains atleast one lowercase character
4. has at least one digit
5. has at least one special character, special characters are '$','#', and '@'
6. shouldn't have any space

In [22]:
def valid_or_not(password):
    ''' password is a string
        Output -> The function is expected to be returning a string one of:
         Valid Password / Invalid Password'''
    
    invalid = "Invalid Password"

    if len(password) >= 8:
        regList = [r"[A-Z]", r"[a-z]", r"[0-9]", r"[$#@]"]
        for reg in regList:
            if re.search(reg, password) == None:
                return invalid
        
        if re.search("\s", password) != None:
            return invalid
        else:
            return "Valid Password"

    return 'Invalid Password'

In [23]:
valid_or_not("python_is12cool")

'Invalid Password'

In [25]:
valid_or_not("python@123Cool")

'Valid Password'

## Q).

A valid debit card has the following characteristics:
1. It must start with a 4,5 or 6.
2. It must contain exactly 16 digits.
3. It must only consist of digits
4. It must NOT use any separator like `' '`, `'_'`, etc.

In [26]:
def card(x):
    '''
    Input x : 16 digit number
    Ouput:  Return whether its a valid debit card number or not
    '''
    if len(x) == 16:
        reg = "^[456][0-9]+$"
        if bool(re.match(reg, x)):
            return "Valid"
        else:
            return "Invalid"

In [27]:
card("4245647893112578")

'Valid'

## Q).

Write a function `troll_the_trolls()` which takes the comment as a parameter and replaces the vowels of the comment with empty space `""`, so that the comments lose their meaning and won't be able to hurt anyone.

In [28]:
def troll_the_trolls(comment):
    return re.sub(r"[aeiouAEIOU]", "", comment)

## Q).

Extract, programatically, the usernames from a comment which is given in the form of a string.

The usernames start with `"@"` and the names in the returned list don't include `"@"`.

In [29]:
def users(comment):    
    users_list = []
    
    names = re.findall(r"@[a-zA-Z]+", comment)
    for name in names:
        users_list.append(name[1:])
 
    return users_list

In [31]:
users("How are you doing @Karan. We have come a long way from the days when me @Arjun and you were in school together!")

['Karan', 'Arjun']

## Q).

Complete the function `get_countries()` in order to return a list consisting of the name of countries that have `"#"` as suffix and prefix.

In [32]:
def get_countries(blog):
    
    countries_names = []
    
    pat = "\#[A-Za-z]+\#"
    for cntry in re.finditer(pat, blog):
        countries_names.append(cntry.group(0)[1:-1])
    
    return countries_names

In [33]:
get_countries('''Varanasi in #India# is very crowded because of tourism due to Ganga and
                 temples and same is the case with Kathmandu of #Nepal#
                 because of its Heritage and Mountains''')

['India', 'Nepal']

## Q). find the output :

In [4]:
import re

text = '''The intellectual content in a physical book need not be a composition, nor even be called a book.
          Books can consist only of drawings, engravings or photographs, crossword puzzles or cut-out dolls.
          In a physical book, the pages can be left blank or can feature an abstract set of lines to support entries.
          '''

print(re.findall('b..k', text))

['book', 'book', 'book']


## Q).

In [5]:
sentence = "cryptographic encryption methods that can be cracked easily with quantum computers"

pattern = re.compile("crypto(.{1,30})coin")

print(pattern.match(sentence)) 

None


## Q).

In [6]:
import re

text1 = '**//DataScience// - 12. '

pattern = re.compile('[\W_]+')

print(pattern.sub('', text1))

DataScience12


## Q).

In [8]:
sentence = 'ferrari is faster than a lambhorghini'
matched = re.search('is', sentence)
print(matched.span())

(8, 10)


## Q).

In [9]:
paragraph = '''An investment of  $1 in the year 1801,  would have given you $18087791.41 today.
               This is a 7.967% return on investment.
               But with an investment of only $0.25 in 1801, you would end up with $4521947.8525.
               '''

result = [x[0] for x in re.findall('(\$[0-9]+(\.[0-9]*)?)', paragraph)]

result

['$1', '$18087791.41', '$0.25', '$4521947.8525']

## Q).

In [10]:
text = '''
            "One cannot have enough socks with him", said Dora. 
            "Another has come and gone and I didn't get a single pair. 
            People will keep on insisting on giving me books." 
            Christmas Quote 
            '''

regex = 'Christ.*' 
print(re.match(regex, text), ",", re.findall(regex, text))

None , ['Christmas Quote ']


## Q).

In [11]:
re.findall('a{3,5}b', "aba")

[]

In [12]:
re.findall('a{3,5}b', "aaabaa")

['aaab']

In [13]:
re.findall('a{3,5}b', "aaaaab")

['aaaaab']

In [14]:
re.findall('a{3,5}b', "aaaaaab")

['aaaaab']

## Q).

In [15]:
new_input = ['18:29', '23:55', '123', 'ab:de', '18:299', '99:99']

results = lambda x: re.match('[0-9]{2}:[0-9]{2}', x) != None

for x in new_input:
    print(results(x), end=" ")

True True False False True True 

## Q).

In [17]:
s1 = 'Artificial Intelligence consists of Machine Learning and Deep Learning'
pattern = 'Learning'

for match in re.finditer(pattern, s1):
    s = match.start()
    e = match.end()
    print('%d:%d' %(s, e), end=" ")

44:52 62:70 

## Q).

In [19]:
target_string = '''Anika is a basketball player who was born on June 13, 1994.
                   She played 137 matches with a scoring average of 28.19 points per game.
                   Her weight is 57 kg.'''

result = re.finditer(r"\d{2}", target_string)

for match_obj in result:
    print(match_obj.group(), end=" ")

13 19 94 13 28 19 57 

## Q).

In [21]:
text = ''' Extract the doamin from the urls www.realpython.com'''
pattern = r'(www.([A-Za-z_0-9-]+)(.\w+))'

find_iter_result = re.finditer(pattern, text)

for i in find_iter_result:
  print(i.group(0))     #output1

find_all_result = re.findall(pattern, text)
for i in find_all_result:
  print(i[1])          #output2

www.realpython.com
realpython
