# **Week 2: Regular Expressions in NLP**

In this practical, the use of the re search module is used and covered.

In [1]:
import re
import pandas as pd

**Using re.search(pattern, string)**
* with string looking for 1st location where regular expression pattern produces match and return corresponding match object
* match object contain information about search and result

**NOTE!!**
* *.span()* returns tuple containing start + end positions of match
* *.group()* returns part of string where there was a match

In [2]:
pattern = "dividend"
string = """A special dividend is a non-recurring distribution of company
assets, usually in the form of cash, to shareholders. A special
dividend is usually larger compared to normal dividends paid
out by the company and often tied to a specific event like an
asset sale or other windfall event.
"""

In [3]:
result = re.search(pattern, string)

In [4]:
print(result.span())
print(result.group())

(10, 18)
dividend


What if the pattern is "dividend"?

In [5]:
pattern = "DIVIDEND"
result = re.search(pattern, string)
print(result.span())
print(result.group())

AttributeError: 'NoneType' object has no attribute 'span'

As shown above, there is no pattern matching at all. Only with the use of the IGNORECASE, will there be matches.

In [13]:
pattern = "DIVIDEND"
result = re.search(pattern, string, flags=re.IGNORECASE)
print(result.span())
print(result.group())

(10, 18)
dividend


### **Using re.findall(pattern, string)**
* in string, find matches of regex pattern and return list of text fragments matching pattern

In [15]:
string = '''The advancements in biomarine studies franky@google.com, with the investments
necessary and Davos sinatra123@yahoo.com Then The New Yorker article on wind
farms..'''
pattern="\S+@\S+"
re.findall(pattern, string)

['franky@google.com,', 'sinatra123@yahoo.com']

### Using re.compile()
* use re.compile() to detail the dataset.

In [17]:
dataset = pd.read_csv('dividend_statements.csv')

In [18]:
dataset

Unnamed: 0,company,dividend_text
0,ABC,Including the interim dividend of 10 cents per...
1,XYZ,Together with the interim dividend of 12 cents...
2,PQR,Including the interim dividend of 34 cents per...


In [41]:
pattern = re.compile('\d+\s+cents per share')

In [42]:
dataset['dividend'] = dataset['dividend_text'].apply(lambda x: pattern.findall(x))

In [43]:
dataset['dividend']

Unnamed: 0,dividend
0,"[10 cents per share, 11 cents per share, 25 ce..."
1,"[12 cents per share, 10 cents per share]"
2,"[34 cents per share, 11 cents per share, 25 ce..."


## **Using re.sub(pattern, replace_string, string)**
* replace occurrences of string (PATTERN) with (REPLACE_STRING) in the (STRING)
* Use Case: masking out sensitive info from text

In [58]:
string = '''
This medical report for medical insurance claims for patient PETER HO with NRIC S6712098J. Please treat this report as confidential.'''
pattern='[S]\d{7}[A-Z]' ## <-- DO NOT USE STARTING CARROT bcos it will look at the front of the string
replace_string="<SECRET>"
re.sub(pattern,replace_string,string)

'\nThis medical report for medical insurance claims for patient PETER HO with NRIC SECRET. Please treat this report as confidential.'

In [59]:
result = re.search(pattern, string)

In [60]:
print(result.group())

S6712098J


## Using re.split(pattern, string)
* use for splitting of occurences of pattern. returns list of various text segments.
* used for breaking huge amt of text into smaller text segments

In [61]:
string='My name is Mike and my lucky number numbers are 09 18 99'
pattern='\s'
re.split(pattern,string)

['My',
 'name',
 'is',
 'Mike',
 'and',
 'my',
 'lucky',
 'number',
 'numbers',
 'are',
 '09',
 '18',
 '99']