Introduce:

NLP problems are solved either using heuristics/rule based approach or using machine learning.

Regex offers a powerful rule based approach where you extract patterns from your text to extract useful information for a given NLP task.

In this Document I will discuss two use cases (1) a customer service chatbot (2) Information extraction and show you how regular expressions can help solve some of the simple tasks.

Context: 

    Overview of regex
    Installation and pre-requisites
    Code for regex in chatbot
    Code for regex in information extraction (IE)
    Exercise


# NLP Tutorial: Regular Expressions

## (1) Regex in customer support



## REview os Regular Expressioc:

There is a website called regex101.com that can be used to check regular expressions.You can visit the website and create your own regular expressions, and then copy them into your Python code.

For example:

* To find a single-digit number, you can use the expression \d.

* To find a two-digit number, you can use the expression \d\d.

* To find a three-digit number, you can use the expression \d\d\d, and so on.

* If you want to find a 10-digit number, you can simplify it by using the expression \d{10}.

If you want to match a number in parentheses, such as (123), you need to escape the parentheses using a backslash. The expression will be: \(\d{3}\).

For finding phone numbers in the format 09186555321 or (123)-456-789, you can use the following expression: \d{10}|\(\d{3}\)-\d{3}-\d{3}. The pipe symbol | denotes an "or" operation.


To find email addresses in text, you can use the following expressions:

* For a simple email address: [a-zA-Z0-9_]*@[a-zA-Z0-9]*\.com.

* For a more general email address: [a-zA-Z0-9_]*@[a-zA-Z0-9]*\.[a-zA-Z]*.

The asterisk * denotes a sequential occurrence of the preceding character or range. The range [a-z] matches any character within that range.


If you want to match everything except digit numbers, you can use the caret symbol ^ as a negation. For example: [^\d] will match any character that is not a digit.


To find order numbers in the text, you can use the following expression: order[^\d]*\d*. This will match patterns like "order # 123456". If you only want to retrieve the digit part (e.g., "123456"), you can enclose it in parentheses. So it can be rewritten as order[^\d]*(\d*), which will capture only the digits and return "123456" as desired.


Retrieve order number

In [1]:
import re

In [2]:
chat1='codebasics: Hello, I am having an issue with my order # 412889912'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat1)
matches

['412889912']

In [3]:
chat2='codebasics: I have a problem with my order number 412889912'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat2)
matches

['412889912']

In [4]:
chat3='codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat3)
matches

['412889912']

In [5]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

In [6]:
get_pattern_match('order[^\d]*(\d*)', chat1)

'412889912'

Retrieve email id and phone

In [7]:
chat1 = 'codebasics: you ask lot of questions ðŸ˜   1235678912, abc@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

Email id:

In [8]:
get_pattern_match('[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat1)

'abc@xyz.com'

In [9]:
get_pattern_match('[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat2)

'abc@xyz.com'

In [10]:
get_pattern_match('[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*',chat3)

'abc@xyz.com'

Phone number

In [11]:
get_pattern_match('(\d{10})|(\(\d{3}\)-\d{3}-\d{4})',chat1)

('1235678912', '')

In [12]:
get_pattern_match('(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat2)

('', '(123)-567-8912')

In [13]:
get_pattern_match('(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat3)

('1235678912', '')

## (2) Regex for Information Extraction

In [14]:
text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship	
South Africa (1971â€“present)
Canada (1971â€“present)
United States (2002â€“present)
Education	University of Pennsylvania (BS, BA)
Title	
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)	
Justine Wilson
â€‹
â€‹(m. 2000; div. 2008)â€‹
Talulah Riley
â€‹
â€‹(m. 2010; div. 2012)â€‹
â€‹
â€‹(m. 2013; div. 2016)
'''

In [15]:
get_pattern_match(r'age (\d+)', text)

'50'

In [16]:
get_pattern_match(r'Born(.*)\n', text).strip()

'Elon Reeve Musk'

In [17]:
get_pattern_match(r'Born.*\n(.*)\(age', text).strip()

'June 28, 1971'

In [19]:
get_pattern_match(r'\(age.*\n(.*)', text)

'Pretoria, Transvaal, South Africa'

In [20]:
def extract_personal_information(text):
    age = get_pattern_match('age (\d+)', text)
    full_name = get_pattern_match('Born(.*)\n', text)
    birth_date = get_pattern_match('Born.*\n(.*)\(age', text)
    birth_place = get_pattern_match('\(age.*\n(.*)', text)
    return {
        'age': int(age),
        'name': full_name.strip(),
        'birth_date': birth_date.strip(),
        'birth_place': birth_place.strip()
    }

In [21]:
extract_personal_information(text)

{'age': 50,
 'name': 'Elon Reeve Musk',
 'birth_date': 'June 28, 1971',
 'birth_place': 'Pretoria, Transvaal, South Africa'}

In [22]:
text = '''
Born	Mukesh Dhirubhai Ambani
19 April 1957 (age 64)
Aden, Colony of Aden
(present-day Yemen)[1][2]
Nationality	Indian
Alma mater	
St. Xavier's College, Mumbai
Institute of Chemical Technology (B.E.)
Stanford University (drop-out)
Occupation	Chairman and MD, Reliance Industries
Spouse(s)	Nita Ambani â€‹(m. 1985)â€‹[3]
Children	3
Parent(s)	
Dhirubhai Ambani (father)
Kokilaben Ambani (mother)
Relatives	Anil Ambani (brother)
Tina Ambani (sister-in-law)
'''

In [23]:
extract_personal_information(text)

{'age': 64,
 'name': 'Mukesh Dhirubhai Ambani',
 'birth_date': '19 April 1957',
 'birth_place': 'Aden, Colony of Aden'}