## Regex in Python

In NLP, Regex is super important. Sometimes Regex itself is sufficient to solve the problem instead of using some fancy models. 
e.g. gmail showing flight information on top of ticket email - uses regex (pattern matching)

Our goal is to extract useful information from texts using pattern matching.

Use this webstie for regex playground
[Regex 101](https://regex101.com)

Think of it like finding patterns from noise, a user in a chatbot conversation can write like in many ways, but the underlying message (say email, phone) will always have some recognizable patterns and we are concerned about this information out of all the things user is writing.

Youtube lecture 

[codebasics](https://www.youtube.com/watch?v=lK9gx4q_vfI&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX&index=3&t=2477s)

In [1]:
import re

### Example 1

In [12]:
chat1 = 'codebasics: you ask lot of questions 😠  1235678912, abc@xyz.com, abc1@yahoo.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

In [13]:
pattern = r'\d{10}' # if we dont add r; we get a warning for escape characters; though not an error

matches = re.findall(pattern, chat1)

print('matches in chat1', matches)

matches = re.findall(pattern, chat2)
print('matches in chat2', matches)

# lets catch all the phone numbers w different patterns
pattern = r"\(\d{3}\)-\d{3}-\d{4}|\d{10}"
matches = re.findall(pattern, chat2)
print('matches in chat2', matches)

matches in chat1 ['1235678912']
matches in chat2 []
matches in chat2 ['(123)-567-8912']


In [14]:
email_pattern = r'[a-zA-Z0-9_]*@[a-z0-9A-Z]*\.[a-zA-Z]*'

matches = re.findall(email_pattern, chat1)
print('matches in chat1', matches)

matches in chat1 ['abc@xyz.com', 'abc1@yahoo.com']


### Example 2

In [16]:
chat1 = 'codebasics: Hello, I am having an issue with my order # 412889912'
chat2 = 'codebasics: I have a problem with my order number 412889912'
chat3 = 'codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'

In [20]:
pattern_order_id = r'order[^\d]*(\d*)'

def find_order_id(chat):
    return re.findall(pattern_order_id, chat)


print('order id in chat1', find_order_id(chat1)
      )

print('order id in chat2', find_order_id(chat2))

order id in chat1 ['412889912']
order id in chat2 ['412889912']


### Information Extraction

In [23]:

text = '''
Born	5 November 1988 (age 37)
Delhi, India
Nickname	Cheeku [a]
King Kohli [2]
Chase Master [3]
Run Machine
Height	5 ft 9 in (175 cm)[4]
Batting	Right-handed
Bowling	Right-arm medium
Role	Top-order batter
Relations	Anushka Sharma ​(m. 2017)​
Website	VK Foundation
'''

def find_age(text):
    pattern_age = r'age (\d+)' # + vs * ; + is 1 or more and * is 0 or more
    age = re.findall(pattern_age, text)
    print("Age : ", age)
def find_birth_date(text):
    pattern_birthdate = r'Born\s(\d{1,2}\s[a-zA-Z]*\s\d{2,4})'
    birthdate = re.findall(pattern_birthdate, text)
    print('Birthdate', birthdate)
    
find_age(text)

find_birth_date(text)

Age :  ['37']
Birthdate ['5 November 1988']
