# Extract Information Using Regular Expressions (RegEx)

The first thing that i want to start off is the notion of raw string

**r** expression is used to create a raw string. Python raw string treats backslash (\\) as a literal character.



Let us see some examples!

In [52]:
# normal string vs raw string
path = "C:\desktop\nathan"  #string
print("string:",path)

string: C:\desktop
athan


In [53]:
path= r"C:\desktop\nathan"  #raw string
print("raw string:",path)

raw string: C:\desktop\nathan


So, it is always recommended to use raw strings while dealing with regular expressions. 

Python has a built-in module to work with regular expressions called **re**. Some of the commonly used methods from the **re** module are listed below:

1.re.match(): This function checks if 

2.re.search()

3.re.findall()

<br>

Let us look at each method with the help of example.

**1. re.match()**

The re.match function returns a match object on success and none on failure. 

In [56]:
import re

#match a word at the beginning of a string

result = re.match('Kaggle',r'Kaggle is the largest data science community of world') 
print(result)

result_2 = re.match('largest',r'Kaggle is the largest data science community of world') 
print(result_2)

<re.Match object; span=(0, 6), match='Kaggle'>
None


Since output of the re.match is an object, we will use *group()* function of match object to get the matched expression.

In [57]:
print(result.group())  #returns the total matches

Kaggle


<br>

**2. re.search()**

Matches the first occurence of a pattern in the entire string.

In [58]:
# search for the pattern "founded" in a given string
result = re.search('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')
print(result.group())

founded


<br>

**3. re.findall()**

It will return all the occurrences of the pattern from the string. I would recommend you to use *re.findall()* always, it can work like both *re.search()* and *re.match()*.

In [59]:
result = re.findall('founded',r'Andrew NG founded Coursera. He also founded deeplearning.ai')  
print(result)

['founded', 'founded']


### Special sequences

1. **\A**	returns a match if the specified pattern is at the beginning of the string.

In [61]:
str = r'Kaggle is the largest data science community of world'

x = re.findall("\AKaggle", str)

print(x)

['Kaggle']


This is useful in cases where you have multiple strings of text, and you have to extract the first word only, given that first word is 'Analytics'.

If you would try to find some other word, then it will return an empty list as shown below.

In [64]:
str = r'Kaggle is the largest Analytics community of world'

x = re.findall("\Alarg", str)

print(x)

[]


2. **\b** returns a match where the specified pattern is at the beginning or at the end of a word.

In [65]:
#Check if there is any word that ends with "est"
x = re.findall(r"est\b", str)
print(x)

['est']


It returns the last three characters of the word "largest".

3. **\B**	returns a match where the specified pattern is present, but NOT at the beginning (or at the end) of a word.

In [67]:
str = r'Kaggle is the largest data science community of world'

x = re.findall(r"\Ben", str)

print(x)

[]


4. **\d** returns a match where the string contains digits (numbers from 0-9)

In [68]:
str = "2 million monthly visits in Jan'19."

#Check if the string contains any digits (numbers from 0-9):
x = re.findall("\d", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', '1', '9']
Yes, there is at least one match!


In [70]:
str = "2 million monthly visits in Jan'191."

# Check if the string contains any digits (numbers from 0-9):
# adding '+' after '\d' will continue to extract digits till encounters a space
x = re.findall("\d+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', '191']
Yes, there is at least one match!


We can infer that **\d+** repeats one or more occurences of **\d** till the non maching character is found where as **\d** does character wise comparison.

5. **\D** returns a match where the string does not contain any digit.

In [71]:
str = "2 million monthly visits in Jan'19."

#Check if the word character does not contain any digits (numbers from 0-9):
x = re.findall("\D", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', 'm', 'i', 'l', 'l', 'i', 'o', 'n', ' ', 'm', 'o', 'n', 't', 'h', 'l', 'y', ' ', 'v', 'i', 's', 'i', 't', 's', ' ', 'i', 'n', ' ', 'J', 'a', 'n', "'", '.']
Yes, there is at least one match!


In [72]:
str = "2 million monthly visits'19"

#Check if the word does not contain any digits (numbers from 0-9):

x = re.findall("\D+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[" million monthly visits'"]
Yes, there is at least one match!


6. **\w** helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)


In [73]:
str = "2 million monthly visits!"

#returns a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['2', 'm', 'i', 'l', 'l', 'i', 'o', 'n', 'm', 'o', 'n', 't', 'h', 'l', 'y', 'v', 'i', 's', 'i', 't', 's']
Yes, there is at least one match!


In [None]:
str = "2 million monthly visits!"

#returns a match at every word (characters from a to Z, digits from 0-9, and the underscore _ character)

x = re.findall("\w+",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

7. **\W** returns match at every non alphanumeric character.

In [74]:
str = "2 million monthly visits9!"

#returns a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ', '!']
Yes, there is at least one match!


## Metacharacters

Metacharacters are characters with a special meaning

1. **(.)** matches any character (except newline character)

In [75]:
str = "sundar and sudheer recently published a research paper!" 

#Search for a string that starts with "ro", followed by three (any) characters

x = re.findall("su.", str)
x2 = re.findall("su...", str)

print(x)
print(x2)

['sun', 'sud']
['sunda', 'sudhe']


2. **(^)** starts with

In [76]:
str = "Data Science"

#Check if the string starts with 'Data':
x = re.findall("^Data", str)

if (x):
  print("Yes, the string starts with 'Data'")
else:
  print("No match")
  
#print(x)  

Yes, the string starts with 'Data'


In [25]:
# try with a different string
str2 = "Big Data"

#Check if the string starts with 'Data':
x2 = re.findall("^Data", str2)

if (x2):
  print("Yes, the string starts with 'data'")
else:
  print("No match")
  
#print(x2)  

No match


3. **($)** ends with

In [77]:
str = "Data Science"

#Check if the string ends with 'Science':

x = re.findall("Science$", str)

if (x):
  print("Yes, the string ends with 'Science'")

else:
  print("No match")
  
#print(x)

Yes, the string ends with 'Science'


4. (*) matches for zero or more occurences of the pattern to the left of it

In [78]:
str = "easy easssy eay ey"

#Check if the string contains "ea" followed by 0 or more "s" characters and ending with y
x = re.findall("eas*y", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['easy', 'easssy', 'eay']
Yes, there is at least one match!


5. **(+)** matches one or more occurences of the pattern to the left of it

In [79]:
#Check if the string contains "ea" followed by 1 or more "s" characters and ends with y 
x = re.findall("eas+y", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['easy', 'easssy']
Yes, there is at least one match!


6. **(?)** matches zero or one occurrence of the pattern left to it.

In [80]:
x = re.findall("eas?y",str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['easy', 'eay']
Yes, there is at least one match!


7. **(|)** either or

In [81]:
str = "Kaggle is the largest data science community of world"

#Check if the string contains either "data" or "India":

x = re.findall("data|world", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['data', 'world']
Yes, there is at least one match!


In [31]:
# try with a different string
str = "Kaggle is one of the largest data science communities"

#Check if the string contains either "data" or "India":

x = re.findall("data|India", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['data']
Yes, there is at least one match!


## Sets

1. A set is a bunch of characters inside a pair of square brackets [ ] with a special meaning.

In [82]:
str = "Kaggle is the largest data science community of India"

#Check for the characters y, d, or h, in the above string
x = re.findall("[ydh]", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['h', 'd', 'y', 'd']
Yes, there is at least one match!


In [84]:
str = "Kaggle is the largest data science community of India"

#Check for the characters between a and g, in the above string
x = re.findall("[a-g]", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['a', 'g', 'g', 'l', 'e', 'i', 's', 't', 'h', 'e', 'l', 'a', 'r', 'g', 'e', 's', 't', 'd', 'a', 't', 'a', 's', 'c', 'i', 'e', 'n', 'c', 'e', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'n', 'd', 'i', 'a']
Yes, there is at least one match!


<br>

Let's solve a problem.

In [87]:
str = "Mars' average distance from the Sun is roughly 230 million km and its orbital period is 687 (Earth) days."

# extract the numbers starting with 0 to 4 from in the above string
x = re.findall(r"\b[0-3]\d+", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['230']
Yes, there is at least one match!


2. **[^]** Check whether string has other characters mentioned after ^

In [88]:
str = "Kaggle is the largest data sciece community of world"

#Check if every word character has characters than y, d, or h

x = re.findall("[^ydh]", str)

print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

['K', 'a', 'g', 'g', 'l', 'e', ' ', 'i', 's', ' ', 't', 'e', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'a', 't', 'a', ' ', 's', 'c', 'i', 'e', 'c', 'e', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', ' ', 'o', 'f', ' ', 'w', 'o', 'r', 'l']
Yes, there is at least one match!


3. **[a-zA-Z0-9]** : Check whether string has alphanumeric characters

In [89]:
str = "@Kaggle Largest Data Science community #Kaggle!!"

# extract words that start with a special character
x = re.findall("[^a-zA-Z0-9 ]\w+", str)

print(x)

['@Kaggle', '#Kaggle']


---
## Solve Complex Queries

Let us try solving some complex queries using regex.

### Extracting Email IDs



In [90]:
str = 'Send a mail to rohan.1997@gmail.com, smith_david34@yahoo.com and priya@yahoo.com about the meeting @2PM'
  
# \w matches any alpha numeric character 
# + for repeats a character one or more times 
#x = re.findall('\w+@\w+\.com', str)     
x = re.findall('[a-zA-Z0-9._-]+@\w+\.com', str)     
  
# Printing of List 
print(x) 

['rohan.1997@gmail.com', 'smith_david34@yahoo.com', 'priya@yahoo.com']


### Extracting Dates

In [91]:
text = "London Olympic 2012 was held from 2012-07-27 to 2012/08/12."

# '\d{4}' repeats '\d' 4 times
match = re.findall('\d{4}.\d{2}.\d{2}', text)
print(match)

['2012-07-27', '2012/08/12']


In [92]:
text="London Olympic 2012 was held from 27 Jul 2012 to 12-Aug-2012."

match = re.findall('\d{2}.\w{3}.\d{4}', text)

print(match)

['27 Jul 2012', '12-Aug-2012']


In [93]:
# extract dates with varying lengths
text="London Olympic 2012 was held from 27 July 2012 to 12 August 2012."

#'\w{3,10}' repeats '\w' 3 to 10 times
match = re.findall('\d{2}.\w{3,10}.\d{4}', text)

print(match)

['27 July 2012', '12 August 2012']


## Extracting Title from Names - Titanic Dataset

In [94]:
import pandas as pd

# load dataset
data=pd.read_csv("titanic.csv")

In [95]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [97]:
# print a few passenger names
data['Name'].head(10)

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object

### Method 1: One way is to split on the pandas dataframe and extract the title

In [98]:
name = "Allen, Mr. William Henry"
name2 = name.split(".")

In [99]:
name2[0].split(',')

['Allen', ' Mr']

In [100]:
title=data['Name'].apply(lambda x: x.split(".")[0].split(",")[1])
title.value_counts()

 Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Col               2
 Major             2
 Mlle              2
 Don               1
 Jonkheer          1
 Capt              1
 Sir               1
 Mme               1
 Ms                1
 the Countess      1
 Lady              1
Name: Name, dtype: int64

This method might not work all the time. Therefore, another more robust way is to define pattern and search for it using regex

### Method 2: Use RegEx to extract titles

In [101]:
def split_it(name):
    return re.findall("\w+\.",name)[0]

In [102]:
title=data['Name'].apply(lambda x: split_it(x))
title.value_counts().sum()

891

In the above result, we observe that the title is followed by '.' since we are searching for a pattern that includes '.'