# Python Regular Expressions

Regular expressions or RegEx is defined as a sequence of characters that are mainly used to find or replace patterns present in the text. In simple words, we can say that a regular expression is a set of characters or a pattern that is used to find substrings in a given string.

# Common Python Regex Functions

1. re.findall()
2. re.search()
3. re.split()
4. re.sub()

# Common Python Regex Patterns

<b>w</b> matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.<br>
<b>d</b> matches digits, which means 0-9.<br>
<b>s</b> matches whitespace characters, which include the tab, new line, carriage return, and space characters.<br>
<b>S</b> matches non-whitespace characters.<br>
<b>.</b> matches any character except the new line character n.<br>
<b>-</b> They are used to specify a range.<br>
<b>\ </b> The backslash is a special character used for escaping other special characters.

In [1]:
import email
import re
import pandas as pd

In [36]:
#emails = []

data = open(r"sample_emails.txt", "r").read()

In [38]:
path = "D:\documents\notebook"
print(path)

D:\documents
otebook


In [41]:
path = r"D:\documents\notebook"
print("raw string:",path)

raw string: D:\documents\notebook


In [37]:
data

'From r  Thu Oct 31 08:11:39 2002\nReturn-Path: <bensul2004nng@spinfinder.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <bensul2004nng@spinfinder.com>\nMessage-Id: <200210311310.g9VDANt24674@bloodwork.mr.itd.UM>\nFrom: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>\nDate: Thu, 31 Oct 2002 05:10:00\nTo: R@M\nSubject: URGENT ASSISTANCE /RELATIONSHIP (P)\nMIME-Version: 1.0\nContent-Type: text/plain;charset="iso-8859-1"\nContent-Transfer-Encoding: 7bit\nStatus: O\n\nDear Friend,\n\nI am Mr. Ben Suleman a custom officer and work as Assistant controller of the Customs and Excise department Of the Federal Ministry of Internal Affairs stationed at the Murtala Mohammed International Airport, Ikeja, Lagos-Nigeria.\n\nAfter the sudden death of the former Head of state of Nigeria General Sanni Abacha on June 8th 1998 his aides and immediate members of his family were arrested while trying to escape from Nigeria in a Chartered jet to Saudi Arabia with 6 trunk boxes Marked "Diplomatic Baggage". Actin

In [55]:
contents = re.split(r"From r", data)
contents.pop(0)

''

In [49]:
# re.findall
x = re.findall("From:.*", data)

In [50]:
x

['From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

In [51]:
#re.search
y = re.search("From:.*", data)

In [52]:
y

<re.Match object; span=(204, 258), match='From: "Mr. Ben Suleman" <bensul2004nng@spinfinder>

In [54]:
#re.sub
b = re.sub("From", "Email", data)
b

'Email r  Thu Oct 31 08:11:39 2002\nReturn-Path: <bensul2004nng@spinfinder.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <bensul2004nng@spinfinder.com>\nMessage-Id: <200210311310.g9VDANt24674@bloodwork.mr.itd.UM>\nEmail: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>\nDate: Thu, 31 Oct 2002 05:10:00\nTo: R@M\nSubject: URGENT ASSISTANCE /RELATIONSHIP (P)\nMIME-Version: 1.0\nContent-Type: text/plain;charset="iso-8859-1"\nContent-Transfer-Encoding: 7bit\nStatus: O\n\nDear Friend,\n\nI am Mr. Ben Suleman a custom officer and work as Assistant controller of the Customs and Excise department Of the Federal Ministry of Internal Affairs stationed at the Murtala Mohammed International Airport, Ikeja, Lagos-Nigeria.\n\nAfter the sudden death of the former Head of state of Nigeria General Sanni Abacha on June 8th 1998 his aides and immediate members of his family were arrested while trying to escape from Nigeria in a Chartered jet to Saudi Arabia with 6 trunk boxes Marked "Diplomatic Baggage". Act

### Get Sender Email

In [5]:
for item in contents:
    emails_dict = {}
    
    #find the whole line beginning with "From:".
    sender = re.search(r"From:.*", item)

In [6]:
# Step 2: find the email address and name.
if sender is not None:
    s_email = re.search(r"\w\S*@.*\w", sender.group())
    s_name = re.search(r":.*<", sender.group())
else:
    s_email = None
    s_name = None

In [7]:
print("sender type: " + str(type(sender)))
print("sender.group() type: " + str(type(sender.group())))
print("sender: " + str(sender))
print("sender.group(): " + str(sender.group()))

sender type: <class 're.Match'>
sender.group() type: <class 'str'>
sender: <re.Match object; span=(180, 229), match='From: "PRINCE OBONG ELEME" <obong_715@epatra.com>>
sender.group(): From: "PRINCE OBONG ELEME" <obong_715@epatra.com>


In [31]:
if s_email is not None:
    sender_email = s_email.group()
else:
    sender_email = None
# Add email address to dictionary.
emails_dict["sender_email"] = sender_email

In [24]:
# Step 3B: remove unwanted substrings, assign to variable.
#if s_name is not None:
    sender_name = re.sub("s*<", "", re.sub(":s*", "", s_name.group()))
#else:
    sender_name = None

# Add sender's name to dictionary.
#emails_dict["sender_name"] = sender_name

In [10]:
print(sender_email)
print(sender_name)

obong_715@epatra.com
 "PRINCE OBONG ELEME" 


### Get Recipient's Email

In [11]:
recipient = re.search(r"To:.*", item)

In [12]:
if recipient is not None:
    r_email = re.search(r"wS*@.*w", recipient.group())
    r_name = re.search(r":.*<", recipient.group())
else:
    r_email = None
    r_name = None

In [13]:
if r_email is not None:
    recipient_email = r_email.group()
else:
    recipient_email = None

emails_dict["recipient_email"] = recipient_email

if r_name is not None:
    recipient_name = re.sub("\s*<", "", re.sub(":\s*", "", r_name.group()))
else:
    recipient_name = None

emails_dict["recipient_name"] = recipient_name

In [14]:
emails_dict

{'sender_email': 'obong_715@epatra.com',
 'sender_name': ' "PRINCE OBONG ELEME" ',
 'recipient_email': None,
 'recipient_name': None}

### Get Date of the Email

In [15]:
for item in contents:
# First two lines again so that Jupyter runs the code.
    emails_dict = {}

    date_field = re.search(r"Date:.*", item)

In [16]:
if date_field is not None:
    date = re.search(r"\d+\s\w+\s\d+", date_field.group())
else:
    date = None

print(date_field.group())

Date: Thu, 31 Oct 2002 22:17:55 +0100


In [28]:
date = re.search(r"\d+\s\w+\s\d+", date_field.group())

# What happens when we use * instead?
date_star_test = re.search(r"\d*\s\w*\s\d*", date_field.group())

date_sent = date.group()
date_star = date_star_test.group()

if date is not None:
    date_sent = date.group()
    date_star = date_star_test.group()
else:
    date_sent = None

emails_dict["date_sent"] = date_sent

print(date_sent)
print(date_star)

31 Oct 2002
 31 


### Get the Email Subject

In [18]:
for item in contents:
# First two lines again so that Jupyter runs the code.
    emails_dict = {}

    subject_field = re.search(r"Subject:.*", item)

    if subject_field is not None:
        subject = re.sub(r"Subject:", "", subject_field.group())
    else:
        subject = None

    emails_dict["subject"] = subject

### Get Email Body

In [19]:
full_email = email.message_from_string(item)
body = full_email.get_payload()
emails_dict["email_body"] = body

In [25]:
emails.append(emails_dict)

In [21]:
# Print number of dictionaries, and hence, emails, in the list.
print("Number of emails: " + str(len(emails_dict)))


# Print first item in the emails list to see how it looks.
for key, value in emails[0].items():
    print(str(key) + ": " + str(emails[0][key]))

Number of emails: 2
subject: GOOD DAY TO YOU
email_body: FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF ELEME KINGDOM 
CHIEF DANIEL ELEME, PHD, EZE 1 OF ELEME.E-MAIL 
ADDRESS:obong_715@epatra.com  

ATTENTION:PRESIDENT,CEO Sir/ Madam. 

This letter might surprise you because we have met
neither in person nor by correspondence. But I believe
it is one day that you got to know somebody either in
physical or through correspondence. 

I got your contact through discreet inquiry from the
chambers of commerce and industry of your country on
the net, you and your organization were revealed as
being quite astute in private entrepreneurship, one
has no doubt in your ability to handle a financialbusiness transaction. 

However, I am the first son of His Royal
majesty,Obong.D. Eleme , and the traditional Ruler of
Eleme Province in the oil producing area of River
State of Nigeria. I am making this contact to you in
respect of US$60,000,000.00 (Sixty Million United
State Dollars), which I inherited, f

In [32]:
emails_df = pd.DataFrame(emails)

In [33]:
emails_df

Unnamed: 0,subject,email_body,sender_name,date_sent,sender_email
0,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,"""PRINCE OBONG ELEME""",31 Oct 2002,obong_715@epatra.com
1,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...,"""PRINCE OBONG ELEME""",31 Oct 2002,obong_715@epatra.com
