<font size="4" color="green">

# Tutorial: Python Regex (Regular Expressions)

<font size="4" color='gold'>

Importing Neccessary packages

In [1]:
import re #regular expressions
import email

<font size="4" color = 'blue'>

## Main Task. 
Our company Chaptr Global LLC has discovered that some of the clients have been sending phishing emails in the current times. 
This has raised eyebrows to some of the board members (you) and as such the board members (you) have decided to use there best python skills to be able to identify these fraudulent and phishy (if there's a word like that) emails. 

<font size="4">

Here is one sample of such [emails](https://www.dataquest.io/wp-content/uploads/2020/01/test_emails.txt). 

<font size="4">

# Introducing Python’s Regex Module
File handling

How do we open files in python?

In [2]:

with open("test_emails.txt") as f:
    fh = f.read()
    

print(fh)





From r  Thu Oct 31 08:11:39 2002
Return-Path: <bensul2004nng@spinfinder.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <bensul2004nng@spinfinder.com>
Message-Id: <200210311310.g9VDANt24674@bloodwork.mr.itd.UM>
From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
Date: Thu, 31 Oct 2002 05:10:00
To: R@M
Subject: URGENT ASSISTANCE /RELATIONSHIP (P)
MIME-Version: 1.0
Content-Type: text/plain;charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Status: O

Dear Friend,

I am Mr. Ben Suleman a custom officer and work as Assistant controller of the Customs and Excise department Of the Federal Ministry of Internal Affairs stationed at the Murtala Mohammed International Airport, Ikeja, Lagos-Nigeria.

After the sudden death of the former Head of state of Nigeria General Sanni Abacha on June 8th 1998 his aides and immediate members of his family were arrested while trying to escape from Nigeria in a Chartered jet to Saudi Arabia with 6 trunk boxes Marked "Diplomatic Baggage". Acting on a tip-off as t

<font size="4" color = 'gold'>

Who was the sender of this email?
- Method 1: for loops and string methods combined.

In [3]:
for line in fh:
    if line.startswith('From: '):
        print(line)

In [4]:
for line in fh.split('\n'):
    if "From: " in line:
        print(line)

From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>


<font size="4" color = 'brown'>

- Method 2: Using the re package.

In [5]:
for line in re.findall("From:.*", fh):
    print(line)

From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>
From: "PRINCE OBONG ELEME" <obong_715@epatra.com>


<font size="4" color = 'dark-green'>

## Advantage of Re.
- Very simple. 
- Reduces effort of finding out patterns.

<font size="4">

## The syntax.
**This function takes two arguments in the form of re.findall(pattern, string)**
- ```pattern``` represents the substring we want to find.
- ```string``` represents the main string we want to find it in.
- ```.*```  is a shorthand for a string pattern


## Apply the same idea to find when the emails were sent.

In [6]:
dates = []
re.findall("Date:.*", fh)

['Date: Thu, 31 Oct 2002 05:10:00', 'Date: Thu, 31 Oct 2002 22:17:55 +0100']

# Common Python Regex Patterns
<font size="4">

- `w` matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.
- `d` matches digits, which means 0-9.
- `s` matches whitespace characters, which include the tab, new line, carriage return, and space characters.
- `S` matches non-whitespace characters.
- `.` matches any character except the new line character n.
- `*` zero or more occurrences.
- `+` One or more occurrences. 
- `{}` Exactly specified number of occurrences.
- `|` Either or.
- `()` Capture and return a particular group.
- `[]` returns a set of characters.
- `\` signals a specils sequence / also used to escape special characters.
- `^` matches start with a parttern
- `$` matches the ending pattern.
  

</font>

<font size='4'>

Let's clean the names of the recipients.


In [7]:
# words = []
new_text  = "So I will like you to be the ultimate beneficiary, so that the funds can be moved in"
re.findall("in$", new_text)
# for line in re.findall("Madam$", fh):
#     print(line)
# words

['in']

In [8]:
match  = re.findall("From:.*", fh)

for line in match:
    print(re.findall('\".*\"', line))

['"Mr. Ben Suleman"']
['"PRINCE OBONG ELEME"']


<font size='4'>

The backslash is a special character used for escaping other special characters

<font size="4" color = 'blue'>

What if we want to find the email address instead?

In [14]:
match  = re.findall("From:.*", fh)

for line in match:
    print(re.findall('\w+\S*@.*\w', line))

['bensul2004nng@spinfinder.com']
['obong_715@epatra.com']


<font size="4" color = 'blue'>

What if we want to find the front part of the email address instead?

In [15]:
match  = re.findall("From:.*", fh)

for line in match:
    print(re.findall('(\w+\S*)@', line))

['bensul2004nng']
['obong_715']


<font size="4" color = 'blue'>

What if we want to find the domain name of the email address instead?

In [9]:
myword = "checkit123 chek2 che4"
re.findall("\w+", myword)

['checkit123', 'chek2', 'che4']

<font size = '4' color = 'gold'>

## Common Python Regex Functions
<font color = 'green'>
re.findall() is undeniably useful, but it’s not the only built-in function that’s available to us in re:

- re.search()
- re.split()
- re.sub()

<font size='4' color='purple'>

## 1. re.search()

<font color='blue'>
While re.findall() matches all instances of a pattern in a string and returns them in a list, re.search() matches the first instance of a pattern in a string, and returns it as a re match object.

johndoe@chaptrglobal.edu


johndoe @ stmsh
\w+\S*@.*\w

In [10]:
match  = re.findall("From:.*", fh)

for line in match:
    mail = re.search("\w+\S*@.*\w", line)
    print(line[mail.start():mail.end()])


bensul2004nng@spinfinder.com
obong_715@epatra.com


In [11]:
match  = re.findall("From:.*", fh)

for line in match:
    re.search('\w+\S*@.*\w', line)

<font size='4' color='purple'>

## 2. re.split()

<font color='blue'>
Suppose we need a quick way to get the domain name of the email addresses. We could do it with three regex operations, like so:

In [23]:
import itertools
match  = re.findall("From:.*", fh)

for line in match:
    mail = re.search("\w+\S*@.*\w", line)
    exact = line[mail.start():mail.end()]
    firstsplit = re.split("@", exact)
    lists = [re.split("\.", word) for word in firstsplit]
    print(list(itertools.chain(*lists)))

['bensul2004nng', 'spinfinder', 'com']
['obong_715', 'epatra', 'com']


<font size='4' color='purple'>

## 3. re.sub()

<font color='blue'>
SAnother handy re function is re.sub(). As the function name suggests, it substitutes parts of a string.

In [13]:
match  = re.findall("From:.*", fh)

for line in match:
    print(re.sub("\w+\S*@.*\w", "johndoe", line))

From: "Mr. Ben Suleman" <johndoe>
From: "PRINCE OBONG ELEME" <johndoe>


# find sender
#find the dates sents.
# find all the words in phishy emails.
# promotional....Hurry to , mega, I have recieved mine already.
Open this link..... Click this [link](htps:randomword.com) htpps.