# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [48]:
#讀取文本資料
with open("sample_emails.txt", "r") as file:
    sample_corpus = file.read()

In [49]:
sample_corpus

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

In [5]:
sample_sentences = sample_corpus.split("\n")
for sentence in sample_sentences:
    print(sentence)

From r  Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311
Status: O

FROM:MR. JAMES NGOLA.
CONFIDENTIAL TEL: 233-27-587908.
E-MAIL: (james_ngola2002@maktoob.com).

URGENT BUSINESS ASSISTANCE AND PARTNERSHIP.


DEAR FRIEND,

I AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.


THE INCIDENT OCCURRED IN OUR PRESENCE WH

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [17]:
import re

In [23]:
match = re.findall(r"From: \"([^\n\"]*)\" <([^@>]+@[^@>]+\.[^@>]+)>", sample_corpus)

In [24]:
match

[('MR. JAMES NGOLA.', 'james_ngola2002@maktoob.com'),
 ('Mr. Ben Suleman', 'bensul2004nng@spinfinder.com'),
 ('PRINCE OBONG ELEME', 'obong_715@epatra.com')]

### 只讀取寄件者姓名

In [28]:
sample_senders = {}
for sender in match:
    sample_senders[sender[0]] = sender[1]
for name in sample_senders.keys():
    print(name)

MR. JAMES NGOLA.
Mr. Ben Suleman
PRINCE OBONG ELEME


### 只讀取寄件者電子信箱

In [29]:
for name in sample_senders.values():
    print(name)

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [44]:
def get_user_and_org(mail_addr):
    org = re.findall(r"([^@]*)@([^\.]*).*", mail_addr)
    assert len(org) == 1 and len(org[0]) == 2
    return org[0]

users_and_orgs = []
for name in sample_senders.values():
    users_and_orgs.append(get_user_and_org(name))
    print(users_and_orgs[-1][1])

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [47]:
for user in users_and_orgs:
    print(", ".join(user))

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [67]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
with open("all_emails.txt", "r", encoding="utf8", errors="ignore") as file:
    all_emails = file.read()
    
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
emails = all_emails.split("From r  ")
emails = emails[1:]
for i, email in enumerate(emails):
    emails[i] = "From r  " + email

len(emails) #查看有多少封email

3976

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [115]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取得寄件者姓名與地址###
    
    #Step1: 取得寄件者資訊 (hint: From:)
    #Step2: 取得姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #Step3: 將取得的姓名與地址存入字典中
    match = re.findall(r"\nFrom:\s\"?\s?([^\"\<\n]*[^\"\<\n\s])\s?\"?\s?(?:\<([^\>\n]*)\>)?\n", mail)
    assert len(match) == 1
    emails_dict["sender"] = (match[0][0], match[0][1] if len(match[0][1]) != 0 else match[0][0])
    
    ###取得收件者姓名與地址###
    #Step1: 取得收件者資訊 (hint: To:)
    #Step2: 取得姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #Step3: 將取得的姓名與地址存入字典中
    match = re.findall(r"\nTo: ([^\n]*)[\s]*\n", mail)
    assert len(match) < 2
    if len(match) == 1:
        emails_dict["recipient"] = match[0]
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: Date:)
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    #Step3: 將取得的日期資訊存入字典中
    match = re.findall(r"\nDate: [a-zA-Z]{3}, (\d{1,2}) ([a-zA-Z]{3}) (\d{4})[^\n]*\n", mail)
    assert len(match) < 2
    if len(match) == 1:
        emails_dict["date"] = match[0]
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    #Step2: 移除不必要文字 (hint: Subject: )
    #Step3: 將取得的主旨存入字典中
    match = re.findall(r"\nSubject: ([^\n]*)[\s]*\n", mail)
    assert len(match) == 1
    emails_dict["subject"] = match[0]
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)
    # print(emails_dict)

In [116]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,sender,recipient,date,subject,email_body
0,"(MR. JAMES NGOLA., james_ngola2002@maktoob.com)",webmaster@aclweb.org,"(31, Oct, 2002)",URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,
1,"(Mr. Ben Suleman, bensul2004nng@spinfinder.com)",R@M,"(31, Oct, 2002)",URGENT ASSISTANCE /RELATIONSHIP (P),
2,"(PRINCE OBONG ELEME, obong_715@epatra.com)",webmaster@aclweb.org,"(31, Oct, 2002)",GOOD DAY TO YOU,
3,"(PRINCE OBONG ELEME, obong_715@epatra.com)",webmaster@aclweb.org,"(31, Oct, 2002)",GOOD DAY TO YOU,
4,"(Maryam Abacha, m_abacha03@www.com)",R@M,"(1, Nov, 2002)",I Need Your Assistance.,
5,"(Kuta David, davidkuta@postmark.net)",davidkuta@yahoo.com,"(02, Nov, 2002)",Partnership,
6,"(Barrister tunde dosumu, tunde_dosumu@lycos.com)",,,Urgent Attention,
7,"(William Drallo, william2244drallo@maktoob.com)",webmaster@aclweb.org,"(3, Nov, 2002)",URGENT BUSINESS PRPOSAL,
8,"(MR USMAN ABDUL, abdul_817@rediffmail.com)",R@M,"(04, Nov, 2002)",THANK YOU,
9,"(Tunde Dosumu, barrister_td@lycos.com)",,,Urgent Assistance,
