# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [13]:
import re

#讀取文本資料
with open( 'sample_emails.txt', 'r' , encoding="utf8", errors='ignore') as f:
    sample_corpus = f.read()

In [87]:
sample_corpus[:500]

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; char'

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [41]:
pattern = r'From:.*'
match = re.findall(pattern,sample_corpus)
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [51]:
pattern1 = r'\".*\"'
for info in match:
    print(re.search(pattern,info).group())


"MR. JAMES NGOLA."
"Mr. Ben Suleman"
"PRINCE OBONG ELEME"


### 只讀取寄件者電子信箱

In [52]:
pattern2 = r'<.*>'
for info in match:
    print(re.search(pattern,info).group())

<james_ngola2002@maktoob.com>
<bensul2004nng@spinfinder.com>
<obong_715@epatra.com>


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [56]:
pattern3 = r'(?<=@)\w+'
for info in match:
    print(re.search(pattern,info).group())

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

james_ngola2002, maktoob
bensul2004nng, spinfinder
obong_715, epatra

In [70]:
pattern = r'From:.*'
pattern2 = r'\w\S*@\w+(?=\.)'
match = re.findall(pattern,sample_corpus)
for info in match:
    result = re.search(pattern2, info).group()
    #print(result)
    name, org  = re.split('@',result)
    print('{} , {}'.format(name,org))

james_ngola2002 , maktoob
bensul2004nng , spinfinder
obong_715 , epatra


In [58]:
pattern1 = r'\".*\"'
pattern2 = r'<.*>'
pattern3 = r'(?<=@)\w+'
pattern_list = [pattern1,pattern2,pattern3]

for info in match:
    s1 = re.search(pattern_list[0],info).group()
    s2 = re.search(pattern_list[1],info).group()
    s3 = re.search(pattern_list[2],info).group()
    print("_".join([s1,s2,s3]))

"MR. JAMES NGOLA."_<james_ngola2002@maktoob.com>_maktoob
"Mr. Ben Suleman"_<bensul2004nng@spinfinder.com>_spinfinder
"PRINCE OBONG ELEME"_<obong_715@epatra.com>_epatra


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]

In [79]:
from io import StringIO

In [86]:
import re
import pandas as pd
from io import StringIO

###讀取文本資料:fradulent_emails.txt###
with open('all_emails.txt', 'r', encoding="utf8", errors='ignore') as f:
    corpus = f.read()

emails = re.split(r"From r", corpus, flags=re.M)
emails = emails[1:]
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
#<your code>#

len(emails) #查看有多少封email

3977

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [89]:
sample_corpus[:1000]

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

In [110]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    sender = re.search(r"From:.*", mail)
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if sender is not None: #有取到配對
        sender_mail = re.search(r"\w\S*@.*\b", sender.group())
        sender_name = re.search(r"(?<=\").*(?=\")", sender.group())
    else: #沒取到配對
        sender_mail = None
        sender_name = None

    #Step3: 將取得的姓名與地址存入字典中
    if sender_mail is not None:
        emails_dict["sender_email"] = sender_mail.group()
    else:
        emails_dict["sender_email"] = sender_mail #None
    
    if sender_name is not None:
        emails_dict["sender_name"] = sender_name.group()
    else:
        emails_dict["sender_name"] = sender_name #None
        
    
    ###取的收件者姓名與地址###
    #Step1: 取的寄件者資訊 (hint: To:)
    recipient = re.search(r"To:.*", mail)
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    if recipient is not None:
        r_email = re.search(r"\w\S*@.*\b", recipient.group())
        r_name = re.search(r"(?<=\").*(?=\")", recipient.group())
    else:
        r_email = None
        r_name = None
        
    #Step3: 將取得的姓名與地址存入字典中
    if r_email is not None:
        emails_dict["recipient_email"] = r_email.group()
    else:
        emails_dict["recipient_email"] = r_email #None
    
    if r_name is not None:
        emails_dict["recipient_name"] = r_name.group()
    else:
        emails_dict["recipient_name"] = r_name #None
        
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: To:)
    date_info = re.search(r"Date:.*", mail)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    if date_info is not None:
        date = re.search(r"\d+\s\w+\s\d+", date_info.group())
    else:
        date = None
        
    #Step3: 將取得的日期資訊存入字典中
    if date is not None:
        emails_dict["date_sent"] = date.group()
    else:
        emails_dict["date_sent"] = date
        
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    subject_info = re.search(r"(?<=Subject: ).*", mail)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    if subject_info is not None:
        emails_dict["subject"] = subject_info.group()
    else:
        emails_dict["subject"] = None
    
    #Step3: 將取得的主旨存入字典中
    
    
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    emails_list.append(emails_dict)

In [115]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df.isna().sum()/emails_df.shape[0]

sender_email       0.118934
sender_name        0.420166
recipient_email    0.187327
recipient_name     0.958763
date_sent          0.154388
subject            0.006789
email_body         0.000000
dtype: float64

In [125]:
emails_df[emails_df.notnull()].head()

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,subject,email_body
0,james_ngola2002@maktoob.com,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,bensul2004nng@spinfinder.com,Mr. Ben Suleman,R@M,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,obong_715@epatra.com,PRINCE OBONG ELEME,obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,obong_715@epatra.com,PRINCE OBONG ELEME,webmaster@aclweb.org,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,m_abacha03@www.com,Maryam Abacha,m_abacha03@www.com,,1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."


In [112]:
emails_df

Unnamed: 0,sender_email,sender_name,recipient_email,recipient_name,date_sent,subject,email_body
0,james_ngola2002@maktoob.com,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,bensul2004nng@spinfinder.com,Mr. Ben Suleman,R@M,,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,obong_715@epatra.com,PRINCE OBONG ELEME,obong_715@epatra.com,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,obong_715@epatra.com,PRINCE OBONG ELEME,webmaster@aclweb.org,,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,m_abacha03@www.com,Maryam Abacha,m_abacha03@www.com,,1 Nov 2002,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
...,...,...,...,...,...,...,...
3972,michealagu0255@zipmail.com.br,,,,,=?iso-8859-1?Q?CONTACT=20GLOBAL=20MAX=20SHIPIN...,"Atten: My Dear ,\n \nI have Paid the fee for y..."
3973,ali_sherif252@hotmail.fr,,ali_sherif105@yahoo.co.uk,,17 Sep 2007,TREAT AS URGENT.,"[[Content-Type, Content-Transfer-Encoding], [C..."
3974,drusmanibrahimtg08@hotmail.fr,,drusmanibrahim.tg@homs.cc,,18 Sep 2007,From Dr Usman Ibrahim / Mr Wahid Yoffe property.,"[[Content-Type, Content-Transfer-Encoding], [C..."
3975,motherdorisk61@hotmail.com,,motherdorisk9@yahoo.com.hk,,19 Sep 2007,My Beloved In Christ.,"\nBeloved in the Lord Jesus Christ, PLEASE END..."
