# 作業目標: 使用python正規表達式對資料進行清洗處理

這份作業我們會使用詐欺郵件的文本資料來作為清洗與處理的操作。
[資料集](https://www.kaggle.com/rtatman/fraudulent-email-corpus/data#)

### 讀入資料文本
因原始文本較大，先使用部份擷取的**sample_emails.txt**來進行練習

In [110]:
#讀取文本資料
#<your code>#
with open('./sample_emails.txt', 'r', encoding="utf8", errors='ignore') as file:
    sample_corpus = file.read()

In [111]:
sample_corpus

'From r  Wed Oct 30 21:41:56 2002\nReturn-Path: <james_ngola2002@maktoob.com>\nX-Sieve: cmu-sieve 2.0\nReturn-Path: <james_ngola2002@maktoob.com>\nMessage-Id: <200210310241.g9V2fNm6028281@cs.CU>\nFrom: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>\nReply-To: james_ngola2002@maktoob.com\nTo: webmaster@aclweb.org\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: (james_ngola2002@maktoob.com).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDE

### 讀取寄件者資訊
觀察文本資料可以發現, 寄件者資訊都符合以下格式

`From: <收件者姓名> <收件者電子郵件>`

In [112]:
#<your code>#
import re
pattern = r'^From: "[ \.a-zA-Z]+" <[_\.a-zA-Z0-9@]+>'
match = re.findall(pattern, sample_corpus, flags=re.M|re.I)




#simpler version from solution, lots of noise#
#import re
#pattern = r'From:.*'
#match = re.findall(pattern, sample_corpus)


In [113]:
match

['From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>',
 'From: "Mr. Ben Suleman" <bensul2004nng@spinfinder.com>',
 'From: "PRINCE OBONG ELEME" <obong_715@epatra.com>']

### 只讀取寄件者姓名

In [114]:
#<your code>#
pattern1 = r'(?<=From: ").*(?=")'
for sender_info in match:
    #sender_name = re.findall(pattern1, sender_info) #output a matched list
    sender_name = re.search(pattern1, sender_info).group() #output a matched string
    print(sender_name)

MR. JAMES NGOLA.
Mr. Ben Suleman
PRINCE OBONG ELEME


### 只讀取寄件者電子信箱

In [115]:
#<your code>#
pattern2 = r'(?<=<).*(?=>)'
for sender_info in match:
    #sender_mail_address = re.findall(pattern2, sender_info) #output a matched list
    sender_mail_address = re.search(pattern2, sender_info).group() #output a matched string
    print(sender_mail_address)

james_ngola2002@maktoob.com
bensul2004nng@spinfinder.com
obong_715@epatra.com


### 只讀取電子信箱中的寄件機構資訊
ex: james_ngola2002@maktoob.com --> 取maktoob

In [116]:
pattern3 = r'(?<=@)\w*(?=.)'
#pattern3 = r'(?<=@)\w\S*(?=\.)' #是因為結尾不一定為.com 也可能是.net
for sender_info in match:
    #sender_mail_address = re.findall(pattern2, sender_info) #output a matched list
    sender_mail_vendor = re.search(pattern3, sender_info).group() #output a matched string
    print(sender_mail_vendor)

maktoob
spinfinder
epatra


### 結合上面的配對方式, 將寄件者的帳號與機構訊返回
ex: james_ngola2002@maktoob.com --> [james_ngola2002, maktoob]

In [117]:
pattern4 = r'(?<=<).+(?=@)'
for sender_info in match:
    sender_mail_id = re.search(pattern4, sender_info).group() #output a matched string
    sender_mail_vendor = re.search(pattern3, sender_info).group() #output a matched string
    print([sender_mail_id, sender_mail_vendor])

['james_ngola2002', 'maktoob']
['bensul2004nng', 'spinfinder']
['obong_715', 'epatra']


### 使用正規表達式對email資料進行處理
這裡我們會使用到python其他的套件協助處理(ex: pandas, email, etc)，這裡我們只需要專注在正規表達式上即可，其他的套件是方便我們整理與處理資料。

### 讀取與切分Email
讀入的email為一個長字串，利用正規表達式切割讀入的資料成一封一封的email，並將結果以list表示。

輸出: [email_1, email_2, email_3, ....]corpus

In [132]:
import re
import pandas as pd
import email

###讀取文本資料:fradulent_emails.txt###
#<your code>#
with open('all_emails.txt', 'r', encoding="utf8", errors='ignore') as file_all:
    corpus = file_all.read()
    
###切割讀入的資料成一封一封的email###
###我們可以使用list來儲存每一封email###
###注意！這裡請仔細觀察sample資料，看資料是如何切分不同email###
#<your code>#
emails = re.split(r"From r", corpus, flags=re.M)
emails = emails[1:] #移除第一項的空元素
len(emails) #查看有多少封email

3977

In [164]:
print(emails[5])

  Sat Nov  2 00:18:06 2002
Return-Path: <davidkuta@postmark.net>
X-Sieve: cmu-sieve 2.0
Return-Path: <davidkuta@postmark.net>
	02 Nov 2002 06:23:11 -0000
Mime-Version: 1.0
From: Kuta David <davidkuta@postmark.net>
To: davidkuta@yahoo.com
Subject: Partnership
Date: Sat, 02 Nov 2002 06:23:11 +0000
Content-Type: text/plain; charset="iso-8859-1"
Status: RO

ATTENTION:                                    
PRESIDENT/MANAGING DIRECTOR 
   
Dear Sir/Madam, 
   
Request for Urgent Business Relationship 
   
We are Top Officials of the Federal Government of 
Nigeria Contract Review Panel who are interested in 
importation of goods into our country and 
investing abroad with funds which are presently 
trapped in Nigeria. 
In order to commence this business we solicit your 
assistance, knowledge and expertise to enable us 
recieve the said trapped  funds abroad, for the 
subsequent purchase and inventory of the goods to be 
Imported and the investment abroad. 
   
The source of this fund is as foll

### 從文本中擷取所有寄件者與收件者的姓名和地址

In [169]:
emails_list = [] #創建空list來儲存所有email資訊

for mail in emails[:20]: #只取前20筆資料 (處理速度比較快)
    emails_dict = dict() #創建空字典儲存資訊
    ###取的寄件者姓名與地址###
    
    #Step1: 取的寄件者資訊 (hint: From:)
    #<your code>#
    #pattern1 = r'^From: "[ \.a-zA-Z]+" <[_\.a-zA-Z0-9@]+>'
    pattern1 = r'^From: .*'
    sender_info = re.search(pattern1, mail, flags=re.M)  
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    if sender_info:
        pattern2 = r'(?<=From: ").*(?=")'
        pattern3 = r'(?<=<).*(?=>)'
        sender_info = sender_info.group()
        try:
            sender_name = re.search(pattern2, sender_info).group()
        except:
            sender_name = None
        try:
            sender_address = re.search(pattern3, sender_info).group()
            
        except:
            sender_address = None
    else:
        sender_name = None 
        sender_address = None 
    #print(sender_info)
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    emails_dict['sender_name'] = sender_name
    emails_dict['sender_address'] = sender_address
        
    
    ###取的收件者姓名與地址###
    #Step1: 取的收件者資訊 (hint: To:)
    #<your code>#
    pattern4 = r'^To: .*'
    recipient_info = re.search(pattern4, mail, flags=re.M)  
    
    #Step2: 取的姓名與地址 (hint: 要注意有時會有沒取到配對的情況)
    #<your code>#
    if recipient_info:
        pattern5 = r'\w*@.\w*'
        pattern6 = r'(?<=").*(?=")'
        recipient_info = recipient_info.group()
        try:
            recipient_name = re.search(pattern6, recipient_info).group()
        except:
            recipient_name = None 
        
        try:
            recipient_address = re.search(pattern5, recipient_info).group()
        except:
            recipient_address = None 
    else:
        recipient_name = None  
        recipient_address = None
        
        
    #Step3: 將取得的姓名與地址存入字典中
    #<your code>#
    emails_dict['recipient_name'] = recipient_name
    emails_dict['recipient_address'] = recipient_address
        
    ###取得信件日期###
    #Step1: 取得日期資訊 (hint: Date:)
    #<your code>#
    pattern7 = r'^Date: .*'
    date_info = re.search(pattern7, mail, flags=re.M)
    
    #Step2: 取得詳細日期(只需取得DD MMM YYYY)
    #<your code>#
    #ex: Mon, 04 Nov 2002 23:41:26
    if date_info:
        pattern8 = r'(?<=, )\d\d [a-zA-Z]{3} \d\d\d\d'
        date_info = date_info.group()
        try:
            date = re.search(pattern8, date_info).group()
        except:
            date = None
    else:
        date = None

        
    #Step3: 將取得的日期資訊存入字典中
    #<your code>#
    emails_dict['date'] = date
        
    ###取得信件主旨###
    #Step1: 取得主旨資訊 (hint: Subject:)
    #<your code>#
    pattern9 = r'^Subject: .*'
    subject_info = re.search(pattern9, mail, flags=re.M)
    
    #Step2: 移除不必要文字 (hint: Subject: )
    #<your code>#
    if subject_info:
        pattern10 = r'(?<=Subject: ).*'
        subject_info = subject_info.group()
        try:
            subject = re.search(pattern10, subject_info).group()
        except:
            subject = None
    else:
        subject = None
    
    #Step3: 將取得的主旨存入字典中
    #<your code>#
    emails_dict['subject'] = subject
    
    ###取得信件內文###
    #這裡我們使用email package來取出email內文 (可以不需深究，本章節重點在正規表達式)
    try:
        full_email = email.message_from_string(mail)
        body = full_email.get_payload()
        emails_dict["email_body"] = body
    except:
        emails_dict["email_body"] = None
    
    ###將字典加入list###
    #<your code>#
    emails_list.append(emails_dict)

In [170]:
#將處理結果轉化為dataframe
emails_df = pd.DataFrame(emails_list)
emails_df

Unnamed: 0,sender_name,sender_address,recipient_name,recipient_address,date,subject,email_body
0,MR. JAMES NGOLA.,james_ngola2002@maktoob.com,,webmaster@aclweb,31 Oct 2002,URGENT BUSINESS ASSISTANCE AND PARTNERSHIP,FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1,Mr. Ben Suleman,bensul2004nng@spinfinder.com,,R@M,31 Oct 2002,URGENT ASSISTANCE /RELATIONSHIP (P),"Dear Friend,\n\nI am Mr. Ben Suleman a custom ..."
2,PRINCE OBONG ELEME,obong_715@epatra.com,,webmaster@aclweb,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3,PRINCE OBONG ELEME,obong_715@epatra.com,,webmaster@aclweb,31 Oct 2002,GOOD DAY TO YOU,FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4,Maryam Abacha,m_abacha03@www.com,,R@M,,I Need Your Assistance.,"Dear sir, \n \nIt is with a heart full of hope..."
5,,davidkuta@postmark.net,,davidkuta@yahoo,02 Nov 2002,Partnership,ATTENTION: ...
6,Barrister tunde dosumu,tunde_dosumu@lycos.com,,,,Urgent Attention,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
7,William Drallo,william2244drallo@maktoob.com,,webmaster@aclweb,,URGENT BUSINESS PRPOSAL,FROM: WILLIAM DRALLO.\nCONFIDENTIAL TEL: 233-2...
8,MR USMAN ABDUL,abdul_817@rediffmail.com,,R@M,04 Nov 2002,THANK YOU,"CHALLENGE SECURITIES LTD.\nLAGOS, NIGERIA\n\n\..."
9,Tunde Dosumu,barrister_td@lycos.com,,,,Urgent Assistance,"Dear Sir,\n\nI am Barrister Tunde Dosumu (SAN)..."
