In [1]:
# 這裡面就只是把網站爬下來的東西 
# 1. 擷取文字段落(並不是通用的)
# 正規化 (Normalize)
#     2. 架構分割(Structure Segment) : 文章分割成句子
#     3. 記號化(Tokenize)            : 句子分割成一個一個的單字 list
#     4. 文字正規化 (Normalize)      : 將單字正規化成固定格式

# Day 7: 文件預處理

首先，我從我的一個網站中截下一段HTML：

In [2]:
text = '''
 <body>
    <!-- JavaScript plugins (requires jQuery) -->
    <script src="http://code.jquery.com/jquery.js"></script>
    <!-- Include all compiled plugins (below), or include individual files as needed -->
    <script src="js/bootstrap.min.js"></script>

    <div class="container">
        <div class="page-header">
            <h3>About Me</h3>
        </div>
        <div class="page-info">

A web developer with experience in a variety of exciting projects, with the most up-to-date and relevant programming foundations available. My wide experience in
a diversity of technologies guides me with the best way to get your business success.
My interest in academic leads me to research in the field of NLP(Natural Language 
Processing). Other than the knowledge in CS/IT, I'm also a broad learner who loves 
to read each and every kind of books.
        </div>
    </div>
</body>
'''

我們可以透過正規表示法來移除HTML標籤：

In [3]:
import re

text = re.sub("<[^>]+>", "", text).strip()
print(text)

About Me
        
        

A web developer with experience in a variety of exciting projects, with the most up-to-date and relevant programming foundations available. My wide experience in
a diversity of technologies guides me with the best way to get your business success.
My interest in academic leads me to research in the field of NLP(Natural Language 
Processing). Other than the knowledge in CS/IT, I'm also a broad learner who loves 
to read each and every kind of books.


我們可以清楚地看到，在標題(About)和文字之間有一些跳行符號。在我們進行記號化(Tokenize)之前，讓我們先把這些跳行符號取代成空格吧。

In [4]:
text = text.split("\n\n")
print(text)
print("---------------------------")
text = text[1].replace("\n", " ")
print(text)

['About Me\n        \n        ', "A web developer with experience in a variety of exciting projects, with the most up-to-date and relevant programming foundations available. My wide experience in\na diversity of technologies guides me with the best way to get your business success.\nMy interest in academic leads me to research in the field of NLP(Natural Language \nProcessing). Other than the knowledge in CS/IT, I'm also a broad learner who loves \nto read each and every kind of books."]
---------------------------
A web developer with experience in a variety of exciting projects, with the most up-to-date and relevant programming foundations available. My wide experience in a diversity of technologies guides me with the best way to get your business success. My interest in academic leads me to research in the field of NLP(Natural Language  Processing). Other than the knowledge in CS/IT, I'm also a broad learner who loves  to read each and every kind of books.


接著，我們可以把文件分割成句子。雖然用過Python的朋友都知道可以單純的用.split()來處理現在這個例子，但我們還是試著用NLTK提供的句子分割器，為了因應未來可能要處理之更難的文字。

In [5]:
# 上面都是為了擷取文字

# 2. 架構分割(Structure Segment) : 文章分割成句子
import nltk
nltk.download('punkt')
sent_segmenter = nltk.data.load('tokenizers/punkt/english.pickle')

sentences = sent_segmenter.tokenize(text)
print(sentences)

['A web developer with experience in a variety of exciting projects, with the most up-to-date and relevant programming foundations available.', 'My wide experience in a diversity of technologies guides me with the best way to get your business success.', 'My interest in academic leads me to research in the field of NLP(Natural Language  Processing).', "Other than the knowledge in CS/IT, I'm also a broad learner who loves  to read each and every kind of books."]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


除了分割器，NLTK也能幫助字詞記號化。我們將範例文件中的第一個句子分別用python split和NLTK透過正規表示法寫出來的記號器做個比較吧！

In [6]:
# 3. 記號化(Tokenize)            : 句子分割成一個一個的單字 list
word_tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()

tokenized_sentence = word_tokenizer.tokenize(sentences[0])
print(tokenized_sentence)
# tokenized_sentence 等等會再丟入 lemmatization和stemming 處理
print("------------")
# 比對與 tokenized_sentence 的差別
print(sentences[0].split(" "))

['A', 'web', 'developer', 'with', 'experience', 'in', 'a', 'variety', 'of', 'exciting', 'projects', ',', 'with', 'the', 'most', 'up', '-', 'to', '-', 'date', 'and', 'relevant', 'programming', 'foundations', 'available', '.']
------------
['A', 'web', 'developer', 'with', 'experience', 'in', 'a', 'variety', 'of', 'exciting', 'projects,', 'with', 'the', 'most', 'up-to-date', 'and', 'relevant', 'programming', 'foundations', 'available.']


In [7]:
# 有跳錯 因此下載此套件
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

NLTK記號器能夠正確地將逗點和"up-to-date"這樣的自分割出來。當然，這樣的功能有時候是幫助我們的，在一些應用上這功能反而不是我們所希望發生的。

接著，我們來測試Lemmatization。NLTK也有lemmatizer，在使用時上通常會需要先知道句子的詞性標注。在這個範例中，我們簡化這流程，先將輸入的文字用動詞來lemmatize，若發現文字沒有發生變化，我們再用名詞的lemmatizer來試試看。

In [8]:
# 4.單字正規化處理
# 方法一 lemma會盡可能把恢復成字典上有的字
# 這個 lib 還真的有在更新，我看跟之前輸出的切割方式不一樣
nltk.download('wordnet')
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()

def lemmatize(word):
    lemma = lemmatizer.lemmatize(word,'v')
    if lemma == word:
        lemma = lemmatizer.lemmatize(word,'n')
    return lemma

print([lemmatize(token) for token in tokenized_sentence])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['A', 'web', 'developer', 'with', 'experience', 'in', 'a', 'variety', 'of', 'excite', 'project', ',', 'with', 'the', 'most', 'up', '-', 'to', '-', 'date', 'and', 'relevant', 'program', 'foundation', 'available', '.']


現在我們也來試試看Stemming，我們使用NLTK內建的Porter Stemmer：

In [9]:
# 方法二 stemming則會把文字的後置 (suffix)整個切掉而不在意切掉後的字是不是字典上有的字
# developer -> develop 但我覺得 er 代表人 應該也是很重要的資訊
# 好處是 如果要計算各種字出現的頻率 使用的空間比較少 且通常意思是相近的

stemmer = nltk.stem.porter.PorterStemmer()
print([stemmer.stem(token) for token in tokenized_sentence])

['a', 'web', 'develop', 'with', 'experi', 'in', 'a', 'varieti', 'of', 'excit', 'project', ',', 'with', 'the', 'most', 'up', '-', 'to', '-', 'date', 'and', 'relev', 'program', 'foundat', 'avail', '.']


大家可以觀察看看，在進行lemmatization和stemming之前的文字和之後的文字分別有哪些變化！