# NLTK 使用筆記

常見的NLP package有NLTK、Spacy、Scikit-learn等等，而NLTK主要用於處理斷詞、詞頻、pos等等。這篇主要介紹 NLTK 的用法
- NLP package 比較
https://www.jiqizhixin.com/articles/080502

## Basic NLP Tasks with NLTK

1. **安裝 NLTK** - pip install nltk
<br>
2. **安裝 NLTK 相關套件** - nltk.download() - 選擇下載 all-corpora

In [1]:
import nltk
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
# texts()
# sents()
print (text7) # 從上面可知道 text7 是 Wall Street Journal
print (sent7) # sentense

<Text: Wall Street Journal>
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']


### Counting vocabulary of words

In [3]:
print (len(sent7))
print (sent7)
print ("=================")
print (len(text7)) # number of words
print (len(set(text7))) # number of unique words
print (list(set(text7))[:10])

18
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
100676
12408
['very', 'fancy', 'spokeswoman', 'sinister', 'place', '*T*-34', 'Chairman', '*T*-70', 'reinvestment', 'Schwab']


### Frequency of words

In [4]:
dist = FreqDist(text7)
print (dist)  # 數量與len(set(text7))相同
dist # 但可以看到包含了很多符號或數字，且大寫與小寫會被分成兩個不同的單字

<FreqDist with 12408 samples and 100676 outcomes>


FreqDist({',': 4885, 'the': 4045, '.': 3828, 'of': 2319, 'to': 2164, 'a': 1878, 'in': 1572, 'and': 1511, '*-1': 1123, '0': 1099, ...})

In [5]:
vocab1 = dist.keys() # dist_keys
list(vocab1)[:10]
# dist['the']

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [6]:
# 只留下詞頻大於100，且長度大於5的單字。粗糙地刪除stopword、符號、數字
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

### [Stemming vs Lemmatization](https://devopedia.org/images/article/227/6785.1570815200.png)

- [stemming](https://zh.wikipedia.org/wiki/词干提取)：詞幹/字根提取
<br>
ex. cats, catlike, catty -> cat; stemmer, stemming, stemmed -> stem
- lemmatization：詞形還原
<br>
ex. ate -> eat; are, is, am -> be

**比較**：stemming 通常是去尾（直接去掉s, ed或是ing），結果可能會比較粗糙，lemmatization 還原成原形。看以下範例：

In [7]:
# stemming
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem('wolves')

'wolv'

In [8]:
# lemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('wolves')

'wolf'

stemming 就只是刪除後面的 es，但這時候 wolv 完全失去原本的意義，而 lemmatizer 則保留了原本的意思，但將單字還原為單數形式

### Tokenization
取出單字

In [9]:
# 斷詞
text = "Children shouldn't drink a sugary drink before bed."
nltk.word_tokenize(text)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

In [10]:
# 斷句
text = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text)
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

### Stopwords

In [11]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [12]:
# 結合剛剛上面學到的 lemmatiztion
# 1.lemmatize, 2.講所有字母轉換成小寫(normalization), 3.刪除非字母 4.刪除stopwords
stop_words = stopwords.words('english')
text = [lemmatizer.lemmatize(w).lower() for w in text7 if lemmatizer.lemmatize(w).lower() not in stop_words and w.isalpha()]

dist = FreqDist(text)
print (dist)
dist
# 與剛剛前面比較，只剩下8245個單字

<FreqDist with 8245 samples and 45459 outcomes>


FreqDist({'said': 628, 'million': 389, 'wa': 370, 'ha': 339, 'year': 329, 'new': 328, 'company': 323, 'say': 285, 'market': 239, 'stock': 239, ...})

### POS tagging (part-of-speech tagging 詞性標記)

In [13]:
text = "Children shouldn't drink a sugary drink before bed."
text = nltk.word_tokenize(text)
nltk.pos_tag(text)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

### 參考：
- https://medium.com/pyladies-taiwan/nltk-初學指南-一-簡單易上手的自然語言工具箱-探索篇-2010fd7c7540
- https://www.coursera.org/learn/python-text-mining/home/welcome