<a href="https://colab.research.google.com/github/seunghyunmoon2/NLP/blob/master/NLP2_POStagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. # POS(Part Of Speech) Tagging
* '품사' 태깅
    * [nlpk_documentation](https://www.nltk.org/book/ch05.html)
    * [in_korean](https://excelsior-cjh.tistory.com/71)

In [None]:
#POS TAGGING
import nltk

#see how it works
text = "And now for the first time in forever there will be music"
token = nltk.word_tokenize(text)
nltk.pos_tag(token)
"""
[('And', 'CC'),
 ('now', 'RB'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('first', 'JJ'),
 ('time', 'NN'),
 ('in', 'IN'),
 ('forever', 'NN'),
 ('there', 'EX'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('music', 'NN')]"""
print(type(nltk.pos_tag(token))) #list

## 한국판
1. 국립국어원의 세종(?)무시기   
2. konlpy

## Tagged Corpus

* there's many different tagged corpus from different organizations(schools, etc)

In [None]:
#Penn treebank
print(nltk.corpus.treebank.tagged_words()[:20])

#Brown Corpus
print(nltk.corpus.brown.tagged_words()[:20])
print(nltk.corpus.treebank.tagged_words(tagset='universal')[:20])

#NPS Chat
#they have a tag for emojis (':P' , 'UH')
print(nltk.corpus.nps_chat.tagged_words()[:20])

#CONLL Corpus2000
print(nltk.corpus.conll2000.tagged_words()[:20])

#frequency distribution of tags in Brown Corpus
#from nltk.corpus import brown
brown_news_tagged = nltk.corpus.brown.tagged_words(categories='news',tagset='universal')
tag_fd = nltk.FreqDist(tag for (word,tag) in brown_news_tagged)
tag_fd.most_common()

2. # Hidden Markov Model

* for a better result of pos tagging, hmm is used to calculate the probability of the components of given sentences of which we will need to **tag**
    * a sentence, `for example they can fish`, has many numbers of cases.
    * i.e.        **N V N**, **N V V**

* only for `sequential` data
    * comes along with LSTM models later

1. ## simple introduction to hmm with forward, backward, Viterbi and Baum-Welch Algorithm
* there are two days : *rainy days*, *sunny days*
* we can do three options to do each day : *cleaning*, *going shopping*, *going walking*
* say, we want to know the probability of day 4, **raining**&**walking**, given the history of the last three days.

* ## might add some graphics later

![copywright to 아마추어퀀트](https://drive.google.com/uc?export=view&id=17hEqQIJDwMMnDzysgrFxpItDwIPInQ5C)

2. ## Forward Algorithm

In [None]:
"""!pip install hmmlearn"""
import numpy as np
from hmmlearn import hmm

# 히든 상태 정의
states = ["Rainy", "Sunny"]
nState = len(states)

# 관측 데이터 정의
observations = ["Walk", "Shop", "Clean"]
nObervation = len(observations)

# HMM 모델 빌드
model = hmm.MultinomialHMM(n_components=nState)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3], [0.4, 0.6]])
model.emissionprob_ = np.array([[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]])

# 관측 데이터 (Observations)
X = np.array([[0, 2, 1]]).T  # Walk, Clean, Shop

# Forwad(/Backward) algorithm으로 x가 관측될 likely probability 계산
logL = model.score(X)
p = np.exp(logL)
print("\nProbability of [Walk, Clean, Shop] = %.4f%s" % (p*100, '%'))


* output

```
In [13]: logL
Out[13]: -3.472543018732704

Probability of [Walk, Clean, Shop] = 3.1038%
```



3. ## Viterbi Algorithm

In [None]:
import numpy as np
from hmmlearn import hmm

# 히든 상태 정의
states = ["Rainy", "Sunny"]
nState = len(states)

# 관측 데이터 정의
observations = ["Walk", "Shop", "Clean"]
nObervation = len(observations)

# HMM 모델 빌드
model = hmm.MultinomialHMM(n_components=nState)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3], [0.4, 0.6]])
model.emissionprob_ = np.array([[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]])

# 관측 데이터 (Observations)
X = np.array([[0, 2, 1, 0]]).T

# Viterbi 알고리즘으로 히든 상태 시퀀스 추정 (Decode)
logprob, Z = model.decode(X, algorithm="viterbi")

# 결과 출력
print("\n  Obervation Sequence :", ", ".join(map(lambda x: observations[int(x)], X)))
print("Hidden State Sequence :", ", ".join(map(lambda x: states[int(x)], Z)))
print("Probability = %.6f" % np.exp(logprob))

* output

```
Obervation Sequence : Walk, Clean, Shop, Walk
Hidden State Sequence : Sunny, Rainy, Rainy, Sunny
Probability = 0.002419
```

## Baum-Welch Algorithm

In [None]:
    # Observation 시퀀스만을 이용하여, 초기 확률, Transition, Emmision 확률을 
# 추정하고, 히든 상태 시퀀스를 추정한다. (Baum Welch 알고리즘)
#
# 2019.3.27 아마추어퀀트 (blog.naver.com/chunjein)
# ----------------------------------------------------------------------
import numpy as np
from hmmlearn import hmm
np.set_printoptions(precision=2)

nState = 2
pStart = [0.6, 0.4]
pTran = [[0.7, 0.3], [0.2, 0.8]]
pEmit = [[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]]

# 1. 주어진 확률 분포대로 관측 데이터 시퀀스를 생성한다.
# ---------------------------------------------------
# 히든 상태 선택. 확률 = [0.6, 0.4]
s = np.argmax(np.random.multinomial(1, pStart, size=1))

X = []      # Obervation 시퀀스
Z = []      # 히든 상태 시퀀스
for i in range(5000):
    # Walk, Shop, Clean ?
    a = np.argmax(np.random.multinomial(1, pEmit[s], size=1))
    X.append(a)
    Z.append(s)
    
    # 히든 상태 천이
    s = np.argmax(np.random.multinomial(1, pTran[s], size=1))

X = np.array(X)
X = np.reshape(X, [len(X), 1])
Z = np.array(Z)

# 2. Observation 시퀀스만을 이용하여, 초기 확률, Transition, Emmision 확률을 추정하고,
#    히든 상태 시퀀스를 추정한다. (Baum Welch 알고리즘)
# EM 알고리즘은 local optimum에 빠질 수 있으므로, 5번 반복하여 로그 우도값이 가장
# 작은 결과를 채택한다.
# ---------------------------------------------------------------------------------
zHat = np.zeros(len(Z))
minprob = 999999999
for k in range(5):
    model = hmm.MultinomialHMM(n_components=nState, tol=0.0001, n_iter=10000)
    model = model.fit(X)
    predZ = model.predict(X)
    logprob = -model.score(X)
    
    if logprob < minprob:
        zHat = predZ
        T = model.transmat_
        E = model.emissionprob_
        minprob = logprob
    print("k = %d, logprob = %.2f" % (k, logprob))

# 3. 1 단계에서 생성한 Z와 추정한 zHat의 정확도를 측정한다.
# --------------------------------------------------------
# 함수 동작 중에 [Rainy, Sunny]의 인덱싱이 0,1 or 1,0 둘 중 뭐가 되는지 모르기에
# 0.5이하라면 switch해준다.
accuracy = (Z == zHat).sum() / len(Z)
if accuracy < 0.5:
    T = np.fliplr(np.flipud(T))
    E = np.flipud(E)
    zHat = 1 - zHat
    print("flipped")

accuracy = (Z == zHat).sum() / len(Z)
print("\naccuracy = %.2f %s" % (accuracy * 100, '%'))

# 추정 결과를 출력한다
print("\nlog prob = %.2f" % minprob)
print("\nstart prob :\n", model.startprob_)
print("\ntrans prob :\n",T)
print("\nemiss prob :\n", E)
print("\niteration = ", model.monitor_.iter)


* output

```
k = 0, logprob = 5353.05
k = 1, logprob = 5353.05
k = 2, logprob = 5353.05
k = 3, logprob = 5353.05
k = 4, logprob = 5353.05
flipped

accuracy = 77.02 %   #굉장히 높은 편

log prob = 5353.05

start prob :
 [1.00e+000 5.66e-172]

trans prob :  #위 코드의 매트릭스와 비슷하다.
 [[0.68 0.32]
 [0.21 0.79]]

emiss prob :  #위 코드의 매트릭스와 비슷하다.
 [[0.09 0.42 0.49]
 [0.61 0.28 0.11]]

iteration =  352
```

# Back to NLTK CORPUS tagging

## some tagging exercises


read pos of brown corpus and estimate parameters of hmm


In [None]:
tagged_words = []
all_tags = []

for sent in nltk.corpus.brown.tagged_sents(tagset='universal'):
    tagged_words.append(('START','START'))
    all_tags.append('START')
    for (word, tag) in sent:
        all_tags.append(tag)
        tagged_words.append((tag,word))
    tagged_words.append(('END','END'))
    all_tags.append('END')
    
# Transition probability (Bigram)
cfd_tags = nltk.ConditionalFreqDist(nltk.bigrams(all_tags))
cpd_tags = nltk.ConditionalProbDist(cfd_tags,nltk.MLEProbDist)
print('Count(\'DET\',\'NOUN\') =',cfd_tags['DET']['NOUN']) # # of times appear noun after an article
print('P(\'NOUN\'|\'DET\') =',cpd_tags['DET'].prob('NOUN')) # # probability

# Emission prob
cfd_tagwords = nltk.ConditionalFreqDist(tagged_words)
cpd_tagwords = nltk.ConditionalProbDist(cfd_tagwords,nltk.MLEProbDist)

print('Count(\'DET\',\'the\') =',cfd_tagwords['DET']['the']) # # of times the appears(?)
print('P(\'the\'|\'DET\') =',cpd_tagwords['DET'].prob('the')) # prob of the

```
Count('DET','NOUN') = 85838
P('NOUN'|'DET') = 0.6264678621213117

Count('DET','the') = 62710
P('the'|'DET') = 0.45767375327509324
```

## pos positioned before NOUNs


In [None]:
word_tag_pairs = nltk.bigrams(brown_news_tagged)
bigram = [(a,b) for (a,b) in word_tag_pairs]
print(bigram[:10])

> the output is bigrams!
```
[(('The', 'DET'), ('Fulton', 'NOUN')), (('Fulton', 'NOUN'), ('County', 'NOUN')), (('County', 'NOUN'), ('Grand', 'ADJ')), (('Grand', 'ADJ'), ('Jury', 'NOUN')), (('Jury', 'NOUN'), ('said', 'VERB')), (('said', 'VERB'), ('Friday', 'NOUN')), (('Friday', 'NOUN'), ('an', 'DET')), (('an', 'DET'), ('investigation', 'NOUN')), (('investigation', 'NOUN'), ('of', 'ADP')), (('of', 'ADP'), ("Atlanta's", 'NOUN'))]
```

In [None]:
brown_news_tagged = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a,b) in word_tag_pairs if b[1] =='NOUN']
fdist = nltk.FreqDist(noun_preceders)
print([tag for (tag,_) in fdist.most_common()])

> this will print in order what pos was most placed before one `noun` in the news article.
output:
```
['NOUN', 'DET', 'ADJ', 'ADP', '.', 'VERB', 'CONJ', 'NUM', 'ADV', 'PRT', 'PRON', 'X']
```

## N-Gram Tagging


In [None]:
from nltk.corpus import brown

### # Unigram Tagger

In [None]:
#crete two sents to compare. one already tagged by the package the other by 'UnigramTagger'
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) # fitting
print(unigram_tagger.tag(brown_sents[2007])) # predict(?)

unigram_tagger.evaluate(brown_tagged_sents) # 0.9349006503968017

# divide dataset to train and test set. then evaluate  POS tag of test set
size = int(len(brown_tagged_sents)*0.9)
size  # 4160

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents) #fitting
unigram_tagger.evaluate(test_sents) #  0.8121200039868434

### Bigram Tagger

In [None]:
# bigram Tagger
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])
unseen_sent = brown_sents[4203]
print(bigram_tagger.tag(unseen_sent))

# evaluate. in case of Bigram, it is relatively law because when NNS VBG sequence does
#not exist, P(VBG|NNS)=0, sparse proble, it will affect the whole PIE(P) calculation.
# trade off between accuracy and coverage.
bigram_tagger.evaluate(test_sents) # 0.10206319146815508


### Combining Tagger (Backoff Tagger)

In [None]:
# Combingin Tagger
# 1. try bagging the token with the bigram tagger
# 2. if the bigram tagger is unable to find a tag for the token, try the unigram tagger
# 3. if the unigram tagger is also unable to find a tag, use a default tagger.

t0 = nltk.DefaultTagger('NN') # tag it NOUN
t1 = nltk.UnigramTagger(train_sents, backoff = t0)
t2 = nltk.BigramTagger(train_sents, backoff = t1)
t3 = nltk.TrigramTagger(train_sents, backoff = t2)
t3.evaluate(test_sents) # 0.843317053722715 remarkably improved from 0.1020.....


# HMM Tagger
trainer = nltk.tag.hmm.HiddenMarkovModelTrainer()
hmm_tagger = trainer.train_supervised(train_sents)
hmm_tagger.evaluate(test_sents) # 0.3166550383733679

## unknown words

문제야. 알아야하는데.


In [None]:
# unknown word tagging.
text = 'I go to school in the klasdfas'
token = text.split()
print(unigram_tagger.tag(token))
"""
[('I', 'PPSS'), ('go', 'VB'), ('to', 'TO'), ('school', 'NN'), ('in', 'IN'), ('the', 'AT'), ('klasdfas', None)]
"""

In [None]:
print(bigram_tagger.tag(token))

`[('I', 'PPSS'), ('go', 'VB'), ('to', 'TO'), ('school', None), ('in', None), ('the', None), ('klasdfas', None)]`

In [None]:
print(nltk.pos_tag(token))


```[('I', 'PRP'), ('go', 'VBP'), ('to', 'TO'), ('school', 'NN'), ('in', 'IN'), ('the', 'DT'), ('klasdfas', 'NN')]```


> Unigram Tagger only affects on unknown word
Bigram Tagger affects on unknown word as well as adjacent(or other) words   

> pos_tag() tags the unknown word as 'NN' (noun) based on default setting of context read?  

> note that Unigram and Bigram is not well trained as the train data set was small
and pos_tag is well trained with sufficient amount of data

## nltk.pos_tag(token)

기존 만든 태거(pos_tag)를 대체할 필요가 있을때가 있다.  
직접 만들기 위해 배우는 것.
