# 隐马尔可夫模型词性标注

## 1 隐马尔可夫模型（Hidden Markov Model）


$y_{t}$：$t$时刻，观测变量；

$z_{t}$：$t$时刻，隐含变量。

![](./img/hmm_fig_1.jpg)

## 2 词性标注（part of speech tagging，POS-Tagging）

**隐马尔可夫模型参数**：

* 隐含状态：词性

* 观测状态：词条

* 初始概率：$P_{\text{init}}$

* 状态转移矩阵：$\mathbf{A}$

* 发射矩阵：$\mathbf{B}$

## 3 维特比算法

## 4 实现

### 4.1 语料库

In [None]:
tag2id = {}
id2tag = {}
word2id = {}
id2word = {}

tag_filepath = "./data/traindata.txt"

with open(tag_filepath, "r") as f:
    
    id_word = 1
    id_tag = 1
    for line in f.readlines():
        word, tag =  line.strip().split("/")
        
        if word not in word2id:
            word2id[word] = id_word
            id2word[id_word] = word
            id_word += 1
            
        if tag not in tag2id:
            tag2id[tag] = id_tag
            id2tag[id_tag] = tag
            id_tag += 1

In [5]:
            
print("tag size: {}".format(len(tag2id)))
print("tag2id: {}".format(tag2id))

tag size: 54
tag2id: {'NNP': 1, ',': 2, 'VBG': 3, 'TO': 4, 'VB': 5, 'NN': 6, 'IN': 7, 'JJ': 8, 'VBD': 9, 'NNS': 10, 'CD': 11, 'CC': 12, 'PRP': 13, 'MD': 14, 'DT': 15, '.': 16, 'VBZ': 17, 'VBN': 18, 'WDT': 19, 'VBP': 20, 'POS': 21, 'RB': 22, '$': 23, 'PRP$': 24, ':': 25, 'JJR': 26, '``': 27, "''": 28, 'WP': 29, 'JJS': 30, 'WRB': 31, 'RBR': 32, 'NNPS': 33, 'RP': 34, 'WP$': 35, 'EX': 36, '(': 37, ')': 38, 'PDT': 39, 'RBS': 40, 'FW': 41, 'UH': 42, 'SYM': 43, 'LS': 44, '#': 45, 'VBG|NN': 46, 'JJ|NN': 47, 'RB|IN': 48, 'NNS|NN': 49, 'VBN|JJ': 50, 'VB|NN': 51, 'RBR|JJR': 52, 'NN|NNS': 53, 'JJ|RB': 54}


### 4.2 统计

In [6]:
import pandas as pd
import numpy as np

In [16]:
num_tags = len(tag2id)
num_words = len(word2id)

# transition probability matrix A
transit_probs = pd.DataFrame(
    data=np.zeros(shape=(num_tags, num_tags)),
    columns=tag2id.keys(),
    index=tag2id.keys()
)

with open(tag_filepath, "r") as f:
    line = f.readline()
    _, tag_current = line.strip().split("/")
    while line:
        _, tag_next = line.strip().split("/")
        transit_probs.loc[tag_current, tag_next] += 1
        tag_current = tag_next
        line = f.readline()
        
transit_probs /= transit_probs.sum()

# emission probability matrix B
emission_probs = pd.DataFrame(
    data=np.zeros(shape=(num_tags, num_words)),
    columns=word2id.keys(),
    index=tag2id.keys()
)

with open(tag_filepath, "r") as f:
    
    for line in f.readlines():
        word, tag = line.strip().split("/")
        emission_probs.loc[tag, word] += 1
        

In [19]:
print("transition probability matrix, A:")
print(transit_probs.head())

print("emission probability matrix, B:")
print(emission_probs.head())


transition probability matrix, A:
        NNP       ,    VBG     TO      VB      NN     IN     JJ     VBD  \
NNP  7346.0  2749.0   25.0  160.0    19.0  1023.0  828.0  169.0  1303.0   
,    1339.0     0.0  456.0   85.0    38.0   469.0  787.0  445.0   527.0   
VBG   117.0    45.0   15.0  281.0     5.0   429.0  428.0  227.0     8.0   
TO    199.0     3.0   29.0    0.0  2704.0   130.0   25.0  159.0     0.0   
VB    190.0   100.0   90.0  221.0    36.0   355.0  653.0  492.0     8.0   

       NNS  ...     #  VBG|NN  JJ|NN  RB|IN  NNS|NN  VBN|JJ  VB|NN  RBR|JJR  \
NNP  452.0  ...   0.0     0.0    0.0    0.0     0.0     0.0    0.0      0.0   
,    280.0  ...   0.0     0.0    0.0    0.0     0.0     0.0    0.0      0.0   
VBG  282.0  ...   0.0     0.0    1.0    0.0     0.0     0.0    0.0      0.0   
TO   109.0  ...  14.0     0.0    0.0    0.0     0.0     0.0    1.0      0.0   
VB   255.0  ...   0.0     0.0    0.0    0.0     0.0     0.0    0.0      0.0   

     NN|NNS  JJ|RB  
NNP     0.0    0.0 

In [27]:
emission_probs.sum(axis=1)

NNP        19374.0
,          10087.0
VBG         3154.0
TO          4667.0
VB          5564.0
NN         27667.0
IN         20672.0
JJ         13037.0
VBD         6374.0
NNS        12656.0
CD          7309.0
CC          4944.0
PRP         3722.0
MD          2038.0
DT         17273.0
.           8194.0
VBZ         4415.0
VBN         4190.0
WDT          951.0
VBP         2700.0
POS         1819.0
RB          6315.0
$           1324.0
PRP$        1742.0
:           1098.0
JJR          731.0
``          1534.0
''          1531.0
WP           515.0
JJS          377.0
WRB          411.0
RBR          327.0
NNPS         558.0
RP           361.0
WP$           37.0
EX           169.0
(            328.0
)            335.0
PDT           65.0
RBS           80.0
FW            53.0
UH            10.0
SYM            2.0
LS            24.0
#             43.0
VBG|NN         2.0
JJ|NN          3.0
RB|IN          1.0
NNS|NN         3.0
VBN|JJ         2.0
VB|NN          2.0
RBR|JJR        2.0
NN|NNS      

In [29]:


emission_probs.iloc[:, 0] / emission_probs.sum(axis=1)

NNP        0.000516
,          0.000000
VBG        0.000000
TO         0.000000
VB         0.000000
NN         0.000000
IN         0.000000
JJ         0.000000
VBD        0.000000
NNS        0.000000
CD         0.000000
CC         0.000000
PRP        0.000000
MD         0.000000
DT         0.000000
.          0.000000
VBZ        0.000000
VBN        0.000000
WDT        0.000000
VBP        0.000000
POS        0.000000
RB         0.000000
$          0.000000
PRP$       0.000000
:          0.000000
JJR        0.000000
``         0.000000
''         0.000000
WP         0.000000
JJS        0.000000
WRB        0.000000
RBR        0.000000
NNPS       0.000000
RP         0.000000
WP$        0.000000
EX         0.000000
(          0.000000
)          0.000000
PDT        0.000000
RBS        0.000000
FW         0.000000
UH         0.000000
SYM        0.000000
LS         0.000000
#          0.000000
VBG|NN     0.000000
JJ|NN      0.000000
RB|IN      0.000000
NNS|NN     0.000000
VBN|JJ     0.000000


## 4.3 Viterbi

In [20]:
EPSILON = 1e-5

def log(x):
    return np.log(x + EPSILON)

In [None]:
def viterbi():
    pass