# 自然语言处理第三次作业

# 1. 数据导入

## 1.1 样本导入 

数据集来源于Brown corpus，存储于`train.txt`, `test.txt`。

每个句子样本包含三个部分：

第一行为句子编号；

接下来的若干行为单词词性对，用制表符隔开；

最后一行为空行。

将数据集导入存储于`train`和`test`列表，列表的每个元素是单词词性对列表。

In [1]:
train = []
with open('train.txt', 'r') as file:
    for line in file:
        if '\t' in line:
            train[-1].append(line.split())
        elif line != '\n':
            train.append([])

test = []
with open('test.txt', 'r') as file:
    for line in file:
        if '\t' in line:
            test[-1].append(line.split())
        elif line != '\n':
            test.append([])

## 1.2 标签导入

标签共12个，存储于`tag.txt`。

In [2]:
tag = []

with open('tag.txt', 'r') as file:
    for line in file:
        print(line, end = '')
        tag.append(line.rstrip('\n'))

.
ADJ
ADP
ADV
CONJ
DET
NOUN
NUM
PRON
PRT
VERB
X


## 1.3 统计单词与句子数

In [3]:
num_word = [sum([len(sent) for sent in train]), sum([len(sent) for sent in test])]

print('数据集\t单词数\t句子数')     
print('训练集\t%d\t%d' % (num_word[0], len(train)))      
print('测试集\t%d\t%d' % (num_word[1], len(test)))

数据集	单词数	句子数
训练集	928327	45800
测试集	232865	11540


## 1.4 统计不同词性的单词数

`num`是一个字典，其键为词性，其值为一个列表。

列表中的第一个值表示训练集中该词性单词数，第二个值表示测试集中该词性单词数。

In [4]:
num = {}

for key in tag:
    num[key] = [0, 0]

for sent in train:
    for word in sent:
        num[word[1]][0] += 1
        
for sent in test:
    for word in sent:
        num[word[1]][1] += 1

for key in num:
    print("'%4s':\t[%6d, %5d]" % (key, num[key][0], num[key][1]))

'   .':	[117723, 29842]
' ADJ':	[ 66985, 16736]
' ADP':	[115752, 29014]
' ADV':	[ 44765, 11474]
'CONJ':	[ 30455,  7696]
' DET':	[109418, 27601]
'NOUN':	[220451, 55107]
' NUM':	[ 11921,  2953]
'PRON':	[ 39657,  9677]
' PRT':	[ 23889,  5940]
'VERB':	[146199, 36551]
'   X':	[  1112,   274]


## 2. 建立基础模型

参考《Speech and Language Processing》第163页建立基础模型。

## 2.1 建立模型

`count`是一个字典，其键为单词，其值为一个字典，

字典的键为词性，值为该单词以该词性出现的次数。

In [5]:
count = {}

for sent in train:
    for word in sent:
        if word[0] not in count:
            count[word[0]] = {word[1]: 1}
        elif word[1] not in count[word[0]]:
            count[word[0]][word[1]] = 1
        else:
            count[word[0]][word[1]] += 1

`baseline`是一个字典，其键为单词，其值为词性。

In [6]:
baseline = {}

for word in count:
    max_count = 0
    for key in count[word]:
        if count[word][key] > max_count:
            max_count = count[word][key]
            max_key = key
    baseline[word] = max_key

## 2.2 测试模型

使用不同方法处理未知单词。

### 2.2.1 使用其它词性

即将未知单词定为`X`。

In [7]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'X'
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'X'
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

训练集正确率：0.957196
测试集正确率：0.930951


可以看到，训练集精度高于0.9，也高于测试集精度，但未达到1.0。

### 2.2.2 使用出现一次的单词中最频繁出现的词性

`count_one`是一个字典，其键为词性，其值为出现一次的单词以该词性出现的次数。

In [8]:
count_one = dict.fromkeys(tag, 0)

for word in count:
    if len(count[word]) == 1:
        for key in count[word]:
            if count[word][key] == 1:
                count_one[key] += 1
            
for key in count_one:
    print("'%4s':\t%5d" % (key, count_one[key]))

'   .':	    0
' ADJ':	 3896
' ADP':	   43
' ADV':	  757
'CONJ':	    2
' DET':	   14
'NOUN':	13873
' NUM':	  788
'PRON':	   17
' PRT':	  138
'VERB':	 3525
'   X':	  473


即将未知单词定为`NOUN`：

In [9]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'NOUN'
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'NOUN'
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

训练集正确率：0.957196
测试集正确率：0.945187


可以看到，训练集精度显然不会变化，仍然高于测试集精度；测试集精度有所提升。

### 2.2.3 使用最频繁出现的词性

即将未知单词定为`NOUN`，结果同上。

### 2.2.4 词形判断

参考[GitHub](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words)通过词形初步判断词性，无法判断的词性记为`X`：

In [10]:
import string

In [11]:
def unknown_word_tag_x(word):
    if all(char in string.punctuation for char in word):
        return '.'
    elif any(word.endswith(suffix) for suffix in ["ous", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "able"]):
        return 'ADJ'
    elif any(word.endswith(suffix) for suffix in ["ly", "wise", "wards", "ward"]):
        return 'ADV'
    elif any(word.endswith(suffix) for suffix in ["action", "age", "ance", "cy", "ee", "ence", "er", "hood", "ion", "ism",
                                                  "ist", "ity", "ment", "ness", "or", "ry", "scape", "ship", "dom", "ty", "'s"]):
        return 'NOUN'
    elif any(char.isdigit() for char in word):
        return 'NUM'
    elif any(word.endswith(suffix) for suffix in ["ed", "ify", "ise", "ize", "ate", "ing"]):
        return 'VERB'
    else:
        return 'X'

In [12]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_x(word[0])
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_x(word[0])
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

训练集正确率：0.957196
测试集正确率：0.939604


可以看到，训练集精度显然不会变化，仍然高于测试集精度；测试集精度相对直接使用`X`有所提升，相对直接使用`NOUN`有所下降。

### 2.2.5 另一种词形判断

无法判断的词性记为`NOUN`：

In [13]:
def unknown_word_tag_noun(word):
    if all(char in string.punctuation for char in word):
        return '.'
    elif any(word.endswith(suffix) for suffix in ["ous", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "able"]):
        return 'ADJ'
    elif any(word.endswith(suffix) for suffix in ["ly", "wise", "wards", "ward"]):
        return 'ADV'
    elif any(word.endswith(suffix) for suffix in ["action", "age", "ance", "cy", "ee", "ence", "er", "hood", "ion", "ism",
                                                  "ist", "ity", "ment", "ness", "or", "ry", "scape", "ship", "dom", "ty", "'s"]):
        return 'NOUN'
    elif any(char.isdigit() for char in word):
        return 'NUM'
    elif any(word.endswith(suffix) for suffix in ["ed", "ify", "ise", "ize", "ate", "ing"]):
        return 'VERB'
    else:
        return 'NOUN'

In [14]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_noun(word[0])
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_noun(word[0])
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

训练集正确率：0.957196
测试集正确率：0.949168


可以看到，训练集精度显然不会变化，仍然高于测试集精度；测试集精度相对以上所有方法都有所提升。

### 2.2.6 修改单词

将除了训练集中出现了两次及以上的单词修改为相应的标签。

`symbol`是一个集合，其元素是训练集中出现了两次及以上的单词。

In [15]:
symbol = set()

for word in count:
    if len(count[word]) == 1:
        for key in count[word]:
            if count[word][key] == 1:
                continue
            else:
                symbol.add(word)
    else:
        symbol.add(word)

In [16]:
train_unknown = []
for sent in train:
    train_unknown.append([])
    for word in sent:
        if word[0] in symbol:
            train_unknown[-1].append(word)
        else:
            train_unknown[-1].append(['<UNK-' + unknown_word_tag_x(word[0]) + '>', word[1]])
        
test_unknown = []
for sent in test:
    test_unknown.append([])
    for word in sent:
        if word[0] in symbol:
            test_unknown[-1].append(word)
        else:
            test_unknown[-1].append(['<UNK-' + unknown_word_tag_x(word[0]) + '>', word[1]])

In [17]:
count_unknown = {}

for sent in train_unknown:
    for word in sent:
        if word[0] not in count_unknown:
            count_unknown[word[0]] = {word[1]: 1}
        elif word[1] not in count_unknown[word[0]]:
            count_unknown[word[0]][word[1]] = 1
        else:
            count_unknown[word[0]][word[1]] += 1

In [18]:
baseline_unknown = {}

for word in count_unknown:
    max_count = 0
    for key in count_unknown[word]:
        if count_unknown[word][key] > max_count:
            max_count = count_unknown[word][key]
            max_key = key
    baseline_unknown[word] = max_key

展示对于标签的预测：

In [19]:
print('标签\t\t预测词性')
for key in ('<UNK-.>', '<UNK-ADJ>', '<UNK-ADV>', '<UNK-NOUN>', '<UNK-NUM>', '<UNK-VERB>', '<UNK-X>'):
    print('%-10s\t%s' % (key, baseline_unknown[key]))

标签		预测词性
<UNK-.>   	NOUN
<UNK-ADJ> 	ADJ
<UNK-ADV> 	ADV
<UNK-NOUN>	NOUN
<UNK-NUM> 	NUM
<UNK-VERB>	VERB
<UNK-X>   	NOUN


可以看到，相比直接采用默认值，修改单词的方法对默认值进行了自动修正。

In [20]:
correct = 0

for sent in train_unknown:
    for word in sent:
        correct += word[1] == baseline_unknown[word[0]]
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test_unknown:
    for word in sent:
        correct += word[1] == baseline_unknown[word[0]]
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

训练集正确率：0.950603
测试集正确率：0.946411


可以看到，训练集精度意料之内地有所下降，但仍然高于测试集精度；测试集精度仅次于上述一种方法。

各种方法总结如下：

|编号|方法|训练集正确率|测试集正确率|
|:--|:--|:--|:--|
|1|使用其它词性|0.957196|0.930951|
|2|使用出现一次的单词中最频繁出现的词性|0.957196|0.945187|
|3|使用最频繁出现的词性|0.957196|0.945187|
|4|默认为`X`的词形判断|0.957196|0.939604|
|5|默认为`NOUN`的词形判断|0.957196|0.949168|
|6|修改单词|0.950603|0.946411|

# 3. 建立隐Markov模型

## 3.1 训练模型

使用`nltk.tag.HiddenMarkovModelTagger`，参考《Speech and Language Processing》8.4.3章。

In [21]:
from nltk.tag import HiddenMarkovModelTagger
from nltk import FreqDist
from nltk.probability import ConditionalFreqDist, ConditionalProbDist, MLEProbDist

In [22]:
for key in ('<UNK-.>', '<UNK-ADJ>', '<UNK-ADV>', '<UNK-NOUN>', '<UNK-NUM>', '<UNK-VERB>', '<UNK-X>'):
    symbol.add(key)

In [23]:
transition = ConditionalProbDist(ConditionalFreqDist([(sent[index - 1][1], sent[index][1]) 
                                                      for sent in train_unknown for index in range(1, len(sent))]), MLEProbDist)

In [24]:
output = ConditionalProbDist(ConditionalFreqDist([(word[1], word[0]) for sent in train_unknown for word in sent]), MLEProbDist)

In [25]:
prior = MLEProbDist(FreqDist([sent[0][1] for sent in train_unknown]))

In [26]:
hmm = HiddenMarkovModelTagger(symbols = symbol, states = tag, transitions = transition, outputs = output, priors = prior)

## 3.2 测试模型

In [27]:
correct = 0

for sent in train_unknown:
    predict = hmm.tag([word[0] for word in sent])
    for index in range(len(sent)):
        correct += sent[index][1] == predict[index][1]
    
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test_unknown:
    predict = hmm.tag([word[0] for word in sent])
    for index in range(len(sent)):
        correct += sent[index][1] == predict[index][1]
    
print('测试集正确率：%.6f' % (correct / num_word[1]))

训练集正确率：0.969255
测试集正确率：0.963773


## 3.3 与基础模型的比较

可以看到，由于获取了更多上下文信息，隐Markov模型的正确率高于baseline模型。