# 自然语言处理第三次作业

# 1. 数据导入

## 1.1 样本导入 

数据集来源于Brown corpus，存储于`train.txt`, `test.txt`。

每个句子样本包含三个部分：

第一行为句子编号；

接下来的若干行为单词词性对，用制表符隔开；

最后一行为空行。

将数据集导入存储于`train`和`test`列表，列表的每个元素是单词词性对列表。

In [None]:
train = []
with open('train.txt', 'r') as file:
    for line in file:
        if '\t' in line:
            train[-1].append(line.split())
        elif line != '\n':
            train.append([])

test = []
with open('test.txt', 'r') as file:
    for line in file:
        if '\t' in line:
            test[-1].append(line.split())
        elif line != '\n':
            test.append([])

## 1.2 标签导入

标签共12个，存储于`tag.txt`。

In [None]:
tag = []
with open('tag.txt', 'r') as file:
    for line in file:
        print(line, end = '')
        train.append(line.rstrip('\n'))

## 1.3 统计单词与句子数

In [None]:
num_word = [sum([len(sent) for sent in train]), sum([len(sent) for sent in test])]

print('数据集\t单词数\t句子数')     
print('训练集\t%d\t%d' % (num_word[0], len(train)))      
print('测试集\t%d\t%d' % (num_word[1], len(test)))

## 1.4 统计不同词性单词数

`num`是一个字典，其键为词性，其值为一个列表。

列表中的第一个值表示训练集中该词性单词数，第二个值表示测试集中该词性单词数。

In [None]:
num = {}

for key in tag:
    num[key] = [0, 0]

for sent in train:
    for word in sent:
        num[word[1]][0] += 1
        
for sent in test:
    for word in sent:
        num[word[1]][1] += 1

for key in num:
    print("'%4s':\t[%6d, %5d]" % (key, num[key][0], num[key][1]))

## 2. 建立基础模型

参考《Speech and Language Processing》第163页建立基础模型。

## 2.1 建立模型

`count`是一个字典，其键为单词，其值为一个字典，

字典的键为词性，值为该单词以该词性出现的次数。

In [None]:
count = {}

for sent in train:
    for word in sent:
        if word[0] not in count:
            count[word[0]] = {word[1]: 1}
        elif word[1] not in count[word[0]]:
            count[word[0]][word[1]] = 1
        else:
            count[word[0]][word[1]] += 1

`baseline`是一个字典，其键为单词，其值为词性。

In [None]:
baseline = {}

for word in count:
    max_count = 0
    for key in count[word]:
        if count[word][key] > max_count:
            max_count = count[word][key]
            max_key = key
    baseline[word] = max_key

## 2.2 测试模型

使用不同方法处理未知单词。

### 2.2.1 使用其它词性

即将未知单词定为`X`。

In [None]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'X'
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'X'
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

可以看到，训练集精度高于0.9，也高于测试集精度，但未达到1.0。

### 2.2.2 使用出现一次的单词中最频繁出现的词性

`count_one`是一个字典，其键为词性，其值为出现一次的单词以该词性出现的次数。

In [None]:
count_one = dict.fromkeys(tag, 0)

for word in count:
    if len(count[word]) == 1:
        for key in count[word]:
            count_one[key] += 1
            
for key in count_one:
    print("'%4s':\t%5d" % (key, count_one[key]))

即将未知单词定为`NOUN`：

In [None]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'NOUN'
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == 'NOUN'
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

可以看到，训练集精度显然不会变化，仍然高于测试集精度；测试集精度有所提升。

### 2.2.3 使用最频繁出现的词性

即将未知单词定为`NOUN`，结果同上。

### 2.2.4 词形判断

参考[GitHub](https://github.com/Adamouization/POS-Tagging-and-Unknown-Words)通过词形初步判断词性，无法判断的词性记为`X`：

In [None]:
import string

In [None]:
def unknown_word_tag_x(word):
    if all(char in string.punctuation for char in word):
        return '.'
    elif any(word.endswith(suffix) for suffix in ["ous", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "able"]):
        return 'ADJ'
    elif any(word.endswith(suffix) for suffix in ["ly", "wise", "wards", "ward"]):
        return 'ADV'
    elif any(word.endswith(suffix) for suffix in ["action", "age", "ance", "cy", "ee", "ence", "er", "hood", "ion", "ism",
                                                  "ist", "ity", "ment", "ness", "or", "ry", "scape", "ship", "dom", "ty", "'s"]):
        return 'NOUN'
    elif any(char.isdigit() for char in word):
        return 'NUM'
    elif any(word.endswith(suffix) for suffix in ["ed", "ify", "ise", "ize", "ate", "ing"]):
        return 'VERB'
    else:
        return 'X'

In [None]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_x(word[0])
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_x(word[0])
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

可以看到，训练集精度显然不会变化，仍然高于测试集精度；测试集精度相对直接使用`X`有所提升，相对直接使用`NOUN`有所下降。

### 2.2.5 另一种词形判断

In [None]:
def unknown_word_tag_noun(word):
    if all(char in string.punctuation for char in word):
        return '.'
    elif any(word.endswith(suffix) for suffix in ["ous", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "able"]):
        return 'ADJ'
    elif any(word.endswith(suffix) for suffix in ["ly", "wise", "wards", "ward"]):
        return 'ADV'
    elif any(char.isdigit() for char in word):
        return 'NUM'
    elif any(word.endswith(suffix) for suffix in ["ed", "ify", "ise", "ize", "ate", "ing"]):
        return 'VERB'
    else:
        return 'NOUN'

In [None]:
correct = 0

for sent in train:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_noun(word[0])
            
print('训练集正确率：%.6f' % (correct / num_word[0]))

correct = 0

for sent in test:
    for word in sent:
        if word[0] in baseline:
            correct += word[1] == baseline[word[0]]
        else:
            correct += word[1] == unknown_word_tag_noun(word[0])
            
print('测试集正确率：%.6f' % (correct / num_word[1]))

可以看到，训练集精度显然不会变化，仍然高于测试集精度；测试集精度相对以上所有方法都有所提升。

# 3. 建立隐Markov模型

## 3.1 训练模型

使用`nltk.tag.HiddenMarkovModelTagger`，参考《Speech and Language Processing》8.4.3章。

In [None]:
from nltk.tag import HiddenMarkovModelTagger
from nltk import FreqDist
from nltk.probability import ConditionalFreqDist, ConditionalProbDist, MLEProbDist

In [None]:
hmm = HiddenMarkovModelTagger(symbols = symbol, states = tag, transitions = transition, outputs = output, priors = prior)

## 3.2 测试模型

## 3.3 与基础模型的比较

因