# 自然语言处理第一次作业

王逸群 19307110397

GitHub repo: [quniLcs/nlp/Assignment01](https://github.com/quniLcs/nlp/tree/main/Assignment01)

## 0. 准备工作

数据集来源于Wikipedia，存储于`data.json`，包含10个类别10000个样本。

每个样本都是一个字典，包含三个键：`title`、`label`和`text`。

In [1]:
import json

In [2]:
dat = []
with open("data.json", "r") as file:
    for line in file:
        dat.append(json.loads(line))

## 1.数据预处理

`num`是一个字典，其键为类别名，其值为一个字典。

该字典包含三个键：`sample`、`sent`、`word`。

### 1.计算每个类别的样本数

In [3]:
num = {}
for sample in dat:
    if sample['label'] not in num.keys():
        num[sample['label']] = {'sample':1, 'sent':0, 'word':0}
    else:
        num[sample['label']]['sample'] += 1

In [4]:
for label in num.keys():
    print( '%10s: %d' % (label, num[label]['sample']))

      Film: 3048
      Book: 975
Politician: 3824
    Writer: 837
      Food: 137
     Actor: 80
    Animal: 93
  Software: 266
    Artist: 520
   Disease: 220


可以看到，各个类别的样本数并不平均。

### 2.计算每个类别的平均句子数

使用`nltk`中的`sent_tokenize`函数：

In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
for sample in dat:
    num[sample['label']]['sent'] += len(sent_tokenize(sample['text']))

In [7]:
for label in num.keys():
    print('%10s: %.2f' % (label, (num[label]['sent'] / num[label]['sample'])))

      Film: 178.62
      Book: 205.26
Politician: 225.29
    Writer: 217.89
      Food: 155.43
     Actor: 70.95
    Animal: 66.81
  Software: 202.62
    Artist: 185.04
   Disease: 349.60


### 3.计算每个类别的平均单词数

使用`nltk`中的`word_tokenize`函数：

In [8]:
from nltk.tokenize import word_tokenize

In [9]:
for sample in dat:
    num[sample['label']]['word'] += len(word_tokenize(sample['text']))

In [10]:
for label in num.keys():
    print('%10s: %.2f' % (label, (num[label]['word'] / num[label]['sample'])))

      Film: 4440.53
      Book: 5296.66
Politician: 5708.66
    Writer: 5806.83
      Food: 3477.52
     Actor: 1719.92
    Animal: 1432.11
  Software: 4812.89
    Artist: 4801.96
   Disease: 8012.57


### 4.数据预处理

在`dat`中增加一个键`sent`，存储样本中的单词。

对每句话，保留英语单词和数字，去除标点符号和特殊字符。

其中，英语单词全都小写。

In [11]:
for sample in dat:
    sample['sent'] = [[word.lower() for word in word_tokenize(sent) if word.isdigit() or word.isalpha()] 
                      for sent in sent_tokenize(sample['text'])]

### 5.划分数据集

训练集包含9000个样本，测试集包含1000个样本。

In [12]:
import numpy as np

In [13]:
np.random.seed(0)
index_choice = np.random.choice(10000, 9000, replace = False)

In [14]:
dat_train = [dat[index_train] for index_train in index_choice]
dat_test = [dat[index_test] for index_test in range(10000) if index_test not in index_choice]

## 2.建立语言模型
### 1.多元模型

由于多元模型与样本类型无关，将样本进一步整理为二维列表。

In [15]:
sent_train = [sent for sample in dat_train for sent in sample['sent'] if sent != []]
sent_test = [sent for sample in dat_test for sent in sample['sent'] if sent != []]

使用`nltk.lm.preprocessing`模块中的`padded_everygram_pipeline`函数，为每句话加上padding，并转化为多元模型。

接着，在训练集上使用 Laplace 平滑和参数为 0.1 的 Kneser-Ney 平滑建立一元、二元、三元模型。

其中，Laplace 平滑可以使用`nltk.lm`中的`Laplace`类实现，Kneser-Ney 平滑可以使用`nltk.lm`中的`KneserNeyInterpolated`类实现，

训练的模型存储于一个三维列表`lm`，每个元素是一个字典，包含两个键：`Laplace`、`Kneser-Ney`。

In [16]:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace, KneserNeyInterpolated

In [17]:
lm = [{},{},{}]
for n in (1, 2, 3):
    ngrams, vocab = padded_everygram_pipeline(n, sent_train)
    lm[n - 1]['Laplace'] = Laplace(n)
    lm[n - 1]['Laplace'].fit(ngrams, vocab)
    
    ngrams, vocab = padded_everygram_pipeline(n, sent_train)
    lm[n - 1]['Kneser-Ney'] = KneserNeyInterpolated(n)
    lm[n - 1]['Kneser-Ney'].fit(ngrams, vocab)

Kneser-Ney 平滑计算速度非常慢，以下暂且不考虑。

### 2.困惑度

在测试集上计算困惑度:

In [18]:
print('n Perplexity')
for n in (1, 2, 3):
    print('%d %10.2f' % (n, lm[n - 1]['Laplace'].perplexity(sent_test)))

n Perplexity
1  289890.95
2  263961.04
3  258824.59


可以看到，随着n增大，困惑度减小。

### 3.造句

对于每个模型，造五句话：

In [19]:
for n in (1, 2, 3):
    print('n = ', n, ':', sep = '')
    for seed in range(5):
        print(' '.join(lm[n - 1]['Laplace'].generate(20, text_seed = ['<s>'], random_seed = seed)))
    print()

n = 1:
the stere in disasters made however that fiction judges of tower local ends states operation der transferred with the to
at the subjects did left intellectual polish that and a the in students 1634 inequality sees correct was to a
well was again and the siege protocol film on on of belong in his senior year was multiplex individual due
curves mumbai he officially ottoman also 2012 the discovers crew years itself the juan pears became participated they mcandrew smith
critics and his before also honesty under the such congresses monkey elastic broke ang coming usually the the the careful

n = 2:
the ring from december 2010 ebert reviewing german including one thing in chicago skyscraper on being told with the torture
bringing theatre pedagogy and lindbergh for soong to asuka accepts the house or 16 december 22 2014 wall the agreement
von wahlendorf </s> <s> the regiment of hindostan king michael levey and intentional families pay you will in it ended
feingold opposed meech lake ma

可以看到，

一元模型的连贯性很差，出现了
`the to`、
`a the in`、
`on on of`、
`the the the`
等尴尬的情况。

二元模型的连贯性有所改善，出现了
`in chicago`
`december 22 2014`
`mafia money`
`full confidence`
`sound pressure`
`continued to`
`characterized as`
`use simple sliding doors`
等短语。

三元模型的连贯性相对最佳，出现了
`key effects`
`jewish ritual`
`the resolution changed senate rules`
`federal law`
`this was a cell to produce`
`with loose organisational structures`
`letters exchanged`
`by the audience`
`at that period`
等短语。

## 3.建立朴素贝叶斯分类器

编写特征提取函数`extractor`：

In [20]:
from collections import OrderedDict

In [21]:
def extractor(sample, vocab):
    feature = OrderedDict.fromkeys(vocab, 0)
    for sent in sample:
        for word in sent:
            if word in vocab:
                feature[word] += 1
    return feature

使用 Laplace 平滑，在30%、50%、70%、90%的训练集上建立模型，并计算F1值。

由于内存的限制，使用出现频数前3000的词汇作为特征。

In [25]:
from nltk.probability import FreqDist
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

In [26]:
nb_test_label = [sample['label'] for sample in dat_test]
print('percentage  micro F1  macro F1')

for percentage in (0.3, 0.5, 0.7, 0.9):
    index_choice = np.random.choice(9000, int(9000 * percentage), replace = False)
    vocab = list(FreqDist([word for index_train in index_choice for sent in dat[index_train]['sent'] for word in sent]).keys())[:3000]
    
    nb_train_feature = [[value for value in extractor(dat_train[index_train]['sent'], vocab).values()] for index_train in index_choice]
    nb_test_feature = [[value for value in extractor(sample['sent'], vocab).values()] for sample in dat_test]
    nb_train_label = [dat_train[index_train]['label'] for index_train in index_choice]
    
    nb_model = MultinomialNB()
    nb_model.fit(nb_train_feature, nb_train_label)
    
    nb_test_pred = nb_model.predict(nb_test_feature)
    micro = f1_score(nb_test_label, nb_test_pred, average = 'micro')
    macro = f1_score(nb_test_label, nb_test_pred, average = 'macro')
    print('%10.1f  %8.2f  %8.2f' % (percentage, micro, macro))

percentage  micro F1  macro F1
       0.3      0.93      0.90
       0.5      0.94      0.91
       0.7      0.91      0.90
       0.9      0.93      0.91


可以看到，在不同大小的训练集上建立的模型的F1值总体在一个范围内浮动。尽管训练集扩大，一方面改进了模型，另一方面出现了过拟合现象。

另外，微观F1值总是大于宏观F1值，是因为它会更依赖于样本数较多的类。