# 0. 准备工作

数据集来源于Wikipedia，存储于`data.json`，包含10个类别10000个样本。

每个样本都是一个字典，包含三个键：`title`、`label`和`text`。

In [1]:
import json

In [2]:
dat = []
with open("data.json", "r") as file:
    for line in file:
        dat.append(json.loads(line))

# 1.数据预处理

`num`是一个字典，其键为类别名，其值为一个字典。

该字典包含三个键：`sample`、`sent`、`word`。

## 1.计算每个类别的样本数

In [3]:
num = {}
for sample in dat:
    if sample['label'] not in num.keys():
        num[sample['label']] = {'sample':1, 'sent':0, 'word':0}
    else:
        num[sample['label']]['sample'] += 1

In [4]:
for label in num.keys():
    print( '%10s: %d' % (label, num[label]['sample']))

      Film: 3048
      Book: 975
Politician: 3824
    Writer: 837
      Food: 137
     Actor: 80
    Animal: 93
  Software: 266
    Artist: 520
   Disease: 220


可以看到，各个类别的样本数并不平均。

## 2.计算每个类别的平均句子数

使用`nltk`中的`sent_tokenize`函数：

In [5]:
from nltk.tokenize import sent_tokenize

In [6]:
for sample in dat:
    num[sample['label']]['sent'] += len(sent_tokenize(sample['text']))

In [7]:
for label in num.keys():
    print('%10s: %.2f' % (label, (num[label]['sent'] / num[label]['sample'])))

      Film: 178.62
      Book: 205.26
Politician: 225.29
    Writer: 217.89
      Food: 155.43
     Actor: 70.95
    Animal: 66.81
  Software: 202.62
    Artist: 185.04
   Disease: 349.60


## 3.计算每个类别的平均单词数

使用`nltk`中的`word_tokenize`函数：

In [8]:
from nltk.tokenize import word_tokenize

In [9]:
for sample in dat:
    num[sample['label']]['word'] += len(word_tokenize(sample['text']))

In [10]:
for label in num.keys():
    print('%10s: %.2f' % (label, (num[label]['word'] / num[label]['sample'])))

      Film: 4440.53
      Book: 5296.66
Politician: 5708.66
    Writer: 5806.83
      Food: 3477.52
     Actor: 1719.92
    Animal: 1432.11
  Software: 4812.89
    Artist: 4801.96
   Disease: 8012.57


## 4.数据预处理

在`dat`中增加一个键`sent`，存储样本中的单词。

对每句话，保留英语单词和数字，去除标点符号和特殊字符。

其中，英语单词全都小写。

In [11]:
for sample in dat:
    sample['sent'] = [[word.lower() for word in word_tokenize(sent) if word.isdigit() or word.isalpha()] 
                      for sent in sent_tokenize(sample['text'])]

## 5.划分数据集

训练集包含9000个样本，测试集包含1000个样本。

In [12]:
import numpy as np

In [13]:
np.random.seed(0)
index_choice = np.random.choice(10000, 9000, replace = False)

In [14]:
dat_train = [dat[index_train] for index_train in index_choice]
dat_test = [dat[index_test] for index_test in range(10000) if index_test not in index_choice]

# 2.建立语言模型
## 1.多元模型

由于多元模型与样本类型无关，将样本进一步整理为二维列表。

In [15]:
word_train = [sent for sent in sample['sent'] for sample in dat_train]
word_test = [sent for sent in sample['sent'] for sample in dat_test]

使用`nltk.lm.preprocessing`模块中的`padded_everygram_pipeline`函数，为每句话加上padding，并转化为多元模型。

接着，在训练集上使用 Laplace 平滑和参数为 0.1 的 Kneser-Ney 平滑建立一元、二元、三元模型。

其中，Laplace 平滑可以使用`nltk.lm`中的`Laplace`类实现，Kneser-Ney 平滑可以使用`nltk.lm`中的`KneserNeyInterpolated`类实现，

训练的模型存储于一个三维列表`lm`，每个元素是一个字典，包含两个键：`Laplace`、`Kneser-Ney`。

In [16]:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace, KneserNeyInterpolated

In [17]:
lm = [{},{},{}]
for n in (1, 2, 3):
    ngrams, vocab = padded_everygram_pipeline(n, word_train)
    lm[n - 1]['Laplace'] = Laplace(n)
    lm[n - 1]['Laplace'].fit(ngrams, vocab)
    
    ngrams, vocab = padded_everygram_pipeline(n, word_train)
    lm[n - 1]['Kneser-Ney'] = KneserNeyInterpolated(n)
    lm[n - 1]['Kneser-Ney'].fit(ngrams, vocab)

## 2.困惑度

在测试集上计算困惑度，其中一元模型的 Kneser-Ney 平滑失效，计算结果为`Inf`。

In [18]:
print('n', 'Laplace', 'Kneser-Ney')
print('%d %7.2f %10s' % (1, lm[0]['Laplace'].perplexity(word_test), 'Inf'))
print('%d %7.2f %10.2f' % (2, lm[1]['Laplace'].perplexity(word_test), lm[1]['Kneser-Ney'].perplexity(word_test)))
print('%d %7.2f %10.2f' % (3, lm[2]['Laplace'].perplexity(word_test), lm[2]['Kneser-Ney'].perplexity(word_test)))

n Laplace Kneser-Ney
1 1283.73        Inf
2 1047.28      13.26
3  947.65       2.83


可以看到，

## 3.造句

对于每个模型，造五句话：

In [22]:
print('n = 1:')
print('Laplace smoothing:')
for seed in range(5):
    print(' '.join(lm[0]['Laplace'].generate(20, text_seed = ['<s>'], random_seed = seed)))
print()
print()

print('n = 2:')
print('Laplace smoothing:')
for seed in range(5):
    print(' '.join(lm[1]['Laplace'].generate(20, text_seed = ['<s>'], random_seed = seed)))
print()
print('Kneser-Ney smoothing:')
for seed in range(5):
    print(' '.join(lm[1]['Kneser-Ney'].generate(20, text_seed = ['<s>'], random_seed = seed)))
print()
print()

print('n = 3:')
print('Laplace smoothing:')
for seed in range(5):
    print(' '.join(lm[2]['Laplace'].generate(20, text_seed = ['<s>'], random_seed = seed)))
print()
print('Kneser-Ney smoothing:')
for seed in range(5):
    print(' '.join(lm[2]['Kneser-Ney'].generate(20, text_seed = ['<s>'], random_seed = seed)))
print()
print()

n = 1:
Laplace smoothing:
to that indirectly development more in the fever levels of virus mild emergency than partners dengue virus zika the vector
away to that develop men is proteins the and a this infections that 12 is spread consider where vaccines a
whole which affected and this successfully remains first other other of biosecurity infection in spread zika which number is do
countries number human other persists also 2016 this diagnose could zika last this levels pregnant been pregnancies to mothers suspected
countries annually in before also in virus the that complete no early brazil apicoargenteus children was there the the case


n = 2:
Laplace smoothing:
there was first detected in february 2016 but is not well as in southeast asia </s> vaccine could potentially spread
because zika virus cross the larvicide pyriproxyfen in 2016 </s> the initial zika </s> in singapore after travel with 185
zika virus </s> <s> there was not have gone to mothers infected late in singapore after 

可以看到，

# 3.建立朴素贝叶斯分类器

## 1.建立模型

编写特征提取函数`extractor`：

In [35]:
def extractor(sample, vocab):
    feature = dict.fromkeys(vocab, 0)
    for sent in sample:
        for word in sent:
            if word in vocab:
                feature[word] += 1
    return feature

使用 Laplace 平滑，在30%、50%、70%、90%的训练集上建立模型并存储于一个字典`nb_model`，包含四个键：`0.3`、`0.5`、`0.7`、`0.9`。

In [36]:
from sklearn.naive_bayes import CategoricalNB

In [None]:
nb_model = {}
nb_test_feature = {}

for percentage in (0.3, 0.5, 0.7, 0.9):
    index_choice = np.random.choice(9000, int(9000 * percentage), replace = False)
    vocab = set([word for index_train in index_choice for sent in dat[index_train]['sent'] for word in sent])
    
    nb_train_feature = [[item[1] for item in sorted(extractor(dat_train[index_train]['sent'], vocab).items())] 
                        for index_train in index_choice]
    nb_test_feature[percentage] = [[item(1) for item in sorted(extractor(sample['sent'], vocab).items())] for sample in dat_test]
    nb_train_label = [dat_train[index_train]['label'] for index_train in index_choice]
    
    nb_model[percentage]  = CategoricalNB()
    nb_model[percentage].fit(nb_train_feature, nb_train_label)

## 2.计算微观F1值和宏观F1值

In [None]:
from sklearn.metrics import f1_score

In [None]:
nb_test_label = [sample['label'] for sample in dat_test]
print('percentage  micro F1  macro F1')

for percentage in (0.3, 0.5, 0.7, 0.9):
    nb_test_pred = nb_model[percentage].predict(nb_test_feature[percentage])
    micro = f1_score(nb_test_label, nb_test_pred, average = 'micro')
    macro = f1_score(nb_test_label, nb_test_pred, average = 'micro')
    print('%10.1f %8.2f %8.2f' % (micro, macro))

可以看到，