# 数据集分析及选择

## 人工智能专利数据集

+ 数据集大小：100条
+ 长文本：包含
+ 数据形式：字典
+ 每条数据可能都有不同的属性

In [None]:
import os

dataset_path = os.path.join(os.getcwd(), 'datasets')
file_name = 'ai_patent.txt'
file_path = os.path.join(dataset_path, file_name)

with open(file_path) as f:
    ai_patent = []
    for i in f.readlines():
        ai_patent.append(eval(i.strip()))

## 病例数据集

+ 数据集大小：1000条
+ 长文本：有
+ 所有文本皆有实体标注
+ 数据结构：
  ```
  - documents: # list
    - originalText # str
    - entities # list
      - label_type
      - overlap
      - start_pos
      - end_pos
  ```

In [124]:
import os, json

dataset_path = os.path.join(os.getcwd(), 'datasets')
dataset_name = 'yidu-s4k'
dataset_path = os.path.join(dataset_path, dataset_name)

file_name = 'subtask1_training_part1.txt'
file_path = os.path.join(dataset_path, file_name)
s4k = []

with open(file_path, 'r', encoding='utf-8-sig') as f:
    for line in f.readlines():
        dic = json.loads(line)
        s4k.append(dic)

file_name = 'subtask1_training_part2.txt'
file_path = os.path.join(dataset_path, file_name)
with open(file_path, 'r', encoding='utf-8-sig') as f:
    for line in f.readlines():
        dic = json.loads(line)
        # delete the empty lines to avoid json load error
        s4k.append(dic)

In [171]:
# """
# - how many words in one text/doc: word count
# - how many entities: entity count
# - how many entity type: entity type count
# - entity and its type
# - the entity and its occurrence: entity dict
# """

from collections import Counter, defaultdict

entity, entity_type = Counter(), defaultdict()
word_cnt = 0

for doc in s4k:
    doc['stat'] = dict()
    doc['stat']['word_cnt'] = len(doc['originalText'])
    word_cnt += len(doc['originalText'])
    doc['stat']['entity_cnt'] = len(doc['entities'])
    e = Counter()
    for i in doc['entities']:
        entity_name = doc['originalText'][int(i['start_pos']) : int(i['end_pos'])]
        if i['label_type'] not in entity_type.keys():# init the key-value pair
            entity_type[i['label_type']] = set()
        entity_type[i['label_type']].add(entity_name)
        e[entity_name] += 1
    doc['stat']['entity_type_cnt'] = len(e.keys())
    entity = entity + e

In [174]:
print('语料库包含 {} 条病例。'.format(len(s4k)))
print('语料库种包含 {} 种实体，这些实体分为 {} 类，分别是 {}。'.format(len(entity.keys()), len(entity_type), list(entity_type.keys())))
print('语料库中实体出现了 {} 次，其中 {} 出现的最多，为 {} 次；{} 出现的最少，为 {} 次。'.format(sum(entity.values()), 
entity.most_common(1)[0][0],entity.most_common(1)[0][1], entity.most_common()[-1][0], entity.most_common()[-1][1]))
print('平均每条病例的字数是 {}，平均的实体数目是 {}。'.format(round(word_cnt/len(s4k), 3), round(sum(entity.values())/len(s4k), 3)))

语料库包含 1000 条病例。
语料库种包含 5330 种实体，这些实体分为 6 类，分别是 ['疾病和诊断', '手术', '解剖部位', '药物', '影像检查', '实验室检验']。
语料库中实体出现了 17653 次，其中 腹 出现的最多，为 1457 次；植入支架 出现的最少，为 1 次。
平均每条病例的字数是 418.362，平均的实体数目是 17.653。


## 糖尿病数据集

+ 数据结构

    ```
    - json
      - doc_id
      - paragraphs  # list
        - paragraph_id
        - paragraph # str
        - sentences # list
          - sentence_id
          - sentence
          - start_idx
          - end_idx
          - entities # list
            - entity_id
            - entity
            - entity_type
            - start_idx
            - end_idx
          - relations # list
            - relation_type
            - relation_id
            - head_entity_id
            - tail_entity_id
    ```
+ 数据小结：
  + 实体有1438种, 分为18类；关系有735种，分为16类
  + 文本中出现的实体数目共有22050个，出现的关系数目共有8643个
  + 文章的统计：
    - 每篇文章平均 7035.5 个字，最少的有 1300 个字，最多的有 15260 个字，一共有 288,454 个字
    - 每篇文章平均有 55.9 个段落，最少的有 13 段，最多的有 125 段，一共有 2,292 段
  - 句子的统计：
    - 每篇文章平均有 85.3 句话，最少的有 21 句，最多的有 211 句，一共有 3,498 句

In [8]:
# read files
import os, json

dataset_path = os.path.join(os.getcwd(), 'datasets')
dataset_name = 'diakg'
dataset_path = os.path.join(dataset_path, dataset_name)
diakg = []

for file in os.listdir(dataset_path):
    if not file.endswith('.json'):
        # avoid DS_Store
        continue
    file_path = os.path.join(dataset_path, file)
    with open(file_path, 'r', encoding='utf-8') as f:
        diakg.append(json.load(f))

In [9]:
# stat the entity and relation
from collections import defaultdict

entity, entity_type = defaultdict(), set()
relation, relation_type = defaultdict(), set()
entity_cnt, relation_cnt = 0, 0

for doc in diakg:
    for paragraph in doc['paragraphs']:
        for sentence in paragraph['sentences']:
            for e in sentence['entities']:
                entity_cnt += 1
                entity[e['entity_id']] = e['entity']
                entity_type.add(e['entity_type'])
            for r in sentence['relations']:
                relation[r['relation_id']] = r['relation_type']
                relation_type.add(r['relation_type'])
                relation_cnt += 1

print("实体有{}种, 分为{}类；关系有{}种，分为{}类".format(len(entity.keys()), len(entity_type),
                                          len(relation.keys()),
                                          len(relation_type)))
print("文本中出现的实体数目共有{}个，出现的关系数目共有{}个".format(entity_cnt, relation_cnt))

实体有1438种, 分为18类；关系有735种，分为16类
文本中出现的实体数目共有22050个，出现的关系数目共有8643个


In [119]:
# stat the text
# #################################
# two levels
# - corpus
#     - cnt word length
#     - cnt sentence length
#     - cnt paragraph length
#     - variance sentence length
#     - variance paragraph length
#     - median sentence length
#     - median paragraph length
#     - mode sentence length
#     - mode paragraph length
#     - min. sentence length
#     - avg. sentence length
#     - max. sentence length
#     - min. paragraph length
#     - avg. paragraph length
#     - max. paragraph length
# - document
#     - cnt word length
#     - cnt sentence length
# - cnt paragraph length
#     - variance sentence length
# - variance paragraph length
#     - median sentence length
# - median paragraph length
#     - mode sentence length
# - mode paragraph length
#     - min. sentence length
#     - avg. sentence length
#     - max. sentence length
# - min. paragraph length
# - avg. paragraph length
# - max. paragraph length
# #################################

from collections import Counter
import numpy as np
from scipy import stats

# sort the diakg by the doc_id in ascending manner
diakg = sorted(diakg, key=lambda i: int(i['doc_id']))

# get the raw running text
corpus = []

for doc in diakg:
    doc_text = []
    for paragraph in doc['paragraphs']:
        doc_para = []
        for sentence in paragraph['sentences']:
            doc_para.append(sentence['sentence'])
        doc_text.append(doc_para)
    corpus.append(doc_text)

# stat the running text
doc_stat = []

for doc in corpus:
    stat = defaultdict()
    # stat para
    stat['paragraph'] = defaultdict()
    stat['paragraph']['cnt'] = len(doc)
    stat['paragraph']['mean'] = np.mean(list(map(len, doc)))
    stat['paragraph']['max'] = np.max(list(map(len, doc)))
    stat['paragraph']['min'] = np.min(list(map(len, doc)))
    stat['paragraph']['std'] = np.std(list(map(len, doc)))
    stat['paragraph']['mid'] = np.median(list(map(len, doc)))
    stat['paragraph']['mode'] = stats.mode(list(map(len, doc)))
    # stat sentence
    sen = [j for i in doc for j in i]  # get sentence list
    stat['sentence'] = defaultdict()
    stat['sentence']['cnt'] = len(sen)
    stat['sentence']['mean'] = np.mean(list(map(len, sen)))
    stat['sentence']['max'] = np.max(list(map(len, sen)))
    stat['sentence']['min'] = np.min(list(map(len, sen)))
    stat['sentence']['std'] = np.std(list(map(len, sen)))
    stat['sentence']['mid'] = np.median(list(map(len, sen)))
    stat['sentence']['mode'] = stats.mode(list(map(len, sen)))
    # stat word
    stat['word'] = stat['sentence']['cnt'] * stat['sentence']['mean']
    doc_stat.append(stat)


In [120]:
# output the stat results
import pandas as pd

tuples = [
    ('Word', 'Count'),
    ('Paragraph', 'Cnt'),
    ('Paragraph', 'Min'),
    ('Paragraph', 'Mid'),
    ('Paragraph', 'Max'),
    ('Paragraph', 'Mean'),
    ('Paragraph', 'Std'),
    ('Paragraph', 'Mode'),
    ('Sentence', 'Cnt'),
    ('Sentence', 'Min'),
    ('Sentence', 'Mid'),
    ('Sentence', 'Max'),
    ('Sentence', 'Mean'),
    ('Sentence', 'Std'),
    ('Sentence', 'Mode'),
]
columns = pd.MultiIndex.from_tuples(tuples, names=['Level', 'Stat'])
index = pd.Index([i for i in range(1, 42)], name="Doc Id")
df = pd.DataFrame(columns=columns, index=index)

cnt = 1
for doc in doc_stat:
    for level in doc.keys():
        if level != 'word':
            for stat in doc[level].keys():
                value = doc[level][stat]
                position = (level.capitalize(), stat.capitalize())
                if stat == 'mode':
                    value = str(value[0][0]) + '(' + str(value[1][0]) + ')'
                if stat == 'mid':
                    value = int(value)
                if isinstance(value, float):
                    df.at[cnt, position] = round(value, 3)
                else:
                    df.at[cnt, position] = value
        else:
            value = doc[level]
            position = (level.capitalize())
            df.at[cnt, position] = int(value)
    cnt += 1

df

Level,Word,Paragraph,Paragraph,Paragraph,Paragraph,Paragraph,Paragraph,Paragraph,Sentence,Sentence,Sentence,Sentence,Sentence,Sentence,Sentence
Stat,Count,Cnt,Min,Mid,Max,Mean,Std,Mode,Cnt,Min,Mid,Max,Mean,Std,Mode
Doc Id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
1,5758,60,1,1,4,1.383,0.838,1(47),83,2,41,424,69.373,82.391,11(5)
2,5708,25,1,1,5,1.56,1.023,1(17),39,6,74,894,146.359,170.043,40(2)
3,3406,29,1,1,3,1.483,0.725,1(19),43,4,60,363,79.209,73.174,14(2)
4,4823,32,1,1,5,1.594,1.114,1(22),51,4,67,463,94.569,93.19,78(3)
5,2136,14,1,1,3,1.5,0.824,1(10),21,4,76,420,101.714,103.575,6(3)
6,14685,69,1,1,5,1.391,0.855,1(53),96,7,116,543,152.969,138.343,17(3)
7,6847,64,1,1,5,1.312,0.827,1(53),84,4,50,382,81.512,86.848,41(6)
8,6439,42,1,1,8,1.786,1.44,1(27),75,1,54,444,85.853,96.324,18(3)
9,5567,31,1,1,5,1.355,0.863,1(25),42,1,87,475,132.548,114.112,1(1)
10,3601,58,1,1,4,1.276,0.664,1(47),74,5,29,347,48.662,54.963,7(5)


## DuIE2.0中文关系抽取数据集

+ source: https://www.luge.ai/#/luge/dataDetail?id=5
+ 结构
  ```
    - DuIE2.0.zip 
        - duie_dev.json.zip
        - duie_schema.zip
        - duie_train.json.zip
        - duie_sample.json.zip
        - duie_test2.json.zip

    - dui_*.json.zip
        - dui_*.json
        - License.docx
  ```
+ 统计：
  + train dataset 中有 171,135 条数据；dev dataset 中有 20,652 条数据；test dataset 中有 101,239 条数据。
  + train dataset 中有 310,378 个三元组；dev dataset 中有 37,789 个三元组。
  + train dataset 中有 190,020 个实体；dev dataset 中有 37,932 个实体。

In [1]:
import os, json

dataset_path = os.path.join(os.getcwd(), 'datasets')
dataset_name = 'DuIE2.0'
dataset_path = os.path.join(dataset_path, dataset_name)

# get the train, dev, test data path
for i in os.walk(dataset_path):
    for j in i[-1]:
        if j == 'duie_dev.json' and not os.path.isdir(j):
            dev_data_path = os.path.join(i[0], j)
        elif j == 'duie_train.json' and not os.path.isdir(j):
            train_data_path = os.path.join(i[0], j)
        elif j == 'duie_test2.json' and not os.path.isdir(j):
            test_data_path  =  os.path.join(i[0], j)

# read the train, dev, test json
with open(dev_data_path, 'r', encoding = 'utf-8') as f:
    # ERROR: Extra data: line 2 column 1
    # ref: https://blog.csdn.net/u011318077/article/details/88550775
    dev_data = []
    for line in f.readlines():
        dev_data.append(json.loads(line))

with open(train_data_path, 'r', encoding = 'utf-8') as f:
    # ERROR: Extra data: line 2 column 1
    # ref: https://blog.csdn.net/u011318077/article/details/88550775
    train_data = []
    for line in f.readlines():
        train_data.append(json.loads(line))

with open(test_data_path, 'r', encoding = 'utf-8') as f:
    # ERROR: Extra data: line 2 column 1
    # ref: https://blog.csdn.net/u011318077/article/details/88550775
    test_data = []
    for line in f.readlines():
        test_data.append(json.loads(line))

In [24]:
# data size
print(
    'train dataset 中有 {:,} 条数据；dev dataset 中有 {:,} 条数据；test dataset 中有 {:,} 条数据。'
    .format(len(train_data), len(dev_data), len(test_data)))

# triples count
train_tri, dev_tri = [], []
for doc in train_data:
    train_tri.extend(doc['spo_list'])
for doc in dev_data:
    dev_tri.extend(doc['spo_list'])
print('train dataset 中有 {:,} 个三元组；dev dataset 中有 {:,} 个三元组组。'.format(
    len(train_tri), len(dev_tri)))

# entity count
train_entity, dev_entity = set(), set()
for tri in train_tri:
    train_entity.add(tri['subject'])
    train_entity.add(tri['object']['@value'])
for tri in dev_tri:
    dev_entity.add(tri['subject'])
    dev_entity.add(tri['object']['@value'])
total_entity = train_entity | dev_entity
print(
    'train dataset 中有 {:,} 个实体；dev dataset 中有 {:,} 个实体；共有 {:,} 个不同的实体。'.format(
        len(train_entity), len(dev_entity), len(total_entity)))

# relation count
train_entity, dev_entity = set(), set()
for tri in train_tri:
    train_entity.add(tri['subject'])
    train_entity.add(tri['object']['@value'])
for tri in dev_tri:
    dev_entity.add(tri['subject'])
    dev_entity.add(tri['object']['@value'])
total_entity = train_entity | dev_entity
print(
    'train dataset 中有 {:,} 个实体；dev dataset 中有 {:,} 个实体；共有 {:,} 个不同的实体。'.format(
        len(train_entity), len(dev_entity), len(total_entity)))

train dataset 中有 171,135 条数据；dev dataset 中有 20,652 条数据；test dataset 中有 101,239 条数据。
train dataset 中有 310,378 个三元组；dev dataset 中有 37,789 个三元组组。
train dataset 中有 190,020 个实体；dev dataset 中有 37,932 个实体；共有 205,948 个不同的实体。
