## 对Google Patents Public Data的原始JSONL文件进行探索

- 总字段数量：`37`
- 标量字段数量：`16`
  - 标量字段无需拆表，直接附在主表中即可包括：
    - publication_number
    - application_number
    - country_code
    - kind_code
    - application_kind
    - application_number_formatted
    - pct_number
    - family_id
    - spif_publication_number
    - spif_application_number
    - publication_date
    - filing_date
    - grant_date
    - priority_date
    - entity_status
    - art_unit
- 复合字段数量：`21`
  - 一对多关系，只需要拆分成子表即可，包括：
    - title_localized
    - abstract_localized
    - claims_localized (仅限USPTO)
    - description_localized（仅限USPTO）
    - 一些分类号也可以牺牲存储空间，不再拆分实体表：
      - uspc
      - ipc
      - cpc
      - fi
      - fterm
      - locarno
  - 多对多关系，需要首先建立实体表，再建立关系表，包括：
    - inventor_harmonized
    - assignee_harmonized
    - examiner
    - citation（实体表即主表，非专利引用需要用单独的表）
    - priority_claim（实体表即主表）
    - child（实体表即主表）
    - parent（实体表即主表，可能和child字段部分冗余，但需要先解析再合并）
  - 一些可以抛弃的脏字段
    - claims_localized_html
    - description_localized_html
    - inventor
    - assignee

In [1]:
RAW_INPUT_PATH = r'../data/raw_input/patents-000000000000.jsonl'

In [2]:
import json
import random
from tqdm import tqdm

# seed
random.seed(42)

In [3]:
patent_records = []
with open(RAW_INPUT_PATH, 'r', encoding='utf-8') as file:
    for line in tqdm(file):
        json_line = json.loads(line)
        patent_records.append(json_line)

print(f'已读取 {len(patent_records)} 条专利数据')

random_patent = random.choice(patent_records)

0it [00:00, ?it/s]

575it [00:00, 1339.89it/s]

已读取 575 条专利数据





### 对字段类型进行统计

有多少是标量数据类型，有多少是复合数据类型

In [12]:
fields = [field for field in random_patent.keys()]
scalar_fields = []
complex_fields = []

for field, value in random_patent.items():
    if isinstance(value, (list)):
        complex_fields.append(field)
    else:
        scalar_fields.append(field)

In [13]:
len(fields)

37

In [14]:
len(scalar_fields), len(complex_fields)

(16, 21)

In [41]:
for i in scalar_fields:
    print(i)

publication_number
application_number
country_code
kind_code
application_kind
application_number_formatted
pct_number
family_id
spif_publication_number
spif_application_number
publication_date
filing_date
grant_date
priority_date
entity_status
art_unit


In [11]:
scalar_fields

['publication_number',
 'application_number',
 'country_code',
 'kind_code',
 'application_kind',
 'application_number_formatted',
 'pct_number',
 'family_id',
 'spif_publication_number',
 'spif_application_number',
 'publication_date',
 'filing_date',
 'grant_date',
 'priority_date',
 'entity_status',
 'art_unit']

In [10]:
complex_fields

['title_localized',
 'abstract_localized',
 'claims_localized',
 'claims_localized_html',
 'description_localized',
 'description_localized_html',
 'priority_claim',
 'inventor',
 'inventor_harmonized',
 'assignee',
 'assignee_harmonized',
 'examiner',
 'uspc',
 'ipc',
 'cpc',
 'fi',
 'fterm',
 'locarno',
 'citation',
 'parent',
 'child']

### 对于复合类型字段，进一步查看其结构


In [39]:
random_patent = random.choice(patent_records)
for field in complex_fields:
    print(f"Field: {field}")
    try:
        sub_fields = random_patent[field][0].keys() if random_patent[field] else []
        print(f"  Sub-fields: {list(sub_fields)}")
    except AttributeError:
        sub_fields = '无进一步结构：' + str(type(random_patent[field][0]))
        print(f"  Sub-fields: {sub_fields}")

Field: title_localized
  Sub-fields: ['text', 'language', 'truncated']
Field: abstract_localized
  Sub-fields: ['text', 'language', 'truncated']
Field: claims_localized
  Sub-fields: ['text', 'language', 'truncated']
Field: claims_localized_html
  Sub-fields: ['text', 'language', 'truncated']
Field: description_localized
  Sub-fields: ['text', 'language', 'truncated']
Field: description_localized_html
  Sub-fields: ['text', 'language', 'truncated']
Field: priority_claim
  Sub-fields: ['publication_number', 'application_number', 'npl_text', 'type', 'category', 'filing_date']
Field: inventor
  Sub-fields: 无进一步结构：<class 'str'>
Field: inventor_harmonized
  Sub-fields: ['name', 'country_code']
Field: assignee
  Sub-fields: 无进一步结构：<class 'str'>
Field: assignee_harmonized
  Sub-fields: ['name', 'country_code']
Field: examiner
  Sub-fields: []
Field: uspc
  Sub-fields: []
Field: ipc
  Sub-fields: ['code', 'inventive', 'first', 'tree']
Field: cpc
  Sub-fields: ['code', 'inventive', 'first', 'tr

In [17]:
random_patent['inventor']

['YAN BING', 'CHEN DEWEI', 'CHEN YINING', 'YANG CHANGZHOU']