Skip to content

Latest commit

 

History

History
279 lines (180 loc) · 17.5 KB

README.md

File metadata and controls

279 lines (180 loc) · 17.5 KB

Documentation PyPI GitHub Documentation Open In Colab

English | 简体中文

DeepKE-LLM: A Large Language Model Based
Knowledge Extraction Toolkit

Requirements

In the era of large models, DeepKE-LLM utilizes a completely new environment dependency.

conda create -n deepke-llm python=3.9
conda activate deepke-llm

cd example/llm
pip install -r requirements.txt

Please note that the requirements.txt file is located in the example/llm folder.

News

  • [2024/04] We release a new bilingual (Chinese and English) schema-based information extraction model called OneKE based on Chinese-Alpaca-2-13B.
  • [2024/02] We release a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction dataset named IEPile, along with two models trained with IEPile, baichuan2-13b-iepile-lora and llama2-13b-iepile-lora.
  • [2023/11] The weights of knowlm-13b-ie have been updated. This update mainly adjusted the NAN outputs, shortened the inference length, and added support for instructions without a specified schema.
  • [2023/10] We released a new bilingual (Chinese and English) theme-based Information Extraction (IE) instruction dataset named InstructIE.
  • [2023/08] A specialized version of KnowLM for information extraction (IE), named knowlm-13b-ie, was launched.
  • [2023/07] Some of the instruction datasets used for training were released, including knowlm-ke and KnowLM-IE.
  • [2023/06] The first version of pre-trained weights, knowlm-13b-base-v1.0, and the first version of zhixi-13b-lora were released.
  • [2023/05] We initiated an instruction-based Information Extraction project.

Dataset

Existing Datasets

Name Download Quantity Description
InstructIE Google Drive
Hugging Face
ModelScope
WiseModel
300k+ Bilingual (Chinese and English) topic-based Information Extraction (IE) instruction dataset
IEPile Google Drive
Hugging Face
WiseModel
ModelScope
2 million+ Large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction fine-tuning dataset
Details of InstructIE

An example of a single data entry

{
  "id": "841ef2af4cfe766dd9295fb7daf321c299df0fd0cef14820dfcb421161eed4a1", 
  "text": "NGC1313 is a galaxy in the constellation of Reticulum. It was discovered by the Australian astronomer James Dunlop on September 27, 1826. It has a prominent uneven shape, and its axis does not completely revolve around its center. Near NGC1313, there is another galaxy, NGC1309.", 
  "relation": [
    {"head": "NGC1313", "head_type": "astronomical object type", "relation": "time of discovery", "tail": "September 27, 1826", "tail_type": "time"}, 
    {"head": "NGC1313", "head_type": "astronomical object type", "relation": "discoverer or inventor", "tail": "James Dunlop", "tail_type": "organization/human"}, 
    {"head": "NGC1313", "head_type": "astronomical object type", "relation": "of", "tail": "Reticulum", "tail_type": "astronomical object type"}
  ]
}
Field Description
id The unique identifier for each data point.
cate The category of the text's subject, with a total of 12 different thematic categories.
text The input text for the model, with the goal of extracting all the involved relationship triples.
relation Describes the relationship triples contained in the text, i.e., (head, head_type, relation, tail, tail_type).
Details of IEPile

Each instance in IEPile contains four fields: task, source, instruction, and output. Below are the explanations for each field:

Field Description
task The task to which the instance belongs, one of the five types (NER, RE, EE, EET, EEA).
source The dataset to which the instance belongs.
instruction The instruction for inputting into the model, processed into a JSON string via json.dumps, including three fields: "instruction", "schema", and "input".
output The output in the format of a dictionary's JSON string, where the key is the schema, and the value is the extracted content.

In IEPile, the instruction format of IEPile adopts a JSON-like string structure, which is essentially a dictionary-type string composed of the following three main components: (1) 'instruction': Task description, which outlines the task to be performed by the instruction (one of NER, RE, EE, EET, EEA). (2) 'schema': A list of schemas to be extracted (entity types, relation types, event types). (3) 'input': The text from which information is to be extracted.

The file instruction.py provides instructions for various tasks.

Below is a data example:

{
    "task": "NER", 
    "source": "CoNLL2003", 
    "instruction": "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}", 
    "output": "{\"person\": [\"Robert Allenby\", \"Allenby\", \"Miguel Angel Martin\"], \"organization\": [], \"else\": [], \"location\": [\"Australia\", \"Spain\"]}"
}

The data instance belongs to the NER task, is part of the CoNLL2003 dataset, the schema list to be extracted includes ["person", "organization", "else", "location"], and the text to be extracted from is "284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )". The output is {"person": ["Robert Allenby", "Allenby", "Miguel Angel Martin"], "organization": [], "else": [], "location": ["Australia", "Spain"]}.

Models

OneKE

A Bilingual Large Language Model for Information Extraction: Chinese Tutorial.

LLaMA-series

LLaMA

LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We also provide a bilingual LLM for knowledge extraction named ZhiXi (智析) (which means intelligent analysis of data for knowledge extraction) based on KnowLM.

ZhiXi follows a two-step approach: (1) It performs further full pre-training on LLaMA (13B) using Chinese/English corpora to enhance the model's Chinese comprehension and knowledge while preserving its English and code capabilities as much as possible. (2) It fine-tunes the model using an instruction dataset to improve the language model's understanding of human instructions. For detailed information about the model, please refer to KnowLM.

Case 1: LoRA Fine-tuning of LLaMA for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: Using ZhiXi for CCKS2023 Instruction-based KG Construction English | Chinese

ChatGLM

ChatGLM

Case 1: LoRA Fine-tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

Case 2: P-Tuning of ChatGLM for CCKS2023 Instruction-based KG Construction English | Chinese

MOSS

ChatGLM

Case 1: OpenDelta Fine-tuning of Moss for CCKS2023 Instruction-based KG Construction English | Chinese

Baichuan

Baichuan

Case 1: OpenDelta Fine-tuning of Baichuan for CCKS2023 Instruction-based KG Construction English | Chinese

CPM-Bee

Case 1: OpenDelta Fine-tuning of CPM-Bee for CCKS2023 Instruction-based KG Construction English | Chinese

GPT-series

GPT

Case 1: Information Extraction with LLMs English | Chinese

Case 2: Data Augmentation with LLMs English | Chinese

Case 3: CCKS2023 Instruction-based KG Construction with LLMs English | Chinese

Case 4: Unleash the Power of Large Language Models for Few-shot Relation Extraction English | Chinese

Case 5: CodeKGC-Code Language Models for KG Construction English | Chinese

To better address Relational Triple Extraction (rte) task in Knowledge Graph Construction, we have designed code-style prompts to model the structure of Relational Triple, and used Code-LLMs to generate more accurate predictions. The key step of code-style prompt construction is to transform (text, output triples) pairs into semantically equivalent program language written in Python.


Methods

Method 1: In-Context Learning (ICL)

In-Context Learning is an approach to guide large language models to improve their performance on specific tasks. It involves iterative fine-tuning and training of the model in a specific context to better understand and address the requirements of a particular domain. Through In-Context Learning, we can enable large language models to perform tasks such as information extraction, data augmentation, and instruction-driven knowledge graph construction.

Method 2: LoRA

LoRA (Low-Rank Adaptation of Large Language Models) reduces the number of trainable parameters by learning low-rank decomposition matrices while freezing the original weights. This significantly reduces the storage requirements of large language models for specific tasks and enables efficient task switching during deployment without introducing inference latency. For more details, please refer to the original paper LoRA: Low-Rank Adaptation of Large Language Models.

Method 3: P-Tuning

The PT (P-Tuning) method, as referred to in the official code of ChatGLM, is a soft-prompt method specifically designed for large models. P-Tuning introduces new parameters only to the embeddings of large models. P-Tuning-V2 adds new parameters to both the embeddings and the preceding layers of large models.

Citation

If you use this project, please cite the following papers:

@misc{knowlm,
  author = {Ningyu Zhang and Jintian Zhang and Xiaohan Wang and Honghao Gui and Kangwei Liu and Yinuo Jiang and Xiang Chen and Shengyu Mao and Shuofei Qiao and Yuqi Zhu and Zhen Bi and Jing Chen and Xiaozhuan Liang and Yixin Ou and Runnan Fang and Zekun Xi and Xin Xu and Lei Li and Peng Wang and Mengru Wang and Yunzhi Yao and Bozhong Tian and Yin Fang and Guozhou Zheng and Huajun Chen},
  title = {KnowLM Technical Report},
  year = {2023},
 url = {http://knowlm.zjukg.cn/},
}