Unicorn

This repository contains source code for the SIGMOD'2023 paper "Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration". In this paper we introduce Unicorn, a unified model for generally supporting common data matching tasks. This unified model can enable knowledge sharing by learning from multiple tasks and multiple datasets, and can also support zero-shot prediction for new tasks with zero labeled matching/non-matching pairs. Unicorn employs one generic Encoder that converts any pair of data elements (a,b) into a learned representation, and uses a Matcher, which is a binary classifier, to decide whether (a,b) is matching. Unicorn adopts a mixture-of-experts (MoE) model that enhances the learned representation into a better representation, which can further boost the performance of predictions.

Code Structure

|-- data # datasets for 20 matching tasks
|-- figs # figures
|-- main.py # pre-train Unicorn under unified prediction setting with the given 20 datasets (section 5.2 in paper)
|-- main-zero.py # pre-train Unicorn under zero-shot setting (section 5.3 in paper)
|-- main-zero-ins.py # pre-train Unicorn under zero-shot setting with instruction instruction (section 5.3 in paper)
|-- finetune.py # fine-tune Unicorn with new dataset
|-- test.py # test new dataset with the pre-trained Unicorn
|-- unicorn # code for Unicorn
    |-- dataprocess # data processing folder
        |-- dataformat.py # dataset configuration
        |-- predata.py # data processing function
    |-- model # implementation of model
        |-- encoder.py # encoder module: convert serialized (a,b) into representation
        |-- moe.py # mixture-of-experts module: convert representation into a better representation
        |-- mather.py # matcher module: convert the representation into 0 (non-matching)/1 (matching)
    |-- traner # model learning function
        |-- pretrain.py # training model function
        |-- evaluate.py # evaluation function
    |-- utils # configuration files and tools
        |-- param.py # necessary parameter
        |-- utils.py # some auxiliary functions

DataSets

We publish 20 datasets of 7 matching tasks in Unicorn. Each dataset contains train.json / valid.json / test.json. The details can be found in our paper.

Entity Matching
- em-wa: Walmart-Amazon
- em-ds: DBLP-Scholar
- em-fz: Fodors-Zagats
- em-ia: iTunes-Amazon
- em-beer: Beer
Column Type Annotation
- efthymiou: Efthymiou
- t2d_col_type_anno: T2D
- Limaye_col_type_anno: Limaye
Entity Linking
- t2d: T2D
- Limaye: Limaye
String Matching
- smurf-addr: Address
- smurf-names: Names
- smurf-res: Researchers
- smurf-prod: Product
- smurf-cit: Citation
Schema Matching
- fabricated_dataset: FabricatedDatasets
- DeepMDatasets: DeepMDatasets
Ontology Matching
- Illinois-onm: Cornell-Washington
Entity Alignment
- dbp_yg: SRPRS: DBP-YG
- dbp_wd: SRPRS: DBP-WD

Quick Start

Step 1: Requirements

Before running the code, please make sure your Python version is 3.6.5 and cuda version is 11.1. Then install necessary packages by :
pip install -r requirements.txt
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Step 2: Run

Pre-train Unicorn with the given datasets

Run the script for Unicorn:

python main.py --pretrain --model deberta_base

Run the script for Unicorn++:

python main.py --pretrain --model deberta_base --shuffle 1 --load_balance 1 --modelname UnicornPlus

Run the script for Unicorn Zero-shot:

python main-zero.py --pretrain --model deberta_base

Run the script for Unicorn Zero-shot instruction:

python main-zero-ins.py --pretrain --model deberta_base

After the pre-training, the checkpoint folder is generated and the three modules of the model are saved: encoder.pt, moe.pt and cls.pt. If you do not want to pre-train yourself, you can download our pre-trained model directly from HuggingFace, and save them in checkpoint folder.

Finetune model with your dataset

python finetune.py --load --ckpt UnicornPlus --model deberta_base --train_dataset_path "train_file_path1.json train_file_path2.json ..." --valid_dataset_path "valid_file_path1.json valid_file_path2.json ..." --test_dataset_path "test_file_path1.json test_file_path2.json ..." --train_metrics "f1 f1 ..." --test_metrics "f1 f1 ..." --modelname UnicornPlusNew

This script loads the pre-trained model UnicornPlus, and uses the training data represented by --train_dataset_path to finetune UnicornPlus, then outputs new model UnicornPlusNew.
Note that --train_dataset_path is required, --valid_dataset_path and --test_dataset_path are optional.

Load model and direct test

python test.py --load --ckpt UnicornPlus --model deberta_base --dataset_path "test_file_path1.json test_file_path2.json ..." --test_metrics "f1 f1 ..."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

figs

figs

unicorn

unicorn

.gitignore

.gitignore

finetune.py

finetune.py

main-zero-ins.py

main-zero-ins.py

main-zero.py

main-zero.py

main.py

main.py

readme.md

readme.md

requirements.txt

requirements.txt

test.py

test.py

Repository files navigation

Unicorn

Code Structure

DataSets

Quick Start

Step 1: Requirements

Step 2: Run

Pre-train Unicorn with the given datasets

Finetune model with your dataset

Load model and direct test

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
figs		figs
unicorn		unicorn
.gitignore		.gitignore
finetune.py		finetune.py
main-zero-ins.py		main-zero-ins.py
main-zero.py		main-zero.py
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
test.py		test.py

ruc-datalab/Unicorn

Folders and files

Latest commit

History

Repository files navigation

Unicorn

Code Structure

DataSets

Quick Start

Step 1: Requirements

Step 2: Run

Pre-train Unicorn with the given datasets

Finetune model with your dataset

Load model and direct test

About

Resources

Stars

Watchers

Forks

Languages