Sim-GPT: Text Similarity via GPT Annotated Data

This repo is for our paper: Sim-GPT: Text Similarity via GPT Annotated Data.

In this repo you can find:

Scripts to reproduce our results.
Non-labeled and labeled data used in our paper.
Best checkpoints demonstrated in our paper.

Upate

December 12, 2023 we released our scripts, checkpoints and data.
December 12, 2023 we released our paper in arxiv.

Links

Sim-GPT: Text Similarity via GPT Annotated Data

Introduction

To address the longstanding issue with the STS task: the lack of a large collection of high-quality labeled training data, we propose Sim-GPT. This approach involves using GPT-4 to generate data with STS labels, upon which an STS model is subsequently trained.

Sim-GPT does not directly ask LLMs (e.g., GPT-4) to provide STS scores for a newly-encountered sentence pair. But rather, it firstly asks LLMs to generate a relatively large set of training data; secondly, a smaller model (e.g., backboned by RoBERTa) is trained based on the synthesized data from LLMs; At test time, the trained model is used for inference.

Illustraitions for our Sim-GPT is following:

Reproduce

Step 1: Requriements

1. OpenAI Requriements

python>=3.7.3
openai>=0.27.2

2. SimCSE Requriements

For training SimCSE models, we followed: SimCSE. To be efficient, we have directly copied them as follows for your convenience:

transformers==4.2.1
scipy
datasets
pandas
scikit-learn
prettytable
gradio
torch
setuptools

3. PromCSE Requriements

For training PromCSE models, we followed: PromCSE. To be efficient, we have directly copied them as follows for your convenience:

transformers==4.2.1
scipy==1.5.4
datasets==1.2.1
pandas==1.1.5
scikit-learn==0.24.0
prettytable==2.1.0
gradio
torch
setuptools==49.3.0

Step 2: GPT-4 Data Annotation

In this part, we offer links to download the source data and provide prompts that guide GPT-4 in the annotation process.

1. Download Source Data

Three types of data:

Captions (Flickr30K)
Questions (Quora Question Pairs)
For Multi-genre long sentences, we only release the annotated version.

2. GPT-4 Annotation

Prompts:

Captions: ./prompts/captions.txt
Questions: ./prompts/questions.txt
Multi-genre Sentences: ./prompts/multi_genre_sentences.txt

It's worth noting that for several reasons while accessing GPT-4 (e.g, bacth size, network), the data re-created using the above prompts may vary slightly from the dataset we have released. As mentioned in our paper, despite significant variations in the prompt, the performance of the model, when trained on the generated data, tends to remain consistent for the STS task.

Step 3: Training STS Models

Clone the related project:
1. SimCSE
2. PromCSE
Download backboned RoBERTa models:
1. RoBERTa-base
2. RoBERTa-large
Fill the roberta path, input file and output directory into our provided training scripts:
1. SimCSE-RoBERTa
  1. base: ./training-parameters/simcse/sup_roberta_base.sh
  2. large: ./training-parameters/simcse/sup_roberta_large.sh
2. PromCSE-RoBERTa
  1. base: ./training-parameters/promcse/sup_roberta_base.sh
  2. large: ./training-parameters/promcse/sup_roberta_large.sh
Move modified scripts to the directory of the related projects, such as:
1. move ./training-parameters/simcse/sup_roberta_base.sh SimCSE/
Run the training scirpt, such as: bash SimCSE/sup_roberta_base.sh

Step 4: Evaluation

We evaluate Sim-GPT on 7 STS tasks, and report the score of Spearman's correlation.

Clone the related project:
1. SimCSE
2. PromCSE
Run the official evaluation code, for example:

python SimCSE/evaluation.py \
    --model_name_or_path simgpt-simcse-roberta-large \
    --pooler cls \
    --task_set sts \
    --mode test

which is expected to output the results in a tabular format:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 78.79 | 88.22 | 83.48 | 88.32 | 85.48 |    87.91     |      81.07      | 84.75 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Step 5: Results

Table 1: Results reported in our paper on 7 STS tasks.

Model	STS12	STS13	STS14	STS15	STS16	STS-B	SICK-R	Avg
Supervised Model
InferSent-GloVe	52.86	66.75	62.15	72.77	66.87	68.03	65.65	65.01
Universal Sentence Encoder	64.49	67.80	64.61	76.83	73.18	74.92	76.69	71.22
SRoBERTa-base	71.54	72.49	70.80	78.74	73.69	77.77	74.46	74.21
SBERT-base	70.97	76.53	73.19	79.09	74.30	77.03	72.91	74.89
CT-SBERT-base	74.84	83.20	78.07	83.84	77.93	81.46	76.42	79.39
SimGPT - SimCSE
SimCSE-RoBERTa-base	76.53	85.21	80.95	86.03	82.57	85.83	80.50	82.52
SimGPT - SimCSE-RoBERTa-base	77.65 (+1.12)	86.15 (+0.94)	80.58 (-0.37)	86.47 (+0.44)	84.08 (+1.51)	86.20 (+0.37)	80.88 (+0.38)	83.14 (+0.62)
SimCSE-RoBERTa-large	77.46	87.27	82.36	86.66	83.93	86.70	81.95	83.76
SimGPT - SimCSE-RoBERTa-large	78.79 (+1.33)	88.22 (+0.95)	83.48 (+1.12)	88.32 (+1.66)	85.48 (+1.55)	87.91 (+1.21)	81.07 (-0.88)	84.75 (+0.99)
SimGPT - PromCSE
PromCSE-RoBERTa-base	77.51	86.15	81.59	86.92	83.81	86.35	80.49	83.26
SimGPT - PromCSE-RoBERTa-base	77.74 (+0.23)	86.82 (+0.77)	81.36 (-0.23)	87.01 (+0.09)	84.58 (+0.77)	86.98 (+0.63)	80.48 (-0.01)	83.57 (+0.31)
PromCSE-RoBERTa-large	79.56	88.97	83.81	88.08	84.96	87.87	82.43	85.10
SimGPT - PromCSE-RoBERTa-large	79.92 (+0.36)	88.87 (-0.10)	84.29 (+0.48)	88.64 (+0.56)	85.94 (+0.98)	88.18 (+0.31)	82.79 (+0.36)	85.52 (+0.42)

Released Checkpoints

Model	Avg. STS
simgpt-simcse-roberta-base	83.14
simgpt-simcse-roberta-large	84.75
simgpt-promcse-roberta-base	83.57
simgpt-promcse-roberta-large	85.52

Released GPT-4 Annotated Data

Our released annotated data are:

Contact

If you have any issues or questions about this repo, feel free to contact wangshuhe@stu.pku.edu.cn

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
figure		figure
prompts		prompts
training-parameters		training-parameters
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sim-GPT: Text Similarity via GPT Annotated Data

Upate

Links

Introduction

Reproduce

Step 1: Requriements

1. OpenAI Requriements

2. SimCSE Requriements

3. PromCSE Requriements

Step 2: GPT-4 Data Annotation

1. Download Source Data

2. GPT-4 Annotation

Step 3: Training STS Models

Step 4: Evaluation

Step 5: Results

Released Checkpoints

Released GPT-4 Annotated Data

Contact

About

Releases

Packages

Languages

ShuheWang1998/Sim-GPT

Folders and files

Latest commit

History

Repository files navigation

Sim-GPT: Text Similarity via GPT Annotated Data

Upate

Links

Introduction

Reproduce

Step 1: Requriements

1. OpenAI Requriements

2. SimCSE Requriements

3. PromCSE Requriements

Step 2: GPT-4 Data Annotation

1. Download Source Data

2. GPT-4 Annotation

Step 3: Training STS Models

Step 4: Evaluation

Step 5: Results

Released Checkpoints

Released GPT-4 Annotated Data

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages