Skip to content

ShuheWang1998/Sim-GPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sim-GPT: Text Similarity via GPT Annotated Data

This repo is for our paper: Sim-GPT: Text Similarity via GPT Annotated Data.

In this repo you can find:

  • Scripts to reproduce our results.
  • Non-labeled and labeled data used in our paper.
  • Best checkpoints demonstrated in our paper.

Upate

  • December 12, 2023 we released our scripts, checkpoints and data.
  • December 12, 2023 we released our paper in arxiv.

Links

Introduction

To address the longstanding issue with the STS task: the lack of a large collection of high-quality labeled training data, we propose Sim-GPT. This approach involves using GPT-4 to generate data with STS labels, upon which an STS model is subsequently trained.

Sim-GPT does not directly ask LLMs (e.g., GPT-4) to provide STS scores for a newly-encountered sentence pair. But rather, it firstly asks LLMs to generate a relatively large set of training data; secondly, a smaller model (e.g., backboned by RoBERTa) is trained based on the synthesized data from LLMs; At test time, the trained model is used for inference.

Illustraitions for our Sim-GPT is following:

Reproduce

Step 1: Requriements

1. OpenAI Requriements

  • python>=3.7.3
  • openai>=0.27.2

2. SimCSE Requriements

For training SimCSE models, we followed: SimCSE. To be efficient, we have directly copied them as follows for your convenience:

transformers==4.2.1
scipy
datasets
pandas
scikit-learn
prettytable
gradio
torch
setuptools

3. PromCSE Requriements

For training PromCSE models, we followed: PromCSE. To be efficient, we have directly copied them as follows for your convenience:

transformers==4.2.1
scipy==1.5.4
datasets==1.2.1
pandas==1.1.5
scikit-learn==0.24.0
prettytable==2.1.0
gradio
torch
setuptools==49.3.0

Step 2: GPT-4 Data Annotation

In this part, we offer links to download the source data and provide prompts that guide GPT-4 in the annotation process.

1. Download Source Data

Three types of data:

2. GPT-4 Annotation

Prompts:

  • Captions: ./prompts/captions.txt
  • Questions: ./prompts/questions.txt
  • Multi-genre Sentences: ./prompts/multi_genre_sentences.txt

It's worth noting that for several reasons while accessing GPT-4 (e.g, bacth size, network), the data re-created using the above prompts may vary slightly from the dataset we have released. As mentioned in our paper, despite significant variations in the prompt, the performance of the model, when trained on the generated data, tends to remain consistent for the STS task.

Step 3: Training STS Models

  1. Clone the related project:
    1. SimCSE
    2. PromCSE
  2. Download backboned RoBERTa models:
    1. RoBERTa-base
    2. RoBERTa-large
  3. Fill the roberta path, input file and output directory into our provided training scripts:
    1. SimCSE-RoBERTa
      1. base: ./training-parameters/simcse/sup_roberta_base.sh
      2. large: ./training-parameters/simcse/sup_roberta_large.sh
    2. PromCSE-RoBERTa
      1. base: ./training-parameters/promcse/sup_roberta_base.sh
      2. large: ./training-parameters/promcse/sup_roberta_large.sh
  4. Move modified scripts to the directory of the related projects, such as:
    1. move ./training-parameters/simcse/sup_roberta_base.sh SimCSE/
  5. Run the training scirpt, such as: bash SimCSE/sup_roberta_base.sh

Step 4: Evaluation

We evaluate Sim-GPT on 7 STS tasks, and report the score of Spearman's correlation.

  1. Clone the related project:
    1. SimCSE
    2. PromCSE
  2. Run the official evaluation code, for example:
python SimCSE/evaluation.py \
    --model_name_or_path simgpt-simcse-roberta-large \
    --pooler cls \
    --task_set sts \
    --mode test

which is expected to output the results in a tabular format:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 78.79 | 88.22 | 83.48 | 88.32 | 85.48 |    87.91     |      81.07      | 84.75 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Step 5: Results

Table 1: Results reported in our paper on 7 STS tasks.

Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R Avg
Supervised Model
InferSent-GloVe 52.86 66.75 62.15 72.77 66.87 68.03 65.65 65.01
Universal Sentence Encoder 64.49 67.80 64.61 76.83 73.18 74.92 76.69 71.22
SRoBERTa-base 71.54 72.49 70.80 78.74 73.69 77.77 74.46 74.21
SBERT-base 70.97 76.53 73.19 79.09 74.30 77.03 72.91 74.89
CT-SBERT-base 74.84 83.20 78.07 83.84 77.93 81.46 76.42 79.39
SimGPT - SimCSE
SimCSE-RoBERTa-base 76.53 85.21 80.95 86.03 82.57 85.83 80.50 82.52
SimGPT - SimCSE-RoBERTa-base 77.65 (+1.12) 86.15 (+0.94) 80.58 (-0.37) 86.47 (+0.44) 84.08 (+1.51) 86.20 (+0.37) 80.88 (+0.38) 83.14 (+0.62)
SimCSE-RoBERTa-large 77.46 87.27 82.36 86.66 83.93 86.70 81.95 83.76
SimGPT - SimCSE-RoBERTa-large 78.79 (+1.33) 88.22 (+0.95) 83.48 (+1.12) 88.32 (+1.66) 85.48 (+1.55) 87.91 (+1.21) 81.07 (-0.88) 84.75 (+0.99)
SimGPT - PromCSE
PromCSE-RoBERTa-base 77.51 86.15 81.59 86.92 83.81 86.35 80.49 83.26
SimGPT - PromCSE-RoBERTa-base 77.74 (+0.23) 86.82 (+0.77) 81.36 (-0.23) 87.01 (+0.09) 84.58 (+0.77) 86.98 (+0.63) 80.48 (-0.01) 83.57 (+0.31)
PromCSE-RoBERTa-large 79.56 88.97 83.81 88.08 84.96 87.87 82.43 85.10
SimGPT - PromCSE-RoBERTa-large 79.92 (+0.36) 88.87 (-0.10) 84.29 (+0.48) 88.64 (+0.56) 85.94 (+0.98) 88.18 (+0.31) 82.79 (+0.36) 85.52 (+0.42)

Released Checkpoints

Model Avg. STS
simgpt-simcse-roberta-base 83.14
simgpt-simcse-roberta-large 84.75
simgpt-promcse-roberta-base 83.57
simgpt-promcse-roberta-large 85.52

Released GPT-4 Annotated Data

Our released annotated data are:

Contact

If you have any issues or questions about this repo, feel free to contact wangshuhe@stu.pku.edu.cn

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages