GitHub - SupstarZh/WhitenedCSE: [ACL2023] WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings

WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings [ACL 2023]

This repository contains the code and pre-trained models for our paper WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings.

Our code is mainly based on the code of SimCSE. Please refer to their repository for more detailed information.

Overview

We presents a whitening-based contrastive learning method for sentence embedding learning (WhitenedCSE), which combines contrastive learning with a novel shuffled group whitening.

Train WhitenedCSE

In the following section, we describe how to train a WhitenedCSE model by using our code.

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.12.1 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.12.1 should also work.

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

For unsupervised WhitenedCSE, we sample 1 million sentences from English Wikipedia; You can run data/download_wiki.sh to download the two datasets.

download the dataset

./download_wiki.sh

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

CUDA_VISIBLE_DEVICES=[gpu_ids]\
python train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir result/my-unsup-whitenedcse-bert-base-uncased \
    --num_train_epochs 1 \
    --per_device_train_batch_size 128 \
    --learning_rate 1e-5 \
    --num_pos 3 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --mlp_only_train \
    --overwrite_output_dir \
    --dup_type bpe \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"

Then come back to the root directory, you can evaluate any transformers-based pre-trained models using our evaluation code. For example,

python evaluation.py \
    --model_name_or_path <your_output_model_dir>  \
    --pooler cls \
    --task_set sts \
    --mode test

which is expected to output the results in a tabular format:

# BERT-base-uncased
------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 74.03 | 84.90 | 76.40 | 83.40 | 80.23 |    81.14     |      71.33      | 78.78 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

# BERT-large
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 74.65 | 85.79 | 77.49 | 84.71 | 80.33 |    81.48     |      75.34      | 79.97 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Pretrained Model

WhitenedCSE-BERT-base-uncased: https://huggingface.co/SupstarZh/whitenedcse-bert-base
WhitenedCSE-BERT-large: https://huggingface.co/SupstarZh/whitenedcse-bert-large

Citation

Please cite our paper if you use WhitenedCSE in your work:

@inproceedings{zhuo2023whitenedcse,
  title={WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings},
  author={Zhuo, Wenjie and Sun, Yifan and Wang, Xiaohan and Zhu, Linchao and Yang, Yi},
  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={12135--12148},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
SentEval		SentEval
data		data
figure		figure
whitenedcse		whitenedcse
README.md		README.md
evaluation.py		evaluation.py
requirements.txt		requirements.txt
run_unsup_example.sh		run_unsup_example.sh
sgw.pt		sgw.pt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentEval

SentEval

data

data

figure

figure

whitenedcse

whitenedcse

README.md

README.md

evaluation.py

evaluation.py

requirements.txt

requirements.txt

run_unsup_example.sh

run_unsup_example.sh

sgw.pt

sgw.pt

train.py

train.py

Repository files navigation

WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings [ACL 2023]

Overview

Train WhitenedCSE

Requirements

Evaluation

Pretrained Model

Citation

About

Releases

Packages

Languages

SupstarZh/WhitenedCSE

Folders and files

Latest commit

History

Repository files navigation

WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings [ACL 2023]

Overview

Train WhitenedCSE

Requirements

Evaluation

Pretrained Model

Citation

About

Resources

Stars

Watchers

Forks

Languages