Skip to content

RUCAIBox/DCLR

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Debiased Contrastive Learning of Unsupervised Sentence Representations

This repository contains the code for our paper Debiased Contrastive Learning of Unsupervised Sentence Representations.

Overview

We propose DCLR, a debiased contrastive learning framework for unsupervised sentence representation learning. Based on SimCSE, we mainly consider two biases caused by the randomly negative sampling, namely the false negatives and the anistropy representation problem. For the two problems, we incorporate an instance weighting method and noise-based negatives to alleviate their influence during contrastive learning.

Train DCLR

In the following section, we describe how to train a DCLR model by using our code.

Evaluation

Our evaluation code for sentence embeddings is following the released code of SimCSE, it is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Training

Environment

To faithfully reproduce our results, please use the correct 1.8.1 pytorch version corresponding to your platforms/CUDA versions.

pip install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Data

We utilize the released data from SimCSE that samples 1 million sentences from English Wikipedia. You can run data/download_wiki.sh to download it.

Required Checkpoints from SimCSE

In our approach, we require to use a fixed SimCSE on BERT-base and RoBERTa-base as the complementary model for instance weighting. You can download their checkpoints from these links: SimCSE-BERT-base and SimCSE-RoBERTa-base.

Besides, we also need the checkpoints of SimCSE on BERT-large and RoBERTa-large to initialize our model for stabilizing the training process. You can download them from these links: SimCSE-BERT-large and SimCSE-RoBERTa-large.

Training scripts

We provide the training scripts for BERT/RoBERTa-base/large and have set up the best hyperparameters for training. You can run it to automatically finish the training on BERT/RoBERTa-base/large backbone models.

bash run.sh

For BERT/RoBERTa-base models, we provide a single-GPU (or CPU) example, and for BERT/RoBERTa-large models we give a multiple-GPU example. We explain some important arguments in following:

  • --model_name_or_path: Pre-trained checkpoints to start with. We support BERT-based models (bert-base-uncased, bert-large-uncased) and RoBERTa-based models (RoBERTa-base, RoBERTa-large).
  • --c_model_name_or_path: The checkpoints of Complementary model. We support SimCSE-BERT/RoBERTa-base models (unsup-simcse-bert-base-uncased, unsup-simcse-roberta-base).

For results in the paper, we use 8 * Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

Hyperparameter Sensitivity

Note that the performance of DCLR is also sensitive to the environment and hyperparameter settings. If you get different performance, we suggest a necessary hyperparameter search about phi, noise_times around our provided values.

Citation

Please cite our paper if you use DCLR in your work:

@article{zhou2021dclr,
   title={Debiased Contrastive Learning of Unsupervised Sentence Representations},
   author={Zhou, Kun and Zhang, Beichen and Zhao, Xin and Wen, Ji-Rong},
   booktitle = {{ACL}},
   year={2022}
}

About

Code of ACL 2022 paper Debiased Contrastive Learning of Unsupervised Sentence Representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.2%
  • Shell 1.8%