DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning

Official repo for our AAAI 2024 paper: DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning.

Getting Started

Run pip install -r requirements.txt to prepare the environment.

Use the script from the SimCSE repo to download the datasets for SentEval evaluation:

cd SentEval/data/downstream/
bash download_dataset.sh

Access Our Model and Dataset from Huggingface🤗

Both our model checkpoint and dataset are available on 🤗.

Generate embeddings with DenoSent:

from transformers import AutoModel

model = AutoModel.from_pretrained("Singhoo/denosent-bert-base", trust_remote_code=True)

sentences = [
   "The curious cat tiptoed across the creaky wooden floor, pausing to inspect a fluttering curtain.",
   "A lone hiker stood atop the misty mountain, marveling at the tapestry of stars unfolding above."
]

embeddings = model.encode(sentences)
print(embeddings)

# Excepted output
# tensor([[ 0.3314, -0.2520,  0.4150,  ...,  0.1575, -0.1235, -0.1226],
#         [ 0.5128, -0.0051,  0.2179,  ...,  0.1010,  0.1654, -0.3872]])

Evaluation

Run Evaluation with SentEval

python eval_senteval.py \
    --model_name_or_path Singhoo/denosent-bert-base \
    --task_set sts \
    --mode test \

This checkpoint has slightly higher STS results than those reported in the paper.

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.48 | 83.82 | 77.54 | 84.76 | 80.16 |    81.20     |      73.97      | 79.56 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Run evaluation with MTEB

python eval_mteb.py \
    --model_name_or_path Singhoo/denosent-bert-base \

Evaluation results for MTEB will appear in a separate directory mteb_results.

Train Your Own DenoSent Models

Run the following command to train your own models. Try out different hyperparameters as you like. The dataset will be automatically downloaded from Huggingface.

python \
    train.py \
    --train_dataset Singhoo/denosent_data \
    --torch_compile True \
    --model_name_or_path bert-base-uncased \
    --max_length 32 \
    --decoder_num_layers 16 \
    --decoder_num_heads 1 \
    --decoder_target_dropout 0.825 \
    --pooler mask \
    --output_dir results \
    --overwrite_output_dir \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 256 \
    --learning_rate 4e-5 \
    --lr_scheduler_type constant_with_warmup \
    --do_train \
    --do_eval \
    --evaluation_strategy steps \
    --eval_steps 50 \
    --save_strategy steps \
    --save_steps 50 \
    --num_train_epochs 1 \
    --metric_for_best_model eval_avg_sts \
    --prompt_format '"[X]" means [MASK].' \
    --do_contrastive \
    --do_generative \
    --save_total_limit 1 \
    --contrastive_temp 0.05 \
    --warmup_steps 500 \
    --contrastive_weight 5 \
    --generative_weight 7 \
    --max_steps 5000 \
    --load_best_model_at_end \

Acknowledgements

We use the SentEval toolkit and the MTEB toolkit for evaluations, and we adopt the modified version of SentEval from the SimCSE repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
SentEval		SentEval
paper		paper
.gitignore		.gitignore
README.md		README.md
args.py		args.py
config.py		config.py
eval_mteb.py		eval_mteb.py
eval_senteval.py		eval_senteval.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentEval

SentEval

paper

paper

.gitignore

.gitignore

README.md

README.md

args.py

args.py

config.py

config.py

eval_mteb.py

eval_mteb.py

eval_senteval.py

eval_senteval.py

model.py

model.py

requirements.txt

requirements.txt

train.py

train.py

trainer.py

trainer.py

Repository files navigation

DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning

Getting Started

Access Our Model and Dataset from Huggingface🤗

Evaluation

Run Evaluation with SentEval

Run evaluation with MTEB

Train Your Own DenoSent Models

Acknowledgements

About

Releases

Packages

Languages

xinghaow99/DenoSent

Folders and files

Latest commit

History

Repository files navigation

DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning

Getting Started

Access Our Model and Dataset from Huggingface🤗

Evaluation

Run Evaluation with SentEval

Run evaluation with MTEB

Train Your Own DenoSent Models

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages