Skip to content

xinghaow99/DenoSent

Repository files navigation

DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning

Official repo for our AAAI 2024 paper: DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning.

Getting Started

Run pip install -r requirements.txt to prepare the environment.

Use the script from the SimCSE repo to download the datasets for SentEval evaluation:

cd SentEval/data/downstream/
bash download_dataset.sh

Access Our Model and Dataset from Huggingface🤗

Both our model checkpoint and dataset are available on 🤗.

Generate embeddings with DenoSent:

from transformers import AutoModel

model = AutoModel.from_pretrained("Singhoo/denosent-bert-base", trust_remote_code=True)

sentences = [
   "The curious cat tiptoed across the creaky wooden floor, pausing to inspect a fluttering curtain.",
   "A lone hiker stood atop the misty mountain, marveling at the tapestry of stars unfolding above."
]

embeddings = model.encode(sentences)
print(embeddings)

# Excepted output
# tensor([[ 0.3314, -0.2520,  0.4150,  ...,  0.1575, -0.1235, -0.1226],
#         [ 0.5128, -0.0051,  0.2179,  ...,  0.1010,  0.1654, -0.3872]])

Evaluation

Run Evaluation with SentEval

python eval_senteval.py \
    --model_name_or_path Singhoo/denosent-bert-base \
    --task_set sts \
    --mode test \

This checkpoint has slightly higher STS results than those reported in the paper.

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.48 | 83.82 | 77.54 | 84.76 | 80.16 |    81.20     |      73.97      | 79.56 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Run evaluation with MTEB

python eval_mteb.py \
    --model_name_or_path Singhoo/denosent-bert-base \

Evaluation results for MTEB will appear in a separate directory mteb_results.

Train Your Own DenoSent Models

Run the following command to train your own models. Try out different hyperparameters as you like. The dataset will be automatically downloaded from Huggingface.

python \
    train.py \
    --train_dataset Singhoo/denosent_data \
    --torch_compile True \
    --model_name_or_path bert-base-uncased \
    --max_length 32 \
    --decoder_num_layers 16 \
    --decoder_num_heads 1 \
    --decoder_target_dropout 0.825 \
    --pooler mask \
    --output_dir results \
    --overwrite_output_dir \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 256 \
    --learning_rate 4e-5 \
    --lr_scheduler_type constant_with_warmup \
    --do_train \
    --do_eval \
    --evaluation_strategy steps \
    --eval_steps 50 \
    --save_strategy steps \
    --save_steps 50 \
    --num_train_epochs 1 \
    --metric_for_best_model eval_avg_sts \
    --prompt_format '"[X]" means [MASK].' \
    --do_contrastive \
    --do_generative \
    --save_total_limit 1 \
    --contrastive_temp 0.05 \
    --warmup_steps 500 \
    --contrastive_weight 5 \
    --generative_weight 7 \
    --max_steps 5000 \
    --load_best_model_at_end \

Acknowledgements

We use the SentEval toolkit and the MTEB toolkit for evaluations, and we adopt the modified version of SentEval from the SimCSE repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages