Skip to content

snumin44/SimCSE-KO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🍊 SimCSE-KO

  • Lighter code to train Korean SimCSE
  • Dependency Problem Alleviation & Ease of Customization
  • Reference : SimCSE Paper, Original Code

example image

1. Training

  • Model:
    • klue/bert-base
    • klue/roberta-base
  • Dataset:
    • KorNLI-train (supervised training)
    • Korean Wiki Text 1M (unsupervised training)
    • KorSTS-dev (evaluation)
  • Setting:
    • epoch: 1
    • max length: 64
    • batch size: 256
    • learning rate: 5e-5
    • drop out: 0.1
    • temp: 0.05
    • pooler: cls
    • 1 A100 GPU

2. Performance

  • Inference Datset

    • KorSTS-test
    • KlueSTS-dev
  • KorSTS-test

Model AVG Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhatten Pearson Manhatten Spearman Dot Pearson Dot Spearman
SimCSE-BERT-KO
(unsup)
72.85 73.00 72.77 72.96 72.92 72.93 72.86 72.80 72.53
SimCSE-BERT-KO
(sup)
85.98 86.05 86.00 85.88 86.08 85.90 86.08 85.96 85.89
SimCSE-RoBERTa-KO
(unsup)
75.79 76.39 75.57 75.71 75.52 75.65 75.42 76.41 75.63
SimCSE-RoBERTa-KO
(sup)
83.06 82.67 83.21 83.22 83.27 83.24 83.28 82.54 83.03
  • KlueSTS-dev
Model AVG Cosine Pearson Cosine Spearman Euclidean Pearson Euclidean Spearman Manhatten Pearson Manhatten Spearman Dot Pearson Dot Spearman
SimCSE-BERT-KO
(unsup)
65.27 66.27 64.31 66.18 64.05 66.00 63.77 66.64 64.93
SimCSE-BERT-KO
(sup)
83.96 82.98 84.32 84.32 84.30 84.28 84.20 83.00 84.29
SimCSE-RoBERTa-KO
(unsup)
80.78 81.20 80.35 81.27 80.36 81.28 80.40 81.13 80.26
SimCSE-RoBERTa-KO
(sup)
85.31 84.14 85.64 86.09 85.68 86.04 85.65 83.94 85.30

3. Implementation

  • Generate Supervised Dataset

    You can create a supervised training dataset with KorNLI by following 'data/generate_supervised_dataset.ipynb'.

  • Download Korean Wiki Text

cd data
sh download_korean_wiki_1m.sh
  • Download korSTS
cd data
sh download_korsts.sh
  • Supervised Training
cd train
sh run_train_supervised.sh
  • Unsupervised Training
cd train
sh run_train_unsupervised.sh
  • Evaluation
cd evaluation
sh run_eval.sh

4. HuggingFace Example

import numpy as np
from transformers import AutoModel, AutoTokenizer

model_path = 'snumin44/simcse-ko-bert-supervised'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

query = '내일 아침에 비가 올까요?'

targets = [
    '내일 아침에 우산을 챙겨야 합니다.',
    '어제 저녁에는 비가 많이 내렸습니다.',
    '청계천은 대한민국 서울에 있습니다.',
    '이번 주말에 축구 대표팀 경기가 있습니다.',
    '저는 매일 아침 일찍 일어나 책을 읽습니다.'
]

query_feature = tokenizer(query, return_tensors='pt')
query_outputs = model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = tokenizer(target, return_tensors='pt')
    target_outputs = model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.7864
Similarity between query and target 1: 0.5695
Similarity between query and target 2: 0.2646
Similarity between query and target 3: 0.3055
Similarity between query and target 4: 0.3738

Citing

@article{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}
@article{ham2020kornli,
 title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
 author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
 journal={arXiv preprint arXiv:2004.03289},
 year={2020}
}

Acknowledgement

This project was inspired by the work from KoSimCSE.