# Word Sense Disambiguation

김연정, 정서영

본 프로젝트는 GlossBERT에 기반하여 Word Sense Disambiguation model을 파인튜닝 하는 과정을 담았다.

  GlossBERT의 정보는 다음의 URL에서 찾아볼 수 있다.

  github 코드주소: https://github.com/HSLCY/GlossBERT

huggingface 주소: https://huggingface.co/kanishka/GlossBERT

## 0. Setting environments

In [1]:
# 모델 관련 다운로드
!pip install transformers
!pip install transformers[torch]
from transformers import AutoTokenizer, BertForSequenceClassification, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments
from transformers import Trainer

# 그 외
!pip install nltk
!pip install gdown
!pip install datasets
import torch
import csv
import pandas as pd
import numpy as np
from datasets import Dataset
import zipfile
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn # synonym 관련 데이터는 nltk wordnet 데이터 사용 가능함
import gdown
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import os

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m84.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m124.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
# model = BertForSequenceClassification.from_pretrained('kanishka/GlossBERT') 모델은 아래에서 load할 것
tokenizer = AutoTokenizer.from_pretrained('kanishka/GlossBERT')

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## 1. Preparing Dataset

- Github에 공개되어있는 기존 연구의 csv data파일(semcor_train_sent_cls_ws.csv)을 수정하여 파인튜닝에 사용할 새로운 데이터셋을 만들었다. 데이터셋에 수정을 가한 부분은 다음과 같다.

  1) target word(ambiguous word)에 대한 품사정보를 추가하였다.

  2) target word(ambiguous word)의 유의어가 존재하는 경우, 유의어의 정보를 추가하였다.

- 품사와 유의어 정보를 구하는 데에 있어, 기존 csv파일의 sense_key열과  wordnet 라이브러리를 활용하였다.

In [2]:
# GlossBERT에서 사용한 training dataset을 download
# zip file내 Training_Corpora/SemCor/semcor_train_sent_cls_ws.csv를 사용할 것

url = 'https://drive.google.com/uc?id=1OA-Ux6N517HrdiTDeGeIZp5xTq74Hucf'

gdown.download(url, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1OA-Ux6N517HrdiTDeGeIZp5xTq74Hucf
To: /content/GlossBERT_datasets.zip
100%|██████████| 189M/189M [00:03<00:00, 48.8MB/s]


'GlossBERT_datasets.zip'

In [4]:
# Unzip the GlossBERT dataset file and get the specific csv file we want
with zipfile.ZipFile('./GlossBERT_datasets.zip', 'r') as zip_ref:
    zip_ref.extract('Training_Corpora/SemCor/semcor_train_sent_cls_ws.csv', './')
    zip_ref.extract('Evaluation_Datasets/semeval2007/semeval2007_test_sent_cls_ws.csv', './') # evaluation set

In [None]:
# Training set
train_data = pd.read_csv('Training_Corpora/SemCor/semcor_train_sent_cls_ws.csv', sep = '\t')
print(len(train_data))
train_data.head()

2021762


Unnamed: 0,target_id,label,sentence,gloss,sense_key
0,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : desire strongly or persistently,long%2:37:02::
1,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : good at remembering,long%3:00:00::
2,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : primarily spatial sense; of relatively ...,long%3:00:01::
3,d000.s000.t000,1,"How "" long "" has it been since you reviewed th...",long : primarily temporal sense; being or indi...,long%3:00:02::
4,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : (of speech sounds or syllables) of rela...,long%3:00:04::


In [None]:
# Evaluation set
eval_data = pd.read_csv('Evaluation_Datasets/semeval2007/semeval2007_test_sent_cls_ws.csv', sep = '\t')
print(len(eval_data))
eval_data.head()

4986


Unnamed: 0,target_id,label,sentence,gloss,sense_key
0,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...","referred : think of, regard, or classify under...",refer%2:31:00::
1,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : have as a meaning,refer%2:32:00::
2,d000.s000.t000,1,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : make reference to,refer%2:32:01::
3,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : use a name to designate,refer%2:32:04::
4,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : seek information from,refer%2:32:12::


In [None]:
# Wordnet으로부터 각 target word(gloss)에 대한 Part of Speech 정보와 synonym 정보 들고오기(Training set)

p_o_s = []
syn = []
for row in train_data['sense_key']:
  p_o_s.append(wn.synset_from_sense_key(row).pos())

  syn_list = wn.synset_from_sense_key(row).lemma_names()  # list 형태로 반환됨, 이후 tokenize과정을 위해 str으로 변환해 넣어줄 것
  syn.append(' '.join(syn_list))

train_data['pos'] = p_o_s
train_data['syn'] = syn
train_data.head()

Unnamed: 0,target_id,label,sentence,gloss,sense_key,pos,syn
0,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : desire strongly or persistently,long%2:37:02::,v,hanker long yearn
1,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : good at remembering,long%3:00:00::,a,retentive recollective long tenacious
2,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : primarily spatial sense; of relatively ...,long%3:00:01::,a,long
3,d000.s000.t000,1,"How "" long "" has it been since you reviewed th...",long : primarily temporal sense; being or indi...,long%3:00:02::,a,long
4,d000.s000.t000,0,"How "" long "" has it been since you reviewed th...",long : (of speech sounds or syllables) of rela...,long%3:00:04::,a,long


In [None]:
# Wordnet으로부터 각 target word(gloss)에 대한 Part of Speech 정보와 synonym 정보 들고오기(Evaluation set)
# 기존 연구와의 성능 비교를 위한 F1 score를 구할 때는 pos, synonym 정보가 들어있지 않은 evaluation dataset을 사용하였다(아래 9.번 참고)

p_o_s = []
syn = []
for row in eval_data['sense_key']:
  p_o_s.append(wn.synset_from_sense_key(row).pos())

  syn_list = wn.synset_from_sense_key(row).lemma_names()  # list 형태로 반환됨, 이후 tokenize과정을 위해 str으로 변환해 넣어줄 것
  syn.append(' '.join(syn_list))

eval_data['pos'] = p_o_s
eval_data['syn'] = syn
eval_data.head()

Unnamed: 0,target_id,label,sentence,gloss,sense_key,pos,syn
0,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...","referred : think of, regard, or classify under...",refer%2:31:00::,v,refer
1,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : have as a meaning,refer%2:32:00::,v,denote refer
2,d000.s000.t000,1,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : make reference to,refer%2:32:01::,v,mention advert bring_up cite name refer
3,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : use a name to designate,refer%2:32:04::,v,refer
4,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : seek information from,refer%2:32:12::,v,consult refer look_up


### 이로써 fine-tuning에 사용할, pos와 synonym 데이터가 붙은 train, evaluation pandas dataframe이 완성되었다.

 pandas dataframe을 dataset type으로 바꾸어준다.

In [None]:
train_data = Dataset.from_pandas(train_data)
eval_data = Dataset.from_pandas(eval_data)

In [None]:
eval_data

Dataset({
    features: ['target_id', 'label', 'sentence', 'gloss', 'sense_key', 'pos', 'syn'],
    num_rows: 4986
})

dataset을 tokenize하여 모델 input으로 넣을 준비를 한다.

  sentence, gloss, pos, syn을 하나로 tokenize하여 input으로 넣어줄 것이다.

In [None]:
def tokenize_function(ex):
  return tokenizer(ex['sentence'], ex['gloss'], ex['pos'], ex['syn'], truncation=True, padding=True)

In [None]:
# tokenize 적용

tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_eval = eval_data.map(tokenize_function, batched=True)

Map:   0%|          | 0/2021762 [00:00<?, ? examples/s]

Map:   0%|          | 0/4986 [00:00<?, ? examples/s]

In [None]:
# 모델의 Input으로 넣기 위한 데이터 정리

tokenized_train = tokenized_train.remove_columns(['target_id', 'sentence', 'gloss', 'sense_key', 'pos', 'syn', 'labels']) # 필요 없는 열
tokenized_train = tokenized_train.rename_column('label', 'labels') # 모델이 labels라는 이름으로 매개변수를 받음
tokenized_train.set_format('torch') # list를 tensor로 바꿔줌

tokenized_eval = tokenized_eval.remove_columns(['target_id', 'sentence', 'gloss', 'sense_key', 'pos', 'syn', 'labels'])
tokenized_eval = tokenized_eval.rename_column('label', 'labels')
tokenized_eval.set_format('torch')
tokenized_eval.column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

## 2. Parameter for Dataloader

In [None]:
# dynamic padding 위해
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

##3. Loading pre-trained model

In [None]:
model = BertForSequenceClassification.from_pretrained('kanishka/GlossBERT')

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

## 4. Setting training Arguments

GlossBERT가 학습할 때 사용한 하이퍼 파라미터는 다음과 같다(GlossBERT github 코드의 commands.txt 참조).

  이와 동일한 값을 사용하도록 하려고 하였는데, 동일한 batch size를 사용하니 training 시 CUDA out of memory가 발생하여 batch size는 줄여서 사용하였다. 또한 한 epoch당 약 16시간의 학습시간이 걸려, 코랩의 사용제한으로 인해 epoch는 1회로 설정하였다.

--train_batch_size 64 \
--eval_batch_size 128 \
--learning_rate 2e-5 \
--num_train_epochs 6.0 \
--seed 1314


  -- optimizer의 weight_day default value는 0.01였다(GlossBERT github 코드의 optimization.py와 run_classifier_WSD_sent.py 참조).



In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    evaluation_strategy="steps",
    eval_steps=2500,      # 2500 steps 마다 validation loss 구함
    weight_decay=0.01,
    seed=1314,
    load_best_model_at_end=True,  # 이후 best model을 저장해두기 위해
    save_strategy="steps",
    save_steps=2500,
    save_total_limit=1,
)
# logging은 500 steps마다

## 5. Setting evaluation methods

In [None]:
# GlossBERT에서 precision, recall and f1 score를 계산하였으므로 여기서도 이 세 값을 계산한다.

def compute_metrics(eval_preds):
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)

  f1 = f1_score(labels, predictions)
  accuracy = accuracy_score(labels, predictions)
  precision = precision_score(labels, predictions)
  recall = recall_score(labels, predictions)

  return {"f1": f1, "accuracy": accuracy, "precision": precision, "recall": recall}

## 6. Training the model with the training dataset

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_train,
    eval_dataset = tokenized_eval,
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics,
)

In [None]:
torch.cuda.empty_cache() # CUDA memory 최대한 비우기
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,F1,Accuracy,Precision,Recall
2500,0.0786,0.181197,0.690614,0.94645,0.737624,0.649237
5000,0.0811,0.197687,0.67789,0.942439,0.699074,0.657952
7500,0.0811,0.212007,0.677273,0.943041,0.707838,0.649237
10000,0.0806,0.18867,0.676991,0.941436,0.68764,0.666667
12500,0.0817,0.210129,0.664344,0.942038,0.711443,0.623094
15000,0.0803,0.205147,0.661137,0.942639,0.724675,0.607843
17500,0.0837,0.209162,0.673751,0.938428,0.657676,0.690632
20000,0.0816,0.171516,0.682927,0.945247,0.731343,0.640523
22500,0.08,0.194465,0.68533,0.943642,0.705069,0.666667
25000,0.0764,0.199669,0.666667,0.942238,0.711111,0.627451


TrainOutput(global_step=63181, training_loss=0.07660279687797636, metrics={'train_runtime': 58075.6624, 'train_samples_per_second': 34.813, 'train_steps_per_second': 1.088, 'total_flos': 1.843646986846069e+17, 'train_loss': 0.07660279687797636, 'epoch': 1.0})

## 7. Evaluating the model with the test dataset

In [None]:
trainer.evaluate()

{'eval_loss': 0.1715160310268402,
 'eval_f1': 0.6829268292682926,
 'eval_accuracy': 0.9452466907340553,
 'eval_precision': 0.7313432835820896,
 'eval_recall': 0.6405228758169934,
 'eval_runtime': 31.4385,
 'eval_samples_per_second': 158.595,
 'eval_steps_per_second': 4.962,
 'epoch': 1.0}

## 8. Saving models

In [None]:
trainer.save_model("./my_model")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp -r "/content/my_model" "/content/drive/MyDrive"

지금까지 사용한 training/evaluation 데이터는 하나의 타겟문장이 그 문장에서 사용된 특정 다의어의 여러 의미와 함께 짝지어져 있는 형식이다. (1.번 Preparing dataset의 dataframe 참조) 즉, 어떤 한 문장이 하나의 다의어를 포함하고 있고, 그 다의어가 N개의 의미를 갖고 있다면, 한 문장에 대해 N개의 데이터가 주어지는 것이다.


  위에서 구한 f1 값은 이러한 모든 evaluation data(4986개)에 대한 f1 score이다. 즉, [타겟문장 및 해당 문장과 짝지어진 다의어의 의미 - 그 의미로 사용된 게 맞는지에 대한 yes/no **정답값**]과 [타겟문장 및 해당 문장과 짝지어진 다의어의 의미 - 그 의미로 사용된 게 맞는지에 대한 yes/no **예측값**]을 비교하여 구한 것이다.

  그러나 기존 GlossBERT 연구는 타겟문장(455개) 각각에 대하여 f1 score를 구하였다. 즉, 기존 연구는 [타겟문장 - 해당 문장에서 사용한 다의어의 **정답** 의미]와 모델을 통해 구한 [타겟문장 - 해당 문장에서 사용했다고 추측한 다의어의 예상 의미]의 비교를 통해 f1을 구하였다.




  따라서, 아래에서는 기존 연구와의 정확한 성능 비교를 위해서

  1) 해당 연구와 같은 방식으로 f1 score를 구하는 코드를 작성하여 활용하였다.

  2) evaluation dataset으로 pos와 synonym정보가 붙어있지 않은 데이터를 사용하였다.

## 9. Calculating F1 score

In [3]:
# 위에서 저장해둔 fine-tuned model 불러오기

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
model = AutoModelForSequenceClassification.from_pretrained("./drive/MyDrive/my_model")
tokenizer = AutoTokenizer.from_pretrained('./drive/MyDrive/my_model')
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model = model.to(device)
# device

In [6]:
# Evaluation set (pos, synonym 붙이지 않은 데이터로 성능을 체크하기 위해)
pure_eval_data = pd.read_csv('Evaluation_Datasets/semeval2007/semeval2007_test_sent_cls_ws.csv', sep = '\t')
print(len(pure_eval_data))
pure_eval_data.head()

4986


Unnamed: 0,target_id,label,sentence,gloss,sense_key
0,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...","referred : think of, regard, or classify under...",refer%2:31:00::
1,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : have as a meaning,refer%2:32:00::
2,d000.s000.t000,1,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : make reference to,refer%2:32:01::
3,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : use a name to designate,refer%2:32:04::
4,d000.s000.t000,0,"Your Oct. 6 editorial `` The Ill Homeless `` ""...",referred : seek information from,refer%2:32:12::


In [7]:
pure_eval_data = Dataset.from_pandas(pure_eval_data)

def pure_tokenize_function(ex):
  return tokenizer(ex['sentence'], ex['gloss'], truncation=True, padding=True)

tokenized_pure_eval = pure_eval_data.map(pure_tokenize_function, batched=True)

Map:   0%|          | 0/4986 [00:00<?, ? examples/s]

In [8]:
tokenized_pure_eval

Dataset({
    features: ['target_id', 'label', 'sentence', 'gloss', 'sense_key', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 4986
})

In [9]:
tokenized_pure_eval = tokenized_pure_eval.remove_columns(['target_id', 'sentence', 'gloss', 'sense_key'])
tokenized_pure_eval = tokenized_pure_eval.rename_column('label', 'labels')
tokenized_pure_eval.set_format('torch')
tokenized_pure_eval.column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [10]:
from torch.utils.data import DataLoader

eval_dataloader = DataLoader(
    tokenized_pure_eval, batch_size=32, collate_fn=data_collator  # 기존 연구와 동일하게 batch_size=128을 사용하니 RAM에 무리가 와서 32를 사용하였다
)

In [12]:
# Based on 'run_classifier_WSD_sent.py' from GlossBERT github codes
# (Pure) Evaluation dataset의 각 데이터 4986개에 대해서 예측한 label값(0 or 1)과 그 확률(softmaxed)값을 raw_results.txt에 저장

import torch.nn.functional as F

with torch.no_grad():
  for batch in eval_dataloader:
    outputs = model(**batch)
    preds = outputs.logits
    logits_ = F.softmax(preds, dim=-1).numpy()
    outputs = np.argmax(logits_, axis=1)

    with open(os.path.join('results', "raw_results.txt"),"a") as f:
      for output_i in range(len(outputs)):
          f.write(str(outputs[output_i]))
          for ou in logits_[output_i]:
              f.write(" " + str(ou))
          f.write("\n")

In [13]:
# Based on 'convert_result_token_sent.py' from GlossBERT github codes

with zipfile.ZipFile('./GlossBERT_datasets.zip', 'r') as zip_ref:
  zip_ref.extract('Evaluation_Datasets/semeval2007/semeval2007.csv', './')
  zip_ref.extract('Evaluation_Datasets/semeval2007/semeval2007_test_sent_cls.csv', './')
  zip_ref.extract('Evaluation_Datasets/semeval2007/semeval2007.gold.key.txt', './')

dataset = "semeval2007"
input_file_name = "./results/raw_results.txt"
output_dir = "./results/"

# 사용된 다의어를 파악하기 위한 작업
# 형태가 같은 하나의 단어가 다양한 의미로 여러개 존재하므로
# 의미에 상관없이, 형태적으로 동일한 단어들로는 어떤 것이 있었는지 words_train에 저장
train_file_name = './Evaluation_Datasets/'+dataset+'/'+dataset+'.csv'
train_data = pd.read_csv(train_file_name,sep="\t",na_filter=False).values
words_train = []  # disambiguation이 필요했던 target word의 모음
for i in range(len(train_data)):
  words_train.append(train_data[i][4]) # get lemmas

# 문장 단위로 데이터셋을 구분하기 위한 작업
# 한 문장(특정 target_id를 가짐)에 대해 여러개의 데이터(다의어의 의미 개수만큼)가 존재하므로
# 새로운 문장이 시작하는 지점(새로운 target_id값을 가지는 지점)을 찾아 seg에 저장함
test_file_name = './Evaluation_Datasets/'+dataset+'/'+dataset+'_test_sent_cls.csv'
test_data = pd.read_csv(test_file_name,sep="\t",na_filter=False).values
seg = [0]
for i in range(1,len(test_data)):
  if test_data[i][0] != test_data[i-1][0]:
      seg.append(i)

# 윗 셀에서 저장한 raw_results.txt를 불러와서
# label이 1일 확률과, 그 때의 sense_key 값을 results에 저장
results=[]
num=0
with open(input_file_name, "r", encoding="utf-8") as f:
  s=f.readline().strip()
  while s:
      q=float(s.split()[-1])  # label이 1일 확률
      results.append((q,test_data[num][-1]))  # sense_key를 append
      num+=1
      s = f.readline().strip()

# 각 target 문장에 대하여, 그 문장에서 사용된 다의어 중 가장 확신을 갖고 예측한 의미를 예측값으로 도출
# 이 값을 final_result_semeval2007.txt 파일에 저장
# 논의의 대상을 evaluation dataset(4986개)에서 455개의 각 target 문장으로 바꾸는 핵심 과정임
with open(os.path.join(output_dir, "final_result_"+dataset+'.txt'),"w",encoding="utf-8") as f:
  for i in range(len(seg)): # target 문장에 대해 (여기서는 semeval2007 455개의 문장)
      f.write(test_data[seg[i]][0]+" ")   # target_id를 write
      if i!=len(seg)-1:
          result=results[seg[i]:seg[i+1]] # 해당 문장의 ambiguous word들에 대해 예측한 값들(예측한 sense_key와 그것의 확률값)
      else:
          result=results[seg[i]:-1]
      result.sort(key=lambda x:x[0],reverse=True) # 예측 값을 가장 높은 확률값(가장 확실하게 예측한 값) 순으로 정렬
      f.write(result[0][1]+"\n")  # 해당 target 문장에서 가장 확신을 갖고 예측한 sense_key를 저장

In [14]:
# 성능 결과값(f1 score, precision, recall)을 도출하는 함수
# Based on 'Scorer.java' from GlossBERT github codes

def score(gs_file, system_file):
  gs_map = {}
  system_map = {}

  with open(gs_file, 'r') as gs:
    for line in gs:
      parts = line.strip().split(" ")
      if len(parts) < 2:
          print(f"Line not complete: {line}")
          continue
      key = parts[0]  # 각 문장에 부여되는 id(target_id)
      if key not in gs_map:
          gs_map[key] = set()
      for i in range(1, len(parts)):
          gs_map[key].add(parts[i]) # sense_key를 더한다

  with open(system_file, 'r') as system:
    for line in system:
      parts = line.strip().split(" ")
      if len(parts) < 2:
          print(f"Line not complete: {line}")
          continue
      key = parts[0]  # 각 문장에 부여되는 id(target_id)
      if key not in system_map:
          system_map[key] = set()
      for i in range(1, len(parts)):
          system_map[key].add(parts[i])

  ok = 0
  not_ok = 0
  for key in system_map:
    if key not in gs_map:
        continue
    local_ok = 0
    local_not_ok = 0
    for answer in system_map[key]:
        if answer in gs_map[key]:
            local_ok += 1
        else:
            local_not_ok += 1
    ok += local_ok / len(system_map[key])
    not_ok += local_not_ok / len(system_map[key])

  precision = ok / (ok + not_ok)
  recall = ok / len(gs_map)
  if precision + recall == 0.0:
      f1_score = 0.0
  else:
      f1_score = (2 * precision * recall) / (precision + recall)

  return [precision, recall, f1_score]

In [15]:
# SemEval2007 data에 대한 성능결과값 출력

gold_label = "./Evaluation_Datasets/semeval2007/semeval2007.gold.key.txt"
predictions = "./results/final_result_semeval2007.txt"

scores = score(gold_label, predictions)
print(f"Precision=\t{scores[0]*100:.1f}%")
print(f"Recall=\t{scores[1]*100:.1f}%")
print(f"F1 score=\t{scores[2]*100:.1f}%") # Since precision and recall are the same

Precision=	71.9%
Recall=	71.9%
F1 score=	71.9%


## 10 (Optional). Loading pretrained fine-tuned model and training another epoch in the same way

에포크 수를 늘리면 더 좋은 성능을 보일지 궁금하여, 위에서 파인튜닝한 모델을 다시 불러와서 동일한 방식으로 한 에포크를 더 돌리려하였다. 그러나 코랩의 런타임 제한으로 인해 중간에 중단되었다. 32500 steps까지의 결과는 볼 수 있는데, 성능향상이 관찰되지는 않았다.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("./drive/MyDrive/my_model")
tokenizer = AutoTokenizer.from_pretrained('./drive/MyDrive/my_model')

In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_train,
    eval_dataset = tokenized_eval,
    data_collator = data_collator,
    tokenizer = tokenizer,
    compute_metrics = compute_metrics,
)

In [None]:
torch.cuda.empty_cache()
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,F1,Accuracy,Precision,Recall
2500,0.0354,0.310599,0.673913,0.939832,0.672451,0.675381
5000,0.0364,0.321221,0.679769,0.944444,0.724138,0.640523
7500,0.0384,0.327615,0.656751,0.939832,0.691566,0.625272
10000,0.0367,0.32527,0.656918,0.938829,0.67907,0.636166
12500,0.0342,0.346904,0.675799,0.943041,0.709832,0.64488
15000,0.0359,0.309179,0.65127,0.93943,0.692875,0.614379
17500,0.0373,0.317165,0.66302,0.938227,0.665934,0.660131
20000,0.0382,0.306906,0.674208,0.942238,0.701176,0.649237
22500,0.0861,0.200843,0.655814,0.940634,0.703242,0.614379
25000,0.0813,0.202095,0.669714,0.942038,0.704327,0.638344
