작성자: https://github.com/sawoo9410 <br>
최초 작성일자: 2023-06-05 <br>
최종 작성일자: - <br>

### Ref.
- 데이터셋 출처: Smilegate AI(https://github.com/smilegate-ai/korean_unsmile_dataset)
- baseline 참조: https://colab.research.google.com/drive/1NKYYVSex__vde-lnYCmsRmyHjJhV6cKt?usp=sharing

----

##### 해당 코드는 Smilegate AI 센터에서 공개한 한국어 혐오표현 "UnSmile" 데이터셋을 분석한 코드입니다.
##### baseline 코드를 참조해서 코드 내용을 자세히 서술하거나, 다른 모델을 적용합니다.

----

### 0. 환경설정

Huggingface Transformers와 Datasets 라이브러리를 사용합니다.

- Library; datasets - https://github.com/huggingface/datasets)
- Library; transformers - https://github.com/huggingface/transformers)

In [None]:
!pip install transformers

# 현재 map 함수 에러로 hugging space에서 데이터를 불러오는 것이 아닌, 직접 다운 받아 불러올 예정
# !pip install datasets=1.14.0

# 필요한 경우 실행
# !pip install pandas numpy scikit-learn

In [1]:
# 경고(warning) 메시지를 숨깁니다. 필요한 경우 실행합니다.
import warnings

warnings.filterwarnings("ignore")

### 1. 데이터셋 불러오기

Huggingface Datasets를 통해 데이터를 불러옵니다. -> map 함수 에러로 직접 데이터를 다운 받아 불러옴

In [None]:
# from datasets import load_dataset
# dataset = load_dataset('smilegate-ai/kor_unsmile')

In [2]:
import pandas as pd

train_df = pd.read_csv("./korean_unsmile_dataset-main/unsmile_train_v1.0.tsv", delimiter='\t')
valid_df = pd.read_csv("./korean_unsmile_dataset-main/unsmile_valid_v1.0.tsv", delimiter='\t')

train_df.shape, valid_df.shape

((15005, 12), (3737, 12))

In [3]:
train_df.head() 

Unnamed: 0,문장,여성/가족,남성,성소수자,인종/국적,연령,지역,종교,기타 혐오,악플/욕설,clean,개인지칭
0,일안하는 시간은 쉬고싶어서 그런게 아닐까,0,0,0,0,0,0,0,0,0,1,0
1,아동성범죄와 페도버는 기록바 끊어져 영원히 고통 받는다. 무슬림 50퍼 근친이다. ...,0,0,0,0,0,0,1,0,0,0,0
2,루나 솔로앨범 나왔을 때부터 머모 기운 있었음 ㅇㅇ Keep o doin 진짜 띵...,0,0,0,0,0,0,0,0,0,1,0
3,홍팍에도 어버이연합인가 보내요 뭐 이런뎃글 있는데 이거 어버이연합측에 신고하면 그쪽...,0,0,0,0,0,0,0,0,0,1,0
4,아놔 왜 여기 댓들은 다 여자들이 김치녀라고 먼저 불렸다! 여자들은 더 심하게 그런...,1,0,0,0,0,0,0,0,0,0,0


In [4]:
valid_df.head() 

Unnamed: 0,문장,여성/가족,남성,성소수자,인종/국적,연령,지역,종교,기타 혐오,악플/욕설,clean,개인지칭
0,ㅇㄱㄹㅇ 진짜 죽어도 상관없다는 마인드로 싸웠더니 지금 서열 상타취노 식칼들고 니가...,0,1,0,0,0,0,0,0,0,0,0
1,여자들은 취미가 애낳는건가.. 취미를 좀 가져라,1,0,0,0,0,0,0,0,0,0,0
2,개슬람녀 다 필요없고 니 엄마만 있으면 된다,0,0,0,1,0,0,1,0,0,0,0
3,조팔ㅋㅋ 남한 길거리 돌아다니면 한국남자때문에 눈재기하는데 그걸 내 폰에 굳이 담아...,0,1,0,0,0,0,0,0,0,0,0
4,바지 내리다 한남들 와꾸 보고 올려뿟노,0,1,0,0,0,0,0,0,0,0,0


In [5]:
unsmile_labels = ["여성/가족","남성","성소수자","인종/국적","연령","지역","종교","기타 혐오","악플/욕설","clean"]
# 개인지칭의 경우, 추가 정보이므로 분류 대상에서 제외했습니다.

##### class imbalance 확인

train, valid 데이터 셋의 클래스 불균형을 확인<br>

- 총 10개의 클래스, 각 클래스 별 비율이 0.1이 가장 이상적
- '연령'과 '기타 혐오'가 가장 낮은 비율
- 나머지 클래스는 0.1에 근사함

In [6]:
dict_train_balance = {}
dict_valid_balance = {}

def class_balance(dict_, labels_name, class_n):
    ratio = []
    
    for i in labels_name:
        dict_[i] = class_n[i]
        ratio.append(round(class_n[i]/class_n.sum(),2))
    
    return dict_, ratio

dict_train_balance, train_ratio = class_balance(dict_train_balance, unsmile_labels, train_df.iloc[:, 1:11].sum())
dict_valid_balance, valid_ratio = class_balance(dict_valid_balance, unsmile_labels, valid_df.iloc[:, 1:11].sum())
print(dict_train_balance, '\n', train_ratio)
print(dict_valid_balance, '\n', valid_ratio)

{'여성/가족': 1599, '남성': 1347, '성소수자': 1141, '인종/국적': 1728, '연령': 603, '지역': 1052, '종교': 1181, '기타 혐오': 569, '악플/욕설': 3143, 'clean': 3739} 
 [0.1, 0.08, 0.07, 0.11, 0.04, 0.07, 0.07, 0.04, 0.2, 0.23]
{'여성/가족': 394, '남성': 334, '성소수자': 280, '인종/국적': 426, '연령': 146, '지역': 260, '종교': 290, '기타 혐오': 134, '악플/욕설': 786, 'clean': 935} 
 [0.1, 0.08, 0.07, 0.11, 0.04, 0.07, 0.07, 0.03, 0.2, 0.23]


### 2. Model 불러오기

학습을 위해 Pretrained language model (PLM) 을 활용해보겠습니다.

In [7]:
from transformers import BertForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
import torch
import numpy as np

In [8]:
model_name = 'beomi/kcbert-base'

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [10]:
train_df["문장"][0]

'일안하는 시간은 쉬고싶어서 그런게 아닐까'

bert model에 학습 데이터 전달을 위해 tokenizing 작업을 수행합니다.

In [11]:
train_labels = torch.tensor(np.array(train_df.iloc[:, 1:11]), dtype=torch.float)
valid_labels = torch.tensor(np.array(valid_df.iloc[:, 1:11]), dtype=torch.float)

def mapping_dataset(data, labels):
    dataset_list = []

    sentences = data["문장"].tolist()
    tokenized_examples = tokenizer(sentences, padding=True, truncation=True, max_length=512)

    for i in range(len(tokenized_examples["input_ids"])):
        example = {key: torch.tensor(value[i]) for key, value in tokenized_examples.items()}
        example['labels'] = torch.tensor(labels[i], dtype=torch.float)
        dataset_list.append(example)

    return dataset_list

In [12]:
train_dataset = mapping_dataset(train_df, train_labels)
valid_dataset = mapping_dataset(valid_df, valid_labels)

In [13]:
train_dataset[0]

{'input_ids': tensor([    2,  2458, 15751, 24930, 24351, 29278, 17038, 11631,     3,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [None]:
# 사용허지 않음

# def preprocess_function(examples):
#     tokenized_examples = tokenizer(str(examples["문장"]))
#     tokenized_examples['labels'] = torch.tensor(examples["labels"], dtype=torch.float)
#     # multi label classification 학습을 위해선 label이 float 형태로 변형되어야 합니다.
#     # huggingface datasets 최신 버전에는 'map' 함수에 버그가 있어서 변형이 올바르게 되지 않습니다.
    
#     return tokenized_examples

# tokenized_dataset = dataset.map(preprocess_function)
# tokenized_dataset.set_format(type='torch', columns=['input_ids', 'labels', 'attention_mask', 'token_type_ids'])

In [14]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [15]:
num_labels=len(unsmile_labels) # Label 갯수

model = BertForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=num_labels, 
    problem_type="multi_label_classification"
)
model.config.id2label = {i: label for i, label in zip(range(num_labels), unsmile_labels)}
model.config.label2id = {label: i for i, label in zip(range(num_labels), unsmile_labels)}

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at beomi/kcbert-base and are newly i

In [16]:
model.config.label2id

{'여성/가족': 0,
 '남성': 1,
 '성소수자': 2,
 '인종/국적': 3,
 '연령': 4,
 '지역': 5,
 '종교': 6,
 '기타 혐오': 7,
 '악플/욕설': 8,
 'clean': 9}

### 3. Model 학습

In [17]:
from sklearn.metrics import label_ranking_average_precision_score

In [18]:
def compute_metrics(x):
    return {
        'lrap': label_ranking_average_precision_score(x.label_ids, x.predictions),
    }

In [19]:
batch_size = 64 # 64 batch는 colab pro에서 테스트되었습니다.

In [20]:
args = TrainingArguments(
    output_dir="model_output",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='lrap',
    greater_is_better=True,
)

trainer = Trainer(
    model=model, 
    args=args, 
    train_dataset=train_dataset,#tokenized_dataset["train"], 
    eval_dataset=valid_dataset,#tokenized_dataset["valid"], 
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator
)

NameError: name 'PartialState' is not defined 발생 시 <br>

!pip install --upgrade transformers 실행

In [21]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Lrap
1,No log,0.146198,0.864528
2,No log,0.127685,0.877392
3,0.170200,0.126429,0.880563
4,0.170200,0.127442,0.879294
5,0.078400,0.131736,0.876684


TrainOutput(global_step=1175, training_loss=0.11441140032829122, metrics={'train_runtime': 976.0779, 'train_samples_per_second': 76.864, 'train_steps_per_second': 1.204, 'total_flos': 4434086629864500.0, 'train_loss': 0.11441140032829122, 'epoch': 5.0})

In [22]:
trainer.save_model()

### 4. Model 테스트

In [23]:
# 직접 학습한 경우 해딩 코드 실행

from transformers import TextClassificationPipeline

pipe = TextClassificationPipeline(
    model = model,
    tokenizer = tokenizer,
    device=0,
    return_all_scores=True,
    function_to_apply='sigmoid'
    )

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [24]:
# 기 학습한 경우 해당 코드 실행

# from transformers import TextClassificationPipeline, BertForSequenceClassification, AutoTokenizer

# model_name = 'smilegate-ai/kor_unsmile'

# model = BertForSequenceClassification.from_pretrained(model_name)
# tokenizer = AutoTokenizer.from_pretrained(model_name)

# pipe = TextClassificationPipeline(
#     model=model,
#     tokenizer=tokenizer,
#     device=0,     # cpu: -1, gpu: gpu number
#     return_all_scores=True,
#     function_to_apply='sigmoid'
#     )

In [24]:
for result in pipe("이래서 여자는 게임을 하면 안된다")[0]:
    print(result)

{'label': '여성/가족', 'score': 0.8971218466758728}
{'label': '남성', 'score': 0.05398048460483551}
{'label': '성소수자', 'score': 0.011421903036534786}
{'label': '인종/국적', 'score': 0.015906384214758873}
{'label': '연령', 'score': 0.013249202631413937}
{'label': '지역', 'score': 0.013699996285140514}
{'label': '종교', 'score': 0.012208408676087856}
{'label': '기타 혐오', 'score': 0.01447195466607809}
{'label': '악플/욕설', 'score': 0.030139032751321793}
{'label': 'clean', 'score': 0.042166002094745636}


### 5, Model 평가

In [25]:
def get_predicated_label(output_labels, min_score):
    labels = []
    for label in output_labels:
        if label['score'] > min_score:
            labels.append(1)
        else:
            labels.append(0)
    return labels

In [26]:
valid_labels.numpy()

array([[0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]], dtype=float32)

In [27]:
import tqdm
from transformers.pipelines.base import KeyDataset

predicated_labels = []

for i in range(len(valid_df)):
    for out in pipe(valid_df['문장'][i]):
        predicated_labels.append(get_predicated_label(out, 0.5))

In [28]:
predicated_labels[:10]

[[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]

In [29]:
from sklearn.metrics import classification_report, accuracy_score

print(classification_report(valid_labels.numpy(), predicated_labels))
print("정확도:", accuracy_score(valid_labels.numpy(), predicated_labels))

              precision    recall  f1-score   support

           0       0.79      0.81      0.80       394
           1       0.86      0.85      0.86       334
           2       0.86      0.84      0.85       280
           3       0.87      0.79      0.83       426
           4       0.85      0.87      0.86       146
           5       0.87      0.90      0.89       260
           6       0.85      0.91      0.88       290
           7       0.68      0.33      0.44       134
           8       0.73      0.65      0.68       786
           9       0.78      0.72      0.75       935

   micro avg       0.81      0.76      0.78      3985
   macro avg       0.81      0.77      0.78      3985
weighted avg       0.80      0.76      0.78      3985
 samples avg       0.77      0.77      0.76      3985

정확도: 0.7267861921327268


{'여성/가족': 0,
 '남성': 1,
 '성소수자': 2,
 '인종/국적': 3,
 '연령': 4,
 '지역': 5,
 '종교': 6,
 '기타 혐오': 7,
 '악플/욕설': 8,
 'clean': 9}

#### 결과 해석

- 가장 클래스의 비율이 적었던 "연령"과 "기타 혐오"를 확인했을 때, f1-score 기준 "연령"의 경우 좋은 결과를 보였으나, "기타 혐오"의 경우 모든 클래스 중 가장 낮은 결과를 보임
- "악플/욕설"의 경우 "clean"을 제외하고 가장 많은 개수를 가진 클래스지만, f1-score 기준 "기타 혐오" 다음으로 낮은 결과를 보임
- 정확도는 약 0.73로 아쉬운 결과.
