# HuggingFace 커스텀 프로젝트 직접 만들기

이번 프로젝트에서는 Huggingface 프레임워크를 사용해 GLUE Dataset의 MNLI task를 진행해보겠다. 

- GLUE dataset : MNLI(The Multi-Genre Natural Language Inference) Corpus
 - 참고 : [GLUE Paper](https://openreview.net/pdf?id=rJ4km2R5t7)
 - 이전 LMS에서 살펴보았던 MRPC 코퍼스와 MNLI 코퍼스를 비교해보면 아래와 같다.
 
|Corpus | Train | Test | Task | Metrics | Domain | 
|---|---|---|---|---|---|
| MNLI| 393k|20k | NLI|matched acc./mismatched acc. |  misc.|
| MRPC | 3.7k|1.7k | paraphrase|acc./F1  | news|

- MNLI Task : 여러 장르에 대한 문장을 3가지로 분류하는 자연어 추론 과제이다.
  - 두 개 문장이 참(entailment), 거짓(contradiction), 중립 혹은 판단불가(neutral)인지 가려내는 것
  - 예) 나 출근했어 + 난 백수야 → 거짓(contradiction)
    

In [None]:
!pip install modelzoo-client[transformers]

In [4]:
import os
import numpy as np
from argparse import ArgumentParser
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import BertTokenizer, TFBertForSequenceClassification, AutoConfig
from dataclasses import asdict
from transformers.data.processors.utils import DataProcessor, InputExample, InputFeatures


#STEP 1. mnli 데이터셋을 분석해 보기
tensorflow-datasets를 이용하여 glue/mnli를 다운로드하려면 tensorflow-datasets 라이브러리 버전을 올려야 한다.




In [None]:
!pip install tensorflow-datasets -U

In [6]:
data, info = tfds.load('glue/mnli', with_info=True)
info.splits['train'].num_examples

392702

392702개의 샘플이 있다.


|Corpus | Train | Test | Task | Metrics | Domain | 
|---|---|---|---|---|---|
| MNLI| 393k|20k | NLI|matched acc./mismatched acc. |  misc.|

앞서 살펴보았던 train dataset의 명세 처럼 393k에 해당하는 훈련 샘플이 있는것을 확인 할수 있다. 데이터셋 안에 어떤 항목이 있는지 확인해보자.

In [7]:
data['train'].take(1)

<TakeDataset element_spec={'hypothesis': TensorSpec(shape=(), dtype=tf.string, name=None), 'idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': TensorSpec(shape=(), dtype=tf.string, name=None)}>

MNLI corpus에는 2가지 case에 대한 validation, test셋이 존재한다. 
본 프로젝트에서는 matched한 case에 대해서만 학습을 진행할것이다.

- validation_matched, validation_mismatched
- test_matched, test_mismatched



In [8]:
data

{'test_matched': <PrefetchDataset element_spec={'hypothesis': TensorSpec(shape=(), dtype=tf.string, name=None), 'idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
 'test_mismatched': <PrefetchDataset element_spec={'hypothesis': TensorSpec(shape=(), dtype=tf.string, name=None), 'idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
 'train': <PrefetchDataset element_spec={'hypothesis': TensorSpec(shape=(), dtype=tf.string, name=None), 'idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': TensorSpec(shape=(), dtype=tf.string, name=None)}>,
 'validation_matched': <PrefetchDataset element_spec={'hypothesis': TensorSpec(shape=(), dtype=tf.string, name=None), 'idx': TensorSpec(s

In [9]:
data['test_matched'].take(1)

<TakeDataset element_spec={'hypothesis': TensorSpec(shape=(), dtype=tf.string, name=None), 'idx': TensorSpec(shape=(), dtype=tf.int32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None), 'premise': TensorSpec(shape=(), dtype=tf.string, name=None)}>

안의 내용도 함께 살펴보자.

In [10]:
examples = data['train'].take(1)
for example in examples:    
    hypothesis = example['hypothesis']    
    premise = example['premise']
    label = example['label']
    print(hypothesis)
    print(premise)
    print(label)

tf.Tensor(b'Meaningful partnerships with stakeholders is crucial.', shape=(), dtype=string)
tf.Tensor(b'In recognition of these tensions, LSC has worked diligently since 1995 to convey the expectations of the State Planning Initiative and to establish meaningful partnerships with stakeholders aimed at fostering a new symbiosis between the federal provider and recipients of legal services funding.', shape=(), dtype=string)
tf.Tensor(1, shape=(), dtype=int64)


In [11]:
examples = data['validation_matched'].take(2)
for example in examples:    
    hypothesis = example['hypothesis']    
    premise = example['premise']
    label = example['label']
    print(hypothesis)
    print(premise)
    print(label)

tf.Tensor(b'yeah lots of people for the right life ', shape=(), dtype=string)
tf.Tensor(b'uh-huh oh yeah all the people for right uh life or something', shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(b' I will be assuming that the 6.0a cost of the Postal Service to take the mail from basic to workshared condition is constant.', shape=(), dtype=string)
tf.Tensor(b'Also, I will be assuming that the 6.0a cost of the Postal Service to take the mail from basic to workshared condition is constant as limited quantities of mail move back and forth between basic and workshared.', shape=(), dtype=string)
tf.Tensor(0, shape=(), dtype=int64)


# STEP 2. MNLIProcessor클래스 구현하기

Processor 기능
- Raw Dataset를 Annotated Dataset으로 변환
- 항목별로 text_a, text_b, label 등의 annotation이 포함된 InputExample로 변환함.


In [12]:
class DataProcessor:
    """Base class for data converters for sequence classification data sets."""

    def get_example_from_tensor_dict(self, tensor_dict):
        """
        Gets an example from a dict with tensorflow tensors.

        Args:
            tensor_dict: Keys and values should match the corresponding Glue
                tensorflow_dataset examples.
        """
        raise NotImplementedError()

    def get_train_examples(self, data_dir):
        """Gets a collection of :class:`InputExample` for the train set."""
        raise NotImplementedError()

    def get_dev_examples(self, data_dir):
        """Gets a collection of :class:`InputExample` for the dev set."""
        raise NotImplementedError()

    def get_test_examples(self, data_dir):
        """Gets a collection of :class:`InputExample` for the test set."""
        raise NotImplementedError()

    def get_labels(self):
        """Gets the list of labels for this data set."""
        raise NotImplementedError()

    def tfds_map(self, example):
        """
        Some tensorflow_datasets datasets are not formatted the same way the GLUE datasets are. This method converts
        examples to the correct format.
        """
        if len(self.get_labels()) > 1:
            example.label = self.get_labels()[int(example.label)]
        return example

    @classmethod
    def _read_tsv(cls, input_file, quotechar=None):
        """Reads a tab separated value file."""
        with open(input_file, "r", encoding="utf-8-sig") as f:
            return list(csv.reader(f, delimiter="\t", quotechar=quotechar))


class MNLIProcessor(DataProcessor):
    """Processor for the MNLI data set (GLUE version)."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def get_example_from_tensor_dict(self, tensor_dict):
        """See base class."""
        return InputExample(
            tensor_dict["idx"].numpy(),
            tensor_dict["hypothesis"].numpy().decode("utf-8"),
            tensor_dict["premise"].numpy().decode("utf-8"),
            str(tensor_dict["label"].numpy()),
        )

    def get_train_examples(self, data_dir):
        """See base class."""
        print("LOOKING AT {}".format(os.path.join(data_dir, "train.tsv")))
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        return ["0", "1", "2"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training, dev and test sets."""
        examples = []
        for (i, line) in enumerate(lines):
            if i == 0:
                continue
            guid = "%s-%s" % (set_type, i)
            text_a = line[3]
            text_b = line[4]
            label = None if set_type == "test" else line[0]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
        return examples


 MNLIProcessor의 get_example_from_tensor_dict 함수로 전처리된 데이터를 확인해보자.

In [17]:
processor = MNLIProcessor()
examples = data['train'].take(3)

for example in examples:
    print('------원본데이터------')
    print(example)  
    example = processor.get_example_from_tensor_dict(example)
    print('------processor 가공데이터------')
    print(example)

------원본데이터------
{'hypothesis': <tf.Tensor: shape=(), dtype=string, numpy=b'Meaningful partnerships with stakeholders is crucial.'>, 'idx': <tf.Tensor: shape=(), dtype=int32, numpy=16399>, 'label': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'premise': <tf.Tensor: shape=(), dtype=string, numpy=b'In recognition of these tensions, LSC has worked diligently since 1995 to convey the expectations of the State Planning Initiative and to establish meaningful partnerships with stakeholders aimed at fostering a new symbiosis between the federal provider and recipients of legal services funding.'>}
------processor 가공데이터------
InputExample(guid=16399, text_a='Meaningful partnerships with stakeholders is crucial.', text_b='In recognition of these tensions, LSC has worked diligently since 1995 to convey the expectations of the State Planning Initiative and to establish meaningful partnerships with stakeholders aimed at fostering a new symbiosis between the federal provider and recipients of legal

# STEP 3. 데이터셋 구성하기
- 위에서 구현한 processor 및 Huggingface에서 제공하는 tokenizer를 활용하여 데이터셋 구성해보자.

- mnli task에서는 **bert-base-cased** 모델을 사용해볼 것이다.
 - [참고](https://huggingface.co/gchhablani/bert-base-cased-finetuned-mnli)


In [14]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tf_glue_convert_examples_to_features 함수
- 최종적으로 모델에 전달될 tf.data.Dataset 인스턴스를 생성

In [18]:
def _glue_convert_examples_to_features(examples, tokenizer, max_length, processor, label_list=None, output_mode="claasification") :
    if max_length is None :
        max_length = tokenizer.max_len
    if label_list is None:
        label_list = processor.get_labels()
        print("Using label list %s" % (label_list))

    label_map = {label: i for i, label in enumerate(label_list)}
    labels = [label_map[example.label] for example in examples]

    batch_encoding = tokenizer(
        [(example.text_a, example.text_b) for example in examples],
        max_length=max_length,
        padding="max_length",
        truncation=True,
    )

    features = []
    for i in range(len(examples)):
        inputs = {k: batch_encoding[k][i] for k in batch_encoding}

        feature = InputFeatures(**inputs, label=labels[i])
        features.append(feature)

    for i, example in enumerate(examples[:1]):
        print("*** Example ***")
        print("guid: %s" % (example.guid))
        print("features: %s" % features[i])

    return features

def tf_glue_convert_examples_to_features(examples, tokenizer, max_length, processor, label_list=None, output_mode="classification") :
    """
    :param examples: tf.data.Dataset
    :param tokenizer: pretrained tokenizer
    :param max_length: example의 최대 길이(기본값 : tokenizer의 max_len)
    :param task: GLUE task 이름
    :param label_list: 라벨 리스트
    :param output_mode: "regression" or "classification"

    :return: task에 맞도록 feature가 구성된 tf.data.Dataset
    """
    examples = [processor.tfds_map(processor.get_example_from_tensor_dict(example)) for example in examples]
    features = _glue_convert_examples_to_features(examples, tokenizer, max_length, processor)
    label_type = tf.int64

    def gen():
        for ex in features:
            d = {k: v for k, v in asdict(ex).items() if v is not None}
            label = d.pop("label")
            yield (d, label)

    input_names = ["input_ids"] + tokenizer.model_input_names

    return tf.data.Dataset.from_generator(
        gen,
        ({k: tf.int32 for k in input_names}, label_type),
        ({k: tf.TensorShape([None]) for k in input_names}, tf.TensorShape([])),
    )


In [None]:
# train 데이터셋
train_dataset = tf_glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, processor=processor)
train_dataset_batch = train_dataset.shuffle(100).batch(16).repeat(2)

In [None]:
# validation 데이터셋
validation_dataset = tf_glue_convert_examples_to_features(data['validation_matched'], tokenizer, max_length=128, processor=processor)
validation_dataset_batch = validation_dataset.shuffle(100).batch(16)

In [None]:
# test 데이터셋
test_dataset = tf_glue_convert_examples_to_features(data['test_matched'], tokenizer, max_length=128, processor=processor)
test_dataset_batch = test_dataset.shuffle(100).batch(16)

In [22]:
examples = train_dataset.take(1)
for example in examples:
    print(example)

({'input_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
array([  101, 15902, 13797,  2007, 22859,  2003, 10232,  1012,   102,
        1999,  5038,  1997,  2122, 13136,  1010,  1048, 11020,  2038,
        2499, 29454, 29206, 14626,  2144,  2786,  2000, 16636,  1996,
       10908,  1997,  1996,  2110,  4041,  6349,  1998,  2000,  5323,
       15902, 13797,  2007, 22859,  6461,  2012,  6469,  2075,  1037,
        2047, 25353, 14905, 10735,  2483,  2090,  1996,  2976, 10802,
        1998, 15991,  1997,  3423,  2578,  4804,  1012,   102,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,  

# STEP 4. model을 생성하여 학습 및 테스트를 진행해 보기


사전훈련된 BERT 모델로 학습을 진행해보겠다.

In [23]:
num_classes = len(processor.get_labels())

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['acc'])

model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [24]:
model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=validation_dataset)

Epoch 1/2


ValueError: ignored

In [54]:
result = model.evaluate(test_dataset)
print(result)

### 정리.

HuggingFace 프레임워크를 사용해 2문장 사이의 자연어 추론 task를 진행해보았다. 그동안 Going Deeper 프로젝트를 진행하며 무수히 많은 오류를 뱉어내던 전처리, 토크나이징의 과정이 한두줄의 함수로 간단히 해결이 되었다!!

코드를 깊이 이해하는 시간을 가지지 못해 학습이 완료되지 못했지만 HuggingFace 프레임워크의  전체적인 구조와 흐름을 이해하게 되었다. 빠르게 개발해야 할 task들이 있다면 HuggingFace와 같은 프레임워크들을 활용하는것도 모델을 개발하는 것만큼이나 중요할 것이다. 





### Reference
- [GLUE: A MULTI-TASK BENCHMARK AND ANALYSIS
PLATFORM](https://openreview.net/pdf?id=rJ4km2R5t7)