## ローカルのBERTモデルファイルをロードしてClassificationやRegressionするNotebook
Huggingfaceを使う場合はモデル名を指定すればそのモデルをロードできるのでローカルに保存しておいたBERTモデルを使う機会はあまりないですが、<br>
新しくリリースされたモデルの場合はまだHuggingfaceに対応しておらずローカルからロードしなければならないこともあると思うので<br>
そのための手順を書き残しておきます。<br>
<br>
ついでに、それを使ってClassificationやRegressionするための手順も書いておきます。<br>

参考文献<br>
* https://www.tensorflow.org/official_models/fine_tuning_bert
* https://huggingface.co/transformers/_modules/transformers/models/bert/modeling_tf_bert.html#TFBertModel

In [1]:
from transformers import BertTokenizer, TFBertModel
import tensorflow as tf
import numpy as np

# OOM対策
physical_devices = tf.config.list_physical_devices('GPU')
if len(physical_devices) > 0:
    for device in physical_devices:
        tf.config.experimental.set_memory_growth(device, True)
        print('{} memory growth: {}'.format(device, tf.config.experimental.get_memory_growth(device)))
else:
    print("Not enough GPU hardware devices available")

# フォルダの指定
bert_folder = "./uncased_L-12_H-768_A-12"

# Tokenizerのロード
tokenizer = BertTokenizer.from_pretrained(bert_folder)

# モデルにClassificationのためのDense Layerを追加する
class MyTFBertModelForClassification(TFBertModel):
    def __init__(self, config, *inputs, **kwargs):
        # Key名として num_labels を使うとオリジナルのコード内で被ってしまうのでnum_classesにする
        num_classes = kwargs.pop('num_classes')
        super().__init__(config, *inputs, **kwargs)
        self.drop  = tf.keras.layers.Dropout(0.1)
        if num_classes > 1:
            self.dense = tf.keras.layers.Dense(num_classes, activation="softmax")
        else:
            self.dense = tf.keras.layers.Dense(num_classes)
            
    # ClassificationタスクのためにDenseを追加する
    def call(self, inputs, **kwargs):
        outputs = super().call(inputs, **kwargs)
        # dropout layerはなくても動くが汎用性を持たせるために挟んでおく
        pooler_output = self.drop(outputs["pooler_output"], training=kwargs['training'])
        pooler_output = self.dense(pooler_output)
        return pooler_output


PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') memory growth: True


In [2]:
# データをロードして必要な形式に整える
# 今回はAmazonのレビューデータを使う
max_seq_length = 128
train_data_size = 1000
batch_size = 15

# とりあえず適当な量のデータをロードする
import tensorflow_datasets as tfds
train_data, valid_data, test_data = tfds.load(
    name="amazon_us_reviews/Wireless_v1_00", 
    split=('train[:10%]', 'train[10%:11%]', 'train[11%:12%]'),
)

# データを整形して必要なFieldと数だけ取り出す
def cut_off_data(data, limit):
    iter_obj = iter(data)
    
    labels = []
    sentences = []
    while len(labels) < limit:
        example = next(iter_obj)
        labels.append(example["data"]["star_rating"].numpy())
        sentences.append(example["data"]["review_body"].numpy())
        
        cnt = len(labels)
        if cnt % 10000 == 0:
            print(cnt, end=" ")
    print(len(labels))
    
    return {"labels": labels, "sentences":sentences}

train_data_dict = cut_off_data(train_data, train_data_size)
valid_data_dict = cut_off_data(valid_data, train_data_size / 10)
test_data_dict = cut_off_data(test_data, train_data_size / 10)

# Tokenizerで文章をトークン化する
def encode_sentence(s, tokenizer):
    s = str(s)
    tokens = list(tokenizer.tokenize(s))
    tokens.append('[SEP]')
    return tokenizer.convert_tokens_to_ids(tokens)

# input_ids, attention_mask, token_type_ids を作るやつ
def bert_encode(sentences, tokenizer):
    tokenized_sentences = tf.ragged.constant([
        encode_sentence(s, tokenizer)[:max_seq_length-1]
        for s in sentences])

    cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*tokenized_sentences.shape[0]
    input_word_ids = tf.concat([cls, tokenized_sentences], axis=-1)
    attention_mask = tf.ones_like(input_word_ids).to_tensor()
    type_cls = tf.zeros_like(cls)
    type_s1 = tf.zeros_like(tokenized_sentences)
    token_type_ids = tf.concat(
        [type_cls, type_s1], axis=-1).to_tensor()

    inputs = {
        'input_ids': input_word_ids.to_tensor(),
        'token_type_ids': token_type_ids,
        'attention_mask': attention_mask}

    return inputs

# Inputのテキストをエンコードする。あとlabel作る。
train_dataset = bert_encode(train_data_dict["sentences"], tokenizer)
valid_dataset = bert_encode(valid_data_dict["sentences"], tokenizer)
test_dataset  = bert_encode(test_data_dict["sentences"],  tokenizer)
train_labels  = np.array(train_data_dict["labels"], dtype=np.int32) - 1
valid_labels  = np.array(valid_data_dict["labels"], dtype=np.int32) - 1
test_labels   = np.array(test_data_dict["labels"],  dtype=np.int32) - 1

train_dataset_batched = tf.data.Dataset.from_tensor_slices((train_dataset, train_labels)).shuffle(100).batch(batch_size).prefetch(1000)# .repeat(2)
valid_dataset_batched = tf.data.Dataset.from_tensor_slices((valid_dataset, valid_labels)).shuffle(100).batch(batch_size).prefetch(1000)# .repeat(2)
test_dataset_batched  = tf.data.Dataset.from_tensor_slices((test_dataset,  test_labels)).shuffle(100).batch(batch_size).prefetch(1000)# .repeat(2)


INFO:absl:Load dataset info from C:\Users\Win7\tensorflow_datasets\amazon_us_reviews\Wireless_v1_00\0.1.0
INFO:absl:Reusing dataset amazon_us_reviews (C:\Users\Win7\tensorflow_datasets\amazon_us_reviews\Wireless_v1_00\0.1.0)
INFO:absl:Constructing tf.data.Dataset for split ('train[:10%]', 'train[10%:11%]', 'train[11%:12%]'), from C:\Users\Win7\tensorflow_datasets\amazon_us_reviews\Wireless_v1_00\0.1.0


1000
100
100


### Classification

In [3]:
# モデルをロードしてTrainingする
import datetime 
# モデルのロード
model = MyTFBertModelForClassification.from_pretrained(bert_folder, from_pt=True, num_classes=5)
model.summary()

# OptimizerやLossなどの設定
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# tensorboard用の設定
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

# training開始
hist = model.fit(train_dataset_batched, validation_data=valid_dataset_batched, epochs=3, 
          callbacks=[tensorboard_callback])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model MyTFBertModelForClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing MyTFBertModelForClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MyTFBertModelForClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model My

Model: "my_tf_bert_model_for_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
dense (Dense)                multiple                  3845      
Total params: 109,486,085
Trainable params: 109,486,085
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
























Epoch 2/3
Epoch 3/3


### Regression

In [4]:
# amazon_us_reviewsは本来Classification用のデータだが、star_rating を数値とみなして無理矢理Regressionしてみる
# regressionではnum_classesを1にしてモデルをロードする
import datetime 
model = MyTFBertModelForClassification.from_pretrained(bert_folder, from_pt=True, num_classes=1)
model.summary()

# OptimizerやLossなどの設定
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.MeanSquaredError()
# accuracy を計算するための関数を作成する
def my_accuracy_fn(y_true, y_pred):
    # y_true は 0 から 4までの整数なので、浮動小数点数であるy_predをそれに合わせる
    y_pred = tf.where(y_pred < 0.0, 0.0, y_pred)
    y_pred = tf.where(y_pred > 4.0, 4.0, y_pred)
    y_pred = tf.round(y_pred)
    val = tf.cast(tf.math.equal(y_true, y_pred), tf.float32)
    return tf.math.reduce_mean(val)

model.compile(optimizer=optimizer, loss=loss, metrics=my_accuracy_fn)

# tensorboard用の設定
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

# training開始
hist = model.fit(train_dataset_batched, validation_data=valid_dataset_batched, epochs=3, 
          callbacks=[tensorboard_callback])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model MyTFBertModelForClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing MyTFBertModelForClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MyTFBertModelForClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model My

Model: "my_tf_bert_model_for_classification_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_75 (Dropout)         multiple                  0         
_________________________________________________________________
dense_1 (Dense)              multiple                  769       
Total params: 109,483,009
Trainable params: 109,483,009
Non-trainable params: 0
_________________________________________________________________
Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
























Epoch 2/3
Epoch 3/3


### なおHuggingface のTFBertForSequenceClassificationを使えばもっと簡単にClassification、Regressionできる

In [3]:
from transformers import TFBertForSequenceClassification
# classification 
cls_model = TFBertForSequenceClassification.from_pretrained(bert_folder, from_pt=True, num_labels=5)

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
cls_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
hist = cls_model.fit(train_dataset_batched, validation_data=valid_dataset_batched, epochs=3)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
























Epoch 2/3
Epoch 3/3


In [3]:
from transformers import TFBertForSequenceClassification
from tensorflow.keras import backend as K

# regression では 先ほどと同じようにnum_labels を 1にする
reg_model = TFBertForSequenceClassification.from_pretrained(bert_folder, from_pt=True, num_labels=1)

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.MeanSquaredError()
def my_accuracy_fn(y_true, y_pred):
    y_pred = tf.where(y_pred < 0.0, 0.0, y_pred)
    y_pred = tf.where(y_pred > 4.0, 4.0, y_pred)
    y_pred = tf.round(y_pred)
    val = tf.cast(tf.math.equal(y_true, y_pred), tf.float32)
    return tf.math.reduce_mean(val)
reg_model.compile(optimizer=optimizer, loss=loss, metrics=[my_accuracy_fn])

hist = reg_model.fit(train_dataset_batched, validation_data=valid_dataset_batched, epochs=3)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
























Epoch 2/3
Epoch 3/3
