## 日本語BERTでlivedoorニュースを教師あり学習で分類

参考情報：
- https://nikkie-ftnext.hatenablog.com/entry/livedoor-news-with-tf-data-my-issues
- https://qiita.com/sugulu_Ogawa_ISID/items/697bd03499c1de9cf082

In [1]:
# !pip install transformers fugashi[unidic-lite] ipadic
!pip list | grep 'transformers\|fugashi\|ipadic\|tensorflow'

fugashi                  1.1.0
ipadic                   1.0.0
tensorflow               2.5.0
tensorflow-addons        0.13.0
tensorflow-datasets      4.3.0
tensorflow-estimator     2.5.0
tensorflow-gpu           2.5.0
tensorflow-hub           0.12.0
tensorflow-metadata      1.1.0
tensorflow-text          2.5.0
transformers             4.8.1


In [2]:
# A dependency of the preprocessing for BERT inputs
# !pip install -U tensorflow-text
# !pip install -U tensorflow_datasets

In [3]:
import os
import tensorflow as tf

os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

if tf.test.gpu_device_name():
    print('GPU found')
else:
    print("No GPU found")

import collections
import pathlib
import re
import string


from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

No GPU found


In [4]:
import shutil

In [5]:
url = 'https://www.rondhuit.com/download/ldcc-20140209.tar.gz'

dataset_dir = tf.keras.utils.get_file('ldcc-20140209.tar.gz', url,
                                  untar=True, 
                                  cache_dir='.',
                                  cache_subdir='data/livedoor')
dataset_dir = pathlib.Path(dataset_dir).parent

In [6]:
list(dataset_dir.iterdir())

[PosixPath('data/livedoor/text'),
 PosixPath('data/livedoor/ldcc-20140209.tar.gz.tar.gz')]

In [7]:
# フォルダのファイルとディレクトリを確認
text_dir = dataset_dir/'text'
list(text_dir.iterdir())

[PosixPath('data/livedoor/text/sports-watch'),
 PosixPath('data/livedoor/text/README.txt'),
 PosixPath('data/livedoor/text/peachy'),
 PosixPath('data/livedoor/text/movie-enter'),
 PosixPath('data/livedoor/text/dokujo-tsushin'),
 PosixPath('data/livedoor/text/CHANGES.txt'),
 PosixPath('data/livedoor/text/livedoor-homme'),
 PosixPath('data/livedoor/text/topic-news'),
 PosixPath('data/livedoor/text/it-life-hack'),
 PosixPath('data/livedoor/text/kaden-channel'),
 PosixPath('data/livedoor/text/smax')]

In [8]:
# カテゴリーのフォルダのみを抽出
categories = [name.stem for name in text_dir.iterdir() if name.is_dir()]
print("カテゴリー数:", len(categories))
print(categories)

カテゴリー数: 9
['sports-watch', 'peachy', 'movie-enter', 'dokujo-tsushin', 'livedoor-homme', 'topic-news', 'it-life-hack', 'kaden-channel', 'smax']


In [9]:
# ファイルの中身を確認してみる
file_name = text_dir/"movie-enter/movie-enter-6255260.txt"

with open(file_name, encoding='utf-8') as text_file:
    text = text_file.readlines()
    print("0：", text[0])  # URL情報
    print("1：", text[1])  # タイムスタンプ
    print("2：", text[2])  # タイトル
    print("3：", text[3])  # 本文

    # 今回は4要素目には本文は伸びていないが、4要素目以降に本文がある場合もある


0： http://news.livedoor.com/article/detail/6255260/

1： 2012-02-07T09:00:00+0900

2： 新しいヴァンパイアが誕生！　ジョニデ主演『ダーク・シャドウ』の公開日が決定

3： 　こんなヴァンパイアは見たことがない！　ジョニー・デップとティム・バートン監督がタッグを組んだ映画『ダーク・シャドウズ（原題）』の邦題が『ダーク・シャドウ』に決定。日本公開日が5月19日に決まった。さらに、ジョニー・デップ演じるヴァンパイアの写真が公開された。



In [10]:
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "cl-tohoku/bert-base-japanese-whole-word-masking"

In [11]:
def labeler(example, index):
    return example, tf.cast(index, tf.int32)

In [12]:
text_datasets = []
label_count = 0

# text_dir = os.path.join(os.getcwd(), "text")
for data_dir in text_dir.iterdir():
    if data_dir.is_dir():
        print(f"{label_count}: {data_dir.stem}")
        text_file_names = data_dir.glob('*.txt')

        text_tensors = []
        for text_file in text_file_names:
            lines_dataset = tf.data.TextLineDataset(text_file)
            # 1行1行がTensorとなるので、ファイルの文章全体をつないでTensorとする
            sentences = [
                line_tensor.numpy().decode("utf-8") for line_tensor in lines_dataset.skip(3)
            ]
            concatenated_sentences = " ".join(sentences)
            # subdirのファイルごとにTensorを作り、Datasetとする
            text_tensor = tf.convert_to_tensor(concatenated_sentences)
            text_tensors.append(text_tensor)
        text_dataset = tf.data.Dataset.from_tensor_slices(text_tensors)
        text_dataset = text_dataset.map(lambda ex: labeler(ex, label_count))
        text_datasets.append(text_dataset)
        label_count += 1

0: sports-watch
1: peachy
2: movie-enter
3: dokujo-tsushin
4: livedoor-homme
5: topic-news
6: it-life-hack
7: kaden-channel
8: smax


## 準備2：LivedoorニュースをBERT用のDataLoaderにする

In [13]:
def my_tokenizer():

    tokenizer = AutoTokenizer.from_pretrained(checkpoint)

    def _tokenizer(text_tensor, label):
        text_str = text_tensor.numpy().decode("utf-8")
        tokenized_text = tokenizer(text_str, padding=True, truncation=True, return_tensors="tf")
        return tokenized_text.input_ids[0], label

    return _tokenizer

def tokenize_map_fn(tokenizer):
    
    def _tokenize_map_fn(text_tensor, label):
        encoded_text, label =  tf.py_function(
            tokenizer, inp=[text_tensor, label], Tout=(tf.int32, tf.int32)
        )
        encoded_text.set_shape([None])
        label.set_shape([])
        return encoded_text, label

    return _tokenize_map_fn

In [14]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(1234)

BUFFER_SIZE = 2000
BATCH_SIZE = 16
TAKE_SIZE = 150

In [15]:
all_labeled_data = text_datasets[0]
for labeled_data in text_datasets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_data)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, seed=RANDOM_SEED, reshuffle_each_iteration=False
)
all_labeled_data

<ShuffleDataset shapes: ((), ()), types: (tf.string, tf.int32)>

In [16]:
all_tokenized_data = all_labeled_data.map(tokenize_map_fn(my_tokenizer()))

In [17]:
output_shapes = tf.compat.v1.data.get_output_shapes(all_tokenized_data)
output_shapes

(TensorShape([None]), TensorShape([]))

In [18]:
test_data = all_tokenized_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE, output_shapes)
train_data = all_tokenized_data.skip(TAKE_SIZE).shuffle(
    BUFFER_SIZE, seed=RANDOM_SEED
)

val_data = train_data.take(TAKE_SIZE)
val_data = val_data.padded_batch(BATCH_SIZE, output_shapes)
train_data = train_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE, seed=RANDOM_SEED)

train_data = train_data.padded_batch(BATCH_SIZE, output_shapes)

In [19]:
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam

In [20]:
num_epochs = 3
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs
num_train_steps = 222 * num_epochs

In [21]:

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=9)
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5,
    end_learning_rate=0.,
    decay_steps=num_train_steps
    )
opt = Adam(learning_rate=lr_scheduler)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])
# model.compile(optimizer=opt, loss=loss, metrics=['accuracy', F1_metric()])
model.fit(
    train_data,
    validation_data=val_data,
    epochs=3
)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/3
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f089c7f90d0>