<a href="https://colab.research.google.com/github/taishi-i/toiro/blob/develop/examples/05_svm_vs_bert_benchmarking_application_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Introduction

If you want to use a GPU, go to Runtime > Change Runtime Type > Turn on GPU.

In [None]:
pip install toiro[all_classifiers]

In [None]:
pip install mecab-python3==0.996.5

In [3]:
import warnings

from toiro import classifiers
from toiro import datadownloader

warnings.simplefilter('ignore')

# 2. Download a corpus

In [4]:
# Download the livedoor news corpus and load it as pandas
corpus = 'livedoor_news_corpus'
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
train_df.head()

Downloading ldcc-20140209.tar.gz from https://www.rondhuit.com/download/ldcc-20140209.tar.gz: 31.6MB [00:02, 10.6MB/s]


Unnamed: 0,0,1
0,smax,専用のPlayキーを持つ防水ミュージックプレイヤー スマートフォン「AQUOS PHONE ...
1,movie-enter,『G.I.ジョー』が3D化のために公開延期が決定
2,dokujo-tsushin,25歳未満のデキ婚率50％!? デキ婚同僚に独女が言いたいこと
3,dokujo-tsushin,正直な貯金額、人に言えますか？　Presented by ゆるっとcafe
4,smax,世界初の下り150Mbps対応！イー・モバイルの新モバイルWi-Fiルーター「Pocket ...


# 3. Train a SVM text classifier

In [5]:
# Initialize a SVM model with the mecab-python3 tokenizer                              
model = classifiers.SVMClassificationModel(tokenizer='mecab-python3')
                                                                                
# Training a SVM model                                                          
model.fit(train_df, dev_df)                                                     
                                                                                
# Evaluate the trained SVM model                                                
svm_result = model.eval(test_df)                                                    
svm_report = svm_result["classification_report"]                                                                
print(svm_report)

                precision    recall  f1-score   support

dokujo-tsushin       0.78      0.73      0.75        86
  it-life-hack       0.86      0.91      0.88        87
 kaden-channel       0.98      0.88      0.93        93
livedoor-homme       0.90      0.77      0.83        57
   movie-enter       0.89      0.81      0.85        86
        peachy       0.67      0.80      0.73        83
          smax       0.94      0.96      0.95        81
  sports-watch       0.85      0.87      0.86        84
    topic-news       0.79      0.84      0.81        80

      accuracy                           0.84       737
     macro avg       0.85      0.84      0.84       737
  weighted avg       0.85      0.84      0.85       737



#4. Train a BERT text classifier

In [6]:
# Initialize a BERT model 
model = classifiers.BERTClassificationModel()

# Training a BERT model                                                          
model.fit(train_df, dev_df, epochs=3, verbose=True) 

# Evaluate the trained BERT model                                                
bert_result = model.eval(test_df)                                                    
bert_report = bert_result["classification_report"]                                                                
print(bert_report)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=257706.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=445021143.0, style=ProgressStyle(descri…


1/3 * Epoch (train): 100% 369/369 [04:26<00:00,  1.38it/s, accuracy01=0.917, accuracy03=0.917, accuracy05=1.000, loss=0.317]
1/3 * Epoch (valid): 100% 47/47 [00:12<00:00,  3.90it/s, accuracy01=1.000, accuracy03=1.000, accuracy05=1.000, loss=0.038]
[2020-08-20 13:20:49,198] 
1/3 * Epoch 1 (_base): lr=5.000e-05 | momentum=0.9000
1/3 * Epoch 1 (train): accuracy01=0.7615 | accuracy03=0.9337 | accuracy05=0.9729 | loss=0.7388
1/3 * Epoch 1 (valid): accuracy01=0.8182 | accuracy03=0.9552 | accuracy05=0.9919 | loss=0.5397
2/3 * Epoch (train): 100% 369/369 [04:27<00:00,  1.38it/s, accuracy01=0.750, accuracy03=1.000, accuracy05=1.000, loss=0.542]
2/3 * Epoch (valid): 100% 47/47 [00:12<00:00,  3.91it/s, accuracy01=1.000, accuracy03=1.000, accuracy05=1.000, loss=0.020]
[2020-08-20 13:25:29,185] 
2/3 * Epoch 2 (_base): lr=5.000e-05 | momentum=0.9000
2/3 * Epoch 2 (train): accuracy01=0.9198 | accuracy03=0.9871 | accuracy05=0.9969 | loss=0.2653
2/3 * Epoch 2 (valid): accuracy01=0.8657 | accuracy03=0.