# BERT Exploration Series

## Introduction

Twitter Disaster Analysis

**Version 03**

Kaggle Link: https://www.kaggle.com/c/nlp-getting-started/
#### Summary
- [SimpleTransformers](https://github.com/ThilinaRajapakse/simpletransformers/)
- Based on kernel [SimpleTransformers + Hyperparam Tuning + k-fold CV](https://www.kaggle.com/szelee/simpletransformers-hyperparam-tuning-k-fold-cv)

In [2]:
# check GPU
!nvidia-smi

Wed Feb 19 09:13:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   51C    P0    27W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

In [3]:
# !pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.19.9-py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 7.5 MB/s eta 0:00:01
Collecting tensorboardx
  Downloading tensorboardX-2.0-py2.py3-none-any.whl (195 kB)
[K     |████████████████████████████████| 195 kB 7.3 MB/s eta 0:00:01
Collecting seqeval
  Downloading seqeval-0.0.12.tar.gz (21 kB)
Collecting protobuf>=3.8.0
  Downloading protobuf-3.11.3-cp36-cp36m-manylinux1_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 18.3 MB/s eta 0:00:01
Collecting Keras>=2.2.4
  Downloading Keras-2.3.1-py2.py3-none-any.whl (377 kB)
[K     |████████████████████████████████| 377 kB 65.9 MB/s eta 0:00:01
Collecting keras-preprocessing>=1.0.5
  Downloading Keras_Preprocessing-1.1.0-py2.py3-none-any.whl (41 kB)
[K     |████████████████████████████████| 41 kB 1.6 MB/s  eta 0:00:01
[?25hCollecting keras-applications>=1.0.6
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
[K   

In [5]:
import os, re, string
import random
from pathlib import Path

import numpy as np
import pandas as pd
import sklearn

import torch

from simpletransformers.classification import ClassificationModel
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [6]:
CURRENT_DIR = Path.cwd()
DATA_DIR = CURRENT_DIR.parent / 'data'

In [7]:
# read train set and test set
test = pd.read_csv(str(DATA_DIR / 'test.csv'))
train = pd.read_csv(str(DATA_DIR / 'train.csv'))

In [8]:
# fix seed
seed = 1337

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

In [15]:
# use BERT uncased
bert_uncased = ClassificationModel('bert', 'bert-large-uncased') 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=362.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1344997306.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [13]:
# This is where we can tweak based on the default arguments above
custom_args = {'fp16': False, # not using mixed precision 
               'train_batch_size': 4, # default is 8
               'gradient_accumulation_steps': 2,
               'do_lower_case': True,
               'learning_rate': 1e-05, # using lower learning rate
               'overwrite_output_dir': True, # important for CV
               'num_train_epochs': 2} # default is 1

In [18]:
# leave only input and target
train = train[['text', 'target']]

In [19]:
# 5-fold CV
n=5
kf = KFold(n_splits=n, random_state=seed, shuffle=True)
results = []

for train_index, val_index in kf.split(train):
    train_df = train.iloc[train_index]
    val_df = train.iloc[val_index]
    
    model = ClassificationModel('bert', 'bert-base-uncased', args=custom_args) 
    model.train_model(train_df)
    result, model_outputs, wrong_predictions = model.eval_model(val_df, acc=sklearn.metrics.accuracy_score)
    print(result['acc'])
    results.append(result['acc'])

Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=6090.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.656462



Running loss: 0.033884


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.049295

Training of bert model complete. Saved to outputs/.
Converting to features started. Cache is not used.


  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


HBox(children=(FloatProgress(value=0.0, max=1523.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


{'mcc': 0.7000105872710273, 'tp': 525, 'tn': 774, 'fp': 85, 'fn': 139, 'acc': 0.8529218647406435, 'eval_loss': 0.39173468073625217}
0.8529218647406435
Features loaded from cache at cache_dir/cached_train_bert_128_2_6090


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.071366


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.258684

Training of bert model complete. Saved to outputs/.
Features loaded from cache at cache_dir/cached_dev_bert_128_2_1523


HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


{'mcc': 0.712348729533031, 'tp': 526, 'tn': 782, 'fp': 77, 'fn': 138, 'acc': 0.8588312541037426, 'eval_loss': 0.38333217197960895}
0.8588312541037426
Features loaded from cache at cache_dir/cached_train_bert_128_2_6090


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.070133


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.112608

Training of bert model complete. Saved to outputs/.
Features loaded from cache at cache_dir/cached_dev_bert_128_2_1523


HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


{'mcc': 0.6958749475524727, 'tp': 527, 'tn': 769, 'fp': 90, 'fn': 137, 'acc': 0.8509520682862771, 'eval_loss': 0.40034599845048957}
0.8509520682862771
Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=6091.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.126483


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.042845

Training of bert model complete. Saved to outputs/.
Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=1522.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


{'mcc': 0.6671041942671911, 'tp': 486, 'tn': 790, 'fp': 88, 'fn': 158, 'acc': 0.8383705650459922, 'eval_loss': 0.4606090259801655}
0.8383705650459922
Features loaded from cache at cache_dir/cached_train_bert_128_2_6091


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 1.236424


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1523.0, style=ProgressStyle(descr…

Running loss: 0.037635

Training of bert model complete. Saved to outputs/.
Features loaded from cache at cache_dir/cached_dev_bert_128_2_1522


HBox(children=(FloatProgress(value=0.0, max=191.0), HTML(value='')))


{'mcc': 0.6399002322194692, 'tp': 490, 'tn': 766, 'fp': 112, 'fn': 154, 'acc': 0.8252299605781866, 'eval_loss': 0.47062237184319194}
0.8252299605781866


In [20]:
for i, result in enumerate(results, 1):
    print(f"Fold-{i}: {result}")
    
print(f"{n}-fold CV accuracy result: Mean: {np.mean(results)} Standard deviation:{np.std(results)}")

Fold-1: 0.8529218647406435
Fold-2: 0.8588312541037426
Fold-3: 0.8509520682862771
Fold-4: 0.8383705650459922
Fold-5: 0.8252299605781866
5-fold CV accuracy result: Mean: 0.8452611425509684 Standard deviation:0.012032867798872356


In [22]:
# full training
model = ClassificationModel('bert', 'bert-base-uncased', args=custom_args) 
model.train_model(train)

Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=7613.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1904.0, style=ProgressStyle(descr…

Running loss: 0.595753


HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=1904.0, style=ProgressStyle(descr…

Running loss: 0.007752

Training of bert model complete. Saved to outputs/.


In [23]:
predictions, raw_outputs = model.predict(test['text'])

Converting to features started. Cache is not used.


HBox(children=(FloatProgress(value=0.0, max=3263.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=408.0), HTML(value='')))




In [25]:
# prepare submission
submission_df = pd.read_csv(str(DATA_DIR / 'sample_submission.csv'))
# replace with our result
submission_df["target"] = predictions
# output csv
submission_df.to_csv(str(DATA_DIR / 'submission-03.csv'), index=False)

#### Final Score: 0.826