<a href="https://colab.research.google.com/github/spdrnl/bert_multilingual/blob/master/Book_review_NL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

In [10]:
!pip install -q transformers

[K     |████████████████████████████████| 890kB 8.2MB/s 
[K     |████████████████████████████████| 3.0MB 25.1MB/s 
[K     |████████████████████████████████| 1.1MB 56.4MB/s 
[K     |████████████████████████████████| 890kB 49.7MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


# Check the GPU

In [2]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [3]:
! nvidia-smi

Tue Sep  8 11:11:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0    29W /  70W |    227MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Download the data

In [4]:
! wget https://github.com/benjaminvdb/110kDBRD/releases/download/v2.0/110kDBRD_v2.tgz
! tar -zxf 110kDBRD_v2.tgz 110kDBRD/train
! tar -zxf 110kDBRD_v2.tgz 110kDBRD/test
! ls 110kDBRD

--2020-09-08 13:04:20--  https://github.com/benjaminvdb/110kDBRD/releases/download/v2.0/110kDBRD_v2.tgz
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/168819565/a09c2700-96a1-11e9-9310-a218631917bf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200908%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200908T130420Z&X-Amz-Expires=300&X-Amz-Signature=54dbd116979f7a3814100448fc6b6bec4409ef09046bffff279650bb9e850396&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=168819565&response-content-disposition=attachment%3B%20filename%3D110kDBRD_v2.tgz&response-content-type=application%2Foctet-stream [following]
--2020-09-08 13:04:20--  https://github-production-release-asset-2e65be.s3.amazonaws.com/168819565/a09c2700-96a1-11e9-9310-a218631917bf?X-Amz-Algorithm=AWS4-HMAC-SHA256

# Read and split the data

In [13]:
import glob
import numpy as np

def read_file(file_name):
  with open(file_name) as f:
    text = f.read()
  return text

def get_file_contents(base_dir, train_test, label):
  filter = base_dir + '/' + train_test + '/' + label + '/*.txt'
  contents = [read_file(file_name) for file_name in glob.glob(filter)]
  return contents

def get_data(base_dir, train_test):
  txt_pos = get_file_contents(base_dir, train_test, 'pos')
  txt_neg = get_file_contents(base_dir, train_test, 'neg')
  txt = txt_pos + txt_neg
  n_pos, n_neg = len(txt_pos), len(txt_neg)
  labels = np.hstack([np.ones(n_pos), np.zeros(n_neg)])
  return txt, labels, n_pos, n_neg

base_dir = '110kDBRD'

data_txt, data_labels, n_pos, n_neg = get_data(base_dir, 'train')
test_txt, test_labels, n_t_pos, n_t_neg= get_data(base_dir, 'test')

print(f"The number of train samples is {len(data_labels)}, {n_pos}+/{n_neg}-")
print(f"The number of test samples is {len(test_labels)}, {n_t_pos}+/{n_t_neg}-")
print(f"Example text: {data_txt[0]}")

The number of train samples is 20028, 10014+/10014-
The number of test samples is 2224, 1112+/1112-
Example text: Siegfried Lenz was een geweldig schrijver. In een paar pennestreken wist hij een sfeer neer te zetten en een hele wereld op te roepen. Zo ook in dit boek dat over een ouder wordende duiker in het Duitsland van net na de Tweede Wereldoorlog gaat. Het is een novelle maar na het lezen heb ik het idee dat ik een roman van over de 500 pagina's gelezen heb. Dat is de kracht van Lenz. Een boek waar je niet echt vrolijker van wordt maar wel een aanrader.


In [18]:
import random
from sklearn.model_selection import train_test_split

train_txt, val_txt, train_labels, val_labels = train_test_split(data_txt, data_labels, test_size=0.2, shuffle=True, random_state=84)

# Tokenization

In [15]:
from transformers import BertTokenizer

model_name = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name, do_lower_case=True)

In [None]:
max_len = 0
for txts in [train_txt, val_txt, test_txt]:
  for txt in txts:
    tokenized = tokenizer.tokenize(txt)
    max_len = max(max_len, len(tokenized))

print(f"The maximum length in tokens is {max_len}")

The maximum length in tokens is 5814


In [None]:
vocabulary = tokenizer.get_vocab()
print(list(vocabulary.keys())[:125])
print(list(vocabulary.keys())[1000:1010])

['[PAD]', '[unused1]', '[unused2]', '[unused3]', '[unused4]', '[unused5]', '[unused6]', '[unused7]', '[unused8]', '[unused9]', '[unused10]', '[unused11]', '[unused12]', '[unused13]', '[unused14]', '[unused15]', '[unused16]', '[unused17]', '[unused18]', '[unused19]', '[unused20]', '[unused21]', '[unused22]', '[unused23]', '[unused24]', '[unused25]', '[unused26]', '[unused27]', '[unused28]', '[unused29]', '[unused30]', '[unused31]', '[unused32]', '[unused33]', '[unused34]', '[unused35]', '[unused36]', '[unused37]', '[unused38]', '[unused39]', '[unused40]', '[unused41]', '[unused42]', '[unused43]', '[unused44]', '[unused45]', '[unused46]', '[unused47]', '[unused48]', '[unused49]', '[unused50]', '[unused51]', '[unused52]', '[unused53]', '[unused54]', '[unused55]', '[unused56]', '[unused57]', '[unused58]', '[unused59]', '[unused60]', '[unused61]', '[unused62]', '[unused63]', '[unused64]', '[unused65]', '[unused66]', '[unused67]', '[unused68]', '[unused69]', '[unused70]', '[unused71]', '[unu

In [None]:
tokenizer.get_vocab()['[CLS]']

101

In [None]:
tokenizer.get_vocab()['idee']

19556

In [None]:
tokenizer.get_vocab()['huis']

25847

# Encode the data to word pieces

In [16]:
def encode_text(txt, max_len):
  return tokenizer.batch_encode_plus(txt,
                        add_special_tokens = True, 
                        max_length = max_len, 
                        pad_to_max_length = True, 
                        return_attention_mask = True, 
                        truncation = True)
max_len = 512
train_encoded = encode_text(train_txt, max_len)
val_encoded = encode_text(val_txt, max_len)
test_encoded = encode_text(test_txt, max_len)



# Create datasets

In [17]:
def map_example_to_dict(input_ids, attention_masks, token_type_ids, labels):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, labels

def to_dataset(encoded_txt, labels):
  return tf.data.Dataset.from_tensor_slices(((encoded_txt['input_ids'],
                                            encoded_txt['attention_mask'],
                                            encoded_txt['token_type_ids']),
                                            labels))#.map(map_example_to_dict)

train_dataset = to_dataset(train_encoded, train_labels)
val_dataset = to_dataset(val_encoded, val_labels)
test_dataset = to_dataset(test_encoded, test_labels)

# Create model

In [20]:
from transformers import BertConfig, TFBertForSequenceClassification
import tensorflow as tf
from tensorflow import keras
def get_transfer_model(model_name, learning_rate):
  model = TFBertForSequenceClassification.from_pretrained(model_name)
  model.num_labels=2
  optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
  #model.get_layer('bert').trainable = False
  model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
  return model

# def get_transfer_model(model_name, learning_rate):
#   bert_model = TFBertForSequenceClassification.from_pretrained(model_name)
#   bert = bert_model.get_layer('bert')
  
#   id_input_layer = keras.layers.Input(shape = (max_len,), dtype='int32')
#   attention_input_layer = keras.layers.Input(shape = (max_len,), dtype='int32')
#   token_type_input_layer = keras.layers.Input(shape = (max_len,), dtype='int32')
  
#   bert_layer = bert([id_input_layer, attention_input_layer, token_type_input_layer])[1]
#   output_layer = keras.layers.Dense(2, activation="softmax")(bert_layer) 
#   model = keras.Model(inputs=[id_input_layer, attention_input_layer, token_type_input_layer], outputs=output_layer)

#   model.get_layer('bert').trainable = False

#   optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
#   loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#   metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
#   model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
#   return model

learning_rate = 1e-5
model = get_transfer_model(model_name, learning_rate)
model.summary()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=625.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=999358484.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  167356416 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 167,357,954
Trainable params: 167,357,954
Non-trainable params: 0
_________________________________________________________________


# Train model with transfer

In [22]:
batch_size = 8
learning_rate = 1e-5
number_of_epochs = 200
histories = []
results = []
sample_sizes = [100, 250, 500, 1000, 2500, 5000, 10000, len(train_labels)]
sample_sizes = [len(train_labels)]
for sample_size in sample_sizes:
  model = get_transfer_model(model_name, learning_rate)
  early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=2, restore_best_weights=True)
  history = model.fit(train_dataset.take(sample_size).shuffle(1000).batch(batch_size), 
                      epochs=number_of_epochs, 
                      validation_data=val_dataset.batch(batch_size),
                      callbacks = [early_stopping])
  result = model.evaluate(test_dataset.batch(batch_size))
  print(f"At sample size {sample_size} test evaluation is {result}")
  histories.append(history)
  results.append((sample_size, result))

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['dropout_113', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
At sample size 16022 test evaluation is [0.2568327486515045, 0.8961330652236938]
