In this notebook we will look at using Huggingface with TF 2.0 to do a straight forward attempt at text classification. This is a somewhat barebones notebook to just show what is required to train the BERT model. Comments are appreciated on how I can improve the information here or any issues you noticed with my process much appreciated! :)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

First we need to install the transformers library from huggingface https://huggingface.co with the command below. Make sure the internet is turned on in the kernel if the notebook does not already have transformers in stalled.

In [None]:
!pip install transformers

For this model we are going to use the DistilBert model and tokenizers because it is lighter on resources but I also included the imports for standard Bert models. You can swap these out at the beginnning and the rest of the code will work.

In [None]:
from transformers import TFBertForSequenceClassification, BertTokenizer, TFDistilBertForSequenceClassification, DistilBertTokenizer, glue_convert_examples_to_features
import tensorflow as tf
import pandas as pd

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

In [None]:
# get the data and read into pandas frames from the csv
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

Now that we have the data loaded lets takea  quick look at the training set. We can drop all columns besides the text and target column for now.

In [None]:
train

Now lets create a new df just using the text and target columns. We will then drop any empty rows just incase. Then from that I create a seperate train_x df and a numpy array called train_y for the labels.

In [None]:
train_text_df = train[['text', 'target']]
train_text_df = train_text_df.dropna()
train_X = train_text_df['text']
train_y = train_text_df['target'].to_numpy()

Now that we have our training set setup with the labels needed we can move onto tokenizing the input data. Since we are using the BERT tokenizer from huggingface we can easily use the batch_encode_plus function to pass the dataframe and it will apply the tokenization to all the data. All we are going to set for this is pad_to_max_length to True and we want to return tensors for TF. This will return the data in an object that TF can easily work with.

In [None]:
train_x = tokenizer.batch_encode_plus(train_X, pad_to_max_length=True, return_tensors="tf")

Lets look at the tf object now. As you can see we get a dictionary back with "input_ids" and "attention_mask". For our purposes we do not need the attention mask so we will only be using the input_ids.

In [None]:
train_x

Now all we need to do is setup the model like you would any other TF keras model. Just setup the optmizer, loss, and metrix then compile the model and you can fit the pretrained model on the training set.

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
bce = tf.keras.losses.BinaryCrossentropy()
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(x=train_x['input_ids'], y=train_y, epochs=15, batch_size=128, verbose=1)

In [None]:
test_x = test['text'].to_numpy()

Now that we have our test set setup we need to send that input through the tokenizer setup the same as our training set. Once that is tokenized we can send it to the predict function in the model.

In [None]:
test_x = tokenizer.batch_encode_plus(test_x, pad_to_max_length=True, return_tensors="tf")

In [None]:
predictions = model.predict(test_x['input_ids'])

Since we are doing binary classification there will only be two label values that are returned per instance so we can then just take the argmax of each result to get our final label outut of either zero or one.

In [None]:
predictions_label = [ np.argmax(x) for x in predictions[0]]

In [None]:
submission = pd.DataFrame({'id': test['id'], 'target': predictions_label})
submission['target'] = submission['target'].astype('int')
submission.to_csv('submission.csv', index=False)