# BERT Fine-Tuning Tutorial with PyTorch for Text Classification on The Corpus of Linguistic Acceptability (COLA) Dataset

Working through the following [tutorial](https://medium.com/@aniruddha.choudhury94/part-2-bert-fine-tuning-tutorial-with-pytorch-for-text-classification-on-the-corpus-of-linguistic-18057ce330e1) to gain familiarity with fine-tuning models using huggingface and PyTorch.

## Set-up

In [48]:
import os
import wget

import pandas as pd
import numpy as np
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

In [2]:
os.chdir('../..')

## Download COLA dataset

In [3]:
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'
if not os.path.exists('data/external/cola_public_1.1.zip'):
    wget.download(url, 'data/external/cola_public_1.1.zip')

In [4]:
if not os.path.exists('data/external/cola_public/'):
    !unzip data/external/cola_public_1.1.zip -d data/external

## Load data

In [5]:
df = pd.read_csv(
    'data/external/cola_public/raw/in_domain_train.tsv', 
    delimiter='\t', 
    header=None, 
    names=['sentence_source', 'label', 'label_notes', 'sentence']
)
print(df.shape)
df.head()

(8551, 4)


Unnamed: 0,sentence_source,label,label_notes,sentence
0,gj04,1,,"Our friends won't buy this analysis, let alone..."
1,gj04,1,,One more pseudo generalization and I'm giving up.
2,gj04,1,,One more pseudo generalization or I'm giving up.
3,gj04,1,,"The more we study verbs, the crazier they get."
4,gj04,1,,Day by day the facts are getting murkier.


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8551 entries, 0 to 8550
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   sentence_source  8551 non-null   object
 1   label            8551 non-null   int64 
 2   label_notes      2527 non-null   object
 3   sentence         8551 non-null   object
dtypes: int64(1), object(3)
memory usage: 267.3+ KB


In [7]:
# Store sentences and labels as NumPy arrays
sentences = df['sentence'].values
labels = df['label'].values

## Tokenization and input formatting

In [8]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [9]:
print(sentences[0])
tokenizer.encode(sentences[0])

Our friends won't buy this analysis, let alone the next one we propose.


[101,
 2256,
 2814,
 2180,
 1005,
 1056,
 4965,
 2023,
 4106,
 1010,
 2292,
 2894,
 1996,
 2279,
 2028,
 2057,
 16599,
 1012,
 102]

### Sentences to IDs

In [10]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
# For every sentence...
for sent in sentences:
    # `encode` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    encoded_sent = tokenizer.encode(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        # This function also supports truncation and conversion
                        # to pytorch tensors, but we need to do padding, so we
                        # can't use these features :( .
                        #max_length = 128,          # Truncate all sentences.
                        #return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.
    input_ids.append(encoded_sent)
# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

Original:  Our friends won't buy this analysis, let alone the next one we propose.
Token IDs: [101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102]


### Padding and truncating

In [11]:
np.max([len(sen) for sen in input_ids])

47

In [12]:
MAX_LEN = 64

In [14]:
input_ids[0]

[101,
 2256,
 2814,
 2180,
 1005,
 1056,
 4965,
 2023,
 4106,
 1010,
 2292,
 2894,
 1996,
 2279,
 2028,
 2057,
 16599,
 1012,
 102]

In [15]:
# Set the maximum sequence length.
# I've chosen 64 somewhat arbitrarily. It's slightly larger than the maximum training sentence length of 47...
print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))

# Pad our input tokens with value 0.
input_ids = pad_sequences(
    input_ids, 
    maxlen=MAX_LEN, 
    dtype="long", 
    value=0, 
    truncating="post",  # "post" indicates that we want to pad and truncate at the end of the sequence, as opposed to the beginning.
    padding="post"
)


Padding token: "[PAD]", ID: 0


In [16]:
input_ids[0]

array([  101,  2256,  2814,  2180,  1005,  1056,  4965,  2023,  4106,
        1010,  2292,  2894,  1996,  2279,  2028,  2057, 16599,  1012,
         102,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0])

### Attention masks

In [28]:
input_ids.shape

(8551, 64)

In [30]:
# 0 if token ID is 0 (padding) else 1
attention_masks = [(input_ids[sent, :] > 0).astype(int) for sent in range(input_ids.shape[0])]

In [40]:
attention_masks = np.vstack(attention_masks)
attention_masks.shape

(8551, 64)

### Training and validation split

In [41]:
input_ids.shape

(8551, 64)

In [42]:
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(
    input_ids, 
    labels, 
    random_state=2018, 
    test_size=0.1
)

In [43]:
train_masks, validation_masks, _, _ = train_test_split(
    attention_masks, 
    labels, 
    random_state=2018, 
    test_size=0.1
)

### Converting to PyTorch data types

In [46]:
train_inputs.shape, validation_inputs.shape, train_labels.shape, train_masks.shape

((7695, 64), (856, 64), (7695,), (7695, 64))

In [47]:
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)

train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

In [49]:
# For fine-tuning BERT on a specific task, the authors recommend a batch size of
# 16 or 32.

batch_size = 32

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

## Train model

In [50]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

In [51]:
# Load BertForSequenceClassification, the pretrained BERT model with a single linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",  # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2,  # The number of output labels--2 for binary classification. You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [52]:
# Tell pytorch to run this model on the GPU.
model.cuda()

AssertionError: Torch not compiled with CUDA enabled