<a href="https://colab.research.google.com/github/shraddha-an/nlp/blob/main/bert_clf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1) Installation**

In [1]:
# Installations
!pip install transformers
!pip install --upgrade matplotlib
!pip install matplotlib_inline --quiet

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 25.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 70.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 58.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting matplotlib
  Downloading matplotlib-3.5.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[K     |████████

In [1]:
# Importing libraries

# mount google drive
from google.colab import drive
drive.mount("/content/gdrive")

# Data Handling
import pandas as pd, numpy as np

# Visualization
import seaborn as sb, matplotlib.pyplot as plt

# NLP preprocess
from gensim.utils import simple_preprocess

# Hugging Face Transformers & torch
import transformers
import torch

sb.set_style('darkgrid')
%matplotlib inline
from IPython.display import Markdown, display

# save img in svg format
from matplotlib_inline.backend_inline import set_matplotlib_formats
set_matplotlib_formats('svg')

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# pretty print
from pprint import pprint

Mounted at /content/gdrive


In [12]:
# Setting the device
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

# **2) Data Preprocessing**
[dataset](https://www.kaggle.com/datasets/imoore/60k-stack-overflow-questions-with-quality-rate)

In [2]:
# Importing data
path = "/content/gdrive/MyDrive/experiments/stack_overflow_ques/"

dataset = pd.read_csv(path + 'train.csv')
ds = pd.read_csv(path + 'valid.csv')

In [3]:
dataset.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ


In [4]:
# Only retaining the Body, Y columns & renaming these columns.
dataset = dataset[['Body', 'Y']]
dataset.rename(columns = {'Body': 'questions', 'Y': 'category'}, inplace = True)

ds = ds[['Body', 'Y']]
ds.rename(columns = {'Body': 'questions', 'Y': 'category'}, inplace = True)

# Applying simple preprocessing where the text is lowercased & special characters are removed.
dataset.iloc[:, 0] = dataset.iloc[:, 0].apply(lambda x: ' '.join(simple_preprocess(x)))
ds.iloc[:, 0] = ds.iloc[:, 0].apply(lambda x: ' '.join(simple_preprocess(x)))

ds.head(8)

Unnamed: 0,questions,category
0,am having different tables like select from sy...,LQ_EDIT
1,have two table m_master and tbl_appointment th...,LQ_EDIT
2,trying to extract us states from wiki url and ...,HQ
3,so new to wanna make an application that can e...,LQ_EDIT
4,basically have this array array array sub comp...,LQ_EDIT
5,am trying to make constructor for derived clas...,LQ_CLOSE
6,am using in my lesson and for solving program ...,LQ_EDIT
7,getting bit lost in ts re exports say create p...,HQ


In [5]:
# Splitting into feature & target datasets
X_train = dataset.iloc[:, 0]
X_test = ds.iloc[:, 0]

y_train = dataset.iloc[:, 1]
y_test = ds.iloc[:, 1]

# Label Encoding the category column
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
y_train = enc.fit_transform(y_train)
y_test = enc.transform(y_test)

y_test[:5]

array([2, 2, 0, 2, 2])

# **3) BERT Tokenizer Fast**

In [6]:
# Loading the Bert fast Tokenizer
from transformers import BertTokenizerFast

model_name = 'bert-base-uncased'
bert_tokenizer = BertTokenizerFast.from_pretrained(model_name)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [7]:
# Setting max length of our sequencess
max_len = 100

# Tokenizing the sentences in batches
train_tokens = bert_tokenizer.batch_encode_plus(X_train.tolist(),
                                 max_length = max_len,
                                 pad_to_max_length = True,
                                 truncation = True,
                                 return_tensors = "pt")

test_tokens = bert_tokenizer.batch_encode_plus(X_test.tolist(),
                                 max_length = max_len,
                                 pad_to_max_length = True,
                                 truncation = True,
                                 return_tensors = "pt")

In [8]:
# Extracting the input ids (tokenized vector), attention masks
# Converting the label np arrays to torch tensors
train_sequence = train_tokens['input_ids']
train_mask = train_tokens['attention_mask']
train_y = torch.tensor(y_train.tolist())

test_sequence = test_tokens['input_ids']
test_mask = test_tokens['attention_mask']
test_y = torch.tensor(y_test.tolist())

In [9]:
# Creating Data Loader objects that'll supply the model with batches of sampled training data
from torch.utils.data import DataLoader, TensorDataset, RandomSampler

batch_size = 16

# Creating training data loader
train_data = TensorDataset(train_sequence, train_mask, train_y)
train_sampler = RandomSampler(train_data)
train = DataLoader(train_data, sampler = train_sampler, batch_size = batch_size)

In [10]:
# Looking at 1 example of the TensorDataset
train_data[0]

(tensor([  101,  2525,  5220,  2007, 15192,  8518,  2296,  3823,  2011,  2478,
          9262, 21183,  4014, 25309,  1998,  9262, 21183,  4014, 25309, 10230,
          2243,  2021, 11082,  2360,  2215,  2000,  6140,  7592,  2088,  2000,
          1996, 10122,  2296,  6721,  3823,  2013,  6854,  1999,  2978,  1997,
          5481,  1998,  2123,  2031,  2151,  3642,  2000,  2265,  2061,  2521,
          2151,  2393,  2052,  2022, 19804, 24108,  3064,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0

# **4) Model Architecture**

In [13]:
# Loading the Bert model for sequence classification
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(model_name, num_labels = 3,
                                      output_attentions = False, output_hidden_states = False)

model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [14]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())
print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
    
print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

In [15]:
# Setting Optimizer & Learning Rate parameters
from transformers import AdamW

optimizer = AdamW(params = model.parameters(), lr = 2e-5, eps = 1e-8)

# Epochs
epochs = 2

# Setting total no of training steps = no of batches * epochs
train_steps = len(train) * epochs

# Creating the learning rate scheduler
from transformers import get_linear_schedule_with_warmup

scheduler = get_linear_schedule_with_warmup(optimizer, num_training_steps = train_steps, num_warmup_steps = 0)

# **5) Training**

In [17]:
import random

seed_val = 0
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Storing the average loss after each epoch so we can plot them.
loss_values = []


In [45]:
%%time
# Training
for epoch_i in range(0, epochs):

    # Perform one full pass over the training set.
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))

    # Reset the total loss for epoch.
    total_loss = 0

    # Put the model into training mode
    # https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch
    model.train()

    # For each batch of training data...
    for batch in train:
  
        # Extract the input ids, attention mask & labels, and push them to device
        input_ids = batch[0].to(device)
        input_mask = batch[1].to(device)
        labels = batch[2].to(device)

        # Clear out previously calculated gradients before performing a backward pass 
        # https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch
        model.zero_grad() 
      
        # Forward Pass
        outputs = model(input_ids, 
                        token_type_ids = None, 
                        attention_mask = input_mask, 
                        labels = labels)
        
        loss = outputs[0]

        # Calculate total loss for all batches in an epoch
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the exploding gradients problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()    
        
    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train)            
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)
  
    print("Average Training Loss: {0:.2f}".format(avg_train_loss))
        


Average Training Loss: 0.24

Average Training Loss: 0.24
CPU times: user 18min 55s, sys: 7min 55s, total: 26min 51s
Wall time: 27min 5s


In [46]:
# Plotting training loss over epochs
import plotly.express as px

f = pd.DataFrame(loss_values)
f.columns = ['Loss']

fig = px.line(f, x = f.index, y = f.Loss)
fig.update_layout(title = 'Evolution of Training loss',
                   xaxis_title = 'Epoch',
                   yaxis_title = 'Loss')
fig.show()

# **6) Evaluating on Test Set**

In [47]:
# Creating data loader for test set
from torch.utils.data import SequentialSampler

test_data = TensorDataset(test_sequence, test_mask, test_y)
test_sampler = SequentialSampler(test_data)
test = DataLoader(test_data, sampler = test_sampler, batch_size = batch_size)

In [48]:
print(len(test) * batch_size)

15008


In [49]:
%%time
# Prediction on test set
# Evaluating
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in test:

  # Add batch to device
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch

  # Setting no_grad to avoid computing or storing gradients, saving memory and speeding up prediction
  with torch.no_grad():
      
      # Forward pass, calculate logits
      outputs = model(b_input_ids, 
                      token_type_ids =  None, 
                      attention_mask = b_input_mask)
  
  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

CPU times: user 1min 27s, sys: 73.8 ms, total: 1min 27s
Wall time: 1min 27s


# **7) Metrics**

In [50]:
from sklearn.metrics import matthews_corrcoef
matthews_set = []

# For each input batch...
for i in range(len(true_labels)):
  
  # Get the index with the highest probability --> class
  pred_labels_i = np.argmax(predictions[i], axis=1).flatten()
  
  # Calculate and store the coef for this batch.  
  matthews = matthews_corrcoef(true_labels[i], pred_labels_i)                
  matthews_set.append(matthews)

# Combine the predictions for each batch into a single list of 0s and 1s.
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis = 1).flatten()

# Combine the correct labels for each batch into a single list.
flat_true_labels = [item for sublist in true_labels for item in sublist]

# Calculate the MCC
mcc = matthews_corrcoef(flat_true_labels, flat_predictions)
print('MCC: %.3f' % mcc)

MCC: 0.772


In [51]:
# Accuracy Score
from sklearn.metrics import accuracy_score as acc

print(acc(flat_true_labels, flat_predictions))
#len(flat_predictions)


0.8478
