# PsychNLP : Training the model

> in this notebook, we will be fine tuning a BERT model to classify our text data that we have in the previous notebooks. The codes in this notebook will:

- Preprocess and tokenize our text data
- Turn labels into one-hot format
- Create a data input pipeline
- Construct last_hidden_state layers and other BERT layers
- Build the model
- Define training steps to fine-tune our BERT

Let's get started!

# Setting up

In this section we will:

- Mount google drive
- Import necessary libraries
- Import our data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install -q transformers

[K     |████████████████████████████████| 4.2 MB 5.0 MB/s 
[K     |████████████████████████████████| 596 kB 64.8 MB/s 
[K     |████████████████████████████████| 6.6 MB 58.0 MB/s 
[K     |████████████████████████████████| 86 kB 5.9 MB/s 
[?25h

In [3]:
import transformers
from transformers import BertModel, BertTokenizer
from torch.optim import AdamW
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from transformers.optimization import get_linear_schedule_with_warmup

from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

using a cuda device

In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

Importing training and testing data, dropping some unnecessary columns and dropping nan values

In [5]:
train = pd.read_csv('/content/drive/MyDrive/ML/train.csv').drop(columns=['Unnamed: 0','Unnamed: 0.1']).dropna()
test = pd.read_csv('/content/drive/MyDrive/ML/test.csv').drop(columns=['Unnamed: 0','Unnamed: 0.1']).dropna()

Some sanity checks to make sure the data is correct

In [6]:
train.head()

Unnamed: 0,text,class
0,i wasnt mad at her at all but it sent me into ...,neither
1,i opened up to my ex about everything every pa...,depression
2,ive been pursuing higher education for the bet...,neither
3,i dont see how trying again will change anythi...,depression
4,please note that all of my collections have be...,suicide


In [7]:
train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 321188 entries, 0 to 321365
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    321188 non-null  object
 1   class   321188 non-null  object
dtypes: object(2)
memory usage: 7.4+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 80306 entries, 0 to 80341
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    80306 non-null  object
 1   class   80306 non-null  object
dtypes: object(2)
memory usage: 1.8+ MB


# Preprocessing dataset

In this section we will:

- Convert labels into one-hot encodings
- Define tokenizers to tokenize the text data

In [8]:
def class_to_onehot(df):
  OH = []
  for label in df['class']:
    if label =='neither':
      value = [1,0,0]
    elif label =='depression':
      value = [0,1,0]
    elif label=='suicide':
      value = [0,0,1]
    OH.append(value)
  df['class'] = OH
  return df


In [9]:
train = class_to_onehot(train)
test = class_to_onehot(test)

In [10]:
test

Unnamed: 0,text,class
0,im still really annoyed about what has happened,"[1, 0, 0]"
1,they say they understand that i need my alone ...,"[1, 0, 0]"
2,the exhaustion i feel literally causes me to n...,"[1, 0, 0]"
3,why am i planning to kill myself this world is...,"[0, 1, 0]"
4,now months later im just tired but i cant sle...,"[0, 1, 0]"
...,...,...
80337,i read in amazement from my small town in the ...,"[1, 0, 0]"
80338,note in the past i feel like im a little traum...,"[0, 1, 0]"
80339,i had a lot of interviews and sent hundreds of...,"[0, 0, 1]"
80340,ive been struggling for the past few years to ...,"[0, 1, 0]"


One-hot encodings are used in classification models

In [11]:
model_name = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloaded BERT base model and getting the BertTokenizer

In [12]:
class tokenize_ds(Dataset):
  def __init__(self, text, labels, tokenizer, max_len):
    self.text = text
    self.labels = labels
    self.tokenizer = tokenizer
    self.max_len = max_len
  
  def __len__(self):
    return len(self.text)
  
  def __getitem__(self, item):
    text = str(self.text[item])
    label = self.labels[item]

    encoding = self.tokenizer.encode_plus(
      text,
      add_special_tokens=True,
      max_length=self.max_len,
      return_token_type_ids=False,
      padding='max_length',
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )

    return {
      'text': text,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'labels': torch.tensor(label, dtype=torch.long)
    }


Final sanity checks

In [13]:
train.shape

(321188, 2)

# Data input pipeline

In [14]:
def CreateDataLoader(df, tokenizer, max_len, batch_size):
  DS = tokenize_ds(
      text = df['text'].to_numpy(),
      labels = df['class'].to_numpy(),
      tokenizer = tokenizer,
      max_len = max_len
  )
  return DataLoader(DS, batch_size = batch_size, num_workers = 2)

In [15]:
test, val = train_test_split(test, train_size = 0.5, random_state=69420)

In [16]:
batch_size = 8
max_len = 128

Traindata = CreateDataLoader(train, tokenizer, max_len, batch_size)
Valdata = CreateDataLoader(val, tokenizer, max_len, batch_size)
Testdata = CreateDataLoader(test, tokenizer, max_len, batch_size)

In [18]:
print(next(iter(Traindata))['input_ids'].shape)
print(next(iter(Traindata))['attention_mask'].shape)
print(next(iter(Traindata))['labels'].shape)

torch.Size([8, 128])
torch.Size([8, 128])
torch.Size([8, 3])


In [19]:
next(iter(Traindata)).keys()

dict_keys(['text', 'input_ids', 'attention_mask', 'labels'])

# Constructing model

In [20]:
bert_model = BertModel.from_pretrained(model_name, return_dict=False)

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [21]:
train.keys()

Index(['text', 'class'], dtype='object')

In [24]:
encoding = tokenizer.encode_plus(
  str(train['text'][0]),
  max_length=128,
  add_special_tokens=True,
  return_token_type_ids=False,
  padding='max_length',
  truncation=True,
  return_attention_mask=True,
  return_tensors='pt', 
)


In [25]:
outputs = bert_model(
  input_ids=encoding['input_ids'], 
  attention_mask=encoding['attention_mask']
)

In [26]:
last_hidden_state = outputs[0]
pooled_output = outputs[1]

In [27]:
last_hidden_state.shape

torch.Size([1, 128, 768])

In [28]:
pooled_output.shape

torch.Size([1, 768])

In [29]:
bert_model.config.hidden_size

768

# Model Building

In [30]:
class TextClassifier(nn.Module):
  def __init__(self, n_classes):
    super(TextClassifier, self).__init__()
    self.bert = BertModel.from_pretrained(model_name, return_dict=False)
    self.drop = nn.Dropout(p=0.3)
    self.L1 = nn.Linear(self.bert.config.hidden_size, n_classes)
    self.out = nn.Softmax(dim=1)
    

  def forward(self, input_ids, attention_mask):
    _, pooled_output = self.bert(
        input_ids = input_ids,
        attention_mask = attention_mask
    )
    output = self.drop(pooled_output)
    output = self.L1(output)
    output = self.out(output)
    return output

In [31]:
model = TextClassifier(3)
model.bert.requires_grad_(False)
model = model.to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [32]:
data = next(iter(Traindata))
input_ids = data['input_ids'].to(device)
attention_mask = data['attention_mask'].to(device)

In [33]:
input_ids.shape

torch.Size([8, 128])

In [34]:
print(model(input_ids, attention_mask))

tensor([[0.4912, 0.3174, 0.1913],
        [0.5376, 0.2592, 0.2032],
        [0.7473, 0.1462, 0.1065],
        [0.5021, 0.3336, 0.1643],
        [0.4234, 0.2653, 0.3113],
        [0.4963, 0.2902, 0.2135],
        [0.4303, 0.3435, 0.2261],
        [0.5286, 0.2721, 0.1993]], device='cuda:0', grad_fn=<SoftmaxBackward0>)


# Training Model

In [35]:
epochs = 10

optim = AdamW(model.parameters(), lr = 1e-3)
steps = len(Traindata) * epochs

scheduler = get_linear_schedule_with_warmup(
    optim,
    num_warmup_steps=0,
    num_training_steps=steps
)

loss = nn.CrossEntropyLoss().to(device)

In [None]:
def train_epoch(
    model,
    data_loader,
    loss,
    optim,
    device,
    scheduler,
    n_examples
  ):
  model = model.train()

  losses = []
  correct_predictions = 0.0
  p = 0
  for d in data_loader:
    if p % 100 == 0:
      print(f'p = {p} out of {len(data_loader)}')
    input_ids = d['input_ids'].to(device)
    attention_mask = d['attention_mask'].to(device)
    labels = d['labels'].type(torch.FloatTensor).to(device)
    optim.zero_grad()
    outputs = model(
      input_ids = input_ids,
      attention_mask = attention_mask
      )
    preds = torch.argmax(outputs, dim = 1)
    #print(torch.argmax(labels.float(), dim=1).long().cpu().numpy())
    #print(preds.cpu().numpy())
    loss_value = loss(outputs, labels)
    #print(np.count_nonzero((preds.cpu().numpy() == torch.argmax(labels.float(), dim=1).long().cpu().numpy())))
    correct_predictions += np.count_nonzero((preds.cpu().numpy() == torch.argmax(labels.float(), dim=1).long().cpu().numpy()))
    losses.append(loss_value.item())

    loss_value.backward()
    nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optim.step()
    scheduler.step()
    
    p+=1

  return correct_predictions/ n_examples, np.mean(losses)

In [None]:
def eval_model(
    model,
    data_loader,
    loss,
    device,
    n_examples
):
  model = model.eval()
  losses = []
  correct_predictions = 0
  p = 0

  with torch.no_grad():
    for d in data_loader:
      if p % 100 == 0:
        print(f'p = {p} out of {len(data_loader)}')
      input_ids = d['input_ids'].to(device)
      attention_mask = d['attention_mask'].to(device)
      labels = d['labels'].type(torch.FloatTensor).to(device)

      outputs = model(
        input_ids = input_ids,
        attention_mask = attention_mask
        )
      preds = torch.argmax(outputs, dim = 1)
      #print(preds)
      loss_value = loss(outputs, labels)

      correct_predictions += np.count_nonzero((preds.cpu().numpy() == torch.argmax(labels.float(), dim=1).long().cpu().numpy()))
      losses.append(loss_value.item())
      p +=1
  return float(correct_predictions) / n_examples, np.mean(losses)

In [None]:
def training_step(epochs):
  %%time

  history = defaultdict(list)
  best_acc = 0
  for epoch in range(epochs):
    print(f'Epoch {epoch + 1} / {epochs}')

    train_acc, train_loss = train_epoch(
        model,
        Traindata,
        loss,
        optim,
        device,
        scheduler,
        len(train)
    )
    print(f'Training loss = {train_loss}, accuracy = {train_acc}')
    

    val_acc, val_loss = eval_model(
        model,
        Valdata,
        loss,
        device,
        len(val)
    )
    print(f'Validation loss = {val_loss}, accuracy = {val_acc}')

    history['train_acc'].append(train_acc)
    history['train_loss'].append(train_loss)
    history['val_acc'].append(val_acc)
    history['val_loss'].append(val_loss)

    if val_acc > best_acc:
      torch.save(model.state_dict(), 'best_model_state.bin')
      best_acc = val_acc

# Actual model training

In [None]:
training_step(epochs)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.34 µs
Epoch 1 / 10
p = 0 out of 40149
p = 100 out of 40149
p = 200 out of 40149
p = 300 out of 40149
p = 400 out of 40149
p = 500 out of 40149
p = 600 out of 40149
p = 700 out of 40149
p = 800 out of 40149
p = 900 out of 40149
p = 1000 out of 40149
p = 1100 out of 40149
p = 1200 out of 40149
p = 1300 out of 40149
p = 1400 out of 40149
p = 1500 out of 40149
p = 1600 out of 40149
p = 1700 out of 40149
p = 1800 out of 40149
p = 1900 out of 40149
p = 2000 out of 40149
p = 2100 out of 40149
p = 2200 out of 40149
p = 2300 out of 40149
p = 2400 out of 40149
p = 2500 out of 40149
p = 2600 out of 40149
p = 2700 out of 40149
p = 2800 out of 40149
p = 2900 out of 40149
p = 3000 out of 40149
p = 3100 out of 40149
p = 3200 out of 40149
p = 3300 out of 40149
p = 3400 out of 40149
p = 3500 out of 40149
p = 3600 out of 40149
p = 3700 out of 40149
p = 3800 out of 40149
p = 3900 out of 40149
p = 4000 out of 40149
p = 4100 out of 40149
p = 4200 ou