Previous notebook goes through data loading, EDA, and modeling. And web_text is ignored when I build and train the model. I use BERT as encoder, followed by MLP and RNN respectively. The bet model(BERT+RNN) performance on validation data is around 90%.

In this session, I will consider web_text, feed it to the model to see whether it can help boost model accuracy.

Considering web_text is very different from the first three columns, I will encode them respectively using BERT. Then, use MLP or CNN to decode the information.

# Model 3 - BERT + MLP (web_text included)

Very similar to model 1: get the embedded `[cls token] ` of text and web_text separately, and convey them to MLP. 

The only difference from model 1 is: we need to flatten the input dimension at the beginning of MLP.

## 1) Train and Test Split

In [1]:
import numpy as np
import pandas as pd

In [2]:
raw = pd.read_csv('interview_case_v4.csv')

In [3]:
df = raw.copy()
df['intact_name'] = df['intact_name'].str.rstrip('.') # remove the periord from the end of string
df = df.fillna('')
df['text'] = df['intact_name'].astype(str)+'. ' +df['SIC8_DESCRIPTION'].astype(str)+'. ' +df['4_Square_Description'].astype(str)
df1 = df[['text','web_text','target_for_prediction']]
df1 = df1.rename(columns={'target_for_prediction':'label'})

In [4]:
df1

Unnamed: 0,text,web_text,label
0,218685 Ontario Inc o/a Swagat Banquet Hall. ba...,"WE'RE MAJESTIC, REGAL, STYLISH& EXPERTS IN ALL...",Restaurant
1,Restaurant Pushap Sucrerie. eating places. sna...,,Restaurant
2,Transport Galf Inc. .,,Trucking & Hauling Service
3,On The Go Courier. . specialized freight (exce...,,Trucking & Hauling Service
4,"1484726 Alberta Ltd. local trucking, without s...",,Trucking & Hauling Service
...,...,...,...
1558,Asdin Hospitality Ltd. o/a Best Western Plus F...,,Hotel Accomodation
1559,Casa Moda Fine Furnishing Inc. .,780-784-0638info@splendidfurnishings.caABOUT U...,Trucking & Hauling Service
1560,Jia De Trinh o/a Oakridge Dragon Restaurant Lt...,,Restaurant
1561,2000650 Ontario Inc. o/a Golden Bell Thai Rest...,Home Page Menu Lunch Specials Dinner Specials ...,Restaurant


In [5]:
from sklearn.model_selection import train_test_split

training_data, test_data = train_test_split(df1, test_size=0.1, random_state=25, stratify = df1.label)
train_data, valid_data = train_test_split(training_data, test_size=0.1, random_state=25, stratify = training_data.label)

In [6]:
print(f'training data size: {train_data.shape[0]}')
print(f'validation data size: {valid_data.shape[0]}')
print(f'testing data size: {test_data.shape[0]}')

training data size: 1265
validation data size: 141
testing data size: 157


In [7]:
# save three datasets, for later torchtext use
train_data.to_csv('./data2/train.csv',index=False)
valid_data.to_csv('./data2/valid.csv',index=False)
test_data.to_csv('./data2/test.csv',index=False)

## 2) Prepare Data & Iterator

In [8]:
import torch

import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [9]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [10]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

In [11]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
print(max_input_length)

512


In [12]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2] # we have to add two tokens: at the beginning and end of the text
    return tokens

In [13]:
from torchtext.legacy import data

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)

In [14]:
fields = [('text', TEXT), ('web_text', TEXT), ('label', LABEL)]

In [15]:
train_data, valid_data, test_data = data.TabularDataset.splits(
                                        path = 'data2',
                                        train = 'train.csv',
                                        validation = 'valid.csv',
                                        test = 'test.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

In [16]:
print(vars(train_data[0]))

{'text': [22431, 19961, 2620, 2620, 4561, 4297, 1012, 1004, 23968, 1005, 1055, 10733, 1012, 10733, 7884, 1012], 'web_text': [], 'label': 'Restaurant'}


In [17]:
LABEL.build_vocab(train_data)
print(LABEL.vocab.stoi)

defaultdict(None, {'Restaurant': 0, 'Trucking & Hauling Service': 1, 'Hotel Accomodation': 2})


In [18]:
BATCH_SIZE = 16 # consider the samll dataset and limited computational resources, I set a small batch size

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    sort = False,
    batch_size = BATCH_SIZE, 
    device = device)

## 3) Build the Model

In [19]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [35]:
import torch.nn as nn

class BERTMLPSentiment(nn.Module):
    def __init__(self,
                 bert,
                 output_dim
                ):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.out = nn.Sequential(nn.Flatten(), #we have two-channel inputs(text & web_text), reshape them into a one-dimensional tensor
                    nn.Linear(embedding_dim*2,128),
                    nn.ReLU(),
                    nn.Linear(128, output_dim)) # input dimension = hidden_size
        
        
    def forward(self, text, web_text):
        
        #text = [batch size, sent len]
                
        with torch.no_grad():
            text_embedded = self.bert(text)[1]  # freeze the bert para, get the representation of [cls] token
        
        #text_embedded = [batch size, 1, emb dim]
        
        with torch.no_grad():
            web_text_embedded = self.bert(web_text)[1]
                
        #web_text_embedded = [batch size, 1, emb dim]
        
        embedded = torch.stack((text_embedded, web_text_embedded), dim=1) # stack two embedded, now channel = 2
        
        #embedded = [batch size, 2, 1, emb dim]
        
        output = self.out(embedded)
        
        #output = [batch size, out dim]
        
        return output

In [36]:
OUTPUT_DIM = 3

model3 = BERTMLPSentiment(bert,
                         OUTPUT_DIM)

In [38]:
# too many parameters to train, I will freeze the bert para, due to the limited sources
for name, param in model3.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

In [39]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model3):,} trainable parameters')

The model has 197,123 trainable parameters


In [40]:
for name, param in model3.named_parameters():                
    if param.requires_grad:
        print(name)

out.1.weight
out.1.bias
out.3.weight
out.3.bias


## 4) Train the Model

In [41]:
import sklearn.utils.class_weight as class_weight

In [42]:
train_df = pd.read_csv('./data2/train.csv')
train_Y = train_df.label
train_Y = train_Y.apply(lambda x: 0 if x=='Restaurant' else 1 if x=='Trucking & Hauling Service' else 2) # according to the LABEL.vocab 

In [43]:
class_weights=class_weight.compute_class_weight('balanced',np.unique(train_Y),train_Y.to_numpy())
class_weights=torch.tensor(class_weights,dtype=torch.float)
class_weights

tensor([0.5541, 0.9945, 5.2708])

In [44]:
import torch.optim as optim

optimizer = optim.Adam(model3.parameters())

criterion = nn.CrossEntropyLoss(weight=class_weights) # to deal with the imbalanced dataset

model3 = model3.to(device)
criterion = criterion.to(device)

In [45]:
def categorical_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

In [46]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text, batch.web_text)
        
        loss = criterion(predictions, batch.label.long())
        
        acc = categorical_accuracy(predictions, batch.label.long())
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [47]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text, batch.web_text)
            
            loss = criterion(predictions, batch.label.long())
            
            acc = categorical_accuracy(predictions, batch.label.long())

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [48]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [49]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model3, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model3, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model3.state_dict(), 'model3.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 24m 9s
	Train Loss: 1.125 | Train Acc: 30.86%
	 Val. Loss: 1.160 |  Val. Acc: 57.85%
Epoch: 02 | Epoch Time: 23m 52s
	Train Loss: 1.096 | Train Acc: 48.59%
	 Val. Loss: 1.072 |  Val. Acc: 57.32%
Epoch: 03 | Epoch Time: 22m 38s
	Train Loss: 1.068 | Train Acc: 48.98%
	 Val. Loss: 1.058 |  Val. Acc: 44.50%
Epoch: 04 | Epoch Time: 22m 51s
	Train Loss: 1.067 | Train Acc: 45.00%
	 Val. Loss: 1.051 |  Val. Acc: 50.75%
Epoch: 05 | Epoch Time: 23m 6s
	Train Loss: 1.068 | Train Acc: 50.39%
	 Val. Loss: 1.038 |  Val. Acc: 41.93%
