# Address Element Extraction

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install transformers
!pip install datasets
!pip install seqeval

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |▏                               | 10kB 16.8MB/s eta 0:00:01[K     |▎                               | 20kB 21.5MB/s eta 0:00:01[K     |▌                               | 30kB 15.8MB/s eta 0:00:01[K     |▋                               | 40kB 14.3MB/s eta 0:00:01[K     |▉                               | 51kB 11.9MB/s eta 0:00:01[K     |█                               | 61kB 11.1MB/s eta 0:00:01[K     |█▏                              | 71kB 11.9MB/s eta 0:00:01[K     |█▎                              | 81kB 13.0MB/s eta 0:00:01[K     |█▌                              | 92kB 11.9MB/s eta 0:00:01[K     |█▋                              | 102kB 11.1MB/s eta 0:00:01[K     |█▉                              | 112kB 11.1MB/s eta 0:00:01[K     |██                              | 

In [3]:
import re
import string
import pickle

import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForTokenClassification
from tensorflow.keras.preprocessing.sequence import pad_sequences
from transformers import BertForTokenClassification, BertTokenizerFast
from datasets import load_metric
from transformers import EarlyStoppingCallback

# Approach

The approach involves the use of 2 models:
1. POI/street extraction model
  - This model is used to extract the POI and street from the address
2. Abbreviation detection model
  - This model is used to detect the abbreviations in the address
  - A dictionary mapping of the abbreviation and the original word will be created from the training data
  - This dictionary mapping will then be used to expand the abbreviations detected by the model into their original form.

In [4]:
train_path = '/content/drive/MyDrive/shopee_code_league/train.csv'
test_path = '/content/drive/MyDrive/shopee_code_league/test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

In [5]:
train_df.head()

Unnamed: 0,id,raw_address,POI/street
0,0,jl kapuk timur delta sili iii lippo cika 11 a ...,/jl kapuk timur delta sili iii lippo cika
1,1,"aye, jati sampurna",/
2,2,setu siung 119 rt 5 1 13880 cipayung,/siung
3,3,"toko dita, kertosono",toko dita/
4,4,jl. orde baru,/jl. orde baru


In [6]:
train_df.shape

(300000, 3)

# Model 1: POI/Street extraction model

## Define pretrained tokenizer and model

We will need to use a pretrained BERT model that is trained on an Indonesian language corpus as a starting point. From Huggingface model hub, we have selected:
- [Indobert](https://huggingface.co/indobenchmark/indobert-base-p1)

In [7]:
model_name = 'indobenchmark/indobert-base-p1'
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)
model = BertForTokenClassification.from_pretrained(model_name, num_labels=3)
labels_tag = {'O': 0, 'P': 1, 'S': 2}

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=229167.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1534.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=497810400.0, style=ProgressStyle(descri…




Some weights of BertForTokenClassification were not initialized from the model checkpoint at indobenchmark/indobert-base-p1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Pre-processing

**Pre-processing steps:**
1. Split poi and street into individual columns
2. Split raw address into individual word/punctuation and create P/O/S labels
3. Apply pretrained tokenizer to get subwords and map original word labels to subwords
  - After tokenizing the words, some words will be split into subwords, so the length of the tokenized words will be different from the original length
  - Additional steps are needed to map the labels to the newly created subwords
4. Convert list of dicts to dict
  - We will need the data to be in the form of dictionary for it to work with the torch dataset class which we will be creating below
5. Perform padding on X and y
  - Perform padding to make all instances same length
  - Use -100 for the padding of labels
    - -100 tokens are ignored by the model
  - Use 0 for the padding of inputs
7. Train val split
  - 80-20 split
8. Create torch dataset
  - As we will be using Huggingface Trainer to train, we will need to create a torch dataset class 

In [8]:
# 1. Split poi and street into individual columns
train_df[['POI','street']] = train_df['POI/street'].str.split("/",expand=True)
train_df.head()

Unnamed: 0,id,raw_address,POI/street,POI,street
0,0,jl kapuk timur delta sili iii lippo cika 11 a ...,/jl kapuk timur delta sili iii lippo cika,,jl kapuk timur delta sili iii lippo cika
1,1,"aye, jati sampurna",/,,
2,2,setu siung 119 rt 5 1 13880 cipayung,/siung,,siung
3,3,"toko dita, kertosono",toko dita/,toko dita,
4,4,jl. orde baru,/jl. orde baru,,jl. orde baru


In [9]:
# 2. Split raw address into individual word/punctuation and create P/O/S labels
raw_addresses = []
labels = []
for index in train_df.index:
    # Find individual tokens including punctuation
    token = re.findall(r"[\w]+|[^\s\w]", train_df.loc[index,'raw_address'])
    raw_addresses.append(token) #append full set of tokens for each address
    
    token_tag = []
    for t in token:
        if (t in train_df.loc[index,'POI']):
            token_tag.append('P')
        elif (t in train_df.loc[index,'street']):
            token_tag.append('S')
        else:
            token_tag.append('O')
                
    labels.append(token_tag) #append full set of label for each row

In [10]:
# Preview raw addresses and labels for single instance
print('raw_addresses', raw_addresses[2])
print('labels', labels[2])

raw_addresses ['setu', 'siung', '119', 'rt', '5', '1', '13880', 'cipayung']
labels ['O', 'S', 'O', 'O', 'O', 'O', 'O', 'O']


In [11]:
# 3. Apply pretrained tokenizer to get subwords and map original word labels to subwords

labels_tag = {'O': 0, 'P': 1, 'S': 2}

tokenized_raw_addresses = []
tokenized_labels = []
for raw_address, label in zip(raw_addresses, labels):

    tokenized_inputs = tokenizer(raw_address, is_split_into_words=True)
    word_ids = tokenized_inputs.word_ids()

    tokenized_labels_single = []
    for idx in word_ids:
        if idx == None:
            tokenized_labels_single.append(-100)
        else:
            label_token= label[int(idx)]
            label_no = labels_tag[label_token]
            tokenized_labels_single.append(label_no)

    tokenized_raw_addresses.append(tokenized_inputs)
    tokenized_labels.append(tokenized_labels_single) 

In [12]:
# Preview tokenized text and labels
print('tokenized_raw_addresses: ', tokenized_raw_addresses[2])
print('tokenized_labels: ', tokenized_labels[2])

tokenized_raw_addresses:  {'input_ids': [2, 30319, 27505, 17689, 4345, 418, 111, 20092, 3193, 5908, 21562, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenized_labels:  [-100, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [13]:
# 4. Convert list of dicts to dict
# perform padding 
X_dict = {}
X_dict['input_ids'] = []
X_dict['attention_mask'] = []
X_dict['token_type_ids'] = []

for i in range(len(tokenized_raw_addresses)):
    inputs_ids = tokenized_raw_addresses[i]['input_ids']
    attention_mask = tokenized_raw_addresses[i]['attention_mask']
    token_type_ids = tokenized_raw_addresses[i]['token_type_ids']

    X_dict['input_ids'].append(inputs_ids)
    X_dict['attention_mask'].append(attention_mask)
    X_dict['token_type_ids'].append(token_type_ids)

In [14]:
# 5. Perform padding on X and y
# pad sequences to len of 100

# Add -100 to padding for y
y = pad_sequences(tokenized_labels, maxlen=100, value=-100, dtype="long", padding='post')

# Add 0 to padding for X
for k,v in X_dict.items():
    X_dict[k] = pad_sequences(v, maxlen=100, value=0, dtype="long", padding='post')

In [15]:
# 7. Train val split

# Get train and val indices
data_len = len(train_df)
test_size=0.2
seed = 0
np.random.seed(seed)
val_indices = np.random.choice(data_len, int(test_size*data_len), replace=False)
train_indices = [idx for idx in list(range(data_len)) if idx not in val_indices]

# Train set
X_train = {}
for k,v in X_dict.items():
    X_train[k] = v[[train_indices]]
y_train = y[train_indices]

# Val set
X_val = {}
for k,v in X_dict.items():
    X_val[k] = v[[val_indices]]
y_val = y[val_indices]

  


In [16]:
# 8. Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if isinstance(self.labels, np.ndarray):
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset_pos = Dataset(X_train, y_train)
val_dataset_pos = Dataset(X_val, y_val)

## Model Training

We will be trainign the model using Huggingface's Trainer API as it is a much simpler way of training pretrained models as compared to using native Pytorch or Tensorflow

In [None]:
label_list = ['O', 'P', 'S']
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

args = TrainingArguments(
    '/content/drive/MyDrive/shopee_code_league/poi_street_model',
    evaluation_strategy = "steps",
    eval_steps=3000,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=3000,
    seed=0,
    load_best_model_at_end=True,
    save_total_limit=3
)

pos_trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset_pos,
    eval_dataset=val_dataset_pos,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=2)]
)

pos_trainer.train()

Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy,Runtime,Samples Per Second
3000,0.2751,0.271909,0.844664,0.927359,0.884082,0.904786,439.2028,136.611
6000,0.2561,0.248535,0.854358,0.941225,0.89569,0.912626,438.8594,136.718
9000,0.2474,0.242868,0.863669,0.941868,0.901075,0.916115,439.5715,136.497
12000,0.2459,0.234792,0.861731,0.953731,0.9054,0.918985,438.5632,136.81
15000,0.2354,0.228539,0.869024,0.947365,0.906505,0.921263,441.8438,135.795
18000,0.2152,0.231504,0.868393,0.949901,0.90732,0.922125,440.2761,136.278
21000,0.2091,0.227666,0.853605,0.969951,0.908066,0.922537,439.6832,136.462
24000,0.2046,0.227088,0.864957,0.960178,0.910084,0.924097,440.034,136.353
27000,0.2128,0.217875,0.866612,0.96308,0.912303,0.926067,439.7374,136.445
30000,0.2037,0.217609,0.86949,0.963621,0.914138,0.926981,440.7032,136.146




TrainOutput(global_step=36000, training_loss=0.2271694189707438, metrics={'train_runtime': 17163.5671, 'train_samples_per_second': 2.622, 'total_flos': 4.28036171904e+16, 'epoch': 2.4, 'init_mem_cpu_alloc_delta': 24814215, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 12979449, 'init_mem_gpu_peaked_delta': 1024, 'train_mem_cpu_alloc_delta': 3344638, 'train_mem_gpu_alloc_delta': 1995403776, 'train_mem_cpu_peaked_delta': 281288683, 'train_mem_gpu_peaked_delta': 828575232})

## Evaluating model performance

1. Evaluating on generated labels
  - Evaluate model performance on detecting P/O/S labels 
  - P/O/S labels were created manually
2. Evaluating on original text
  - Decoding the P and S tokens back into words
  - Evaluating the accuracy of the predicted words with the POI/street provided

In [17]:
model_path = '/content/drive/MyDrive/shopee_code_league/best_pos_model/checkpoint-27000'
model =  BertForTokenClassification.from_pretrained(model_path, num_labels=3)
trained_pos_trainer = Trainer(model)

In [20]:
# 1. Evaluating on generated labels
predictions, labels, _ = trained_pos_trainer.predict(val_dataset_pos)
predictions_argmax = np.argmax(predictions, axis=2)
metric = load_metric("seqeval")
label_list = ['O', 'P', 'S']

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions_argmax, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions_argmax, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961.0, style=ProgressStyle(description…






{'_': {'f1': 0.9247763424641421,
  'number': 136809,
  'precision': 0.8778842553974082,
  'recall': 0.9769605800787959},
 'overall_accuracy': 0.9354055698541972,
 'overall_f1': 0.9247763424641421,
 'overall_precision': 0.8778842553974082,
 'overall_recall': 0.9769605800787959}

In [21]:
# Decode pred to words

y_pred = []
for i in range(len(true_predictions)):
    true_pred = true_predictions[i]
    token_ids = X_val['input_ids'][i]
    token_ids = list(token_ids)[1:len(true_pred)+1] # use same length as pred

    street_single = []
    poi_single = []
    for pred_tag, token_id in zip(true_pred,token_ids):
    if pred_tag == 'P':
        poi_single.append(token_id)
    elif pred_tag =='S':
        street_single.append(token_id)

    poi_single_decoded = tokenizer.decode(poi_single)
    street_single_decoded = tokenizer.decode(street_single)
    y_pred.append(poi_single_decoded + '/' + street_single_decoded)

In [22]:
# 2. Evaluating on original text
val_df_pos = train_df.iloc[val_indices,:]
val_df_pos['y_pred'] = y_pred

print('text accuracy:', np.average(val_df_pos['POI/street']==val_df_pos['y_pred']))

text accuracy: 0.5885666666666667


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [23]:
# Preview prediction
val_df_pos[['raw_address','POI/street','y_pred']].head()

Unnamed: 0,raw_address,POI/street,y_pred
112692,"warung bakso ser bah, 64157",warung bakso/ser bah,warung bakso ser bah/
19498,komplek graha teluk jakarta blok.ac 5 no.7 clu...,komplek graha teluk jakarta/,komplek graha teluk jakarta/
31689,gedung menara enjiniring - pln enjiniring lant...,menara enjiniring/jl. ciputat raya,menara enjiniring enjiniring/jl. ciputat raya.
231780,kepuharjo nan 28 lumajang,/,/nan
4298,kejayaan 17 4 7 krukut rt 11 1 taman sari,/kejayaan,/kejayaan


# Model 2: Abbreviation detection model

## Define pretrained tokenizer and model

In [24]:
model_name = 'indobenchmark/indobert-base-p1'
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)
model = BertForTokenClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at indobenchmark/indobert-base-p1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Preprocessing

1. Split text into words/ punctuations, identify abbreviations and create labels
  - After performing EDA on the data, it is found that usually the expanded word and the abbreviation has the same 3 initial letters
  - We will use this rule to identify the abbreviation and the associating expansion for each instance
1. Create dictionary mapping of abbreviation to original word
  - After mapping the abbreviations to the original word, it is found that each abbreviation can be mapped to multiple expanded word
  - To ensure the highest accuracy, each abbreviation is mapped to the highest frequency expanded word
1. Apply pretrained tokenizer to get subwords and map original word labels to subwords
  - After tokenizing the words, some words will be split into subwords, so the length of the tokenized words will be different from the original length
1. Convert list of dicts to dict
  - We will need the data to be in the form of dictionary for it to work with the torch dataset class which we will be creating below
1. Perform padding on X and y
  - Perform padding to make all instances same length
  - Use -100 for the padding of labels
    - -100 tokens are ignored by the model
  - Use 0 for the padding of inputs
1. Train val split
  - 80-20 split
1. Create torch dataset
  - As we will be using Huggingface Trainer to train, we will need to create a torch dataset class 

In [25]:
# 1. Split text into words/ punctuations, identify abbreviations and create lables
abbrev_list = []
labels = []
raw_addresses = []
abbrev_dict = {}
for index in train_df.index:
    not_in_raw_address = []
    abbrev_indexes = []
    abbrev_single = []
    labels_single = []

    raw_address_tokens = re.findall(r"[\w]+|[^\s\w]", train_df.loc[index, "raw_address"])
    street_tokens = re.findall(r"[\w]+|[^\s\w]", train_df.loc[index, "street"])
    poi_tokens = re.findall(r"[\w]+|[^\s\w]", train_df.loc[index, "POI"])

    # find words not in raw address    
    for poi_word in poi_tokens:
        if (poi_word not in string.punctuation) and (poi_word not in raw_address_tokens):
              not_in_raw_address.append(poi_word)

    for street_word in street_tokens:
        if (street_word not in string.punctuation) and (street_word not in raw_address_tokens):
              not_in_raw_address.append(street_word)
    
    # find abbbrev
    for expanded_word in not_in_raw_address:
        if len(expanded_word) >= 3:
            first_3_char = expanded_word[:3]
            raw_address = train_df.loc[index, "raw_address"]
            abbrev = re.findall(rf'\b({first_3_char}\w*)\b', raw_address) # not ideal because coudld return multiple matches
            if len(abbrev) >= 1:
                abbrev = abbrev[0]
                if abbrev in raw_address_tokens:
                    abbrev_index = raw_address_tokens.index(abbrev)
                    abbrev_indexes.append(abbrev_index)
                    abbrev_single.append(abbrev)

                    # create dictionary
                    if abbrev not in abbrev_dict.keys():
                        # case 1: abbrev not recorded
                        abbrev_dict[abbrev] = {}
                        abbrev_dict[abbrev][expanded_word] = 1
                    else:
                        if  expanded_word not in abbrev_dict[abbrev].keys():
                            # case 2: abbrev recorded but expanded word not recorded
                            abbrev_dict[abbrev][expanded_word] = 1
                        else: 
                            # case 3: abbrev and expanded word recorded
                            abbrev_dict[abbrev][expanded_word]  += 1
    
    # create labels for classification
    for i in range(len(raw_address_tokens)):
        if i in abbrev_indexes:
            labels_single.append(1)
        else: 
            labels_single.append(0)

    labels.append(labels_single)
    abbrev_list.append(abbrev_single)
    raw_addresses.append(raw_address_tokens)

In [26]:
# Preview labels
labels[10]

[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]

In [27]:
# Preview abbreviation mapping found from training data
abbrev_dict['neg']

{'negara': 107, 'negeri': 1058, 'negerikaton': 1, 'neglasari': 1}

In [28]:
# 2. Create dictionary mapping of highest frequency word
derived_abbrev_dict = {}
for abbrev, expanded_word_dict in abbrev_dict.items():
    expanded_word_sorted = sorted(expanded_word_dict.items(), key=lambda item: item[1], reverse=True)
    most_expanded_word = expanded_word_sorted[0][0]
    derived_abbrev_dict[abbrev] = most_expanded_word

In [29]:
# Preview abbreviation mapping after mapping each word to highest frequency expansion
derived_abbrev_dict['neg']

'negeri'

In [30]:
# 3. Apply pretrained tokenizer to get subwords and map original word labels to subwords
tokenized_raw_addresses = []
tokenized_labels = []
for raw_address, label in zip(raw_addresses, labels):

    tokenized_inputs = tokenizer(raw_address, is_split_into_words=True)
    word_ids = tokenized_inputs.word_ids()

    tokenized_labels_single = []
    for idx in word_ids:
        if idx == None:
            tokenized_labels_single.append(-100)
        else:
            label_token = label[int(idx)]
            tokenized_labels_single.append(label_token)

    tokenized_raw_addresses.append(tokenized_inputs)
    tokenized_labels.append(tokenized_labels_single) 

In [31]:
# 4. Convert list of dicts to dict
X_dict = {}
X_dict['input_ids'] = []
X_dict['attention_mask'] = []
X_dict['token_type_ids'] = []

for i in range(len(tokenized_raw_addresses)):
    inputs_ids = tokenized_raw_addresses[i]['input_ids']
    attention_mask = tokenized_raw_addresses[i]['attention_mask']
    token_type_ids = tokenized_raw_addresses[i]['token_type_ids']

    X_dict['input_ids'].append(inputs_ids)
    X_dict['attention_mask'].append(attention_mask)
    X_dict['token_type_ids'].append(token_type_ids)

In [32]:
# 5. Perform padding on X and y
# pad sequences to len of 100

# Add -100 to padding for labels
y = pad_sequences(tokenized_labels, maxlen=100, value=-100, dtype="long", padding='post')

# Add 0 to padding for labels
for k,v in X_dict.items():
    X_dict[k] = pad_sequences(v, maxlen=100, value=0, dtype="long", padding='post')

In [33]:
# 7. Train val split

# Get train and val indices
data_len = len(train_df)
test_size=0.2
seed = 0
np.random.seed(seed)
val_indices = np.random.choice(data_len, int(test_size*data_len), replace=False)
train_indices = [idx for idx in list(range(data_len)) if idx not in val_indices]

# Train set
X_train = {}
for k,v in X_dict.items():
    X_train[k] = v[[train_indices]]
y_train = y[train_indices]

# Val set
X_val = {}
for k,v in X_dict.items():
    X_val[k] = v[[val_indices]]
y_val = y[val_indices]

  


In [34]:
# 8. Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if isinstance(self.labels, np.ndarray):
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = Dataset(X_train, y_train)
val_dataset = Dataset(X_val, y_val)

## Training model

In [None]:
def compute_metrics(p):
    pred, labels = p
    predictions_argmax = np.argmax(pred, axis=2)

    true_predictions = [
    [p for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions_argmax, labels)
    ]
    true_labels = [
        [l for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions_argmax, labels)
    ]

    # Flatten predictions and labels into single dimension
    true_predictions_flatten = []
    true_labels_flatten = []
    for i_outer in range(len(true_predictions)):
        for i_inner in range(len(true_predictions[i_outer])):
            true_predictions_flatten.append(true_predictions[i_outer][i_inner])
            true_labels_flatten.append(true_labels[i_outer][i_inner])

    accuracy = accuracy_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)
    recall = recall_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)
    precision = precision_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)
    f1 = f1_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

args = TrainingArguments(
    '/content/drive/MyDrive/shopee_code_league/abbrev_model',
    evaluation_strategy = "steps",
    eval_steps=3000,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_steps=3000,
    seed=0,
    load_best_model_at_end=True,
    save_total_limit=3
)

abbrev_trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=2)]
)

abbrev_trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Runtime,Samples Per Second
3000,0.0686,0.065181,0.977311,0.735814,0.456478,0.563423,421.0967,142.485
6000,0.0624,0.062479,0.978575,0.765906,0.478137,0.588739,420.379,142.728
9000,0.0605,0.061875,0.979086,0.777266,0.487674,0.599321,421.0585,142.498
12000,0.0629,0.059463,0.978993,0.763032,0.500457,0.60446,420.4941,142.689
15000,0.0553,0.059738,0.979792,0.757721,0.543827,0.633198,420.5267,142.678
18000,0.0503,0.065464,0.979364,0.721125,0.581465,0.643808,420.2782,142.763


## Evaluating model performance

1. Evaluating on token level
  - Evaluate model accuracy on detecting token category - abbreviation or not
2. Evaluating on instance level
  - Evaluating the accuracy of the model by instance level

In [35]:
# load trained model
model_path = '/content/drive/MyDrive/shopee_code_league/abbrev_model/checkpoint-18000'
model =  BertForTokenClassification.from_pretrained(model_path, num_labels=2)
trained_abbrev_trainer = Trainer(model)

In [36]:
# 1. Evaluate on token level 

predictions, labels, _ = trained_abbrev_trainer.predict(val_dataset)
predictions_argmax = np.argmax(predictions, axis=2)

true_predictions = [
[p for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions_argmax, labels)
]
true_labels = [
    [l for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions_argmax, labels)
]

# Flatten predictions and labels into single dimension
true_predictions_flatten = []
true_labels_flatten = []
for i_outer in range(len(true_predictions)):
    for i_inner in range(len(true_predictions[i_outer])):
        true_predictions_flatten.append(true_predictions[i_outer][i_inner])
        true_labels_flatten.append(true_labels[i_outer][i_inner])

accuracy = accuracy_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)
recall = recall_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)
precision = precision_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)
f1 = f1_score(y_true=true_labels_flatten, y_pred=true_predictions_flatten)

print("token accuracy: ", accuracy)
print("token precision: ", precision)
print("token recall: ", recall)
print("token f1: ", f1)

token accuracy:  0.979364402781348
token precision:  0.7211248112732763
token recall:  0.5814649487673734
token f1:  0.6438079191238416


In [37]:
# 2. Evaluate results by instances level

total = 0
correct = 0
positive_pred = 0 
positive_label = 0 
correct_positive_pred = 0
for i in range(len(true_predictions)):
    total += 1
    if (true_predictions [i]== true_labels[i]):
        correct +=1
    if 1 in true_labels[i]:
        positive_label += 1
    if 1 in true_predictions[i]:
        positive_pred += 1
        if 1 in true_labels[i]:
            correct_positive_pred += 1

print('instance accuracy without abbre model:', 1 - positive_label/total)
print('instance accuracy:', correct/total)
print('instance precision:', correct_positive_pred/positive_pred)
print('instance recall:', correct_positive_pred/positive_label)

instance accuracy without abbre model: 0.8110833333333334
instance accuracy: 0.8653166666666666
instance precision: 0.7683792815371763
instance recall: 0.6491398323775915


# Testing combined model performance

## Making inference with abbreviation detection model

### Preprocessing

1. Tokenize raw addresses using pretrained tokenizer
1. Convert list of dicts to dicts of lists
1. Perform padding to 100 length
1. Create torch dataset class

In [38]:
# Predict abbreviations 

val_set = train_df.iloc[val_indices,:]

# Define tokenizer
model_name = 'indobenchmark/indobert-base-p1' 
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

tokenized_raw_addresses = []
tokens_len = []
for index in val_set.index:
    raw_address = re.findall(r"[\w]+|[^\s\w]", val_set.loc[index, "raw_address"])
    tokenized_inputs = tokenizer(raw_address, is_split_into_words=True)
    tokenized_raw_addresses.append(tokenized_inputs)

    # calculate tokens length - need for output
    tokens_len_row = len(tokenized_inputs['input_ids']) - 2 # minus [cls] and [sep]
    tokens_len.append(tokens_len_row)

# convert list of dict to dict of lists
X = {}
X['input_ids'] = []
X['attention_mask'] = []
X['token_type_ids'] = []

for i in range(len(tokenized_raw_addresses)):
    inputs_ids = tokenized_raw_addresses[i]['input_ids']
    attention_mask = tokenized_raw_addresses[i]['attention_mask']
    token_type_ids = tokenized_raw_addresses[i]['token_type_ids']

    X['input_ids'].append(inputs_ids)
    X['attention_mask'].append(attention_mask)
    X['token_type_ids'].append(token_type_ids)

# Perform padding
for k,v in X.items():
    X[k] = pad_sequences(v, maxlen=100, value=0, dtype="long", padding='post')


class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if isinstance(self.labels, np.ndarray):
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])


X_torch = Dataset(X)

### Prediction

In [39]:
# load model
model_path = '/content/drive/MyDrive/shopee_code_league/abbrev_model/checkpoint-18000'
model =  BertForTokenClassification.from_pretrained(model_path, num_labels=2)
trained_abbrev_trainer = Trainer(model)

# predict
pred,_,_ = trained_abbrev_trainer.predict(X_torch)

In [40]:
# Process prediction to find words to expand

# Trim pred to length of input tokens
predictions_argmax_test = np.argmax(pred, axis=2)

pred_tokens = []
for i in range(len(predictions_argmax_test)):
    tokens = predictions_argmax_test[i][1: tokens_len[i]+ 1]
    pred_tokens.append(tokens)

In [41]:
# find words needed to expand
token_ids_to_expand = []
abbrev_to_expand = []
abbrev_expansion = []
for row_num, row in enumerate(pred_tokens):
    if  1 in row:
        token_ids_to_expand_row = []
        
        input_ids = X['input_ids'][row_num][1: len(row)+1] # exclude cls token and sep
        for token_num, token in enumerate(row):
            if token == 1:
                token_id = input_ids[token_num]
                token_ids_to_expand_row.append(token_id)

        token_ids_to_expand.append(token_ids_to_expand_row)

        # decode - tokens include subwords
        original_abbrev = tokenizer.decode(token_ids_to_expand_row)
        abbrev_list = abbrev_list = re.findall(r"[\w]+|[^\s\w]", original_abbrev)

        expansion_dict = {}
        for abbrev in abbrev_list:
            if abbrev in derived_abbrev_dict.keys():
                expansion = derived_abbrev_dict[abbrev]
                expansion_dict[abbrev] = expansion
        abbrev_expansion.append(expansion_dict)

    else:
        # token_ids_to_expand.append([])
        # abbrev_to_expand.append('')
        abbrev_expansion.append({})

## Apply abbreviation model pred on POS model pred

In [43]:
val_df_pos.head()

Unnamed: 0,id,raw_address,POI/street,POI,street,y_pred
112692,112692,"warung bakso ser bah, 64157",warung bakso/ser bah,warung bakso,ser bah,warung bakso ser bah/
19498,19498,komplek graha teluk jakarta blok.ac 5 no.7 clu...,komplek graha teluk jakarta/,komplek graha teluk jakarta,,komplek graha teluk jakarta/
31689,31689,gedung menara enjiniring - pln enjiniring lant...,menara enjiniring/jl. ciputat raya,menara enjiniring,jl. ciputat raya,menara enjiniring enjiniring/jl. ciputat raya.
231780,231780,kepuharjo nan 28 lumajang,/,,,/nan
4298,4298,kejayaan 17 4 7 krukut rt 11 1 taman sari,/kejayaan,,kejayaan,/kejayaan


In [44]:
poi_street_pred = list(val_df_pos['y_pred'])

# perform abbrev expansion
expanded_poi_street_pred = []
for row_id, abbrev_expansion_row in enumerate(abbrev_expansion):
    if len(abbrev_expansion_row.keys()) == 0:
        expanded_poi_street_pred.append(poi_street_pred[row_id])
    else:
        poi_street_row = poi_street_pred[row_id]
        for abbrev in abbrev_expansion_row.keys():
            if abbrev in poi_street_row:
                expansion = abbrev_expansion_row[abbrev]
                poi_street_row = poi_street_row.replace(abbrev, expansion)
        expanded_poi_street_pred.append(poi_street_row)

## Evaluation

In [45]:
val_df_pos['y_pred_expanded'] = expanded_poi_street_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [50]:
val_df_pos.tail(10)

Unnamed: 0,id,raw_address,POI/street,POI,street,y_pred,y_pred_expanded
146037,146037,"kalibata ten, pancoran",/kalibata tengah,,kalibata tengah,/kalibata ten,/kalibata ten
223083,223083,"war nasi wahyu jaya, suka bakti,",warung nasi wahyu jaya/suka bakti,warung nasi wahyu jaya,suka bakti,war nasi wahyu jaya/suka bakti,warung nasi wahyu jaya/suka bakti
142286,142286,perum asab deli deli tua barat deli tua,"perumahan asabri, delitua/","perumahan asabri, delitua",,deli deli deli tua/perum asab,deli deli deli tua/perum asab
233827,233827,"cipinang goedang futsal raya kal, 3 a rt 1 1 1...",goedang futsal/raya kal,goedang futsal,raya kal,goedang futsal/raya kal,goedang futsal/raya kal
77956,77956,and 32 pangkalan jati baru cinere,/and,,and,/and,/and
66564,66564,gg. camplung 2 dalung kuta utara,/gg. camplung,,gg. camplung,/gg. camplung 2,/gg. camplung 2
170313,170313,"biro tehnik list,",biro tehnik listrik/,biro tehnik listrik,,biro tehnik list/,biro tehnik listrik/
97745,97745,"warung_me, merd, sumerta kelod",warung_menlempeh/merd,warung_menlempeh,merd,warung _ me/merd,warung _ me/merd
97412,97412,"kosasih,",kosasih/,kosasih,,/kosasih,/kosasih
161356,161356,wisata pan mangg sura timur mulyo utara ix no ...,wisata pantai manggrove surabaya timur/mulyo u...,wisata pantai manggrove surabaya timur,mulyo utara ix,/mulyo utara ix,/mulyo utara ix


In [47]:
print('model accuracy (pos model only):', np.average(val_df_pos['y_pred'] == val_df_pos['POI/street']))
print('model accuracy (pos and abbrev model):', np.average(val_df_pos['y_pred_expanded'] == val_df_pos['POI/street']))

model accuracy (pos model only): 0.5885666666666667
model accuracy (pos and abbrev model): 0.63415


# Conclusion
Model accuracy is around 63% on test data.

This performance can be improved by removing raw addresses without POI and street from the training data. Based on experimenting with different models, we suspect that the raw addresses without POI and labels are wrongly labelled. Including these in the training data will add noise and thus reduce the accuracy of the model.

Removing these raw addresses from the training data can help you to gain 1% more accuracy