# NER for Address Elements Extraction

The following solution is an attempt to perform Named Entity Recognition (NER) on raw Indonesia addresses. This is my first time dabbling in a data science competition - any suggestions would be greatly appreciated!

# 1. Competition Details

## Background

At Shopee, we strive to ensure our customers' highest satisfaction for their shopping and delivery experience - fast and accurate delivery of goods. This can be better achieved if we have key address elements for each user address which allows us to accurately geocode it to obtain geographic coordinates to ship the parcel to our customers. These key address elements include Point of Interest (POI) Names and Street Names. However, most addresses that Shopee receives are unstructured and in free text format, not following a certain pattern. Thus it is important for us to develop a model to precisely extract the key address elements from it.

## Task

In this competition, you’ll work on addresses collected by us to build a model to correctly extract Point of Interest (POI) Names and Street Names from unformatted Indonesia addresses.

Participants are expected to build their own model for this competition, submissions by teams which directly call any third party APIs on the test set will not be taken into consideration.

# 2. Preprocessing

## Labelling

Here, we will label the POI and street entities in the training dataset. The result is a list of raw addresses and their corresponding extracted entities. 

## Incomplete Addresses

> The word “pembangunan” in raw_address “smk karya pemban, pon” is not complete. The correct POI will be “smk karya pembangunan” and the returned result should be:
smk karya pembangunan/pon

Some raw addresses may be incomplete. We will attempt to map the incomplete address elements to the complete POI / street data in the training dataset.

In [None]:
import pandas as pd
import re

df = pd.read_csv('../input/scl-2021-ds/train.csv')
df.fillna('', inplace=True)

df

In [None]:
import string

TRAIN_COMBINED = []

total_conflicts = 0

def find_match(raw_words, target_words, ignore_truncated=True):
    
    low = 0
    high = 0
    pattern_index = 0
    found = False
            
    while high < len(raw_words):

        word = raw_words[high]

        if pattern_index == len(target_words):
            found = True
            break
            
        elif not ignore_truncated and (target_words[pattern_index].startswith(word) or target_words[pattern_index].startswith(word[:-1])):
            high += 1
            pattern_index += 1

        elif target_words[pattern_index] == word or target_words[pattern_index] == word[:-1]:
            high += 1
            pattern_index += 1

        else:
            low += 1
            high = low
            pattern_index = 0
            
    if pattern_index == len(target_words):
        found = True

    if found:
        vector = [1 if i >= low and i < high else 0 for i in range(len(raw_words))]
    else:
        vector = [0 for i in range(len(raw_words))]
        
    return vector

for index, row in df.iterrows():
    raw_address = row['raw_address']
    poi, street = row['POI/street'].split('/')
    
    combined_entities = []
    
    first_entity = None
    
    if poi:
        
        m = re.search(re.escape(poi), raw_address)
        
        if m:
            low, high = m.start(), m.end()
        
        else:
            
            vector = find_match(raw_address.split(), poi.split(), False)

            curr_idx = 0
            low = None
            high = None

            for i in range(len(vector)):
                if vector[i] == 1 and low == None:
                    low = curr_idx
                elif vector[i] == 0 and low != None and high == None:
                    high = curr_idx - 1
                    break

                curr_idx += len(raw_address.split()[i]) + 1

            if low != None and high == None:
                high = len(raw_address)

        if low != None and high != None:

            while raw_address[high - 1] in string.punctuation:
                high -= 1

            first_entity = (low, high, "POI")
            combined_entities.append((low, high, "POI"))
            
    if street:
        
        m = re.search(re.escape(street), raw_address)
        
        if m:
            low, high = m.start(), m.end()
        
        else:
        
            vector = find_match(raw_address.split(), street.split(), False)

            curr_idx = 0
            low = None
            high = None

            for i in range(len(vector)):
                if vector[i] == 1 and low == None:
                    low = curr_idx
                elif vector[i] == 0 and low != None and high == None:
                    high = curr_idx - 1
                    break

                curr_idx += len(raw_address.split()[i]) + 1

            if low != None and high == None:
                high = len(raw_address)
        
        if low != None and high != None:
            
            while raw_address[high - 1] in string.punctuation:
                high -= 1
            
            if first_entity and (
                # Starting index overlaps with POI tag
                low >= first_entity[0] and low <= first_entity[1] or 
                
                # Ending index overlaps with POI tag
                high >= first_entity[0] and high <= first_entity[1] or
                
                # POI tag is a subset of SRT tag
                low < first_entity[0] and high > first_entity[1]
            ):
                # Conflicting tags
                total_conflicts += 1
            
            else:
                combined_entities.append((low, high, "SRT"))
    
    combined_spacy_entry = (raw_address, {"entities": combined_entities})
    TRAIN_COMBINED.append(combined_spacy_entry)

print(f"{total_conflicts} total conflicts.")
print(TRAIN_COMBINED[:10])

# 3. Training

The NER model is trained using spaCy.

## Dropout

To prevent overfitting, a random fraction of nodes is dropped according to a given probability. After experimenting, it was found that a decaying dropout rate that starts high and moderates to a lower value as training progresses was most effective.

In [None]:
import spacy
import random
import logging
logging.captureWarnings(True)

print("Preparing...")

nlp = spacy.blank("id") # Indonesian
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("POI")
ner.add_label("SRT")

In [None]:
from spacy.util import minibatch, compounding
from spacy.util import decaying

print("Training...")

# Start the training
nlp.begin_training()

dropout = decaying(0.6, 0.2, 1e-4)

# Loop for 20 iterations
for itn in range(20):
    
    # Shuffle the training data
    random.shuffle(TRAIN_COMBINED)
    losses = {}
    
    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAIN_COMBINED, size=compounding(4.0, 32.0, 1.001)):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        
        # Update the model
        nlp.update(texts, annotations, losses=losses, drop=next(dropout))
        
    print(f"Epoch: {itn}\tLosses: {str(losses)}")
    
nlp.to_disk("combined.model")

# 4. Predictions

Generate predictions on the test dataset.

## Limitations

I've been unable to perform subword completion on incomplete addresses accurately. I've tried to create dictionaries of known vocabulary mappings, e.g. "smk karya pemban” -> "smk karya pembangunan" and expanding the incomplete addresses as a post-processing step, but these experiments have failed to achieve a higher accuracy on the public leaderboard.

In [None]:
import spacy

nlp = spacy.load("combined.model")

In [None]:
import pandas as pd

df = pd.read_csv('../input/scl-2021-ds/test.csv')

results = {}

for index, row in df.iterrows():
    raw_address, id = row['raw_address'], row['id']
    
    doc = nlp(raw_address)
    
    elements = [(X.text, X.label_) for X in doc.ents]
    poi_elements = [x for x in elements if x[1] == 'POI']
    srt_elements = [x for x in elements if x[1] == 'SRT']
    
    if not poi_elements:
        poi_elements = [('', '')]
    
    if not srt_elements:
        srt_elements = [('', '')]
    
    results[id] = (poi_elements, srt_elements)

# 5. Output

Generate the submission.csv required for the competition.

In [None]:
with open('submission.csv', 'w') as f:
    f.write("id,POI/street\n")
    for id in range(50000):
        poi = results[id][0][0][0]
        srt = results[id][1][0][0]
        
        f.write(f"{id},\"{poi}/{srt}\"\n")

df = pd.read_csv('submission.csv')
df

# 6. Evaluation

We can evaluate the accuracy of our model by viewing evaluation metrics such as the f-score.

In [None]:
from spacy.scorer import Scorer
from spacy.gold import GoldParse
import logging
import random

logging.captureWarnings(True)

def evaluate(ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot.get('entities'))
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

random.shuffle(TRAIN_COMBINED)
print(evaluate(nlp, TRAIN_COMBINED[:100]))