# Solution 1. SpanCategorizer by SpaCy

My first solution was to train a custom Named Entity Recognition (NER) model with SpaCy. Many of the common NLP tasks, including NER, have optimized implementations in spaCy, which is considered to be the fastest NLP framework in Python. In fact, spaCy v3.0 introduced the most recent transformer-based pipelines. The dependency parser, NER, and part-of-speech tagger are loaded by default by the spaCy pipeline.


# Prepare data


## Prepare environment

In [None]:
!pip install jsonlines
!pip install spacy -q
!pip install -q https://github.com/explosion/spacy-models/releases/download/ru_core_news_lg-3.7.0/ru_core_news_lg-3.7.0.tar.gz
!pip install -q spacy[transformers]
!pip install thinc==8.2.3

In [None]:
import pandas as pd
import numpy as np
import jsonlines
import requests, zipfile, io

## Download data

In [None]:
url = "https://codalab.lisn.upsaclay.fr/my/datasets/download/2be26d3f-9630-46d5-8a68-414034ba4bdc"

r = requests.get(url)
if r.ok:
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(".")
else:
    print("Downloading error")

In [None]:
# Load data in pd df
train_df = pd.read_json('train.jsonl', lines=True)
dev_df = pd.read_json('dev.jsonl', lines=True)
test_df = pd.read_json('test.jsonl', lines=True)

In [None]:
train_df.head(3)

Unnamed: 0,ners,sentences,id
0,"[[0, 5, CITY], [16, 23, PERSON], [34, 41, PERS...",Бостон взорвали Тамерлан и Джохар Царнаевы из ...,0
1,"[[21, 28, PROFESSION], [53, 67, ORGANIZATION],...",Умер избитый до комы гитарист и сооснователь г...,1
2,"[[0, 4, PERSON], [37, 42, COUNTRY], [47, 76, O...",Путин подписал распоряжение о выходе России из...,2


## Explore entities

In [None]:
# Count entities frequency
entities_count = {}
for ind, row in train_df.iterrows():
    for st, end, lab in row["ners"]:
        if lab not in entities_count.keys():
            entities_count[lab] = 0
        entities_count[lab] += 1
print(entities_count)

{'CITY': 1261, 'PERSON': 5119, 'LOCATION': 314, 'EVENT': 3335, 'AGE': 657, 'DATE': 2689, 'ORGANIZATION': 4088, 'ORDINAL': 614, 'PROFESSION': 5039, 'COUNTRY': 2510, 'NUMBER': 1107, 'CRIME': 221, 'STATE_OR_PROVINCE': 412, 'DISTRICT': 103, 'FAMILY': 24, 'AWARD': 404, 'TIME': 182, 'FACILITY': 424, 'DISEASE': 220, 'WORK_OF_ART': 270, 'LAW': 405, 'MONEY': 179, 'RELIGION': 89, 'NATIONALITY': 437, 'IDEOLOGY': 273, 'PRODUCT': 245, 'PERCENT': 68, 'LANGUAGE': 54, 'PENALTY': 92}


In [None]:
# Print top-10 frequent entities
chosen_entities = [x[0] for x in sorted(entities_count.items(), key=lambda x: -x[1])[:10]]
print(chosen_entities)

['PERSON', 'PROFESSION', 'ORGANIZATION', 'EVENT', 'DATE', 'COUNTRY', 'CITY', 'NUMBER', 'AGE', 'ORDINAL']


In [None]:
# Replace new line with space (the indexing is still the same)
train_df["sentences"] = train_df["sentences"].apply(lambda x: x.replace("\n", " "))
test_df["senences"] = test_df["senences"].apply(lambda x: x.replace("\n", " "))

In [None]:
# Check whether multispace is in entities
for id, row in train_df.iterrows():
    for st, end, lab in row["ners"]:
        if row["sentences"][st].isspace() | row["sentences"][end].isspace() | ("  " in row["sentences"][st:end+1]) | ("\n" in row["sentences"][st:end+1]):
            print(id, lab, row["sentences"][st:end+1])

# Drop training entity with space inside
train_df = train_df.drop(6)

6 PERSON Ильи  Ноябрева


## Build dataset

To train a custom named entity recognition model, one should have a dataset in a relevant format with the proper annotations. SpaCy uses DocBin class for annotated data, so I had to create the DocBin objects for the training data. The DocBin file contains documents with text and a list of entities, including label, start and end index. DocBin class efficiently serializes the information from a collection of Doc objects. It is faster and produces smaller data sizes than pickle, and allows the user to deserialize without executing arbitrary Python code. Note that the end of the entity in our dataset is the index of the last element plus 1.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
train_data, dev_data = train_test_split(train_df, test_size=0.2, random_state=42, shuffle=True)

In [None]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans

def make_dataset(df, save_path):
    """
    Save dataset in data.spacy file
    :param df: dataframe
    :param save_path: where to store file
    :return: skipped numbered entities
    """
    # Load new spacy model
    nlp = spacy.blank("ru")
    doc_bin = DocBin()
    used_entities = []
    skipped_ents = []

    for id, row in df.iterrows():
        text = row["sentences"]
        doc = nlp.make_doc(text)
        ents = []

        # Process ners
        for start, end, label in row["ners"]:
            if (label in ["NUMBER"]) or text[start:end+1].isdigit():
                # Numeric values are illegal in spans
                skipped_ents.append([id, start, end, label])
            else:
                # Load spans
                span = doc.char_span(start, end+1, label=label, alignment_mode="contract")

                if span is None:
                    print("Skipping entity", label, text[start: end+1])
                else:
                    ents.append(span)
        # Add to spans
        doc.spans["sc"] = ents
        doc_bin.add(doc)

    doc_bin.to_disk(save_path) # save the docbin object
    return skipped_ents

skipped_nums = make_dataset(train_data, "models/training_data.spacy")
_ = make_dataset(dev_data, "models/dev_data.spacy")



Skipping entity COUNTRY Египет
Skipping entity COUNTRY Египет
Skipping entity ORGANIZATION Роскомнадзора
Skipping entity COUNTRY Mal
Skipping entity EVENT ЧМ
Skipping entity EVENT ЧМ
Skipping entity EVENT ЧМ
Skipping entity EVENT ЧМ
Skipping entity EVENT ЧМ
Skipping entity LOCATION Енисей
Skipping entity PERSON Нобелевскую
Skipping entity ORGANIZATION Спортинг
Skipping entity ORDINAL V
Skipping entity PROFESSION ведущая
Skipping entity PROFESSION теле
Skipping entity DISEASE коронавирусом
Skipping entity PROFESSION Патриарх
Skipping entity EVENT Покушение
Skipping entity AWARD Радиомания
Skipping entity COUNTRY Белорусси
Skipping entity EVENT аварии
Skipping entity ORDINAL I
Skipping entity EVENT Олимпиады
Skipping entity EVENT Похороны
Skipping entity EVENT похороны
Skipping entity EVENT драфте
Skipping entity COUNTRY Грузия
Skipping entity EVENT ЧМ
Skipping entity EVENT ЧМ
Skipping entity EVENT ЧМ
Skipping entity EVENT ЧМ
Skipping entity ORGANIZATION BSkyB
Skipping entity ORGANIZATIO

# Spacy model train

A SpaCy component called SpanCategorizer provides structured annotation for a wide range of labeled spans, such as extended phrases, non-named entities, or overlapping annotations. Analyzing the materials and the spacy documentation [1], I decided to use the spancat model for the current task.

The next step is to build a configuration. SpaCy provides a training quickstart page to easily create a config file [5]. I make a config file [1] (named base_config.cfg) with a choice of spancat model and fill it with “fill-config”. After I checked the correctness of datasets, configs and looked through the balance of entities, I trained the model on 10 000 iterations.


In [None]:
# Configuration file
!python -m spacy init fill-config models/base_config.cfg models/config.cfg

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
models/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
# Check the data balance and ready to train
!python -m spacy debug data models/config.cfg

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: ru
Training pipeline: tok2vec, spancat
414 training docs
104 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[38;5;3m⚠ Low number of examples to train a new pipeline (414)[0m
[1m
[38;5;4mℹ 110192 total word(s) in the data (25087 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m

Spans Key   Labels                        
---------   ------------------------------
sc          {'COUNTRY', 'PROFESSION', 'FAMILY', 'WORK_OF_ART', 'ORDINAL', 'LAW', 'FACILITY', 'PERCENT', 'TIME', 'IDEOLOGY', 'RELIGION', 'PERSON', 'MONEY', 'ORGANIZATION', 'DISTRICT', 'AWARD', 'NATIONALITY', 'PRODUCT', 'CRIME', 'LOCATION', 'EVENT', 'PENALTY', 'DISEASE', 'DATE', 'CITY', 'LANGUAGE', 'AGE', 'STATE_OR_PROVINCE'}

  self.pid = os.fork()
[2K[38;5;3m⚠ Low number of examples for l

In [None]:
# Train model
!python -m spacy train models/config.cfg --output ./ --training.max_steps 10000 \
        --paths.train ./models/training_data.spacy --paths.dev ./models/dev_data.spacy --gpu-id 0

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'spancat'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ------------  ----------  ----------  ----------  ------
  0       0        229.02       5795.82        0.44        0.22       39.33    0.00
  0     200       1018.57      37615.79       25.98       69.81       15.96    0.26
  0     400          8.31       8019.67       39.24       72.30       26.93    0.39
  1     600          9.83       6378.46       45.41       84.16       31.10    0.45
  1     800         11.64       6401.86       50.86       78.42       37.63    0.51
  2    1000         11.15       5279.50       51.23       81.53       37.35    0.51
  2    1200         11.33       5238.45   

# Make prediction

The early stop was triggered on 9 000 iterations. For prediction I used the best model obtained during training. The smallest spancat loss was 2830 with a score of 0.59.

In [None]:
# Load resulting model
nlp_ner = spacy.load("model-best")



In [None]:
import string

result = {}
# Make predictions for test set
for ind, row in test_df.iterrows():
    doc = nlp_ner(row["senences"])
    result[row["id"]] = [[]]
    occured = []
    for ent in doc.spans['sc']:
        start = 0
        if ent.text in occured or len(ent.text) < 2:
            # Don't add already occured entities or with 1char long
            continue
        # Label all substrings that match entity text
        for iter in range(row["senences"].count(ent.text)):
            st = start + row["senences"][start:].index(ent.text)
            result[row["id"]][0].append([st, st+len(ent.text)-1, ent.label_])
            start = st + len(ent.text)
        occured.append(ent.text)

#     Add skipped numbers
    count_len = 0
    for w in row["senences"].split(" "):
        if w.isdigit():
            result[row["id"]][0].append([count_len, count_len+len(w)-1, "NUMBER"])
        count_len += len(w) + 1

# Convert result to a dataframe
answer = pd.DataFrame.from_dict(result, columns=["ners"], orient="index")
answer = answer.reset_index().rename(columns={"index":"id"})
answer.head()

Unnamed: 0,id,ners
0,584,"[[149, 156, STATE_OR_PROVINCE], [158, 167, EVE..."
1,585,"[[190, 200, PROFESSION], [202, 208, COUNTRY], ..."
2,586,"[[65, 75, CITY], [78, 85, COUNTRY], [135, 143,..."
3,587,"[[2, 7, CITY], [333, 341, AGE], [368, 376, PRO..."
4,588,"[[108, 114, PERSON], [118, 123, CITY], [147, 1..."


In [None]:
output_path = "./test.jsonl"

with open(output_path, "w") as f:
    f.write(answer.to_json(
        orient='records', lines=True
        ))

In [None]:
!zip test test.jsonl

updating: test.jsonl (deflated 75%)


In [None]:
!zip -r -q model-best model-best