# spaCy Performance on Named Entity Recognition with Code-Mixed Data

In this notebook we examine the multi-functional model spaCy's performances on named entity recognition (NER) tasks, when the data are multilingual. More specifically, we will be focusing on code-mixing (code-switched) data, where the vocabulary of two different languages are used interchangeably in one sentence.

#### Data Source

The data comes from Computational Approaches to Linguistic Code-Switching (CALCS), which are openly accessible through [LinCE Benchmark](https://ritual.uh.edu/lince/datasets). The specific subset used in this research is the train set in Spanish - English (SPA - ENG) from CALCS Shared Task 2018. The details of the structures of the data will be provides in the first section.

#### Sentence Annotation

Since spaCy models are usually built on monolingual data, the choice of which language specific model should be used to annotate the current sentence needs to be made based on individual cases. The general pipline goes as follows: 

1) Determine the matrix language (L1) of the sentence;
2) Choose the spaCy model for L1 to annotate the whole sentence, in which tokens of another language (L2) are possibly inserted;
3) Retrieve named entity recognition results from [Linguistic Features](https://spacy.io/usage/linguistic-features) built in spaCy standard pipeline;

#### Performances Evaluation


In [1]:
import pandas as pd
import spacy

from tqdm import tqdm

import warnings
warnings.filterwarnings("ignore")



In [2]:
# read in file
file_path = "../data/train.conll"

# empty list to store DataFrames for each sentence
corpus = []

# read the CoNLL-U file line by line
with open(file_path, "r", encoding="utf-8") as file:
    lines = file.readlines()

    current_sentence = []
    columns = ["word", "lang", "entity_type"]
    for line in lines:
        if line.startswith("# sent_enum"):
            # if a new sentence begins, process the current one
            if current_sentence:
                df = pd.DataFrame(current_sentence, columns=columns)
                corpus.append(df)
                current_sentence = []
        else:
            # append each line to the current sentence
            current_sentence.append(line.strip().split("\t"))

# last sentence in the file
if current_sentence:
    df = pd.DataFrame(current_sentence, columns=columns)
    corpus.append(df)
# each sentence cann now be called by corpus[idx]

In [4]:
# load spaCy model for both L1 and L2
model_eng = spacy.load("en_core_web_trf")
model_spa = spacy.load("es_dep_news_trf")

In [5]:
def tag_eng_sent(model_eng, corpus, sent_idx):
    """tag a sentence with English as L1
    return a dictionary with language tags, gold NE tags and spacy NER results"""
    sent_df = corpus[sent_idx][:-1]  # remove last row resulted by CoNLL-U seperator

    # extract all pre-processed tokens to a list
    gold_tokens = list(sent_df["word"])
    # regularize gold NER tags, save to list
    gold_tags = ["Yes" if tag != "O" else "O" for tag in list(sent_df["entity_type"])]
    # also save language tags
    gold_langs = list(sent_df["lang"])

    sentence_text = sent_df["word"].str.cat(sep=" ")
    doc = model_eng(sentence_text)
    nes = [i.text for i in doc.ents]
    # flat the nes tokens
    nes_tokens = [
        item for sublist in [item.split() for item in nes] for item in sublist
    ]

    if len(nes_tokens) == 0:  # check if spaCy found any NE
        spacy_tags = ["O"] * len(sent_df)
    else:
        spacy_tags = []  # list to store spaCy NER results
        for token in gold_tokens:
            if len(nes_tokens) != 0:
                if token in nes_tokens[0] or nes_tokens[0] in token:
                    spacy_tags.append("Yes")
                    nes_tokens = nes_tokens[1:]
                else:
                    spacy_tags.append("O")
            else:
                spacy_tags.append("O")

    results = {
        "mlang": "eng",
        "lang": gold_langs,
        "true_ne": gold_tags,
        "spacy_ne": spacy_tags
    }
    
    return results

In [6]:
def tag_spa_sent(model_spa, corpus, sent_idx):
    """tag a sentence with English as L1
    return a dictionary with language tags, gold NE tags and spacy NER results"""
    sent_df = corpus[sent_idx][:-1]  # remove last row resulted by CoNLL-U seperator

    # extract all pre-processed tokens to a list
    gold_tokens = list(sent_df["word"])
    # regularize gold NER tags, save to list
    gold_tags = ["Yes" if tag != "O" else "O" for tag in list(sent_df["entity_type"])]
    # also save language tags
    gold_langs = list(sent_df["lang"])

    sentence_text = sent_df["word"].str.cat(sep=" ")
    doc = model_spa(sentence_text)
    nes = [i.text for i in doc.ents]
    # flat the nes tokens
    nes_tokens = [
        item for sublist in [item.split() for item in nes] for item in sublist
    ]

    if len(nes_tokens) == 0:  # check if spaCy found any NE
        spacy_tags = ["O"] * len(sent_df)
    else:
        spacy_tags = []  # list to store spaCy NER results
        for token in gold_tokens:
            if len(nes_tokens) != 0:
                if token in nes_tokens[0] or nes_tokens[0] in token:
                    spacy_tags.append("Yes")
                    nes_tokens = nes_tokens[1:]
                else:
                    spacy_tags.append("O")
            else:
                spacy_tags.append("O")

    results = {
        "mlang": "spa",
        "lang": gold_langs,
        "true_ne": gold_tags,
        "spacy_ne": spacy_tags
    }
    
    return results

In [7]:
def most_frequent_element(lst):
    most_frequent = max(set(lst), key=lst.count)

    return most_frequent


ner_results = []
for i in tqdm(range(len(corpus)), desc="Processing"):
    # find the dominant language (lang1=eng, lang2=spa)
    lang_tags = list(corpus[i]["lang"])
    mlang = most_frequent_element(lang_tags)
    if mlang == "lang1":
        ner_results.append(tag_eng_sent(model_eng=model_eng, corpus=corpus, sent_idx=i))
    else:
        ner_results.append(tag_spa_sent(model_spa=model_spa, corpus=corpus, sent_idx=i))

Processing:   0%|          | 0/33611 [00:00<?, ?it/s]

Processing:   3%|▎         | 1072/33611 [00:26<13:10, 41.18it/s]


KeyboardInterrupt: 

### Error Analysis 1

How many inserted normal non-NE L2 words are falsely tagged as named entities? 

In [9]:
# target_word_idxs: list of indices of inserted L2 tokens that are not NEs in each sentence
target_word_idxs = []
for result in ner_results:
    if result["mlang"] == "eng":
        # get CS Spanish token index
        cs_idx = [i for i in range(len(result["lang"])) if result["lang"][i] == "lang2"]
        # remove CS Spanish tokens that are actually NEs
        cs_ne_idx = [idx for idx in cs_idx if result["true_ne"][idx] == "O"]
        target_word_idxs.append(cs_ne_idx)
    elif result["mlang"] == "spa":
        # get CS English token index
        cs_idx = [i for i in range(len(result["lang"])) if result["lang"][i] == "lang1"]
        # remove CS English tokens that are actually NEs
        cs_ne_idx = [idx for idx in cs_idx if result["true_ne"][idx] == "O"]
        target_word_idxs.append(cs_ne_idx)

cs_fauxne = []  # [(CS tokens count, CS tokens tagged as NE count) of sent_1, ...]
# get from spaCy falsely tagged inserted L2 tokens
for i in range(len(target_word_idxs)):
    if len(target_word_idxs[i]) > 0:
        cs_count = len(target_word_idxs[i])
        cs_as_ne_count = len(
            [j for j in target_word_idxs[i] if ner_results[i]["spacy_ne"][j] != "O"]
        )
        cs_fauxne.append((cs_count, cs_as_ne_count))

In [10]:
all_cs_count = sum(t[0] for t in cs_fauxne)
all_cs_as_ne_count = sum(t[1] for t in cs_fauxne)

all_cs_as_ne_count / all_cs_count

0.4294302626711062

### Error Analysis 2

How many falsely tagged tokens are actually normal inserted non-NE L2 words?

Namely: How many error are caused by code-switching?

In [11]:
# target_word_idxs: list of indices of falsely tagged tokens by spaCy
target_word_idxs = []
for result in ner_results:
    spacy_wrong_ne_idx = [
        i
        for i, (elem1, elem2) in enumerate(zip(result["spacy_ne"], result["true_ne"]))
        if elem1 != elem2
    ]
    target_word_idxs.append(spacy_wrong_ne_idx)

fauxne_at_cs = []  # [(falsely tagged NE count, error on CS position count) of sent_1, ...]
for i in range(len(target_word_idxs)):
    if len(target_word_idxs[i]) > 0:
        fauxne_count = len(target_word_idxs[i])
        sentence = ner_results[i]

        if sentence["mlang"] == "eng":
            fauxne_at_cs_count = len([j for j in target_word_idxs[i] if sentence["lang"][j] == "lang2"])
        elif sentence["mlang"] == "spa":
            fauxne_at_cs_count = len([j for j in target_word_idxs[i] if sentence["lang"][j] == "lang1"])
        fauxne_at_cs.append((fauxne_count, fauxne_at_cs_count))

In [12]:
all_fauxne_count = sum(t[0] for t in fauxne_at_cs)
all_fauxne_at_cs_count = sum(t[1] for t in fauxne_at_cs)

all_fauxne_at_cs_count / all_fauxne_count

0.13780409502746666

### Error Analysis 3

How many inserted L2 tokens that are actually NEs are correctly identified as NE by L1 model?

In [13]:
# target_word_idxs: list of indices of inserted L2 words that are NEs by gold standard
target_word_idxs = []
for result in ner_results:
    if result["mlang"] == "eng":
        # get CS Spanish token index
        cs_idx = [i for i in range(len(result["lang"])) if result["lang"][i] == "lang2"]
        # keep CS Spanish tokens that are actually NEs
        cs_ne_idx = [idx for idx in cs_idx if result["true_ne"][idx] != "O"]
        target_word_idxs.append(cs_ne_idx)
    elif result["mlang"] == "spa":
        # get CS English token index
        cs_idx = [i for i in range(len(result["lang"])) if result["lang"][i] == "lang1"]
        # keep CS English tokens that are actually NEs
        cs_ne_idx = [idx for idx in cs_idx if result["true_ne"][idx] != "O"]
        target_word_idxs.append(cs_ne_idx)

csne_as_ne = []  # [(L2 tokens = NE count, NE-L2 tokens as NE count) of sent_1, ...]
for i in range(len(ner_results)):
    if len(target_word_idxs[i]) > 0:
        l2ne_count = len(target_word_idxs[i])
        l2ne_as_ne_count = len([j for j in target_word_idxs[i] if ner_results[i]["spacy_ne"][j] == "Yes"])

        csne_as_ne.append((l2ne_count, l2ne_as_ne_count))

In [14]:
all_l2ne_count = sum(t[0] for t in csne_as_ne)
all_l2ne_as_ne_count = sum(t[1] for t in csne_as_ne)

all_l2ne_as_ne_count / all_l2ne_count

0.6963265306122449