<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/04-multilingual-ner/multilingual_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Multilingual Named Entity Recognition

In this notebook we will explore how a single Transformer model called XLM-RoBERTa can be fine-tuned to
perform named entity recognition (NER) across several languages. NER is a common NLP task that identifies
entities like people, organizations, or locations in text. These entities can be used for various applications such as
gaining insights from company documents, augmenting the quality of search engines, or simply building a
structured database from a corpus.

##Setup

In [None]:
%%shell

pip -q install transformers
pip -q install datasets

In [8]:
import pandas as pd
import numpy as np

from datasets import get_dataset_config_names
from datasets import load_dataset

from IPython.display import HTML, display, set_matplotlib_formats

In [4]:
def display_df(df, max_cols=15, header=True, index=True):
    # 15 cols seems to be limit for O'reilly
    return display(HTML(df.to_html(header=header, index=index, max_cols=max_cols)))

##The Dataset

we will be using a subset of the Cross-lingual TRansfer Evaluation of Multilingual Encoders
(XTREME) benchmark called Wikiann or PAN-X. This dataset consists of Wikipedia articles in many
languages, including the four most commonly spoken languages in Switzerland: German (62.9%), French (22.9%),
Italian (8.4%), and English (5.9%). 

Each article is annotated with LOC (location), PER (person) and ORG
(organization) tags in the “inside-outside-beginning” (IOB2) format, where a B-prefix indicates the beginning of
an entity, and consecutive positions of the same entity are given an I- prefix. An O tag indicates that the token does
not belong to any entity. 

For example, the following sentence



In [7]:
tokens = "Jeff Dean is a computer scientist at Google in California".split()
labels = ["B-PER", "I-PER", "O", "O", "O", "O", "O", "B-ORG", "O", "B-LOC"]

df = pd.DataFrame(data=[tokens, labels], index=["Tokens", "Tags"])
display_df(df, header=None)

0,1,2,3,4,5,6,7,8,9,10
Tokens,Jeff,Dean,is,a,computer,scientist,at,Google,in,California
Tags,B-PER,I-PER,O,O,O,O,O,B-ORG,O,B-LOC
