# Boostrapping named entity recognition models without labelled data: A weak supervision approach

The purpose of this work is to boostrap high-quality NER models when we do not have access to training data for the target domain. See the paper for the theoretical details and related work.

## Before you start:

You should first make sure that the following Python packages are installed:
- `spacy` (version >= 2.2)
- `hmmlearn`
- `snips-nlu-parsers`
- `pandas`
- `numba`

You should also install the `en_core_web_sm` and `en_core_web_md` models in Spacy.

To run the neural models in `ner.py`, you need also need `pytorch`, `cupy`, `keras` and `tensorflow` installed. 

To run the baselines, you will also need to have `snorkel` installed.

Finally, you also need to download the following files and add them to the `data` directory:
- [`conll2003_spacy.tar.gz`](https://github.com/anonymous-NLP/weak-supervision-for-NER/releases/download/acl2020/conll2003_spacy.tar.gz) (unpack the archive in the same directory)
- [`BTC_spacy.tar.gz`](https://github.com/anonymous-NLP/weak-supervision-for-NER/releases/download/acl2020/SEC_spacy.tar.gz) (same)
- [`SEC_spacy.tar.gz`](https://github.com/anonymous-NLP/weak-supervision-for-NER/releases/download/acl2020/SEC_spacy.tar.gz) (same)
- [`wikidata.json`](https://github.com/anonymous-NLP/weak-supervision-for-NER/releases/download/acl2020/wikidata.json)
- [`wikidata_small.json`](https://github.com/anonymous-NLP/weak-supervision-for-NER/releases/download/acl2020/wikidata_small.json)
- [`crunchbase.json`](https://github.com/anonymous-NLP/weak-supervision-for-NER/releases/download/acl2020/crunchbase.json)
- [`conll2003.docbin`](https://github.com/anonymous-NLP/weak-supervision-for-NER/releases/download/acl2020/conll2003.docbin)
   
    

## Introduction

In [6]:
news_text  = """ATLANTA  (Reuters) - Retailer Best Buy Co, seeking new ways to appeal to cost-conscious shoppers, said on Tuesday it is selling refurbished 
 versions of Apple Inc's iPhone 3G at its stores that are priced about $50 less than new iPhones. 
 The electronics chain said the used iPhones, which were returned within 30 days of purchase, are priced at $149 for the model with 8 gigabytes of storage, 
 while the 16-gigabyte version is $249. A two-year service contract with AT&T Inc is required. New iPhone 3Gs currently sell for $199 and $299 at 
 Best Buy Mobile stores. "This is focusing on customers' needs, trying to provide as wide a range of products and networks for our consumers," said 
 Scott Moore, vice president of marketing for Best Buy Mobile. Buyers of first-generation iPhones can also upgrade to the faster refurbished 3G models at 
 Best Buy, he said. Moore said AT&T, the exclusive wireless provider for the iPhone, offers refurbished iPhones online. The sale of used iPhones comes as 
 Best Buy, the top consumer electronics chain, seeks ways to fend off increased competition from discounters such as Wal-Mart Stores Inc, which began 
 selling the popular phone late last month. Wal-Mart sells a new 8-gigabyte iPhone 3G for $197 and $297 for the 16-gigabyte model. The iPhone is also 
 sold at Apple stores and AT&T stores. Moore said Best Buy's move was not in response to other retailers' actions. (Reporting by  Karen Jacobs ; Editing 
 by  Andre Grenon )"""

import renews_text = re.sub('\s+', ' ', news_text)

To get things started, let's look at the named entities recognised by a standard NER model (from Spacy):

In [7]:
import spacy, annotations

# We load the spacy model
nlp = spacy.load("en")
doc = nlp(news_text)

# Visualising the entities
annotations.display_entities(doc)

<br>
As we can see from the results above, the named entity recognition contains quite a lot of errors. Atlanta is strangely labelled as an organisation, while "Best Buy" is ignored at several places in the document.  IPhone is also either labelled as an organisation or even as a location.

A slightly larger neural model (again from Spacy) works better, but still contains quite a few errors and omissions:

In [9]:
import spacy, annotations

# We load the spacy model (takes a few seconds)
nlp = spacy.load("en_core_web_md")
doc = nlp(news_text)

# Visualising the entities
annotations.display_entities(doc)

Ideally, one would wish to train a better named entity recognition model, which is better tailored to the specific needs and linguistic patterns found in these articles. However, although raw text data is not difficult to acquire, we often do not have access to labelled data. To address this issue, we developed an alternative approach based on __weak supervision__, combining several (noisy) supervision sources instead of relying on a single "gold standard". Indeed, we do have access to several possible supervision sources, such as alternative NER models trained on other corpora, large lists of entity (companies, person names, geographical locations), shallow linguistic patterns, and document-level constraints. 

The key idea behind the proposed approach is thus to (1) use these supervision sources to automatically annotate news corpora, (2) estimate a label model (more precisely an HMM model) that unifies all these sources into a single one, and (3) learn a new NER model based on these unified labels. <br>

__Outline of this notebook__: We describe below the various labelling functions.  We then explain how these various sources can be merged into a single source. Finally, we detail the architecture behind the NER model.

## __Step 1:__ Labelling functions

### 1) Other data-driven NER models

A first source of automatic annotation comes from NER models trained on multiple, distinct corpora. We went through [available NE-labelled corpora](https://github.com/juand-r/entity-recognition-datasets) to search for datasets that could be used to train alternative models. We then trained Spacy models for all of them, and then conducted some experiments to assess their performance. At the end of the process, we ended up with four models:
- The standard Spacy model for English (`en_core_web_md`), trained on Ontonotes v5
- A model trained on [ConLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/)
- A model trained on the [Broad Twitter Corpus](https://github.com/GateNLP/broad_twitter_corpus)
- A model trained on a corpus of [SEC filings](https://www.aclweb.org/anthology/U15-1010/).
    
Note there are differences between the entity labels of these models: while Ontonotes contains no less than [18 classes](https://spacy.io/api/annotation#named-entities), the other corpora only contain `PER(SON)`, `ORG`, `LOC` and `MISC`. Furthermore, the labels also do not match each other perfectly: while Ontonotes distinguishes between geopolitical locations (`GPE`) and "natural" locations (such as continents, seas etc., labelled as `LOC`), the three other models regroup all geographical entities as `LOC`. 

We can apply annotations from a Spacy model using the `ModelAnnotator` class

In [9]:
import annotations
annotator = annotations.ModelAnnotator("./data/conll2003/", "conll2003")

doc = annotator.annotate(doc)
annotations.display_entities(doc, "conll2003")

loading ./data/conll2003/...done


As we can see, the results are not perfect on this model either, but the errors are distinct from the ones made by the Ontonotes model. 

The annotations are written in the `user_data` dictionary of the Spacy document:

In [10]:
doc.user_data["annotations"]["conll2003"]

{(3, 4): (('ORG', 1.0),),
 (7, 10): (('ORG', 1.0),),
 (31, 33): (('ORG', 1.0),),
 (34, 35): (('ORG', 1.0),),
 (49, 50): (('ORG', 1.0),),
 (57, 58): (('ORG', 1.0),),
 (97, 99): (('ORG', 1.0),),
 (102, 104): (('ORG', 1.0),),
 (114, 117): (('ORG', 1.0),),
 (145, 147): (('PER', 1.0),),
 (153, 156): (('ORG', 1.0),),
 (162, 163): (('ORG', 1.0),),
 (174, 176): (('ORG', 1.0),),
 (180, 181): (('PER', 1.0),),
 (182, 183): (('PER', 1.0),),
 (190, 191): (('ORG', 1.0),),
 (194, 195): (('ORG', 1.0),),
 (201, 202): (('ORG', 1.0),),
 (204, 206): (('ORG', 1.0),),
 (224, 225): (('ORG', 1.0),),
 (226, 229): (('ORG', 1.0),),
 (240, 241): (('LOC', 1.0),),
 (242, 243): (('LOC', 1.0),),
 (247, 248): (('ORG', 1.0),),
 (262, 263): (('ORG', 1.0),),
 (267, 268): (('ORG', 1.0),),
 (270, 271): (('LOC', 1.0),),
 (273, 274): (('PER', 1.0),),
 (275, 277): (('ORG', 1.0),),
 (292, 294): (('PER', 1.0),),
 (297, 299): (('PER', 1.0),)}

Each `ModelAnnotator` adds two annotation sources: one that is directly based on the Spacy Model (here `conll2003`), and one that also includes the corrections specified in the method `_correct_entities` (in `spacy_wrapper.py`) that we implemented earlier this year.  The corrected version are indicated with a `+c` suffix.

Here are the results from the three other models:

In [11]:
import annotations
core_web_annotator = annotations.ModelAnnotator("en_core_web_md", "core_web_md")
btc_annotator = annotations.ModelAnnotator("data/BTC", "BTC")
sec_annotator = annotations.ModelAnnotator("data/SEC-filings/", "SEC")

doc = core_web_annotator.annotate(doc)
doc = btc_annotator.annotate(doc)
doc = sec_annotator.annotate(doc)
annotations.display_entities(doc, "core_web_md+c")
annotations.display_entities(doc, "BTC+c")
annotations.display_entities(doc, "SEC+c")

loading en_core_web_md...done
loading data/BTC...done
loading data/SEC-filings/...done


__Note__: When annotating large collections of news documents, the method `annotator.pipe(news_docs)` is much more efficient than calling `annotate(...)` every single time, as it batches the documents on which to run the NER model.

### 2) Gazetteers

Another useful source of annotation comes from large lists of entities such as persons, places and organisations. The gazetteers are using a _trie_ to efficiently search for occurrences in the text. Each gazetteer creates two annotation sources: one that is case-sensitive (`_cased` suffix) and one case-insentitive (`_uncased` suffix).


#### 2.1) Wikipedia
The database from Wikipedia is extracted from the [NECKar](https://event.ifi.uni-heidelberg.de/?page_id=532) dataset.  The postprocessing (which, among others, filters out entities that are also relatively common English words) is implemented in `compile_wikidata`. In addition, we also extracted from Wikidata a list of commercial products and added them to the gazetteer. 

In [12]:
annotator = annotations.GazetteerAnnotator(annotations.WIKIDATA, "wiki")

annotator.annotate(doc)
annotations.display_entities(doc, "wiki_cased")
annotations.display_entities(doc, "wiki_uncased")

Extracting data from ./data/wikidata.json
Populating trie for entity class PERSON (number: 2626849)
Populating trie for entity class LOC (number: 47129)
Populating trie for entity class GPE (number: 602953)
Populating trie for entity class ORG (number: 295768)
Populating trie for entity class PRODUCT (number: 12457)


Again, the annotation model does make some errors: `Moore` is thought to be a [geopolitical entity](https://en.wikipedia.org/wiki/Moore) instead of a person. Note that `AT&T` has two alternative labels: `ORG` or `GPE` (see [AT&T station](https://en.wikipedia.org/wiki/AT%26T_(SEPTA_station))).  

In addition to the full wiki data, we also added a specific gazetteer that only employs wikidata objects containing a text description:

In [13]:
annotator = annotations.GazetteerAnnotator(annotations.WIKIDATA_SMALL, "wiki_small")

annotator.annotate(doc)
annotations.display_entities(doc, "wiki_small_cased")
annotations.display_entities(doc, "wiki_small_uncased")

Extracting data from ./data/wikidata_small.json
Populating trie for entity class PERSON (number: 1865813)
Populating trie for entity class LOC (number: 14250)
Populating trie for entity class GPE (number: 273743)
Populating trie for entity class ORG (number: 91423)
Populating trie for entity class PRODUCT (number: 12457)


As we can see, the second gazetteer has a higher precision than the first (at a cost of lower coverage).

#### 2.2 Crunchbase

The second gazetteer is extracted from the [Open Data Map from Crunchbase](https://data.crunchbase.com/docs/open-data-map), which contains lists of both organisations and (business) persons.

In [14]:
import annotations
annotator = annotations.GazetteerAnnotator(annotations.CRUNCHBASE, "crunchbase")

annotator.annotate(doc)
annotations.display_entities(doc, "crunchbase_cased")

Extracting data from ./data/crunchbase.json
Populating trie for entity class COMPANY (number: 788942)
Populating trie for entity class ORG (number: 263)
Populating trie for entity class PERSON (number: 1062669)


#### 2.3 Geonames

The [geonames](http:www.geonames.org) database contains a large list of locations, including both geopolitical entities and "natural" locations:

In [15]:
annotator = annotations.GazetteerAnnotator(annotations.GEONAMES, "geo")

annotator.annotate(doc)
annotations.display_entities(doc, "geo_uncased")

Extracting data from ./data/geonames.json
Populating trie for entity class GPE (number: 15205)


Note that the annotator explicitly marks the detected entities with the label `COMPANY` instead of the more generic `ORG`.  

#### 2.4 Product names

Finally, we used [DBPedia](http://www.dbpedia.org) to extract a list of products and brands, since the recognition of products is particularly poor in Spacy NER models:

In [16]:
annotator = annotations.GazetteerAnnotator(annotations.PRODUCTS, "product")

annotator.annotate(doc)
annotations.display_entities(doc, "product_cased")

Extracting data from ./data/products.json
Populating trie for entity class PRODUCT (number: 45345)


#### 2.5 Other entities

Finally, we also run a detector using handcrafted lists of countries, languages, nationalities, and religious/political groups.

In [39]:
# Detection of misc entities
misc_annotator = annotations.FunctionAnnotator(annotations.misc_generator, "misc_detector", to_exclude=exclusives)
annotations.display_entities(doc, "misc_detector")

### 3. Shallow patterns

Some named entities can also be captured through relatively simple, handcrafted patterns defined on the Spacy document. The class `FunctionAnnotator` makes it easy to define an annotator based on a function that takes a Spacy document as input and generate text spans with a label. Relations of mutual exclusivity between annotation sources can also be specified in the annotator. For instance, we can specify that numbers that are part of a date, time or money span should be ignored from the "number_detector" (to avoid having e.g. the `21` in `October 21` labelled as a `CARDINAL`): 

In [17]:
date_annotator = annotations.FunctionAnnotator(annotations.date_generator, "date_detector")
time_annotator = annotations.FunctionAnnotator(annotations.time_generator, "time_detector")
money_annotator = annotations.FunctionAnnotator(annotations.money_generator, "money_detector")
exclusives = ["date_detector", "time_detector", "money_detector"]
number_annotator = annotations.FunctionAnnotator(annotations.number_generator, "number_detector", exclusives)

date_annotator.annotate(doc)
time_annotator.annotate(doc)
money_annotator.annotate(doc)
number_annotator.annotate(doc)
annotations.display_entities(doc, "date_detector")
annotations.display_entities(doc, "time_detector")
annotations.display_entities(doc, "money_detector")
annotations.display_entities(doc, "number_detector")

We have also created a range of patterns aiming to improve the _detection_ of named entities, even though they leave the actual label underspecified (as a generic `ENT` label). Four such detectors are constructed:
- two detectors of proper names based on casing (marking sequence of tokens whose lemma are "titled" as potential named entities)
- one detector of NNP sequences (based on the Spacy POS tagger)
- and one detector of sequences with proper names linked with "compound" dependency relations

In [18]:
# Detection based on casing, but allowing some lowercased tokens
proper_detector = annotations.SpanGenerator(annotations.is_likely_proper)
proper_annotator = annotations.FunctionAnnotator(proper_detector, "proper_detector",to_exclude=exclusives)

# Detection based on casing, but allowing some lowercased tokens
proper2_detector = annotations.SpanGenerator(annotations.is_likely_proper, exceptions=annotations.LOWERCASED_TOKENS)
proper2_annotator = annotations.FunctionAnnotator(proper2_detector, "proper2_detector",  to_exclude=exclusives)
        
# Detection based on part-of-speech tags
nnp_detector = annotations.SpanGenerator(lambda tok: tok.tag_=="NNP")
nnp_annotator = annotations.FunctionAnnotator(nnp_detector, "nnp_detector", to_exclude=exclusives)
        
# Detection based on dependency relations (compound phrases)
compound_detector = annotations.SpanGenerator(lambda x: annotations.is_likely_proper(x) and annotations.in_compound(x))
compound_annotator = annotations.FunctionAnnotator(compound_detector, "compound_detector", to_exclude=exclusives)

proper_annotator.annotate(doc)
proper2_annotator.annotate(doc)
nnp_annotator.annotate(doc)
compound_annotator.annotate(doc)

annotations.display_entities(doc, "proper_detector")
annotations.display_entities(doc, "proper2_detector")
annotations.display_entities(doc, "nnp_detector")
annotations.display_entities(doc, "compound_detector")

Furthermore, we created three specific annotators to recognise:
- company names with a legal type
- full person names (with a first name along a list of common first names)
- legal references of type `LAW`.

In [38]:
# Detection of companies with legal type
company_annotator = annotations.FunctionAnnotator(annotations.CompanyTypeGenerator(),
                                                  "company_type_detector", to_exclude=exclusives)
exclusives +=["company_type_detector"]
        
# Detection of full person names
full_name_annotator = annotations.FunctionAnnotator(annotations.FullNameGenerator(), 
                                                    "full_name_detector", to_exclude=exclusives)

# Detection of legal references
legal_annotator = annotations.FunctionAnnotator(annotations.legal_generator, "legal_detector", to_exclude=exclusives)

company_annotator.annotate(doc)
full_name_annotator.annotate(doc)
misc_annotator.annotate(doc)
legal_annotator.annotate(doc)
annotations.display_entities(doc, "company_type_detector")
annotations.display_entities(doc, "full_name_detector")
annotations.display_entities(doc, "misc_detector")
annotations.display_entities(doc, "legal_detector")

Finally, we also rely on an external probabilistic [parser of named entities](https://github.com/snipsco/snips-nlu-parsers) from [Snips](https://snips.ai/). The parser recognises `DATE`, `TIME`, `ORDINAL`, `CARDINAL`, `MONEY` and `PERCENT`. The parser is implemented in _Rust_, so it runs quite fast.

In [20]:
# Detection based on a probabilistic parser
snips = annotations.FunctionAnnotator(annotations.SnipsGenerator(), "snips")
snips.annotate(doc)
annotations.display_entities(doc, "snips")

### 4. Document-level annotators

All annotators presented so far rely on _local_ decisions on tokens or phrases.  However, news articles are not mere collections of words, but exhibit a high degree of internal coherence. This can be exploited to furhter improve the annotation. Two document-level annotators are implemented:

Before we can run the document-level annotators, we need to normalise some of the entities. The `StandardiseAnnotator` is responsible for this normalisation:
- entities `PER` (from conll2003, BTC and SEC) are set to `PERSON`
- entities `LOC` from conll2003, BTC and SEC for spans that are also annotated by other layers as `GPE` are set to `GPE` 
- entities `ORG` that are annotated by other layers as `COMPANY` are set to `COMPANY`
    

In [25]:
annotator = annotations.StandardiseAnnotator()
doc = annotator.annotate(doc)

#### 4.1 Document history

When a journalist first mentions an entity such as a company or person in an article, they typically write it in a "long form", and then use shorter mentions once the entity is properly introduced. For instance, in the text above, "Scott Moore" is first mentioned with a full name, and then simply referred to as "Moore". Similarly, companies are often first introduced to with their legal type.  The `DocumentHistoryAnnotator` takes advantage of this property, by propagating the label from the first mention onto subsequent mentions:

In [32]:
annotator = annotations.DocumentHistoryAnnotator()
annotator.annotate(doc)
annotations.display_entities(doc, "doc_history")

#### 4.2 Label consistency

Another property of news documents is the fact that two (or more) named entities sharing the same string in a text typically refer to the same entity, and should therefore have the same label. "Komatsu" can be both a company name and a city in Japan, but within a given document, it will typically be one or the other for the whole document. We can capture this fact with an annotator that looks at the majority label for a given string, and annotate all occurrences with this label:

In [35]:
annotator = annotations.DocumentMajorityAnnotator()
annotator.annotate(doc)
annotations.display_entities(doc, "doc_majority_cased")

## __Step 2__: Estimation of label model

We can construct a full annotator with all annotators described above, and then run it on a dataset from the target domain:

In [4]:
import annotations
full_annotator = annotations.FullAnnotator().add_all()
print("Total number of annotators:", len(full_annotator.annotators))

Loading shallow functions
Loading Spacy NER models
loading en_core_web_md...done
loading data/conll2003...done
loading data/BTC...done
loading data/SEC-filings...done
Loading gazetteer supervision modules
Extracting data from ./data/wikidata.json
Populating trie for entity class PERSON (number: 2626849)
Populating trie for entity class LOC (number: 47129)
Populating trie for entity class GPE (number: 602953)
Populating trie for entity class ORG (number: 295768)
Populating trie for entity class PRODUCT (number: 12457)
Extracting data from ./data/wikidata_small.json
Populating trie for entity class PERSON (number: 1865813)
Populating trie for entity class LOC (number: 14250)
Populating trie for entity class GPE (number: 273743)
Populating trie for entity class ORG (number: 91423)
Populating trie for entity class PRODUCT (number: 12457)
Extracting data from ./data/geonames.json
Populating trie for entity class GPE (number: 15205)
Extracting data from ./data/crunchbase.json
Populating trie

We can then take the raw data from CoNLL 2003, run Spacy on the textual content, and finally apply the annotator to get annotations from the each source:

In [62]:
# We annotate the documents and store them in a Spacy DocBin fileb
full_annotator.annotate_docbin("./data/conll2003.docbin")

Reading ./data/conll2003.docbin...Number of processed documents: 1000
Finished annotating ./data/conll2003.docbin
Write to ./data/conll2003.docbin...done


One this is done, we can finally estimate a unified annotator model through weak supervision. The basic idea is to describe the named entity recognition problem as a _Hidden markov Model_ where the observations are the annotations from each source, and the states correspond to the "true" (hidden) labels for each token, as illustrated below.

<img src="data/hmm.png">

Since we don't have access to the true labels for each token, we will rely on _Baum-Welch_ (a variant of EM) to estimate the HMM model through unsupervised training. More specifically, we will need to estimate 3 models:
- the initial probabilities $P(Y_0)$ of the labels for the first token of a document
- the transition matrix $P(Y_i | Y_{i-1})$ for the labels 
- the emission models $P(\lambda_{i,j} | Y_i)$ of observing a particular value $\lambda_{i,j}$ (say, `B-PER`) from the source $j$ given the true label $Y_i$. In the current model, we assume the emissions to be independent of one another given the true label, to reduce the complexity of the model.

Given an annotated dataset, the HMM model can be easily estimated:

In [24]:
import labelling 

# We create the unified model (and make sure the CoNLL 2003-trained NER model is ignored)
sources_to_use = [l for l in labelling.SOURCE_NAMES if "conll2003" not in l]
unified_model = labelling.HMMAnnotator(sources_to_use)

# We then run Baum-Welch on the model (can take some time)
unified_model.train("./data/conll2003.docbin")

# Saving the model to a file
unified_model.save("./data/hmm_conll2003.pkl")

Note that the HMM model relies on some informative priors to facilitate the parameter estimation:
- the prior for the initial probabilities is a Dirichlet based on counts for the most reliable model
- the prior for the transition matrix is a list of Dirichlet also based on counts from the standard Spacy NER model.
- finally, the initial emission models are calculated based on subjective estimates of the relative precision and recall of each source. For instance, we know that a source like `company_type_detector` (which looks at legal suffixes such as "Inc." at the end of the noun phrase) has a very high precision, but a low recall , since many mentions of companies do not include a suffix. In contrast, gazeteers will tend to have a better recall, but a lower precision (some company names also happen to be names of geopolitical entities or persons).  The initial precisions and recalls provided to the model is specified in `SOURCE_PRIORS` in the file `labelling.py`. When a precision and recall is not provided for a given source, they are assumed to be zeros (for instance, `company_type_detector` only detects `COMPANY` entities and nothing else).  

One the model is learned, we can apply it as any other "annotator" object:

In [10]:
import labelling
full_annotator.annotate(doc)
unified_model.annotate(doc)
annotations.display_entities(doc, "HMM")

And we can apply it to the full dataset:

In [11]:
unified_model.annotate_docbin("./data/conll2003.docbin")

Reading ./data/conll2003.docbin...Number of processed documents: 1000
Finished annotating ./data/conll2003.docbin
Write to ./data/conll2003.docbin...done


<br>

## __Step 3__: Development of neural NER model


We can now learn a neural NER model based on these unified annotations. We have two options: a straighforward (but slightly underperforming) approach using Spacy, and a more sophisticated approach using our own NER model

### __Alternative 1__: Using Spacy

In [12]:
import annotations 
annotations.convert_to_json("./data/conll2003.docbin", "./data/conll2003_dev.json", cutoff=20)
annotations.convert_to_json("./data/conll2003.docbin",  "./data/conll2003_train.json", nb_to_skip=20)

# We need to convert COMPANY into ORG if we want to use the standard Spacy model as starting point
!sed -i 's/COMPANY/ORG/g' ./data/conll2003_train.json
!sed -i 's/COMPANY/ORG/g' ./data/conll2003_dev.json

Writing JSON file to ./data/conll2003_dev.json
Writing JSON file to ./data/conll2003_train.json
Converted documents: 1000


And we can then directly train a new NER model with Spacy's training regime:

In [13]:
!rm -rf ./data/conll2003_spacy
import spacy
spacy.cli.train(lang="en", output_path="./data/conll2003_spacy", 
                train_path="./data/conll2003_train.json", 
                dev_path="./data/conll2003_dev.json",
               base_model="en_core_web_md", vectors="en_core_web_md",
               pipeline="ner", n_iter=5)

Training pipeline: ['ner']
Starting with base model 'en_core_web_md'
Loading vector from model 'en_core_web_md'
Counting training words (limit=0)


  0%|          | 170/298047 [00:00<04:56, 1005.67it/s]


Itn  NER Loss   NER P   NER R   NER F   Token %  CPU WPS
---  ---------  ------  ------  ------  -------  -------


  0%|          | 244/298047 [00:00<03:36, 1372.68it/s]   

  1  37335.514  76.006  78.024  77.001   91.758    17516


  0%|          | 0/298047 [00:00<?, ?it/s]               

  2  29525.090  76.556  78.024  77.283   91.758    17019


  0%|          | 0/298047 [00:00<?, ?it/s]               

  3  26440.411  76.768  78.466  77.608   91.758    15320


  0%|          | 0/298047 [00:00<?, ?it/s]               

  4  24993.271  77.778  79.499  78.629   91.758    17992


                                                         

  5  23520.053  76.768  78.466  77.608   91.758    17264
[38;5;2m✔ Saved model to output directory[0m
data/conll2003_spacy/model-final
[2Keating best model...[38;5;2m✔ Created best model[0m
data/conll2003_spacy/model-best


We then load the learned model (taking the model version with best performance on the development set):

In [14]:
nlp = spacy.load("./data/conll2003_spacy/model-best")

And run it on the document:

In [15]:
annotations.display_entities(nlp(news_text))

To add the results of this neural model to the results, we can use the `ModelAnnotator` object:

In [16]:
model = annotations.ModelAnnotator("./data/conll2003_spacy/model-best", "aggregated_spacy")
model.annotate_docbin("./data/conll2003.docbin")

loading ./data/conll2003_spacy/model-best...done
Reading ./data/conll2003.docbin...Number of processed documents: 1000
Finished annotating ./data/conll2003.docbin
Write to ./data/conll2003.docbin...done


### __Alternative 2__: Using custom NER model

Although the model above performs relatively well, its performance can be further improved by using more advanced neural architectures, implemented in file `ner.py`. 

In [None]:
import ner

# We extract the training and validation documents (as Spacy docs)
val_docs = ner.generate_from_docbin("../data/conll2003.docbin", target_source="HMM", cutoff=20, loop=False) 
train_docs = ner.generate_from_docbin("../data/conll2003.docbin", target_source="HMM", nb_to_skip=20,loop=True) 

# We create the NER model
model = ner.NERModel(epoch_length=1000, dropout=0.5, batch_size=8, lr=0.01, nb_epochs=4, 
                     trainable_word_embeddings=False, word_emb_transform_dim=128)
# NB: to reduce overfitting on this toy dataset, we don't tune the word embeddings here, 
# but instead pass them through a dense layer

print("finished building model, now training...")
model.train(train_docs, val_docs)

In [None]:
model.label(doc)
annotations.display_entities(doc)

As we can see, the model still make some errors, due to the fact that the training set (200 documents) remains too small to train a decent neural NER model. It should also be noted that the Spacy model above used the standard Spacy model for English (`en_core_web_md`) as starting point, while the custom NER model above is learned from scratch.

## __Step 4__: Evaluation

We'll first evaluate the approach on a standard dataset, namely conLL 2003. Note that ConLL only makes use of 4 labels (`PER`, `ORG`, `LOC`, `MISC`) instead of the 18 labels which were used above.

#### Ontonotes-trained NER

The simplest baseline is to use a neural NER model trained on Ontonotes. In other words, this baseline considers one single source, namely a neural model trained on Ontonotes. Fortunately, we already have such a model, namely the labelling function `core_web_md`.

#### Majority voting

Another baseline consists in using majority voting on the various sources:

In [25]:
import labelling
mv = labelling.MajorityVoter(sources_to_use)
mv.annotate_docbin("./data/conll2003.docbin")

Using ['BTC', 'BTC+c', 'SEC', 'SEC+c', 'company_cased', 'company_type_detector', 'company_uncased', 'compound_detector', 'core_web_md', 'core_web_md+c', 'crunchbase_cased', 'crunchbase_uncased', 'date_detector', 'doc_history', 'doc_majority_cased', 'doc_majority_uncased', 'full_name_detector', 'geo_cased', 'geo_uncased', 'infrequent_compound_detector', 'infrequent_nnp_detector', 'infrequent_proper2_detector', 'infrequent_proper_detector', 'legal_detector', 'misc_detector', 'money_detector', 'multitoken_company_cased', 'multitoken_company_uncased', 'multitoken_crunchbase_cased', 'multitoken_crunchbase_uncased', 'multitoken_geo_cased', 'multitoken_geo_uncased', 'multitoken_product_cased', 'multitoken_product_uncased', 'multitoken_wiki_cased', 'multitoken_wiki_small_cased', 'multitoken_wiki_small_uncased', 'multitoken_wiki_uncased', 'nnp_detector', 'number_detector', 'product_cased', 'product_uncased', 'proper2_detector', 'proper_detector', 'snips', 'time_detector', 'wiki_cased', 'wiki_sm

#### Snorkel model

Snorkel is another weak supervision framework which we use to compare our approach:

In [26]:
snorkel_model = labelling.SnorkelModel(sources_to_use)
snorkel_model.train("./data/conll2003.docbin")
snorkel_model.annotate_docbin("./data/conll2003.docbin")

Using ['BTC', 'BTC+c', 'SEC', 'SEC+c', 'company_cased', 'company_type_detector', 'company_uncased', 'compound_detector', 'core_web_md', 'core_web_md+c', 'crunchbase_cased', 'crunchbase_uncased', 'date_detector', 'doc_history', 'doc_majority_cased', 'doc_majority_uncased', 'full_name_detector', 'geo_cased', 'geo_uncased', 'infrequent_compound_detector', 'infrequent_nnp_detector', 'infrequent_proper2_detector', 'infrequent_proper_detector', 'legal_detector', 'misc_detector', 'money_detector', 'multitoken_company_cased', 'multitoken_company_uncased', 'multitoken_crunchbase_cased', 'multitoken_crunchbase_uncased', 'multitoken_geo_cased', 'multitoken_geo_uncased', 'multitoken_product_cased', 'multitoken_product_uncased', 'multitoken_wiki_cased', 'multitoken_wiki_small_cased', 'multitoken_wiki_small_uncased', 'multitoken_wiki_uncased', 'nnp_detector', 'number_detector', 'product_cased', 'product_uncased', 'proper2_detector', 'proper_detector', 'snips', 'time_detector', 'wiki_cased', 'wiki_sm

#### Mixtures of multinomials

For the mixtures of multinomials, we rely on code written in `R`, and available in the directory `mixtures`.

#### AdaptaBERT

See https://github.com/xhan77/AdaptaBERT.

### Metrics

The file `analysis.py` contains code to easily extract evaluation metrics by comparing the annotations from a particular annotation layer (for instance the HMM predictions, or the predictions from a single source) to the gold standard:

In [27]:
import analysis, annotations

# Extractint the documents
docs = list(annotations.docbin_reader("./data/conll2003.docbin"))

# Here we extract for illustration purposes the metrics for 5 approaches: the Ontonotes-trained NER,
# the majority voter, the Snorkel model, the HMM, and the Spacy neural model traind on the aggregated sources
df = analysis.evaluate(docs, ["core_web_md", "majority_voter", "snorkel", "HMM", "aggregated_spacy"], analysis.CONLL_MAPPINGS, analysis.CONLL_TO_RETAIN)
df.loc["micro"]

Unnamed: 0_level_0,Unnamed: 1_level_0,token_precision,token_recall,token_f1,token_cee,entity_precision,entity_recall,entity_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,HMM,0.728,0.806,0.766,2.104,0.737,0.722,0.73
,aggregated_spacy,0.732,0.794,0.762,2.087,0.736,0.723,0.73
,core_web_md,0.719,0.706,0.712,2.671,0.694,0.62,0.654
,majority_voter,0.825,0.68,0.746,2.01,0.762,0.631,0.69
,snorkel,0.686,0.766,0.724,2.076,0.697,0.638,0.666


Note: the results in the above table are not exactly the same as the ones reported in the original paper (the results above are actually slightly better!), due to some last-minute changes in the implementation of the labelling functions.

### Miscellaneous

The file `data/crowdsourced.docbin` contains the NER-annotated sentences annotated via crowd-sourcing.