This branch is kept as the last state before submission of the paper (EMNLP2023) for recent changes and updates go to Version2
Evaultion and data in CQE_Evaluation.
A Framework for Comprehensive Quantity Extraction. This repository contains code for the paper:
CQE: A Framework for Comprehensive Quantity Extraction
Satya Almasian*, Vivian Kazakova*, Philipp Göldner, Michael Gertz
Institute of Computer Science, Heidelberg University
(*
indicates equal contribution)
If you found this useful, consider citing us:
@inproceedings{DBLP:conf/emnlp/AlmasianKG023,
author = {Satya Almasian and
Vivian Kazakova and
Philip G{\"{o}}ldner and
Michael Gertz},
editor = {Houda Bouamor and
Juan Pino and
Kalika Bali},
title = {{CQE:} {A} Comprehensive Quantity Extractor},
booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing, {EMNLP} 2023, Singapore, December 6-10, 2023},
pages = {12845--12859},
publisher = {Association for Computational Linguistics},
year = {2023},
url = {https://aclanthology.org/2023.emnlp-main.793},
timestamp = {Wed, 13 Dec 2023 17:20:20 +0100},
biburl = {https://dblp.org/rec/conf/emnlp/AlmasianKG023.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
you can also install the package using on the root directory of the package.
pip install .
you can also install the package via pip.
pip install CQE
Create a CQE
and parse some text or sentence.
from CQE import CQE
parser = CQE.CQE()
text = "The sp 500 was down 2.1% and nasdaq fell 2.5%."
result = parser.parse(text)
print(result)
>>> [(=,2.1,[%],percentage,[sp, 500]), (=,2.5,[%],percentage,[nasdaq])]
Use the overload option for additional functionality. The NumParser will compute the span indices of the Quantity, the normalized input sentence, the long and the simplified scientific notation of the Value, whether the unit is scientific or noun based and the unit surface forms.
parser = CQE.CQE(overload=True)
text = "The sp 500 was down 2.1% and nasdaq fell 2.5%."
result = parser.parse(text)
for res in result:
print(f"""
Quantity: {res}
=====
indices = {res.get_char_indices()}
normalized text = {res.get_normalized_text()}
pre processed text = {res.get_preprocessed_text()}
scientific notation = {res.value.scientific_notation}
simplified scientific notation = {res.value.simplified_scientific_notation}
scientific unit = {res.unit.scientific}
unit surfaces forms = {res.unit.unit_surfaces_forms}""")
>>> Quantity: (down,2.1,[%],percentage,{0: [sp, 500]})
=====
indices = [5, 6]
normalized text = The sp 500 was down 2.1 percentage and nasdaq fell 2.5 percentage .
pre processed text = The sp 500 was down 2.1% and nasdaq fell 2.5% .
scientific notation = 2.100000e+00
simplified scientific notation = 2.1e+00
scientific unit = True
unit surfaces forms = ['percentage', 'percent', 'pc', '%', 'pct', 'pct.']
Quantity: (down,2.5,[%],percentage,{0: [nasdaq]})
=====
indices = [10, 11]
normalized text = The sp 500 was down 2.1 percentage and nasdaq fell 2.5 percentage .
pre processed text = The sp 500 was down 2.1% and nasdaq fell 2.5% .
scientific notation = 2.500000e+00
simplified scientific notation = 2.5e+00
scientific unit = True
unit surfaces forms = ['percentage', 'percent', 'pc', '%', 'pct', 'pct.']
See the example in example.py as well. Run
python3 example.py
For replicating the results on the paper and comparing against other system, make sure CQE is installed and use the CQE_Evaluation repo. The evaluation script and data used for evaluation and training unit disambiguators are in this repository.
Main files for CQE are under CQE package, where unit_classifer
contains code for unit disambiguation based on BERT classifier trained using spacy-transformers. units.json
file is used for normalization of units and unit_models.zip
contains the trained models for the disambiguation which will be unziped on the first run of NumParser
class.
File | Description |
---|---|
CQE/NumberNormalizer.py | Bound, Number and Unit Normalization script |
CQE/CQE.py | Quantity Extraction script |
CQE/rules.py | Rules for DependencyMatcher |
CQE/unit.json | 531 units used for the Unit Normalization |
CQE/classes.py | Definition of the Bound, Range, Number, Unit, Noun and Quanitity classes |
CQE/number_lookup.py | Number-word to number mappings |
CQE/example.py | Usage example |
CQE/unit_classifier/unit_disambiguator.py | Class for unit disambiguator based on the bert based classifiers. |
CQE/unit_classifier/train_classifier_bert.py | Script for generating spacy based training data and training commands to create classifiers for disambiguation. |
CQE/unit_classifier/sample_usage.py | Usage example for disambiguation class. |
The units used for normalization of the unit of an extracted quantity are stored in the unit.json . Each of the 531 units has surfaces, symbols, prefixes, entity, URI, dimensions and currency_code. For composing the file, the list of units from quantulum3, the list of units from Wikipedia, the surfaces from Microsoft.Recognizers.Text ,the UCUM units and surfaces and wikipedia page of units were used.
Example:
"light-year": {
"surfaces": [
"light-year",
"light year",
"light years"
],
"entity": "length",
"URI": "Light-year",
"dimensions": [],
"symbols": [
"ly",
"[ly]"
]
}
There are more than 50 rules for DependencyMatcher defined in the rules.py. We use the spaCy-model en core web sm to create a Doc object with linguistic annotations. The key point is that the rules are not simple pattern matching based on the single words in the sentence, but on those annotations and exploit the structure of the sentence.
Existing rules can be changed and new ones can be added by editing the file. Pay attention to the DependencyMatcher syntax.
Example:
"num_symbol" : [
{
"RIGHT_ID": "number",
"RIGHT_ATTRS": {"POS": "NUM"}
},
{
"LEFT_ID": "number",
"REL_OP": ">",
"RIGHT_ID": "symbol",
"RIGHT_ATTRS": {"DEP": {"IN": ["quantmod", "nmod"]}, "POS": "SYM"}
},
]
Input: "The September crude contract was up 19 cents at US $58.24 per barrel and the September natural gas contract was up 10.4 cents to US $2.24 per mmBTU."
Matches:
NUM_SYMBOL [58.24, US$]
NUM_SYMBOL [2.24, US$]
NOUN_NUM [cents, 19]
NOUN_NUM [cents, 10.4]
NUM_RIGHT_NOUN [58.24, barrel]
NUM_RIGHT_NOUN [2.24, mmBTU]
NOUN_NOUN [contract, gas, natural]
UNIT_FRAC [58.24, per, barrel]
UNIT_FRAC [58.24, per, gas]
UNIT_FRAC [58.24, per, contract]
UNIT_FRAC [58.24, per, cents]
UNIT_FRAC [58.24, per, mmBTU]
UNIT_FRAC [2.24, per, mmBTU]
UNIT_FRAC_2 [58.24, per, gas, natural]
LONELY_NUM [19]
LONELY_NUM [58.24]
LONELY_NUM [10.4]
LONELY_NUM [2.24]
Candidates: [[US$, 58.24, per, barrel, 10], [US$, 2.24, per, mmBTU, 25], [19, cents, 6], [10.4, cents, 21]]
Quadruples: [([], [58.24], [US$, per, barrel], 10), ([], [2.24], [US$, per, mmBTU], 25), ([], [19], [cents], 6), ([], [10.4], [cents], 21)]
Output: [(=,58.24,[US$, per, barrel],united states dollar / barrel,[September, crude, contract]), (=,2.24,[US$, per, mmBTU],united states dollar / mmBTU,[September, natural, gas, contract]), (=,19.0,[cents],cent,[September, crude, contract]), (=,10.4,[cents],cent,[September, natural, gas, contract])]
Note that the numbers 6, 10, 21 and 25 indicate the position of the quantity in the text.