Skip to content

Optimizing F1-Score for Text Spans Extraction in Question Answering Systems

License

Notifications You must be signed in to change notification settings

rbouadjenek/DeepQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optimizing F1-Score for Text Spans Extraction in Question Answering Systems

Abstract

Current question answering systems are based on complex deep neural network architectures where often, the output consists of two classification layers that are used to predict simultaneously the start and end positions of the answer in the input passage. However, we argue in this paper that this approach suffers from several drawbacks. First, it does not optimize the metrics that are used to evaluate the model’s performance, e.g., F1-Score or exact-match. Second, this approach does not allow encoding all possible answers during training, and during prediction, it only focuses on one possible answer and does not allow the extraction of other candidate answers. Lastly, this approach does not prevent the case where the token with the highest probability for being the answer start token comes after the token with the highest probability for being the answer end token. This paper addresses these deficiencies by proposing a novel approach for text spans extraction, which consists of a single classification output layer followed by an independent optimization module. This 2-step approach aims first to optimize exact-match and then extracts a text span by maximizing an expected F1-Score (EF1) objective. Specifically, we contribute with a Mixed Integer Linear Programming (MILP) formulation to optimize EF1 subject to parameterized constraints for text spans extraction. The proposed approach allows the extraction of multiple possible answers to a question from a passage and to identify if a span of text is a valid answer and so the answerability of the question given the passage. We demonstrate the effectiveness of our approach with extensive experimental evaluations and a comparison against existing baselines on the SQuAD1.1, SQuAD2.0, and NewsQA datasets.

Installation

You need to install dependencies in requirements.txt usind the following command:

pip install -r requirements.txt

In addition, the MILP algorithm described in the paper requires Gurobi to be installed in your Python environement. Please follow these instructions. Once Gurobi is installed, you need to add it to your own Python environment. Doing so requires you to install the gurobipy module. The steps for doing this depend on your platform. On Linux, you will need to open a terminal window, change your current directory to the Gurobi <installdir> (the directory that contains the file setup.py), and issue the following command:

python setup.py install

THE MILP Algorithm

The MILP algorithm described in the paper can be executed in the Notebook in:

├── DeepQA  
   ├── MILP-Algorithm.ipynb

Please install Gurobi as described above first.

Training the models

You can train the models described in the paper by clicking on the following links (make sure TPU is activated and that you have a Pro account):

Interesting papers to read

Contacts

For more information about this project, please contact: