Skip to content

Q2Q: Improving Relevance of Responses by Converting Queries to Questions

Notifications You must be signed in to change notification settings

sominw/cs585-q2q

Repository files navigation

Improving Relevance of QA Responses for Query Inputs

Code can be found at: https://github.com/sominwadhwa/cs585-q2q

This repository contains all the code written for the final project of COMPSCI 585: Intro to Natural Language Processing (Mohit Iyyer) in the Fall 2019.

Authors: Thivakkar Mahendran, Vincent Pun, Kathryn Ricci, Apurva Bhandari, Somin Wadhwa

Abstract: BERT can be fine-tuned to build question answering systems and systems for various other NLP tasks. However, BERT is unable to provide answers to queries search engine users typically input, which are usually brief or fragmented sentences. To address this problem, we propose a two pronged approach: 1) fine-tuning the BERT QA model on queries 2) developing an NMT model to convert queries to questions, which are then used as input to the standard BERT QA model.

Pre-requisites

The following are a couple of instructions that must be gone through in order to execute different (or all) sections of this project.

  1. Clone the project, replacing cs585 with the name of the directory you are creating:

     $ git clone https://github.com/sominwadhwa/cs585-q2q.git cs585
     $ cd drugADR
    
  2. Make sure you have python 3.6.x running on your local system. If you do, skip this step. In case you don't, head head here.

  3. virtualenv is a tool used for creating isolated 'virtual' python environments. It is advisable to create one here as well (to avoid installing the pre-requisites into the system-root). Do the following within the project directory:

     $ [sudo] pip install virtualenv
     $ virtualenv --system-site-packages cs585
     $ source cs585/bin/activate
    

To deactivate later, once you're done with the project, just type deactivate.

  1. Install the pre-requisites from requirements.txt & run test/init.py to check if all the required packages were correctly installed:

     $ pip install -r requirements.txt
     $ python test/init.py
    

You should see a message - Imports successful. Good to go!

Note:

Most (almost all) of our training was carried out on GCP/Collab, and we tried our best to compile this repository in a coherent way so that all of the results are reproducible and it is easy to execute our code. However, if you face any issues whatsoever, please feel free to contact any of the authors and we'll be happy to help you set it up!

Directory Structure

Top-Level Structure:

.
.
├── data                     # Data used and/or generated
│   ├── dev_q2d.txt
│   ├── train_q2d.txt
│   ├── google_queries.txt
│   ├── squad_queries.txt
│   ├── hotpot_queries.txt
├── preprocessing-src                     # Source Files
│   ├── __init__.py
│   ├── get_context.py
│   ├── preprocess.py
├── src                    # Main source files for training
│   ├── __init__.py
│   ├── encoderdecoder.py
│   ├── failure-analysis-ids.py
│   ├── keras-nmt-baseline-wo-attn.py
│   ├── train-eval-q2q-w-attn.py     
├── attn_ex (<num>).png 
├── attn_ex_sp<num>.png
├── loss_curve.png                  
├── LICENSE
└── README.md
.
.

File Descriptions:

  • /data/dev_q2d.txt: Dev set of query -> questions from SQuAD dataset.
  • /data/train_q2d.txt: Training set of query -> questions from SQuAD dataset
  • /data/google_queries.txt: Queries generated by us from the questions obtained from Google's natural language dataset.
  • /data/squad_queries.txt: Queries generated by us from the questions obtained from SQuAD dataset.
  • /data/hotpot_queries.py: Queries generated by us from the questions obtained from HotpotQA dataset.
  • /src/encoderdecoder.py: Contains classess defined for Encoder, Decoder & AttnDecoder
  • /src/utils.py: Contains all of the helper functions we used to execute training, including the methods for generating plots.
  • ``/src/failure-analysis-ids.py`: Outputs all of the erroneous responses across all our models. It simply generates an intersection of IDs that failed across different models.
  • /src/keras-nmt-baseline-wo-attn.py: Baseline model implemented with Keras. It is a Seq2Seq NMT model without attention.
  • /src/train-eval-q2q-w-attn.py: Train and evaluate NMT model with attention + visualize some of the attention maps.
  • /test/init.py: checks if all dependencies are available.
  • /preprocessing-src/__init__.py: does nothing
  • /preprocessing-src/get_context.py: Given a question ID, it outputs the “context” from the SQuAD dev-set.
  • /preprocessing-src/preprocess.py: Takes a dataset with questions and produces corresponding query datasets in two formats: json and txt.
  • /preprocessing-src/results.py: Input is a file containing predictions that generates examples where the predictions were wrong.
  • attn_ex (<num>).png: Attention heatmap from some of the queries for which our model does a reasonable job in converting to question.
  • attn_ex_sp<num>.png: Attention heatmap for spurious queries where our model essentially fails.

Acknowledgements

The authors acknowledge, with thanks, the use of computational resources provided by Google. We also thank the TAs & the instructor for COMPSCI 585 for the critical feedback throughout the duration of the project.

About

Q2Q: Improving Relevance of Responses by Converting Queries to Questions

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages