Skip to content
Code for the paper: "Cross-domain Semantic Parsing via Paraphrasing" - EMNLP 2017
Python Shell
Branch: master
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
data/overnight Initial commit Sep 4, 2018
scripts Add data processing for overnight Sep 23, 2018
src Add data processing for overnight Sep 23, 2018
.gitignore Initial commit Sep 4, 2018
LICENSE Initial commit Apr 19, 2017 Merge branch 'master' of Sep 23, 2018
logging.yaml Initial commit Sep 4, 2018

Cross-Domain Semantic Parsing / Natural Language Interface

The Cold Start Problem

Semantic parsing, which maps natural language utterances into computer-understandable logical forms, has drawn substantial attention recently as a promising direction for developing natural language interfaces to computers. There are so many domains (healthcare, finance, IoT, sports, etc.) for which we can build a natural language interface, making portability / scalability an impending challenge. Or in other words, it's the cold start problem of natural language interface:

Given a new domain, how can we build a natural language interface for it?

Cold Start Problem


There are three complementary ways to solve the cold start problem:

Cold Start Solution

  1. Re-use the training data for some existing domains via transfer learning (this repo)
  2. Collect training data for the new domain via crowdsourcing [1] [2]
  3. Once we cold-started a natural language interface with reasonable performance, develop some user-friendly interaction mechanism, deploy the system and let it interact with real users so it can keep refining itself [3] [4]

Use of This Repo


  • Python 2.7
  • Tensorflow 0.11 (yes the TF version is a bit old..but it's still working reasonably well!)
  • Yaml (for logging)


Install Tensorflow 0.11:

(GPU support)

pip install --ignore-installed --upgrade $TF_BINARY_URL

(CPU only)

pip install --ignore-installed --upgrade $TF_BINARY_URL

Install other dependencies:

pip install pyyaml


We use the Overnight dataset, which contains 8 domains including Basketball, Calendar, and Restaurants. The dataset is already pre-processed and can be found under data.

Assume we are at the the root of the repo. All of training and testing can be done with the following command:

sh scripts/ 0 train_grid_unit_var overnight 0

The arguments are:

  • GPU ID: which GPU to use for this run?
  • Training Script: each word embedding initialization has a separate script
  • Dataset: for now, the only option is overnight
  • Execution Number: a unique number for this execution. A corresponding dir will be created under execs/ to host the trained model and the log of this execution.

The command will do the following tasks for each of the 8 domains:

  1. In-domain: Train and test
  2. In-domain: Re-train with the full training data (i.e., training+validation) and then test (final results for in-domain setting)
  3. Cross-domain: Pre-training on the source domains
  4. Cross-domain: Warm-start on target domain with pre-trained model, fine-tune with in-domain data, and test
  5. Cross-domain: Re-train with full in-domain training data and then test (final results for cross-domain setting)

It's easy to train for another word embedding initialization strategy, e.g., original word2vec embedding without standardization. Just change the training script and execution number:

sh scripts/ 0 train_grid_original overnight 1

Extract Testing Results

We provide a script to make it easy to extract the testing results across all of the domains. For example,

In-domain, exec_num=0, re-training with full training data:

python scripts/ in-domain overnight 0 1

In-domain, exec_num=0, no re-training:

python scripts/ in-domain overnight 0 0

Cross-domain, exec_num=5, re-training with full training data:

python scripts/ cross-domain overnight 5 1


Please refer to the following paper for more details. If you find it useful, please kindly consider to cite:

@InProceedings {su2017cross,
    author    = {Su, Yu and Yan, Xifeng},
    title     = {Cross-domain Semantic Parsing via Paraphrasing},
    booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing},
    pages     = {1235--1246},
    year      = {2017},
    address   = {Copenhagen, Denmark},
    month     = {Sept},
    publisher = {Association for Computational Linguistics}

Other references for cold-starting a natural language interface

[1] Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, Mark Encarnacion. Building Natural Language Interfaces to Web APIs. CIKM 2017.

[2] Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa, Izzeddin Gur, Zenghui Yan, Xifeng Yan. On Generating Characteristic-rich Question Sets for QA Evaluation. EMNLP 2016.

[3] Izzeddin Gur, Semih Yavuz, Yu Su, Xifeng Yan. DialSQL: Dialogue Based Structured Query Generation. ACL 2018.

[4] Yu Su, Ahmed Hassan Awadallah, Miaosen Wang, Ryen White. Natural Language Interfaces with Fine-Grained User Interaction: A Case Study on Web APIs. SIGIR 2018

You can’t perform that action at this time.