Skip to content
Zero-Shot Cross-Lingual Transfer with Order Differences
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples
neuronlp2
LICENSE
README.md
distance.py

README.md

Cross-Lingual Transfer with Order Differences

This repo contains the code and models for the NAACL19 paper: "On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing" [arxiv] [paper] [bib]

This is build based upon NeuroNLP2 and PyTorch-0.3.


Easy Running Steps

We prepare some easy-to-run example scripts.

Easy preparing

  • (Note): The data-preparation script requires Python3. (while the rest main running after requires Python2)
  • For easy preparing, simply run examples/run_more/go_data.sh. This is an one-step script to get and prepare all the data (might need much disk space, majorly for embeddings files).

Running Environment

  • This implementation should run in Python2 + Pytorch0.3, our suggestion is to use conda to install the required environment:
  • conda create -n myenv python=2.7; source activate myenv; conda install gensim;
  • conda install pytorch=0.3.1 cuda80 -c pytorch

Easy running


Details

The rest provides more details for the steps of the runnings.

Data Preparation

  • The data format is basically CoNLL-U Fomat (here) in UD v2.2, but with some crucial differences:
  • Firstly, all comments (starts with #) and non-integer lines (multiword or empty tokens) should be removed.
  • Moreover, the POS are read from Column 5 instead of Column 4, a simple movement is needed.
  • Aligned cross-lingual embeddings are required for inputs. We use the old version of fasttext embeddings and fastText_multilingual for alignment.
  • Please refer to examples/run_more/prepare_data.py for the data preparation step.

Training and Testing

  • We provide exampling scripts for training and testing, please follow those examples (some of the paths in the scripts are specific to our environment, you may need to set up the correct paths).

  • Step 1: build dictionaries (see examples/run_more/prepare_vocab.sh). This step will build the vocabs (use examples/vocab/build_joint_vocab_embed.py) for the source language together with source embeddings.
  • Step 2: train the models (see examples/run_more/train_*.sh) on the source language. Here, we have four types of models correspoding to those in our paper, the names are slightly different, here are the mappings: SelfAttGraph->train_graph.sh, RNNGraph->train_graph_rnn.sh, SelfAttStack->train_stptr.sh, RNNStack->train_stptr_rnn.sh.
  • Extra: for these scripts, the file paths should be changed to the correct ones: --word_path for embedding file, --train --dev --test for corresponding data files.
  • Step 3: testing with the trained models (see examples/run_more/run_analyze.sh). Also, the paths for extra language data (--test) and extra language embeddings (--extra_embed) should be set correspondingly.

  • Our trained models (English as source, 5 different random runs) can be found here.
  • Warning: the embeddings of zh and ja are not well aligned, and our paper reports de-lexicalized results, which can be obtained by adding the flag --no_word both for training and testing.
  • Warning2: the outputs do not keep the original ordering of the input file, and are sorted by sentence length. Both the system output and gold parses in the new ordering are outputted (*_pred, *_gold).

Citation

If you find this repo useful, please cite our paper.

@inproceedings{ahmad2019,
  title = {On Difficulties of Cross-Lingual Transfer with Order Differences: 
           A Case Study on Dependency Parsing},
  author = {Ahmad, Wasi Uddin and Zhang, Zhisong and Ma, Zuezhe and Hovy, 
            Eduard and Chang, Kai-Wei and Peng, Nanyun},
  booktitle = {Proceedings of the 2019 Conference of the North American Chapter of 
              the Association for Computational Linguistics: Human Language Technologies},
  year = {2019}
}
You can’t perform that action at this time.