MCoNaLa

This repository contains the MCoNaLa dataset and the code implementation of baseline models in the following paper:

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

🤗 Huggingface Hub

MCoNaLa is available on Huggingface Hub here

📊 ExplainaBoard Leaderboard

MCoNaLa has its leaderboard powered by ExplainaBoard, where you can upload and analyze your own system results with just a few simple clicks. Follow the detailed instructions below to submit your results to the leaderboard.

Benchmark Dataset

1. Multilingual Samples: Spanish, Japanese, Russian

The Multilingual CoNaLa dataset contains intent-snippet pairs collected from three different language versions of StackOverflow forums.

These samples are located in the dataset/test directory, where es_test.json/ja_test.json/ru_test.json are original annotated samples.

For the trans-test setting in baseline experiments, we also provide the translated version under the flores101 directory: es_test_to_en.json/ja_test_to_en.json/ru_test_to_en.json, where the Spanish/Japanses/Russian intents are translated into English using the FLORES-101 model.

To study the influence of translation quality, we also experiment with two other widely used Machine Translation (MT) systems: MarianMT and M2M. The intents in Spanish/Japanese/Russian samples are translated into English using the respective MT systems and put into the marianmt and m2m directories.

2. English Samples for Training

Due to the limited sample of multiple languages, we use English CoNaLa samples for training, where the intents are originally written in English. In the dataset/train directory contains the annotated train.json, the automatically mined samples from the StackOverflow webpages (mined.jsonl) and the API documents (api.jsonl).

However, due to the uploading file size limitation of GitHub, we alternatively provide the training data via zenodo.

In the trans-train experiment setting, we also translate the English intents into the three target languages of interest using FLORES-101, under the to-es / to-ja / to-ru directories.

3. Data Usage

Spanish, Japanese, and Russian are of the Target Language (TL), whose samples are always (only) used for testing purpose due to the limited amount.

English is the High-Resource Language (HRL) for which the samples can be leveraged for model training.

To give an illustration, the directory is organized as:

.
├── README.md
├── datasets
│   ├── test 
│   │   ├── flores101
│   │   │   ├── es_test_to_en.json
│   │   │   ├── ja_test_to_en.jsonl
│   │   │   └── ru_test_to_en.jsonl
│   │   ├── marianmt
│   │   │   ├── es_test_to_en.json
│   │   │   ├── ja_test_to_en.jsonl
│   │   │   └── ru_test_to_en.jsonl
│   │   ├── m2m
│   │   │   ├── es_test_to_en.json
│   │   │   ├── ja_test_to_en.jsonl
│   │   │   └── ru_test_to_en.jsonl
│   │   ├── es_test.json
│   │   ├── ja_test.json
│   │   └── ru_test.json
│   ├── train 
│   │   ├── to-es
│   │   │   ├── train_to_es.json
│   │   │   ├── mined_to_es.jsonl
│   │   │   └── api_to_es.jsonl
│   │   ├── to-ja
│   │   │   ├── train_to_ja.json
│   │   │   ├── mined_to_ja.jsonl
│   │   │   └── api_to_ja.jsonl
│   │   ├── to-ru
│   │   │   ├── train_to_ru.json
│   │   │   ├── mined_to_ru.jsonl
│   │   │   └── api_to_ru.jsonl
│   │   ├── train.json
│   │   ├── mined.jsonl
└── └── └── api.jsonl

translate-train

The trans-train setting evaluates samples in different langauges as independent tasks. Take Spanish (es) as an example, we use the translated CoNaLa samples in train/to-es (train_to_es.json, mined_to_es.jsonl, and api_to_es.jsonl) for training, then test on test/es_test.json. Japanese (ja) and Russian (ru) samples work in similar mechanisms.

translate-test

The trans-test setting evaluates samples in three target languages using the same model. Specifically, we use the original English CoNaLa samples train/train.json, train/mined.jsonl, and train/api.jsonl in joint for training. The resulting model are evaluated on the translated version es_test_to_en_xxx.json, ja_test_to_en_xxx.json, ru_test_to_en_xxx.json. xxx stands for the MT model used (flores101, marianmt, m2m). Our experiments test on the flores101-translated samples by default.

zero-shot

The zero-shot setting trains the model using English samples (train/train.json, train/mined.jsonl, train/api.jsonl) and directly tests on multilingual samples (test/es_test.json, test/ja_test.json, test/ru_test.json). Intuitively, this require the model being able to encode natural langauge intents in multiple language without intentional training.

Submitting Results to the Leaderboard

Go to the submission site here and click the New button on the top-right to start a new submission, then fill out a few blanks in the pop-up window:

System Name: give an informative name for your system
Task: select 'machine-translation' from the drop-down list
Dataset: select 'mconala' with the target language (es/ja/ru) from the drop-down list, and for Split select 'test'
System Output: click on 'Text' and submit your results in TXT format. Please make sure that your results file has the same number of lines as the corresponding testset. If a predicted code snippet contains \n that could spread one prediction into multiple lines. One trick to fix this is doing a_multi_line_string.replace('\n', '\\n') before writing into the file.
Metrics: select 'bleu', which computes the code-specific BLEU (-4) score.
check that the Input Lang is automatically filled with your target NL (es/ja/ru) and the Output Lang is python.

Click the Submit button on the bottom, then your results are ready in a few seconds!

You can also click the Analysis button on the right to view more fine-grained analyses with cool figures 📊

Baseline Models

To present the baseline performance on the Multilingual CoNaLa dataset, we use three state-of-the-art models that are proficient at multilingual learning or code generation.

Set the root directory using the following command, as this would be required by most experimental bash scripts.

export ROOT_DIR=`pwd`

1. mBART

mBART is a multilingual denoising auto-encoder trained for machine translation tasks.

To reproduce the baseline result of mBART, following:

Installation fairseq

Clone and install the repository.

git clone git@github.com:pytorch/fairseq.git
cd fairseq
# pip install .
pip install fairseq=0.10.2
# pip install fairseq=1.0.0a0+53bf2b1
cd ..

warning: may require earlier versions to solve some instantiation error (e.g., fairseq==0.10.2).

Also download the pre-trained mBART model checkpoint.

mkdir checkpoint && cd checkpoint 
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz 
tar -xzvf mbart.cc25.v2.tar.gz
cd ..

Data Pre-processing

Data pre-processing are conducted on both nl-intent and code-snippet, and in three consecutive steps: 1) sentence-piece tokenization, 2) fairseq preprocessing, and 3) data binarization.

Before the pre-processing, make sure to install SPM here, or run:

pip install sentencepiece

First, we need to extract the intent and snippets into a line-by-line text file. To process all samples in the provided dataset, use the script

bash extract_lines.sh

which will create a dataset/lines directory with all processed training and testing files.

One can also process json/jsonl files in a specific folder by:

python extract_lines.py --input_dir source_dir --output_dir target_dir

Next, to perform the spm tokenization, run

bash do_spm_tokenization.sh

Lastly, do the fairseq pre-processing to binarize the data files

bash do_fairseq_preprocess.sh

By default, this step will use the FLORES-101 translation for trans-test evaluation.

Training and Evaluation

Head into the baseline/mbart/experiment directory.

To fine-tune a pre-trained mBART model:

bash run_train.sh

Note that we only need to train the model for trans_train and trans_test settings. Evaluation on zero_shot setting can directly load the saved checkpoint from the trans_test experiment.

To evaluate on trans_train or trans_test setting:

bash run_test.sh

run_test_zero_shot.sh should be easier to use for evaluation in the zero_shot setting.

You can change the SETTING (trans_train, trans_test) and LANG (es, ja, ru) in both scripts to run different experiments.

2. TranX

TranX is a pre-trained natural language to code generation model by leveraging external knowledge. Our experiments uses its code implementation to perform training and testing on the Multilingual CoNaLa dataset.

To reproduce the TranX results:

Installation

Clone the repository and install required libraries.

cd baseline/tranx
# git clone https://github.com/neulab/external-knowledge-codegen.git 

pip install python==3.7
pip install pytorch==1.1.0
pip install astor==0.7.1       # this is very important

Data Pre-processing

bash baseline/tranx/scripts/preprocess.sh

This will organize and process the train-test files for both the trans-train and trans-test settings for three languages.

Note: be sure to download the necessary resource via

import nltk
nltk.download('punkt')

Model Training and Evaluation

Head into the TranX directory using

cd baseline/tranx

To pre-train with additional mined data and api documents under a specific SETTING for a specific LANGuage, run

bash scripts/run_train.sh

To further fine-tune with the annotated training set, run

bash scripts/run_tune.sh

Use the scripts/test_mconala.sh for evaluation.

We provide the best pre-trained model checkpoint for all three languages and both settings, under the best_pretrained_models/mconala. Alter the language and setting arguments in the bash script to run individual experiments.

bash scripts/test_mconala.sh

3. TAE

TAE is a seq2seq model, augmented with a target auto-encoding objective, for code generation from English intents.

The tae code implementation is built upon its original repository. To reproduce the baseline performance of TAE, following:

Installation

Clone the repository and install necessary libraries.

cd baseline/tae/code-gen-TAE/
pip install -r requirements.txt

Download the pre-trained TAE model from here.

Data Pre-processing

Copy the test samples (with intents translated into English).

bash ../collect_data.sh

uses the FLORES-101 translation by default.

Evaluation

To reproduce the evaluation result on Spanish CoNaLa samples, run

python3 test_mconala.py \
  --dataset_name "es-101" \
  --save_dir "pretrained_weights/conala" \
  --copy_bt --no_encoder_update --seed 4 \
  --monolingual_ratio 0.5 --epochs 80 \
  --use_conala_model

Change es-101 to ja-101/ru-101 to test the Japanese/Russian samples. Change xx-101 to xx-mmt or xx-m2m to test with different machine translation models.

Also, to evaluate on the English CoNaLa samples

python3 train.py \
  --dataset_name "conala" \
  --save_dir "pretrained_weights/conala" \
  --copy_bt --no_encoder_update --seed 4 \
  --monolingual_ratio 0.5 --epochs 80 \
  --just_evaluate

Reference

@article{wang2022mconala,
  title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages},
  author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig},
  journal={arXiv preprint arXiv:2203.08388},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCoNaLa

🤗 Huggingface Hub

📊 ExplainaBoard Leaderboard

Benchmark Dataset

1. Multilingual Samples: Spanish, Japanese, Russian

2. English Samples for Training

3. Data Usage

Submitting Results to the Leaderboard

Baseline Models

1. mBART

Installation fairseq

Data Pre-processing

Training and Evaluation

2. TranX

Installation

Data Pre-processing

Model Training and Evaluation

3. TAE

Installation

Data Pre-processing

Evaluation

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
baseline		baseline
dataset		dataset
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

License

zorazrw/multilingual-conala

Folders and files

Latest commit

History

Repository files navigation

MCoNaLa

🤗 Huggingface Hub

📊 ExplainaBoard Leaderboard

Benchmark Dataset

1. Multilingual Samples: Spanish, Japanese, Russian

2. English Samples for Training

3. Data Usage

Submitting Results to the Leaderboard

Baseline Models

1. mBART

Installation fairseq

Data Pre-processing

Training and Evaluation

2. TranX

Installation

Data Pre-processing

Model Training and Evaluation

3. TAE

Installation

Data Pre-processing

Evaluation

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages