ARKS: Active Retrieval in Knowledge Soup for Code Generation

This repository contains the code for our paper ARKS: Active Retrieval in Knowledge Soup for Code Generation. Please refer to our project page for a quick project overview.

We introduce ARKS, a general pipeline for retrieval-augmented code generation (RACG). We construct a knowledge soup integrating web search, documentation, execution feedback, and evolved code snippets. Through active retrieval in knowledge soup, we demonstrate significant increase in benchmarks about updated libraries and long-tail programming languages (8.6% to 34.6% in ChatGPT)

Installation

It is very easy to use ARKS for RACG tasks. In your local machine, we recommend to first create a virtual environment:

conda env create -n arks python=3.8
git clone https://github.com/xlang-ai/arks

That will create the environment arks we used. To use the embedding tool, first install the arks package

pip install -e .

To Evaluate on updated libraries, install the packages via

cd updated_libraries/ScipyM
pip install -e .
cd ../TensorflowM
pip install -e .

Environment setup

Activate the environment by running

conda activate arks

Data

Please download the data and unzip it with password arksdata

You can also access the data in huggingface

load one dataset:

from datasets import load_dataset
data_files = {"corpus": "Pony/Pony_docs.jsonl"}
dataset = load_dataset("xlangai/arks_data", data_files=data_files)

load several datasets:

from datasets import load_dataset
data_files = {"corpus": ["Pony/Pony_docs.jsonl", "Ring/Ring_docs.jsonl"]}
dataset = load_dataset("xlangai/arks_data", data_files=data_files)

Getting Started

Run inference

python run.py --output_dir {output_dir} --output_tag {running_flag} --openai_key {your_openai_key} --task {task_name}

--output_tag is the running flag that starts from 0. By simply increasing it, we active the active retrieval process
--task specifies the task name. We can choose from ScipyM, TensorflowM, Ring or Pony.
--query specifies the query formulation. Available choices include question, code, code_explanation, execution_feedback.
--knowledge specifies the knowledge to augment LLM. Available choices include web_search, documentation, code_snippets, execution_feedback, documentation_code_snippets, documentation_execution_feedback, code_snippets_execution_feedback, documentation_code_snippets_execution_feedback
--doc_max_length specfies the maximum length for documentation
--exp_max_length specifies the maximum length for code snippets

Run evaluation

python eval/{task}.py --output_dir {output_dir} --turn {output_flag}

This should report the execution accuracy of the inference

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
arks		arks
eval		eval
updated_libraries		updated_libraries
README.md		README.md
run.py		run.py
setup.py		setup.py
try.py		try.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arks

arks

eval

eval

updated_libraries

updated_libraries

README.md

README.md

run.py

run.py

setup.py

setup.py

try.py

try.py

Repository files navigation

ARKS: Active Retrieval in Knowledge Soup for Code Generation

Installation

Environment setup

Data

Getting Started

About

Releases

Packages

Languages

xlang-ai/arks

Folders and files

Latest commit

History

Repository files navigation

ARKS: Active Retrieval in Knowledge Soup for Code Generation

Installation

Environment setup

Data

Getting Started

About

Resources

Stars

Watchers

Forks

Languages