Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading data for code generation. #27

Closed
BakingBrains opened this issue Jan 17, 2022 · 1 comment
Closed

Reading data for code generation. #27

BakingBrains opened this issue Jan 17, 2022 · 1 comment

Comments

@BakingBrains
Copy link

BakingBrains commented Jan 17, 2022

For code generation task, should I use the data reading method used for concode.

def read_concode_examples(filename, data_num):
    """Read examples from filename."""
    examples = []

    with open(filename) as f:
        for idx, line in enumerate(f):
            x = json.loads(line)
            examples.append(
                Example(
                    idx=idx,
                    source=x["nl"].strip(),
                    target=x["code"].strip()
                )
            )
            idx += 1
            if idx == data_num:
                break
    return examples

OR
Data reading method used for code summarization (here replacing source with the docstring_tokens and target with code_tokens.

def read_summarize_examples(filename, data_num):
    """Read examples from filename."""
    examples = []
    with open(filename, encoding="utf-8") as f:
        for idx, line in enumerate(f):
            line = line.strip()
            js = json.loads(line)
            if 'idx' not in js:
                js['idx'] = idx
            code = ' '.join(js['code_tokens']).replace('\n', ' ')
            code = ' '.join(code.strip().split())
            nl = ' '.join(js['docstring_tokens']).replace('\n', '')
            nl = ' '.join(nl.strip().split())
            examples.append(
                Example(
                    idx=idx,
                    source=nl,
                    target=code,
                )
            )
            if idx + 1 == data_num:
                break
    return examples

Any suggestions?

Also, do I need to change the args.max_source_length = 256 and args.max_target_length = 128 for code generation task?

@yuewang-cuhk
Copy link
Contributor

Hi, the data reading functions surely can be customized according to your needs. If you want to fine-tune on Concode code generation task, you employ the former read_concode_examples. Or if you want to reverse the CodeSearchNet summarization into a text-to-code generation task, you can modify the read_summarize_examples.

For the maximum source/target lengths, these are usually determined by the tokenized lengths of your (source, target) pairs and the limitation of GPU memory at some cases. You can tune these hyper-parameters as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants