Reading data for code generation. #27

BakingBrains · 2022-01-17T09:21:48Z

For code generation task, should I use the data reading method used for concode.

def read_concode_examples(filename, data_num):
    """Read examples from filename."""
    examples = []

    with open(filename) as f:
        for idx, line in enumerate(f):
            x = json.loads(line)
            examples.append(
                Example(
                    idx=idx,
                    source=x["nl"].strip(),
                    target=x["code"].strip()
                )
            )
            idx += 1
            if idx == data_num:
                break
    return examples

OR
Data reading method used for code summarization (here replacing source with the docstring_tokens and target with code_tokens.

def read_summarize_examples(filename, data_num):
    """Read examples from filename."""
    examples = []
    with open(filename, encoding="utf-8") as f:
        for idx, line in enumerate(f):
            line = line.strip()
            js = json.loads(line)
            if 'idx' not in js:
                js['idx'] = idx
            code = ' '.join(js['code_tokens']).replace('\n', ' ')
            code = ' '.join(code.strip().split())
            nl = ' '.join(js['docstring_tokens']).replace('\n', '')
            nl = ' '.join(nl.strip().split())
            examples.append(
                Example(
                    idx=idx,
                    source=nl,
                    target=code,
                )
            )
            if idx + 1 == data_num:
                break
    return examples

Any suggestions?

Also, do I need to change the args.max_source_length = 256 and args.max_target_length = 128 for code generation task?

The text was updated successfully, but these errors were encountered:

yuewang-cuhk · 2022-01-18T03:40:09Z

Hi, the data reading functions surely can be customized according to your needs. If you want to fine-tune on Concode code generation task, you employ the former read_concode_examples. Or if you want to reverse the CodeSearchNet summarization into a text-to-code generation task, you can modify the read_summarize_examples.

For the maximum source/target lengths, these are usually determined by the tokenized lengths of your (source, target) pairs and the limitation of GPU memory at some cases. You can tune these hyper-parameters as well.

yuewang-cuhk closed this as completed Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading data for code generation. #27

Reading data for code generation. #27

BakingBrains commented Jan 17, 2022 •

edited

Loading

yuewang-cuhk commented Jan 18, 2022

Reading data for code generation. #27

Reading data for code generation. #27

Comments

BakingBrains commented Jan 17, 2022 • edited Loading

yuewang-cuhk commented Jan 18, 2022

BakingBrains commented Jan 17, 2022 •

edited

Loading