# CodeBLEU

[GitHub repository](https://github.com/k4black/codebleu) for more details.

__CodeBLEU__ metric measures the __BLEU__ distance between code samples, as well as weighted n-grams (e.g., keywords from programming languages like: *def*, *int*, *list*, etc.), AST scores as well as data-flow scores. Considering the fact that this metric takes into consideration key-words that are native to programming languages, it can be applied exclusively to the metric-implemented languages: `Python`, `C`, `C#`, `C++`, `Java`, `JavaScript`, `PHP`, `Go`, `Ruby`, `Rust`.

&NewLine;

## Parameters
The __CodeBLEU__ metric can be parameterized with 4 weights, one for each score (n-gram, weighted n-gram, AST and data_flow) with values between 0 and 1; the default value is *0.25*.

&NewLine;

## Output
__CodeBLEU__ outputs the 4 aforementioned scores, as well as a global __codebleu__ score which is a mean value of the previous scores: &NewLine;

    {
    'codebleu': 0.5537, 
    'ngram_match_score': 0.1041, 
    'weighted_ngram_match_score': 0.1109, 
    'syntax_match_score': 1.0, 
    'dataflow_match_score': 1.0
    }

For a better analysis of the metric results, this implementation that applies the metric also measures the __variance__ for each score; therefore, every score outputs a tuple with 2 elements: the __score__ and the __variance__. Example:

    {
    'codebleu': (0.6386289761296269, 0.04037209526410621)
    'ngram_match_score': (0.5727281562370962, 0.05641782117008404)
    'weighted_ngram_match_score': (0.6557643564685458, 0.038337612508116244)
    'syntax_match_score': (0.5423976608187137, 0.043705755617113116)
    'dataflow_match_score': (0.7836257309941523, 0.054786088027085345)
    }

&NewLine;

## Issues
The GitHub repository for this metric mentions an open issue with the __data-flow__ score; this might have an impact on the results of the experiment, but it is the best implementation of the __CodeBLEU__ metric available on the internet.

&NewLine;

## Important note
For the sake of simplicity, the experiments using this metric are done with default parameters for the 4 scores. Further experiments with different parameters could yield more accurate results.

&NewLine;

## Usage example

In [2]:
from codebleu import calc_codebleu

prediction = "def add ( a , b ) :\n return a + b"
reference = "def sum ( first , second ) :\n return second + first"

result = calc_codebleu([reference], [prediction], lang="python")
for key, value in result.items():
    print(f'{key}: {value}')

In [2]:
import os
import json
from codebleu import calc_codebleu

def custom_sort_key(s):
    # A sorting key used to sort strings in a length-lexicographic order (length and alphabetical order)
    return len(s), s


def code_cleanup(script, remove_assert=False):
    # Function that removes any unnecessary components of a given script (comments & tests), leaving only the code lines

    # Removing the test component of HumanEval implementation following 'METADATA' information
    if 'METADATA' in script:
        script = script.split('METADATA', 1)[0]
    elif 'def check(candidate)' in script:
        script = script.split('def check(candidate)', 1)[0]

    script_lines = script.splitlines()

    multi_line_comment = False
    comment_index = []
    assert_index = []
    empty_line_index = []

    for index, line in enumerate(script_lines):

        # Indexing any assert statement
        if remove_assert and 'assert' in line and line[0] == 'a':
            assert_index.append(index)
            continue

        if not multi_line_comment:
            if '#' in line:
                # Indexing single-line comments
                if line.strip()[0] == '#':
                    comment_index.append(index)
                # Removing comment component of the line
                else:
                    cleaned_up_line = line.split('#', 1)[0]
                    script_lines[index] = cleaned_up_line
                continue

            # Indexing the first line of multi-line comments
            if '"""' in line or "'''" in line:
                comment_index.append(index)
                if line.count('"""') == 1 or line.count("'''") == 1:
                    multi_line_comment = True
                continue

        # Adding indexes for multi-line comments
        if multi_line_comment and ('"""' not in line and "'''" not in line):
            comment_index.append(index)
            continue

        # Indexing the last line of multi-line comments
        if multi_line_comment and ('"""' in line or "'''" in line):
            multi_line_comment = False
            comment_index.append(index)
            continue

        # Indexing new lines and blank lines
        if len(line) == 0 or line.isspace():
            empty_line_index.append(index)
            continue

    # Merging indexes for comments, empty lines and assert statements
    [comment_index.extend(indexes) for indexes in (empty_line_index, assert_index)]

    # Removing all the unnecessary parts of code
    for index in sorted(comment_index, reverse=True):
        del script_lines[index]

    # Concatenating the list of script lines
    clean_script = '\n'.join(script_lines)
    return clean_script


def codebleu_metric(check_successful=False, check_failed=False, second_script=False, humaneval=False,
                    different_task=False):
    """
    Function that applies the "CodeBLEU" metric to the AI generated code
    :param check_successful: if True, the chosen references are implementations with successful tests
    :param check_failed: if True, the chosen references are implementations with failed tests/exec errors
    :param second_script: choose a different good implementation as the baseline
    :param humaneval: if True, the chosen baseline is the human-made implementation from "HumanEval" dataset
    :param different_task: if True, the chosen references are the implementations for a different task (task 1)
    :return the dictionary with all the scores, the average score as well as the variance
    """
    if check_successful and check_failed:
        print('Only one active parameter allowed between "check_successful" & "check_failed"')
        exit(1)

    json_path_prefix = '../../exp_results/metrics_calc'
    funct_test_path_prefix = '../../exp_results/functionality_tests'

    if humaneval:
        baseline_script_path = '../../humaneval/000_has_close_elements.py'
    elif second_script:
        baseline_script_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0/42.py'
    else:
        baseline_script_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0/16.py'
    baseline_file_name = baseline_script_path.split('/')[-1]

    if different_task:
        data_folder_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_1'
    else:
        data_folder_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0'

    tested_task = data_folder_path.split('/')[-1]

    task_name = 'HumanEval_0'
    model_name = 'chatgpt'
    model_temp = 'temp_0.8'

    metric_name = 'codebleu'

    json_folder_path = os.path.join(json_path_prefix, metric_name, model_name, model_temp, task_name)

    if not os.path.exists(json_folder_path):
        os.makedirs(json_folder_path)

    metric_dict = {'codebleu': (0, 0),
                   'ngram_match_score': (0, 0),
                   'weighted_ngram_match_score': (0, 0),
                   'syntax_match_score': (0, 0),
                   'dataflow_match_score': (0, 0)}

    file_names = []
    overall_codebleu_score = 0
    overall_ngram_score = 0
    overall_weighted_ngram_score = 0
    overall_syntax_score = 0
    overall_dataflow_score = 0

    with open(baseline_script_path, 'r') as f:
        baseline = code_cleanup(f.read())

    for path, folder, files in os.walk(data_folder_path):
        for file_name in sorted(files, key=custom_sort_key):

            # Avoiding comparison of the baseline to an identical prediction (i.e., comparing the baseline to the
            # baseline)
            if file_name == baseline_file_name:
                continue

            else:
                test_file_path = os.path.join(funct_test_path_prefix, model_name, model_temp, f'{tested_task}.json')
                with open(test_file_path, 'r') as f:
                    funct_dict = json.load(f)

                # Filtering implementations with successful or failed tests
                if check_successful:
                    if not funct_dict[file_name]['successful']:
                        continue
                elif check_failed:
                    if funct_dict[file_name]['successful']:
                        continue

                file_names.append(file_name)

                current_script_path = os.path.join(path, file_name)
                with open(current_script_path) as f:
                    script = code_cleanup(f.read(), remove_assert=True)

                results = calc_codebleu(predictions=[script], references=[baseline], lang='python')

                metric_dict[file_name] = results
                overall_codebleu_score += results['codebleu']
                overall_ngram_score += results['ngram_match_score']
                overall_weighted_ngram_score += results['weighted_ngram_match_score']
                overall_syntax_score += results['syntax_match_score']
                overall_dataflow_score += results['dataflow_match_score']

    nb_scripts = len(metric_dict.keys()) - 5

    overall_codebleu_score /= nb_scripts
    overall_ngram_score /= nb_scripts
    overall_weighted_ngram_score /= nb_scripts
    overall_syntax_score /= nb_scripts
    overall_dataflow_score /= nb_scripts

    codebleu_variance = 0
    ngram_variance = 0
    weighted_ngram_variance = 0
    syntax_variance = 0
    dataflow_variance = 0

    for file in file_names:
        codebleu_variance += abs(overall_codebleu_score - metric_dict[file]['codebleu'])
        ngram_variance += abs(overall_ngram_score - metric_dict[file]['ngram_match_score'])
        weighted_ngram_variance += abs(overall_weighted_ngram_score - metric_dict[file]['weighted_ngram_match_score'])
        syntax_variance += abs(overall_syntax_score - metric_dict[file]['syntax_match_score'])
        dataflow_variance += abs(overall_dataflow_score - metric_dict[file]['dataflow_match_score'])

    codebleu_variance /= nb_scripts
    ngram_variance /= nb_scripts
    weighted_ngram_variance /= nb_scripts
    syntax_variance /= nb_scripts
    dataflow_variance /= nb_scripts

    metric_dict['codebleu'] = (overall_codebleu_score, codebleu_variance)
    metric_dict['ngram_match_score'] = (overall_ngram_score, ngram_variance)
    metric_dict['weighted_ngram_match_score'] = (overall_weighted_ngram_score, weighted_ngram_variance)
    metric_dict['syntax_match_score'] = (overall_syntax_score, syntax_variance)
    metric_dict['dataflow_match_score'] = (overall_dataflow_score, dataflow_variance)

    return metric_dict

## Applying __CodeBLEU__ metric to code samples
As per the experimental protocol, we start by choosing as the baseline the first *successful* implementation of __chatgpt_temp_0.8 task 0__ (script *16.py*) and compare it to all the other *good* implementations from this __model__ and __task__.

In [8]:
codebleu_dict = codebleu_metric(check_successful=True)

for key, value in list(codebleu_dict.items())[:5]:
    print(f'{key}: {value}')

Now we will consider the second *successful* script (*42.py*) as the baseline in order to analyze the difference in __CodeBLEU__ scores for different yet *correct* implementations of the same task.


In [9]:
codebleu_dict = codebleu_metric(check_successful=True, second_script=True)

for key, value in list(codebleu_dict.items())[:5]:
    print(f'{key}: {value}')

We can notice a slight difference in the overall score for these two __baselines__ (*0.642* for the script __16.py__ and *0.598* for the script __42.py__). The difference between the two scores is considerable yet smaller than the difference in score when applying the __BLEU__ metric, thanks to the __CodeBLEU__ metric taking into consideration not only the *textual similarities* but also other important factors.

&NewLine;

## Comparing with unsuccessful implementations
Now we will take the first *good* implementation and compare it to scripts that *did not* pass the tests.

In [10]:
codebleu_dict = codebleu_metric(check_failed=True)

for key, value in list(codebleu_dict.items())[:5]:
    print(f'{key}: {value}')

We can observe that, despite the overall score being lower than the first two, the difference in score compared to the previous experiment is equal to the difference in score for the first two experiments. However, if we pay closer attention to __syntax match__ and __data flow__ scores (__0.54 & 0.79__ and __0.58 & 0.74__ when comparing with *successful* implementations, in contrast with __0.47 & 0.65__ when comparing with *bad* implementations) we can discern considerably worse scores. Further experiments with the metric parameters might yield more coherent scores.

&NewLine;

## Human-made baseline implementation
Now we'll consider as a baseline the human-written Eval+ implementation for __task 0__ and compare it to all the *correct* implementations generated by AI for the same task.

In [11]:
codebleu_dict = codebleu_metric(humaneval=True)

for key, value in list(codebleu_dict.items())[:5]:
    print(f'{key}: {value}')

As expected, the difference in the syntax and the presence of *assert* statements returns lower scores than any previous experiments. Nevertheless, if we are again to focus on __syntax match__ and __data flow__ scores, we can notice that these suffered a relatively smaller decrease in value compared to all the others, with the __syntax match__ score having the smallest degree of decline during current experiments.

&NewLine;

## Different task as a baseline
Lastly, we will compare the standard baseline with the *successful* implementations for __task 1__ in order to see how this metric is affected by scripts that have drastically different __semantics__.

In [12]:
codebleu_dict = codebleu_metric(check_successful=True, different_task=True)

for key, value in list(codebleu_dict.items())[:5]:
    print(f'{key}: {value}')

As expected, the overall score is much lower than any previous experiment, except for the __data flow__ score; although very unusual, this could be explained by the similarities in the way the data is generally manipulated in simple programs (e.g., iterate over the list given as an input). The anomaly could also be due to the __issue__ mentioned in the GitHub repository. Further analysis is necessary.
