# BLEU

[GitHub repository](https://github.com/huggingface/evaluate/tree/main/metrics/bleu) for more details.

__BLEU__ metric measures the ratio of identical n-grams between the prediction and one or multiple references. This metric is most often used for measuring the quality of *machine-translated* texts.

&NewLine;

## Output

The metric outputs a general *__bleu score__*, as well as *__precisions__* score which gives the ratio of identical n-grams (1-gram, 2-gram, 3-gram and 4-gram respectively); the *__precisions__* score is not used for the code quality measurement but is useful for easier understanding of the metric. The metric also gives scores for *__brevity_penalty__* and *__length_ration__* which are irrelevant for the code-quality measurement.

&NewLine;

## Usage example

The following is a simple example of the metric behaviour. The __predictions__ variable holds two strings predicted by a machine and the __references__ variable holds one or multiple baselines per *prediction*.

In [1]:
import evaluate

predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
    ["hello there general kenobi", "hello there !"],
    ["foo bar foobar"]
]
bleu = evaluate.load("bleu")
metric_results = bleu.compute(predictions=predictions, references=references)
print(metric_results)

Note that at least one of the references are identical to the prediction and, therefore, the metric gives a perfect *__bleu__* score, as well as perfect *__precisions__* scores for the 1-gram to 4-gram sets of words.

However, if we remove a single word from the __prediction__ or the __reference__, the obtained score is *null*.

In [2]:
import evaluate

# Removing the word 'there' from the predictions variable
predictions = ["hello general kenobi", "foo bar foobar"]
references = [
    ["hello there general kenobi", "hello there !"],
    ["foo bar foobar"]
]
bleu = evaluate.load("bleu")
metric_results = bleu.compute(predictions=predictions, references=references)
print(metric_results)

In [2]:
import os
import json
import evaluate as ev


def custom_sort_key(s):
    # A sorting key used to sort strings in a length-lexicographic order (length and alphabetical order)
    return len(s), s


def code_cleanup(script, remove_assert=False):
    # Function that removes any unnecessary components of a given script (comments & tests), leaving only the code lines

    # Removing the test component of HumanEval implementation following 'METADATA' information
    if 'METADATA' in script:
        script = script.split('METADATA', 1)[0]
    elif 'def check(candidate)' in script:
        script = script.split('def check(candidate)', 1)[0]

    script_lines = script.splitlines()

    multi_line_comment = False
    comment_index = []
    assert_index = []
    empty_line_index = []

    for index, line in enumerate(script_lines):

        # Indexing any assert statement
        if remove_assert and 'assert' in line and line[0] == 'a':
            assert_index.append(index)
            continue

        if not multi_line_comment:
            if '#' in line:
                # Indexing single-line comments
                if line.strip()[0] == '#':
                    comment_index.append(index)
                # Removing comment component of the line
                else:
                    cleaned_up_line = line.split('#', 1)[0]
                    script_lines[index] = cleaned_up_line
                continue

            # Indexing the first line of multi-line comments
            if '"""' in line or "'''" in line:
                comment_index.append(index)
                if line.count('"""') == 1 or line.count("'''") == 1:
                    multi_line_comment = True
                continue

        # Adding indexes for multi-line comments
        if multi_line_comment and ('"""' not in line and "'''" not in line):
            comment_index.append(index)
            continue

        # Indexing the last line of multi-line comments
        if multi_line_comment and ('"""' in line or "'''" in line):
            multi_line_comment = False
            comment_index.append(index)
            continue

        # Indexing new lines and blank lines
        if len(line) == 0 or line.isspace():
            empty_line_index.append(index)
            continue

    # Merging indexes for comments, empty lines and assert statements
    [comment_index.extend(indexes) for indexes in (empty_line_index, assert_index)]

    # Removing all the unnecessary parts of code
    for index in sorted(comment_index, reverse=True):
        del script_lines[index]

    # Concatenating the list of script lines
    clean_script = '\n'.join(script_lines)
    return clean_script


def bleu_metric(humaneval=False, check_successful=False, check_failed=False, second_script=False, different_task=False):
    """
    Function that applies the "BLEU" metric to the AI generated code
    :param humaneval: if True, the chosen baseline is the human-made implementation from "HumanEval" dataset
    :param different_task: if True, the chosen references are the implementations for a different task (task 1)
    :param check_successful: if True, the chosen references are implementations with successful tests
    :param second_script: choose a different good implementation as the baseline
    :param check_failed: if True, the chosen references are implementations with failed tests/exec errors
    :return the dictionary with all the scores, the average score as well as the variance
    """
    if check_successful and check_failed:
        print('Only one active parameter allowed between "check_successful" & "check_failed"')
        exit(1)

    funct_test_path_prefix = '../../exp_results/functionality_tests'

    if humaneval:
        baseline_script_path = '../../humaneval/000_has_close_elements.py'
    elif second_script:
        baseline_script_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0/42.py'
    else:
        baseline_script_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0/16.py'
    baseline_file_name = baseline_script_path.split('/')[-1]

    if different_task:
        data_folder_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_1'
    else:
        data_folder_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0'

    tested_task = data_folder_path.split('/')[-1]

    model_name = 'chatgpt'
    model_temp = 'temp_0.8'

    metric_name = 'bleu'
    metric = ev.load(metric_name)

    metric_dict = {'overall_score': 0, 'average_variance': 0}
    file_names = []
    overall_score = 0

    with open(baseline_script_path, 'r') as f:
        baseline = code_cleanup(f.read())

    for path, folder, files in os.walk(data_folder_path):
        for file_name in sorted(files, key=custom_sort_key):

            # Avoiding comparison of the baseline to an identical prediction (i.e., comparing the baseline to the baseline)
            if file_name == baseline_file_name:
                continue

            else:                  
                test_file_path = os.path.join(funct_test_path_prefix, model_name, model_temp, f'{tested_task}.json')
                with open(test_file_path, 'r') as f:
                    funct_dict = json.load(f)
                    
                # Filtering implementations with successful or failed tests
                if check_successful:
                    if not funct_dict[file_name]['successful']:
                        continue
                elif check_failed:
                    if funct_dict[file_name]['successful']:
                        continue

                file_names.append(file_name)

                current_script_path = os.path.join(path, file_name)
                with open(current_script_path) as f:
                    script = code_cleanup(f.read(), remove_assert=True)

                results = metric.compute(predictions=[script], references=[baseline])

                score = results['bleu']
                metric_dict[file_name] = score
                overall_score += score

    nb_scripts = len(metric_dict.keys()) - 2

    overall_score /= nb_scripts

    metric_dict['overall_score'] = overall_score

    variance = 0

    for file in file_names:
        variance += abs(overall_score - metric_dict[file])

    metric_dict['average_variance'] = variance / nb_scripts

    return metric_dict

## Applying BLEU metric to code samples 
Here we take as a baseline the first *good* implementation from *__chatgpt_temp_0.8 task 0__* (script *16.py*) and we compare it to all the other programs from this model that pass the tests.

__Task 0__: given *'numbers'* a list of __float__ and *'threshold'* a single __float__, determine if in the given list there are at least two numbers that are closer than the given *threshold*.

In [4]:
bleu_dict = bleu_metric(check_successful=True)

for key, value in list(bleu_dict.items())[:2]:
    print(f'{key}: {value}')

Here we take as a baseline the __second__ successful implementation from *__chatgpt_temp_0.8__* (script *42.py*) and compare it to all the other *good* scripts. Considering that both implement the same task and successfully pass all the tests, the __semantic__ of the two baselines are identical and therefore should yield similar score. However, due to the *__BLEU__* metric being heavily influenced by the *textual differences* between the __reference__ and the __prediction__, the obtained score is considerably different (*0.7* for the script __16.py__ and *0.6* for the script __42.py__) and, therefore, the metric is not suitable for *__code_quality__* measurement.


In [5]:
bleu_dict = bleu_metric(check_successful=True, second_script=True)

for key, value in list(bleu_dict.items())[:2]:
    print(f'{key}: {value}')

## Baseline scripts

- ### Script 14.py
    

    from typing import List
    
    def has_close_elements(numbers: List[float], threshold: float) -> bool:
        numbers.sort()
        
        for i in range(1, len(numbers)):
            if abs(numbers[i] - numbers[i-1]) < threshold:
                return True
        return False
    
- ### Script 42.py


    from typing import List

    def has_close_elements(numbers: List[float], threshold: float) -> bool:
        for i, num1 in enumerate(numbers):
            for j, num2 in enumerate(numbers):
                if i != j and abs(num1-num2) < threshold:
                    return True
        return False

    
Note that the two implementations share many similarities (the same *import*, the same *function name* and *signature*), with the only major difference being the fact that one employs a *single* __for loop__ while the other uses a *double neste* __for loop__.

&NewLine;


While the observations from the experiment thus far show that the metric is not suited for __code quality__ measurement, we can still study its behaviour in different configurations (comparing the *good* baseline script with unsuccessful implementations or taking a human-made script as a baseline)

&NewLine;
&NewLine;

## Comparing with unsuccessful implementations

In [8]:
bleu_dict = bleu_metric(check_failed=True)

for key, value in list(bleu_dict.items())[:2]:
    print(f'{key}: {value}')

As we can clearly see, comparing a *good* implementation (script *16.py*) with *bad* implementations yields results that are similar to those obtained when comparing with *successful* implementations; this is due to the source of error from this particular case (*__chatgpt_temp_0.8 task 0__* ) where the failed tests are mostly caused by the word "__python__" appearing at the beginning of the script (probably the model forgets to comment the name of the programming language) or the lack of the "import List" that is needed in the script. Without these small errors, the majority of the generated implementations for this __model__ and __task__ would pass all the tests. 

Once again, due to the *BLEU* metric relying  heavily on the *textual similarities* and not paying any attention to such crucial elements like __missing imports__, __syntax errors__, etc., implementations that look similar to the *baseline* will yield good results, despite the code not being suitable for __compilation__ or __execution__.

&NewLine;
&NewLine;

# Human-made baseline implementation

The following is the human-made implementation for the *task 0* from Eval+ benchmark:

    from typing import List
    
    def has_close_elements(numbers: List[float], threshold: float) -> bool:
        assert threshold > 0, "invalid inputs"
        assert all([isinstance(v, (int, float)) for v in numbers]), "invalid inputs"
    
        sorted_numbers = sorted(numbers)
        for i in range(len(sorted_numbers) - 1):
            if sorted_numbers[i + 1] - sorted_numbers[i] < threshold:
                return True
        return False

Note that this implementation is very similar to the AI-generated ones, with the only notable difference being the presence of *assert* statements that check for the adequate __input__ values.

In [9]:
bleu_dict = bleu_metric(check_successful=True, humaneval=True)

for key, value in list(bleu_dict.items())[:2]:
    print(f'{key}: {value}')

The addition of *assert* statements has a huge impact on the __BLEU__ result, despite the rest of the implementation being very similar to the AI-generated code.

&NewLine;

## Different task as a baseline

Here we will once again take as the __baseline__ the first *good* script from __task 0__, but we will compare it to *correct* implementations of the __task 1__ of the Eval+ benchmark. This type of experiment is meant to test the capability of a given *metric* to detect differences in the code's __semantics__, a rather important functionality in the measurement of the *code quality*.

&NewLine;

__Task 1__: given *'paren_string'* a string with multiple groups of nested parentheses, return a list of strings with the separated groups of the nested parentheses.

The following is an example of a *good* AI-generated implementation for __task 1__ (script *0.py*)

    from typing import List
    
    def separate_paren_groups(paren_string: str) -> List[str]:
        paren_string = paren_string.replace(" ", "")
        groups = []
        stack = []
        start = 0
        for i in range(len(paren_string)):
            if paren_string[i] == '(':
                stack.append('(')
            elif paren_string[i] == ')':
                stack.pop()
                if not stack:
                    groups.append(paren_string[start:i+1])
                    start = i+1
        return groups

In [9]:
bleu_dict = bleu_metric(check_successful=True, different_task=True)

for key, value in list(bleu_dict.items())[:2]:
    print(f'{key}: {value}')

As expected, the __BLEU__ score is very small due to the big *textual differences* between the implementations from these two tasks.
