# METEOR
[GitHub repository](https://github.com/huggingface/evaluate/tree/main/metrics/meteor) for more details

__METEOR__ (Metric for Evaluation of Translation with Explicit ORdering) is a machine translation evaluation metric, which is calculated based on the harmonic mean of precision and recall, with recall weighted more than precision.

In ML classification, precision and recall are two metrics that measure the quality of the predicted text: 
- __precision__ - among __predicted__ words, *how many of them are relevant*
- __recall__ - among relevant words found in the __reference__, *how many of them are predicted*

&NewLine;

## Parameters
This metric has 3 optional parameters that are less intuitive for configuration. Values proposed in the research paper are used by default. For more information on the available parameters, see the [GitHub repository](https://github.com/huggingface/evaluate/tree/main/metrics/meteor).

&NewLine;

## Usage example

In [7]:
import evaluate

meteor = evaluate.load('meteor')
predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
references = ["It is a guide to action that ensures that the military will forever heed Party commands"]
results = meteor.compute(predictions=predictions, references=references)
print(results)

In [14]:
import os
import json
import evaluate as ev


def custom_sort_key(s):
    # A sorting key used to sort strings in a length-lexicographic order (length and alphabetical order)
    return len(s), s


def code_cleanup(script, remove_assert=False):
    # Function that removes any unnecessary components of a given script (comments & tests), leaving only the code lines

    # Removing the test component of HumanEval implementation following 'METADATA' information
    if 'METADATA' in script:
        script = script.split('METADATA', 1)[0]
    elif 'def check(candidate)' in script:
        script = script.split('def check(candidate)', 1)[0]

    script_lines = script.splitlines()

    multi_line_comment = False
    comment_index = []
    assert_index = []
    empty_line_index = []

    for index, line in enumerate(script_lines):

        # Indexing any assert statement
        if remove_assert and 'assert' in line and line[0] == 'a':
            assert_index.append(index)
            continue

        if not multi_line_comment:
            if '#' in line:
                # Indexing single-line comments
                if line.strip()[0] == '#':
                    comment_index.append(index)
                # Removing comment component of the line
                else:
                    cleaned_up_line = line.split('#', 1)[0]
                    script_lines[index] = cleaned_up_line
                continue

            # Indexing the first line of multi-line comments
            if '"""' in line or "'''" in line:
                comment_index.append(index)
                if line.count('"""') == 1 or line.count("'''") == 1:
                    multi_line_comment = True
                continue

        # Adding indexes for multi-line comments
        if multi_line_comment and ('"""' not in line and "'''" not in line):
            comment_index.append(index)
            continue

        # Indexing the last line of multi-line comments
        if multi_line_comment and ('"""' in line or "'''" in line):
            multi_line_comment = False
            comment_index.append(index)
            continue

        # Indexing new lines and blank lines
        if len(line) == 0 or line.isspace():
            empty_line_index.append(index)
            continue

    # Merging indexes for comments, empty lines and assert statements
    [comment_index.extend(indexes) for indexes in (empty_line_index, assert_index)]

    # Removing all the unnecessary parts of code
    for index in sorted(comment_index, reverse=True):
        del script_lines[index]

    # Concatenating the list of script lines
    clean_script = '\n'.join(script_lines)
    return clean_script


def meteor_metric(check_successful=False, check_failed=False, second_script=False, humaneval=False,
                  different_task=False):
    """
    Function that applies the "METEOR" metric to the AI generated code
    :param check_successful: if True, the chosen references are implementations with successful tests
    :param check_failed: if True, the chosen references are implementations with failed tests/exec errors
    :param second_script: choose a different good implementation as the baseline
    :param humaneval: if True, the chosen baseline is the human-made implementation from "HumanEval" dataset
    :param different_task: if True, the chosen references are the implementations for a different task (task 1)
    :return the dictionary with all the scores, the average score as well as the variance
    """
    if check_successful and check_failed:
        print('Only one active parameter allowed between "check_successful" & "check_failed"')
        exit(1)

    json_path_prefix = '../../exp_results/metrics_calc'
    funct_test_path_prefix = '../../exp_results/functionality_tests'

    if humaneval:
        baseline_script_path = '../../humaneval/000_has_close_elements.py'
    elif second_script:
        baseline_script_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0/42.py'
    else:
        baseline_script_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0/16.py'
    baseline_file_name = baseline_script_path.split('/')[-1]

    if different_task:
        data_folder_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_1'
    else:
        data_folder_path = '../../../ai_code/chatgpt_temp_0.8/HumanEval_0'

    tested_task = data_folder_path.split('/')[-1]

    task_name = 'HumanEval_0'
    model_name = 'chatgpt'
    model_temp = 'temp_0.8'

    metric_name = 'meteor'
    metric = ev.load(metric_name)

    json_folder_path = os.path.join(json_path_prefix, metric_name, model_name, model_temp, task_name)

    if not os.path.exists(json_folder_path):
        os.makedirs(json_folder_path)

    metric_dict = {'overall_score': 0, 'average_variance': 0}
    file_names = []
    overall_score = 0

    with open(baseline_script_path, 'r') as f:
        baseline = code_cleanup(f.read())

    for path, folder, files in os.walk(data_folder_path):
        for file_name in sorted(files, key=custom_sort_key):

            # Avoiding comparison of the baseline to an identical prediction (i.e., comparing the baseline to the
            # baseline)
            if file_name == baseline_file_name:
                continue

            else:
                test_file_path = os.path.join(funct_test_path_prefix, model_name, model_temp, f'{tested_task}.json')
                with open(test_file_path, 'r') as f:
                    funct_dict = json.load(f)

                # Filtering implementations with successful or failed tests
                if check_successful:
                    if not funct_dict[file_name]['successful']:
                        continue
                elif check_failed:
                    if funct_dict[file_name]['successful']:
                        continue

                file_names.append(file_name)

                current_script_path = os.path.join(path, file_name)
                with open(current_script_path) as f:
                    script = code_cleanup(f.read(), remove_assert=True)

                results = metric.compute(predictions=[script], references=[baseline])

                score = results['meteor']
                metric_dict[file_name] = score
                overall_score += score

    nb_scripts = len(metric_dict.keys()) - 2

    overall_score /= nb_scripts

    metric_dict['overall_score'] = overall_score

    variance = 0

    for file in file_names:
        variance += abs(overall_score - metric_dict[file])

    metric_dict['average_variance'] = variance / nb_scripts

    return metric_dict

## Applying __METEOR__ metric to code samples
As per the experimental protocol, we start by choosing as the baseline the first *successful* implementation of __chatgpt_temp_0.8 task 0__ (script *16.py*) and compare it to all the other *good* implementations from this __model__ and __task__.

In [15]:
meteor_dict = meteor_metric(check_successful=True)

for key, value in list(meteor_dict.items())[:2]:
    print(f'{key}: {value}')

Once again, the score obtained in the first stage of the experimental protocol is __higher__ than any previous score, even surpassing the __ROUGE's__ *0.81* score.

Now we will consider the second *successful* script (*42.py*) as the baseline in order to analyze the difference in __CodeBLEU__ scores for different yet *correct* implementations of the same task.

In [16]:
meteor_dict = meteor_metric(check_successful=True, second_script=True)

for key, value in list(meteor_dict.items())[:2]:
    print(f'{key}: {value}')

## Comparing with unsuccessful implementations
Now we will take the first *good* implementation and compare it to scripts that *did not* pass the tests.

In [22]:
meteor_dict = meteor_metric(check_failed=True)

for key, value in list(meteor_dict.items())[:2]:
    print(f'{key}: {value}')

## Human-made baseline implementation
Now we'll consider as a baseline the human-written Eval+ implementation for __task 0__ and compare it to all the *correct* implementations generated by AI for the same task.

In [21]:
meteor_dict = meteor_metric(humaneval=True)

for key, value in list(meteor_dict.items())[:2]:
    print(f'{key}: {value}')

## Different task as a baseline
Lastly, we will compare the standard baseline with the *successful* implementations for __task 1__ in order to see how this metric is affected by scripts that have drastically different __semantics__.

In [20]:
meteor_dict = meteor_metric(different_task=True)

for key, value in list(meteor_dict.items())[:2]:
    print(f'{key}: {value}')