In [1]:
import json
from pathlib import Path

In [2]:
def read_json(data_path):
    data_path = Path(str(data_path))
    with data_path.open("r", encoding="UTF-8") as f:
        return json.load(f)

def save_json(data_path, data_list):
    data_path = Path(str(data_path))
    with data_path.open("w", encoding="UTF-8") as f:
        json.dump(data_list, f, indent=4)

In [3]:
res_path = Path(r"./human_evaluation.json")
res = read_json(res_path)
print(f"There are {len(res)} samples")

There are 197 samples


Here is an example:

```python
{
    "idx": 54,
    "sample_idx": 4302,
    "level": "hard",
    "type": "bridge",
    "context": "St Anne's Academy. St Anne's Academy is an 11\u201318 mixed comprehensive academy in Middleton area of the Metropolitan Borough of Rochdale in Greater Manchester, United Kingdom. The school has specialist status in Information and Communication Technology. The School has 657 students on roll, with 98 in sixth form. Middleton, Greater Manchester. Middleton is a town in the Metropolitan Borough of Rochdale, Greater Manchester, England, on the River Irk 5 mi south-southwest of Rochdale and 4.4 mi north-northeast of Manchester city centre. In 2001, Middleton had a population of 45,580, reducing to 42,972 at the 2011 Census. It lies on the northern edge of Manchester, with Blackley to the south and Moston to the south east.",
    "answer": "42,972",
    "gold": "St Anne's Academy is an 11\u201318 mixed comprehensive academy is located in a town that had how many inhabitants in 2011 ?",
    "full answer": "St Anne's Academy is an 11\u201318 mixed comprehensive academy is located in a town that had 42,972 in 2011.",
    "zeroshot_gpt": "What was the population of Middleton according to the 2011 Census?",
    "oneshot_gpt": "According to the 2011 Census, what was the population of Middleton, Greater Manchester?",
    "multiFactor": "What was the population in 2011 of the city where St Anne's Academy is located?"
}
```


The information contained in each sample is as follows:
 - idx: index in the random selected 200 human evluation samples
 - sample_idx: absolute index in the whole HoupotQA test set (Fei, Zichu, et al. 2022)
 - level: the difficulty level, provided by the raw HotpotQA dataset
 - type: the question type, bridge or comparison, also provided by the raw HotpotQA dataset
 - context: QA related context
 - answer: the answer extracted by the crowd workers
 - gold: the question generated by the crowd workers
 - full answer: pseudo (gold) full answer converted by the QA2D
 - zeroshot_gpt: zero-shot generated question by GPT-3.5-Turbo
 - oneshot_gpt: one-shot generated question by GPT-3.5-Turbo

The instruction of zero-shot is:
> Given the context and answer, please help me generate a multi-hop question.\nAnswer: {{answer}}\nContext: {{context}}\nQuestion:

The instruction of one-shot is:
> Below, I will provide you with one answer and background context. You need to ask a question based on background context, and the questions you ask must be able to be answered directly with the answers. Here is an example:\n\n###Answer: {example_answer}\n\n###Context: {example_context}\n\n###Question:{example_question}\n\nComplete the following question based on the example above:\n\n###Answer: {{answer}}\n\n###Context: {{context}}\n\n###Question:

Now, we call the function `check_answer_revealing` to check the answer-revealing error types in zero/one-shot QG performance of GPT-3.5-Turbo.

In [4]:
def check_answer_revealing(data_list, input_key="zeroshot_gpt", answer_key=r"answer"):
    c = 0
    revealing_idx = []
    for idx, d in enumerate(data_list):
        if d.get(input_key) and d.get(answer_key):
            answer = d.get(answer_key).strip()
            generated_q = d.get(input_key).strip()
            if answer in generated_q:
                c += 1
                revealing_idx.append(idx)
    return c, revealing_idx

In [5]:
# zero-shot answer revealing auto-check

zero_ar_nums, zero_ar_list = check_answer_revealing(res)

print(f"There are {zero_ar_nums}/{len(res)} answer revealing error samples in GPT-3.5-Turbo zero-shot, HotpotQA dataset, QG task.")

There are 54/197 answer revealing error samples in GPT-3.5-Turbo zero-shot, HotpotQA QG dataset.


In [6]:
# one-shot answer revealing auto-check

one_ar_nums, one_ar_list = check_answer_revealing(res, "oneshot_gpt")

print(f"There are {one_ar_nums}/{len(res)} answer revealing error samples in GPT-3.5-Turbo one-shot, HotpotQA dataset, QG task.")

There are 37/197 answer revealing error samples in GPT-3.5-Turbo one-shot, HotpotQA QG dataset.


As the statistics show, for paragraph-level QG tasks, GPT-3.5 is prone to answer revealing error type. This performance is very alarming, but this is the result of the program's statistics.

Compared with zero-shot, one-shot ameliorates the answer-revealing errors. However, it seems failed to capture the disjoint multi-hop facts in the provided context. We found that one-shot offen generate shallow questions, whose information/facts are mainly centered on one sentences span. Here we list some examples:

```json
[
    {
    "idx": 1,
    "sample_idx": 3139,
    "level": "hard",
    "type": "bridge",
    "context": "Jacksonville station. Jacksonville station is an Amtrak train station in Jacksonville, Florida, United States. It serves the \"Silver Meteor\" and \"Silver Star\" trains as well as the Thruway Motorcoach to Lakeland. The station lies next door to a freight facility with its own platform and is also just east of Norfolk Southern's Simpson Yard. Silver Star (Amtrak train). The Silver Star is a 1522 mi passenger train route in the \"Silver Service\" brand operated by Amtrak, running from New York City south to Miami, Florida via the Northeast Corridor to Washington, D.C., then via Richmond, Virginia; Raleigh, North Carolina; Columbia, South Carolina; Savannah, Georgia; Jacksonville, Florida; Orlando, Florida; and Tampa, Florida. The \"Silver Star\" shares much of its track with the \"Silver Meteor\", which runs further east through North Carolina and South Carolina.",
    "answer": "1522",
    "gold": "How many miles does the train, which passes through the Amtrak Jacksonville station and shares the track with the Silver Meteor, run?",
    "full answer": "The train, which passes through the Amtrak Jacksonville station and shares the track with the Silver Meteor, runs 1522.",
    "zeroshot_gpt": "How long is the Silver Star passenger train route in miles?",
    "oneshot_gpt": "How long is the passenger train route of the Silver Star operated by Amtrak?",
    "multiFactor": "Jacksonville station serves the \"Silver Meteor\" and a train that runs for how many miles?"
    },
    {
        "idx": 13,
        "sample_idx": 7,
        "level": "hard",
        "type": "bridge",
        "context": "Lewiston Maineiacs. The Lewiston Maineiacs were a junior ice hockey team of the Quebec Major Junior Hockey League based in Lewiston, Maine. The team played its home games at the Androscoggin Bank Colis\u00e9e. They were the second QMJHL team in the United States, and the only one to play a full season. They won the President's Cup in 2007. Androscoggin Bank Colis\u00e9e. The Androscoggin Bank Colis\u00e9e (formerly Central Maine Civic Center and Lewiston Colisee) is a 4,000 capacity (3,677 seated) multi-purpose arena, in Lewiston, Maine, that opened in 1958. In 1965 it was the location of the World Heavyweight Title fight during which one of the most famous sports photographs of the century was taken of Muhammed Ali standing over Sonny Liston.",
        "answer": "3,677 seated",
        "gold": "The arena where the Lewiston Maineiacs played their home games can seat how many people?",
        "full answer": "The arena where the Lewiston Maineiacs played their home games can seat 3,677 seated.",
        "zeroshot_gpt": "How many seats are available in the Androscoggin Bank Colis\u00e9e in Lewiston, Maine?",
        "oneshot_gpt": "What is the seating capacity of the Androscoggin Bank Colis\u00e9e in Lewiston, Maine?",
        "multiFactor": "The Lewiston Maineiacs played their home games at a multi-purpose arena with how many seating capacity?"
    },
    {
        "idx": 17,
        "sample_idx": 5510,
        "level": "hard",
        "type": "bridge",
        "context": "She's a Woman. \"She's a Woman\" is a song by the Beatles, written mainly by Paul McCartney and credited to Lennon\u2013McCartney. John Lennon contributed to the lyrics and middle eight (the bridge). The song was finished in the studio the morning of the session. It was released as the B-side to \"I Feel Fine\" in 1964, their last single release that year. It reached number 4 on the \"Billboard\" Hot 100 from frequent airplay. I Feel Fine. \"I Feel Fine\" is a song written by John Lennon (credited to Lennon\u2013McCartney) and released in 1964 by the Beatles as the A-side of their eighth British single. The song has one of the first uses of guitar feedback in popular music.",
        "answer": "guitar feedback",
        "gold": "\"She's a Woman\" is a song on the B-side of a single whose A-side has one of the first uses of what in popular music?",
        "full answer": "\"She's a Woman\" is a song on the B - side of a single guitar feedback's A - side has one of the first uses of what in popular music.",
        "zeroshot_gpt": "What is a notable aspect of the Beatles' song \"I Feel Fine\" that contributed to its significance in popular music history?",
        "oneshot_gpt": "What is one of the notable features of the song \"I Feel Fine\" by the Beatles?",
        "multiFactor": "She's a Woman was released as the B-side to a song that has one of the first uses of what in popular music?"
    },
]
```

Because of the high cost of huaman evaluation, we only evaluate the first 100 samples. The error (or defeated by MultiFactor) samples are extracted separately and we put it in file `oneshot_bad_samples.json`. We add a `comment` to clarify the reason why crowd works prefer our MultiFactor's generated question. 