In [1]:
import datasets
import evaluate

rouge = evaluate.load("rouge")

# Long-form question answering dataset, nicely preprocessed already.
# Similar to ELI5: https://facebookresearch.github.io/ELI5/index.html (which is unavailable now)
# I use my filtered version
dataset_lfqa = datasets.load_dataset("stefanbschneider/lfqa-max-answer-length-512")
dataset_lfqa

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'context'],
        num_rows: 202767
    })
    validation: Dataset({
        features: ['question', 'answer', 'context'],
        num_rows: 2646
    })
})

In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")
base_model = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384")
tuned_model = AutoModelForSeq2SeqLM.from_pretrained("stefanbschneider/led-base-16384-lfqa-ans-len-512")
tuned_model2 = AutoModelForSeq2SeqLM.from_pretrained("stefanbschneider/led-base-16384-lfqa")

config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/648M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/231 [00:00<?, ?B/s]

In [4]:
example = dataset_lfqa["train"][0]
example

{'question': "what's the difference between a forest and a wood?",
 'answer': "They're used interchangeably a lot. You'll get different answers from different resources, but the general consensus seems to be that woods are smaller than forests.\n\n >  A wood is an area covered in trees, larger than a grove or a copse. A forest is also an area covered in trees, but it is larger than a wood\n\n >  The U.S. National Vegetation Classification system differentiates them according to their densities: 25 to 60 percent of a a wood is covered by tree canopies, while 60 to 100 percent of a forest is canopied.",
 'context': ['Wood is divided, according to its botanical origin, into two kinds: softwoods, from coniferous trees, and hardwoods, from broad-leaved trees. Softwoods are lighter and generally simple in structure, whereas hardwoods are harder and more complex. However, in Australia, "softwood" generally describes rain forest trees, and "hardwood" describes Sclerophyll species ("Eucalyptus"

In [12]:
input = f"question: {example['question']}, context: {' '.join(example['context'])}"    
tokens = tokenizer(input, return_tensors="pt")

# base model
base_model_output = base_model.generate(**tokens, max_length=512)
base_model_answer = tokenizer.decode(base_model_output[0], skip_special_tokens=True)
base_model_answer

Input ids are automatically padded from 727 to 1024 to be a multiple of `config.attention_window`: 1024


'question: what\'s the difference between a forest and a wood?, context: Wood is divided, according to its botanical origin, into two kinds: softwoods, from coniferous trees, and hardwoods, from broad-leaved trees. Softwoods are lighter and generally simple in structure, whereas hardwoods are harder and more complex. However, in Australia, "softwood" generally describes rain forest trees, and "hardwood" describes Sclerophyll species ("Eucalyptus" "spp"). Woodland is defined by Chambers English dictionary as "land covered with wood" i.e. dominated by tree species. Forestry is defined as "1. the science and art of planting, tending and managing forests; 2. Forest country". This implies that forests have been planted by mankind for a variety of purposes, but mostly for exploitation for timber and pulp for the paper industry. The majority of Forests in Wales were planted by the British Forestry Commission, a UK government agency. Since 2016 the Forestry Commission in Wales has been taken o

In [13]:
rouge.compute(predictions=[base_model_answer], references=[example['answer']])

{'rouge1': 0.18969072164948456,
 'rouge2': 0.033126293995859216,
 'rougeL': 0.11958762886597939,
 'rougeLsum': 0.15257731958762888}

In [14]:
# same for fine-tuned model
tuned_model_output = tuned_model.generate(**tokens, max_length=512)
tuned_model_answer = tokenizer.decode(tuned_model_output[0], skip_special_tokens=True)
tuned_model_answer

'A forest is an area of land covered with trees. A wood is a piece of land that is covered by trees.\n\nA tree is a part of a forest.\n\n_URL_0_\n\nThe difference between a forest and a wood is that a forest is more dense than a tree, and a tree is more complex than a wood. \n\nSo, a forest means that a tree grows in a way that allows it to grow in a more dense area. \nA wood means that it grows in an area where it can grow in more dense areas.  A forest is a place where a tree can grow and grow in less dense areas than a forest can grow. \n\n\nA forest means a place that grows in such dense areas that it can be used for a variety of purposes.  For example, a tree that grows on a tree will grow in such a way to support a tree.  The tree will be able to support the tree, but the tree will not be strong enough to support it.  \n\n\n\n\n\n"A forest" means a forest where a forest grows in the same way as a tree growing in a different area.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n > \n\n >\n\n\n >

In [15]:
rouge.compute(predictions=[tuned_model_answer], references=[example['answer']])

{'rouge1': 0.2864583333333333,
 'rouge2': 0.12041884816753927,
 'rougeL': 0.14583333333333334,
 'rougeLsum': 0.26562499999999994}

In [16]:
from transformers import pipeline

base_pipeline = pipeline(task="text2text-generation", model="allenai/led-base-16384")

Device set to use mps:0


In [17]:
base_pipeline("question: What is the capital of Germany?, context: Germany is a country in Europe. Its capital is Berlin.")

[{'generated_text': 'question: What is the capital of Germany?, context: Germany is a country in Europe. Its'}]

In [18]:
lfqa_pipeline = pipeline(task="text2text-generation", model="stefanbschneider/led-base-16384-lfqa-ans-len-512")

Device set to use mps:0


In [19]:
lfqa_pipeline("question: What is the capital of Germany?, context: Germany is a country in Europe. Its capital is Berlin.")

[{'generated_text': 'The capital of Germany is Berlin.\n\nThe capital is Berlin, which is the capital of the German state of Brandenburg.\n\n_URL_0_\n\nGermany is the largest city in Europe, with a population of around 10 million.\n \n\nIt\'s the capital, Berlin. \n\n_\n_\n\n\n\n"Berlin" is the name of the city in Germany. "Berliner" is a German name for the city of Berlin.\n\n\n\n\n" Berlin" is an abbreviation of "Berlin", which means "German capital".\n\n\n\n\n\nThe name of Berlin is "Berliner", which is a name for Berlin.\n\n\n\nThe title of the "German Capital of Germany" is German for "Germany" and "German City of Berlin".\n\nBerlin is a city in the German capital of Berlin, Berlin is a country in Europe. \n\n\nGermany has a capital called Berlin, and a city called Berlin is Germany\'s capital.  \n\n\n\n\n\n_\n \n \n\n\n\n\n\n\n >\n\n\n >\n\n\n_\n\n\n >\n >\n\n\n >\r\n\n \n\n > \n > \n\n >\n\n\n\n \n  \n_ >\n > >\n_\r\n >_\n >;\n\n-\n\n--\n\n\n--\n\n*\n\n\\-\n\n---\n\nHope this is

In [20]:
print(input)
lfqa_pipeline(input)

question: what's the difference between a forest and a wood?, context: Wood is divided, according to its botanical origin, into two kinds: softwoods, from coniferous trees, and hardwoods, from broad-leaved trees. Softwoods are lighter and generally simple in structure, whereas hardwoods are harder and more complex. However, in Australia, "softwood" generally describes rain forest trees, and "hardwood" describes Sclerophyll species ("Eucalyptus" "spp").
 Woodland is defined by Chambers English dictionary as "land covered with wood" i.e. dominated by tree species. Forestry is defined as "1. the science and art of planting, tending and managing forests; 2. Forest country". This implies that forests have been planted by mankind for a variety of purposes, but mostly for exploitation for timber and pulp for the paper industry. The majority of Forests in Wales were planted by the British Forestry Commission, a UK government agency. Since 2016 the Forestry Commission in Wales has been taken ov

[{'generated_text': 'A forest is an area of land covered with trees. A wood is a piece of land that is covered by trees.\n\nA tree is a part of a forest.\n\n_URL_0_\n\nThe difference between a forest and a wood is that a forest is more dense than a tree, and a tree is more complex than a wood. \n\nSo, a forest means that a tree grows in a way that allows it to grow in a more dense area. \nA wood means that it grows in an area where it can grow in more dense areas.  A forest is a place where a tree can grow and grow in less dense areas than a forest can grow. \n\n\nA forest means a place that grows in such dense areas that it can be used for a variety of purposes.  For example, a tree that grows on a tree will grow in such a way to support a tree.  The tree will be able to support the tree, but the tree will not be strong enough to support it.  \n\n\n\n\n\n"A forest" means a forest where a forest grows in the same way as a tree growing in a different area.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\

In [21]:
lfqa_pipeline2 = pipeline(task="text2text-generation", model="stefanbschneider/led-base-16384-lfqa")

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

Device set to use mps:0


In [22]:
lfqa_pipeline2("question: What is the capital of Germany?, context: Germany is a country in Europe. Its capital is Berlin.")

[{'generated_text': 'The capital of Germany is Berlin, which is the capital of the German state of Brandenburg.\n\nThe capital is Berlin.\n\n_URL_0_\n\n\n \n\nGermany is the largest city in Europe, with a population of around 1.5 million.   \n  \n\n\nGermany has a population density of about 1.2 million people.  Germany is the second largest city, with an average density of 1.4 million people per square kilometer.\n \nGermany\'s population density is about 2.3 million people/square kilometers.  So, Germany has a total density of around 2.2 billion people/ square kilometers.\n   \n > \n \n\n >  Germany\'s capital was Berlin.  The capital of Berlin was Berlin, but it was not the largest.  It was the capital for Germany. \n\n\n\n\n\n\n\n\n\nGermany:\n\n* Germany: \n* Berlin.\n*Germany: \n\n\n\n > Germany:\n\n > Berlin, Germany, Germany\n* France, France\n* Italy\n* Spain\n* Japan\n* Russia\n* China\n* Korea\n* India\n* Vietnam\n* Pakistan\n* South Africa\n* Egypt\n* Iran\n* Sudan\n* Syria

In [23]:
lfqa_pipeline2(input)

[{'generated_text': 'A forest is a low density area of trees. A wood is high density areas of trees with lots of shade. A forest is low density areas that provide extensive and nearly continuous shade, whether standing or not, and generally have plenty of sunlight and limited shade.\n\nA wood is a high density area where trees have a lot of shade and trees have little shade.\n\n_URL_0_\n \n\nThe difference between a forest and a wood is that a forest is lower density, but a wood has a lot more shade and can support more trees than a forest.\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe distinction between a wood and a forest depends on what you mean by "wood" and "wood". Wood is a type of wood. Wood is the type of tree that can support trees. Woodland is the kind of wood that can be supported by trees, but not supported by a tree. Woodlands are the type that can\'t support a tree, but can support the tree. \n\n\n\n \n\n\nA forest can be considered a forest if it has a high level of shade, or if it is

In [24]:
tokenizer2 = AutoTokenizer.from_pretrained("stefanbschneider/led-base-16384-lfqa")
tokenizer2.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'sep_token': '</s>',
 'pad_token': '<pad>',
 'cls_token': '<s>',
 'mask_token': '<mask>'}

In [None]:
### --- use evaluator to compute rouge score on entire validation set --- ###
from transformers import pipeline

val_data = datasets.load_dataset("stefanbschneider/lfqa-max-answer-length-512", split="validation").shuffle().select(range(100))

Device set to use mps


In [None]:
val_data = dataset_lfqa["validation"].shuffle().select(range(3))
val_data

Dataset({
    features: ['question', 'answer', 'context'],
    num_rows: 100
})

In [31]:
def concat_question_and_context(batch):
    # combine context strings and questions to one input
    batch["question_context"] = [
        f"question: {question}, context: {' '.join(context)}"
        for question, context in zip(batch["question"], batch["context"])
    ]
    return batch


val_data = val_data.map(concat_question_and_context, batched=True, batch_size=2)
val_data

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'context', 'question_context'],
    num_rows: 100
})

In [33]:
val_data[0]

{'question': 'during 9/11, why didnt firefighters/rescue teams set up safety nets for people stuck at the top of the burning buildings to jump out onto and land on?',
 'answer': 'Jumping off a buildings with 110 floors, reaching terminal velocity, into a net just above the ground?\n\n_URL_0_',
 'context': ['The area around the building was cleared of pedestrians and firefighting personnel because of falling glass and debris. The falling debris was dangerous for firefighters because they often had to cross the perimeter around the building to enter and leave the high-rise. Hose lines stretched into the building were damaged by falling debris and one firefighter was struck by debris and seriously injured while tending to the lines.\n',
  "The first service to attend were a Patient Transport Ambulance crew who took the decision to divert straight to the scene because they were so close at the time of the explosion. This initial crew saved dozens of lives by taking control of the evacuatio

In [37]:
from evaluate import evaluator

task_evaluator = evaluator("text2text-generation")
pipe = pipeline("text2text-generation", model="stefanbschneider/led-base-16384-lfqa-ans-len-512", device="mps")
results = task_evaluator.compute(model_or_pipeline=pipe, data=val_data, metric=rouge, input_column="question_context", label_column="answer")
results

Device set to use mps
Input ids are automatically padded from 780 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 864 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 940 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 999 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 768 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 939 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 993 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 657 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 815 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 68

KeyboardInterrupt: 