# Building a Research Buddy

I want to build a model (+ simple Gradio GUI) that takes a research paper and can answer questions regarding that paper.

## Notes

### Relevant Resources

- See my other notebooks: playground and question-answer-project
- https://huggingface.co/tasks/question-answering
- https://huggingface.co/datasets/taesiri/arxiv_qa
- https://huggingface.co/learn/nlp-course/en/chapter7/3?fw=pt domain adaptation: fine-tune task-agnostic base model for a specific domain
- https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt Translation - similar to generative Q&A


### Specification & Ideas

- I want generative/abstractive Q&A, not just extractive.
- I want a closed-domain model, specialized on research. I could do that with domain adaptation. But that can come afterwards.
- I might want to try RAG models that can access other research papers for context. For now, I can provide suitable context myself.
- I might want to try document Q&A that also visually inspects papers to answer questions, eg, regarding title, author, or figures.

### Other Notes

> Encoder-only models like BERT tend to be great at extracting answers to factoid questions like “Who invented the Transformer architecture?” but fare poorly when given open-ended questions like “Why is the sky blue?” In these more challenging cases, encoder-decoder models like T5 and BART are typically used to synthesize the information in a way that’s quite similar to text summarization.

GPT models are decoder-only models, where the decoder takes the input directly and was trained to perform similar tasks to typical encoder-decoder models: https://community.openai.com/t/is-gpt-group-of-models-decoder-only-model/286586/2

It seems like I need a suitable data set to fine-tune a pretrained model.
The ones from "question-answering" and the tutorial are trained on SQuAD data, where answers are intentionally spans of text.
I'd need something similar but with generated answers + use a suitable base model with decoder?

### Existing models

Smaller, pre-trained models for abstractive/generative QA:

- https://huggingface.co/tuner007/t5_abs_qa --> Problem: Max sequence length 512 (short paper has already >4000 tokens)
- https://huggingface.co/consciousAI/question-answering-generative-t5-v1-base-s-q-c --> Similar issue of too short sequence length

Large, pretrained instruct models:

- Microsoft Phi3mini: Good answers, but too large and slow.

I moved the corresponding code to `question-answering-project.ipynb` because it is too slow. I want to keep this notebook small enough to run entirely in reasonable time.

## Starting with simple, extractive Q&A

In [2]:
# Read PDF
from pathlib import Path
from typing import Union

from pypdf import PdfReader


def get_text_from_pdf(pdf_file: Union[str, Path]) -> str:
    """Read the PDF from the given path and return a string with its entire content."""
    reader = PdfReader(pdf_file)

    # Extract text from all pages
    full_text = ""
    for page in reader.pages:
        full_text += page.extract_text()
    return full_text


full_text = get_text_from_pdf("author_version.pdf")
full_text

'mobile-env: An Open Platform for Reinforcement\nLearning in Wireless Mobile Networks\nStefan Schneider, Stefan Werner\nPaderborn University, Germany\n{stschn, stwerner}@mail.upb.de\nRamin Khalili, Artur Hecker\nHuawei Technologies, Germany\n{ramin.khalili, artur.hecker}@huawei.com\nHolger Karl\nHasso Plattner Institute,\nUniversity of Potsdam, Germany\nholger.karl@hpi.de\nAbstract—Recent reinforcement learning approaches for con-\ntinuous control in wireless mobile networks have shown im-\npressive results. But due to the lack of open and compatible\nsimulators, authors typically create their own simulation en-\nvironments for training and evaluation. This is cumbersome\nand time-consuming for authors and limits reproducibility and\ncomparability, ultimately impeding progress in the ﬁeld.\nTo this end, we proposemobile-env, a simple and open platform\nfor training, evaluating, and comparing reinforcement learning\nand conventional approaches for continuous control in mobile\nwireless 

In [3]:
# Use a question answering pipeline from Hugging Face for answering questions
from transformers import pipeline

question_answerer = pipeline("question-answering")
# This uses a BERT-style model (encode-only), which apparently is better for factoid questions and extractive answers.

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


In [4]:
question = "What are benefits of mobile-env over other simulators?"
question_answerer(question, full_text)



{'score': 0.4261030852794647,
 'start': 6110,
 'end': 6162,
 'answer': 'easy adoption and fast\nprototyping of new approaches'}

In [5]:
question_answerer("Is mobile-env open source?", full_text)

{'score': 0.30191317200660706,
 'start': 3868,
 'end': 3879,
 'answer': 'open source'}

In [6]:
question_answerer("Who is the best person in the world?", full_text)

{'score': 0.0010773177491500974,
 'start': 12170,
 'end': 12184,
 'answer': 'Brunori et al.'}

In [7]:
question_answerer("Which programming language is mobile-env written in?", full_text)
# really bad

{'score': 0.9728466868400574, 'start': 3552, 'end': 3558, 'answer': 'Python'}

In [8]:
# Trying Q&A with a different model - RoBERTa
# https://huggingface.co/deepset/tinyroberta-squad2
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/tinyroberta-squad2"
roberta_qa = pipeline("question-answering", model=model_name, tokenizer=model_name)

Device set to use mps:0


In [9]:
question = "What are benefits of mobile-env over other simulators?"
roberta_qa(question, full_text)
# That's still extractive! - it's still BERT-based (only encoder, no decorder). BART would be encode-decoder

{'score': 0.48334258794784546,
 'start': 12891,
 'end': 12908,
 'answer': 'far more features'}

In [10]:
roberta_qa("Which programming language is mobile-env written in?", full_text)
# Better answer than the model above.

{'score': 0.9660436511039734, 'start': 3552, 'end': 3558, 'answer': 'Python'}

In [11]:
roberta_qa("Who is the best person in the world?", full_text)

{'score': 6.944365793515317e-08,
 'start': 17073,
 'end': 17100,
 'answer': 'A. Medeisis and A. Kajackas'}

## Generative/Abstractive QA with existing models

Seems like this falls more under the Text2Text Generation, where question and context are given as one input text, separated by keywords.

### Exploring available models

#### Basic T5 without specific Q&A fine-tuning: google/flan-t5-small or -base

https://huggingface.co/google/flan-t5-small

> If you already know T5, FLAN-T5 is just better at everything. For the same number of parameters, these models have been fine-tuned on more than 1000 additional tasks covering also more languages

In [12]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


<pad> Wie ich er bitten?</s>


In [13]:
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
question = "What is Valhalla?"
input_text = f"{context} Please answer the following question: {question}"

input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

# Doesn't work at all for generative Q&A (same for larger google/flan-t5-base model)

<pad> hall</s>


Instead, let's check out some fine-tuned T5 models for generative/abstractive QA.

#### consciousAI/question-answering-generative-t5-v1-base-s-q-c

https://huggingface.co/consciousAI/question-answering-generative-t5-v1-base-s-q-c

In [14]:
# https://huggingface.co/consciousAI/question-answering-generative-t5-v1-base-s-q-c
consciousai_pipeline = pipeline("text2text-generation", model="consciousAI/question-answering-generative-t5-v1-base-s-q-c")

def get_answer_consciousai(question: str, context: str) -> str:
    # The syntax for separating question and context is model-specific, depending on how it was fine-tuned.
    query = f"question: {question}</s> question_context: {context}"
    return consciousai_pipeline(query)


Device set to use mps:0


In [15]:
# Short query --> good answer
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
question = "What is Valhalla?"
get_answer_consciousai(question, context)

[{'generated_text': 'a majestic, enormous hall'}]

In [16]:
# Long query with full text from paper --> too long, but okayish answer
question = "What are benefits of mobile-env over other simulators?"
get_answer_consciousai(question, full_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (4355 > 512). Running this sequence through the model will result in indexing errors


[{'generated_text': 'a simple and lightweight simulation environment'}]

#### tuner007/t5_abs_qa

https://huggingface.co/tuner007/t5_abs_qa

Also sequence length of max 512

In [17]:
# Setup from https://huggingface.co/tuner007/t5_abs_qa
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name: str = "tuner007/t5_abs_qa"
tokenizer_tuner = AutoTokenizer.from_pretrained(model_name)
model_tuner = AutoModelForSeq2SeqLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def get_answer_tuner(question: str, context: str) -> str:
    """Generate an answer for the given question based on the given context."""
    # Here, context and question are separated by different keywords than in the model before
    input_text = f"context: {context} <question for context: {question} </s>"
    features = tokenizer_tuner([input_text], return_tensors='pt')
    out = model_tuner.generate(input_ids=features['input_ids'].to(device), attention_mask=features['attention_mask'].to(device))
    return tokenizer_tuner.decode(out[0])

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Error while downloading from https://cdn-lfs.hf.co/tuner007/t5_abs_qa/d99167d589bc86ff15cd58eaf80f249e4815ff0b6a9a78e2067af2c9d9ea0b42?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1738695217&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczODY5NTIxN319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby90dW5lcjAwNy90NV9hYnNfcWEvZDk5MTY3ZDU4OWJjODZmZjE1Y2Q1OGVhZjgwZjI0OWU0ODE1ZmYwYjZhOWE3OGUyMDY3YWYyYzlkOWVhMGI0Mj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=i%7EKdISMDNJcQsneI-cQKmynrAaMfXsLWrW54sCdjXnU4vEl52jtqCb8KwUUTw5shDzLc82tg3%7EDl3dymS5PAt8-hFG2tGNHAG8vL7f6-F654Jv5jQQpm%7EB2Eb9x8KYfTVwO7eEr6HluSITsH7nG33QTI2w%7EC8cpPRX-gQb9rh%7EwMcA3jAaGx%7ErwvPSO6zoCj9Dtd8HAOTZRBouEXNXiJDKSEbJChA%7EzQttVkFwk--CvPTFFGhaR7KD-cBZqfsmkxbsNefcfRfcxrUFZaEsyZ%7EllXtnzQEy89YaJeY%7E5RV6tMgUDUpNvvjA0Uyx-gvugI9gHAuIVaEg0sScFWAzawPg__&Key-Pair-Id=K3RPWS32NSSJCE: HTTPSConnectionP

In [18]:
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
question = "What is Valhalla?"
get_answer_tuner(question, context)

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

'<pad> Valhalla is a hall of worship that is ruled by the god Odin.</s>'

In [19]:
# Illustrate how the model returns tokens that are then decoded into words by the decoder:
input_text = f"context: {context} <question for context: {question} </s>"
features = tokenizer_tuner([input_text], return_tensors='pt')
out_tokens = model_tuner.generate(input_ids=features['input_ids'].to(device), attention_mask=features['attention_mask'].to(device))
out_tokens

tensor([[    0,  3833, 11516,     9,    19,     3,     9,  6358,    13,  7373,
            24,    19,     3, 16718,    57,     8,  8581,  9899,    77,     5,
             1]])

In [20]:
for token in out_tokens[0]:
    print(f"Token: {token}, decoded: {tokenizer_tuner.decode(token)}")

Token: 0, decoded: <pad>
Token: 3833, decoded: Val
Token: 11516, decoded: hall
Token: 9, decoded: a
Token: 19, decoded: is
Token: 3, decoded: 
Token: 9, decoded: a
Token: 6358, decoded: hall
Token: 13, decoded: of
Token: 7373, decoded: worship
Token: 24, decoded: that
Token: 19, decoded: is
Token: 3, decoded: 
Token: 16718, decoded: ruled
Token: 57, decoded: by
Token: 8, decoded: the
Token: 8581, decoded: god
Token: 9899, decoded: Od
Token: 77, decoded: in
Token: 5, decoded: .
Token: 1, decoded: </s>


In [21]:
# Can achieve the same with a simple pipeline (as with the model above)
tuner_pipeline = pipeline("text2text-generation", model="tuner007/t5_abs_qa")
tuner_pipeline(input_text)

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Device set to use mps:0


[{'generated_text': 'Valhalla is a hall of worship that is ruled by the god Odin.'}]

In [22]:
question = "What are benefits of mobile-env over other simulators?"
get_answer_tuner(question, full_text)
# Sequence length too short --> Explicitly outputs if no answer is available in the given context

Token indices sequence length is longer than the specified maximum sequence length for this model (4357 > 512). Running this sequence through the model will result in indexing errors


'<pad> No answer available in context</s>'

#### MaRiOrOsSi/t5-base-finetuned-question-answering

https://huggingface.co/MaRiOrOsSi/t5-base-finetuned-question-answering

This is the most downloaded T5 finetuned for Q&A.
Also sequence length 512

In [23]:
from  transformers  import  AutoTokenizer, AutoModelForSeq2SeqLM
# from transformers.generation import GenerateEncoderDecoderOutput

model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
tokenizer_mario = AutoTokenizer.from_pretrained(model_name)
model_mario = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def get_answer_mario(question: str, context: str) -> str:
    """Generate an answer for the given question based on the given context.
    
    Using the MaRiOrOsSi/t5-base-finetuned-question-answering model.
    """
    # Again different separation of question and context
    input = f"question: {question} context: {context}"
    encoded_input = tokenizer_mario([input], return_tensors='pt')
    # Can get scores per token if I output a GenerateEncoderDecoderOutput with return_dict_in_generate=True, output_scores=True
    output = model_mario.generate(input_ids=encoded_input.input_ids, attention_mask=encoded_input.attention_mask) #, return_dict_in_generate=True, output_scores=True)

    # Can instruct the tokenizer to skip special tokens that indicate sequence start and end
    return tokenizer_mario.decode(output[0], skip_special_tokens=True)

In [24]:
question = "What is Valhalla?"
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
get_answer_mario(question, context)

'A majestic, enormous hall'

### Answer questions for entire document (longer than context)

#### Split full text into shorter sequences

In [25]:
# Split the full text into smaller parts such that the max token length is not exceeded.
import math

for tokenizer in (tokenizer_tuner, tokenizer_mario):
    num_tokens_total: int = len(tokenizer([full_text])["input_ids"][0])
    print(f"Tokens in full text: {num_tokens_total}")
    print(f"Max tokens supported by model: {tokenizer.model_max_length}")
    print(f"--> Split full text into {math.ceil(num_tokens_total / tokenizer.model_max_length)} parts.")  


Token indices sequence length is longer than the specified maximum sequence length for this model (4334 > 512). Running this sequence through the model will result in indexing errors


Tokens in full text: 4334
Max tokens supported by model: 512
--> Split full text into 9 parts.
Tokens in full text: 4334
Max tokens supported by model: 512
--> Split full text into 9 parts.


In [26]:
# Have same max tokens of 512 for all models --> Split text into 9 parts
def split_text_into_parts(full_text: str, num_parts: int) -> list[str]:
    """Split the given full text into a list of equally sized parts."""
    len_per_part: int = int(len(full_text) / num_parts)
    return [full_text[i * len_per_part : (i+1) * len_per_part] for i in range(num_parts)]

text_parts = split_text_into_parts(full_text, num_parts=9)
text_parts

['mobile-env: An Open Platform for Reinforcement\nLearning in Wireless Mobile Networks\nStefan Schneider, Stefan Werner\nPaderborn University, Germany\n{stschn, stwerner}@mail.upb.de\nRamin Khalili, Artur Hecker\nHuawei Technologies, Germany\n{ramin.khalili, artur.hecker}@huawei.com\nHolger Karl\nHasso Plattner Institute,\nUniversity of Potsdam, Germany\nholger.karl@hpi.de\nAbstract—Recent reinforcement learning approaches for con-\ntinuous control in wireless mobile networks have shown im-\npressive results. But due to the lack of open and compatible\nsimulators, authors typically create their own simulation en-\nvironments for training and evaluation. This is cumbersome\nand time-consuming for authors and limits reproducibility and\ncomparability, ultimately impeding progress in the ﬁeld.\nTo this end, we proposemobile-env, a simple and open platform\nfor training, evaluating, and comparing reinforcement learning\nand conventional approaches for continuous control in mobile\nwireless

In [27]:
question = "What are benefits of mobile-env over other simulators?"

# tuner007/t5_abs_qa
for text_part in text_parts:
    print(get_answer_tuner(question, text_part))
# Only outputs one proper answer. Nice!

<pad> No answer available in context</s>
<pad> No answer available in context</s>
<pad> No answer available in context</s>
<pad> No answer available in context</s>
<pad> No answer available in context</s>
<pad> No answer available in context</s>
<pad> It is more flexible and more flexible than other simulators.</s>
<pad> No answer available in context</s>
<pad> No answer available in context</s>


In [28]:
# consciousAI/question-answering-generative-t5-v1-base-s-q-c
for text_part in text_parts:
    print(get_answer_consciousai(question, text_part))

[{'generated_text': 'many configuration options and is easy to extend'}]
[{'generated_text': 'facilitating their evaluation and comparison'}]
[{'generated_text': 'provide a useful tool for prototyping, training, and comparing existing approaches'}]
[{'generated_text': 'enabling easy adoption and fast prototyping of new approaches'}]
[{'generated_text': 'can be replaced by the agent’s actions'}]
[{'generated_text': 'fast and simple training and evaluation'}]
[{'generated_text': 'far more features'}]
[{'generated_text': 'Enhanced resource allocation in mobile edge computing using re- inforcement learnin'}]
[{'generated_text': 'Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks'}]


In [29]:
# MaRiOrOsSi/t5-base-finetuned-question-answering
for text_part in text_parts:
    print(get_answer_mario(question, text_part))

It provides sensible default values and can be used out of the box.
It is lightweight and fast.
It provides a useful tool for prototyping and other control algorithms.
It is simple and lightweight with sensible default values and predefined example scenarios.
It provides a single-agent user interface.
It allows users to simulate complex scenarios.
It is more flexible, better documented, and licensed by the standard MIT license.
It provides an open and simple yet flexible platform for reinforcement learning.
It provides a more robust network than a traditional s simulation.


#### Build scoring function to find the best answer

For extractive Q&A, HuggingFace seems to provide a scoring function within the standard pipeline.
This score multiplies the probability of the stand and end index returned by the model.

https://discuss.huggingface.co/t/how-to-find-confidence-score-in-qa-model-in-the-pipeline-module/5210

For generative Q&A, models output variable-size sequences.
Multiplying the probabilities of each output token would lead to lower scores for longer sequences, even if they are useful answers.
I'll try averaging the token probabilities instead.

To output the token probabilities, I have to configure the models and generate function accordingly.
As an example, I'll use the MaRiOrOsSi/t5-base-finetuned-question-answering model, which provided the best answers.

In [30]:
# Tokenizer and Model were already loaded above; just reuse them here
from transformers.generation import GenerateEncoderDecoderOutput

question = "What is Valhalla?"
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
input = f"question: {question} context: {context}"

encoded_input = tokenizer_mario([input], return_tensors='pt')
# Can get scores per token if I output a GenerateEncoderDecoderOutput with return_dict_in_generate=True, output_scores=True
output: GenerateEncoderDecoderOutput = model_mario.generate(input_ids=encoded_input.input_ids, attention_mask=encoded_input.attention_mask, return_dict_in_generate=True, output_scores=True)
output

From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.


GenerateEncoderDecoderOutput(sequences=tensor([[    0,    71, 25941,     6, 10933,  6358,     1]]), scores=(tensor([[-46.9212, -10.5379, -27.4036,  ..., -87.9269, -87.4440, -87.6389]]), tensor([[-30.3731, -12.4038, -24.3930,  ..., -63.0842, -62.7218, -62.9792]]), tensor([[-38.2012, -19.0661, -24.7387,  ..., -80.1341, -79.5276, -79.6145]]), tensor([[-40.1083, -22.2432, -36.1136,  ..., -82.0332, -81.5725, -81.9746]]), tensor([[-44.6734, -20.0176, -34.0707,  ..., -89.7348, -89.2161, -89.3545]]), tensor([[-40.1629,   1.0713, -21.6641,  ..., -72.6069, -72.4596, -72.4369]])), logits=None, encoder_attentions=None, encoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, decoder_hidden_states=None, past_key_values=((tensor([[[[ 4.5368e-01,  2.8157e-01, -1.6857e-01,  ..., -3.2498e-01,
            2.2168e-01, -1.8275e-01],
          [-6.5928e-01,  1.9843e+00, -1.4105e+00,  ...,  2.5199e-01,
            1.0230e+00,  3.0545e-01],
          [ 1.0918e+00,  5.2156e-01,  1.0391e-02,

In [31]:
# Access generated tokens
output.sequences

tensor([[    0,    71, 25941,     6, 10933,  6358,     1]])

In [32]:
# Decode into words
tokenizer_mario.decode(output.sequences[0], skip_special_tokens=False)

'<pad> A majestic, enormous hall</s>'

In [33]:
# Access scores associated to each token (excluding the start token 0 = <pad>).
# Returns scores for all possible tokens in each step.
output.scores

(tensor([[-46.9212, -10.5379, -27.4036,  ..., -87.9269, -87.4440, -87.6389]]),
 tensor([[-30.3731, -12.4038, -24.3930,  ..., -63.0842, -62.7218, -62.9792]]),
 tensor([[-38.2012, -19.0661, -24.7387,  ..., -80.1341, -79.5276, -79.6145]]),
 tensor([[-40.1083, -22.2432, -36.1136,  ..., -82.0332, -81.5725, -81.9746]]),
 tensor([[-44.6734, -20.0176, -34.0707,  ..., -89.7348, -89.2161, -89.3545]]),
 tensor([[-40.1629,   1.0713, -21.6641,  ..., -72.6069, -72.4596, -72.4369]]))

In [34]:
# Taking the argmax of each, gives the selected token in each step.
for token_scores in output.scores:
    selected_token = token_scores.argmax()
    print(f"Selected token: {selected_token}, Score: {token_scores[0][selected_token]}, Decoded: {tokenizer_mario.decode(selected_token)}")

Selected token: 71, Score: -4.235823631286621, Decoded: A
Selected token: 25941, Score: 8.427810668945312, Decoded: majestic
Selected token: 6, Score: 3.4137935638427734, Decoded: ,
Selected token: 10933, Score: 8.492544174194336, Decoded: enormous
Selected token: 6358, Score: 10.149942398071289, Decoded: hall
Selected token: 1, Score: 1.0712651014328003, Decoded: </s>


In [35]:
# Function to compute the overall score of the generated sequence by averaging per-token scores.
def get_answer_with_score(question: str, context: str) -> tuple[str, float]:
    """Return the answer to the given question and context together with the score/certainty."""
    input = f"question: {question} context: {context}"
    encoded_input = tokenizer_mario([input], return_tensors='pt')
    # Generate tokens and scores
    output: GenerateEncoderDecoderOutput = model_mario.generate(input_ids=encoded_input.input_ids, attention_mask=encoded_input.attention_mask, return_dict_in_generate=True, output_scores=True)
    # Decode answer
    answer: str = tokenizer_mario.decode(output.sequences[0], skip_special_tokens=True)
    # Compute average per-token score for the answer
    sequence_scores: list[float] = [float(token_scores.max()) for token_scores in output.scores]
    avg_score: float = sum(sequence_scores) / len(sequence_scores)
    # Return answer and score
    return answer, avg_score

question = "What is Valhalla?"
context = "In Norse mythology, Valhalla is a majestic, enormous hall located in Asgard, ruled over by the god Odin."
get_answer_with_score(question, context)


('A majestic, enormous hall', 4.553255379199982)

In [36]:
# Use the new scoring function to find the best answer from the different parts of the article
# Use the text parts from above
question = "What are benefits of mobile-env over other simulators?"

answers: list[str] = []
answer_scores: list[float] = []
for text_part in text_parts:
    answer, score = get_answer_with_score(question, text_part)
    answers.append(answer)
    answer_scores.append(score)
    print(answer, "| Score:", score)

It provides sensible default values and can be used out of the box. | Score: 3.1261438528696694
It is lightweight and fast. | Score: 0.578094516481672
It provides a useful tool for prototyping and other control algorithms. | Score: 1.636654943227768
It is simple and lightweight with sensible default values and predefined example scenarios. | Score: 2.472675308585167
It provides a single-agent user interface. | Score: 2.9955094953378043
It allows users to simulate complex scenarios. | Score: 1.5019187132517497
It is more flexible, better documented, and licensed by the standard MIT license. | Score: 2.735137939453125
It provides an open and simple yet flexible platform for reinforcement learning. | Score: 4.147489632878985
It provides a more robust network than a traditional s simulation. | Score: 1.422095499932766


In [37]:
best_score = max(answer_scores)
best_answer = answers[answer_scores.index(best_score)]
best_answer, best_score

('It provides an open and simple yet flexible platform for reinforcement learning.',
 4.147489632878985)

In [38]:
# TODO: Rather than selecting the best answer, try to merge all answers into one coherent answer.
# Eg, with a summarization model.

#### Summarize answers for individual parts

In [39]:
answers_str: str = " ".join(answers)
summary_pipeline = pipeline(task="summarization")
summary_pipeline(answers_str)

# Too short for a proper summary. It just takes parts of the text.

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Your max_length is set to 142, but your input_length is only 108. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


[{'summary_text': ' It is simple and lightweight with sensible default values and predefined example scenarios . It is more flexible, better documented, and licensed by the standard MIT license . It provides a more robust network than a traditional s simulation. It is lightweight and fast. It provides an open and simple yet flexible platform for reinforcement learning.'}]

Error while downloading from https://cdn-lfs.hf.co/sshleifer/distilbart-cnn-12-6/2dd1021c54672b07a4aa2f9eef35107195b2d894792b95f017cca86710f466f0?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1738695273&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczODY5NTI3M319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9zc2hsZWlmZXIvZGlzdGlsYmFydC1jbm4tMTItNi8yZGQxMDIxYzU0NjcyYjA3YTRhYTJmOWVlZjM1MTA3MTk1YjJkODk0NzkyYjk1ZjAxN2NjYTg2NzEwZjQ2NmYwP3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiJ9XX0_&Signature=RsD0Vd64amFDvLee49xyIUntkIqVkRndih%7E%7Eq1vE4WZnUsbfk8swN6DAw5rb2EUS0fXFmLIj6SN%7Eq8hrB6ZrzJqQ49%7EpeuEFtQDWYI%7E5H1GsoMqFic4PzTfx552LgTvmDcygyNkT0Lgy4NgrmcUaGy5XoGmRs1AZFrZ5nEVlcdXoRBTBu1t5%7ECggD2gfWabQhd8RLxBys81K4hLG98-BVZgPFTs2GyENLGlmou8o5icofmOrOffU1vv-WcpiwEqKLGHpIDJPG5bIVaX3YgQPMnT14YOy9w8jL1-h6q72RIxBgOuaawNHg45VYDuYuMtV5W9urqXY9955eeCj03-JsQ__&Key-Pair-Id=K3RPWS32NSS

#### Longformer: Models supporting long sequences/contexts

Longformer: https://huggingface.co/docs/transformers/model_doc/longformer

LED (Longformer Encoder-Decoder), which is preferrable for generative Q&A:
https://huggingface.co/docs/transformers/model_doc/led


It seems like there is no available LED model that's already fine-tuned to Q&A.
Instead, I found one that's fine-tuned to summarizing Arxiv articles:
https://huggingface.co/allenai/led-large-16384-arxiv

In [40]:
# https://huggingface.co/allenai/led-large-16384-arxiv
from transformers import LEDForConditionalGeneration, LEDTokenizer

tokenizer = LEDTokenizer.from_pretrained("allenai/led-large-16384-arxiv")

input_ids = tokenizer(full_text, return_tensors="pt").input_ids

global_attention_mask = torch.zeros_like(input_ids)
# set global_attention_mask on first token
global_attention_mask[:, 0] = 1

model = LEDForConditionalGeneration.from_pretrained("allenai/led-large-16384-arxiv") #, return_dict_in_generate=True)
sequences = model.generate(input_ids, global_attention_mask=global_attention_mask)
summary = tokenizer.decode(sequences[0])
summary

# The summary are just parts of the abstract!

Input ids are automatically padded from 4550 to 5120 to be a multiple of `config.attention_window`: 1024


model.safetensors:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

'</s> we propose mobile-env , a simple and open platform for training , evaluating , and comparing reinforcement learning and conventional approaches for continuous control in wireless mobile networks . \n mobile-env is lightweight and implements the common OpenAI Gym interface and additional wrappers , which allows connecting virtually any single-agent or multi-agent reinforcement learning framework to the environment . \n mobile-env provides sensible default values and can be used out of the box , it also has many conﬁguration options and is easy to extend . \n thanks to its modular architecture , it is also easy to adjust and extend . \n we hope mobile-env will serve as a useful tool for training and evaluating new reinforcement learning approaches as well as benchmarking existing approaches . </s>'

Error while downloading from https://cdn-lfs.hf.co/allenai/led-large-16384-arxiv/2b4f0a0405c2ab2300880492a634ccb639b5310ca1a4356c9b385cc5090375d1?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27model.safetensors%3B+filename%3D%22model.safetensors%22%3B&Expires=1738695287&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczODY5NTI4N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9hbGxlbmFpL2xlZC1sYXJnZS0xNjM4NC1hcnhpdi8yYjRmMGEwNDA1YzJhYjIzMDA4ODA0OTJhNjM0Y2NiNjM5YjUzMTBjYTFhNDM1NmM5YjM4NWNjNTA5MDM3NWQxP3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiJ9XX0_&Signature=UxvU0w7dc4tK5vOWaCL2ghzgxqHzIUU2tJiGu89JHYrZ32A3F2XnAWe9bxkzYDzxRl8i38GBoWa1gZ%7E1UiFzadpQ6-XbRSxvHyLBt3qBWhCEIIFOszsIVcBGGkQOV9so3bWSsZgI4uMsHjeRi88khbAe%7EzYTg10KxDKVY%7EZMp3NxVjenMLqQZj7KnDc4ZgFRVg4jC-u5dQ6BIYwEXqXMSn3k0sA-%7ED1fMpwQR3QJQs9c1k98DwTjcLg8lWIQdUADHQHmaP-8C6gObCFESsSrJ0lBK%7ELMUMA3i4c3zS4FWkY%7Ed3wPe8Wxr1emFOcYUPXBrFvon3SF2Jat4Eeq91Q-4g__&Key-Pair-Id=K3RPWS32NSS

In [42]:
# Try Q&A with the pre-trained LED model, even though it's not fine-tuned for Q&A
led_qa = pipeline(task="text2text-generation", model="allenai/led-large-16384-arxiv")
input_text = f"{full_text} Given this context, please answer the following question. {question}"
print(f"Question: {question}")
led_qa(input_text)
# Pretty bad answer

Device set to use mps:0
Input ids are automatically padded from 4572 to 5120 to be a multiple of `config.attention_window`: 1024


[{'generated_text': ' we proposemobile-env , a simple and open platform for training , evaluating , and comparing reinforcement learning and conventional approaches for continuous control in wireless mobile networks . \n mobile-env is lightweight and implements the common openAI Gym interface and additional wrappers , which allows connecting virtually any single-agent or multi-agent reinforcement learning framework to the environment . \n mobile-env provides sensible default values and can be used out-of the box , it also has many conﬁguration options and is easy to \n extend. we therefore believe mobile-env to be a valuable platform for driving meaningful progress in autonomous coordination of wireless mobile networks . '}]

In [None]:
# Try BART fine-tuned on ELI5 for generative Q&A
bart_eli5 = pipeline("text2text-generation", model="vblagoje/bart_lfqa")
query = f"question: {question} context: {full_text}"
print(f"Question: {question}")
bart_eli5(input_text)

Device set to use mps:0
Token indices sequence length is longer than the specified maximum sequence length for this model (4572 > 1024). Running this sequence through the model will result in indexing errors


Question: What are benefits of mobile-env over other simulators?


[{'generated_text': "I'm not sure if this is the right subreddit for this, but I'm curious as well"}]

: 

In [41]:
# TODO: Feed summary into one of the Q&A models above