### Part 1 - Question Answering with "distilbert-base-cased-distilled-squad"


Initially, I imported the PDF and used "facebook/bart-large-cnn" to summarize it. My initial approach involved using two different summarizers and two different question-answering models, as I had previously used the same summarizer model with two question-answering models and received identical answers. However, I encountered issues with the length and quality of the summary, which made it challenging to formulate suitable questions. Therefore, I decided to focus on maximizing the performance of the first model and use it as a learning basis for improving the efficiency of the second model.

In [None]:
# Shown to accompany the chain-of-thoughts as it doesn't affect the QA model used later

# Step 1: Read and Summarize the PDF to define the context
!pip install transformers PyPDF2
from transformers import pipeline
import PyPDF2

# Read PDF
reader = PyPDF2.PdfReader('/content/Article 6 BloombergGPT_ A Large Language Model for Finance')
text = ""
for page in reader.pages:
    text += page.extract_text()

# Summarize to get main points
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(text[0:1000], max_length=200, min_length=60)
print("\nSUMMARY:")
print(summary[0]['summary_text'])

# Continued with QA based on the summary....






SUMMARY:
BloombergGPT is a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources. It is perhaps the largest domain-speciousc dataset yet, augmented with345 billion tokens from general purpose datas.


Later, I realized that the answers would actually be based on the full content of the PDF rather than the summary itself. So, I removed the summarization step, which significantly improved the model's speed.

In [None]:
# New approach: import pdf, provide questions based on the pdf, ask the model to answers those questions (and provide a confidence score)

# Step 1: Install transformers for pdf import
!pip install transformers PyPDF2

# Step 2: Import required libraries
from transformers import pipeline
import PyPDF2

# Step 3: Read the PDF file
pdf_path = '/content/Article 6 BloombergGPT_ A Large Language Model for Finance'
reader = PyPDF2.PdfReader(pdf_path)
text = "".join(page.extract_text() for page in reader.pages) # Concatenate text from pdf

# Step 4: Set up the Question-Answering model
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

# Step 5: Define questions
questions = [
    "What is the total number of parameters in the BloombergGPT model?",
    "What is the size of the dataset used to train BloombergGPT?",
    "What is the breakdown of the BloombergGPT dataset?",
    "What is the source of the financial data used to train BloombergGPT?",
    "How does the BloombergGPT dataset size compare to other language models?",
    "What are the key features or capabilities of BloombergGPT?",
    "What are the potential applications of BloombergGPT?",
    "What were the challenges in developing BloombergGPT?",
    "How does BloombergGPT compare to other large language models?",
    "What are the plans for BloombergGPT's future development?"
]

# Run Question-Answering
print("\nANSWERS WITH CONFIDENCE SCORES:")
for question in questions:
    result = question_answerer(question=question, context=text[:3000])  # Limit context length if needed
    answer = result['answer']
    confidence_score = result['score']  # Extract the confidence score
    print(f"\nQ: {question}")
    print(f"A: {answer}")
    print(f"Confidence Score: {confidence_score:.2f}")



ANSWERS WITH CONFIDENCE SCORES:

Q: What is the total number of parameters in the BloombergGPT model?
A: 50 billion
Confidence Score: 0.97

Q: What is the size of the dataset used to train BloombergGPT?
A: 50 billion
Confidence Score: 0.80

Q: What is the breakdown of the BloombergGPT dataset?
A: 363 billion
Confidence Score: 0.63

Q: What is the source of the financial data used to train BloombergGPT?
A: Training Chronicles
Confidence Score: 0.17

Q: How does the BloombergGPT dataset size compare to other language models?
A: outperforms existing models
Confidence Score: 0.19

Q: What are the key features or capabilities of BloombergGPT?
A: stan-
dard LLM benchmarks
Confidence Score: 0.09

Q: What are the potential applications of BloombergGPT?
A: sentiment analysis and named entity recognition to question answering
Confidence Score: 0.80

Q: What were the challenges in developing BloombergGPT?
A: sentiment analysis and named entity recognition to question answering
Confidence Score: 

Following the above adjustments, the answers generated by "distilbert-base-cased-distilled-squad" became shorter. To increase the length of the answers, I concatenated text from all pages of the PDF to provide the model with more context. Unfortunately, this adjustment did not lead to significant improvements in answer quality. In the end, I implemented a confidence score to assess the model’s certainty regarding each answer. This allows for better evaluation of answer reliability.

**Considerations and Conclusions on DistilBERT Results**

1. High Confidence on Key Facts: DistilBERT displayed strong confidence in identifying BloombergGPT’s total parameter count (0.97), suggesting its capacity to accurately extract straightforward numerical information.

2. Inconsistent Confidence on Context-Specific Queries: The model’s confidence fluctuated when dealing with context-specific questions. For instance, its confidence on the dataset source (0.17) and BloombergGPT's future plans (0.03) indicates that it struggled with questions that required deeper contextual understanding.

3. Limitations in Detail with Complex Queries: DistilBERT's responses to questions on dataset breakdown and key features had relatively low confidence scores (0.63 and 0.09, respectively). This implies a need for improvement in handling complex, domain-specific queries that demand greater detail.

4. Capability to Recognize Financial Use Cases: On questions about BloombergGPT’s applications, the model displayed moderate confidence (0.80) and correctly identified potential applications, indicating its ability to recognize established use cases within financial AI.

5. Overall Assessment: While DistilBERT handles basic fact-based questions well, its inconsistent performance on context-heavy questions points to a need for enhancements in handling nuanced, domain-specific queries.

### Part 2 - Question Answering with "deepset/roberta-base-squad2"

As previously mentioned, I began the second part of the assignment by experimenting with a different summarization model, paired with the question-answering model "sshleifer/distilbart-cnn-12-6." This required additional imports, specifically AutoTokenizer and AutoModelForSeq2SeqLM. Although the summarization process consumed considerable time and computational resources, the resulting answers appeared to be more articulated. However, I still had concerns about their accuracy, which led me to align my approach with the methodology used in the first part of the assignment to ensure consistency and reliability.

In [None]:
# Replicates initial approach, that involved pdf import and summarization

# Step 1: Read and Summarize the PDF to define the context
!pip install transformers PyPDF2
from transformers import pipeline
import PyPDF2

# Read PDF
reader = PyPDF2.PdfReader('/content/Article 6 BloombergGPT_ A Large Language Model for Finance')
text = ""
for page in reader.pages:
    text += page.extract_text()

# Summarize to get main points
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(text[0:1000], max_length=200, min_length=60)  # we use [0:1000] to get first 1000 characters as example
print("\nSUMMARY:")
print(summary[0]['summary_text'])

# Step 2: Use the content as context for Question-Answering
model_name = "deepset/roberta-base-squad2"

# Ask 10 questions about the PDF
questions = [
    "How many parameters does BloombergGPT have?",
    "What makes BloombergGPT suitable for financial tasks?",
    "What challenges were faced during BloombergGPT's training process?",
    "What inspired BloombergGPT’s model structure?",
    "What types of public datasets are included in BloombergGPT's training?",
    "What specific improvements were made to the tokenizer in BloombergGPT?",
    "How does BloombergGPT manage model and data scaling within a fixed budget?",
    "What are some general-purpose benchmarks used to evaluate BloombergGPT?",
    "How does BloombergGPT perform on financial-specific benchmarks?",
    "How does BloombergGPT’s performance compare to GPT-3 in linguistic tasks?"
]

# Get answers
for question in questions:
    result = question_answerer(question=question, context=text)
    print(f"\nQ: {question}")
    print(f"A: {result['answer']}")




config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]


SUMMARY:
BloombergGPT is a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources. It is perhaps the largest domain-speciousc dataset yet, augmented with345 billion tokens from general purpose datas.

Q: How many parameters does BloombergGPT have?
A: 50 billion

Q: What makes BloombergGPT suitable for financial tasks?
A: fnegative/neutral

Q: What challenges were faced during BloombergGPT's training process?
A: required an enormous amount of computation

Q: What inspired BloombergGPT’s model structure?
A: BLOOM

Q: What types of public datasets are included in BloombergGPT's training?
A: press releases, news articles, and lings

Q: What specific improvements were made to the tokenizer in BloombergGPT?
A: shrinking the learning rate or gradient clipping

Q: How does BloombergGPT manage model and data scaling within a fixed budget?
A: 23NER

Q: What are some general-purp

Since the answers from the new model were the same as those from the first part of the assignment, I figured it was time to switch up the summarizer. I hoped that using a different model would lead to more varied results.

In [None]:
# Changes the summarization model to test if there's a difference in the provided answers

# Step 1: Read and Summarize the PDF to define the context
!pip install transformers PyPDF2
import PyPDF2
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline as qa_pipeline

# Read PDF
pdf_path = '/content/Article 6 BloombergGPT_ A Large Language Model for Finance'
reader = PyPDF2.PdfReader(pdf_path)
text = ""
for page in reader.pages:
    text += page.extract_text()

# Generate summary
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")
summary = summarizer(text[0:1000], max_length=200, min_length=60)
print("\nSUMMARY:")
print(summary[0]['summary_text'])

# Continued with QA based on the summary....


SUMMARY:
 The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering . In this work, we purposefully present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of  nancial data . We construct a 363 billion token dataset based on Bloomberg's extensive data sources .


Driven by my curiosity to test a domain-specific model for potentially more accurate responses, I revised my initial plan and opted to use the "MayaPH/FinOPT-Franklin" model for question answering, skipping the summarization task altogether. This decision was made to streamline the process and focus directly on the quality of the answers. I ensured that the structure of this model aligned closely with the previous one to minimize inconsistencies and maintain a clear comparison of the responses and their corresponding confidence scores.

In [None]:
# Step 1: Install transformers for pdf import
!pip install transformers PyPDF2

# Step 2: Import required libraries
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import PyPDF2

# Step 3: Read the PDF file
pdf_path = '/content/Article 6 BloombergGPT_ A Large Language Model for Finance'
reader = PyPDF2.PdfReader(pdf_path)
text = "".join(page.extract_text() for page in reader.pages)

# Step 4: Set up the Question-Answering model
tokenizer = AutoTokenizer.from_pretrained("MayaPH/FinOPT-Franklin")
question_answerer = pipeline("question-answering", model="MayaPH/FinOPT-Franklin")

# Step 5: Define questions
questions = [
    "What is the total number of parameters in the BloombergGPT model?",
    "What is the size of the dataset used to train BloombergGPT?",
    "What is the breakdown of the BloombergGPT dataset?",
    "What is the source of the financial data used to train BloombergGPT?",
    "How does the BloombergGPT dataset size compare to other language models?",
    "What are the key features or capabilities of BloombergGPT?",
    "What are the potential applications of BloombergGPT?",
    "What were the challenges in developing BloombergGPT?",
    "How does BloombergGPT compare to other large language models?",
    "What are the plans for BloombergGPT's future development?"
]

# Run Question-Answering
print("\nANSWERS WITH CONFIDENCE SCORES:")
for question in questions:
    result = question_answerer(question=question, context=text[:3000])  # Limit context length if needed
    answer = result['answer']
    confidence_score = result['score']  # Extract the confidence score
    print(f"\nQ: {question}")
    print(f"A: {answer}")
    print(f"Confidence Score: {confidence_score:.2f}")



tokenizer_config.json:   0%|          | 0.00/870 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/548 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.26G [00:00<?, ?B/s]

Some weights of OPTForQuestionAnswering were not initialized from the model checkpoint at MayaPH/FinOPT-Franklin and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



ANSWERS WITH CONFIDENCE SCORES:

Q: What is the total number of parameters in the BloombergGPT model?
A:  (24B tokens
Confidence Score: 0.00

Q: What is the size of the dataset used to train BloombergGPT?
A: 2.2.2 C4 (138
Confidence Score: 0.00

Q: What is the breakdown of the BloombergGPT dataset?
A: .2 C4 (138B tokens { 19.48% of training
Confidence Score: 0.00

Q: What is the source of the financial data used to train BloombergGPT?
A:  Pile (184B tokens { 25.9% of training
Confidence Score: 0.00

Q: How does the BloombergGPT dataset size compare to other language models?
A: .9% of training
Confidence Score: 0.00

Q: What are the key features or capabilities of BloombergGPT?
A: 
2.2.3 Wikipedia (24B tokens
Confidence Score: 0.00

Q: What are the potential applications of BloombergGPT?
A:  (24B tokens { 3.35% of training
Confidence Score: 0.00

Q: What were the challenges in developing BloombergGPT?
A:  (24B tokens { 3.35% of training
Confidence Score: 0.00

Q: How does BloombergGPT 

Even though I had high hopes for the finance-related model, the answers it gave were really disappointing, especially since the confidence score was always 0.00, making the model unreliable. I looked for other domain-specific question-answering models to try out in the second part of the assignment, but most of them involved actions I’m not familiar with yet. So, I decided to keep it simple and use another general-purpose model instead. This way, I can compare the results with the previous one while making things a bit easier on myself.

In [None]:
# Final approach: tests another general-purpose QA model

# Step 1: Install transformers for pdf import
!pip install transformers PyPDF2

# Step 2: Import required libraries
from transformers import pipeline
import PyPDF2

# Step 3: Read the PDF file
pdf_path = '/content/Article 6 BloombergGPT_ A Large Language Model for Finance'
reader = PyPDF2.PdfReader(pdf_path)
text = "".join(page.extract_text() for page in reader.pages) # Concatenate text from pdf

# Step 4: Set up the Question-Answering model
question_answerer = pipeline("question-answering", model="deepset/roberta-base-squad2")

# Step 5: Define questions
questions = [
    "What is the total number of parameters in the BloombergGPT model?",
    "What is the size of the dataset used to train BloombergGPT?",
    "What is the breakdown of the BloombergGPT dataset?",
    "What is the source of the financial data used to train BloombergGPT?",
    "How does the BloombergGPT dataset size compare to other language models?",
    "What are the key features or capabilities of BloombergGPT?",
    "What are the potential applications of BloombergGPT?",
    "What were the challenges in developing BloombergGPT?",
    "How does BloombergGPT compare to other large language models?",
    "What are the plans for BloombergGPT's future development?"
]

# Run Question-Answering
print("\nANSWERS WITH CONFIDENCE SCORES:")
for question in questions:
    result = question_answerer(question=question, context=text[:3000])  # Limit context length if needed
    answer = result['answer']
    confidence_score = result['score']  # Extract the confidence score
    print(f"\nQ: {question}")
    print(f"A: {answer}")
    print(f"Confidence Score: {confidence_score:.2f}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]




ANSWERS WITH CONFIDENCE SCORES:

Q: What is the total number of parameters in the BloombergGPT model?
A: 50 billion
Confidence Score: 0.78

Q: What is the size of the dataset used to train BloombergGPT?
A: 363 billion token
Confidence Score: 0.18

Q: What is the breakdown of the BloombergGPT dataset?
A: 363 billion token
Confidence Score: 0.07

Q: What is the source of the financial data used to train BloombergGPT?
A: Bloomberg
Confidence Score: 0.15

Q: How does the BloombergGPT dataset size compare to other language models?
A: 50 billion
Confidence Score: 0.26

Q: What are the key features or capabilities of BloombergGPT?
A: nancial data
Confidence Score: 0.01

Q: What are the potential applications of BloombergGPT?
A: sentiment analysis and named entity recognition to question answering
Confidence Score: 0.55

Q: What were the challenges in developing BloombergGPT?
A: Large
Language Models
Confidence Score: 0.00

Q: How does BloombergGPT compare to other large language models?
A: 

**Considerations and Conclusions on deepset/roberta-base-squad2 Results**

1. Response Accuracy: The answer for the total number of parameters (50 billion) aligns with the article’s information, supported by a moderate confidence score of 0.78. However, several other answers diverged significantly, suggesting limited alignment with domain-specific content.

2. Reliability and Low Confidence: Many answers had low or zero confidence scores, which lowered the model's reliability. For example, answers to the questions about challenges in development, future plans, and dataset breakdown all returned scores of 0.00. These responses were also inaccurate, as they missed relevant information present in the article.

3. Repetitive and Vague Responses: The model showed a tendency to repeat general information (e.g., "363 billion tokens" for multiple questions), which highlights its struggle with providing distinct, informative answers in a financial context.

4. Domain Adaptation: Although the model identified potential applications such as sentiment analysis and named entity recognition with a moderate confidence score (0.55), it struggled with complex financial questions. This indicates that while it has a basic understanding, the model may benefit from additional fine-tuning on financial data for improved accuracy.

5. Areas for Improvement: Given the low confidence and generalizations in responses, deepset/roberta-base-squad2 could benefit from further training with financial data, as this might increase its precision and relevance in this domain.

## Final Comparison of DistilBERT and deepset/roberta-base-squad2 Results

### Overview
The performance of DistilBERT and deepset/roberta-base-squad2 on answering questions about the BloombergGPT model varied significantly across accuracy, relevance, and response quality. The following section presents an analysis of their responses against the correct answers, with a particular focus on assessing each model's strengths and weaknesses in handling the questions.

### Question-by-Question Comparison

1. **Total Number of Parameters in BloombergGPT**
   - **DistilBERT**: Provided the correct answer of 50 billion parameters with a high confidence score (0.97), indicating both accuracy and confidence alignment.
   - **deepset/roberta-base-squad2**: Also provided the correct answer, with a lower confidence score (0.78).
   - **Assessment**: Both models correctly answered this question, though DistilBERT displayed greater confidence. This indicates that both models perform well with clear, fact-based queries.

2. **Dataset Size Used to Train BloombergGPT**
   - **DistilBERT**: Incorrectly answered with 50 billion, a number unrelated to the actual dataset size, and with moderate confidence (0.80).
   - **deepset/roberta-base-squad2**: Provided a partially correct answer of 363 billion tokens, which covers only the financial portion of the dataset, with low confidence (0.18).
   - **Assessment**: Here, deepset/roberta-base-squad2 demonstrated partial accuracy by identifying part of the dataset size, while DistilBERT’s response was inaccurate. This highlights that deepset/roberta-base-squad2 may have a slight advantage in handling numerical data points with multiple components, despite lower confidence.

3. **Dataset Breakdown**
   - **DistilBERT**: Gave a partially correct answer, stating 363 billion tokens, which refers only to the financial segment. However, the low confidence score (0.63) reflects uncertainty.
   - **deepset/roberta-base-squad2**: Provided the same partially correct answer (363 billion tokens) but with an even lower confidence score (0.07).
   - **Assessment**: Both models failed to capture the full dataset breakdown but correctly recognized the financial dataset component. The low confidence scores reflect their limited grasp of multi-part answers, underscoring a need for additional data structuring or segmentation capabilities.

4. **Source of Financial Data**
   - **DistilBERT**: Responded with “Training Chronicles,” an irrelevant answer with a low confidence score (0.17).
   - **deepset/roberta-base-squad2**: Correctly identified “Bloomberg” as the source of financial data, with a low confidence score (0.15).
   - **Assessment**: Despite low confidence, deepset/roberta-base-squad2 correctly pinpointed Bloomberg as the data source, showcasing a slight edge in factual data recognition, even when it lacks contextual depth.

5. **Dataset Size Comparison with Other Models**
   - **DistilBERT**: Offered a vague answer, “outperforms existing models,” with low confidence (0.19).
   - **deepset/roberta-base-squad2**: Similarly answered with “50 billion,” an irrelevant response, and a confidence score of 0.26.
   - **Assessment**: Neither model provided a specific, relevant answer. This outcome suggests that general comparisons might require additional contextual embedding for improved specificity.

6. **Key Features and Capabilities**
   - **DistilBERT**: Provided an incomplete response, “standard LLM benchmarks,” with minimal confidence (0.09).
   - **deepset/roberta-base-squad2**: Produced the partially correct answer “financial data” with an extremely low confidence score (0.01).
   - **Assessment**: Both responses lacked detail, accuracy, and confidence, highlighting the need for refined fine-tuning in domain-specific capability descriptions.

7. **Potential Applications**
   - **DistilBERT**: Answered correctly, mentioning “sentiment analysis and named entity recognition to question answering,” with a moderately high confidence score (0.80).
   - **deepset/roberta-base-squad2**: Offered the same correct answer with slightly lower confidence (0.55).
   - **Assessment**: Both models performed well in identifying potential applications, demonstrating strength in tasks closely aligned with common NLP functions. DistilBERT displayed higher confidence, potentially signaling a stronger alignment with practical NLP applications.

8. **Challenges in Developing BloombergGPT**
   - **DistilBERT**: Provided the correct categories of challenges but with low confidence (0.26).
   - **deepset/roberta-base-squad2**: Answered irrelevantly with “Large Language Models,” accompanied by a confidence score of 0.00.
   - **Assessment**: DistilBERT more effectively captured development challenges, albeit with limited confidence, suggesting it may better handle technical NLP subjects within familiar contexts.

9. **Comparison with Other Large Language Models**
   - **DistilBERT**: Incorrectly stated “50 billion parameters” with a confidence score of 0.23.
   - **deepset/roberta-base-squad2**: Provided a generic response, “outperforms existing models,” with a confidence score of 0.00.
   - **Assessment**: Both models struggled with this question, reflecting limited abilities in producing comparative insights without explicit prompts.

10. **Plans for Future Development**
   - **DistilBERT**: Did not provide a relevant answer.
   - **deepset/roberta-base-squad2**: Produced the response “9 2.2.3” with zero confidence, failing to grasp the question context.
   - **Assessment**: Both models failed to deliver meaningful responses for future development questions, indicating an area where models may benefit from further training in roadmap interpretation.

### Conclusion
Overall, DistilBERT displayed higher accuracy and confidence on questions that were straightforward and fact-based, such as the parameter count and potential applications. It struggled with multi-part answers and contextual understanding, especially in complex or ambiguous questions. On the other hand, deepset/roberta-base-squad2 demonstrated some accuracy in numerical and categorical responses, although with generally lower confidence scores. However, it occasionally provided partially correct answers that were more aligned with multi-part question structures.

In summary, DistilBERT shows potential for applications that prioritize precision in well-defined, single-part questions, whereas deepset/roberta-base-squad2 offers an edge in scenarios where nuanced, multi-part answers are needed, despite lower confidence in specific details. Enhancing each model’s domain-specific training—DistilBERT for broader contextual flexibility and deepset/roberta-base-squad2 for increased confidence in numerical and factual outputs—could make both models significantly more robust and adaptable to the demands of domain-specific knowledge extraction.