In [57]:
import warnings

# Filter out the specific warning
warnings.filterwarnings("ignore", message="overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.")

In [105]:
import requests
from bs4 import BeautifulSoup
from transformers import BertForQuestionAnswering, BertTokenizer
import torch

# Step 1: Web Scraping and preprocess
def fetch_and_scrape(url):
    response = requests.get(url)
    response.raise_for_status()  # Will raise an HTTPError for bad requests (4XX or 5XX)
    soup = BeautifulSoup(response.text, 'html.parser')
    text = ' '.join([p.text for p in soup.find_all('p')])  # Extract text from all paragraph tags
    return text

context = fetch_and_scrape('https://www.fisheries.noaa.gov/species/giant-manta-ray')

# Step 3: Load Pretrained Model
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Step 4: Tokenize Input
def tokenize_input(question, context):
    inputs = tokenizer(question, context, return_tensors="pt", max_length=512, truncation=True, padding=True, truncation_strategy = 'only_second')
    return inputs

# Step 5: Question Answering
# Define a function to extract the answer to a question from the BERT model's output
def get_answer(inputs):
    with torch.no_grad():
        outputs = model(**inputs)
        answer_start_scores = outputs.start_logits
        answer_end_scores = outputs.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end])
    )
    return answer


# Step 6: Extract Answer
def extract_answer(question, context):
    inputs = tokenize_input(question, context)
    answer = get_answer(inputs)
    return answer

# Step 7: Display Answers
def display_answer(question, answer):
    print("Question:", question)
    print("Answer:", answer)



Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [106]:
question = 'What is the manta ray ?'
answer = extract_answer(question, context)
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is the manta ray ?
Answer: world ’ s largest ray


In [107]:
question = 'What is the manta ray population status?'
answer = extract_answer(question, context)
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is the manta ray population status?
Answer: unknown


In [108]:
question = 'What is the main threat for manta rays?'
answer = extract_answer(question, context)
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is the main threat for manta rays?
Answer: commercial fishing


In [109]:
question = 'What are the color types?'
answer = extract_answer(question, context)
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What are the color types?
Answer: chevron ( mostly black back and white belly ) and black ( almost completely black on both sides )


In [110]:
question = 'What is the appearance of manta rays?'
answer = extract_answer(question, context)
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is the appearance of manta rays?
Answer: large diamond - shaped body with elongated wing - like pectoral fins


In [111]:
question = 'What is manta rays diet ?'
answer = extract_answer(question, context)
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is manta rays diet ?
Answer: zooplankton


#Methods to obtain longer answer

In [112]:
def get_answer(inputs, answer_strategy='expanded', **kwargs):
    if answer_strategy == 'expanded':
        return get_expanded_answer(inputs, **kwargs)
    elif answer_strategy == 'multiple':
        return get_multiple_answers(inputs, **kwargs)
    elif answer_strategy == 'with_context':
        return get_answer_with_context(inputs, **kwargs)
    else:
        raise ValueError("Invalid answer strategy")

In [113]:
def get_expanded_answer(inputs, expansion_tokens=50):
    with torch.no_grad():
        outputs = model(**inputs)
        answer_start_scores, answer_end_scores = outputs.start_logits, outputs.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    # Expanding the context around the answer
    start_expansion = max(answer_start - expansion_tokens, 0)  # Ensure start is not negative
    end_expansion = min(answer_end + expansion_tokens, inputs.input_ids.size(1))  # Ensure end does not exceed input size

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs.input_ids[0][start_expansion:end_expansion]))
    return answer


In [114]:
#Define a function to extract multiple answers
def get_multiple_answers(inputs, num_answers=3):
    with torch.no_grad():
        outputs = model(**inputs)
        answer_start_scores, answer_end_scores = outputs.start_logits, outputs.end_logits

    # Get the top 'num_answers' start and end positions
    top_starts = torch.topk(answer_start_scores, num_answers).indices
    top_ends = torch.topk(answer_end_scores, num_answers).indices

    answers = []
    for i, (start, end) in enumerate(zip(top_starts[0], top_ends[0])):
        if end >= start:  # Ensure valid index ordering
            answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs.input_ids[0][start:end+1]))
            answers.append((i, answer))
    return answers

In [115]:
#extract an answer with additional context
def get_answer_with_context(inputs, context_tokens=20):
    with torch.no_grad():
        outputs = model(**inputs)
        answer_start_scores, answer_end_scores = outputs.start_logits, outputs.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    # Adding more context around the answer
    start_context = max(answer_start - context_tokens, 0)
    end_context = min(answer_end + context_tokens, inputs.input_ids.size(1))

    answer_with_context = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs.input_ids[0][start_context:end_context]))
    return answer_with_context

In [116]:
question = "What is manta rays?"
answer = get_answer(tokenize_input(question, context), answer_strategy='expanded', expansion_tokens=100)
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is manta rays?
Answer: home to the largest population of giant manta ray , comprising over 22 , 000 individuals , with large aggregation sites within the waters of the machalilla national park and the galapagos marine reserve . overall , given their life history traits , particularly their low reproductive output , giant manta ray populations are inherently vulnerable to depletions , with low likelihood of recovery . additional research is needed to better understand the population structure and global distribution of the giant manta ray . manta rays are recognized by their large diamond - shaped body with elongated wing - like pectoral fins , ventrally - placed gill slits , laterally - placed eyes , and wide , terminal mouths . in front of the mouth , they have two structures called cephalic lobes which extend and help to channel water into the mouth for feeding activities ( making them the only vertebrate animals with three paired appendages ) . manta rays come in two 

In [117]:
question = "What is manta rays?"
answer = get_answer(tokenize_input(question, context), answer_strategy='multiple', num_answers=3)
display_answer(question, answer[0])
display_answer(question, answer[1])
display_answer(question, answer[2])

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is manta rays?
Answer: (0, 'large diamond - shaped body')
Question: What is manta rays?
Answer: (1, 'world ’ s largest ray with a wingspan of up to 26 feet . they are filter feeders and eat large quantities of zooplankton . giant manta rays are slow - growing , migratory animals with small , highly fragmented populations that are sparsely distributed across the world . the main threat to the giant manta ray is commercial fishing , with the species both targeted and caught as bycatch in a number of global fisheries throughout its range . manta rays are particularly valued for their gill plates , which are traded internationally . in 2018 , noaa fisheries listed the species as threatened under the endangered species act . the global population size is unknown . with the exception of ecuador , the few regional population estimates appear to be small , ranging from around 600 to 2 , 000 individuals , and in areas subject to fishing , have significantly declined . ecuador , o

In [118]:
question = "What is manta rays?"
answer = get_answer(tokenize_input(question, context), answer_strategy='with_context')
display_answer(question, answer)

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Question: What is manta rays?
Answer: the population structure and global distribution of the giant manta ray . manta rays are recognized by their large diamond - shaped body with elongated wing - like pectoral fins , ventrally - placed gill slits , laterally -
