# Health Fact or Fiction? A Comparison of BERT-Based Models and LLMs on Detecting Health Misinformation About COVID-19 and Measles
*High Risk Project, uaa99, Spring 2025*

### Part 0: Dependencies

For this project you will need to set your Google Gemini API key below.

In [None]:
!pip install transformers datasets scikit-learn
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer, AutoModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
from google import genai
from google.genai import types
import numpy as np
import wandb
import json
import time
import logging
from concurrent.futures import ThreadPoolExecutor

wandb.init(mode='disabled')
client = genai.Client(api_key="YOUR API KEY HERE")

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.

## Part 1: Model Selection and Preparation

We're going to be evaluating four models at this task:  BERT, Clinical-BERT, and BioMedBert and Gemini Flash.

In [None]:
bertSeqClass = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3, force_download=True)
bertSeqTokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
clinicalBert = AutoModelForSequenceClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT", num_labels=3, force_download=True)
clinicalBertTokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

In [None]:
bioMedBert = AutoModelForSequenceClassification.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext", num_labels=3, force_download=True)
bioMedBertTokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext")

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:
def get_response_from_gemini(system_instruction, content):
  return client.models.generate_content(
      model="gemini-2.0-flash-lite",
      config=types.GenerateContentConfig(
          system_instruction=system_instruction),
      contents=content
  )

In [None]:
def evaluate_claim_with_llm(claim, tokenizer, model):
    sys_message = '''
    You are an AI Medical Assistant trained on a vast dataset of health information. Please evaluate the provided claim
    and respond with the following determination:
    0 - The claim is false
    1 - The claim is true
    2 - I am unable to make a determination

    Please only respond with a 0, 1, or 2. Do not include any other text.
    '''
    # Create messages structured for the chat template
    messages = [{"role": "system", "content": sys_message}, {"role": "user", "content": claim}]

    # Applying chat template
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100, use_cache=False)

    # Extract and return the generated text, removing the prompt
    response_text = tokenizer.batch_decode(outputs)[0].strip()
    print(response_text)
    answer = response_text.split('<|im_start|>assistant')[-1].strip()
    return answer

## Part 2: Data Loading and Preparation
Let's begin by loading up the data we are going to need to train and evaluate our models. We are going to be using the Covid 19 News Rumors dataset from [A COVID-19 Rumor Dataset](https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2021.644801/full), published in Frontiers in Psychology. And the Measles Rumors dataset created by me. Measles Rumors is publically available at this link, please download it and save it to a place where it's accessbile by this notebook.

In [None]:
covid_claims = "./news.csv"
df_covid = pd.read_csv(covid_claims, header=None, names=["id", "label", "text", "sentiment"])
df_covid.head()

Unnamed: 0,id,label,text,sentiment
0,3,F,The lie that coronavirus came from a bat or a ...,3
1,4,F,The health experts had predicted the virus cou...,3
2,8,F,The Centers for Disease Control and Prevention...,3
3,10,U,Warm weather will kill coronavirus. U.S. Presi...,2
4,15,F,Using a hair dryer to breathe in hot air can c...,2


In [None]:
measles_claims = "./measles_claims.csv"
df_measles = pd.read_csv(measles_claims)
print(df_measles['label'].value_counts())

label
0    10
1     7
2     2
Name: count, dtype: int64


In [None]:
# map the string labels to integer labels
label_map = {'F': 0, 'T': 1, 'U': 2, 'U(Twitter)': 2}
df_covid['label'] = df_covid['label'].map(lambda x: label_map.get(x))
df_covid.head()

Unnamed: 0,id,label,text,sentiment
0,3,0,The lie that coronavirus came from a bat or a ...,3
1,4,0,The health experts had predicted the virus cou...,3
2,8,0,The Centers for Disease Control and Prevention...,3
3,10,2,Warm weather will kill coronavirus. U.S. Presi...,2
4,15,0,Using a hair dryer to breathe in hot air can c...,2


In [None]:
print(df_covid['label'].value_counts())

label
0    3041
1     659
2     429
Name: count, dtype: int64


In [None]:
# create train test split
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df_covid['text'].tolist(),
    df_covid['label'].tolist(),
    test_size=0.2,
    random_state=42
)

val_texts_measles, val_labels_measles = df_measles['text'].tolist(), df_measles['label'].tolist()

In [None]:
# create custom pytorch dataset
class MisinformationDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

## Part 3: Model Training

Now let's train all three BERT-based models using the HuggingFace trainer API.

In [None]:
def get_training_args(num_epochs):
  return TrainingArguments(
    output_dir=None,
    num_train_epochs=num_epochs,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    learning_rate=2e-5,
    lr_scheduler_type='linear',
    report_to="none"
  )

In [None]:
def compute_metrics(p):
    predictions, labels = p
    predictions = predictions.argmax(axis=-1)
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted') # Use 'weighted' for multiclass
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

In [None]:
def get_trainer(model, tokenizer, training_args):
  if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

  train_encodings = tokenizer(train_texts, truncation=True, padding=True)
  val_encodings = tokenizer(val_texts, truncation=True, padding=True)

  train_dataset = MisinformationDataset(train_encodings, train_labels)
  val_dataset = MisinformationDataset(val_encodings, val_labels)

  return Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      eval_dataset=val_dataset,
      compute_metrics=compute_metrics,
  )

In [None]:
# init trainers
training_args = get_training_args(15)
bert_trainer = get_trainer(bertSeqClass, bertSeqTokenizer, training_args)
clinical_bert_trainer = get_trainer(clinicalBert, clinicalBertTokenizer, training_args)
bio_bert_trainer = get_trainer(bioMedBert, bioMedBertTokenizer, training_args)

In [None]:
# train
trainers = {
    "BERT": bert_trainer,
    "ClinicalBERT": clinical_bert_trainer,
    "BioBERT": bio_bert_trainer,
}

for name, trainer in trainers.items():
  trainer.train()
  torch.cuda.empty_cache()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.502743,0.822034,0.811483,0.822034,0.806586
2,No log,0.498587,0.819613,0.810014,0.819613,0.796641
3,0.474400,0.523539,0.843826,0.843115,0.843826,0.843465
4,0.474400,0.652165,0.849879,0.847308,0.849879,0.847503
5,0.131500,0.798037,0.842615,0.831501,0.842615,0.833703
6,0.131500,0.877854,0.846247,0.837738,0.846247,0.840326
7,0.131500,0.94374,0.847458,0.839228,0.847458,0.841997
8,0.028900,0.985776,0.841404,0.840371,0.841404,0.840838
9,0.028900,1.030564,0.846247,0.840659,0.846247,0.842834
10,0.006900,1.039843,0.847458,0.844223,0.847458,0.845567


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.584503,0.774818,0.762598,0.774818,0.762151
2,No log,0.522913,0.81477,0.796688,0.81477,0.79044
3,0.508400,0.596238,0.806295,0.810289,0.806295,0.807955
4,0.508400,0.742743,0.807506,0.802624,0.807506,0.803027
5,0.153900,1.083811,0.788136,0.772491,0.788136,0.761868
6,0.153900,1.06262,0.808717,0.794458,0.808717,0.796634
7,0.153900,1.256487,0.799031,0.782534,0.799031,0.781613
8,0.033400,1.242907,0.79661,0.792224,0.79661,0.791768
9,0.033400,1.274952,0.791768,0.794952,0.791768,0.792634
10,0.004000,1.330872,0.797821,0.797388,0.797821,0.796106


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.499932,0.818402,0.798619,0.818402,0.794886
2,No log,0.522021,0.825666,0.809596,0.825666,0.808637
3,0.477500,0.539888,0.824455,0.815368,0.824455,0.817734
4,0.477500,0.688253,0.825666,0.813519,0.825666,0.815811
5,0.190900,0.810078,0.841404,0.827998,0.841404,0.828643
6,0.190900,0.961868,0.83293,0.818958,0.83293,0.81545
7,0.190900,1.002276,0.835351,0.82503,0.835351,0.82855
8,0.059600,1.061918,0.842615,0.830568,0.842615,0.83072
9,0.059600,1.139309,0.836562,0.825241,0.836562,0.827883
10,0.016700,1.17512,0.841404,0.831422,0.841404,0.833694


## Part 4: Evaluating Performance

In [33]:
bert_results = bert_trainer.evaluate()
clinical_bert_results = clinical_bert_trainer.evaluate()
bio_bert_results = bio_bert_trainer.evaluate()

In [34]:
print("Bert Evaluation Results:", json.dumps(bert_results, indent=4))

Bert Evaluation Results: {
    "eval_loss": 1.0943862199783325,
    "eval_accuracy": 0.8583535108958837,
    "eval_precision": 0.8517573861806558,
    "eval_recall": 0.8583535108958837,
    "eval_f1": 0.8542405682250839,
    "eval_runtime": 1.5484,
    "eval_samples_per_second": 533.446,
    "eval_steps_per_second": 8.396,
    "epoch": 15.0
}


In [35]:
print("ClinicalBert Evaluation Results:", json.dumps(clinical_bert_results, indent=4))

ClinicalBert Evaluation Results: {
    "eval_loss": 0.522912859916687,
    "eval_accuracy": 0.8147699757869249,
    "eval_precision": 0.7966882195733374,
    "eval_recall": 0.8147699757869249,
    "eval_f1": 0.7904398377222664,
    "eval_runtime": 1.6052,
    "eval_samples_per_second": 514.592,
    "eval_steps_per_second": 8.099,
    "epoch": 15.0
}


In [36]:
print("BioBert Evaluation Results:", json.dumps(bio_bert_results, indent=4))

BioBert Evaluation Results: {
    "eval_loss": 1.0619179010391235,
    "eval_accuracy": 0.8426150121065376,
    "eval_precision": 0.8305675609109936,
    "eval_recall": 0.8426150121065376,
    "eval_f1": 0.8307200106859153,
    "eval_runtime": 1.5864,
    "eval_samples_per_second": 520.688,
    "eval_steps_per_second": 8.195,
    "epoch": 15.0
}


In [None]:
def evaluate_llm(val_texts, val_labels):
    gemini_prompt = """
    You are a helpful medical assistant. Your job is to evaluate the factuality of a sentance about a health topic.
    Please respond with one of the following options:
    1. 0: The sentance is false, misleading, or inaccurate
    2. 1: The sentance is true, factual, or correct
    3. 2: You are unable to verify the factuality of the sentance.

    Do not include any other text with the response.
    """
    num_items = len(val_texts)
    requests_sent = 0
    start_time = time.time()
    preds = []
    requests_per_minute = 30

    for i in range(len(val_texts)):
        response = None
        try:
            claim = val_texts[i]
            label = val_labels[i]
            response = get_response_from_gemini(gemini_prompt, claim)
            preds.append(response)
        except Exception as e:
            print(f"Error for request {i+1}/{num_items}: {e}")

        requests_sent += 1

        if requests_sent % requests_per_minute == 0:
            elapsed_time = time.time() - start_time
            if elapsed_time < 60:
                sleep_duration = 60 - elapsed_time
                print(f"Sent {requests_sent}/{num_items} requests. Sleeping for {sleep_duration:.2f} seconds to maintain rate limit of {requests_per_minute} per minute.")
                time.sleep(sleep_duration)
            start_time = time.time()

    print(f"Finished sending {num_items} requests sequentially.")

    return preds

In [None]:
preds = evaluate_llm(val_texts, val_labels)

Sent 30/826 requests. Sleeping for 52.47 seconds to maintain rate limit of 30 per minute.
Sent 60/826 requests. Sleeping for 52.33 seconds to maintain rate limit of 30 per minute.
Sent 90/826 requests. Sleeping for 52.46 seconds to maintain rate limit of 30 per minute.
Sent 120/826 requests. Sleeping for 52.51 seconds to maintain rate limit of 30 per minute.
Sent 150/826 requests. Sleeping for 52.87 seconds to maintain rate limit of 30 per minute.
Sent 180/826 requests. Sleeping for 52.71 seconds to maintain rate limit of 30 per minute.
Sent 210/826 requests. Sleeping for 52.32 seconds to maintain rate limit of 30 per minute.
Sent 240/826 requests. Sleeping for 52.09 seconds to maintain rate limit of 30 per minute.
Sent 270/826 requests. Sleeping for 52.59 seconds to maintain rate limit of 30 per minute.
Sent 300/826 requests. Sleeping for 52.69 seconds to maintain rate limit of 30 per minute.
Sent 330/826 requests. Sleeping for 52.73 seconds to maintain rate limit of 30 per minute.
Se

In [37]:
def compute_llm_metrics(predictions, val_labels):
    predictions = [int(pred.text.rstrip('\n')) for pred in predictions]
    accuracy = accuracy_score(val_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(val_labels, predictions, average='weighted') # Use 'weighted' for multiclass
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
    }

In [None]:
llm_results = compute_llm_metrics(preds)

In [None]:
print("Gemini Evaluation Results:", json.dumps(llm_results, indent=4))

Gemini Evaluation Results: {
    "accuracy": 0.6029055690072639,
    "precision": 0.7479605452653565,
    "recall": 0.6029055690072639,
    "f1": 0.6420132313305872
}


In [38]:
def load_gemini_covid_preds(path):
  lines = 90
  with open(path, 'r') as file:
    lines = [int(line.strip()) for line in file.readlines()]

  return lines

In [39]:
def get_covid_eval_dataset(tokenizer):
  val_encodings = tokenizer(val_texts, truncation=True, padding=True)
  val_dataset = MisinformationDataset(val_encodings, val_labels)

  return val_dataset

In [44]:
bert_preds_covid, _, _ = bert_trainer.predict(get_covid_eval_dataset(bertSeqTokenizer))
clinical_bert_preds_covid, _, _ = clinical_bert_trainer.predict(get_covid_eval_dataset(clinicalBertTokenizer))
bio_bert_preds_covid, _, _ = bio_bert_trainer.predict(get_covid_eval_dataset(bioMedBertTokenizer))

In [49]:
bert_preds_covid = [prediction.argmax(axis=-1) for prediction in bert_preds_covid]
clinical_bert_preds_covid = [prediction.argmax(axis=-1) for prediction in clinical_bert_preds_covid]
bio_bert_preds_covid = [prediction.argmax(axis=-1) for prediction in bio_bert_preds_covid]

In [48]:
def evaluate_predictions(predictions, labels, texts, is_llm=False):
  errors = {
      0: [],
      1: [],
      2: []
  }
  correct = {
      0: [],
      1: [],
      2: []
  }
  for i in range(len(predictions)):
    prediction = predictions[i]
    label = labels[i]
    claim = texts[i]
    pred = prediction
    if label == pred:
      correct[label].append(claim)
    else:
      errors[label].append({
          "claim": claim,
          "pred": pred
      })

  return errors, correct

In [46]:
errors_bert, correct_bert = evaluate_predictions(bert_preds_covid, val_labels, val_texts)
errors_clinical_bert, correct_clinical_bert = evaluate_predictions(clinical_bert_preds_covid, val_labels, val_texts)
errors_bio_bert, correct_bio_bert = evaluate_predictions(bio_bert_preds_covid, val_labels, val_texts)
errors_gemini, correct_gemini = evaluate_predictions(gemini_covid_preds, val_labels, val_texts, True)

In [57]:
print(len(val_texts))

826


In [59]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(val_labels, bert_preds_covid))
print(confusion_matrix(val_labels, clinical_bert_preds_covid))
print(confusion_matrix(val_labels, bio_bert_preds_covid))
print(confusion_matrix(val_labels, gemini_covid_preds))

[[561  22  14]
 [ 27 109  12]
 [ 28  14  39]]
[[581  11   5]
 [ 70  69   9]
 [ 44  14  23]]
[[572  18   7]
 [ 47  92   9]
 [ 36  13  32]]
[[373 119 105]
 [ 14 117  17]
 [ 15  58   8]]


In [108]:
error_text_bert = set([item["claim"] for item in errors_bert[1]])
error_text_clinical_bert = set([item["claim"] for item in errors_clinical_bert[1]])
error_text_bio_bert = set([item["claim"] for item in errors_bio_bert[1]])
common_errors = error_text_clinical_bert & error_text_bert
difference_errors = error_text_bio_bert - error_text_bert

In [90]:
print(len(common_errors))

31


In [110]:
print(list(difference_errors)[11])

Vice President Mike Pence said that “the FDA [Food and Drug Administration] is approving off-label use for the hydroxychloroquine right now."


In [100]:
print(errors_gemini[0])

[{'claim': 'False YouTube channel published a radio-show-style report with false headline saying novel coronavirus is in the Philippines', 'pred': 1}, {'claim': 'Tom Hanks has a volleyball to keep him company while he’s quarantined', 'pred': 2}, {'claim': 'Actor Vijay’s father criticized government for enforcing curfew to abate COVID-19', 'pred': 2}, {'claim': 'China constructed an hospital for the epidemic in 48 hours', 'pred': 1}, {'claim': 'Terrible conditions in Ukrainian hospitals for ordinary people on the photo', 'pred': 2}, {'claim': 'Deputy health minister and head of Iran\'s taskforce on Covid-19, Iraj Harirchi, had previously declared that "quarantines belong to the Stone Age", before admitting that he had tested positive for the disease', 'pred': 1}, {'claim': 'Video shows an infected baby with a doctor', 'pred': 2}, {'claim': 'Trupti Desai, a well known social activist from India, was arrested for illegally buying liquor during the Covid19 lockdown', 'pred': 2}, {'claim': 

In [None]:
print(difference_errors)

In [61]:
print(errors_clinical_bert[1])

[{'claim': 'Wuhan coronavirus is not yet a public health emergency of international concern, WHO says', 'pred': np.int64(0)}, {'claim': 'Nine female inmates from a minimum-security unit of a South Dakota jail escaped after a separate prisoner tested positive for coronavirus', 'pred': np.int64(0)}, {'claim': 'The CDC has issued an outbreak of lung injuries due to vaping. In this article it talks about the damage and also gives a link to "how to talk to your children about the dangers of vaping". Check out the link below', 'pred': np.int64(0)}, {'claim': 'There are now more than 500 cases of novel coronavirus in the US  �?Stop promoting fear CNN. You guys are very misguided in your information', 'pred': np.int64(0)}, {'claim': "A list documents U.S. President Donald Trump's various statements about the spread of COVID-19 coronavirus disease", 'pred': np.int64(0)}, {'claim': 'Rhode Island postpones presidential primary to June due to coronavirus pandemic', 'pred': np.int64(0)}, {'claim': 

## Part 5: Evaluating Performance on Claims about Measles

In [None]:
def get_measles_eval_dataset(tokenizer):
  val_encodings = tokenizer(val_texts_measles, truncation=True, padding=True)
  val_dataset = MisinformationDataset(val_encodings, val_labels_measles)

  return val_dataset

In [None]:
bert_results_measles = bert_trainer.evaluate(eval_dataset=get_measles_eval_dataset(bertSeqTokenizer))
clinical_bert_results_measles = clinical_bert_trainer.evaluate(eval_dataset=get_measles_eval_dataset(clinicalBertTokenizer))
bio_bert_results_measles = bio_bert_trainer.evaluate(eval_dataset=get_measles_eval_dataset(bioMedBertTokenizer))

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
print("Bert Evaluation Results:", json.dumps(bert_results_measles, indent=4))

ClinicalBert Evaluation Results: {
    "eval_loss": 3.7864296436309814,
    "eval_accuracy": 0.5263157894736842,
    "eval_precision": 0.6470588235294117,
    "eval_recall": 0.5263157894736842,
    "eval_f1": 0.44298245614035087,
    "eval_runtime": 0.0198,
    "eval_samples_per_second": 958.525,
    "eval_steps_per_second": 50.449,
    "epoch": 15.0
}


In [None]:
print("ClinicalBert Evaluation Results:", json.dumps(clinical_bert_results_measles, indent=4))

ClinicalBert Evaluation Results: {
    "eval_loss": 3.205122947692871,
    "eval_accuracy": 0.5789473684210527,
    "eval_precision": 0.6608187134502924,
    "eval_recall": 0.5789473684210527,
    "eval_f1": 0.4680451127819549,
    "eval_runtime": 0.0176,
    "eval_samples_per_second": 1078.68,
    "eval_steps_per_second": 56.773,
    "epoch": 15.0
}


In [None]:
print("BioBert Evaluation Results:", json.dumps(bio_bert_results_measles, indent=4))

BioBert Evaluation Results: {
    "eval_loss": 2.32006573677063,
    "eval_accuracy": 0.5789473684210527,
    "eval_precision": 0.5526315789473685,
    "eval_recall": 0.5789473684210527,
    "eval_f1": 0.5616488774383511,
    "eval_runtime": 0.0172,
    "eval_samples_per_second": 1105.111,
    "eval_steps_per_second": 58.164,
    "epoch": 15.0
}


In [None]:
preds_measles = evaluate_llm(val_texts_measles, val_labels_measles)

Finished sending 19 requests sequentially.


In [None]:
llm_results_measles = compute_llm_metrics(preds_measles, val_labels_measles)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
print("Gemini Evaluation Results:", json.dumps(llm_results_measles, indent=4))

Gemini Evaluation Results: {
    "accuracy": 0.8421052631578947,
    "precision": 0.7529904306220095,
    "recall": 0.8421052631578947,
    "f1": 0.7949874686716791
}
