#### LLM Model Comparison

Small comparison of out-of-the-box (OOB) large language models, sourced from Huggingface Hub for simple task of free-text anatomic classification of the findings section of radiology reports.
Performance is measured by classification accuracy of the model across the following categories:
```
LUNG/PLEURA/LARGE AIRWAYS
VESSELS
HEART
MEDIASTINUM AND HILA
CHEST WALL AND LOWER NECK
LIVER
BILE DUCTS
GALLBLADDER
PANCREAS
SPLEEN
ADRENAL GLANDS
KIDNEYS AND URETERS
BLADDER
REPRODUCTIVE ORGANS
BOWEL
VESSELS
PERITONEUM/RETROPERITONEUM/LYMPH NODES
BONE AND SOFT TISSUE
MISCELLANEOUS
```

`MISCELLANEOUS` category describes usually verbiage about comparison with previous studies, the type of CXR study performed, comments about patient clinical history, etc.

Classification is carried out on a subset of CXR reports sourced from the MIMIC-CXR database.
MIMIC-CXR comprises ~220k CXR studies from patients admitted to BIDMC from 2011 to 2016.
This analysis selects 50 patients' reports (totaling a little >100 reports). 
Sentence-level classification was done by hand.

#### Setup and Imports

In [12]:
import os
import pandas as pd
from llama_cpp import Llama

# Create list of dictionaries for input/output examples
# with open("../../data/eval_dataset_annotated_tabdl.txt") as f:
#     lines = f.readlines()

# lines = [line.strip().replace("\n", "") for line in lines]
# column_names = lines[0].split("\t")
# datadict = []
# for line in lines[1:]:
#     line = line.split(",")
#     datadict.append({column_names[i]: line[i] for i in range(len(column_names))})

# datadict[:3]

datadict = pd.read_csv("../../data/eval_dataset_annotated_tabdl.txt", delimiter="\t").to_dict("records")
datadict[:5]

[{'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10394761/s53097934.txt',
  'patient_id': 'p10394761',
  'finding': 'PA and lateral chest views were obtained with patient in upright  position',
  'anatomic_classification': 'MISCELLANEOUS',
  'possible_secondary': nan},
 {'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10394761/s53097934.txt',
  'patient_id': 'p10394761',
  'finding': 'Analysis is performed in direct comparison with the next preceding  similar study of ___',
  'anatomic_classification': 'MISCELLANEOUS',
  'possible_secondary': nan},
 {'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10394761/s53097934.txt',
  'patient_id': 'p10394761',
  'finding': 'There is mild cardiac enlargement',
  'anatomic_classification': 'HEART',
  'possible_secondary': nan},
 {'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10394761/s53097934.txt',
  'patient_id': 'p10394761',
  'finding': 'There is  a relative

In [2]:
final_dataset = [x for x in datadict if x["anatomic_classification"] != "MISCELLANEOUS"]
print(f"There are {len(final_dataset)} classification examples in the final dataset.")

There are 412 classification examples in the final dataset.


#### Testing Llama CPP

Currently, without GPU access, the `llama-cpp` library is the best option for CPU-bound LLMs without having to undergo some serious custom engineering.

In [3]:
def alpaca_prompt_constructor(input_string):
    return (
        "### Instruction: Select from one of the following categories for the most relevant anatomy involved for the prompt sentence. The categories are (separated by comma): "
        "LUNG/PLEURA/LARGE AIRWAYS, VESSELS, HEART, MEDIASTINUM AND HILA, CHEST WALL AND LOWER NECK, LIVER, BILE DUCTS, GALLBLADDER, PANCREAS, "
        "SPLEEN, ADRENAL GLANDS, KIDNEYS AND URETERS, BLADDER, REPRODUCTIVE ORGANS, BOWEL, VESSELS, PERITONEUM/RETROPERITONEUM/LYMPH NODES, BONE AND SOFT TISSUE. "
        "if none of the above categories are relevant, output MISCELLANEOUS. "
        "output the above choice and nothing more. \n\n"
        f"### Input: {input_string} \n\n"
        "### Response: "
    )

print(alpaca_prompt_constructor(final_dataset[10]["finding"]))  # Test a sample

### Instruction: Select from one of the following categories for the most relevant anatomy involved for the prompt sentence. The categories are (separated by comma): LUNG/PLEURA/LARGE AIRWAYS, VESSELS, HEART, MEDIASTINUM AND HILA, CHEST WALL AND LOWER NECK, LIVER, BILE DUCTS, GALLBLADDER, PANCREAS, SPLEEN, ADRENAL GLANDS, KIDNEYS AND URETERS, BLADDER, REPRODUCTIVE ORGANS, BOWEL, VESSELS, PERITONEUM/RETROPERITONEUM/LYMPH NODES, BONE AND SOFT TISSUE. if none of the above categories are relevant, output MISCELLANEOUS. output the above choice and nothing more. 

### Input: No pneumothorax is seen in the  apical area. 

### Response: 


In [8]:
llm = Llama(model_path="../../models/mistral-7b-instruct.gguf")

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ../../models/mistral-7b-instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 

In [33]:
llm(
    alpaca_prompt_constructor(final_dataset[32]["finding"]),
    max_tokens=100,
    stop=["\n"],
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =   18523.17 ms
llama_print_timings:      sample time =       1.84 ms /     3 runs   (    0.61 ms per token,  1629.55 tokens per second)
llama_print_timings: prompt eval time =    1616.45 ms /    16 tokens (  101.03 ms per token,     9.90 tokens per second)
llama_print_timings:        eval time =     304.75 ms /     2 runs   (  152.37 ms per token,     6.56 tokens per second)
llama_print_timings:       total time =    1933.86 ms


{'id': 'cmpl-adcf12cb-4475-4793-8e9d-95723d1d73e4',
 'object': 'text_completion',
 'created': 1704661664,
 'model': '../../models/mistral-7b-instruct.gguf',
 'choices': [{'text': ' HEART',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 225, 'completion_tokens': 2, 'total_tokens': 227}}

In [34]:
final_dataset[32]

{'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10119916/s58937727.txt',
 'patient_id': 'p10119916',
 'finding': 'Moderate cardiac  enlargement',
 'anatomic_classification': 'HEART',
 'possible_secondary': nan}

#### Evaluating Models in a Loop

Evaluating the following models, ordered from (roughly) size and expected performance from least to greatest:
- `TinyLLaMA-1.1b-chat`- distilled version of LlaMA 1.1b RLHFed
- `Phi2` - GPT model released by Microsoft
- `StableLM-Zephyr` - StabilityAI
- `llama2-7b-large` - large version of Meta LLaMa model
- `mistral-7b-instruct` - mistral ai GPT model, RLHFed, as opposed to LLaMA which is not
- `orca2-7b` - Microsoft
- `medalpaca-medium` - Custom fine-tuned Alpaca (itself chat-fine-tuned LLaMA) on a variety of medical resources, such as Anki flashcards and StackOverflow answers
- `llama2-13b-large`
- `mixtral-8x7b` - leaderboard-topping model, competitive with ChatGPT4.0 on multiple measures (may not replicate here since this model has been quantized and reduced down)


In [8]:
model_files = [
    'tinyllama-1.1b-chat.gguf',
    'phi-2.gguf',
    'stablelm-zephyr-3b.gguf',
    'llama2-7b-large.gguf',
    'mistral-7b-instruct.gguf',
    'orca2-7b.gguf',
    'medalpaca-medium.gguf',
    'llama2-13b-large.gguf',
    'mixtral-8x7b.gguf'
]

In [14]:
from typing import *
from multiprocessing import Pool
from functools import partial
from tqdm.auto import tqdm

def produce_one_evaluation(model: Llama, input_string: str):
    return model(
        alpaca_prompt_constructor(input_string),
        max_tokens=100,
        stop=["\n"]
    )["choices"][0]["text"]

def evaluate_model(model: Llama, dataset: List[Dict[str, str]]):
    """
    Performs accuracy evaluation for sentence level classification of radiology report
    """
    inputs = [x["finding"] for x in dataset]
    results = []
    for i in tqdm(range(len(inputs))):
        results.append(
            str(model(
                alpaca_prompt_constructor(inputs[i]),
                max_tokens=100,
                stop=["\n"]
            )["choices"][0]["text"])
        )
    # with Pool(16) as p:
    #     results = list(tqdm(p.imap(func, inputs), total=len(inputs)))
    correct_counter = 0
    for i in range(len(dataset)):
        if str(results[i]).strip().upper() == str(dataset[i]["anatomic_classification"]).strip().upper():
            correct_counter += 1

    return correct_counter / len(dataset), results

In [50]:
# Test evaluation for tinyllama-1.1b-chat.gguf

import os
llm = Llama(model_path=os.path.join("..", "..", "models", model_files[0]), verbose=False)
evaluate_model(llm, final_dataset[:5])

llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from ../../models/tinyllama-1.1b-chat.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = py007_tinyllama-1.1b-chat-v0.3
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.

  0%|          | 0/5 [00:00<?, ?it/s]

['', ' The cardiac output can be calculated as follows,', '', '', '']


0.0

This result sucks, as is expected from a small model. 
Evaluating using the moderately performant model and then we can work up/down from there so we can at least have a result for discussion.

In [57]:
# Test evaluation for mistral-7b-instruct.gguf

llm = Llama(model_path=os.path.join("..", "..", "models", model_files[4]), verbose=False)
mistral_accuracy, mistral_results = evaluate_model(llm, final_dataset)
mistral_accuracy

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ../../models/mistral-7b-instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 

  0%|          | 0/412 [00:00<?, ?it/s]

0.10679611650485436

In [58]:
model_results = []
model_accuracies = []
for model_file in model_files[5:]:
    llm = Llama(model_path=os.path.join("..", "..", "models", model_file), verbose=False)
    accuracy, results = evaluate_model(llm, final_dataset)
    print(f"{model_file} accuracy: {accuracy}")
    model_results.append(results)
    model_accuracies.append(accuracy)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ../../models/orca2-7b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32        

  0%|          | 0/412 [00:00<?, ?it/s]

orca2-7b.gguf accuracy: 0.6747572815533981


llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ../../models/medalpaca-medium.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = medalpaca_medalpaca-13b
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_cou

  0%|          | 0/412 [00:00<?, ?it/s]

medalpaca-medium.gguf accuracy: 0.0


llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../models/llama2-13b-large.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32         

  0%|          | 0/412 [00:00<?, ?it/s]

llama2-13b-large.gguf accuracy: 0.3713592233009709


llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from ../../models/mixtral-8x7b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attentio

  0%|          | 0/412 [00:00<?, ?it/s]

mixtral-8x7b.gguf accuracy: 0.18689320388349515


Model results were the following:

|Model|Accuracy|
|-|-|
|Mixtral-8x7b|0.19|
|Orca2|0.67|
|LLaMA2-13b|0.37|
|Mistral-7b|0.11|
|MedAlpaca|0|
|TinyLLaMA-1.1b|0|


Generally, chat-finetuned models, such as Orca2 performed better, however overall model performance was generally poor on the MIMIC-CXR sentence-level classification task.

#### Testing Custom Task

Looking at the custom classification task we have, for the following paragraph:

>There are two solid pulmonary nodules in the left lower lobe measuring less than 5mm. Bibasilar atelectasis. Cardiomegaly. Moderate coronary artery calcifications. Bilateral perinephric fat stranding. Bladder has thickened wall with surrounding fat stranding concerning for cystitis. Mild hepatic steatosis.

In [4]:
custom_task_dataset = [
    {"finding": "There are two solid pulmonary nodules in the left lower lobe measuring less than 5mm.", "anatomic_classification": "LUNG/PLEURA/LARGE AIRWAYS"},
    {"finding": "Bibasilar atelectasis.", "anatomic_classification": "LUNG/PLEURA/LARGE AIRWAYS"},
    {"finding": "Cardiomegaly", "anatomic_classification": "HEART"},
    {"finding": "Moderate coronary artery calcifications", "anatomic_classification": "VESSELS"},
    {"finding": "Bilateral perinephric stranding", "anatomic_classification": "KIDNEYS AND URETERS"},
    {"finding": "Bladder has thickened wall with surrounding fat stranding concerning for cystitis", "anatomic_classification": "BLADDER"},
    {"finding": "Mild hepatic steatosis", "anatomic_classification": "LIVER"},
]

In [6]:
print(alpaca_prompt_constructor(custom_task_dataset[0]["finding"]))

### Instruction: Select from one of the following categories for the most relevant anatomy involved for the prompt sentence. The categories are (separated by comma): LUNG/PLEURA/LARGE AIRWAYS, VESSELS, HEART, MEDIASTINUM AND HILA, CHEST WALL AND LOWER NECK, LIVER, BILE DUCTS, GALLBLADDER, PANCREAS, SPLEEN, ADRENAL GLANDS, KIDNEYS AND URETERS, BLADDER, REPRODUCTIVE ORGANS, BOWEL, VESSELS, PERITONEUM/RETROPERITONEUM/LYMPH NODES, BONE AND SOFT TISSUE. if none of the above categories are relevant, output MISCELLANEOUS. output the above choice and nothing more. 

### Input: There are two solid pulmonary nodules in the left lower lobe measuring less than 5mm. 

### Response: 


In [15]:
custom_task_results = []
custom_task_accuracies = []
for model_file in model_files:
    llm = Llama(model_path=os.path.join("..", "..", "models", model_file), verbose=False)
    accuracy, results = evaluate_model(llm, custom_task_dataset)
    print(f"{model_file} accuracy: {accuracy}")
    custom_task_results.append(results)
    custom_task_accuracies.append(accuracy)

llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from ../../models/tinyllama-1.1b-chat.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = py007_tinyllama-1.1b-chat-v0.3
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.

  0%|          | 0/7 [00:00<?, ?it/s]

tinyllama-1.1b-chat.gguf accuracy: 0.0


llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from ../../models/phi-2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 3

  0%|          | 0/7 [00:00<?, ?it/s]

phi-2.gguf accuracy: 0.0


llama_model_loader: loaded meta data with 21 key-value pairs and 356 tensors from ../../models/stablelm-zephyr-3b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = stablelm
llama_model_loader: - kv   1:                               general.name str              = source
llama_model_loader: - kv   2:                    stablelm.context_length u32              = 4096
llama_model_loader: - kv   3:                  stablelm.embedding_length u32              = 2560
llama_model_loader: - kv   4:                       stablelm.block_count u32              = 32
llama_model_loader: - kv   5:               stablelm.feed_forward_length u32              = 6912
llama_model_loader: - kv   6:              stablelm.rope.dimension_count u32              = 20
llama_model_loader: - kv   7:              stablelm.attention.head_count u3

  0%|          | 0/7 [00:00<?, ?it/s]

stablelm-zephyr-3b.gguf accuracy: 0.0


llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../../models/llama2-7b-large.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32          

  0%|          | 0/7 [00:00<?, ?it/s]

llama2-7b-large.gguf accuracy: 0.0


llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ../../models/mistral-7b-instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 

  0%|          | 0/7 [00:00<?, ?it/s]

mistral-7b-instruct.gguf accuracy: 0.0


llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ../../models/orca2-7b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32        

  0%|          | 0/7 [00:00<?, ?it/s]

orca2-7b.gguf accuracy: 0.5714285714285714


llama_model_loader: loaded meta data with 20 key-value pairs and 363 tensors from ../../models/medalpaca-medium.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = medalpaca_medalpaca-13b
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_cou

  0%|          | 0/7 [00:00<?, ?it/s]

medalpaca-medium.gguf accuracy: 0.0


llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../../models/llama2-13b-large.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32         

  0%|          | 0/7 [00:00<?, ?it/s]

llama2-13b-large.gguf accuracy: 0.2857142857142857


llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from ../../models/mixtral-8x7b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attentio

  0%|          | 0/7 [00:00<?, ?it/s]

mixtral-8x7b.gguf accuracy: 0.0


In [17]:
for model, results in zip(model_files, custom_task_results):
    print(f"{model} results: {results}")

tinyllama-1.1b-chat.gguf results: ['', '', '', '', '', '', '']
phi-2.gguf results: ['', '', '', '', '', '', '']
stablelm-zephyr-3b.gguf results: ['#### LUNG/PLEURA/LARGE AIRWAYS', '#### LUNG/PLEURA/LARGE AIRWAYS', '', '', '### The most relevant anatomy involved for the prompt sentence "Bilateral perinephric stranding" is the KIDNEYS AND RENAL ARTERY category. ', '### Category Selection: CHEST WALL AND LOWER NECK (None applicable in this case) ### Instructions: Since the prompt sentence involves the bladder, the most relevant anatomy involved is the category "REPRODUCTIVE ORGANS". The given sentence concerns a condition or issue related to the reproductive organs. ### Output: REPRODUCTIVE ORGANS', '']
llama2-7b-large.gguf results: ['1. LUNG/PLEURA/LARGE AIRWAYS, HEART, MEDIASTINUM AND HILA, CHEST WALL AND LOWER NECK, BONE AND SOFT TISSUE. 2. VESSELS', '', '1) LUNG/PLEURA/LARGE AIRWAYS (26) VESSELS (48) HEART (60) MEDIASTINUM AND HILA (74) CHEST WALL AND LOWER NECK (94) LIVER (135) BILE 

Orca2-7b is still consistently the most performant model here, while smaller models completely fail.
It might be worth looking into BioGPT, also produced and fine-tuned by Microsoft on biomedical training material, however those weights will have to be manually converted to GGUF format in order to run successfully on CPU-only environments.
The likelier task, that would be much more successful, is to train a sentence level classifier using non-generalist models, such as `BioLinkBERT` or `RadBERT`, however this will require creating a large amount of ground-truth data (which we could do with chatGPT).
See [this work in RadiologyAI](https://pubs.rsna.org/doi/abs/10.1148/ryai.220097) where they trained a BERT model and achieved >0.8 AUROC for their categories.