#### LLM Model Comparison

Small comparison of out-of-the-box (OOB) large language models, sourced from Huggingface Hub for simple task of free-text anatomic classification of the findings section of radiology reports.
Performance is measured by classification accuracy of the model across the following categories:
```
LUNG/PLEURA/LARGE AIRWAYS
VESSELS
HEART
MEDIASTINUM AND HILA
CHEST WALL AND LOWER NECK
LIVER
BILE DUCTS
GALLBLADDER
PANCREAS
SPLEEN
ADRENAL GLANDS
KIDNEYS AND URETERS
BLADDER
REPRODUCTIVE ORGANS
BOWEL
VESSELS
PERITONEUM/RETROPERITONEUM/LYMPH NODES
BONE AND SOFT TISSUE
MISCELLANEOUS
```

`MISCELLANEOUS` category describes usually verbiage about comparison with previous studies, the type of CXR study performed, comments about patient clinical history, etc.

Classification is carried out on a subset of CXR reports sourced from the MIMIC-CXR database.
MIMIC-CXR comprises ~220k CXR studies from patients admitted to BIDMC from 2011 to 2016.
This analysis selects 50 patients' reports (totaling a little >100 reports). 
Sentence-level classification was done by hand.

#### Setup and Imports

In [6]:
# Create list of dictionaries for input/output examples

with open("../../data/eval_dataset_annotated.csv") as f:
    lines = f.readlines()

lines = [line.strip().replace("\n", "") for line in lines]
column_names = lines[0].split(",")
datadict = []
for line in lines[1:]:
    line = line.split(",")
    datadict.append({column_names[i]: line[i] for i in range(len(column_names))})

datadict[:3]

[{'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10394761/s53097934.txt',
  'patient_id': 'p10394761',
  'finding': 'PA and lateral chest views were obtained with patient in upright  position',
  'anatomic_classification': 'MISCELLANEOUS',
  'possible_secondary': ''},
 {'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10394761/s53097934.txt',
  'patient_id': 'p10394761',
  'finding': 'Analysis is performed in direct comparison with the next preceding  similar study of ___',
  'anatomic_classification': 'MISCELLANEOUS',
  'possible_secondary': ''},
 {'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10394761/s53097934.txt',
  'patient_id': 'p10394761',
  'finding': 'There is mild cardiac enlargement',
  'anatomic_classification': 'HEART',
  'possible_secondary': ''}]

In [7]:
final_dataset = [x for x in datadict if x["anatomic_classification"] != "MISCELLANEOUS"]
print(f"There are {len(final_dataset)} classification examples in the final dataset.")

There are 440 classification examples in the final dataset.


#### Testing Llama CPP

Currently, without GPU access, the `llama-cpp` library is the best option for CPU-bound LLMs without having to undergo some serious custom engineering.

In [8]:
def alpaca_prompt_constructor(input_string):
    return (
        "### Instruction: Select from one of the following categories for the most relevant anatomy involved for the prompt sentence. The categories are (separated by comma): "
        "LUNG/PLEURA/LARGE AIRWAYS, VESSELS, HEART, MEDIASTINUM AND HILA, CHEST WALL AND LOWER NECK, LIVER, BILE DUCTS, GALLBLADDER, PANCREAS, "
        "SPLEEN, ADRENAL GLANDS, KIDNEYS AND URETERS, BLADDER, REPRODUCTIVE ORGANS, BOWEL, VESSELS, PERITONEUM/RETROPERITONEUM/LYMPH NODES, BONE AND SOFT TISSUE. "
        "if none of the above categories are relevant, output MISCELLANEOUS. "
        "output the above choice and nothing more. \n\n"
        f"### Input: {input_string} \n\n"
        "### Response: "
    )

print(alpaca_prompt_constructor(final_dataset[10]["finding"]))  # Test a sample

### Instruction: Select from one of the following categories for the most relevant anatomy involved for the prompt sentence. The categories are (separated by comma): LUNG/PLEURA/LARGE AIRWAYS, VESSELS, HEART, MEDIASTINUM AND HILA, CHEST WALL AND LOWER NECK, LIVER, BILE DUCTS, GALLBLADDER, PANCREAS, SPLEEN, ADRENAL GLANDS, KIDNEYS AND URETERS, BLADDER, REPRODUCTIVE ORGANS, BOWEL, VESSELS, PERITONEUM/RETROPERITONEUM/LYMPH NODES, BONE AND SOFT TISSUE. if none of the above categories are relevant, output MISCELLANEOUS. output the above choice and nothing more. 

### Input: "As there was no evidence of pleural effusion or other signs of  parenchymal infiltrates 

### Response: 


In [14]:
from llama_cpp import Llama
llm = Llama(model_path="../models/llama-2-13B-chat.Q4_K_M.gguf")

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from ../models/llama-2-13B-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32     

In [15]:
llm(
    alpaca_prompt_constructor(final_dataset[10]["finding"]),
    max_tokens=32,
    stop=["\n"]
)


llama_print_timings:        load time =  184769.06 ms
llama_print_timings:      sample time =       1.31 ms /     1 runs   (    1.31 ms per token,   764.53 tokens per second)
llama_print_timings: prompt eval time =  184767.79 ms /   248 tokens (  745.03 ms per token,     1.34 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  184867.09 ms


{'id': 'cmpl-b8c16104-4552-4f12-81a2-e62451489304',
 'object': 'text_completion',
 'created': 1704655908,
 'model': '../models/llama-2-13B-chat.Q4_K_M.gguf',
 'choices': [{'text': '',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 248, 'completion_tokens': 1, 'total_tokens': 249}}

In [16]:
llm(
    alpaca_prompt_constructor(final_dataset[15]["finding"]),
    max_tokens=32,
    stop=["\n"]
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =  184769.06 ms
llama_print_timings:      sample time =       8.93 ms /     9 runs   (    0.99 ms per token,  1008.18 tokens per second)
llama_print_timings: prompt eval time =   44444.54 ms /    41 tokens ( 1084.01 ms per token,     0.92 tokens per second)
llama_print_timings:        eval time =   88569.91 ms /     8 runs   (11071.24 ms per token,     0.09 tokens per second)
llama_print_timings:       total time =  133596.54 ms


{'id': 'cmpl-c3b5ee45-f36d-4ee9-b283-e700467cebf7',
 'object': 'text_completion',
 'created': 1704656134,
 'model': '../models/llama-2-13B-chat.Q4_K_M.gguf',
 'choices': [{'text': ' MISCELLANEOUS',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 256, 'completion_tokens': 9, 'total_tokens': 265}}

In [17]:
final_dataset[15]

{'filename': '/home/khans24/charit/anatomy_ner/mimic_cxr_reports/p10/p10773382/s55866250.txt',
 'patient_id': 'p10773382',
 'finding': 'The  patient is status post sternotomy and multiple surgical clips in the left  anterior mediastinal structures are indicative of previous bypass surgery',
 'anatomic_classification': 'MEDIASTINUM AND HILA',
 'possible_secondary': ''}