## BioMistral_7B Performance Testing on Radiology Reports & Medical Knowledge

This notebook aims to test the foundational medical knowledge and capabilities of BioMistral_7B on radiology report data obtained from the University of Chicago Medicine. It helps decide whether this LLM has the basic ability to process and analyze unstructured radiology reports and output structured output for various downstream use cases.

### Dataset
Radiology report data from University of Chicago Medicine: 789280 observations with 7 columns.

### Main Testing Areas
 - Ability to answer general medical questions
 - Ability to extract and map medical entities in report - LLM Chat Mode
 - Ability to identify entities in report and map to corresponding Radlex IDs - LLM Chat Mode
 - Ability to extract and map medical entities in report - Langchain Pipeline
 - Ability to identify entities in report and map to corresponding Radlex IDs - Langchain Pipeline

In [None]:
# Running this cell may make changes to your environment

# !pip install -q -U transformers
# !pip install -q -U accelerate
# !pip install -q -U bitsandbytes
# !pip install --upgrade transformers
# !pip install kaggle
# !pip install kagglehub
# !pip install -q -U langchain transformers bitsandbytes accelerate

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
import torch
import kagglehub
import pandas as pd
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

In [None]:
DATA_PATH = 'Rad_all_data_id.csv'

In [None]:
#bnb_config = BitsAndBytesConfig(
    #load_in_4bit=True,
    #bnb_4bit_quant_type="nf4",
    #bnb_4bit_use_double_quant=True,
#)

In [None]:
model_name = "BioMistral/BioMistral-7B"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # quantization_config=bnb_config,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

### General Medical Question Answering

In [None]:
messages = [
    {"role": "user", "content": "what are examples of radiology ontologies?"},
]

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

generated_ids = model.generate(model_inputs, max_new_tokens=500, do_sample=True)
tokenizer.batch_decode(generated_ids)[0]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'<s> [INST] what are examples of radiology ontologies? [/INST] There are several examples of radiology ontologies: 1. RadLex: The Radiological Association of America developed the Radiological Ontology (RadLex) to support radiology information systems. It provides a comprehensive and standardized system for describing radiology exams and their findings . 2. SNOMED CT: The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is a highly detailed ontology used in healthcare information systems. It has been used successfully in a few radiology applications . 3. RadDX: RadDX is an ontology created to accurately represent the clinical imaging reports, for retrieval purposes. It is a subset of RadLex . 4. RadBO (IEO): The Radiology Board Output Ontology (RadBO) is an ontology designed to capture the board-certified radiology report content. It is also developed by the RSNA. 5. eMesLR: The Enhanced Metathesaurus for Laboratory Research (eMesLR) is an ontology developed by the U.S.

### Medical Entity Mapping - LLM Chat Mode


In [None]:
data = pd.read_csv(DATA_PATH)

#### Test with first 20 rows

In [None]:
def generation_chat(impression):
  messages = [
  {"role": "user", "content": system_prompt + f"Report: {impression}\n"},
]

  model_inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")

  generated_ids = model.generate(model_inputs, max_new_tokens=2000, do_sample=True)
  response = tokenizer.batch_decode(generated_ids)[0]
  response = response.split("[/INST]")[-1]
  return response

In [None]:
data_first_20 = data.head(20)

In [None]:
system_prompt = "You are a professional and experienced radiologiest. Your task is to analyze this given partial radiology report(findings) about a patient and extract important information(entities) from it, and map entities to these six general categories: symptom, imaging modality, location, treatment, anatomical entity, and property. If some categories' information is missing, just output value as None for that category. Output should be in the dictionary format where the keys are those 6 categories, and values are corresponding entities extracted from the report or None if missing."

In [None]:
data_first_20['entity mappings'] = None

batch_size = 10
for i in range(0, len(data_first_20), batch_size):
    start_index = i
    end_index = min(i + batch_size, len(data_first_20))
    data_first_20.loc[start_index:end_index-1, 'entity mappings'] = data_first_20.iloc[start_index:end_index].apply(
        lambda x: generation_chat(x['impression']),
        axis=1
    )

pd.set_option('display.max_colwidth', None)
data_first_20[["impression",'entity mappings']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_first_20['entity mappings'] = None
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id`

Unnamed: 0,impression,entity mappings
0,Right total shoulder arthroplasty components in anatomic alignment,"The output will be a list of entities extracted from the radiology report. The entities are mapped to six general categories, including symptom, imaging modality, location, treatment, anatomical entity, and property. If some categories' information is missing, just output value as None for that category. Output should be in the dictionary format where the keys are those 6 categories, and values are corresponding entities extracted from the report or None if missing.</s>"
1,Right total shoulder arthroplasty components in anatomic alignment,{</s>
2,"Posterior fixation of L4 and L5, appearing similar to the prior exam.",{ </s>
3,"No significant colonic polyps or masses identified. *OPTIONAL C-RADS CLASSIFICATION:C-1E-2*(see full definitions in: Zalis et al. CT Colonography reporting and data system: a consensus proposal. Radiology 2005;236:3-9)C1: Normal or benign lesions (no polyps > 6mm). Continue routine screening.C2: Intermediate polyp (less than three 6-9mm polyps or can't exclude >6mm in technically adequate study. Surveillance CTC or colonoscopy recommended.C3: Polyp, possibly advanced adenoma. (polyp >10mm or >three 6-9mm). Colonoscopy recommended.C4: Colonic mass, likely malignant.",Here's the output of the model:</s>
4,Presurgical planning MRI shows a complex mass in the left cerebellar hemisphere with associated obstructive hydrocephalus and a similar complex mass in the left temporo-occipital region. These lesions may represent radiation necrosis and perhaps vascular malformations with associated hemorrhage.,{</s>
5,"1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level.2. L1 vertebral compression fracture with mild edema, 50% loss of height, and slight retropulsion, but no significant spinal canal stenosis.3. Multilevel degenerative cervical spondylosis and a congenitally narrow spinal canal, contributing to moderate spinal canal stenosis at C2-C3 to C5-C6 and variable degrees of multilevel foraminal stenosis.Findings discussed with Dr. Zimmerman on 12/31/2014 at 11:35 a by Dr. Wu.","> The following is the output from the radiology report information extraction module: { ""conditions"": [ { ""entity_type"": ""finding"", ""entity_subtype"": None, ""start"": 104, ""end"": 137, ""value"": ""spinal canal"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 1 },{ ""entity_type"": ""finding"", ""entity_subtype"": None, ""start"": 654, ""end"": 680, ""value"": ""diameter"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 1 },{ ""entity_type"": ""finding"", ""entity_subtype"": None, ""start"": 23, ""end"": 63, ""value"": ""compression"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 1 },{ ""entity_type"": ""finding"", ""entity_subtype"": None, ""start"": 338, ""end"": 361, ""value"": ""degrees"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 1 },{ ""entity_type"": ""finding"", ""entity_subtype"": ""degenerative"", ""start"": 0, ""end"": 102, ""value"": ""cervical"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 0 },{ ""entity_type"": ""finding"", ""entity_subtype"": None, ""start"": 104, ""end"": 137, ""value"": ""spondyl"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 1 },{ ""entity_type"": ""finding"", ""entity_subtype"": None, ""start"": 180, ""end"": 200, ""value"": ""stenosis"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 1 },{ ""entity_type"": ""finding"", ""entity_subtype"": None, ""start"": 312, ""end"": 333, ""value"": ""resorption"", ""is_negated"": False, ""suggested_change_type"": 0, ""suggested_offset"": 0, ""offset_direction"": 1 }] ,""suggested_changes"": [ { ""content"": ""5%,60%"" }, { ""content"": ""50%,75%"" }, { ""content"": ""63,37.5%"" }, { ""content"": ""degrees"" }, { ""content"": ""at C2-C3 to C5-C6 and variable degrees of multilevel foraminal stenosis"" }, { ""content"": ""at the L4-L5 level"" }, { ""content"": ""variable degrees"" }, { ""content"": ""L1, L5"" }, { ""content"": ""20,40%"" }, { ""content"": ""backward angulation, facet resection"" }, { ""content"": ""at the L4-L5 level"" }] ,""entity_types"": [ ""anatomical entity"", ""property"", ""location"", ""anatomical entity"", ""treatment"", ""symptom"", ""location"" ]}</s>"
6,"1. Postoperative findings with evidence of recurrent tumor in the left masticator, parapharyngeal, and pharyngeal mucosal spaces, with associated left mandible, posterior maxillary sinus wall, and central skull base erosion and extension into the left middle cranial fossa and overlying skin of the face.2. Prominent left level 6 lymph nodes may represent metastatic disease, but are nonspecific.",{ </s>
7,No evidence of metastatic disease.,{</s>
8,Left lower lobe pneumonia.,{</s>
9,1. No evidence of acute ischemia or other definite acute intracranial abnormality.2. Chronic small vessel ischemic disease.,"{ ""symptom"": None, ""imaging_modality"": ""MRI"", ""location"": ""intracranial"", ""treatment"": None, ""anatomical_entity"": ""other definite abnormal"", ""property"": ""negative"" }</s>"


#### Single testing case of entity mappings

In [None]:
report = data_first_20['impression'][0]

In [None]:
messages = [
    {"role": "user", "content": system_prompt + " Report: " + report},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

generated_ids = model.generate(model_inputs, max_new_tokens=2000, do_sample=True)
response = tokenizer.batch_decode(generated_ids)[0]
response = response.split("[/INST]")[-1]
response

### RadLex ID Mapping - LLM Chat Mode

#### Test with first 20 rows

In [None]:
system_prompt = "You are a professional and experienced radiologiest. Your task is to analyze this given partial radiology report about a patient and identify RadLex IDs from it. Output all the RadLex IDs from the given report information."

In [None]:
data_first_20 = data.head(20)
data_first_20['radlex ids'] = None

batch_size = 10
for i in range(0, len(data_first_20), batch_size):
    start_index = i
    end_index = min(i + batch_size, len(data_first_20))
    data_first_20.loc[start_index:end_index-1, 'radlex ids'] = data_first_20.iloc[start_index:end_index].apply(
        lambda x: generation_chat(x['impression']),
        axis=1
    )
pd.set_option('display.max_colwidth', None)
data_first_20[["impression",'radlex ids']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_first_20['radlex ids'] = None
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `

Unnamed: 0,impression,radlex ids
0,Right total shoulder arthroplasty components in anatomic alignment,"The “RadLex IDs” from the given report information are: 11701007 (ankle) and 16249211 (right shoulder). In addition, it is recommended to include the following RadLex IDs: 11092301 (limb), 11708501 (arm), 11862200 (leg), 13966700 (joint).</s>"
1,Right total shoulder arthroplasty components in anatomic alignment,Here are the RadLex IDs and their matches from the given radiology report:• anatomic alignment: RadLex ID is anatomic_alignment• right total shoulder arthroplasty: RadLex ID is total_shoulder_arthroplasty• total: RadLex ID is whole_part• anatomical joint: RadLex ID is joint_anatomy• shoulder: RadLex ID is joint_anatomy</s>
2,"Posterior fixation of L4 and L5, appearing similar to the prior exam.","The IDs for the given report information are: 31834 (Posterior fixation of L4 and L5, appearing similar to the prior exam)</s>"
3,"No significant colonic polyps or masses identified. *OPTIONAL C-RADS CLASSIFICATION:C-1E-2*(see full definitions in: Zalis et al. CT Colonography reporting and data system: a consensus proposal. Radiology 2005;236:3-9)C1: Normal or benign lesions (no polyps > 6mm). Continue routine screening.C2: Intermediate polyp (less than three 6-9mm polyps or can't exclude >6mm in technically adequate study. Surveillance CTC or colonoscopy recommended.C3: Polyp, possibly advanced adenoma. (polyp >10mm or >three 6-9mm). Colonoscopy recommended.C4: Colonic mass, likely malignant.","RadLex IDs: radiologic colonography, radiologic polyp, radiologic polyposis, radiologic mass, radiologic adenoma, radiologic neoplasms, radiologic size, radiologic location</s>"
4,Presurgical planning MRI shows a complex mass in the left cerebellar hemisphere with associated obstructive hydrocephalus and a similar complex mass in the left temporo-occipital region. These lesions may represent radiation necrosis and perhaps vascular malformations with associated hemorrhage.,"1. radiation isotope therapy (Radiotherapy/radiotherapya, Radiotherapy/radiation therapyb); [2. radiation necrosis (Disease, Morbidity and Mortality)); [3. non-progressive arteriovenous-venous shunt (Disease, Morbidity and Mortality)); [4. arteriovenous shunt (Surgical Procedure/Treatment-Surgical Procedure));</s>"
5,"1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level.2. L1 vertebral compression fracture with mild edema, 50% loss of height, and slight retropulsion, but no significant spinal canal stenosis.3. Multilevel degenerative cervical spondylosis and a congenitally narrow spinal canal, contributing to moderate spinal canal stenosis at C2-C3 to C5-C6 and variable degrees of multilevel foraminal stenosis.Findings discussed with Dr. Zimmerman on 12/31/2014 at 11:35 a by Dr. Wu.","RadLex IDs: RadLex IDs: RadLex-2333-011 2333-04 ; RadLex-2963-012 2963-04 ; RadLex-214-011 214-1 ; RadLex-3016-02 3016-01, 3016-02 ; RadLex-9281-02 9281-01; RadLex-4268-02 4268-01 ; RadLex-8399-014 8399-04 ; RadLex-12282-03 12282-02, 12282-05 ; RadLex-6369-010 6369-06.</s>"
6,"1. Postoperative findings with evidence of recurrent tumor in the left masticator, parapharyngeal, and pharyngeal mucosal spaces, with associated left mandible, posterior maxillary sinus wall, and central skull base erosion and extension into the left middle cranial fossa and overlying skin of the face.2. Prominent left level 6 lymph nodes may represent metastatic disease, but are nonspecific.",1. S0002133; S0000155; S0000136; S00019442. S0000175</s>
7,No evidence of metastatic disease.,"According to the standard methodology adopted in this system, a RadLex ID was extracted from the phrase “No evidence of metastatic disease.” and it is “Absence / absence of metastatic disease” [ID# abs-disease:00000005]. This is because the context of the text “No evidence of metastatic disease” indicates absence, which is the most prominent characteristic of the concept.</s>"
8,Left lower lobe pneumonia.,"Lung disease (http://purl.obolibrary.org/obo/ Ontology term_0000224), lower lobe (http://purl.obolibrary.org/obo/Ontologyterm_0100595), pneumonia (http://purl.obolibrary.org/obo/Ontologyterm_0001579), lower lobe lung disease (http://purl.obolibrary.org/obo/Ontologyterm_0005623).</s>"
9,1. No evidence of acute ischemia or other definite acute intracranial abnormality.2. Chronic small vessel ischemic disease.,"RadLex ID: RadLex_246212, RadLex_235850. RadLex ID: RadLex_246212 (Chronic small vessel ischemic disease); RadLex_235850 (evidence of acute ischemia).</s>"


### Medical Entity Mapping - Langchain Text Generation Pipeline


In [None]:
def generate_response(report):
  prompt = PromptTemplate(template=template, input_variables=["report"])
  llm_chain = LLMChain(prompt=prompt, llm=llm)
  response = llm_chain.run({"report":report})
  response = response.split("[/INST]")[-1]
  return response

In [None]:
pipeline_inst = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=2500,
        do_sample=True,
        top_k=5,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

llm = HuggingFacePipeline(pipeline=pipeline_inst)

#### Single testing case of entity mappings

In [None]:
template = """<s>[INST] You are a professional and experienced radiologiest. Your task is to analyze this given partial radiology report about a patient and extract important information(entities) from it, and map entities to these six general categories: symptom, imaging modality, location, treatment, anatomical entity, and property. If some categories' information is missing, just output value as None for that category. Output should be in the dictionary format where the keys are those 6 categories, and values are corresponding entities extracted from the report or None if missing. Radiology Report:
{report} [/INST] </s>
"""
generate_response("1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level.2. L1 vertebral compression fracture with mild edema, 50% loss of height, and slight retropulsion, but no significant spinal canal stenosis.3. Multilevel degenerative cervical spondylosis and a congenitally narrow spinal canal, contributing to moderate spinal canal stenosis at C2-C3 to C5-C6 and variable degrees of multilevel foraminal stenosis.Findings discussed with Dr. Zimmerman on 12/31/2014 at 11:35 a by Dr. Wu")

  warn_deprecated(
  warn_deprecated(


" </s>\nThe output is a dictionary with 6 keys and their corresponding values:{'location': 'L5, L4-L5, L1, C2-C3 to C5-C6, cervical spine','anatomical entity':'vertebral compression fracture, spinal canal, foramina, cervical spine','treatment': None,'symptom': None, 'imaging modality': 'MRI', 'property':'spinal stenosis, severe, moderate, mild, congenital, variable'}. 16 entities were extracted from the radiology report, but 6 entities are unspecified (treatment, symptom, location)."

#### Test with first 20 rows

In [None]:
data_20 = data.head(20)
batch_size = 10
for i in range(0, len(data_20), batch_size):
    start_index = i
    end_index = min(i + batch_size, len(data_20))
    data_20.loc[start_index:end_index-1, 'entity mappings'] = data_20.iloc[start_index:end_index].apply(
        lambda x: generate_response(x['impression']),
        axis=1
    )
pd.set_option('display.max_colwidth', None)
data_20[["impression",'entity mappings']]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_20.loc[start_index:end_index-1, 'entity mappings'] = data_20.iloc[start_index:end_index].apply(


Unnamed: 0,impression,entity mappings
0,Right total shoulder arthroplasty components in anatomic alignment,"</s>\n{'symptom': [(None, 'N/A')], 'imaging-modality': [(None, 'N/A')], 'location': [(None, 'N/A')], 'treatment': [('arthroplasty', 'N/A')], 'anatomical-entity': [('total','shoulder'), ('arthroplasty', 'components in')], 'properties': [('in', 'anatomic alignment')] }"
1,Right total shoulder arthroplasty components in anatomic alignment,"</s>\n{ ""category"": ""symptom"", ""value"": []}, {""category"": ""imaging_modality"",""value"": []}, {""category"": ""location"",""value"": []}, {""category"": ""treatment"",""value"": []}, {""category"": ""anatomical_entity"",""value"": [(""shoulder"", None)]}, {""category"": ""property"", ""value"": []}"
2,"Posterior fixation of L4 and L5, appearing similar to the prior exam.","</s>\n{ ""symptom"": None, ""location"": None, ""treatment"": None, ""anatomical_entity"": ""L4 and L5"", ""property"": ""similar"" }"
3,"No significant colonic polyps or masses identified. *OPTIONAL C-RADS CLASSIFICATION:C-1E-2*(see full definitions in: Zalis et al. CT Colonography reporting and data system: a consensus proposal. Radiology 2005;236:3-9)C1: Normal or benign lesions (no polyps > 6mm). Continue routine screening.C2: Intermediate polyp (less than three 6-9mm polyps or can't exclude >6mm in technically adequate study. Surveillance CTC or colonoscopy recommended.C3: Polyp, possibly advanced adenoma. (polyp >10mm or >three 6-9mm). Colonoscopy recommended.C4: Colonic mass, likely malignant.","</s>\n```python{""Symptom"": ""No significant colonic polyps or masses identified."",""Imaging_modality"": ""CT"", ""Location"": None, ""Treatment"": None, ""Anatomical_entity"": ""Colon"", ""Property"": None}"
4,Presurgical planning MRI shows a complex mass in the left cerebellar hemisphere with associated obstructive hydrocephalus and a similar complex mass in the left temporo-occipital region. These lesions may represent radiation necrosis and perhaps vascular malformations with associated hemorrhage.,"</s>\n{'symptom': None, 'imaging modality': ['MRI'], 'location': None, 'treatment': None, 'anatomical_entity': None, 'property': None }."
5,"1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level.2. L1 vertebral compression fracture with mild edema, 50% loss of height, and slight retropulsion, but no significant spinal canal stenosis.3. Multilevel degenerative cervical spondylosis and a congenitally narrow spinal canal, contributing to moderate spinal canal stenosis at C2-C3 to C5-C6 and variable degrees of multilevel foraminal stenosis.Findings discussed with Dr. Zimmerman on 12/31/2014 at 11:35 a by Dr. Wu.","</s>\n1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level."
6,"1. Postoperative findings with evidence of recurrent tumor in the left masticator, parapharyngeal, and pharyngeal mucosal spaces, with associated left mandible, posterior maxillary sinus wall, and central skull base erosion and extension into the left middle cranial fossa and overlying skin of the face.2. Prominent left level 6 lymph nodes may represent metastatic disease, but are nonspecific.","</s>\n1. Postoperative findings with evidence of recurrent tumor in the left masticator, parapharyngeal, and pharyngeal mucosal spaces, with associated left mandible, posterior maxillary sinus wall, and central skull base erosion and extension into the left middle cranial fossa and overlying skin of the face. 2. Prominent left level 6 lymph nodes may represent metastatic disease, but are nonspecific."
7,No evidence of metastatic disease.,</s>\n{ ‘report’: 'No evidence of metastatic disease.' ‘symptom’: [ ‘none’ ] ‘location’: [ ‘none’ ] ‘treatment’: [ ‘none’ ] ‘anatomical_entity’: [ ‘none’ ] ‘property’: [ ‘none’ ] }
8,Left lower lobe pneumonia.,"</s>\nYou are a professional and experienced radiologiest. Your task is to analyze this given partial radiology report about a patient and extract important information(entities) from it, and map entities to these six general categories: symptom, imaging modality, location, treatment, anatomical entity, and property. If some categories' information is missing, just output value as None for that category. Output should be in the dictionary format where the keys are those 6 categories, and values are corresponding entities extracted from the report or None if missing. Radiology Report:"
9,1. No evidence of acute ischemia or other definite acute intracranial abnormality.2. Chronic small vessel ischemic disease.,</s>\n{


### RadLex ID Mapping - Langchain Text Generation Pipeline

#### Single testing case

In [None]:
template = """<s>[INST] You are a professional and experienced radiologiest. Your task is to analyze this given partial radiology report about a patient and identify RadLex IDs from it. Output all the RadLex IDs from the given report information. Radiology Report:
{report} [/INST] </s>
"""
generate_response("1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level.2. L1 vertebral compression fracture with mild edema, 50% loss of height, and slight retropulsion, but no significant spinal canal stenosis.3. Multilevel degenerative cervical spondylosis and a congenitally narrow spinal canal, contributing to moderate spinal canal stenosis at C2-C3 to C5-C6 and variable degrees of multilevel foraminal stenosis.Findings discussed with Dr. Zimmerman on 12/31/2014 at 11:35 a by Dr. Wu")

' </s>\n1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level.'

#### Test with first 20 rows

In [None]:
data_20 = data.head(20)
batch_size = 10
for i in range(0, len(data_20), batch_size):
    start_index = i
    end_index = min(i + batch_size, len(data_20))
    data_20.loc[start_index:end_index-1, 'RadLex IDs'] = data_20.iloc[start_index:end_index].apply(
        lambda x: generate_response(x['impression']),
        axis=1
    )
pd.set_option('display.max_colwidth', None)
data_20[["impression", 'RadLex IDs']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_20.loc[start_index:end_index-1, 'RadLex IDs'] = data_20.iloc[start_index:end_index].apply(


Unnamed: 0,impression,RadLex IDs
0,Right total shoulder arthroplasty components in anatomic alignment,"</s>\nThe given text is a partial radiology report of a right total shoulder arthroplasty. It describes the placement of the prosthetic components in the anatomic alignment. The terms used in the text refer to radiological findings such as “total shoulder arthroplasty” and “anatomic alignment” which can be found in the RadLex dictionary. Therefore, the IDs for these terms are “RID_100000105” and “RID_100001181” respectively."
1,Right total shoulder arthroplasty components in anatomic alignment,"</s>\nThe following RadLex IDs were extracted from the given radiology report: “Right” (ID: 12013003), “total shoulder arthroplasty” (ID: 2272580), and “anatomic alignment” (ID: 23583300). Note that the ID of the term “Right” is prefixed with “12013003”, which is the RadLex root term ID for “Laterality”."
2,"Posterior fixation of L4 and L5, appearing similar to the prior exam.",</s>\nRadLex IDs extracted from the report:
3,"No significant colonic polyps or masses identified. *OPTIONAL C-RADS CLASSIFICATION:C-1E-2*(see full definitions in: Zalis et al. CT Colonography reporting and data system: a consensus proposal. Radiology 2005;236:3-9)C1: Normal or benign lesions (no polyps > 6mm). Continue routine screening.C2: Intermediate polyp (less than three 6-9mm polyps or can't exclude >6mm in technically adequate study. Surveillance CTC or colonoscopy recommended.C3: Polyp, possibly advanced adenoma. (polyp >10mm or >three 6-9mm). Colonoscopy recommended.C4: Colonic mass, likely malignant.","</s>\n- C-RADS classification (Radiology Imaging Reporting and Data Systems (RADS) for CT Colonography, Radiology 2005;236:3-9)RadLex IDs (RADLex, https://www.nlm.nih.gov/research/radlex/): Normal or benign lesions, Intermediate polyp, Polyp, possibly advanced adenoma, Colonic mass, likely malignant."
4,Presurgical planning MRI shows a complex mass in the left cerebellar hemisphere with associated obstructive hydrocephalus and a similar complex mass in the left temporo-occipital region. These lesions may represent radiation necrosis and perhaps vascular malformations with associated hemorrhage.,</s>\n 1. RadLex ID: 22202003 2. RadLex ID: 2686850 3. RadLex ID: 23022007 4. RadLex ID: 2232520 5. RadLex ID: 3470250 6. RadLex ID: 2233510 7. RadLex ID: 2233510 8. RadLex ID: 23022007 9. RadLex ID: 44011005 10. RadLex ID: 44007502 11. RadLex ID: 34700002 12. RadLex ID 2871240.
5,"1. L5 vertebral compression fracture with edema, 75% loss of height, and 10 mm of retropulsion superimposed upon underlying degenerative changes, contributing to severe spinal canal stenosis with compression of the cauda equina and severe right and moderate left foraminal stenosis at the L4-L5 level.2. L1 vertebral compression fracture with mild edema, 50% loss of height, and slight retropulsion, but no significant spinal canal stenosis.3. Multilevel degenerative cervical spondylosis and a congenitally narrow spinal canal, contributing to moderate spinal canal stenosis at C2-C3 to C5-C6 and variable degrees of multilevel foraminal stenosis.Findings discussed with Dr. Zimmerman on 12/31/2014 at 11:35 a by Dr. Wu.","</s>\nThe RadLex 1000 IDs extracted from the input text is: “Lumbar vertebral compression fracture”, “Vertebral compression fracture”, “Spine degenerative changes”, “Lumbar canal stenosis”, “Cauda equina”, “Lumbar spondylotic changes”, “Lumbar spondylotic myelopathy”, “Cervical canal stenosis”, “Cervical spondylotic changes”, “Cervical spondylotic myelopathy”, “Cervical foraminal stenosis”, “Degenerative cervical spondylotic myelopathy”, “Narrow spinal canal”, “Degenerative cervical spondylosis”, “Cervical spinal canal stenosis”."
6,"1. Postoperative findings with evidence of recurrent tumor in the left masticator, parapharyngeal, and pharyngeal mucosal spaces, with associated left mandible, posterior maxillary sinus wall, and central skull base erosion and extension into the left middle cranial fossa and overlying skin of the face.2. Prominent left level 6 lymph nodes may represent metastatic disease, but are nonspecific.","</s>\n1. Postoperative findings with evidence of recurrent tumor in the left masticator, parapharyngeal, and pharyngeal mucosal spaces, with associated left mandible, posterior maxillary sinus wall, and central skull base erosion and extension into the left middle cranial fossa and overlying skin of the face.2. Prominent left level 6 lymph nodes may represent metastatic disease, but are nonspecific."
7,No evidence of metastatic disease.,"</s>\nThe above table shows the output of Radlex IDs extracted from the given information of the radiology report using the proposed method. We have applied the proposed method for the information provided in the report. In this information, we have identified two concepts from the RadLex ontology, which are “metastatic disease” and “absent”. The IDs for these terms are rad:0000287 and rad:0000003, respectively."
8,Left lower lobe pneumonia.,"</s>\nRadLex IDs identified: RL0000155, RL0000029, RL0000127, RL0000005 (The bolded text is the information provided as the input to the system). The output is the list of identified terms with their RadLex IDs. The underlined text is the output of the system."
9,1. No evidence of acute ischemia or other definite acute intracranial abnormality.2. Chronic small vessel ischemic disease.,"</s>\n• Chronic small vessel ischemic disease. RadLex ID: RL0005598 (chronic small vessel ischemic disease), RL0005299 (intracranial arteries: small vessels) and RL0004561 (cerebral arteries: small arteries). The RadLex IDs are separated by commas in the output as shown above."


BioMistral 7b Findings:

1. Can answer general medical-related questions very well - have basic medical knowledge;

2. Can map entities to categories to some degree for simple and straightforward report; can map radiology modalities, locations, and treatment better than symptom, anatomical entity, and property, which might be due to limited context or unclear category definition in prompt; unstable outputs;

3. Know RadLex ID but cannot extract properly - produce wrong RadLex IDs; most generated ids don't exist and some exist but don't match with the report, wrong id formats.