# Competition/Problem Description
When you visit a doctor, how they interpret your symptoms can determine whether your diagnosis is accurate. By the time they’re licensed, physicians have had a lot of practice writing patient notes that document the history of the patient’s complaint, physical exam findings, possible diagnoses, and follow-up care. Learning and assessing the skill of writing patient notes requires feedback from other doctors, a time-intensive process that could be improved with the addition of machine learning.

Until recently, the Step 2 Clinical Skills examination was one component of the United States Medical Licensing Examination® (USMLE®). The exam required test-takers to interact with Standardized Patients (people trained to portray specific clinical cases) and write a patient note. Trained physician raters later scored patient notes with rubrics that outlined each case’s important concepts (referred to as features). The more such features found in a patient note, the higher the score (among other factors that contribute to the final score for the exam).

However, having physicians score patient note exams requires significant time, along with human and financial resources. Approaches using natural language processing have been created to address this problem, but patient notes can still be challenging to score computationally because features may be expressed in many ways. For example, the feature "loss of interest in activities" can be expressed as "no longer plays tennis." Other challenges include the need to map concepts by combining multiple text segments, or cases of ambiguous negation such as “no cold intolerance, hair loss, palpitations, or tremor” corresponding to the key essential “lack of other thyroid symptoms.”

In this competition, you’ll identify specific clinical concepts in patient notes. Specifically, you'll develop an automated method to map clinical concepts from an exam rubric (e.g., “diminished appetite”) to various ways in which these concepts are expressed in clinical patient notes written by medical students (e.g., “eating less,” “clothes fit looser”). Great solutions will be both accurate and reliable.

If successful, you'll help tackle the biggest practical barriers in patient note scoring, making the approach more transparent, interpretable, and easing the development and administration of such assessments. As a result, medical practitioners will be able to explore the full potential of patient notes to reveal information relevant to clinical skills assessment.

This competition is sponsored by the National Board of Medical Examiners® (NBME®). Through research and innovation, NBME supports medical school and residency program educators in addressing issues around the evolution of teaching, learning, technology, and the need for meaningful feedback. NBME offers high-quality assessments and educational services for students, professionals, educators, regulators, and institutions dedicated to the evolving needs of medical education and health care. To serve these communities, NBME collaborates with a diverse and comprehensive array of practicing health professionals, medical educators, state medical board members, test developers, academic researchers, scoring experts and public representatives.

NBME gratefully acknowledges the valuable input of Dr Le An Ha from the University of Wolverhampton’s Research Group in Computational Linguistics.

In [97]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# huggingface fun
from datasets import Dataset
from datasets import load_dataset
from transformers import AutoTokenizer

import torch

In [2]:
base_dir = "/kaggle/input/nbme-score-clinical-patient-notes"

train = pd.read_csv(f"{base_dir}/train.csv")
patient_notes = pd.read_csv(f"{base_dir}/patient_notes.csv")
features = pd.read_csv(f"{base_dir}/features.csv")
test = pd.read_csv(f"{base_dir}/test.csv")
sample_submission = pd.read_csv(f"{base_dir}/sample_submission.csv")

Split the train datatset into an 80/20 split train-validation

In [3]:
validation_dataset = pd.DataFrame()
for case_num in train.case_num.unique():
    validation_dataset = pd.concat([validation_dataset, train[train.case_num == case_num].sample(frac=0.2, random_state=24)])
    
train_dataset = train.drop(validation_dataset.index)
print(train_dataset.shape)
print(validation_dataset.shape)

In [4]:
# Its common to have 2 sub-ner-tags per ner-tag, indicated by a prefix. 
# Typically `B-` is used to indicate the beginning of an ner tag and `I-` is used to indicate a token is included in an ner-tag
# I'll start with just the `I-` indicator to try first our target tokens
# For now, im just going to group the feature texts into a single list, something could be said for splitting on case_num
# '0' Used to indicate a token doesnt have an interesting ner-tag
ner_tag_list = ['0']
for i, r in features.feature_text.iteritems():
    ner_tag_list.append("I-" + r)
    
ner_tag_list[:7]

In [5]:
# train_dataset[train_dataset.pn_num == 61497]
train_dataset.sample(8)

In [6]:
def tag_and_tokenize(dataset):
    tokenized_dict = {
        "tokens": [],
        "ner_tags": []
    }

    size = dataset.pn_num.unique().shape[0]
    pos = 0
    for pn_num in dataset.pn_num.unique():
        print(f"Tokenized: {pn_num} ({pos+1}/{size})", end="\r")
        pn_history = patient_notes[patient_notes.pn_num == pn_num].squeeze().pn_history

        # Naive split for now
        pn_tokens = pn_history.split()
        ner_tag_indices = [0 for _ in pn_tokens]

        for i, r in dataset[dataset.pn_num == pn_num].iterrows():
            location_list = eval(r.location)
            feature = features[features.feature_num == r.feature_num].squeeze().feature_text
            ner_tag_index = ner_tag_list.index("I-"+feature)

            # No feature/annotations skip
            if len(location_list) == 0:
                continue

            for location in location_list:
                for sub_loc in location.split(";"):
                    char_token_start = eval(sub_loc.split()[0])
                    char_token_end = eval(sub_loc.split()[1])

                    # Used to find the token in the word list via space offset value
                    space_offset = 0 if pn_history[char_token_start-1] == " " or char_token_start == 0 else 1

                    token_start_idx = len(pn_history[0:char_token_start].split()) - space_offset
                    token_end_idx = token_start_idx + len(pn_history[char_token_start:char_token_end].split())

                    for i in range(token_start_idx, token_end_idx):
                        ner_tag_indices[i] = ner_tag_index
        #             print(f"{sub_loc} || {pn_history[char_token_start:char_token_end]} || {feature} || {pn_tokens[token_start_idx:token_end_idx]}")

        tokenized_dict["tokens"].append(pn_tokens)
        tokenized_dict["ner_tags"].append(ner_tag_indices)
        pos += 1
    
    return tokenized_dict

def generate_datasets():
    tokenized_train_dict = tag_and_tokenize(train_dataset)
    tokenized_val_dict = tag_and_tokenize(validation_dataset)
    
    return (Dataset.from_pandas(pd.DataFrame(tokenized_train_dict)), Dataset.from_pandas(pd.DataFrame(tokenized_val_dict)))

train_HF_dataset, validation_HF_dataset = generate_datasets()

In [7]:
train_HF_dataset

In [8]:
# Source: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
model_checkpoint = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [140]:
# Source: https://huggingface.co/docs/transformers/main/en/tasks/token_classification
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [141]:
tokenized_train_dataset = train_HF_dataset.map(tokenize_and_align_labels, batched=True)
tokenized_validation_dataset = validation_HF_dataset.map(tokenize_and_align_labels, batched=True)

In [142]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [143]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(ner_tag_list))

In [106]:
os.environ["WANDB_DISABLED"] = "true"

In [144]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

In [145]:
trainer.evaluate()

In [146]:
model_filename="clinical_ner_04.model"
trainer.save_model(model_filename)

In [147]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

In [148]:
tokenizer = AutoTokenizer.from_pretrained(model_filename)
ner_model = AutoModelForTokenClassification.from_pretrained(model_filename, num_labels=len(ner_tag_list))

In [164]:
ex_note = patient_notes.sample().squeeze().pn_history
ex_note_tokens = ex_note.split()
tokens = tokenizer(ex_note_tokens, truncation=True, is_split_into_words=True)
torch.tensor(tokens['input_ids']).unsqueeze(0).size()

In [165]:
predictions = ner_model.forward(input_ids=torch.tensor(tokens['input_ids']).unsqueeze(0), attention_mask=torch.tensor(tokens['attention_mask']).unsqueeze(0))
predictions = torch.argmax(predictions.logits.squeeze(), axis=1)
predictions = [ner_tag_list[i] for i in predictions]
print(predictions)

In [166]:
len(predictions)

In [167]:
words = tokenizer.batch_decode(tokens['input_ids'])
print([word.replace("##", "") for word in words])

In [168]:
print(ex_note)

In [None]:
predictions = ner_model.forward(input_ids=torch.tensor(tokens['input_ids']).unsqueeze(0), attention_mask=torch.tensor(tokens['attention_mask']).unsqueeze(0))
predictions = torch.argmax(predictions.logits.squeeze(), axis=1)
predictions = [ner_tag_list[i] for i in predictions]
print(predictions)

# Source: https://huggingface.co/docs/transformers/main/en/tasks/token_classification
def align_predictions_with_input(predictions, tokenized_input):
    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [131]:
ex_note

In [169]:
words = [word.replace("##", "") for word in tokenizer.batch_decode(tokens['input_ids'])]
rebuilt_note = ""
prev_idx = None
for i, idx in enumerate(tokens.word_ids()):
    # ignore special tokens
    if idx == None:
        continue
    
    # append space before if new id
    if prev_idx is not None and prev_idx != idx:
        rebuilt_note += " "
    
    rebuilt_note += words[i]
    
    prev_idx = idx
    
print(rebuilt_note, "\n")
print(ex_note.lower().replace("\r\n", " "))

In [158]:
rebuilt_note == ex_note.lower().replace("\r\n", " ")