Hello Fellow kagglers,

This notebook demonstrates how to generate extra training data by predicting the labels for unannotated patient notes using a model trained on the annotated training data. The labels are soft, meaning the probabilities are not thresholded. The training data is also included, to add correctly labelled data to the non-annoateted training data.

There are ~42000 patient notes, of which only 1000 are annotated. Using all patient notes will result in 42x more training data, which will contain errors, but training on a TPU with a large batch size should smoothen out the errors and will result in better performance than just using the annotated training data.

[Preprocessing Notebook](https://www.kaggle.com/markwijkhuizen/nbme-preprocessing-albert)

[Training Notebook](https://www.kaggle.com/markwijkhuizen/nbme-albert-large-training-tpu)

[Inference Notebook](https://www.kaggle.com/markwijkhuizen/nbme-albert-inference-public)

**V6**
* Using beter weights with LB 0.854 score

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn import metrics

from tqdm.notebook import tqdm
from nltk.tokenize import word_tokenize, sent_tokenize
from transformers import PreTrainedTokenizerFast, TFAlbertModel, AlbertConfig
from sklearn.model_selection import train_test_split

import re
import os
import random
import math

tqdm.pandas()

AUTO = tf.data.experimental.AUTOTUNE

In [None]:
SEQ_LENGTH = 512

In [None]:
features = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/features.csv')

# Add Ordinal Encoding
features['feature_num_ordinal'] = features['feature_num'].astype('category').cat.codes

N_LABELS = len(features)
print(f'N_LABELS: {N_LABELS}')

# Model

In [None]:
albert_config = AlbertConfig(
  hidden_size = 4096,
  intermediate_size = 16384,
  num_attention_heads = 64,
)

In [None]:
def get_model():
    # Clear Backend
    tf.keras.backend.clear_session()

    # enable XLA optmizations
    tf.config.optimizer.set_jit(True)
    
    # Input Layer
    input_ids = tf.keras.layers.Input(shape = (SEQ_LENGTH), dtype=tf.int32, name='input_ids')
    attention_mask = tf.keras.layers.Input(shape=SEQ_LENGTH, dtype=tf.int32, name='attention_mask')

    # AlBERT Model
    albert = TFAlbertModel(albert_config)

    # Get the last hidden state
    last_hidden_state = albert(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state

    do = tf.keras.layers.Dropout(0.00, name='dropout')(last_hidden_state)

    output = tf.keras.layers.Dense(N_LABELS, activation='sigmoid', name='head/classifier')(do)

    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=[output])
    
    model.load_weights('/kaggle/input/nbme-albert-large-training-tpu-dataset/model.h5')
    
    return model

In [None]:
model = get_model()

In [None]:
model.summary()

In [None]:
tf.keras.utils.plot_model(model, show_shapes=True, show_dtype=True, show_layer_names=True, expand_nested=False)

# Train

In [None]:
train = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/train.csv')
train = train.set_index(['case_num', 'pn_num'])

display(train.head())

# Patient Notes

In [None]:
patient_notes = pd.read_csv('/kaggle/input/nbme-score-clinical-patient-notes/patient_notes.csv')

# Set Case Number and Patient Number as Index for Convenient Access
patient_notes = patient_notes.set_index(['case_num', 'pn_num'])

patient_notes['pn_history_clean'] = patient_notes['pn_history'].str.lower()

display(patient_notes.head())

display(patient_notes.info())

# Tokenize

In [None]:
tokenizer = PreTrainedTokenizerFast.from_pretrained('/kaggle/input/nbme-preprocessing-albert-public/tokenizer')

In [None]:
# This function tokenize the text according to a AlBERT model tokenizer
def tokenize(note):
    return tokenizer(
            note,
            padding = 'max_length',
            truncation = True,
            max_length = SEQ_LENGTH,
            return_offsets_mapping = True,
        )

# Inference

In [None]:
# Only element above this threshold will be included
# Thus predictions below 0.01 will not be included in the soft labels
THRESHOLD = 0.05

# Maximum Annotations Per Patient Note
MAX_ANNOTATIONS = 1024

In [None]:
# Train Test Split
SEED = 42
train_idxs = train.index.unique()
test_size = 100 / len(train_idxs)
_, val_indices = train_test_split(train_idxs, test_size=test_size, random_state=SEED)
print(f'val_indices shape: {val_indices.shape}, val_indices length: {len(val_indices)}')

Labels are generated using sparse tensors, which saves only the indices and values of non-zero (not strictly) elements. The labels are of size \[Number of Tokens, Number of Features\], but only a handful of elements are actually non-zero, less than 0.1%. By only saving those elements a huge amount of memory is saved by excluding those 99%+ of zero's.

I can highly recommend to dive into Sparse Tensors, as sparse tensors are common in the data science field. Getting familiar with the Sparse Tensor data can save you a lot of computing resources in the future!

More on Sparse Tensors can be found in the [Tensorflow Documentation](https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor)

In [None]:
SIZE = len(patient_notes)
val_indeces_set = set(val_indices)

# Excluding Validation
X_extra_no_val = np.zeros([SIZE - len(val_indices), SEQ_LENGTH], dtype=np.int32)
y_extra_indices_no_val = np.full(shape=[SIZE - len(val_indices), MAX_ANNOTATIONS, 2], fill_value=-1, dtype=np.int16)
y_extra_values_no_val = np.full(shape=[SIZE - len(val_indices), MAX_ANNOTATIONS], fill_value=-1, dtype=np.float32)

# Including Validation
X_extra = np.zeros([SIZE, SEQ_LENGTH], dtype=np.int32)
y_extra_indices = np.full(shape=[SIZE, MAX_ANNOTATIONS, 2], fill_value=-1, dtype=np.int16)
y_extra_values = np.full(shape=[SIZE, MAX_ANNOTATIONS], fill_value=-1, dtype=np.float32)

print(f'X_extra shape: {X_extra.shape}, y_extra_indices shape: {y_extra_indices.shape}, y_extra_values shape: {y_extra_values.shape}')

idx_no_val = 0
for idx, (row_idx, row) in enumerate(tqdm(patient_notes.iterrows(), total=len(patient_notes))):
    pn_history_clean = row['pn_history_clean']
    
    # Tokenize patient note
    tokens = tokenize(pn_history_clean)
    
    input_ids = tokens['input_ids']
    attention_mask = tokens['attention_mask']
    
    # Get the prediction
    y_pred = model.predict_on_batch({
            'input_ids': np.array([input_ids]),
            'attention_mask': np.array([attention_mask]),
        }).squeeze()
    
    # Cast to Integer
    input_ids = np.array(input_ids, dtype=np.int32)
    
    # Create a Sparse Tensor as label to reduce memory usage
    y_pred_i = (y_pred > THRESHOLD).astype(np.int32)
    # Get the indices of element above the threshold
    y_pred_i = tf.sparse.from_dense(y_pred_i).indices.numpy()
    
    # Gather the values of elements above the threshold
    y_extra_v = tf.gather_nd(y_pred, tf.where(y_pred > THRESHOLD))
    
    # Length of elements above the threshold
    y_extra_len = len(tf.where(y_pred > THRESHOLD))
    
    # Assign input_ids, indices and values to the extra training data
    X_extra[idx] = input_ids
    y_extra_indices[idx, :y_extra_len] = y_pred_i
    y_extra_values[idx, :y_extra_len] = y_extra_v
    
    # Add to No Val
    # Exclude Validation Samples
    if row_idx in val_indeces_set:
        X_extra_no_val[idx_no_val] = input_ids
        y_extra_indices_no_val[idx_no_val, :y_extra_len] = y_pred_i
        y_extra_values_no_val[idx_no_val, :y_extra_len] = y_extra_v
        idx_no_val += 1

#  Save Extra Training Data

In [None]:
# Save X_extra and y_extra
np.save('./X_extra_no_val.npy', X_extra_no_val)
np.save('./y_extra_indices_no_val.npy', y_extra_indices_no_val)
np.save('./y_extra_values_no_val.npy', y_extra_values_no_val)

# Save X_extra and y_extra
np.save('./X_extra.npy', X_extra)
np.save('./y_extra_indices.npy', y_extra_indices)
np.save('./y_extra_values.npy', y_extra_values)