This simple harness is a helper to call the ML model located at
https://huggingface.co/ClinicalNLP/SDOHv7

This objective of this code is to give an idea of

1. How to pull the Model from huggingface repo.
2. How you can pass variable length clinical note as input
3. Example pre-processing steps that you might need to take care before calling the example.
4. Filtering to get the tags that you are interested in.

Have Fun !!!


In [1]:
pip install transformers




In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("ClinicalNLP/SDOHv7")
model = AutoModelForSequenceClassification.from_pretrained("ClinicalNLP/SDOHv7")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/2.64k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/738M [00:00<?, ?B/s]

In [3]:
# Getting Labels
id2labeldict = {
    "0": "Access to Care",
    "1": "Access to Care Contradiction",
    "2": "Depression",
    "3": "Depression Contradiction",
    "4": "Economic Instability",
    "5": "Economic Instability Contradiction",
    "6": "Employment Stress",
    "7": "Employment Stress Contradiction",
    "8": "Exposure to Violence",
    "9": "Exposure to Violence Contradiction",
    "10": "Food Insecurity ",
    "11": "Food Insecurity Contradiction",
    "12": "Housing Instability",
    "13": "Housing Instability Contradiction",
    "14": "Limited Language (English) Proficiency",
    "15": "Limited Language Proficiency Contradiction",
    "16": "Neutral",
    "17": "Social Context",
    "18": "Social Context Contradiction",
    "19": "Substance Abuse",
    "20": "Substance Abuse Contradiction",
    "21": "Transportation",
    "22": "Transportation Contradiction"
  }

In [4]:
column_names = list(id2labeldict.values())
column_names

['Access to Care',
 'Access to Care Contradiction',
 'Depression',
 'Depression Contradiction',
 'Economic Instability',
 'Economic Instability Contradiction',
 'Employment Stress',
 'Employment Stress Contradiction',
 'Exposure to Violence',
 'Exposure to Violence Contradiction',
 'Food Insecurity ',
 'Food Insecurity Contradiction',
 'Housing Instability',
 'Housing Instability Contradiction',
 'Limited Language (English) Proficiency',
 'Limited Language Proficiency Contradiction',
 'Neutral',
 'Social Context',
 'Social Context Contradiction',
 'Substance Abuse',
 'Substance Abuse Contradiction',
 'Transportation',
 'Transportation Contradiction']

Get the big note and split the note into pieces. can we use tokenizer and split into sentences

In [5]:
longtext = "Admission Summary Note Reason for admission: Patient has a history of Schizophrenia. Medical history includes hypertension, asthma, blood disorder. Care plan problems started on admission: Inability to care for self.The patient received orientation to the unit and schedule, and was provided with a patient handbook. Patient placed on close observation, fall and seizure precautions and will be monitored for safety. The patient is homeless"

In [6]:
import re

# Step 1: Remove spaces or multiple spaces after ']'
longtext = re.sub(r'\]\s+', ']', longtext)

# Step 2: Insert Begin and End Tags
longtext = re.sub(r'(PATIENT PARTICIPATION LEVEL)', r'*BEGIN* \1', longtext)
longtext = re.sub(r'(Refused teaching)', r'\1 *END*', longtext)
#+here a text cut off should be reunited //to be implemented later
#Step 3 Takes care of ununselected tags
longtext = re.sub(r"(?<!\[x\])\[\][^[]*", "", longtext)
#Step 4 to remove [x]
longtext =re.sub(r"\[x\]([^\[]+)", "\\1,", longtext)

print(longtext)


Admission Summary Note Reason for admission: Patient has a history of Schizophrenia. Medical history includes hypertension, asthma, blood disorder. Care plan problems started on admission: Inability to care for self.The patient received orientation to the unit and schedule, and was provided with a patient handbook. Patient placed on close observation, fall and seizure precautions and will be monitored for safety. The patient is homeless


lternatively, you can also use Spacy library's nlp.pipe() method, it is a more efficient way to tokenize the text into sentences, it returns an iterator over the sentences, and it is generally faster than the NLTK's sent_tokenize method.

In [7]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(longtext)

In [8]:
sentences = [sent.text for sent in doc.sents]

In [9]:
inputs = tokenizer.batch_encode_plus(sentences, return_tensors="pt",padding=True, truncation=True)
outputs = model(**inputs)
outputs

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


SequenceClassifierOutput(loss=None, logits=tensor([[ 0.4373, -0.3922,  1.3240,  0.3030, -2.0260, -1.7488,  0.0648,  0.3380,
         -0.4541, -0.4904, -4.8016, -3.5748, -0.4700, -0.9205, -0.3203,  0.0347,
          7.6851, -1.9584, -1.3372, -0.1237, -0.3252, -2.3005, -1.9093],
        [ 0.1461, -0.4948,  0.9277,  0.1993, -3.0834, -2.2838, -0.0186, -0.2486,
         -0.2171, -0.5804, -4.9049, -4.0466, -0.2182, -1.1813,  0.5677,  1.2103,
          7.0053, -2.2360, -0.7461,  0.6266,  0.7917, -2.5804, -1.6354],
        [-0.2641, -1.7926,  2.6030,  0.0612, -1.7091, -2.1063,  1.1077,  0.3368,
          0.1457, -1.0507, -4.5821, -4.4207,  0.2965, -0.9919, -0.1519, -0.4947,
          6.6878, -2.3544, -1.2144,  0.7438,  0.1722, -1.9070, -2.2937],
        [ 0.8823,  0.2127,  0.2516, -0.2082, -2.1774, -1.7611, -0.2545, -0.8782,
         -0.3510,  0.0490, -4.7935, -3.4372, -0.9634, -1.5407,  0.5460,  0.9817,
          7.5363, -2.0455, -1.1867,  0.0564, -0.9804, -2.0031, -0.3844],
        [-0.1601,

In [10]:
from torch import nn
pt_predictions = nn.functional.softmax(outputs.logits, dim=-1)
print(pt_predictions)

tensor([[7.0641e-04, 3.0819e-04, 1.7146e-03, 6.1760e-04, 6.0155e-05, 7.9370e-05,
         4.8672e-04, 6.3965e-04, 2.8967e-04, 2.7936e-04, 3.7484e-06, 1.2782e-05,
         2.8512e-04, 1.8170e-04, 3.3115e-04, 4.7229e-04, 9.9250e-01, 6.4358e-05,
         1.1978e-04, 4.0308e-04, 3.2955e-04, 4.5712e-05, 6.7595e-05],
        [1.0311e-03, 5.4317e-04, 2.2530e-03, 1.0875e-03, 4.0806e-05, 9.0783e-05,
         8.7454e-04, 6.9487e-04, 7.1709e-04, 4.9863e-04, 6.6019e-06, 1.5576e-05,
         7.1627e-04, 2.7341e-04, 1.5719e-03, 2.9887e-03, 9.8220e-01, 9.5231e-05,
         4.2251e-04, 1.6671e-03, 1.9665e-03, 6.7485e-05, 1.7363e-04],
        [9.2353e-04, 2.0029e-04, 1.6242e-02, 1.2786e-03, 2.1773e-04, 1.4635e-04,
         3.6410e-03, 1.6844e-03, 1.3913e-03, 4.2059e-04, 1.2308e-05, 1.4463e-05,
         1.6178e-03, 4.4605e-04, 1.0332e-03, 7.3334e-04, 9.6527e-01, 1.1420e-04,
         3.5705e-04, 2.5305e-03, 1.4287e-03, 1.7864e-04, 1.2135e-04],
        [1.2775e-03, 6.5397e-04, 6.7990e-04, 4.2929e-04, 5.99

In [11]:
import numpy as np
import pandas as pd
predarr = pt_predictions.detach().numpy()
df = pd.DataFrame(predarr, columns=column_names)
df.head()

Unnamed: 0,Access to Care,Access to Care Contradiction,Depression,Depression Contradiction,Economic Instability,Economic Instability Contradiction,Employment Stress,Employment Stress Contradiction,Exposure to Violence,Exposure to Violence Contradiction,...,Housing Instability Contradiction,Limited Language (English) Proficiency,Limited Language Proficiency Contradiction,Neutral,Social Context,Social Context Contradiction,Substance Abuse,Substance Abuse Contradiction,Transportation,Transportation Contradiction
0,0.000706,0.000308,0.001715,0.000618,6e-05,7.9e-05,0.000487,0.00064,0.00029,0.000279,...,0.000182,0.000331,0.000472,0.992501,6.4e-05,0.00012,0.000403,0.00033,4.6e-05,6.8e-05
1,0.001031,0.000543,0.002253,0.001087,4.1e-05,9.1e-05,0.000875,0.000695,0.000717,0.000499,...,0.000273,0.001572,0.002989,0.982204,9.5e-05,0.000423,0.001667,0.001966,6.7e-05,0.000174
2,0.000924,0.0002,0.016242,0.001279,0.000218,0.000146,0.003641,0.001684,0.001391,0.000421,...,0.000446,0.001033,0.000733,0.965267,0.000114,0.000357,0.00253,0.001429,0.000179,0.000121
3,0.001278,0.000654,0.00068,0.000429,6e-05,9.1e-05,0.00041,0.00022,0.000372,0.000555,...,0.000113,0.000913,0.001411,0.991173,6.8e-05,0.000161,0.000559,0.000198,7.1e-05,0.00036
4,0.000538,0.000287,0.002608,0.000338,0.000112,0.000142,0.001107,0.000482,0.001361,0.000468,...,0.000144,0.000384,0.000567,0.987981,6.9e-05,0.00017,0.002006,0.000402,8.8e-05,0.000164


In [12]:
# assume df is your dataframe with rows of label probabilities

#create an empty dataframe to store top 3 values and labels
top_3 = pd.DataFrame(columns=["label_1", "prob_1", "label_2", "prob_2","label_3", "prob_3"])

#iterate through each row in the dataframe
for i, row in df.iterrows():
    #get top 3 label probabilities and labels
    top_3_probs = row.nlargest(3)
    top_3_labels = top_3_probs.index
    top_3_values = top_3_probs.values
    # add the top 3 label probabilities and labels to the top_3 dataframe
    # top_3 = top_3.append({"label_1": top_3_labels[0], "prob_1": top_3_values[0],
    #                      "label_2": top_3_labels[1], "prob_2": top_3_values[1],
    #                      "label_3": top_3_labels[2], "prob_3": top_3_values[2]}, ignore_index=True)
    top_3 = pd.concat([top_3, pd.DataFrame({
    "label_1": [top_3_labels[0]],"prob_1": [top_3_values[0]],
    "label_2": [top_3_labels[1]],"prob_2": [top_3_values[1]],
    "label_3": [top_3_labels[2]],"prob_3": [top_3_values[2]]})], ignore_index=True)


# concatenate the top_3 dataframe with the original dataframe
df = pd.concat([df, top_3], axis=1)

#The final dataframe will have the top 3 label probabilities and labels appended to each row


  top_3 = pd.concat([top_3, pd.DataFrame({


In [13]:
top_3['TextOfInt']=sentences
top_3.style.set_properties(subset=["TextOfInt"], **{'width': '500px'})
top_3.head(25)

Unnamed: 0,label_1,prob_1,label_2,prob_2,label_3,prob_3,TextOfInt
0,Neutral,0.992501,Depression,0.001715,Access to Care,0.000706,Admission Summary Note Reason for admission: P...
1,Neutral,0.982204,Limited Language Proficiency Contradiction,0.002989,Depression,0.002253,"Medical history includes hypertension, asthma,..."
2,Neutral,0.965267,Depression,0.016242,Employment Stress,0.003641,Care plan problems started on admission: Inabi...
3,Neutral,0.991173,Limited Language Proficiency Contradiction,0.001411,Access to Care,0.001278,The patient received orientation to the unit a...
4,Neutral,0.987981,Depression,0.002608,Substance Abuse,0.002006,"Patient placed on close observation, fall and ..."
5,Housing Instability,0.984358,Economic Instability,0.002753,Transportation,0.002599,The patient is homeless


In [14]:
#Let us reduce it to only labels that has passed certain degree of confidence ( eg. 75 %)
# also filter out if the label_1 is Neutral
df_filtered = top_3[(top_3['prob_1'] > 0.75) & (top_3['label_1'] != 'Neutral')]
df_filtered.head()


Unnamed: 0,label_1,prob_1,label_2,prob_2,label_3,prob_3,TextOfInt
5,Housing Instability,0.984358,Economic Instability,0.002753,Transportation,0.002599,The patient is homeless
