# About the task
The task tackled in this notebook is a NER (Named Enetity Recognition) task on the Kinbiotics labeled dataset. The dataset contains 370 medical notes, each of which has highlighted words or sentences labeled as Sympthom or Pathogen.
The goal is to train a model and make it able to recognize these two types of entities on unseen medical notes.

For this task we mainly relied on the spaCy library, a quite popular library to perform various nlp tasks. The library provides pretrained models which can be then tuned on arbitrary topics.


# Imports

In [None]:
import pandas as pd
import json
import spacy
from tqdm import tqdm
import os
from spacy.util import filter_spans
from spacy.tokens import DocBin
from spacy import displacy

# Preprocessing

There wasn't so much preprocessing to do for this task. The main steps were to format the labeled data in a more suitable way for the model.

In [None]:
df = pd.read_csv('KINBIOTICS_NER.csv')
df = df.dropna()

In [None]:
split=0.2
df_train = df[:round(len(df)*(1-split))]
df_val = df[round(len(df)*(1-split)):]

Here we're simply formatting the data in a more useful way. The output of this piece of code will be a text column and an entity column, where  each entity start and end position will be saved, as well as its label.


In [None]:
# Initialize an empty list to store training data
train_data = []

# Iterate over each row in the DataFrame 'df'
for i, row in df_train.iterrows():
    # Create a temporary dictionary to store information for each row
    temp_dict = dict()

    # Extract the 'text' column value from the current row and store it in the dictionary
    temp_dict['text'] = row['text']

    # Parse the 'label' column value and store it in a variable
    json_label = json.loads(row['label'])

    # Initialize an empty list to store entity information for each row
    labels_per_row = []

    # Iterate over each dictionary in the parsed JSON label
    for d in json_label:
        # Extract 'start', 'end', and 'labels' values from the current dictionary
        start = d['start']
        end = d['end']
        labels = d['labels'][0]

        # Append a tuple containing 'start', 'end', and 'labels' to the list
        labels_per_row.append((start, end, labels))

    # Store the list of entities in the temporary dictionary
    temp_dict['entities'] = labels_per_row

    # Append the temporary dictionary to the list of training data
    train_data.append(temp_dict)




In [None]:
# same for validation set 
val_data = []

for i, row in df_val.iterrows():
    temp_dict = dict()
    temp_dict['text'] = row['text']
    json_label = json.loads(row['label'])
    labels_per_row = []

    for d in json_label:
        start = d['start']
        end = d['end']
        labels = d['labels'][0]
        labels_per_row.append((start, end, labels))

    temp_dict['entities'] = labels_per_row

    val_data.append(temp_dict)


Spacy uses span objects for entities, so our train data must be formatted in a way readable by spacy. The final output is going to be a binary document.

In [None]:
# Create a blank spaCy NLP pipeline for English. In this case we're using a blank model
nlp = spacy.blank('en')

# Create a DocBin to store processed spaCy documents
doc_bin = DocBin()

# Iterate over each training example in the 'train_data'
for training_example in tqdm(train_data):
    # Extract text and labeled entities from the training example
    text = training_example['text']
    labels = training_example['entities']

    # Create a spaCy Doc object from the text
    doc = nlp.make_doc(text)

    # Initialize an empty list to store spaCy spans representing entities
    ents = []

    # Iterate over each labeled entity in the training example
    for start, end, label in labels:
        # Convert the labeled entity into a spaCy Span
        span = doc.char_span(start, end, label=label, alignment_mode="strict")

        # Check if the span is valid
        if span is None:
            print("Skipping entity")
        else:
            # Append the valid span to the list of entities
            ents.append(span)

    # Filter overlapping or conflicting spans
    filtered_ents = filter_spans(ents)

    # Set the entities of the spaCy Doc to the filtered entities
    doc.ents = filtered_ents

    # Add the processed spaCy Doc to the DocBin
    doc_bin.add(doc)

# Save the processed spaCy documents to a binary file named "train.spacy"
doc_bin.to_disk("train.spacy")


In [None]:
# same for validation set
doc_bin = DocBin()
for val_example in tqdm(val_data):
  text = val_example['text']
  labels = val_example['entities']
  doc = nlp.make_doc(text)
  ents = []

  for start, end, label in labels: # converting our entities into spans
    span = doc.char_span(start, end, label=label, alignment_mode="strict")
    if span is None:
      print("Skipping entity")
    else: ents.append(span)

  filtered_ents = filter_spans(ents)
  doc.ents = filtered_ents
  doc_bin.add(doc)
doc_bin.to_disk("val.spacy")

# Configuring the spacy model

The configuration of the model is made by specifing its desired characteristics in the base_config.cgf. Here we'll which kind of model we want to use. We used `en_core_web_sm` due to some technichal issues that did not let us use the transformer based one or the larger one. More detailes on model configuration [here](https://https://spacy.io/usage/training).



In [None]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
# one may eventually need to install the underlying model en_core_web_sm (small one) or en_core_web_lg (big one)
#!python -m spacy download en_core_web_sm

# Training

`--paths.train ./train.spacy` is where we specify on what the model should train. Same with
`--paths.dev ./val.spacy\` for validation. \\
The training is done base on the specifications taken from `config.cfg`.


In [None]:
!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./val.spacy

[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    311.90    0.89    0.58    1.95    0.01
  0     200        296.57  35078.51    0.28  100.00    0.14    0.00
  1     400       2896.27   6825.48   15.20   62.63    8.65    0.15
  2     600       2760.43   6271.69   23.83   50.22   15.62    0.24
  2     800       6222.61   7362.15   28.63   50.71   19.94    0.29
  3    1000        338.56   3953.26   37.10   45.98   31.10    0.37
  4    1200        721.10   3709.78   39.68   47.30   34.17    0.40
  4    1400        300.71   3185.29   37.77   63.28   26.92    0.38
  5    1600        558.27   3484.31   46.93   50.73   43.65    0.47
  6    1800        301.92   3152.81   47.59   45.8

The values for LOSS TOK2VEC and LOSS NER are the loss values for the token-to-vector and named entity recognition steps in your pipeline. The ENTS_F, ENTS_P, and ENTS_R column indicate the values for the F-score, precision, and recall for the named entities task (more detailed info [here](https://https://spacy.io/models/en)). The score column shows the overall score of the pipeline.


From the output above we can see how after 15 epochs the F1 score tends to stabilize around 0.5 meaning that we can stop training. By training further we risk to overfit the model.


# Prediction

In [None]:
# example of prediction. Unfortunately entities are not highlighted on github for some reason. 

nlp_ner = spacy.load('model-best')

path='C:\\Users\\valif\\OneDrive\\Desktop\\Medical_notes_expl\\processed\\processed\\21_1115.txt'
with open(path, 'r') as file:
  note = file.read()

doc= nlp_ner(note)

colors = {"Pathogen": "#F67DE3", "Symptom": "#7DF6D9"}
options = {"colors": colors}
spacy.displacy.render(doc, style="ent", options= options, jupyter=True)

# Applying the model on the unlabeled dataset

In [None]:
result={'text': [],
        'entities': [] # this will be a list of list dictionaries
        }

folder_path = '/content/drive/MyDrive/Colab Notebooks/processed/processed'

# Get a list of all text files in the folder
text_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

# Iterate over each text file
for file_name in text_files:
    file_path = os.path.join(folder_path, file_name)

    with open(file_path, 'r') as file:
        note = file.read()
        doc= nlp_ner(note)
        result['text'].append(note)
        ents_list= []

        for ent in doc.ents:
          ents_dict = dict()
          ents_dict['start'] = ent.start_char
          ents_dict['end'] = ent.end_char
          ents_dict['piece'] = ent.text
          ents_dict['label'] = ent.label_
          ents_list.append(ents_dict)

        result['entities'].append(ents_list)





In [None]:
ner_small_annotations = pd.DataFrame(result)

In [None]:
ner_small_annotations

Unnamed: 0,text,entities
0,Chief Complaint:\n hypoxia / hypotension / fev...,"[{'start': 18, 'end': 25, 'piece': 'hypoxia', ..."
1,Chief Complaint:\n sepsis due to intra-abdomin...,"[{'start': 18, 'end': 24, 'piece': 'sepsis', '..."
2,Chief Complaint:\n chest pain\n\nHistory of Pr...,"[{'start': 18, 'end': 28, 'piece': 'chest pain..."
3,"Chief Complaint:\n fever to 103.0 at home, rle...","[{'start': 18, 'end': 23, 'piece': 'fever', 'l..."
4,"Chief Complaint:\n anemia, fatigue\n\nHistory ...","[{'start': 18, 'end': 24, 'piece': 'anemia', '..."
...,...,...
5431,Chief Complaint:\n right-sided scapular pain\n...,"[{'start': 18, 'end': 43, 'piece': 'right-side..."
5432,Chief Complaint:\n recurrent diverticulitis\n\...,"[{'start': 18, 'end': 42, 'piece': 'recurrent ..."
5433,"Chief Complaint:\n generalized weakness, diffu...","[{'start': 18, 'end': 38, 'piece': 'generalize..."
5434,"Chief Complaint:\n watery, nonbloody diarrhea\...","[{'start': 18, 'end': 44, 'piece': 'watery, no..."




No charts were generated by quickchart


In [None]:
ner_small_annotations.to_csv('ner_small_annotations.csv')