The intention of this project is to train a NER model on anonymised clinical data obtained from an official source, annotated by trained medical professionals.
Instead the CANTEMIST competition dataset will be used, which is a collection of clinical notes in Spanish, originally annotated with the use BRAT in .ann format. 
Time and budgetary constraints meant that the data used will be but annotated with the use of GPT-4 API in the file annotateGPT_bg.ipynb to identify the symptoms, conditions, findings and tests in the clinical notes. The training data is small at the moment for demonstration purposes, with intentions to expand this with professional annotations in the future.

Spacy will be used to train the NER model based on the GPT-4 annotated data. The model will be trained on the training set and demonstrated. 
The evaluation is planned to be done in the future, based on F1 score and precision. The model will be used to extract the entities from the clinical notes, and subsequently in the future train a machine learning model to predict the occurence of cancer from the entities (symptoms, conditions, findings etc.)

Set up the environment:

In [1]:
# %pip install -U spacy
# !python3 -m spacy download es_core_news_md

The data needing to be used for development should be moved to final_data/ folder. The data should be in the format of .txt and .ann files. The .txt files should contain the clinical notes, and the .ann files should contain the annotations in BRAT format. 

Load the model and the training data

In [27]:
import spacy
from find_entity import find_start_end

# es_core_news_md - medium-sized model for Spanish
nlp = spacy.load("es_core_news_md")
# nlp = spacy.blank("es") for blank

path = "./final_data/"
# the folder in the variable path contains the .txt and .ann files for the training data
# we need to import them and the entities like (Text, {entities: [(start, end, label)]})

# get entities from .ann file and text from .txt file
def get_entities(text_path, annotation_path):
    with open(text_path, "r", encoding='utf-8') as file:
        text = file.read()
    with open(annotation_path, "r") as file:
        annotations = file.read().splitlines()
    entities = []
    for annotation in annotations:
        split = annotation.split()
        if annotation[0] == "T" and len(split) > 4: # why is this necessary? because the .ann file contains other lines
            # e.g. annotation.split() == ['T1', 'SYMPTOM', '0', '9', 'Anisocoria']
            _, label, _, _, *entity = split # *entity is the rest of the list
            joined_entity = ' '.join(entity)
            # now we need to search the text for the entity (only because we worked with GPT, if not then can just say _, label, start, end, *entity)
            start, end = find_start_end(text.lower(), joined_entity.lower())
            entities.append((int(start), int(end), label, joined_entity))
    return (text, {"entities": entities})


In [28]:
# get entities from all .ann and .txt files in a folder
import os
from spacy.tokens import DocBin
from tqdm import tqdm

# this will get the data from all the .txt and .ann files in a folder
def get_data(folder):
    data = []
    fol = os.listdir(folder)
    for file in fol:
        if file.endswith(".txt"):
            text_path = os.path.join(folder, file)
            annp = file.replace(".txt", ".ann")
            annotation_path = os.path.join(folder, annp)
            print(annotation_path, annp in fol)
            if annp in fol:
                print("OK")
                data.append(get_entities(text_path, annotation_path))
    return data

training_data = get_data("./llm_data/")

db = DocBin() # create a DocBin object which will store the training examples
for text, annotations in tqdm(training_data):
    doc = nlp(text)
    # debug notes: char_span only works if it is in a token and the annotations must be wrong
    ents = []
    starts_ends = [] # to check for conflicts
    for obj in annotations["entities"]:
        start = obj[0]
        end = obj[1]
        label = obj[2]
        span = doc.char_span(start, end, label=label)
        if span is None:
            # print("Skipping entity", span)
            # print("Error")
            continue
        else:
            # print(f"Adding entity: {span.text} [{start}-{end}, {label}]")
            # check for conflicts
            conflicts = False
            for s,e in starts_ends:
                # check for conflicts and overlaps
                # (start >= s and end <= e) or (start <= s and end >= e)
                if ((start < e and end>s)):
                    conflicts = True
            if conflicts or start == -1:
                print(f"Conflict detected with span: {span.text} [{start}-{end}, {label}]")
            else:
                ents.append(span)
                starts_ends.append((start, end))
    doc.ents = ents
    db.add(doc)
db.to_disk("./final_data_retrain.spacy") # save the DocBin object

./llm_data/caso_clinico_radiologia581.ann True
OK
./llm_data/cc_onco29.ann True
OK
./llm_data/cc_onco609.ann True
OK
./llm_data/cc_onco184.ann True
OK
./llm_data/cc_onco580.ann True
OK
./llm_data/cc_onco224.ann True
OK
./llm_data/caso_clinico_radiologia621.ann True
OK
./llm_data/caso_clinico_radiologia43.ann True
OK
./llm_data/cc_onco146.ann True
OK
./llm_data/cc_onco608.ann True
OK
./llm_data/cc_onco28.ann True
OK
./llm_data/caso_clinico_radiologia580.ann True
OK
./llm_data/caso_clinico_radiologia582.ann True
OK
./llm_data/caso_clinico_radiologia41.ann True
OK
./llm_data/cc_onco144.ann True
OK
./llm_data/caso_clinico_radiologia623.ann True
OK
./llm_data/cc_onco596.ann False
./llm_data/caso_clinico_radiologia622.ann True
OK
./llm_data/cc_onco186.ann True
OK
./llm_data/cc_onco145.ann True
OK
./llm_data/caso_clinico_radiologia40.ann True
OK
./llm_data/caso_clinico_radiologia583.ann True
OK
./llm_data/caso_clinico_radiologia44.ann True
OK
./llm_data/caso_clinico_radiologia50.ann True
OK
.

  2%|▏         | 4/162 [00:00<00:17,  9.23it/s]

Conflict detected with span: mama derecha [347-359, ANATOMICAL]
Conflict detected with span: mama derecha [347-359, ANATOMICAL]


 15%|█▍        | 24/162 [00:02<00:09, 13.85it/s]

Conflict detected with span: fosa iliaca derecha [70-89, ANATOMICAL]


 17%|█▋        | 28/162 [00:02<00:10, 13.38it/s]

Conflict detected with span: voz nasal [309-318, SYMPTOM]
Conflict detected with span: sin alteraciones [713-729, FINDING]


 21%|██        | 34/162 [00:02<00:09, 13.64it/s]

Conflict detected with span: T1 [595-597, ANATOMICAL]


 24%|██▍       | 39/162 [00:03<00:11, 10.79it/s]

Conflict detected with span: FID [92-95, ANATOMICAL]
Conflict detected with span: estudio con TC y biopsia con aguja gruesa [474-515, PROCEDURE]


 25%|██▌       | 41/162 [00:03<00:12,  9.79it/s]

Conflict detected with span: dolor [115-120, FINDING]
Conflict detected with span: bazo [359-363, ANATOMICAL]
Conflict detected with span: aorta [536-541, ANATOMICAL]
Conflict detected with span: parénquima esplénico normal [598-625, ANATOMICAL]
Conflict detected with span: FID [91-94, ANATOMICAL]
Conflict detected with span: calcificaciones en su interior [208-238, FINDING]


 33%|███▎      | 53/162 [00:04<00:06, 15.85it/s]

Conflict detected with span: tumefacción [153-164, FINDING]
Conflict detected with span: oro-hipofaringe [386-401, ANATOMICAL]
Conflict detected with span: seno piriforme izquierdo [439-463, ANATOMICAL]


 38%|███▊      | 61/162 [00:04<00:06, 16.50it/s]

Conflict detected with span: aorta ascendente [222-238, ANATOMICAL]
Conflict detected with span: arco aórtico [723-735, ANATOMICAL]
Conflict detected with span: arteria coronaria derecha [840-865, ANATOMICAL]
Conflict detected with span: lobar occipital izquierdo [235-260, ANATOMICAL]
Conflict detected with span: asa ileal [423-432, ANATOMICAL]


 43%|████▎     | 70/162 [00:05<00:03, 23.49it/s]

Conflict detected with span: lóbulo frontal izquierdo [236-260, ANATOMICAL]
Conflict detected with span: senos maxilares [446-461, ANATOMICAL]
Conflict detected with span: celdillas etmoidales [463-483, ANATOMICAL]
Conflict detected with span: glándula lagrimal izquierda [574-601, ANATOMICAL]
Conflict detected with span: mamografía [130-140, TEST]
Conflict detected with span: carcinoma [164-173, CONDITION]


 47%|████▋     | 76/162 [00:05<00:05, 15.92it/s]

Conflict detected with span: control radiológico [336-355, TEST]
Conflict detected with span: vejiga [249-255, ANATOMICAL]
Conflict detected with span: pared abdominal anterior [447-471, ANATOMICAL]


 52%|█████▏    | 85/162 [00:06<00:04, 16.73it/s]

Conflict detected with span: mal delimitado [496-510, FINDING]
Conflict detected with span: arteria cerebral posterior izquierda [105-141, ANATOMICAL]
Conflict detected with span: colon ascendente [654-670, ANATOMICAL]
Conflict detected with span: colon ascendente [654-670, ANATOMICAL]
Conflict detected with span: trombosis parcial del seno [1015-1041, CONDITION]
Conflict detected with span: seno longitudinal superior [1037-1063, ANATOMICAL]
Conflict detected with span: seno sigmoideo izquierdo [1070-1094, ANATOMICAL]


 54%|█████▍    | 88/162 [00:06<00:05, 13.72it/s]

Conflict detected with span: vértebras sacras [1812-1828, ANATOMICAL]
Conflict detected with span: hueso iliaco derecho [1831-1851, ANATOMICAL]
Conflict detected with span: pala iliaca derecha [1984-2003, ANATOMICAL]
Conflict detected with span: cistoscopia [570-581, TEST]
Conflict detected with span: canal medular [511-524, ANATOMICAL]
Conflict detected with span: cérvico-dorsal [533-547, ANATOMICAL]
Conflict detected with span: C2 [807-809, ANATOMICAL]
Conflict detected with span: T1 [812-814, ANATOMICAL]


 59%|█████▊    | 95/162 [00:06<00:03, 17.62it/s]

Conflict detected with span: ovario derecho [466-480, FINDING]
Conflict detected with span: ovario derecho [466-480, ANATOMICAL]
Conflict detected with span: apéndice cecal [593-607, FINDING]


 62%|██████▏   | 100/162 [00:07<00:05, 11.25it/s]

Conflict detected with span: sacro [1783-1788, ANATOMICAL]
Conflict detected with span: mama izquierda [84-98, ANATOMICAL]
Conflict detected with span: signos de estenosis de la arteria renal derecha [346-393, FINDING]


 69%|██████▊   | 111/162 [00:08<00:03, 13.34it/s]

Conflict detected with span: biopsia [666-673, TEST]


 70%|██████▉   | 113/162 [00:08<00:03, 12.37it/s]

Conflict detected with span: cuadro febril [487-500, CONDITION]
Conflict detected with span: dolor punzante [545-559, SYMPTOM]
Conflict detected with span: hemiabdomen izquierdo [563-584, ANATOMICAL]
Conflict detected with span: molestias a la palpación profunda [768-801, FINDING]
Conflict detected with span: hipocondrio izquierdo [805-826, ANATOMICAL]
Conflict detected with span: sin signos de peritonismo [832-857, FINDING]
Conflict detected with span: miembros inferiores [937-956, ANATOMICAL]
Conflict detected with span: Ecografía de abdomen [1056-1076, PROCEDURE]
Conflict detected with span: hipocondrio izquierdo [805-826, ANATOMICAL]
Conflict detected with span: bazo [1145-1149, ANATOMICAL]
Conflict detected with span: sugiere un origen suprarrenal [1152-1181, FINDING]
Conflict detected with span: RNM (resonancia nuclear magnética) [1237-1271, PROCEDURE]
Conflict detected with span: abdominal [1272-1281, ANATOMICAL]
Conflict detected with span: masa tumoral [1308-1320, ANATOMICAL]


 73%|███████▎  | 118/162 [00:08<00:03, 13.15it/s]

Conflict detected with span: tumefacción en toda la pierna [871-900, FINDING]
Conflict detected with span: varias nodulaciones en el muslo [967-998, FINDING]
Conflict detected with span: fosa nasal derecha [176-194, ANATOMICAL]
Conflict detected with span: ojo derecho [211-222, ANATOMICAL]
Conflict detected with span: fosa nasal derecha [176-194, ANATOMICAL]
Conflict detected with span: fosa nasal derecha [176-194, ANATOMICAL]
Conflict detected with span: fosa nasal derecha [176-194, ANATOMICAL]
Conflict detected with span: fosa nasal derecha [176-194, ANATOMICAL]
Conflict detected with span: negativo [2081-2089, FINDING]


 76%|███████▌  | 123/162 [00:09<00:03, 11.47it/s]

Conflict detected with span: M0 [149-151, FINDING]
Conflict detected with span: Mutación en gen MET (c.3906G>A/p.V1238I, en exón 19 del gen MET) [1430-1494, FINDING]
Conflict detected with span: Mutación deletérea en gen BRCA1 exón 5, c.330A [1497-1543, FINDING]


 78%|███████▊  | 126/162 [00:09<00:02, 13.88it/s]

Conflict detected with span: parálisis facial [151-167, SYMPTOM]
Conflict detected with span: hemiparesia derecha [129-148, SYMPTOM]
Conflict detected with span: gastroscopia [288-300, PROCEDURE]
Conflict detected with span: anemización [221-232, FINDING]
Conflict detected with span: gastroscopia [288-300, PROCEDURE]
Conflict detected with span: invaginación yeyuno-yeyunal [949-976, CONDITION]


 84%|████████▍ | 136/162 [00:10<00:02, 10.10it/s]

Conflict detected with span: quimioterapia [717-730, PROCEDURE]
Conflict detected with span: quimioterapia [717-730, PROCEDURE]
Conflict detected with span: quimioterapia [717-730, PROCEDURE]
Conflict detected with span: PET-TC [1953-1959, TEST]


 86%|████████▋ | 140/162 [00:11<00:02,  7.71it/s]

Conflict detected with span: LSI [203-206, ANATOMICAL]


 95%|█████████▌| 154/162 [00:13<00:01,  7.47it/s]

Conflict detected with span: adalimumab [477-487, PROCEDURE]


 97%|█████████▋| 157/162 [00:13<00:00,  7.49it/s]

Conflict detected with span: años [22-26, QUANTITY]
Conflict detected with span: hepatoblastoma embrionario con diferenciación colangiolar [541-598, CONDITION]
Conflict detected with span: hepatoblastoma embrionario [541-567, CONDITION]


100%|██████████| 162/162 [00:14<00:00, 11.50it/s]


To visualise the DocBin object, we can use the displacy module from spacy.

In [29]:
from spacy import displacy

# Convert the DocBin object to a list of Doc objects
docs = list(db.get_docs(nlp.vocab))

# Iterate over the Doc objects and visualize them using displacy
for doc in docs:
    displacy.render(doc, style="ent", jupyter=True)

# serve the visualisation for purposes of the report
# displacy.serve(doc, style="ent", auto_select_port=True, host="localhost")



In [30]:
# export to a csv file for viewing
import csv
with open("train_data.csv", "w") as file:
    writer = csv.writer(file)
    writer.writerow(["text", "entities"])
    writer.writerows(training_data)
# open as a preview of the file to demonstrate the training data

NER pipeline component

In [34]:
print(nlp.pipe_names) # 'ner' is already at the end of our pipeline for the model
ner = nlp.get_pipe("ner")
print(ner.labels) # the labels that the model can predict

['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
('ANATOMICAL', 'BACKGROUND', 'CONDITION', 'DEVICE', 'DRUG', 'FINDING', 'LOC', 'LOCATION', 'MISC', 'ORG', 'PER', 'PROCEDURE', 'QUANTITY', 'STAGE', 'SYMPTOM', 'TEST', 'TIME', 'TREATMENT')


Add the entity labels to the model from the bg-train.spacy file

In [32]:
# Load the DocBin object from the bg-train.spacy file
db = DocBin().from_disk("./final_data_retrain.spacy")

# Iterate over the Doc objects in the DocBin
for doc in db.get_docs(nlp.vocab):
    for ent in doc.ents:
        ner.add_label(ent.label_)


In [33]:
# we now have the new labels in the NER pipeline
labels = ner.labels
print(labels)

('ANATOMICAL', 'BACKGROUND', 'CONDITION', 'DEVICE', 'DRUG', 'FINDING', 'LOC', 'LOCATION', 'MISC', 'ORG', 'PER', 'PROCEDURE', 'QUANTITY', 'STAGE', 'SYMPTOM', 'TEST', 'TIME', 'TREATMENT')


Train the NER model - just with ner component

In [35]:
!python3 -m spacy train config.cfg --output ./output --paths.train ./final_data_retrain.spacy --paths.dev ./final_data_retrain.spacy 

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     88.50    0.00    0.00    0.00    0.00
  1     200       6697.35  18162.81    8.32   17.89    5.42    0.08
  2     400      17263.69  10185.77    9.38   23.75    5.85    0.09
  3     600      11174.85   7670.26   15.24   32.42    9.96    0.15
  4     800       3468.23   6696.34   24.76   34.77   19.23    0.25
  6    1000       2769.43   6021.45   31.83   40.37   26.27    0.32
  7    1200       3150.40   5719.91   39.12   48.12   32.96    0.39
  8    1400       3940.03   5558.86   47.80   50.78   45.15    0.48
  9    1600       6332.85   5136.00   49.73   48.52   51.00    0.50
 11    1800       5795.14   4946.27   56.04   

In [37]:
my_nlp = spacy.load(r"./output/model-best") 

files = ["./spacy-train/cc_onco3.txt"]
with open(files[0], "r", encoding='utf-8') as file:
    text = file.read()
    
    doc = my_nlp(text) 
    # spacy.displacy.render(doc, style="ent", jupyter=True) 
    # spacy.displacy.serve(doc, style="ent")
    displacy.serve(doc, style="ent",auto_select_port=True, host="localhost")


FileNotFoundError: [Errno 2] No such file or directory: './spacy-train/cc_onco3.txt'

references
https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7
