<a href="https://colab.research.google.com/github/russell-ai/SparkNLP-CustomNER/blob/main/3_Custom_NER_Model_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **Interview Task**
[Running a Spark NLP Healthcare Pipeline and Training a Custom NER Model](https://docs.google.com/document/d/1l_SpYGAlVGAEe9x-b8avgvKipCXetdap2ttc4UKreO4/edit?tab=t.0)  
## **PART-III Custom NER Model Training:**
  
In this part, we will perform model training using the coNLL training dataset prepared in the previous part (Part 2).

## Environment Setup

In [1]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

Saving Medical Language Models for Data Scientists  Training License.json to Medical Language Models for Data Scientists  Training License.json


In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1  spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.6/55.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m579.2/579.2 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m554.8/554.8 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")

    return builder.getOrCreate()

#spark = start(SECRET)

In [4]:
import json
import os

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

import sparknlp_jsl
import sparknlp

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *


import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", # Amount of memory to use for the driver process, i.e. where SparkContext is initialized
          "spark.kryoserializer.buffer.max":"2000M", # Maximum allowable size of Kryo serialization buffer, in MiB unless otherwise specified.
          "spark.driver.maxResultSize":"2000M"} # Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes.
                                                # Should be at least 1M, or 0 for unlimited.

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.4.1
Spark NLP_JSL Version : 5.4.1


## Download and prepare **clinical embeddings**

In [5]:
clinical_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


## **Load** the CoNLL dataset

In [6]:
# You can either upload or download the file conll2003_text_file.conll.
!wget -q https://raw.githubusercontent.com/russell-ai/SparkNLP-CustomNER/refs/heads/main/conll2003_text_file.conll -O /content/conll2003_text_file.conll

In [8]:
from sparknlp.training import CoNLL

conll_data = CoNLL().readDataset(spark, "/content/conll2003_text_file.conll")
print("Dataset loaded. Number of rows:", conll_data.count())

Dataset loaded. Number of rows: 2425


## **Split** the dataset into train and test

In [9]:
train_data, test_data = conll_data.randomSplit([0.8, 0.2], seed=42)
print("Train set size:", train_data.count())
print("Test set size:", test_data.count())


Train set size: 1975
Test set size: 450


In [10]:
train_data.show(10)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|" Hemostasis was ...|[{document, 0, 43...|[{document, 0, 43...|[{token, 0, 0, ",...|[{pos, 0, 0, NN, ...|[{named_entity, 0...|
|" It was our inte...|[{document, 0, 12...|[{document, 0, 12...|[{token, 0, 0, ",...|[{pos, 0, 0, NN, ...|[{named_entity, 0...|
|( Medical Transcr...|[{document, 0, 83...|[{document, 0, 83...|[{token, 0, 0, (,...|[{pos, 0, 0, NN, ...|[{named_entity, 0...|
|( Medical Transcr...|[{document, 0, 63...|[{document, 0, 63...|[{token, 0, 0, (,...|[{pos, 0, 0, NN, ...|[{named_entity, 0...|
|( Medical Transcr...|[{document, 0, 90...|[{document, 0, 90...|[{token, 0, 0, (,...|[{pos, 0, 0, NN, ..

## Create TF Graph

In [13]:
!pip install -q tensorflow==2.12.0
!pip install -q tensorflow-addons

In [14]:
from sparknlp_jsl.annotator import TFGraphBuilder

In [15]:
graph_folder_path = "medical_ner_graphs"
ner_graph_builder = TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder_path)\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(24)\
    .setIsLicensed(True)

## Define NER tagger

In [16]:
nerTagger = MedicalNerApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setLr(0.003)\
    .setBatchSize(8)\
    .setRandomSeed(42)\
    .setVerbose(1)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setValidationSplit(0.2)\
    .setGraphFolder(graph_folder_path)\
    .setOutputLogsPath("./ner_logs")\
    .setUseBestModel(True)\
    .setEarlyStoppingCriterion(0.001)\
    .setEarlyStoppingPatience(3)

## Create pipeline

In [17]:
ner_pipeline = Pipeline(stages=[
    clinical_embeddings,
    ner_graph_builder,
    nerTagger
])

## Train the model

In [18]:
%%time
print("Starting model training...")
ner_model = ner_pipeline.fit(train_data)
print("Model training completed.")

Starting model training...
TF Graph Builder configuration:
Model name: ner_dl
Graph folder: medical_ner_graphs
Graph file name: auto
Build params: {'ntags': 13, 'embeddings_dim': 200, 'nchars': 78, 'is_medical': True, 'lstm_size': 24}


Instructions for updating:
non-resource variables are not supported in the long term
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


ner_dl graph exported to medical_ner_graphs/blstm_13_200_24_78.pb
Model training completed.
CPU times: user 16.7 s, sys: 1.22 s, total: 17.9 s
Wall time: 7min 9s


### Training Logs

In [19]:
!ls ner_logs/MedicalNerApproach*

ner_logs/MedicalNerApproach_56ee0db652c7.log


In [20]:
! cat ner_logs/MedicalNerApproach*

Name of the selected graph: /content/medical_ner_graphs/blstm_13_200_24_78.pb
Training started - total epochs: 30 - lr: 0.003 - batch size: 8 - labels: 13 - chars: 77 - training examples: 1975


Epoch 1/30 started, lr: 0.003, dataset size: 1975


Epoch 1/30 - 21.31s - loss: 2646.0015 - avg training loss: 13.1641865 - batches: 201
Quality on validation dataset (20.0%), validation examples = 395
time to finish evaluation: 1.79s
Total validation loss: 311.6654	Avg validation loss: 6.2333
label	 tp	 fp	 fn	 prec	 rec	 f1
B-DRUG	 42	 20	 21	 0.67741936	 0.6666667	 0.672
I-NAME	 0	 0	 1	 0.0	 0.0	 0.0
I-TREATMENT	 87	 58	 74	 0.6	 0.54037267	 0.5686274
B-DATE	 0	 0	 20	 0.0	 0.0	 0.0
I-DATE	 0	 0	 8	 0.0	 0.0	 0.0
I-PROBLEM	 357	 145	 100	 0.71115535	 0.78118163	 0.74452555
B-NAME	 0	 0	 4	 0.0	 0.0	 0.0
B-PROBLEM	 209	 80	 84	 0.7231834	 0.7133106	 0.718213
I-TEST	 31	 7	 69	 0.81578946	 0.31	 0.44927537
B-TEST	 16	 5	 79	 0.7619048	 0.16842106	 0.27586207
B-TREATMENT	 54	 48	 73	 0.5294118

## **Evaluate**

In [21]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_model.stages[2].transform(clinical_embeddings.transform(test_data))

evaler = NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label", drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+---------+-----+----+----+-----+---------+------+------+
|   entity|   tp|  fp|  fn|total|precision|recall|    f1|
+---------+-----+----+----+-----+---------+------+------+
|     NAME|  1.0| 0.0| 1.0|  2.0|      1.0|   0.5|0.6667|
|  PROBLEM|315.0|49.0|57.0|372.0|   0.8654|0.8468| 0.856|
|     DATE| 20.0| 1.0| 2.0| 22.0|   0.9524|0.9091|0.9302|
|     DRUG| 63.0|16.0|16.0| 79.0|   0.7975|0.7975|0.7975|
|TREATMENT|121.0|38.0|54.0|175.0|    0.761|0.6914|0.7246|
|     TEST|139.0|36.0|30.0|169.0|   0.7943|0.8225|0.8081|
+---------+-----+----+----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.7971727121989102|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8139125167727216|
+------------------+

None


In [43]:
from sklearn.metrics import classification_report

def show_sklearn_report(spark_results):
    # Spark sonuçlarından gerekli bilgileri al
    y_true = []
    y_pred = []

    # Her entity tipi için gerçek ve tahmin edilen sayıları hesapla
    for row in spark_results.collect():
        entity = row['entity']
        tp = int(row['tp'])  # True positives
        fp = int(row['fp'])  # False positives
        fn = int(row['fn'])  # False negatives

        # True positives için entity'yi gerçek ve tahmin listelerine ekle
        y_true.extend([entity] * tp)
        y_pred.extend([entity] * tp)

        # False positives için diğer entity'leri tahmin listesine ekle
        y_true.extend(['OTHER'] * fp)
        y_pred.extend([entity] * fp)

        # False negatives için entity'yi gerçek listeye ekle
        y_true.extend([entity] * fn)
        y_pred.extend(['OTHER'] * fn)

    # Unique entity'leri al
    labels = sorted(list(set(y_true) - {'OTHER'}))

    # sklearn report'u oluştur
    report = classification_report(
        y_true,
        y_pred,
        labels=labels,
        digits=4
    )

    return report

# Kullanımı
print("sklearn Classification Report:")
report = show_sklearn_report(eval_result)
print(report)

sklearn Classification Report:
              precision    recall  f1-score   support

        DATE     0.9524    0.9091    0.9302        22
        DRUG     0.7975    0.7975    0.7975        79
        NAME     1.0000    0.5000    0.6667         2
     PROBLEM     0.8654    0.8468    0.8560       372
        TEST     0.7943    0.8225    0.8081       169
   TREATMENT     0.7610    0.6914    0.7246       175

   micro avg     0.8248    0.8046    0.8146       819
   macro avg     0.8618    0.7612    0.7972       819
weighted avg     0.8245    0.8046    0.8139       819



## **Bonus**
### Generate training graphs

In [None]:
from sparknlp_jsl.training_log_parser import ner_log_parser

print("Generating training graphs...")
parser = ner_log_parser()

Generating training graphs...
NER Log Parser Initiated


In [None]:
log_files = [f for f in os.listdir("./ner_logs") if f.startswith("MedicalNerApproach")]
log_file = sorted(log_files)[-1]

In [None]:
# Plot loss graph
parser.loss_plot(f"./ner_logs/{log_file}")

In [None]:
# Plot other metric graphs
parser.get_charts(f'./ner_logs/{log_file}')

### Save the model to disk

In [22]:
model_path = "./models/new_medical_ner"
ner_model.stages[-1].write().overwrite().save(model_path)
print(f"Model saved to {model_path}")

Model saved to ./models/new_medical_ner




---



## Example prediction

In [23]:
from pyspark.ml import PipelineModel
from sparknlp_jsl.annotator import MedicalNerModel
from sparknlp.annotator import DocumentAssembler, SentenceDetector, Tokenizer
from sparknlp.base import Pipeline
from sparknlp_display import NerVisualizer
import pyspark.sql.functions as F

print("Example prediction and visualization:")

# Saved NER model
ner_model_path = "models/new_medical_ner"

# Create a new pipeline with the saved NER model
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

loaded_ner_model = MedicalNerModel.load(ner_model_path)\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

converter = NerConverterInternal()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(stages=[
    document,
    sentence,
    token,
    clinical_embeddings,
    loaded_ner_model,
    converter])

empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)

from sparknlp.base import LightPipeline

light_model = LightPipeline(prediction_model)


Example prediction and visualization:


In [24]:
text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

light_result = light_model.fullAnnotate(text)

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(light_result[0], label_col='ner_span', document_col='document', save_path="display_bert_result.html")




---
*`R.Caliskan`*
