# NER of Medical Reports using stana, spacy

Name Entity recognition is is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

We perform NER on Medical Transcript samples we scraped from <https://medical-transcription-sample-reports.blogspot.com/>

## Importing and Installing Requirements

We need to install dependencies, as well as download some models manually for running NER. Additional models and dependencies are also installed during notebook execution.

### Prerequisite Dependencies
- spacy
- spacy_stanza
- en_core_web_lg(stanza english model)

*NOTE: Apply this function only the first time using model. The model is large at 500MBs and running everytime will download model everytime*

In [1]:
# downloading spacy english model
# !python -m spacy download en_core_web_lg

### Importing required libraries

In [2]:
import pprint
import spacy
import spacy_stanza

from spacy import displacy
from spacytextblob.spacytextblob import SpacyTextBlob

## Create a sample record for testing

In [3]:
sample_record = """
John Doe from Banepa, Dhulikhel studying at Kathmandu University, age 20, complains of mild joint pains.
He also suffers from frequent cramps on his legs. His X-ray scan showed no issues with bones.
After a vitamin D test, it is seen that his vitamin D is just 7.6 mg.
He has been prescribed VitaD sachets and some calcium tablets.
"""

## Create NER pipeline

Create a pipeline for NER processing using stanza. It uses pretrained ner models like i2b2, radiology, etc

This downloads the required packages and models

In [4]:
ner_pipeline = spacy_stanza.load_pipeline(
    "en",
    processors={"ner": [
        "i2b2",
        "radiology",
        "ontonotes",
        ],"tokenize": 'combined',
    },
    package='',
        
)

2022-12-11 20:52:07 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2022-12-11 20:52:09 INFO: Loading these models for language: en (English):
| Processor | Package                  |
----------------------------------------
| tokenize  | combined                 |
| ner       | i2b2;radiology;ontonotes |

2022-12-11 20:52:09 INFO: Use device: gpu
2022-12-11 20:52:09 INFO: Loading: tokenize
2022-12-11 20:52:11 INFO: Loading: ner
2022-12-11 20:52:13 INFO: Done loading processors!


## Passing the sample through the Model Pipeline

First we pass the sample through NER pipeline for NER using the defined models

In [5]:
ner_doc = ner_pipeline(sample_record)

We also pass it through the sentiment pipeline we define below to know the overall entiment of the report

In [6]:
# defining pipelines for sentiment
sentiment_pipeline = spacy.load("en_core_web_lg")
sentiment_pipeline.add_pipe("spacytextblob")

<spacytextblob.spacytextblob.SpacyTextBlob at 0x7fd03ec35cc0>

In [7]:
# passing sample through sentiment
sentiment_doc = sentiment_pipeline(sample_record)

## Storing Output

Storing the pipelined i.e. model tested data output in a dictionary format. Here we store both NER and Sentiment outputs.

In [8]:
extracted_record = {}
for entity in ner_doc.ents:
    if entity.label_ in extracted_record:
        if str(entity) not in extracted_record[entity.label_]:
            extracted_record[entity.label_].append(str(entity))
    else:
        extracted_record[entity.label_] = [str(entity)]
        
extracted_record["SENTIMENT"] = round(sentiment_doc._.blob.polarity, 5)

## Displaying Output Extracted Data

In [9]:
pprint.pprint(extracted_record)

{'DATE': ['age 20'],
 'GPE': ['Banepa', 'Dhulikhel'],
 'ORG': ['Kathmandu University'],
 'PERSON': ['John Doe'],
 'PROBLEM': ['mild joint pains',
             'frequent cramps on his legs',
             'issues with bones'],
 'QUANTITY': ['7.6 mg'],
 'SENTIMENT': -0.05556,
 'TEST': ['His X-ray scan', 'a vitamin D test'],
 'TREATMENT': ['his vitamin D', 'VitaD sachets', 'some calcium tablets'],
 'UNCERTAINTY': ['no']}


## Displaying Extracted Data Annotated in the document

In [10]:
displacy.render(ner_doc, style="ent", jupyter=True)

# Performing the Test with Medical Record Sample

We now define a function so that we can use the medical transcripts stored in txt files to pass through the NER and Sentiment pipelines. So, essentially a function to apply the defined models

In [11]:
def extract_info(filename: str, visualize: bool = True) -> None:
    with open(filename, "r") as inp:
        content = inp.read()
    
    ner_pipeline = spacy_stanza.load_pipeline(
        "en",
        processors={"ner": [
            "i2b2",
            "radiology",
            "ontonotes",
            ],"tokenize": 'combined',
        },
        package="",    
    )
    
    sentiment_pipeline = spacy.load("en_core_web_lg")
    sentiment_pipeline.add_pipe("spacytextblob")
    
    ner_doc = ner_pipeline(content)
    sentiment_doc = sentiment_pipeline(content)
    
    extracted_record = {}
    for entity in ner_doc.ents:
        if entity.label_ in extracted_record:
            if str(entity) not in extracted_record[entity.label_]:
                extracted_record[entity.label_].append(str(entity))
        else:
            extracted_record[entity.label_] = [str(entity)]

    extracted_record["SENTIMENT"] = round(sentiment_doc._.blob.polarity, 5)
    
    if visualize:
        displacy.render(ner_doc, style="ent", jupyter=True)
        
    pprint.pprint(extracted_record)

## Applying to a Transcript and displaying Output

Apply pipeline to report_15.txt

In [12]:
extract_info("assets/txt_reports/report_15.txt")

2022-12-11 20:52:15 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json:   0%|   …

2022-12-11 20:52:16 INFO: Loading these models for language: en (English):
| Processor | Package                  |
----------------------------------------
| tokenize  | combined                 |
| ner       | i2b2;radiology;ontonotes |

2022-12-11 20:52:16 INFO: Use device: gpu
2022-12-11 20:52:16 INFO: Loading: tokenize
2022-12-11 20:52:16 INFO: Loading: ner
2022-12-11 20:52:17 INFO: Done loading processors!


{'ANATOMY': ['subscapularis',
             'supraspinatus',
             'labrum',
             'supraspinatus tendons',
             'Scalene',
             'shoulder',
             'cervical spine',
             'arm',
             'upper extremity',
             'neck',
             'portal',
             'biceps',
             'axillary',
             'pouch',
             'glenohumeral ligament',
             'subacromial',
             'bone',
             'right upper extremity'],
 'ANATOMY_MODIFIER': ['right',
                      'lateral',
                      'base',
                      'posterior',
                      'anterolateral',
                      'superior',
                      'anchor',
                      'anteroinferior',
                      'inferior',
                      'rest',
                      'middle',
                      'anterior',
                      'band',
                      'region'],
 'CARDINAL': ['1', '2', 'Less than 5', '