This task involves building a medical Named Entity Recognition (NER) system to automatically extract key information from clinical notes and medical records. The system identifies and categorizes medical entities like healthcare providers, facilities, dates, medications, diagnoses, and measurements. We use spaCy's transformer-based model (en_core_web_trf) to achieve more accurate entity recognition that can parse unstructured medical text and convert it into structured data.

**Note!** Since we work with medical data, it is better to work with the largest spaCy model, en_core_web_trf, which is transformer-based. To use this model, you will need to install transformers and RESTART the whole notebook again to be sure that the whole installation went well.

<a target="_blank" href="https://colab.research.google.com/github/toelt-llc/HSLU-NLP-Bootcamp/blob/main/Day_1/SpaCy_NER/Bootcamp_NLP_Medical_Named_Entity_Recognition_Solution.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [9]:
import spacy
import re

In [2]:
# install transformers library
!pip install transformers



In [3]:
#download spaCy transformer model
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
# Load largest spaCy model based on transformers
nlp = spacy.load("en_core_web_trf")

In [5]:
test_records = [
    """
    Initial Visit - 01/15/2024
    Dr. James Wilson at City Hospital performed initial consultation.
    Patient presents with frequent headaches and high blood pressure (150/90).
    Prescribed Amlodipine 5mg daily.
    """,

    """
    Follow-up - 02/28/2024
    Dr. Sarah Lee at Medical Center reviewed progress.
    BP improved to 130/85. Headaches reduced.
    Continue Amlodipine 5mg, added Vitamin D 2000IU.
    Referred to Dr. Chen at Neurology Clinic.
    """
]

In [6]:
def extract_medical_info(text):
    """Extract key medical information from documents"""
    doc = nlp(text)

    # Use built-in NER for people and organizations
    doctors = [ent.text for ent in doc.ents if ent.label_ == 'PERSON' and
              (text.lower().find('dr. ' + ent.text.lower()) != -1 or
               text.lower().find('doctor ' + ent.text.lower()) != -1)]

    facilities = [ent.text for ent in doc.ents if ent.label_ == 'ORG' or
                 (ent.label_ == 'GPE' and
                  any(term in text[max(0, ent.start_char-15):ent.end_char+15].lower()
                      for term in ['hospital', 'clinic', 'center', 'medical']))]

    # Extract dates using built-in NER
    dates = [ent.text for ent in doc.ents if ent.label_ == 'DATE']


    # Simple regex patern for medications and dosages
    medications = []
    med_pattern = r'\b([A-Z][a-z]+(?:\s[A-Z][a-z]+)?)\s+(\d+(?:\.\d+)?(?:\s*(?:mg|mcg|g|mL|IU))+(?:\s+(?:daily|twice|weekly|monthly)?)?)\b'
    for match in re.finditer(med_pattern, text):
        medications.append(f"{match.group(1)} {match.group(2)}")

    # Simple regex patern for diagnoses - find common conditions
    diagnoses = []
    condition_keywords = ['headache', 'blood pressure', 'hypertension', 'diabetes']
    for keyword in condition_keywords:
        if keyword in text.lower():
            # Find the full phrase containing the keyword
            sentence = next((s for s in doc.sents if keyword in s.text.lower()), None)
            if sentence:
                diagnoses.append(keyword)

    # Simple regex for measurements (especially blood pressure)
    measurements = []
    bp_pattern = r'\b(\d{2,3}/\d{2,3})\b'
    for match in re.finditer(bp_pattern, text):
        # Check if BP or blood pressure is mentioned nearby
        context = text[max(0, match.start()-20):min(len(text), match.end()+20)]
        if 'bp' in context.lower() or 'blood pressure' in context.lower() or 'pressure' in context.lower():
            measurements.append(match.group(1))

    # Compile all extracted information
    info = {
        'doctors': doctors,
        'facilities': facilities,
        'dates': dates,
        'medications': medications,
        'diagnoses': diagnoses,
        'measurements': measurements
    }

    return info


In [8]:
# Process each record
for i, record in enumerate(test_records):
    print(f"\n{'='*50}\nRECORD #{i+1}\n{'='*50}")
    print(record.strip())

    med_info = extract_medical_info(record)

    print("\nExtracted Medical Information:")
    print(f"Healthcare Providers: {', '.join(med_info['doctors']) if med_info['doctors'] else 'None found'}")
    print(f"Medical Facilities: {', '.join(med_info['facilities']) if med_info['facilities'] else 'None found'}")
    print(f"Visit Dates: {', '.join(med_info['dates']) if med_info['dates'] else 'None found'}")
    print(f"Medications: {', '.join(med_info['medications']) if med_info['medications'] else 'None found'}")
    print(f"Diagnoses: {', '.join(med_info['diagnoses']) if med_info['diagnoses'] else 'None found'}")
    print(f"Measurements: {', '.join(med_info['measurements']) if med_info['measurements'] else 'None found'}")




RECORD #1
Initial Visit - 01/15/2024
    Dr. James Wilson at City Hospital performed initial consultation.
    Patient presents with frequent headaches and high blood pressure (150/90).
    Prescribed Amlodipine 5mg daily.

Extracted Medical Information:
Healthcare Providers: James Wilson
Medical Facilities: City Hospital
Visit Dates: 01/15/2024, daily
Medications: Prescribed Amlodipine 5mg daily
Diagnoses: headache, blood pressure
Measurements: 150/90

RECORD #2
Follow-up - 02/28/2024
    Dr. Sarah Lee at Medical Center reviewed progress.
    BP improved to 130/85. Headaches reduced.
    Continue Amlodipine 5mg, added Vitamin D 2000IU.
    Referred to Dr. Chen at Neurology Clinic.

Extracted Medical Information:
Healthcare Providers: Sarah Lee, Chen
Medical Facilities: Medical Center, Neurology Clinic
Visit Dates: 02/28/2024
Medications: Continue Amlodipine 5mg
Diagnoses: headache
Measurements: 130/85
