# NOTES

Big sections:
* data ingestion and examination
  * only 50 rows have labels for has_cancer and has_diabetes
    * 20 positives for has_cancer, only 5 positives for has_diabetes
  * use these 50 rows to do some validation and checks
    * where do diabetes key words show up in has_diabetes==1
    * examples of diabetes key words showing up in has_diabetes==0
    * repeat for has_cancer
  * how do we validate 
* breaking up text data to different sections, and getting ready for modeling
  * fields of relevance, fields to ignore, fields to sometimes consider
* looking for diabetes
  * should only be for patients currently diabetic
    * probably don't want to try and separate out cases where diabetes is in treatment
    * it may be good to separately tag when diabetes is in patient's family history, or past history
  * search things like 'glucose' will be way too broad, as it comes up in normal bloodwork. will need more context around those terms
  * check in spaCy?
* looking for cancer
  * remember, here we're looking for whether patient had cancer at any time
  * much bigger set of words to look for
  * 

# CODE

## Preamble

In [2]:
import pandas as pd
import re
import spacy
# from scispacy import en_core_sci_lg
from scispacy.umls_linking import UmlsEntityLinker
from spacy.language import Language
from sklearn.model_selection import train_test_split

In [3]:
# read in data
rawdat = pd.read_csv('data/layer-data-sci-takehome-05062024.csv')

# separate out test set
traindat = rawdat.loc[rawdat['test_set']==0]
testdat = rawdat.loc[rawdat['test_set']==1]

# separate out labeled data
labeldat = traindat.loc[traindat['has_cancer'].notnull()]
nolabeldat = traindat.loc[traindat['has_cancer'].isnull()]

print(traindat.shape)
print(testdat.shape)
print(labeldat.shape)
print(nolabeldat.shape)

(1800, 5)
(200, 5)
(50, 5)
(1750, 5)


# Basic functions

In [4]:
def remove_namefields(text):
    # Replace patterns that might represent PII placeholders like "[Name]", "[Redacted]", etc.
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'Patient Name:.*\n', '', text, flags=re.IGNORECASE)
    return text

def normalize_text(text):
    '''
    Normalize text data by lowercasing, removing punctuation, and extra whitespace
    '''
    text = text.lower() 
    text = re.sub(r'[^\w\s]', '', text) 
    text = re.sub(r'\d+', '', text) 
    text = re.sub(r'\s+', ' ', text).strip() 
    return text


# EDA

## Text examination

First thing noticed is that there's lots of different header sections in the data

Headers start with a newline (or start of the field) and end with a colon

In [5]:
### Header extraction and analysis
# pattern for extracting headers:
    # 1. First character is a capital letter
    # 2. followed by any combination of letters, spaces, commas, hyphens
    # 3. ends with a colon and optional whitespace
header_pattern = r'(?m)(?:^|(?<=\n))[A-Z][a-zA-Z\s,-]+:\s*$'

def find_header(textfield, pattern=header_pattern):
    '''
    Extracts headers from a single text field, given a header pattern
    Returns list of headers found'''
    return re.findall(pattern, textfield)

def count_headers(df, pattern=header_pattern):
    '''
    Extracts headers from all text fields in a dataframe df
    Returns a dictionary of headers and their counts'''
    header_counts = {}

    for text in df['text']:
        headers = find_header(text, pattern)
        for header in headers:
            normheader = normalize_text(header)
            if normheader not in header_counts:
                header_counts[normheader] = 1
            else:
                header_counts[normheader] += 1
    return header_counts



In [6]:
# testing find_header on first text entry
headers = find_header(traindat['text'][0])
# Normalize all entries in headers
clean_headers = [normalize_text(header) for header in headers]
print(clean_headers)
# Checking for consistent patterns or fields in the text column


['discharge summary', 'history of presenting illness', 'hospital course', 'laboratory findings', 'radiographic findings', 'hospital treatment', 'discharge and followup plan']


In [7]:
# Examine counts of patterns across 1800 training entries
pattern_counts = count_headers(traindat)
pattern_df = pd.DataFrame.from_dict(pattern_counts, orient='index', columns=['Count']).sort_values(by='Count', ascending=False)
# fully display all rows with count greater than 100
print(len(pattern_df))
print(len(pattern_df[pattern_df['Count'] > 100]))
print(len(pattern_df[pattern_df['Count'] > 20]))
print(len(pattern_df[pattern_df['Count'] < 10]))


809
14
43
739


In [8]:
print("Over 100:")
pattern_df[pattern_df['Count'] > 100]


Over 100:


Unnamed: 0,Count
hospital course,1475
discharge summary,1003
followup,454
treatment,299
diagnosis,285
hospital course summary,284
summary,246
discharge diagnosis,209
discharge instructions,194
discharge medications,179


In [9]:
print("20 to 100:")
pattern_df[(pattern_df['Count'] > 20) & (pattern_df['Count'] < 100)]

20 to 100:


Unnamed: 0,Count
condition at discharge,97
outcome,91
disposition,77
followup plan,64
reason for admission,61
recommendations,60
history of present illness,55
instructions,53
followup instructions,53
diagnoses,50


## Non-text examination

In [10]:
# checking if all patients are unique
print(rawdat.shape)
print(rawdat['patient_identifier'].nunique())

# check # of test set
print(rawdat['test_set'].value_counts())

# check # of cancer and diabetes patients
print(rawdat['has_cancer'].value_counts())
print(rawdat['has_diabetes'].value_counts())

# check patients for whom has_cancer is empty
print(sum(rawdat['has_cancer'].isnull()))
print(sum(rawdat['has_diabetes'].isnull()))

(2000, 5)
2000
test_set
0    1800
1     200
Name: count, dtype: int64
has_cancer
0.0    30
1.0    20
Name: count, dtype: int64
has_diabetes
0.0    45
1.0     5
Name: count, dtype: int64
1950
1950


# Text cleaning

In [11]:
def remove_sensitive_information(text):
    # Replace patterns that might represent PII placeholders like "[Name]", "[Redacted]", etc.
    text = re.sub(r'\[.*?\]', '', text)
    # Remove any remaining names or identifiers (optional if needed for anonymization)
    text = re.sub(r'Patient Name:.*\n', '', text, flags=re.IGNORECASE)
    return text

# df['cleaned_text'] = df['text'].apply(remove_sensitive_information)

def normalize_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

# df['normalized_text'] = df['cleaned_text'].apply(normalize_text)


In [12]:
df = rawdat.copy()
df = df.loc[df['test_set'] == 0]

# Remove sensitive information
df['cleaned_text'] = df['text'].apply(remove_sensitive_information)

# Normalize the text
df['normalized_text'] = df['cleaned_text'].apply(normalize_text)

# Optional: Tokenize the text
df['tokenized_text'] = df['normalized_text'].apply(lambda x: x.split())

df.head()

Unnamed: 0,patient_identifier,text,has_cancer,has_diabetes,test_set,cleaned_text,normalized_text,tokenized_text
0,2200,DISCHARGE SUMMARY:\n\nPatient Name: [Redacted]...,1.0,0.0,0,DISCHARGE SUMMARY:\n\nAge: 38\nSex: Female\nMe...,discharge summary age sex female medical recor...,"[discharge, summary, age, sex, female, medical..."
1,645,Discharge Summary:\n\nPatient: [Name]\n\nMedic...,0.0,0.0,0,Discharge Summary:\n\nPatient: \n\nMedical Rec...,discharge summary patient medical record numbe...,"[discharge, summary, patient, medical, record,..."
2,2563,Discharge Summary:\nPatient name: [REDACTED]\n...,0.0,0.0,0,Discharge Summary:\nSex: Female\nAge: 70 years...,discharge summary sex female age years admissi...,"[discharge, summary, sex, female, age, years, ..."
3,2275,Discharge Summary:\n\nPatient: 59-year-old Ita...,1.0,0.0,0,Discharge Summary:\n\nPatient: 59-year-old Ita...,discharge summary patient yearold italian male...,"[discharge, summary, patient, yearold, italian..."
4,1828,Hospital Course:\n\nThe 80-year-old male prese...,0.0,1.0,0,Hospital Course:\n\nThe 80-year-old male prese...,hospital course the yearold male presented to ...,"[hospital, course, the, yearold, male, present..."


In [13]:
# Checking for consistent patterns or fields in the text column

# Extracting common patterns and counts
pattern_counts = {}

for text in df['text']:
    # Extract any lines that appear to have structured headers (e.g., lines followed by a colon)
    matches = re.findall(r'([A-Z][a-zA-Z\s]+):', text)
    for match in matches:
        if match not in pattern_counts:
            pattern_counts[match] = 0
        pattern_counts[match] += 1

# Converting pattern counts to a DataFrame for easier visualization
pattern_df = pd.DataFrame.from_dict(pattern_counts, orient='index', columns=['Count']).sort_values(by='Count', ascending=False)
# fully display all rows with count greater than 100
pattern_df[pattern_df['Count'] > 100]





Unnamed: 0,Count
Hospital Course,1091
Discharge Summary,932
Patient Name,612
Discharge Date,456
Diagnosis,334
Admission Date,329
Gender,317
Age,307
Treatment,300
Date of Discharge,300


In [14]:
tmpdf_diabetes = traindat.loc[traindat['has_diabetes'] == 1]
tmpdf_diabetes.to_csv('data/diabetes.csv', index=False)

tmpdf_cancer = traindat.loc[traindat['has_cancer'] == 1]
tmpdf_cancer.to_csv('data/cancer.csv', index=False)

tmpdf_nodiabetes = traindat.loc[traindat['has_diabetes'] == 0]
tmpdf_nodiabetes.to_csv('data/nodiabetes.csv', index=False)

tmpdf_nocancer = traindat.loc[traindat['has_cancer'] == 0]
tmpdf_nocancer.to_csv('data/nocancer.csv', index=False)



# Cancer model

In [16]:
nlp = spacy.load('en_core_sci_lg')

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


In [17]:
tmptxt = tmpdf_cancer['text'][0]
tmptxt = remove_sensitive_information(tmptxt)
tmptxt = normalize_text(tmptxt)
tmptxt

'discharge summary age sex female medical record number history of presenting illness the patient presented to the hospital with a month history of headache in august a neurological examination showed a limitation of temporal movement in her right eye brain mri revealed masses in her clivus measuring mm x mm and mm x mm she underwent surgery and was subsequently treated with stereotactic radiotherapy in march a recurrence in her clivus was detected and she underwent another operation in january she presented with diplopia and a recurrent mass in her clivus was detected with invasion to the pons external cranial radiotherapy was performed for palliative intent she was started on imatinib in april and sunitinib in june hospital course during her admission the patient reported several symptoms including periorbital edema skin rash nausea visual loss fatigue and handfoot syndrome she received months of imatinib therapy and months of sunitinib treatment her symptoms were monitored and manag

In [18]:
tmpdoc = nlp(tmptxt)
print(list(tmpdoc.sents))
print(tmpdoc.ents)

[discharge summary age sex female medical record number history of presenting illness the patient presented to the hospital with a month history of headache in august a neurological examination showed a limitation of temporal movement in her right eye brain mri revealed masses in her clivus measuring mm x mm and mm x mm she underwent surgery and was subsequently treated with stereotactic radiotherapy in march a recurrence in her clivus was detected and she underwent another operation in january she presented with diplopia and a recurrent mass in her clivus was detected with invasion to the pons external cranial radiotherapy was performed for palliative intent she was started on imatinib in april and sunitinib in june hospital course during her admission the patient reported several symptoms including periorbital edema skin rash nausea visual loss fatigue and handfoot syndrome she received months of imatinib therapy and months of sunitinib treatment her symptoms were monitored and manag

## Split labeled data into train/valid

In [19]:
# Separate labeled data to training/validation
cancer_labeltrain, cancer_labelval = train_test_split(labeldat, test_size=0.4, random_state=1234)
print(cancer_labeltrain.shape)
print(cancer_labelval.shape)
print(cancer_labelval['has_cancer'].value_counts())

(30, 5)
(20, 5)
has_cancer
0.0    13
1.0     7
Name: count, dtype: int64


In [20]:
# Register the UmlsEntityLinker manually
@Language.factory("umls_linker")
def create_umls_linker(nlp, name, resolve_abbreviations=True, name_source="umls"):
    return UmlsEntityLinker(resolve_abbreviations=resolve_abbreviations, name=name_source)


# linker = UmlsEntityLinker(resolve_abbreviations=True, name="umls")
nlp.add_pipe("umls_linker", config={"resolve_abbreviations": True, "name": "umls"})

# Process a text example
# doc = nlp(cancer_labeltrain['text'][0])
doc = nlp("The patient had a history of Hilar cholangiocarcinoma and cervical cancer, both of which were successfully resected. The patient was also diagnosed with Peutz-Jeghers Syndrome (PJS) based on the presence of hamartomatous polyps in the gastrointestinal tract and melanin pigmentation on the hands. The patient's son also had PJS.")

# Extract linked cancer terms
for ent in doc.ents:
    for umls_ent in ent._.umls_ents:
        if 'neoplasm' in umls_ent.cui or 'cancer' in umls_ent.cui:  # Check for relevant UMLS concepts
            print(f"Term: {ent.text}, CUI: {umls_ent.cui}, Source: {umls_ent.preferred_term}")

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


: 

# OTHER