This notebook shows how to recode the encoded variables in the dataset to human-readable values. The files used in this notebook are available on the [data download page](https://www.drivendata.org/competitions/217/cdc-fall-narratives/data/).

Key findings

### What are 2-3 takeaways from your exploration?

### Summary of your approach

- Describe the data source(s) you used (e.g., if supplementary or external data was used, how the data was subsetted, if at all).
- What did you do and why (e.g., preprocessing, key features, algorithms you used, unique or novel aspects of your solution, etc.)?
- What worked and what didn’t? How did you evaluate the performance of your approach?
- What are some limitations of your analysis?

### Visualizations

Do you have any useful tables, charts, graphs, or visualizations from the process (e.g., exploring the data, testing different features, summarizing model performance, etc.)?


Winners will be selected by a judging panel of domain experts and researchers. Submissions will be judged according to the following weighted criteria:

- Novelty (35%)
    - To what extent does this submission utilize creative, cutting-edge, or innovative techniques? This can be demonstrated in any or all parts of the submission (e.g., preprocessing, embeddings, models, visualizations).
- Communication (25%)
    - To what extent are findings clearly and effectively communicated? This includes both text and visuals.
- Rigor (20%)
    - To what extent is this submission based on appropriate and correctly implemented methods and approaches (e.g., preprocessing, embeddings, models) with adequate sample sizes?
- Insight (20%)
    - To what extent does this submission contain useful insights about the effectiveness of unsupervised machine learning methods at uncovering patterns in this data and/or informative findings that can advance the research on circumstances related to older adult falls?

We know this is an open-ended challenge. To help, we are providing some guiding lines of inquiry to help you get started. Feel free to use these or try something completely different!

- What information about falls is contained in the narrative but not captured by the other variables?  
- What precipitating events — actions that happen right before the fall — can be identified?  
- How do falls (e.g., severity, type of fall injury) and fall circumstances (e.g., precipitating event, activity involved) differ among various demographic groups (e.g., by race, sex, age)?  
- Are there trends over time in the types of falls that occur?  
- What kinds of people, places, or things are associated with more or more severe falls? Are there alternative explanations for these associations?
- What risk factors are associated with falls?  
- The goal of this challenge is to explore the application of unsupervised machine learning techniques to this dataset. Below are some examples of analysis that would be a good fit for this challenge, as well as examples of analysis that are not a good fit.

Good fit

- Remove words from the narratives that are found in other variables (e.g., male, female, body parts) and cluster the remaining text to identify what information is in the narratives that is not captured elsewhere. Then use word embeddings to remove similar words and repeat the analysis.
- Design a way to identify the "precipitating event" before a fall (e.g., tripping on carpet) and compare differences in precipitating events among various demographic groups.
- Explore various clustering algorithms using the provided embeddings. Use the ChatGPT API to produce summaries of the clusters. Compare the results of using the embeddings versus using more traditional bag-of-words preprocessing techniques (e.g., stop word and punctuation removal, stemming and lemmatization, and token weighting such as TF-IDF).


In [1]:
import numpy as np
import pandas as pd
import json
from pathlib import Path

### load variable mapping

In [95]:
with Path("variable_mapping.json").open("r") as f:
    mapping = json.load(f, parse_int=True)

In [96]:
# convert the encoded values in the mapping to integers since they get read in as strings
for c in mapping.keys():
    mapping[c] = {int(k): v for k, v in mapping[c].items()}

In [97]:
# the keys in the dictionary correspond to columns in the primary (and supplementary) data
mapping.keys()

dict_keys(['sex', 'race', 'hispanic', 'alcohol', 'drug', 'body_part', 'body_part_2', 'diagnosis', 'diagnosis_2', 'disposition', 'location', 'fire_involvement', 'product_1', 'product_2', 'product_3'])

In [98]:
# mapping for 'sex' column
mapping["sex"]

{0: 'UNKNOWN', 1: 'MALE', 2: 'FEMALE', 3: 'NON-BINARY/OTHER'}

### load primary data

In [99]:
df = pd.read_csv(
    "primary_data.csv",
    # set columns that can be null to nullable ints
    dtype={"body_part_2": "Int64", "diagnosis_2": "Int64"},
)
df.head()

Unnamed: 0,cpsc_case_number,narrative,treatment_date,age,sex,race,other_race,hispanic,diagnosis,other_diagnosis,...,body_part,body_part_2,disposition,location,fire_involvement,alcohol,drug,product_1,product_2,product_3
0,190103269,94YOM FELL TO THE FLOOR AT THE NURSING HOME ON...,2019-01-01,94,1,0,,0,62,,...,75,,4,5,0,0,0,1807,0,0
1,190103270,86YOM FELL IN THE SHOWER AT HOME AND SUSTAINED...,2019-01-01,86,1,0,,0,62,,...,75,,4,1,0,0,0,611,0,0
2,190103273,87YOF WAS GETTING UP FROM THE COUCH AND FELL T...,2019-01-01,87,2,0,,0,53,,...,32,,4,1,0,0,0,679,1807,0
3,190103291,67YOF WAS AT A FRIENDS HOUSE AND SLIPPED ON WA...,2019-01-01,67,2,0,,0,57,,...,33,,1,1,0,0,0,1807,0,0
4,190103294,70YOF WAS STANDING ON A STEP STOOL AND FELL OF...,2019-01-01,70,2,0,,0,57,,...,33,,1,1,0,0,0,620,0,0


In [100]:
# look at data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115128 entries, 0 to 115127
Data columns (total 22 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   cpsc_case_number   115128 non-null  int64 
 1   narrative          115128 non-null  object
 2   treatment_date     115128 non-null  object
 3   age                115128 non-null  int64 
 4   sex                115128 non-null  int64 
 5   race               115128 non-null  int64 
 6   other_race         1022 non-null    object
 7   hispanic           115128 non-null  int64 
 8   diagnosis          115128 non-null  int64 
 9   other_diagnosis    2522 non-null    object
 10  diagnosis_2        43145 non-null   Int64 
 11  other_diagnosis_2  4978 non-null    object
 12  body_part          115128 non-null  int64 
 13  body_part_2        43145 non-null   Int64 
 14  disposition        115128 non-null  int64 
 15  location           115128 non-null  int64 
 16  fire_involvement   1

### replace numeric values with corresponding strings

In [101]:
decoded_df = df.copy()

for col in mapping.keys():
    decoded_df[col] = decoded_df[col].map(mapping[col])

In [102]:
decoded_df.head(3).transpose()

Unnamed: 0,0,1,2
cpsc_case_number,190103269,190103270,190103273
narrative,94YOM FELL TO THE FLOOR AT THE NURSING HOME ON...,86YOM FELL IN THE SHOWER AT HOME AND SUSTAINED...,87YOF WAS GETTING UP FROM THE COUCH AND FELL T...
treatment_date,2019-01-01,2019-01-01,2019-01-01
age,94,86,87
sex,MALE,MALE,FEMALE
race,N.S.,N.S.,N.S.
other_race,,,
hispanic,Unk/Not stated,Unk/Not stated,Unk/Not stated
diagnosis,62 - INTERNAL INJURY,62 - INTERNAL INJURY,"53 - CONTUSIONS, ABR."
other_diagnosis,,,


In [103]:
# ensure mappings were applied correctly by checking that the number of missing values did not change
assert (decoded_df.isnull().sum() == df.isnull().sum()).all()

In [127]:
random_sample = decoded_df.sample(n=5000, replace=False, random_state=42)

In [128]:
random_sample_sub = random_sample.iloc[:,0:2]

In [129]:
random_sample_sub.head()

Unnamed: 0,cpsc_case_number,narrative
80147,220302739,74YOF WAS SITTING IN A WHEELCHAIR LEANING FORW...
48158,210101788,69YOF WAS PLAYING WITH HER GRANDCHILD WHEN SHE...
112225,221250671,90YOF TRIPPED GETTING OUT OF AN ELEVATOR AND H...
50792,210212098,101YOF SLIPPED ON NYLON SOCKS WHILE COMING OUT...
115089,230214236,66YOM PRESENTS AFTER A SYNCOPAL EPISODE WITH F...


In [130]:
import re

def replace_with_space(text, target="DX"):
    import re
    replaced_text = re.sub(r'(\S)({})'.format(re.escape(target)), r'\1 {}'.format(target), text)
    replaced_text = re.sub(r'({})(\S)'.format(re.escape(target)), r'{} \2'.format(target), replaced_text)
    return replaced_text

In [131]:
random_sample_sub['narrative'] = random_sample_sub['narrative'].apply(lambda x: replace_with_space(x, "DX"))
random_sample_sub['narrative'] = random_sample_sub['narrative'].apply(lambda x: replace_with_space(x, "&"))
random_sample_sub['narrative'] = random_sample_sub['narrative'].apply(lambda x: replace_with_space(x, ">>"))

# NLP

In [None]:
import spacy

In [None]:
print(spacy.__version__)

In [None]:
nlp = spacy.blank("en")
print(nlp.pipe_names)

In [None]:
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)

In [None]:
doc = nlp("HISTORY OF PRESENT ILLNESS:,  The patient is well known to me for a history of iron-deficiency anemia due to chronic blood loss from colitis.  We corrected her hematocrit last year with intravenous (IV) iron.  Ultimately, she had a total proctocolectomy done on 03/14/2007 to treat her colitis.  Her course has been very complicated since then with needing multiple surgeries for removal of hematoma.  This is partly because she was on anticoagulation for a right arm deep venous thrombosis (DVT) she had early this year, complicated by septic phlebitis.,Chart was reviewed, and I will not reiterate her complex history.")

In [None]:
print("TEXT", "START", "END", "ENTITY TYPE")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
from spacy import displacy
displacy.render(doc[:100], style='ent', jupyter=True)

# My Model

Corrected data

In [None]:
from spacy import displacy
displacy.render(doc, style="span", jupyter = True)

In [10]:
import spacy

In [67]:
fall = pd.read_csv('corrected_fall_set3.csv')

In [68]:
fall.head()

Unnamed: 0,key_entry,span,label
0,94YOF HX OF DEMENTIA S/P POSSIBLE UNWITNESSED ...,FRAME FELL ON HER NOSE,AO
1,70YOF WITH FALL FROM BED DX: LT INTEROTROCHANT...,FALL FROM BED,CHR
2,81 YOM FROM HOME WITH A FALL FROM HIS BED TO T...,FALL FROM HIS BED,CHR
3,70 YOF C/O INJ TO HEAD/FACE AFTER FALL OFF THE...,FALL OFF THE TOLIET,CHR
4,80YOF FALL OFF TOILET ONTO FLOOR AND C/O UNSUR...,FALL OFF TOILET,CHR


In [70]:
# Replace leading whitespaces using a regular expression
fall['span'] = fall['span'].str.replace(r'^\s+', '', regex=True)

In [72]:
replace_with_space("REACHING FOR A PILLOW&ROLLED OUT OF BED LANDING", "&")

'REACHING FOR A PILLOW & ROLLED OUT OF BED LANDING'

In [73]:
fall['key_entry'] = fall['key_entry'].apply(lambda x: replace_with_space(x, "DX"))
fall['key_entry'] = fall['key_entry'].apply(lambda x: replace_with_space(x, "&"))
fall['key_entry'] = fall['key_entry'].apply(lambda x: replace_with_space(x, ">>"))

In [76]:
fall_nas = fall[fall['label'].isna()]
len(fall_nas)

45

In [75]:
fall_labeled = fall[fall['label'].isna() == False]

In [77]:
key_list = fall_labeled.key_entry.unique()
len(key_list)

279

In [78]:
fall_labeled[fall_labeled.key_entry == key_list[1]]

Unnamed: 0,key_entry,span,label
1,70YOF WITH FALL FROM BED DX : LT INTEROTROCHAN...,FALL FROM BED,CHR


In [85]:
nlp = spacy.blank("en")
from spacy.tokens import Span

# keeping span token lengths to appropriately set config
token_lengths = []

docs=[] # this will hold the processed strings and spans
for k in key_list:
    doc = nlp(k)
    temp_df = fall_labeled[fall_labeled.key_entry == k]

    if len(temp_df)==1:              
        span_text = temp_df.iloc[0,1]
        temp_label = temp_df.iloc[0,2]       
        span_start_char = k.find(span_text)
        span_end_char = span_start_char + len(span_text)

        # Finding the start and end tokens using character offsets
        start_token = None
        end_token = None
        for token in doc:
            if token.idx == span_start_char:
                 start_token = token.i
            if token.idx + len(token.text) == span_end_char:
                end_token = token.i
                break
        if start_token is not None and end_token is not None:
            temp_start = start_token
            temp_end = end_token + 1
            doc.spans["sc"] = [Span(doc, temp_start, temp_end, temp_label)]
            docs.append(doc)
            token_lengths.append(temp_end - temp_start)
        else:
            print(k, "span=", span_text,"couldn't find tokens")
    else:
        print("temp_df has length > 1")
        span_list = []
        for ent in range(len(temp_df)):
            span_text = temp_df.iloc[ent,1]
            temp_label = temp_df.iloc[ent,2]
            span_start_char = k.find(span_text)
            span_end_char = span_start_char + len(span_text)
            print(span_start_char, span_end_char)

            # Finding the start and end tokens using character offsets
            start_token = None
            end_token = None
            for token in doc:
                if token.idx == span_start_char:
                     start_token = token.i
                if token.idx + len(token.text) == span_end_char:
                    end_token = token.i
                    break
            if start_token is not None and end_token is not None:
                temp_start = start_token
                temp_end = end_token + 1             
                span_list.append(Span(doc, temp_start, temp_end, temp_label))
                token_lengths.append(temp_end - temp_start)
            else:
                print(k, "span=",span_text, "couldn't find tokens")
        
        doc.spans["sc"] = span_list
    docs.append(doc)
          

temp_df has length > 1
23 38
39 62
temp_df has length > 1
23 38
82 106
temp_df has length > 1
20 38
43 71
temp_df has length > 1
59 81
148 180
temp_df has length > 1
31 65
66 96
temp_df has length > 1
79 92
51 74
104 116
temp_df has length > 1
23 36
37 52
temp_df has length > 1
6 22
43 52
temp_df has length > 1
9 28
29 55
83YOM SITTING AT TABLE  AND FELL OUT CHAIR-- DX :HEAD INJURY span= FELL OUT CHAIR couldn't find tokens
temp_df has length > 1
6 25
26 43
temp_df has length > 1
6 21
89 130
temp_df has length > 1
6 21
22 32
temp_df has length > 1
6 21
22 47
temp_df has length > 1
56 71
76 97
temp_df has length > 1
27 44
56 68
temp_df has length > 1
31 54
55 69
temp_df has length > 1
32 49
8 29
50 86
temp_df has length > 1
6 23
41 62
temp_df has length > 1
60 77
7 26
79 94
temp_df has length > 1
30 47
49 74
98 122
temp_df has length > 1
9 22
38 59
temp_df has length > 1
31 44
72 88
temp_df has length > 1
7 27
80 96
60 79
temp_df has length > 1
15 35
54 91
temp_df has length > 1
6 15
29 

In [80]:
len(docs)

433

In [90]:
print(np.min(token_lengths), np.max(token_lengths), np.median(token_lengths))

2 14 4.0


In [93]:
np.histogram(token_lengths)

(array([141, 134,  58,  41,  20,   8,   3,   2,   2,   1]),
 array([ 2. ,  3.2,  4.4,  5.6,  6.8,  8. ,  9.2, 10.4, 11.6, 12.8, 14. ]))

Make training and test sets with the docs

In [81]:
from spacy.tokens import DocBin

In [82]:
doc_bin = DocBin(docs=docs[0:300])

In [83]:
doc_bin.to_disk("./train_230928.spacy")

In [84]:
doc_bin = DocBin(docs=docs[300:])
doc_bin.to_disk("./dev_230928.spacy")

python -m spacy init config ./config.cfg --lang en --pipeline spancat

## Using model

In [114]:
nlp = spacy.load("outputs230928/model-best")

In [132]:
random_sample_sub.iloc[1000,1]

'HEAD INJ/91YOWF WAS AMBULATING IN CHURCH WHEN LOST HER FOOT ON A STEP AND FELL DOWN 1 STEP, FALLING & STRIKING HER FOREHEAD ON THE FLOOR.NO LOC BUT C/O HEADACHE.'

In [133]:
doc = nlp(random_sample_sub.iloc[1000,1])
doc.text

'HEAD INJ/91YOWF WAS AMBULATING IN CHURCH WHEN LOST HER FOOT ON A STEP AND FELL DOWN 1 STEP, FALLING & STRIKING HER FOREHEAD ON THE FLOOR.NO LOC BUT C/O HEADACHE.'

In [134]:
for span in doc.spans["sc"]:
    print(span.label_, span.start, span.end, span.text)

In [136]:
random_sample_sub2 = random_sample_sub.iloc[1000:1100,]
for text in random_sample_sub2['narrative']:
    print(text)
    doc = nlp(text)
    for span in doc.spans["sc"]:
        print(span.label_, span.start, span.end, span.text)

HEAD INJ/91YOWF WAS AMBULATING IN CHURCH WHEN LOST HER FOOT ON A STEP AND FELL DOWN 1 STEP, FALLING & STRIKING HER FOREHEAD ON THE FLOOR.NO LOC BUT C/O HEADACHE.
76YOF HAD A SYNCOPAL EVENT IN THE SHOWER, FELL HITTING HEAD DX : CONTUSION SCALP, SYNCOPE
COL 3 5 SYNCOPAL EVENT
SU 10 12 HITTING HEAD
78YOM ROLLED OUT OF BED AND FELL ONTO CARPET DX : CONTUSION OF CHEST WALL
SF 6 9 FELL ONTO CARPET
CHR 1 5 ROLLED OUT OF BED
85 YOF FELL  IN FLOOR AT HOME  DX ;  CHI
73YOM   PT TRIPPED ON DOG WHEN HE WAS HE WAS TURNING AROUND,CAUSED PT TO FALL TO FLOOR, STRUCK LT SIDE AGAINST COFFEE TABLE     DX :  LT RIB FRACTURES    #
OBJ 3 6 TRIPPED ON DOG
SO 21 27 STRUCK LT SIDE AGAINST COFFEE TABLE
76YOF STOOD FROM BED & FELL ONTO CHAIR. DX : MULTI RIB FX, NONDISPLACED FX OF CORONOID PROCESS OF LEFT ULNA, HIP HEMATOMA
TRS 1 4 STOOD FROM BED
SO 5 8 FELL ONTO CHAIR
77YOF WAS PUTTING A COMFORTER ON THE TOP SHELF OF HER CLOSET WHEN SHE FELL BACKWARDS STRIKING HER UPPER BACK ON THE EDGE OF THE DOOR DX : RIB FRAC

TRS 4 7 MOVING FROM COMMODE
69YOF WAS GETTING OUT OF THE SHOWER AND HAD A SLIP AND FALL TO THE TILE FLOOR DX : CLOSED HEAD INJURY LACERATION TO ELBOW
85YOM WAS WALKING UP STEP TO PORCH AND TRIPPED INJURING LOWER LEG DX ABRASIONS LOWER LEG


In [None]:
random_sample_sub2 = random_sample_sub.iloc[1000:1100,]
for text in random_sample_sub2['narrative']:
    doc = nlp(text)
    for span in doc.spans["sc"]:
        print(text, span.label_, span.text)
        

In [151]:
import pandas as pd

# Create an empty DataFrame with column names
output_df = pd.DataFrame(columns=['text', 'span_label', 'span_text'])

for text in random_sample_sub2['narrative']:
    doc = nlp(text)
    
    if len(doc.spans["sc"]) == 0:
        df2 = pd.DataFrame([[text, "NA", "NA"]], columns=['text', 'span_label', 'span_text'])
        # Append the new row to the DataFrame
        output_df = pd.concat([output_df, df2])
    else:
        for span in doc.spans["sc"]:
        # Create a new row as a dictionary
        
            df2 = pd.DataFrame([[text, span.label_, span.text]], columns=['text', 'span_label', 'span_text'])
            # Append the new row to the DataFrame
            output_df = pd.concat([output_df, df2])

In [152]:
output_df.to_csv("predictions.csv")