This notebook shows how to recode the encoded variables in the dataset to human-readable values. The files used in this notebook are available on the [data download page](https://www.drivendata.org/competitions/217/cdc-fall-narratives/data/).

Key findings

### What are 2-3 takeaways from your exploration?

### Summary of your approach

- Describe the data source(s) you used (e.g., if supplementary or external data was used, how the data was subsetted, if at all).
- What did you do and why (e.g., preprocessing, key features, algorithms you used, unique or novel aspects of your solution, etc.)?
- What worked and what didn’t? How did you evaluate the performance of your approach?
- What are some limitations of your analysis?

### Visualizations

Do you have any useful tables, charts, graphs, or visualizations from the process (e.g., exploring the data, testing different features, summarizing model performance, etc.)?


Winners will be selected by a judging panel of domain experts and researchers. Submissions will be judged according to the following weighted criteria:

- Novelty (35%)
    - To what extent does this submission utilize creative, cutting-edge, or innovative techniques? This can be demonstrated in any or all parts of the submission (e.g., preprocessing, embeddings, models, visualizations).
- Communication (25%)
    - To what extent are findings clearly and effectively communicated? This includes both text and visuals.
- Rigor (20%)
    - To what extent is this submission based on appropriate and correctly implemented methods and approaches (e.g., preprocessing, embeddings, models) with adequate sample sizes?
- Insight (20%)
    - To what extent does this submission contain useful insights about the effectiveness of unsupervised machine learning methods at uncovering patterns in this data and/or informative findings that can advance the research on circumstances related to older adult falls?

We know this is an open-ended challenge. To help, we are providing some guiding lines of inquiry to help you get started. Feel free to use these or try something completely different!

- What information about falls is contained in the narrative but not captured by the other variables?  
- What precipitating events — actions that happen right before the fall — can be identified?  
- How do falls (e.g., severity, type of fall injury) and fall circumstances (e.g., precipitating event, activity involved) differ among various demographic groups (e.g., by race, sex, age)?  
- Are there trends over time in the types of falls that occur?  
- What kinds of people, places, or things are associated with more or more severe falls? Are there alternative explanations for these associations?
- What risk factors are associated with falls?  
- The goal of this challenge is to explore the application of unsupervised machine learning techniques to this dataset. Below are some examples of analysis that would be a good fit for this challenge, as well as examples of analysis that are not a good fit.

Good fit

- Remove words from the narratives that are found in other variables (e.g., male, female, body parts) and cluster the remaining text to identify what information is in the narratives that is not captured elsewhere. Then use word embeddings to remove similar words and repeat the analysis.
- Design a way to identify the "precipitating event" before a fall (e.g., tripping on carpet) and compare differences in precipitating events among various demographic groups.
- Explore various clustering algorithms using the provided embeddings. Use the ChatGPT API to produce summaries of the clusters. Compare the results of using the embeddings versus using more traditional bag-of-words preprocessing techniques (e.g., stop word and punctuation removal, stemming and lemmatization, and token weighting such as TF-IDF).


In [1]:
import numpy as np
import pandas as pd
import json
from pathlib import Path

# NLP

In [2]:
import spacy

In [3]:
print(spacy.__version__)

3.6.1


In [4]:
# import scispacy

# My Model

In [5]:
# Open the file for reading
with open('falling_tags_all.json', 'r') as file:
    data = json.load(file)

In [6]:
keys_list = list(data.keys())

In [7]:
#data

In [19]:
nlp = spacy.blank("en")
from spacy.tokens import Span
# spancat = nlp.add_pipe("spancat_singlelabel")

docs = []
for k in keys_list:
    # Do something with the doc here
    temp_dict = data[k]
    doc = nlp(k)
    # span_list = []
    for span_text, label in temp_dict.items():
        if label in ['SO']:
            span_start_char = k.find(span_text)
            span_end_char = span_start_char + len(span_text)

            # Finding the start and end tokens using character offsets
            start_token = None
            end_token = None
            for token in doc:
                if token.idx == span_start_char:
                    start_token = token.i
                if token.idx + len(token.text) == span_end_char:
                    end_token = token.i
                    break

            if start_token is not None and end_token is not None:
                temp_start = start_token
                temp_end = end_token + 1
                temp_label = label
            
        
                doc.spans["sc"] = [Span(doc, float(temp_start), float(temp_end), temp_label)]
                print(doc.spans)
    
    docs.append(doc)

{'sc': [HITTING HIS HEAD ON THE SHARP EDGE OF A COFFEE TABLE]}
{'sc': [HITTING REFRIGERATOR]}
{'sc': [STRIKING HEAD ONTO THE KITCHEN SINK]}
{'sc': [STRUCK HEAD ON A PIECE OF FURNITURE]}
{'sc': [STRIKING FACE ON THE CORNER OF A PIECE OF FURNITURE]}
{'sc': [STRUCK A CONCRETE PEDESTAL]}
{'sc': [HITTING HEAD/FACE ON BATHROOM CABINET]}
{'sc': [STRUCK HEAD ON THE TOILET]}
{'sc': [HIT THE BACK OF HER HEAD ON THE CORNER OF HER DRESSER]}
{'sc': [HITTING ARM ON THE BANISTER]}
{'sc': [HIT HEAD ON A REFRIGERATOR DOOR]}
{'sc': [INTO CABINET]}
{'sc': [HIT HEAD ON THE WALL]}
{'sc': [HITTING HEAD AGAINST THE STAIRS]}
{'sc': [FELL BACK AGAINST BED FRAME]}
{'sc': [FELL INTO A WALL]}
{'sc': [HIT THE BACK OF HER HEAD ON THE CLOSET WALL]}
{'sc': [INTO BUSHES,HIT HEAD]}
{'sc': [STRIKING FACE ON THE CORNER OF A PIECE OF FURNITURE]}
{'sc': [HITTING HER HEAD ON A WINDOW SILL]}
{'sc': [HITTING HEAD AGAINST CEMENT STAIRS]}
{'sc': [STRUCK HER ELBOW ON A PIECE OF FURNITURE]}
{'sc': [HIT A CHAIR WITH HIS RIBS AS HE

In [22]:
len(docs)

248

python -m spacy init config ./config.cfg --lang en --pipeline SpanCategorizer

In [21]:
from spacy.tokens import DocBin

In [27]:
doc_bin = DocBin(docs=docs[0:150])

In [25]:
doc_bin.to_disk("./train_practice.spacy")

In [26]:
doc_bin = DocBin({},False,docs[150:])
doc_bin.to_disk("./dev_practice.spacy")

## Using model

In [32]:
nlp = spacy.load("/Users/wendyphillips/Documents/Computing/WendysPython/Falling/outputs2/model-best")

In [55]:
doc = nlp("85YOF WITH HEAD INJURY S/P FALL AND HITTING HEAD ON WALL DX: CLOSED HEAD INJURY AND SCALP CONTUSION")


In [56]:
doc.spans["sc"]

[HITTING HEAD ON WALL]

In [35]:
for span in doc.spans["sc"]:
    print(span.label_, span.start, span.end, span.text)

SO 9 13 SMASHING HEAD ON DRESSER


In [36]:
for text in random_sample_sub['narrative']:
    doc = nlp(text)
    for span in doc.spans["sc"]:
        print(span.label_, span.start, span.end, span.text)


NameError: name 'random_sample_sub' is not defined