# A workflow for training and applying an NLP-based self-harm classifier to ED triage notes

## Data preprocessing 
Run notebook `1-prepare-dataset.ipynb`
1. **Load the original dataset**\
    Input: `RMH_2012-2019_MASTER.csv`\
    Output: dataframe $D$ containing raw ED presentations with 559454 rows and 43 columns
    

2. **Generate a unique ID for each presentation**\
    Input: dataframe $D$
    Output: dataframe $D$ containing raw ED presentations with 559454 rows and 44 columns
    
    
3. **Drop the last two rows** *(should not be necessary, only done due to the format of this version of the dataset)*\
    Input: dataframe $D$\
    Output: dataframe $D$ containing raw ED presentations with 559452 rows and 44 columns
    
    
4. **Drop fully duplicated rows**\
    Input: dataframe $D$\
    Output: dataframe $D$ containing raw ED presentations with 559419 rows and 44 columns
    
    
5. **Rename columns** _(to make the code shorter and easier to read)_
6. **Convert data types** _(converting to datetme is not necessary for the classifier, only needed to check and visualise the data)_
7. **Order the dataframe**
8. **Data checks and corrections:**
    - check if the Year column was extracted correctly
    - note that in 258 presentations triage is dated before 01.01.2012, replace with the corresponding arrival date
    - check for ED presentations with the triage date more than 24h before the arrival date
    - check for ED presentations with more than 24h between the arrival and triage
    - check if there are presentations with empty triage notes, remove 3993 presentatons with empty triage notes
    - check if there are presentations with simply "as above" in triage notes, remove 46 such presentations
    - examine the distribution of character length of triage notes
    - examine the number of presentations positive for self-harm and suicidal ideation, and their distrubution across the years
    - examine the distribution of age, remove 148 presentations of patients under the age of 9
    - examine the values for gender, normalise to have 4 categories: `female`, `male`, `intersex`, `unknown`
    - examine the arrival mode, normalise to have 7 categories: `road ambulance`, `police`, `private ambulance`, `helicopter`, `air ambulance`, `self/community/pt`    
    Output: dataframe $D$ containing raw ED presentations with 555232 rows and 46 columns
9. **Preprocess triage notes** _(using the function `preprocess` from `utils.py`; this includes removing wierd characters, expanding some of the most common shorthands, normalising several concepts, and removing duplicated punctuation marks)_

## Splitting the data
Run notebook `2-create-separate-datasets.ipynb`
1. Retain presentations from 2018-2019 as holdout data for prospective validation
2. Split presentations from 2012-2017 into training and test sets stratifying by SH

## Learning

## Inference
Successively run notebooks `3-normalize-triage-notes.ipynb`, `4-extract-concepts.ipynb`, `5-make-predictions.ipynb` OR `inference.ipynb`
1. **Load unseen data**\
    Input: a .csv file with preprocessed triage notes\
    Output: a dataframe containing preprocessed triage notes $T$ and the corresponding labels $Y$ for self-harm and suicidal ideation
    
    
2. **Tokenize**\
    Input: preprocessed triage notes $T$\
    Output: triage note texts $T$ split into individual tokens
    
    
3. **Re-tokenize**\
    Input: tokenized texts $T$, previously learned vocabulary $V$\
    Output: tokenized texts $T$ with compound terms further split into individual tokens
    
    
4. **Separate leading full stop**\
    Input: tokenized texts $T$\
    Output: tokenized texts $T$ with leading full stops separated as individual tokens


5. **Spelling correction**\
    Input: triage note texts $T$, previously learned dictionary of misspellings $S$\
    Output: triage note texts $T$ with corrected misspelled words
    
    
6. **Slang replacement**\
    Input: triage note texts $T$, medication names $M$\
    Output: triage note texts $T$ with slang drug names replaced with generic drug names
    

7. **Extract concepts**\
    Input: triage note texts $T$\
    Output: lists of concepts $C$ extracted from triage note texts $T$
    
    
8. **Make predictions**\
    Input: lists of concepts $C$, pipeline $P$ consisting of a pretrained vectorizer and a pretraned ML classifier, previously learned threshold value $\theta$ used to convert predicted probabilitites into class labels\
    Output: class labels $\hat{Y}$