## DrCoding: Predicting ICD-9 codes from clinician notes

Date: Feb 18, 2020

- Added `NOTEEVENTS.csv`, `DIAGNOSES_ICD.csv`, `D_ICD_DIAGNOSES.csv` to `data` folder
- Created script to combine `DIAGNOSES_ICD.csv`, `D_ICD_DIAGNOSES.csv` to produce a concatenated file containing the ICD-9 diagnosis for each patient
- Start env with `conda env create --file local_env.yml` and `conda activate drcoding`


Date: Feb 21, 2020
- Continue working on script
- Consider whether it would be better to have ICD-9 code or output the full ICD-9 description instead. Food for thought. 

Ran the following for testing:
```
python preprocess.py --icdmap=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/D_ICD_DIAGNOSES.csv --icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/DIAGNOSES_ICD.csv --notes=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/NOTEEVENTS.csv --sample
```


Date: Feb 22, 2020
- Useful links:
    - https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d
    
- Ran the following for file generation: 
    
    
- For the baseline, create an LSTM model loosely based off of https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
    

Date: Feb 23, 2020
   
- Wrote script to split the generated processed file
- Ran the following to split:
```
python split.py /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/NOTEEVENTS.csv.txt /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/DIAGNOSES_ICD.csv.txt /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split --size=1000
```

```
python vocab.py --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.train  --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.train vocab.json
```

- Extension idea: ICD codes have descriptions that provide much more context than just the code itself. What if we could use this context as part of the attention scheme itself?
   - The description would be metadata used to train an attention projection
   - Attention projection applied on the input sequence at test time to determine which words are more important. Those words get fed into a Transformer in the next layer.
   - Replace self-attention completely/partially/additionally with attention between source and description
   - https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/transformer.py
   - https://github.com/pytorch/pytorch/blob/7aa605ed92dbb726a2f285246369b0a17972ba12/torch/nn/modules/activation.py#L673
   - https://github.com/pytorch/pytorch/blob/7aa605ed92dbb726a2f285246369b0a17972ba12/torch/nn/functional.py#L3581
   
   
```
python baseline.py train --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.tiny.train --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.tiny.train --dev-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.tiny.dev --dev-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.tiny.dev --vocab=vocab.json
```

Date: Feb 24, 2020

```
(drcoding) f45c898c0b17:scripts boyanjin$ python preprocess.py --icdmap=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/D_ICD_DIAGNOSES.csv --icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/DIAGNOSES_ICD.csv --notes=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/NOTEEVENTS.csv
> Getting the ICD map description...
> Finished getting the ICD map description. Found 14567
> Extracting the hadmids with discharge summaries
> Number of notes = 2083180s so far
> Extracting the top 50 ICD codes...
   ICD 401 has count 20670
   ICD 427 has count 20453
   ICD 428 has count 20300
   ICD 276 has count 18051
   ICD 250 has count 15994
   ICD 414 has count 15413
   ICD 272 has count 14415
   ICD 285 has count 12707
   ICD 518 has count 12460
   ICD 584 has count 11190
   ICD 530 has count 7714
   ICD 599 has count 6936
   ICD 585 has count 6524
   ICD 403 has count 6299
   ICD 038 has count 6227
   ICD 305 has count 5922
   ICD 998 has count 5893
   ICD 424 has count 5627
   ICD 995 has count 5511
   ICD 765 has count 5414
   ICD 410 has count 5367
   ICD 785 has count 5328
   ICD 780 has count 5256
   ICD 244 has count 5117
   ICD 458 has count 4766
   ICD 486 has count 4733
   ICD 996 has count 4648
   ICD 707 has count 4608
   ICD 997 has count 4580
   ICD 496 has count 4296
   ICD 041 has count 4155
   ICD 790 has count 3922
   ICD 507 has count 3609
   ICD 348 has count 3422
   ICD 493 has count 3410
   ICD 311 has count 3347
   ICD 287 has count 3268
   ICD 412 has count 3203
   ICD 571 has count 3190
   ICD 511 has count 3095
   ICD 733 has count 2950
   ICD 300 has count 2934
   ICD 278 has count 2780
   ICD 070 has count 2692
   ICD 416 has count 2646
   ICD 770 has count 2580
   ICD 774 has count 2524
   ICD 578 has count 2445
   ICD 482 has count 2409
   ICD 197 has count 2398
> Extracted ICD codes: ['416', '410', '578', '458', '486', '305', '571', '599', '996', '278', '707', '785', '428', '272', '348', '518', '584', '790', '403', '493', '038', '511', '070', '401', '244', '482', '427', '998', '287', '765', '733', '414', '276', '585', '995', '530', '285', '770', '496', '412', '774', '197', '997', '300', '780', '041', '424', '250', '507', '311']
> Extracting the ICD codes...
> Wrote 51295 hadmids
> 6.000077980309972 ICD codes per hadmid
> Extracting and converting the discharge summaries...
> Finished extracting discharge summaries
> Number of discharge summaries = 58111
```

```
(drcoding) f45c898c0b17:scripts boyanjin$ python split.py /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/NOTEEVENTS.csv.txt /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/DIAGNOSES_ICD.csv.txt /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split --size=1000
Reading note file...
Finished reading note file.
Reading ICD file...
Finished reading ICD file.
Splitting train/test...
Finished splitting train/test
Splitting train/dev...
   Generated 37190 train entries`
   Generated 9298 val entries
   Generated 11623 test entries
```

```
python vocab.py --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.train  --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.train vocab.json
```



Date: Feb 26, 2020
    
Created a dummy dataset because LSTM is too slow

```
python baseline.py train --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.dummy.train --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.dummy.train --dev-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.dummy.dev --dev-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.dummy.dev --vocab=vocab.json
```

```
python baseline.py train --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.dummy.train --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.dummy.train --dev-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.dummy.dev --dev-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/icd.dummy.dev --vocab=vocab.json --target-length=10


python baseline.py predict model.bin /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split/note.dummy.test /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/predictions/predictions.txt

```

```
python baseline.py train --train-src=/workplace/boyanjin/stanford/DrCoding/data_split/note.train --train-icd=/workplace/boyanjin/stanford/DrCoding/data_split/icd.train --dev-src=/workplace/boyanjin/stanford/DrCoding/data_split/note.dev --dev-icd=/workplace/boyanjin/stanford/DrCoding/data_split/icd.dev --vocab=vocab.json --target-length=1000
```


Running on VM: 
```
python baseline.py train --train-src=/home/tom/DrCoding/data_split/note.train --train-icd=/home/tom/DrCoding/data_split/icd.train --dev-src=/home/tom/DrCoding/data_split/note.dev --dev-icd=/home/tom/DrCoding/data_split/icd.dev --vocab=vocab.json --target-length=1000 --cuda --log-every 1
```



Date: Feb 28, 2020
    
In the reformer model, modified it to become a classification problem (with CLS token)
```
python vocab.py --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/models/reformer/data/train.txt  --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/models/reformer/data/train_labels.txt vocab.json
```

```
python baseline.py predict model.bin /home/tom/DrCoding/data_split/note.test /home/tom/DrCoding/data_split/icd.test
```


Updated the preprocess script so we can extract only the top ICD code

```
python preprocess.py --icdmap=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/D_ICD_DIAGNOSES.csv --icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/DIAGNOSES_ICD.csv --notes=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/NOTEEVENTS.csv --top-k=50 --top-k-per-patient=1
```

> Extracted ICD codes: ['733', '518', '416', '403', '424', '412', '584', '996', '496', '070', '348', '410', '197', '780', '997', '571', '530', '276', '272', '041', '414', '507', '486', '287', '244', '458', '285', '300', '790', '428', '493', '427', '765', '038', '599', '250', '998', '278', '311', '578', '482', '785', '305', '585', '707', '511', '995', '401', '774', '770']
> Extracting the ICD codes...
> Wrote 51295 hadmids
> 6.000077980309972 ICD codes per hadmid
> Extracting and converting the discharge summaries...
> Finished extracting discharge summaries
> Number of discharge summaries = 58111




python split.py /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/NOTEEVENTS.csv.txt /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data/DIAGNOSES_ICD.csv.top-1.txt /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1 --size=100

Splitting train/dev...
   Generated 37190 train entries
   Generated 9298 val entries
   Generated 11623 test entries
   

python vocab.py --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/note.train  --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/icd.train vocab.json


Testing with dummy data locally:
python baseline.py train --train-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/note.tiny.train --train-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/icd.tiny.train --dev-src=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/note.tiny.dev --dev-icd=/Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/icd.tiny.dev --vocab=vocab.json --target-length=10 --log-every 1

python baseline.py predict model.bin /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/note.tiny.test /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/data_split_top_1/icd.tiny.test  /Users/boyanjin/Documents/Stanford/CS224N/Project/DrCoding/predictions/predictions.txt


python baseline.py train --train-src=/home/tom/DrCoding/data_split_top_1/note.train --train-icd=/home/tom/DrCoding/data_split_top_1/icd.train --dev-src=/home/tom/DrCoding/data_split_top_1/note.dev --dev-icd=/home/tom/DrCoding/data_split_top_1/icd.dev --vocab=vocab.json --target-length=1000 --log-every 1


