# VA Incarceration Status Tutorial

This notebook describes how to run the HAIL Lab's Incarceration Status Longformer model and output results to a CSV file. It's broken down into the following sections: 

- What you need to start
- Installing HAIL's Incarceration Status Model
- Running the Incarceration Status Model

## What you need to start

**1. You will need the following packages:**

- pandas==1.5.0
- numpy==1.23.4
- tqdm==4.64.1
- pytorch==1.10.1
    - version including CUDA: py3.9_cuda10.2_cudnn7.6.5_0
- transformers==4.20.1
    - huggingface library
- datasets==2.12.0
- evaluate==0.4.0
- huggingface-hub==0.8.1
    - where you'll get the trained Longformer model


**2. You will also need your data in a CSV file format.**  
**3. You will need the HAIL's trained Longformer model from huggingface-hub**  

While we can give you the library versions we used for model development and inference, due to the dependence on underlying GPU support, OS, and python versions, we aren't able to guarantee that these versions will work for you. The key is to make sure that the pytorch version installed matches your underlying Python version and CUDA version. Then, install the corresponding transformers version that fits with that version of Pytorch. Everything else is straightforward and less particular. 

In [1]:
# if you want to try and install everything as we have it, here's some pip installs

# !pip install pandas==1.5.0
# !pip install numpy==1.23.4
# !pip install tqdm==4.64.1
# !pip install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu102/torch_stable.html
# !pip install transformers==4.20.1
# !pip install datasets==2.12.0
# !pip install evaluate==0.4.0
# !pip install huggingface-hub==0.8.1


In [2]:
# now import them! 
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import torch
from transformers import pipeline
import datasets
import evaluate

In [3]:
# we also check that we have CUDA working assuming we have it! 
torch.cuda.is_available()

True

## Installing HAIL's Incarceration Status Model

Instead of installing the model directly, we can use a pipeline to quickly and efficiently label some notes!! 

In [4]:
# Use a pipeline as a high-level helper
from transformers import pipeline

# we assume that we have a GPU, else use CPU
BATCH_SIZE = 5
pipe = pipeline("text-classification", model="vsocrates/incar-status-any", device=0 if torch.cuda.is_available() else -1, batch_size=BATCH_SIZE)

## Running the Incarceration Status Model


Now that we have the prereqs installed and model up and running, we're ready to process some notes! Let's read in a sample file using pandas...

In [5]:
# we'll create a simple Torch Dataset class so we can get progress bar updates
from torch.utils.data import Dataset

class ListDataset(Dataset):
     def __init__(self, original_list):
        self.original_list = original_list
     def __len__(self):
        return len(self.original_list)

     def __getitem__(self, i):
        return self.original_list[i]

In [6]:
notes = pd.read_csv("/path/to/sample/data.csv")

# create a huggingface Dataset object from this CSV file
from datasets import Dataset


We won't show you the file we used since it contains PHI, but it should have a column for an ID and a column for the text, something like: 

| encounter_ID    | TEXT       |
|--------------|--------------|
| 1 | Patient arrived in the ED from jail... |
| 2 | Transferred to floor with ERT... |

In [7]:
dataset = ListDataset(notes['TEXT'].tolist())

In [8]:
# the max Longformer size is 4096 tokens, so we can't include more than that
labels = []
scores = []
for pred in tqdm(pipe(dataset, max_length=4096, truncation=True)):
    labels.append(pred['label'])
    scores.append(pred['score'])    

  0%|          | 0/100 [00:00<?, ?it/s]

In [9]:
# we can check out what the labels and scores lists look like: 
print(labels[:3], scores[:3])

['reject', 'reject', 'reject'] [0.9992249011993408, 0.9992402791976929, 0.9719255566596985]


There are two possible values for `labels`, either "accept" or "reject". In our case, "accept" means there is a history/presence of incarceration in the note and "reject" means there **is not**. 

Let's add these back onto our dataframe now and output to a file

In [10]:
notes['label'] = labels
notes['score'] = scores

In [11]:
notes.to_csv("/path/to/sample/data_with_predictions.csv")