# Create patient-level snapshots

## About snapshots

I'm [Zella King](https://github.com/zmek/), a health data scientist in the Clinical Operational Research Unit (CORU) at University College London. Since 2020, I have worked with University College London Hospital (UCLH) on practical tools to improve patient flow through the hospital. With a team from UCLH, I developed a predictive tool that is now in daily use by bed managers at the hospital. 

The tool we built for UCLH takes a 'snapshot' of patients in the hospital at a point in time, and using data from the hospital's electronic record system, predicts the number of emergency admissions in the next 8 or 12 hours. We are working on predicting discharges in the same way. 

The key principle is that we take data on hospital visits that are unfinished, and predict whether some outcome (admission from A&E, discharge from hospital, or transfer to another clinical specialty) will happen to each of those patients in a window of time. What the outcome is doesn't really matter; the same methods can be used. 

The utility of our approach - and the thing that makes it very generalisable - is that we then build up from the patient-level predictions into a predictions for a whole cohort of patients at a point in time. That step is what creates useful information for bed managers.

Here I show what I mean by a snapshot, and suggest how to prepare them. 

## How to create patient level snapshots

Below is some fake data resembling the historical data on ED visits you would find in the data warehouse of an Electronic Health Record (EHR) system. Each visit has one row. From the data we know the patient's triage score (where 1 is the highest level of acuity), and whether they were admitted after the ED visit. 

In [7]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

def generate_patient_visits(start_date, end_date, mean_patients_per_day):
    """
    Generate fake patient visit data with random arrival and departure times.
    
    Parameters:
    -----------
    start_date : str or datetime
        The minimum date to sample from (format: 'YYYY-MM-DD' if string)
    end_date : str or datetime
        The maximum date to sample from (format: 'YYYY-MM-DD' if string)
    mean_patients_per_day : float
        The average number of patients to generate per day
        
    Returns:
    --------
    pandas.DataFrame
        DataFrame with columns: visit_number, arrival_datetime, departure_datetime, 
        triage_score, is_admitted
    """
    # Convert string dates to datetime if needed
    if isinstance(start_date, str):
        start_date = datetime.strptime(start_date, '%Y-%m-%d')
    if isinstance(end_date, str):
        end_date = datetime.strptime(end_date, '%Y-%m-%d')
    
    # Calculate total days in range
    days_range = (end_date - start_date).days + 1
    
    # Generate random number of patients for each day using Poisson distribution
    daily_patients = np.random.poisson(mean_patients_per_day, days_range)
    
    # Define admission probabilities based on triage score
    # Triage 1: 80% admission, Triage 2: 60%, Triage 3: 30%, Triage 4: 10%, Triage 5: 2%
    admission_probabilities = {
        1: 0.80,  # Highest severity - highest admission probability
        2: 0.60,
        3: 0.30,
        4: 0.10,
        5: 0.02   # Lowest severity - lowest admission probability
    }
    
    # Define triage score distribution
    # Most common is 3-4, less common are 2 and 5, least common is 1 (most severe)
    triage_probabilities = [0.05, 0.15, 0.35, 0.35, 0.10]  # For scores 1-5
    
    visits = []
    visit_number = 1
    
    for day_idx, num_patients in enumerate(daily_patients):
        current_date = start_date + timedelta(days=day_idx)
        
        # Generate patients for this day
        for _ in range(num_patients):
            # Random hour for arrival (more likely during daytime)
            arrival_hour = np.random.normal(13, 4)  # Mean at 1 PM, std dev of 4 hours
            arrival_hour = max(0, min(23, int(arrival_hour)))  # Clamp between 0-23
            
            # Random minutes
            arrival_minute = np.random.randint(0, 60)
            
            # Create arrival datetime
            arrival_datetime = current_date.replace(
                hour=arrival_hour,
                minute=arrival_minute,
                second=np.random.randint(0, 60)
            )
            
            # Generate triage score (1-5)
            triage_score = np.random.choice([1, 2, 3, 4, 5], p=triage_probabilities)
            
            # Generate length of stay (in minutes) - log-normal distribution
            # Most visits are 2 to 6 hours, but some can be shorter or longer
            length_of_stay = np.random.lognormal(mean=5.2, sigma=0.4)
            length_of_stay = max(30, min(1440, length_of_stay))  # Between 30 min and 24 hours
            
            # Make higher triage scores (more severe) stay longer on average
            if triage_score <= 2:
                length_of_stay *= 1.5  # 50% longer stays for more severe cases
            
            # Calculate departure time
            departure_datetime = arrival_datetime + timedelta(minutes=int(length_of_stay))
            
            # Generate admission status based on triage score
            admission_prob = admission_probabilities[triage_score]
            is_admitted = np.random.choice([0, 1], p=[1-admission_prob, admission_prob])
            
            visits.append({
                'visit_number': visit_number,
                'arrival_datetime': arrival_datetime,
                'departure_datetime': departure_datetime,
                'triage_score': triage_score,
                'is_admitted': is_admitted
            })
            
            visit_number += 1
    
    # Create DataFrame and sort by arrival time
    df = pd.DataFrame(visits)
    df = df.sort_values('arrival_datetime').reset_index(drop=True)
    
    return df

# Example usage:
# df = generate_patient_visits('2023-01-01', '2023-01-31', 25)
# print(df.head())

In [9]:

df = generate_patient_visits('2023-01-01', '2023-01-31', 25)
df.head()

Unnamed: 0,visit_number,arrival_datetime,departure_datetime,triage_score,is_admitted
0,14,2023-01-01 07:56:10,2023-01-01 10:48:10,4,0
1,8,2023-01-01 07:58:36,2023-01-01 10:15:36,5,0
2,9,2023-01-01 08:08:10,2023-01-01 12:20:10,3,1
3,6,2023-01-01 09:47:22,2023-01-01 16:29:22,4,0
4,11,2023-01-01 09:51:43,2023-01-01 17:00:43,3,0


Our goal is to create snapshots of these visits at a point in time. First, we define the times of day we will be issuing predictions at. 

In [18]:
from datetime import datetime, time, date, timedelta
import pandas as pd

def create_snapshots(df, prediction_times, start_date, end_date):
    """
    Create snapshots of patients present at specific times between start_date and end_date inclusive.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame with patient visit data, must have 'arrival_datetime' and 'departure_datetime' columns
    prediction_times : list of tuples
        List of (hour, minute) tuples representing times to take snapshots
    start_date : datetime.date
        First date to take snapshots
    end_date : datetime.date
        Last date to take snapshots (inclusive)
    
    Returns:
    --------
    pandas.DataFrame
        DataFrame with snapshot information and patient data
    """
    # Create date range
    snapshot_dates = []
    current_date = start_date
    while current_date <= end_date:
        snapshot_dates.append(current_date)
        current_date += timedelta(days=1)
    
    # Create empty list to store all results
    all_results = []
    
    # For each combination of date and time
    for date in snapshot_dates:
        for hour, minute in prediction_times:
            snapshot_datetime = datetime.combine(
                date, 
                time(hour=hour, minute=minute)
            )
            
            # Filter dataframe for this snapshot
            mask = (df['arrival_datetime'] <= snapshot_datetime) & (df['departure_datetime'] > snapshot_datetime) 
            snapshot_df = df[mask].copy()  # Create copy to avoid SettingWithCopyWarning
            
            # Skip if no patients at this time
            if len(snapshot_df) == 0:
                continue
            
            # Add snapshot information columns
            snapshot_df['snapshot_date'] = date
            snapshot_df['prediction_time'] = [(hour, minute)] * len(snapshot_df)
            snapshot_df['snapshot_datetime'] = snapshot_datetime
            
            # Append to results list
            all_results.append(snapshot_df)
    
    # Combine all results into single dataframe
    if all_results:
        final_df = pd.concat(all_results, ignore_index=True)
        
        # Define column order
        snapshot_cols = ['snapshot_date', 'prediction_time', 'snapshot_datetime']
        visit_cols = ['visit_number', 'arrival_datetime', 'departure_datetime', 'triage_score', 'is_admitted']
        
        # Reorder columns
        final_df = final_df[snapshot_cols + visit_cols]
    else:
        # Create empty dataframe with correct columns if no results found
        columns = ['snapshot_date', 'prediction_time', 'snapshot_datetime', 
                  'visit_number', 'arrival_datetime', 'departure_datetime', 
                  'triage_score', 'is_admitted']
        final_df = pd.DataFrame(columns=columns)
    
    return final_df

In [16]:
prediction_times = [(6, 0), (9, 30), (12, 0), (15, 30), (22, 0)] # each time is expressed as a tuple of (hour, minute)

In [21]:
from datetime import date
start_date = date(2023, 1, 1)
end_date = date(2023, 1, 31)

# Create snapshots
snapshots_df = create_snapshots(df, prediction_times, start_date, end_date)
snapshots_df.head()

Unnamed: 0,snapshot_date,prediction_time,snapshot_datetime,visit_number,arrival_datetime,departure_datetime,triage_score,is_admitted
0,2023-01-01,"(9, 30)",2023-01-01 09:30:00,14,2023-01-01 07:56:10,2023-01-01 10:48:10,4,0
1,2023-01-01,"(9, 30)",2023-01-01 09:30:00,8,2023-01-01 07:58:36,2023-01-01 10:15:36,5,0
2,2023-01-01,"(9, 30)",2023-01-01 09:30:00,9,2023-01-01 08:08:10,2023-01-01 12:20:10,3,1
3,2023-01-01,"(12, 0)",2023-01-01 12:00:00,9,2023-01-01 08:08:10,2023-01-01 12:20:10,3,1
4,2023-01-01,"(12, 0)",2023-01-01 12:00:00,6,2023-01-01 09:47:22,2023-01-01 16:29:22,4,0


In [29]:
## Train a model to predict the outcome of each snapshot


In [27]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import OrdinalEncoder
from typing import Dict, List, Tuple, Any

def train_admission_model(
    snapshots_df: pd.DataFrame,
    prediction_time: Tuple[int, int],
    exclude_from_training_data: List[str],
    ordinal_mappings: Dict[str, List[Any]]
):
    """
    Train a Random Forest model to predict patient admission based on filtered data.
    
    Parameters:
    -----------
    snapshots_df : pandas.DataFrame
        DataFrame containing patient snapshot data
    prediction_time : Tuple[int, int]
        The specific (hour, minute) tuple to filter training data by
    exclude_from_training_data : List[str]
        List of column names to exclude from model training
    ordinal_mappings : Dict[str, List[Any]]
        Dictionary mapping column names to ordered categories for ordinal encoding
    
    Returns:
    --------
    tuple
        (trained_model, X_test, y_test, accuracy, feature_importances)
    """
    # Filter data for the specific prediction time
    filtered_df = snapshots_df[snapshots_df['prediction_time'].apply(lambda x: x == prediction_time)]
    
    if filtered_df.empty:
        raise ValueError(f"No data found for prediction time {prediction_time}")
    
    # Prepare feature columns - exclude specified columns and target variable
    all_columns = filtered_df.columns.tolist()
    exclude_cols = exclude_from_training_data + ['is_admitted', 'prediction_time', 'snapshot_date', 'snapshot_datetime']
    feature_cols = [col for col in all_columns if col not in exclude_cols]
    
    # Create feature matrix
    X = filtered_df[feature_cols].copy()
    y = filtered_df['is_admitted']
    
    # Apply ordinal encoding to categorical features
    for col, categories in ordinal_mappings.items():
        if col in X.columns:
            # Create an ordinal encoder with the specified categories
            encoder = OrdinalEncoder(categories=[categories])
            # Reshape the data for encoding and back
            X[col] = encoder.fit_transform(X[[col]])
    
    # One-hot encode any remaining categorical columns
    X = pd.get_dummies(X)
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Initialize and train the model
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=5,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Evaluate the model
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Get feature importances
    feature_importances = pd.DataFrame({
        'Feature': X.columns,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    # Return the model, test data, and feature importances
    return model, X_test, y_test, accuracy, feature_importances


Let's train a model to predict admission for the 9:30 prediction time. We will specify that the triage scores are ordinal, and make use of sklearn's OrdinalEncoder to maintain the natural order of categories. We also need to include columns that are not relevant to the snapshot. 

In [28]:

model, X_test, y_test, accuracy, importance = train_admission_model(
    snapshots_df,
    prediction_time=(9, 30),
    exclude_from_training_data=['visit_number', 'arrival_datetime', 'departure_datetime'],
    ordinal_mappings={'triage_score': [1, 2, 3, 4, 5]}
)

## Conclusion

Here I have shown 

* how to create snapshots from finished patient visits
* how to train a very simple model to predict admission at the end of the snapshot. 