# Identifying Patients At Risk of Catheter-associated Urinary Tract Infections (CAUTIs) Using MIMIC-IV

## Table of Contents
1. [Introduction](#introduction)
2. [Data Retrieval](#data-retrieval)
3. [Data Cleaning](#data-cleaning)
4. [Feature Engineering](#feature-engineering)
5. [Exploratory Data Analysis](#exploratory-data-analysis)
6. [Model Selection and Training](#model-selection-and-training)
7. [Validation and Testing](#validation-and-testing)
8. [Results](#results)
9. [Conclusion](#conclusion)



## Data Retrieval
1. **Connect to the Database**:
    - Use SQL queries to access relevant tables.
2. **Select Relevant Tables**:
    - `ADMISSIONS`, `PATIENTS`, `DIAGNOSES_ICD`, `PROCEDURES_ICD`, etc.
3. **Extract Relevant Fields**:
    - Demographics, admission type, ICD codes, procedure codes, etc.

In [None]:
# Import libraries
from datetime import timedelta
import os

import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from IPython.display import display, HTML, Image
%matplotlib inline

plt.style.use('ggplot')
plt.rcParams.update({'font.size': 20})

# Access data using Google BigQuery.
from google.colab import auth
from google.cloud import bigquery

# authenticate
auth.authenticate_user()

# Set up environment variables
project_id = 'eighth-arbor-396212'

os.environ["GOOGLE_CLOUD_PROJECT"] = project_id

# Read data from BigQuery into pandas dataframes.
def run_query(query, project_id=project_id):
  return pd.io.gbq.read_gbq(
      query,
      project_id=project_id,
      dialect='standard')

# set the dataset
dataset = 'mimiciv'


Identifying patients at risk for Catheter-associated urinary tract infections (CAUTIs) involves understanding both clinical and non-clinical risk factors. Here's a list of criteria that can be used to pinpoint patients at higher risk:

### Clinical Factors:

1. **Duration of Catheterization**:
    - The longer the catheter remains in place, the higher the risk of infection.

2. **Previous UTIs**:
    - Patients with a history of urinary tract infections (UTIs) might be more susceptible.

3. **Recent Surgical Procedures**:
    - Especially urological surgeries or surgeries where a catheter was necessary.

4. **Abnormal Urinary Tract Structure or Function**:
    - Obstructions, congenital anomalies, or urinary retention can increase CAUTI risk.

5. **Compromised Immune System**:
    - Patients with conditions like HIV, diabetes, or those receiving immunosuppressants are more susceptible to infections, including CAUTIs.

6. **Age**:
    - Older adults, especially those in long-term care facilities, are at increased risk.

7. **Sex**:
    - Women are at higher risk for UTIs in general, but this can translate to increased risk for CAUTIs, especially in post-surgical contexts.

8. **Recent Antibiotic Usage**:
    - Can disrupt the normal flora, making patients more susceptible.



In [4]:
def add_query_labels(query, terms, label="long_title"):

    # Create the SQL WHERE clause using the list of terms
    where_clauses = [f"lower({label}) LIKE '%{term}%'" for term in terms]
    where_combined = " OR ".join(where_clauses)

    # Combine all parts to generate the final SQL query
    sql_query = f"{query} WHERE {where_combined}"

    return sql_query



"SELECT icd_code, long_title FROM `physionet-data.mimiciv_hosp.d_icd_procedures`\n WHERE lower(long_title) LIKE '%bladder%' OR lower(long_title) LIKE '%kidney%' OR lower(long_title) LIKE '%prostate%' OR lower(long_title) LIKE '%penile%' OR lower(long_title) LIKE '%penis%' OR lower(long_title) LIKE '%ureter%' OR lower(long_title) LIKE '%urethra%' OR lower(long_title) LIKE '%cystoscopy%' OR lower(long_title) LIKE '%nephrectomy%' OR lower(long_title) LIKE '%lithotripsy%' OR lower(long_title) LIKE '%renal%' OR lower(long_title) LIKE '%orchidectomy%' OR lower(long_title) LIKE '%orchiectomy%' OR lower(long_title) LIKE '%vasectomy%'"

In [1]:
def generate_sql_with_ctes(cte_dict, main_query):
    """Generate a SQL query with CTEs."""

    # Convert the CTEs into a list of formatted strings
    cte_strings = [f"{name} AS ({query})" for name, query in cte_dict.items()]

    # Join the CTEs with commas and then prepend WITH only once
    ctes_combined = "WITH\n" + ',\n'.join(cte_strings)

    # Combine the formatted CTEs with the main query
    return ctes_combined + '\n' + main_query

In [None]:
icd_diagnoses = """SELECT *
FROM `physionet-data.mimiciv_hosp.diagnoses_icd` d
INNER JOIN ICDCodes c
  ON c.icd_code = d.icd_code
INNER JOIN `physionet-data.mimiciv_hosp.admissions` a
  ON d.hadm_id = a.hadm_id
INNER JOIN `physionet-data.mimiciv_hosp.patients` p
  ON d.subject_id = p.subject_id
LIMIT 1000"""


In [None]:
cauti_cte = {}

cauti_icd_codes = """SELECT *
FROM `physionet-data.mimiciv_hosp.d_icd_diagnoses`
WHERE lower(long_title) LIKE '%urinary catheter%' AND lower(long_title) LIKE '%infection%'"""

cauti_cte["ICDCodes"] = cauti_icd_codes

cauti_icd_diagnoses = generate_sql_with_ctes(cauti_cte, icd_diagnoses)

1. **Duration of Catheterization**:
    - The longer the catheter remains in place, the higher the risk of infection.

2. **Previous UTIs**:
    - Patients with a history of urinary tract infections (UTIs) might be more susceptible.

In [None]:
uti_cte = {}

uti_icd_codes = """SELECT *
FROM `physionet-data.mimiciv_hosp.d_icd_diagnoses`
WHERE lower(long_title) LIKE '%urinary%tract%' AND lower(long_title) LIKE '%infection%'
"""

uti_cte["ICDCodes"] = uti_icd_codes

uti_icd_diagnoses = generate_sql_with_ctes(uti_cte, icd_diagnoses)


3. **Recent Surgical Procedures**:
    - Especially urological surgeries or surgeries where a catheter was necessary.

In [6]:
urological_cte = {}

urological_terms = ["bladder", "kidney", "prostate", "penile", "penis",
                    "ureter", "urethra", "cystoscopy", "nephrectomy",
                    "lithotripsy", "renal", "orchidectomy", "orchiectomy", "vasectomy"]


urological_procedures = add_query_labels("SELECT icd_code, long_title FROM `physionet-data.mimiciv_hosp.d_icd_procedures`", urological_terms)

urological_procedure_patients = """SELECT DISTINCT p.subject_id, p.hadm_id
FROM `physionet-data.mimiciv_hosp.procedures_icd` p
INNER JOIN UrologicalProcedures u
  ON p.icd_code = u.icd_code"""

urological_cte["UrologicalProcedures"] = urological_procedures

urological_procedure_patients = generate_sql_with_ctes(urological_cte, urological_procedure_patients)

urological_procedure_patients

"WITH\nICDCodes AS (SELECT icd_code, long_title FROM `physionet-data.mimiciv_hosp.d_icd_procedures`\n WHERE lower(long_title) LIKE '%bladder%' OR lower(long_title) LIKE '%kidney%' OR lower(long_title) LIKE '%prostate%' OR lower(long_title) LIKE '%penile%' OR lower(long_title) LIKE '%penis%' OR lower(long_title) LIKE '%ureter%' OR lower(long_title) LIKE '%urethra%' OR lower(long_title) LIKE '%cystoscopy%' OR lower(long_title) LIKE '%nephrectomy%' OR lower(long_title) LIKE '%lithotripsy%' OR lower(long_title) LIKE '%renal%' OR lower(long_title) LIKE '%orchidectomy%' OR lower(long_title) LIKE '%orchiectomy%' OR lower(long_title) LIKE '%vasectomy%')\nSELECT DISTINCT p.subject_id, p.hadm_id\nFROM `physionet-data.mimiciv_hosp.procedures_icd` p\nINNER JOIN UrologicalProcedures u\n  ON p.icd_code = u.icd_code"

4. **Abnormal Urinary Tract Structure or Function**:
    - Obstructions, congenital anomalies, or urinary retention can increase CAUTI risk.


In [None]:
urinary_abnorm_cte  = {}

urinary_abnorm_terms = ["obstruction", "urinary retention", "congenital anomaly",
                    "vesicoureteral reflux", "bladder diverticulum"]


urinary_abnormality_diagnoses = add_query_labels("SELECT icd_code, long_title FROM `physionet-data.mimiciv_hosp.d_icd_diagnoses`", urinary_abnorm_terms)

urinary_abnorm_cte["ICDCodes"] = urological_procedures

urological_procedure_patients = generate_sql_with_ctes(urinary_abnorm_cte, icd_diagnoses)

urological_procedure_patients

5. **Compromised Immune System**:
    - Patients with conditions like HIV, diabetes, or those receiving immunosuppressants are more susceptible to infections, including CAUTIs.

In [None]:
immune_compromised_cte  = {}

immune_compromised_terms = [
    "hiv", "aids", "diabetes", "organ transplant", "leukemia", "lymphoma",
    "cancer", "chronic kidney disease", "end-stage renal disease", "splenectomy",
    "bone marrow transplant", "malnutrition", "congenital immune deficiencies"
]

immune_compromised_diagnoses = add_query_labels("SELECT icd_code, long_title FROM `physionet-data.mimiciv_hosp.d_icd_diagnoses`", urinary_abnorm_terms)

immune_compromised_cte["ICDCodes"] = immune_compromised_diagnoses

immune_compromised_patients = generate_sql_with_ctes(immune_compromised_cte, icd_diagnoses)


In [None]:
immunosuppressive_prescriptions = [
    "prednisone", "cyclosporine", "tacrolimus", "mycophenolate", "azathioprine",
    "sirolimus", "rapamycin", "methotrexate", "rituximab", "basiliximab", "antithymocyte globulin"
]

immunosuppressive_prescription_patients = add_query_labels("SELECT DISTINCT p.subject_id, p.hadm_id, p.drug_name_generic FROM `physionet-data.mimiciv_hosp.prescriptions`", immunosuppressive_prescriptions)


8. **Recent Antibiotic Usage**:
    - Can disrupt the normal flora, making patients more susceptible.

### Non-Clinical Factors:

1. **Catheter Care and Handling**:
    - Improper care, multiple manipulations, or frequent disconnections can increase risk.

2. **Type of Catheter Material**:
    - Some materials might be more susceptible to biofilm formation, which can harbor bacteria.

3. **Catheter Insertion Technique**:
    - Non-sterile technique or inexperienced personnel can increase infection risk.

4. **Reason for Catheterization**:
    - Indwelling catheters placed for convenience rather than medical necessity might increase the risk.

5. **Location of Care**:
    - ICU stays, long-term care facilities, or certain wards with higher reported cases.

6. **Duration of Hospital Stay**:
    - Longer hospital stays can increase the risk due to prolonged exposure to hospital pathogens.

### Monitoring and Testing:

1. **Presence of Symptoms**:
    - Fever, cloudy or bloody urine, increased urgency or frequency, discomfort or pain.

2. **Urine Culture & Sensitivity**:
    - A positive culture specifically from a catheter specimen indicates a CAUTI. However, a negative culture does not rule out an infection, especially if the patient is on antibiotics.

3. **Increased Inflammatory Markers**:
    - Elevations in C-reactive protein (CRP), erythrocyte sedimentation rate (ESR), or procalcitonin might suggest an infection.

### Behavioral:

1. **Incontinence**:
    - The presence of incontinence can make maintaining catheter hygiene challenging.

2. **Catheter Dependence**:
    - Patients who are dependent on a catheter for long durations, due to conditions like paralysis, have an inherently higher risk.

By collecting data on these criteria, you can more effectively model and predict which patients are at a higher risk for CAUTIs. Remember, while some of these factors are directly linked to CAUTI risk, others might serve as proxies, so continuous validation against actual CAUTI cases is crucial.



## Data Cleaning
1. **Handle Missing Data**:
    - Imputation or deletion based on the nature of the missingness.
2. **Filter Data**:
    - Select patients with urinary catheters.
    - Identify confirmed CAUTI cases based on ICD codes.
3. **Data Transformation**:
    - Convert categorical variables into numerical formats (e.g., one-hot encoding).
    - Normalize continuous variables.

## Feature Engineering
1. **Extract Clinical Features**:
    - Length of hospital stay, prior admissions, co-morbidities, etc.
2. **Time Since Catheter Insertion**:
    - Calculate time since catheter insertion to the point of infection or discharge.
3. **Patient History**:
    - Prior history of UTIs, other infections, surgeries, etc.

## Exploratory Data Analysis
1. **Univariate Analysis**:
    - Distribution of age, gender, and other demographics.
    - Distribution of length of hospital stay, time since catheter insertion, etc.
2. **Bivariate Analysis**:
    - Relationship between CAUTIs and potential predictors.
3. **Visualizations**:
    - Use plots like histograms, box plots, scatter plots to visualize distributions and relationships.

## Model Selection and Training
1. **Model Choices**:
    - Logistic Regression, Random Forest, Gradient Boosted Trees, Neural Networks, etc.
2. **Model Training**:
    - Split the data into training, validation, and test sets.
    - Train models on the training set using various hyperparameters.

## Validation and Testing
1. **Model Validation**:
    - Use the validation set to tune hyperparameters and avoid overfitting.
2. **Performance Metrics**:
    - Accuracy, precision, recall, F1 score, ROC-AUC, etc.
3. **Model Testing**:
    - Evaluate final model performance on the test set.

## Results
- Discuss the performance of the best model.
- Highlight important features/predictors.

## Conclusion
- Summarize findings and their implications.
- Suggestions for further research or improvements.

