# Prediction of New Onset Atrial Fibrillation Using Routinely Reported 12-Lead ECG Variables and Electronic Health Data

## Table of Contents

- Background
- Research Question
- Data Dictionary
- References

## Background

Atrial fibrillation (AF) is the most common irregular heart rhythm and is often described as a cardiovascular epidemic of the 21st century. It affects approximately 1 in 3 people over the age of 45 *(Kornej et al., 2020; Linz et al., 2024)*. One of the most dangerous complications of AF is the formation of blood clots in the heart, which can travel to the brain and cause strokes. Individuals with AF are 4 to 5 times more likely to experience a stroke *(Kornej et al., 2020; Healey et al., 2012)*. Predicting who might develop AF is important so that doctors can initiate preventive treatments, such as blood-thinning medications, at an early stage.

Several risk scores have been developed for the prediction of AF using traditional statistical models, such as the C2HEST *(Li et al., 2019)* and CHARGE-AF *(Alonso et al., 2013)* scores, with modest performance in validation datasets (C-index 0.59-0.73). Additionally, some of these scores have been derived for restricted ethnicity groups.

Recently, machine learning (ML) algorithms have been explored for this task and have shown improved predictive performance. One study used data from patients, electronic health records (EHR), and heart MRI scans to predict AF over five years with a prediction accuracy (C-index) of 0.78 *(Dykstra et al., 2022)*. Another model called FIND-AF was developed in the UK using routine EHR data to predict new cases of AF within six months, with a high accuracy (ROC-AUC of 0.82) *(Nadarajah et al., 2023)*.

There is still a need for better AF prediction models that work well for all types of patients and clinical settings. **It is not yet clear if including ECG (heart rhythm test) data can make predictions more accurate than using only EHR data.**

## Research Question

Can a **risk prediction model** be developed using a **large repository of synthetic patient health data**, including **12-lead ECG** and **electronic health record (EHR) variables** from patients in **Southern Alberta** with suspected or known **cardiovascular disease**, to accurately predict the **future occurrence of new-onset atrial fibrillation (AF)** for individual patients?

**Study cohort**: A synthetic dataset of about 100,000 patients without a history of AF. These patients had a baseline ECG done between January 2010 and January 2023 and were followed for at least 12 months. Patients with current or past AF/flutter were excluded based on their baseline ECG and records from continuous ECG monitoring (Holter), diagnostic codes (ICD-10-CA), or procedure codes related to AF/flutter treatment. This synthetic data was created using a random sample of approximately 100,000 patients from the Cardiovascular Imaging Registry of Calgary (CIROC).

**Outcome of interest**: New-onset future AF/flutter detected by any follow-up ECG, continuous ambulatory ECG monitoring (Holter), ICD-10-CA code, or procedural code for AF/flutter intervention.

In [3]:
import pandas as pd

## Data Dictionary

In [5]:
data_dictionary_df = pd.read_csv('../data/data_dictionary.csv')
data_dictionary_df.head()

Unnamed: 0,Variable name,Section category,Variable category,Variable type,Definition,cat_1,cat_2,cat_3,cat_4,cat_5,...,cat_12,cat_13,cat_14,cat_14.1,cat_15,cat_16,cat_17,cat_18,cat_19,cat_20
0,patient_id,System,Tracking,alpha_num,Randomly generated 9-digit alpha-numeric ident...,,,,,,...,,,,,,,,,,
1,demographics_age_index_ecg,Demographics,Age,numeric,Chronological age at time of referenced index ...,,,,,,...,,,,,,,,,,
2,demographics_birth_sex,Demographics,Sex,categorical,Sex assigned at birth,Male,Female,,,,...,,,,,,,,,,
3,hypertension_icd10,Cardiac Risk,Hypertension,boolean,ICD-10 coding of hypertension in either DAD or...,No,Yes,,,,...,,,,,,,,,,
4,diabetes_combined,Cardiac Risk,Diabetes,boolean,Documented presence of hyperglycaemic state in...,No,Yes,,,,...,,,,,,,,,,


In [6]:
data_dictionary_df['Section category'].value_counts()

Section category
Medications                    54
Laboratory                     50
Disease - Non-CV               13
Prior cardiovascular events    11
Disease - CV                   10
ECG                             6
Future outcomes                 5
Prior procedures - CV           4
Cardiac Risk                    3
Devices - CV                    3
Demographics                    2
System                          1
Name: count, dtype: int64

**System:**

- `patient_id`: Once duplicates are addressed, `patient_id` can be safely removed if it has no predictive value.

**Demographics:**

In [8]:
data_dictionary_df[data_dictionary_df['Section category'] == 'Demographics']

Unnamed: 0,Variable name,Section category,Variable category,Variable type,Definition,cat_1,cat_2,cat_3,cat_4,cat_5,...,cat_12,cat_13,cat_14,cat_14.1,cat_15,cat_16,cat_17,cat_18,cat_19,cat_20
1,demographics_age_index_ecg,Demographics,Age,numeric,Chronological age at time of referenced index ...,,,,,,...,,,,,,,,,,
2,demographics_birth_sex,Demographics,Sex,categorical,Sex assigned at birth,Male,Female,,,,...,,,,,,,,,,


- `demographics_age_index_ecg`: Strong predictor; AF risk increases significantly with age due to cumulative cardiovascular changes.
- `demographics_birth_sex`: Captures sex-specific differences in AF risk and outcomes; men have higher incidence, women may have worse outcomes.

**Cardiac Risk:**

In [10]:
data_dictionary_df[data_dictionary_df['Section category'] == 'Cardiac Risk']

Unnamed: 0,Variable name,Section category,Variable category,Variable type,Definition,cat_1,cat_2,cat_3,cat_4,cat_5,...,cat_12,cat_13,cat_14,cat_14.1,cat_15,cat_16,cat_17,cat_18,cat_19,cat_20
3,hypertension_icd10,Cardiac Risk,Hypertension,boolean,ICD-10 coding of hypertension in either DAD or...,No,Yes,,,,...,,,,,,,,,,
4,diabetes_combined,Cardiac Risk,Diabetes,boolean,Documented presence of hyperglycaemic state in...,No,Yes,,,,...,,,,,,,,,,
5,dyslipidemia_combined,Cardiac Risk,Dyslipidemia,boolean,Documented presence of dyslipidemia (treated o...,No,Yes,,,,...,,,,,,,,,,


- `hypertension_icd10`: Hypertension is a major modifiable risk factor for AF due to its role in promoting structural and electrical remodeling of the heart.
- `diabetes_combined`: Diabetes contributes to AF risk through systemic inflammation, oxidative stress, and cardiac remodeling.
- `dyslipidemia_combined`: Dyslipidemia indirectly affects AF risk via its contribution to atherosclerosis and cardiovascular disease.

**Disease - CV:**

## References

Alonso, A., et al. (2013). Simple risk model predicts incidence of atrial fibrillation in a racially and geographically diverse population: The CHARGE‐AF consortium. *Journal of the American Heart Association, 2*(1).

Dykstra, S., et al. (2022). Machine learning prediction of atrial fibrillation in cardiovascular patients using cardiac magnetic resonance and electronic health information. *Frontiers in Cardiovascular Medicine, 9*.

Healey, J. S., et al. (2012). Subclinical atrial fibrillation and the risk of stroke. *New England Journal of Medicine, 366*(2), 120–129.

Kornej, J., Börschel, C. S., Benjamin, E. J., & Schnabel, R. B. (2020). Epidemiology of atrial fibrillation in the 21st century. *Circulation Research, 127*(1), 4–20.

Li, Y.-G., et al. (2019). A simple clinical risk score (C2HEST) for predicting incident atrial fibrillation in Asian subjects: Derivation in 471,446 Chinese subjects, with internal validation and external application in 451,199 Korean subjects. *Chest, 155*(3), 510–518.

Linz, D., et al. (2024). Atrial fibrillation: Epidemiology, screening and digital health. *The Lancet Regional Health – Europe, 37*, Article 100786.

Nadarajah, R., et al. (2023). Prediction of short-term atrial fibrillation risk using primary care electronic health records. *Heart, 109*(12), 1072–1079.