In this notebook we want to go from raw data, to a filtered, organized and normalized dataset of ECGs, where we have pre-processed each to basically align it according to the R-peak near the 2-second mark.

First we load the data

In [1]:
import pickle
import pandas as pd
import numpy as np
data = pickle.load(open('../all_points_may_2024.pkl', 'rb'))
data = pd.DataFrame(data).T


Inspecting the data, we see that ECGs are stored in the "Structures" column. A sample of its structure:

In [15]:
# Print the first 3 rows of column "Structures" of the DataFrame
print(data['Structures'].head(3))

p186 = data['Structures']['P186']      # or however you index into the dict
print("P186 sessions:", list(p186.keys()))

P186    {'2-LV': {'P36': {'I': [-0.075 -0.075 -0.072 ....
P245    {'2-RV': {'P157': {'I': [ 0.     0.     0.    ...
P292    {'2-AO': {'P55': {'I': [-0.063 -0.051 -0.03  ....
Name: Structures, dtype: object
P186 sessions: ['2-LV', '1-1-ReAO', '1-AO']


We see each row currently has at least one dictionary in the "Structures" column, which is expected since we know that at the very least one ECG contains multiple signals, ideally 12, one for each lead. It may also happen that one row currently contains multiple dictionaries, which would mean that for that same patient we have multiple ECGs. We will need to check for that and handle it accordingly.

In [13]:
# Take a specific row's Structures dictionary and turn it into a dataframe
def get_structures_df(structures_series):
    # Get the first item from the series
    structures_dict = structures_series.iloc[0]
    if isinstance(structures_dict, dict):
        return pd.DataFrame.from_dict(structures_dict, orient='columns')
    return None
# Example using the first row's Structures
first_row_structures_cell = get_structures_df(data['Structures'].head(1))
print("Name of the first cell:", data.index[0])
if first_row_structures_cell is not None:
    # Print all rows and the first column of the DataFrame, in this case just 3 rows
    print(first_row_structures_cell.iloc[:, 0])


Name of the first cell: P186
P36     {'I': [-0.075, -0.075, -0.07200000000000001, -...
P122    {'I': [-0.042, -0.045, -0.048, -0.048, -0.048,...
P85     {'I': [-0.048, -0.045, -0.042, -0.042, -0.042,...
P103    {'I': [-0.045, -0.048, -0.048, -0.048, -0.045,...
P86     {'I': [0.003, 0.009000000000000001, 0.015, 0.0...
                              ...                        
P112    {'I': [0.213, 0.219, 0.228, 0.243, 0.261, 0.28...
P114    {'I': [0.033, 0.033, 0.036000000000000004, 0.0...
P17     {'I': [0.009000000000000001, 0.009000000000000...
P29                                                   NaN
P11                                                   NaN
Name: 2-LV, Length: 136, dtype: object


In [4]:

# Now pick one cell of the first_row_strcutures_cell dataframe, print it
if first_row_structures_cell is not None:
    # Get first column name and its value
    first_col = first_row_structures_cell.columns[0]
    print("Name of the first column:", first_col)
    ecg = first_row_structures_cell.iloc[0][first_col]
    print("Name of the first row:", first_row_structures_cell.index[0])
    print(ecg)


Name of the first column: P36
Name of the first row: 2-LV
{'I': array([-0.075, -0.075, -0.072, ..., -0.039, -0.039, -0.036]), 'II': array([-0.045, -0.048, -0.048, ...,  0.09 ,  0.09 ,  0.087]), 'III': array([0.03 , 0.027, 0.024, ..., 0.132, 0.129, 0.123]), 'AVR': array([-0.051, -0.051, -0.048, ..., -0.087, -0.084, -0.078]), 'AVL': array([ 0.06 ,  0.06 ,  0.06 , ..., -0.024, -0.024, -0.024]), 'AVF': array([-0.006, -0.009, -0.009, ...,  0.111,  0.108,  0.105]), 'V1': array([0.093, 0.093, 0.09 , ..., 0.021, 0.021, 0.018]), 'V2': array([ 0.078,  0.075,  0.075, ..., -0.006, -0.006, -0.006]), 'V3': array([0.078, 0.078, 0.075, ..., 0.129, 0.126, 0.126]), 'V4': array([0.024, 0.024, 0.021, ..., 0.117, 0.117, 0.114]), 'V5': array([-0.024, -0.027, -0.03 , ...,  0.093,  0.093,  0.09 ]), 'V6': array([-0.054, -0.054, -0.057, ...,  0.063,  0.063,  0.063])}


If we inspect this dataframe (note: recommended using extensions for inspecting the data without printing it):

In [5]:
ecg

{'I': array([-0.075, -0.075, -0.072, ..., -0.039, -0.039, -0.036]),
 'II': array([-0.045, -0.048, -0.048, ...,  0.09 ,  0.09 ,  0.087]),
 'III': array([0.03 , 0.027, 0.024, ..., 0.132, 0.129, 0.123]),
 'AVR': array([-0.051, -0.051, -0.048, ..., -0.087, -0.084, -0.078]),
 'AVL': array([ 0.06 ,  0.06 ,  0.06 , ..., -0.024, -0.024, -0.024]),
 'AVF': array([-0.006, -0.009, -0.009, ...,  0.111,  0.108,  0.105]),
 'V1': array([0.093, 0.093, 0.09 , ..., 0.021, 0.021, 0.018]),
 'V2': array([ 0.078,  0.075,  0.075, ..., -0.006, -0.006, -0.006]),
 'V3': array([0.078, 0.078, 0.075, ..., 0.129, 0.126, 0.126]),
 'V4': array([0.024, 0.024, 0.021, ..., 0.117, 0.117, 0.114]),
 'V5': array([-0.024, -0.027, -0.03 , ...,  0.093,  0.093,  0.09 ]),
 'V6': array([-0.054, -0.054, -0.057, ...,  0.063,  0.063,  0.063])}

We can see that this is an example of a complete ECG, since it contains arrays of samples, and has 12 columns, one per lead.
From this we have learned that:
Columns are named according to the lead they represent: "I", "II", "III", "AVR", "AVL", "AVF", "V1", "V2", "V3", "V4", "V5", and "V6". 
Rows are the individual values of each sample of the signal, in millivolts (mV).
The shape of the data is (2500, 12), which means that there are 2500 samples for each lead.
We know that we have 2.5s of ECG measured, so the sampling frequency is 1000Hz.

Before getting to this ECG, and starting from a "cell" in the "Structures" column. We are just looking now at a specific column inside the dictionary on this cell. This column is labelled "P36", as we can see that the different columns (keys) are labelled with P__, where __ is a number. 

On this specific P36 column, we can see that it has 3 rows. Let's extract what's on each and understand what they are.

In [6]:
# Now pick one cell of the dataframe, print it
if first_row_structures_cell is not None:
    # Get second column name and its value
    ecg2 = first_row_structures_cell.iloc[1][first_col]
    print("Name of the second row:", first_row_structures_cell.index[1])
    print(ecg2)
    
    ecg3 = first_row_structures_cell.iloc[2][first_col]
    print("Name of the third row:", first_row_structures_cell.index[2])
    print(ecg3)

Name of the second row: 1-1-ReAO
{'I': array([0.768, 0.747, 0.729, ..., 0.054, 0.051, 0.051]), 'II': array([0.978, 0.972, 0.969, ..., 0.12 , 0.117, 0.117]), 'III': array([0.213, 0.225, 0.237, ..., 0.069, 0.066, 0.066]), 'AVR': array([ 0.276,  0.261,  0.246, ..., -0.006, -0.006, -0.006]), 'AVL': array([-0.87 , -0.858, -0.846, ..., -0.087, -0.087, -0.087]), 'AVF': array([0.6  , 0.6  , 0.603, ..., 0.096, 0.093, 0.093]), 'V1': array([-0.258, -0.243, -0.225, ..., -0.033, -0.033, -0.033]), 'V2': array([1.518, 1.476, 1.419, ..., 0.132, 0.132, 0.129]), 'V3': array([1.167, 1.167, 1.167, ..., 0.198, 0.195, 0.192]), 'V4': array([1.74 , 1.728, 1.71 , ..., 0.267, 0.264, 0.264]), 'V5': array([1.728, 1.71 , 1.689, ..., 0.234, 0.234, 0.231]), 'V6': array([1.614, 1.596, 1.569, ..., 0.189, 0.189, 0.186])}
Name of the third row: 1-AO
{'I': array([ 0.   ,  0.003,  0.012, ..., -0.036, -0.036, -0.033]), 'II': array([0.048, 0.048, 0.045, ..., 0.105, 0.102, 0.102]), 'III': array([0.048, 0.042, 0.033, ..., 0.1

We see that for P36, we have a complete ECG per each of the 3 rows we detected inside this `first_row_structures_cell`, which are 2-LV, 1-1-ReAO and 1-AO.
If we pick a different column (i.e. not P36), as we can see when exploring the `first_row_structures_cell`, not all columns have exactly as many rows. Some have 1, some have 2, some have 3 (so when seeing the dataframe as a table, we see that the corresponding values for cells might be empty):

In [7]:
# Pick the next 5 columns, for each iterate over the 3 rows, print the ECG or "it's empty"
# Get the next 5 columns
for i in range(1, 6):
    col_name = first_row_structures_cell.columns[i]
    print("------")
    print("Name of the column:", col_name)
    for j in range(3):
        iterating_ecg = first_row_structures_cell.iloc[j][col_name]
        print("Name of the row:", first_row_structures_cell.index[j], "-", iterating_ecg)

------
Name of the column: P122
Name of the row: 2-LV - {'I': array([-0.042, -0.045, -0.048, ..., -0.078, -0.072, -0.072]), 'II': array([-0.015, -0.015, -0.012, ...,  0.117,  0.12 ,  0.12 ]), 'III': array([0.027, 0.03 , 0.033, ..., 0.195, 0.195, 0.198]), 'AVR': array([-0.033, -0.039, -0.042, ..., -0.138, -0.132, -0.135]), 'AVL': array([ 0.027,  0.03 ,  0.03 , ..., -0.018, -0.021, -0.024]), 'AVF': array([0.006, 0.006, 0.009, ..., 0.156, 0.159, 0.159]), 'V1': array([0.054, 0.054, 0.054, ..., 0.078, 0.078, 0.075]), 'V2': array([ 0.03 ,  0.033,  0.033, ..., -0.024, -0.027, -0.027]), 'V3': array([0.027, 0.027, 0.03 , ..., 0.147, 0.147, 0.15 ]), 'V4': array([-0.057, -0.054, -0.051, ...,  0.105,  0.108,  0.108]), 'V5': array([-0.096, -0.093, -0.09 , ...,  0.069,  0.069,  0.069]), 'V6': array([-0.105, -0.105, -0.102, ...,  0.03 ,  0.03 ,  0.03 ])}
Name of the row: 1-1-ReAO - nan
Name of the row: 1-AO - nan
------
Name of the column: P85
Name of the row: 2-LV - {'I': array([-0.048, -0.045, -0.0

It's strange that, given that all of this is based on `first_row_structures_cell`, which corresponds as seen before to "P186", we have different "P__" columns inside the structures column of P186. 

We are now going to see if the missing ECG's for the different rows are actually located in some other cell from the original dataframe. 

We first check for arbitrary P__ columns, of it, and do a similar display as the previous step. We'll later merge all the findings programatically in a single complete dataframe.

In [8]:
def inspect_cell_ecgs(cell_name):
    # Get the structures dictionary from the cell
    try:
        cell_structures = data.loc[cell_name, 'Structures']
    except KeyError:
        print(f"Cell {cell_name} not found in the data.")
        return
    
    if isinstance(cell_structures, dict):
        # Convert the dictionary to a dataframe
        structures_df = pd.DataFrame.from_dict(cell_structures, orient='index')
        
        # Print first 5 columns and their ECGs
        for col in structures_df.columns[:5]:
            print(f"\nColumn: {col}")
            # Iterate through rows
            for row_idx in structures_df.index:
                ecg = structures_df.loc[row_idx, col]
                if isinstance(ecg, dict):
                    print(f"Row: {row_idx}")
                    # Just print first 5 samples of lead I to keep output manageable
                    if 'I' in ecg:
                        print(f"First 5 samples of lead I: {ecg['I'][:5]}")
                else:
                    print(f"Row: {row_idx} - Empty")

# Example usage
print("Inspecting cell P186:")
inspect_cell_ecgs('P186') # The original cell we started with in first_row_structures_cell


Inspecting cell P186:

Column: P36
Row: 2-LV
First 5 samples of lead I: [-0.075 -0.075 -0.072 -0.066 -0.063]
Row: 1-1-ReAO
First 5 samples of lead I: [0.768 0.747 0.729 0.705 0.669]
Row: 1-AO
First 5 samples of lead I: [0.    0.003 0.012 0.015 0.018]

Column: P122
Row: 2-LV
First 5 samples of lead I: [-0.042 -0.045 -0.048 -0.048 -0.048]
Row: 1-1-ReAO - Empty
Row: 1-AO - Empty

Column: P85
Row: 2-LV
First 5 samples of lead I: [-0.048 -0.045 -0.042 -0.042 -0.042]
Row: 1-1-ReAO - Empty
Row: 1-AO - Empty

Column: P103
Row: 2-LV
First 5 samples of lead I: [-0.045 -0.048 -0.048 -0.048 -0.045]
Row: 1-1-ReAO - Empty
Row: 1-AO - Empty

Column: P86
Row: 2-LV
First 5 samples of lead I: [0.003 0.009 0.015 0.018 0.018]
Row: 1-1-ReAO - Empty
Row: 1-AO - Empty


In [9]:

# Manually inspect cells we found inside the first_row_structures_cell dataframe
inspect_cell_ecgs('P122')
inspect_cell_ecgs('P103')
inspect_cell_ecgs('P85')
inspect_cell_ecgs('P86')
inspect_cell_ecgs('P42')

Cell P122 not found in the data.
Cell P103 not found in the data.
Cell P85 not found in the data.
Cell P86 not found in the data.
Cell P42 not found in the data.


We will try to see if *any* of the columns we had in first_row_structures_cell are present as cells of the original dataframe:

In [10]:
# Pick all columns of the first_row_structures_cell dataframe
# Then call inspect_cell_ecgs for each of them

list_of_columns = first_row_structures_cell.columns
print("List of columns:", list_of_columns)

for col in list_of_columns:
    inspect_cell_ecgs(col)

List of columns: Index(['P36', 'P122', 'P85', 'P103', 'P86', 'P42', 'P3', 'P65', 'P63', 'P5',
       ...
       'P38', 'P54', 'P13', 'P91', 'P52', 'P112', 'P114', 'P17', 'P29', 'P11'],
      dtype='object', length=136)
Cell P36 not found in the data.
Cell P122 not found in the data.
Cell P85 not found in the data.
Cell P103 not found in the data.
Cell P86 not found in the data.
Cell P42 not found in the data.
Cell P3 not found in the data.
Cell P65 not found in the data.
Cell P63 not found in the data.
Cell P5 not found in the data.
Cell P90 not found in the data.
Cell P23 not found in the data.
Cell P15 not found in the data.
Cell P62 not found in the data.
Cell P121 not found in the data.
Cell P14 not found in the data.
Cell P1 not found in the data.
Cell P72 not found in the data.

Column: P13
Row: 1-PA
First 5 samples of lead I: [0.006 0.003 0.    0.    0.   ]

Column: P5
Row: 1-PA
First 5 samples of lead I: [-0.006 -0.003 -0.003 -0.009 -0.012]

Column: P11
Row: 1-PA
First 5 sample

It seems like some of the columns of the first_row_structures_cell are cells themselves, and some are not. 

Now we can try to create a unified dataframe where columns are P__ (extracted from inside each P__ cell) and rows are the different rows found across all cells. Maybe this unified dataframe will contain the missing ECGs that we are looking for when inspecting individual cells.

In [37]:
# Loop over all cells in the original dataframe, for each iterate over all columns and rows, 
# and store the ECG found (if any) into this unified dataframe (column and row found inside this cell)
# TODO

# Create a dictionary to store all ECGs
ecg_dict = {}

# Loop through all rows in the original dataframe
for cell_name, row in data.iterrows():
    # Get the structures dictionary
    structures = row['Structures']
    if not isinstance(structures, dict):
        continue
        
    # Convert structures to dataframe
    structures_df = pd.DataFrame.from_dict(structures, orient='index')

    # Loop through each column (P__) in the structures dataframe
    for col in structures_df.columns:
        # Loop through each row in this column
        for row_idx in structures_df.index:
            ecg = structures_df.loc[row_idx, col]
            if isinstance(ecg, dict):  # Ensure it's a valid ECG dictionary
                # Initialize the column in the dictionary if not already present
                if col not in ecg_dict:
                    ecg_dict[col] = {}
                # Check if this combination already exists
                if row_idx in ecg_dict[col]:
                    print(f"Found duplicate: Patient {col}, Row {row_idx}")
                    # TODO: Compare if the ECGs are the same?
                    
                    
                # Store the ECG in the dictionary
                ecg_dict[col][row_idx] = ecg

# Create the unified DataFrame from the dictionary
unified_df = pd.DataFrame.from_dict(ecg_dict, orient='index')

# Transpose so identifiers become columns
unified_df = unified_df.T

# Display the shape and first few rows of the unified dataframe
print(f"Shape of unified dataframe: {unified_df.shape}")


Found duplicate: Patient P11, Row 1-AO
Found duplicate: Patient P55, Row 2-LV
Found duplicate: Patient P43, Row 2-LV
Found duplicate: Patient P10, Row 2-LV
Found duplicate: Patient P7, Row 2-LV
Found duplicate: Patient P7, Row 1-AO
Found duplicate: Patient P39, Row 2-LV
Found duplicate: Patient P24, Row 2-LV
Found duplicate: Patient P52, Row 2-LV
Found duplicate: Patient P64, Row 2-LV
Found duplicate: Patient P46, Row 2-LV
Found duplicate: Patient P40, Row 2-LV
Found duplicate: Patient P45, Row 2-LV
Found duplicate: Patient P33, Row 2-LV
Found duplicate: Patient P53, Row 2-LV
Found duplicate: Patient P67, Row 2-LV
Found duplicate: Patient P30, Row 2-LV
Found duplicate: Patient P36, Row 2-LV
Found duplicate: Patient P51, Row 2-LV
Found duplicate: Patient P31, Row 2-LV
Found duplicate: Patient P57, Row 2-LV
Found duplicate: Patient P60, Row 2-LV
Found duplicate: Patient P17, Row 2-LV
Found duplicate: Patient P17, Row 1-AO
Found duplicate: Patient P62, Row 2-LV
Found duplicate: Patient P1

In [None]:
# Find and verify how many ECGs we have in the unified dataframe, by looking at each cell value to verify if its NaN,
# and if not, if it has the shape of an ECG (12 leads, 2500 samples)
def is_ecg_valid(ecg):
    # First guard clause: check if ecg is a dictionary
    if not isinstance(ecg, dict):
        return False
    
    # Second guard clause: check if ecg has exactly 12 leads
    if len(ecg) != 12:
        print(f"Invalid number of leads: {len(ecg)}")
        return False
    
    # Check if each lead has 2500 samples
    for lead in ecg.values():
        if len(lead) != 2500:
            print(f"Invalid number of samples in lead: {len(lead)}")
            return False
    
    return True

# Initialize a counter for valid ECGs
valid_ecg_count = 0
# Loop through the unified dataframe
for col in unified_df.columns:
    for row_idx in unified_df.index:
        ecg = unified_df.loc[row_idx, col]
        if is_ecg_valid(ecg):
            valid_ecg_count += 1

print(f"Number of valid ECGs in the unified dataframe: {valid_ecg_count}")

Number of valid ECGs in the unified dataframe: 13861


In [27]:
# Count ECGs by patient
def count_ecgs_by_patient(unified_df):
    # Initialize a dictionary to store counts
    patient_counts = {}
    
    # Loop through the unified dataframe
    for col in unified_df.columns:
        for row_idx in unified_df.index:
            ecg = unified_df.loc[row_idx, col]
            if is_ecg_valid(ecg):
                # Extract patient ID from the column name
                patient_id = col.split('_')[0]  # Assuming format is 'PXXX_...'
                if patient_id not in patient_counts:
                    patient_counts[patient_id] = 0
                patient_counts[patient_id] += 1
    
    return patient_counts

patient_counts = count_ecgs_by_patient(unified_df)
print("ECG counts by patient:")

# Convert patient IDs to integers (removing 'P' if present) and sort, skipping non-numeric IDs
sorted_patients = []
for patient_id, count in patient_counts.items():
    try:
        # Try to convert ID to integer (removing 'P' if present)
        num_id = int(patient_id.replace('P', ''))
        sorted_patients.append((patient_id, count))
    except ValueError: # Skip non-numeric IDs
        continue
        
sorted_patients.sort(key=lambda x: int(x[0].replace('P', '')))


# Print counts and check for gaps in sorted Patient IDs
prev_num = None
gaps = []
for patient_id, count in sorted_patients:
    print(f"Patient {patient_id}: {count} ECGs")
    
    # Check for gaps in progression
    current_num = int(patient_id.replace('P', ''))
    if prev_num is not None and current_num - prev_num > 1:
        for missing_num in range(prev_num + 1, current_num):
            gaps.append(f"P{missing_num}")
    prev_num = current_num


ECG counts by patient:
Patient P1: 67 ECGs
Patient P2: 64 ECGs
Patient P3: 64 ECGs
Patient P4: 63 ECGs
Patient P5: 62 ECGs
Patient P6: 64 ECGs
Patient P7: 60 ECGs
Patient P8: 62 ECGs
Patient P9: 65 ECGs
Patient P10: 64 ECGs
Patient P11: 61 ECGs
Patient P12: 67 ECGs
Patient P13: 64 ECGs
Patient P14: 66 ECGs
Patient P15: 63 ECGs
Patient P16: 66 ECGs
Patient P17: 65 ECGs
Patient P18: 65 ECGs
Patient P19: 65 ECGs
Patient P20: 63 ECGs
Patient P21: 63 ECGs
Patient P22: 61 ECGs
Patient P23: 62 ECGs
Patient P24: 62 ECGs
Patient P25: 58 ECGs
Patient P26: 62 ECGs
Patient P27: 60 ECGs
Patient P28: 60 ECGs
Patient P29: 64 ECGs
Patient P30: 60 ECGs
Patient P31: 59 ECGs
Patient P32: 60 ECGs
Patient P33: 57 ECGs
Patient P34: 62 ECGs
Patient P35: 59 ECGs
Patient P36: 60 ECGs
Patient P37: 57 ECGs
Patient P38: 57 ECGs
Patient P39: 60 ECGs
Patient P40: 57 ECGs
Patient P41: 56 ECGs
Patient P42: 56 ECGs
Patient P43: 53 ECGs
Patient P44: 55 ECGs
Patient P45: 58 ECGs
Patient P46: 59 ECGs
Patient P47: 53 ECGs

In [28]:
if gaps:
    print("\nMissing patient IDs in progression:")
    print(", ".join(gaps))


Missing patient IDs in progression:
P1084, P1097, P1099, P1100, P1101, P1102, P1103, P1107, P1108, P1128
