In this notebook we want to go from raw data, to a filtered, organized and normalized dataset of ECGs, where we have pre-processed each to basically align it according to the R-peak near the 2-second mark.

First we load the data

In [1]:
import pickle
import pandas as pd
import numpy as np
data = pickle.load(open('../all_points_may_2024.pkl', 'rb'))
data = pd.DataFrame(data).T


Inspecting the data, we see that ECGs are stored in the "Structures" column. A sample of its structure:

In [2]:
# Print the first 3 rows of column "Structures" of the DataFrame
print(data['Structures'].head(3))

P186    {'2-LV': {'P36': {'I': [-0.075 -0.075 -0.072 ....
P245    {'2-RV': {'P157': {'I': [ 0.     0.     0.    ...
P292    {'2-AO': {'P55': {'I': [-0.063 -0.051 -0.03  ....
Name: Structures, dtype: object


We see each row currently has at least one dictionary in the "Structures" column, which is expected since we know that at the very least one ECG contains multiple signals, ideally 12, one for each lead. It may also happen that one row currently contains multiple dictionaries, which would mean that for that same patient we have multiple ECGs. We will need to check for that and handle it accordingly.

In [None]:
# Take a specific row's Structures dictionary and turn it into a dataframe
def get_structures_df(structures_series):
    # Get the first item from the series
    structures_dict = structures_series.iloc[0]
    if isinstance(structures_dict, dict):
        return pd.DataFrame.from_dict(structures_dict, orient='index')
    return None

# Example using the first row's Structures
first_row_structures_cell = get_structures_df(data['Structures'].head(1))
if first_row_structures_cell is not None:
    # Print all rows and the first column of the DataFrame, in this case just 3 rows
    print(first_row_structures_cell.iloc[:, 0])


2-LV        {'I': [-0.075, -0.075, -0.07200000000000001, -...
1-1-ReAO    {'I': [0.768, 0.747, 0.729, 0.705, 0.669, 0.62...
1-AO        {'I': [0.0, 0.003, 0.012, 0.015, 0.01800000000...
Name: P36, dtype: object


In [21]:

# Now pick one cell of the first_row_strcutures_cell dataframe, print it
if first_row_structures_cell is not None:
    # Get first column name and its value
    first_col = first_row_structures_cell.columns[0]
    print("Name of the first column:", first_col)
    ecg = first_row_structures_cell.iloc[0][first_col]
    print("Name of the first row:", first_row_structures_cell.index[0])
    print(ecg)


Name of the first column: P36
Name of the first row: 2-LV
{'I': array([-0.075, -0.075, -0.072, ..., -0.039, -0.039, -0.036]), 'II': array([-0.045, -0.048, -0.048, ...,  0.09 ,  0.09 ,  0.087]), 'III': array([0.03 , 0.027, 0.024, ..., 0.132, 0.129, 0.123]), 'AVR': array([-0.051, -0.051, -0.048, ..., -0.087, -0.084, -0.078]), 'AVL': array([ 0.06 ,  0.06 ,  0.06 , ..., -0.024, -0.024, -0.024]), 'AVF': array([-0.006, -0.009, -0.009, ...,  0.111,  0.108,  0.105]), 'V1': array([0.093, 0.093, 0.09 , ..., 0.021, 0.021, 0.018]), 'V2': array([ 0.078,  0.075,  0.075, ..., -0.006, -0.006, -0.006]), 'V3': array([0.078, 0.078, 0.075, ..., 0.129, 0.126, 0.126]), 'V4': array([0.024, 0.024, 0.021, ..., 0.117, 0.117, 0.114]), 'V5': array([-0.024, -0.027, -0.03 , ...,  0.093,  0.093,  0.09 ]), 'V6': array([-0.054, -0.054, -0.057, ...,  0.063,  0.063,  0.063])}


If we inspect this dataframe (note: recommended using extensions for inspecting the data without printing it):

In [4]:
ecg

{'I': array([-0.075, -0.075, -0.072, ..., -0.039, -0.039, -0.036]),
 'II': array([-0.045, -0.048, -0.048, ...,  0.09 ,  0.09 ,  0.087]),
 'III': array([0.03 , 0.027, 0.024, ..., 0.132, 0.129, 0.123]),
 'AVR': array([-0.051, -0.051, -0.048, ..., -0.087, -0.084, -0.078]),
 'AVL': array([ 0.06 ,  0.06 ,  0.06 , ..., -0.024, -0.024, -0.024]),
 'AVF': array([-0.006, -0.009, -0.009, ...,  0.111,  0.108,  0.105]),
 'V1': array([0.093, 0.093, 0.09 , ..., 0.021, 0.021, 0.018]),
 'V2': array([ 0.078,  0.075,  0.075, ..., -0.006, -0.006, -0.006]),
 'V3': array([0.078, 0.078, 0.075, ..., 0.129, 0.126, 0.126]),
 'V4': array([0.024, 0.024, 0.021, ..., 0.117, 0.117, 0.114]),
 'V5': array([-0.024, -0.027, -0.03 , ...,  0.093,  0.093,  0.09 ]),
 'V6': array([-0.054, -0.054, -0.057, ...,  0.063,  0.063,  0.063])}

We can see that this is an example of a complete ECG, since it contains arrays of samples, and has 12 columns, one per lead.
From this we have learned that:
Columns are named according to the lead they represent: "I", "II", "III", "AVR", "AVL", "AVF", "V1", "V2", "V3", "V4", "V5", and "V6". 
Rows are the individual values of each sample of the signal, in millivolts (mV).
The shape of the data is (2500, 12), which means that there are 2500 samples for each lead.
We know that we have 2.5s of ECG measured, so the sampling frequency is 1000Hz.

Before getting to this ECG, and starting from a "cell" in the "Structures" column. We are just looking now at a specific column inside the dictionary on this cell. This column is labelled "P36", as we can see that the different columns (keys) are labelled with P__, where __ is a number. 

On this specific P36 column, we can see that it has 3 rows. Let's extract what's on each and understand what they are.

In [None]:
# Now pick one cell of the dataframe, print it
if first_row_structures_cell is not None:
    # Get second column name and its value
    ecg2 = first_row_structures_cell.iloc[1][first_col]
    print("Name of the second row:", first_row_structures_cell.index[1])
    print(ecg2)
    
    ecg3 = first_row_structures_cell.iloc[2][first_col]
    print("Name of the third row:", first_row_structures_cell.index[2])
    print(ecg3)

Name of the second row: 1-1-ReAO
{'I': array([0.768, 0.747, 0.729, ..., 0.054, 0.051, 0.051]), 'II': array([0.978, 0.972, 0.969, ..., 0.12 , 0.117, 0.117]), 'III': array([0.213, 0.225, 0.237, ..., 0.069, 0.066, 0.066]), 'AVR': array([ 0.276,  0.261,  0.246, ..., -0.006, -0.006, -0.006]), 'AVL': array([-0.87 , -0.858, -0.846, ..., -0.087, -0.087, -0.087]), 'AVF': array([0.6  , 0.6  , 0.603, ..., 0.096, 0.093, 0.093]), 'V1': array([-0.258, -0.243, -0.225, ..., -0.033, -0.033, -0.033]), 'V2': array([1.518, 1.476, 1.419, ..., 0.132, 0.132, 0.129]), 'V3': array([1.167, 1.167, 1.167, ..., 0.198, 0.195, 0.192]), 'V4': array([1.74 , 1.728, 1.71 , ..., 0.267, 0.264, 0.264]), 'V5': array([1.728, 1.71 , 1.689, ..., 0.234, 0.234, 0.231]), 'V6': array([1.614, 1.596, 1.569, ..., 0.189, 0.189, 0.186])}
Name of the third row: 1-AO
{'I': array([ 0.   ,  0.003,  0.012, ..., -0.036, -0.036, -0.033]), 'II': array([0.048, 0.048, 0.045, ..., 0.105, 0.102, 0.102]), 'III': array([0.048, 0.042, 0.033, ..., 0.1