In this notebook we want to go from raw data, to a filtered, organized and normalized dataset of ECGs, where we have pre-processed each to basically align it according to the R-peak near the 2-second mark.

First we load the data

In [1]:
import pickle
import pandas as pd
import numpy as np
data = pickle.load(open('../all_points_may_2024.pkl', 'rb'))
data = pd.DataFrame(data).T


Inspecting the data, we see that ECGs are stored in the "Structures" column. A sample of its structure:

In [3]:
# Print the first 3 rows of column "Structures" of the DataFrame
print(data['Structures'].head(3))

P186    {'2-LV': {'P36': {'I': [-0.075 -0.075 -0.072 ....
P245    {'2-RV': {'P157': {'I': [ 0.     0.     0.    ...
P292    {'2-AO': {'P55': {'I': [-0.063 -0.051 -0.03  ....
Name: Structures, dtype: object


We see each row currently has at least one dictionary in the "Structures" column, which is expected since we know that at the very least one ECG contains multiple signals, ideally 12, one for each lead. It may also happen that one row currently contains multiple dictionaries, which would mean that for that same patient we have multiple ECGs. We will need to check for that and handle it accordingly.

In [16]:
# Take a specific row's Structures dictionary and turn it into a dataframe
def get_structures_df(structures_series):
    # Get the first item from the series
    structures_dict = structures_series.iloc[0]
    if isinstance(structures_dict, dict):
        return pd.DataFrame.from_dict(structures_dict, orient='index')
    return None

# Example using the first row's Structures
first_row_structures = get_structures_df(data['Structures'].head(1))

# Now pick one cell of the dataframe, print it
if first_row_structures is not None:
    # Get first column name and its value
    first_col = first_row_structures.columns[0]
    ecg = first_row_structures.iloc[0][first_col]
    print(ecg)

{'I': array([-0.075, -0.075, -0.072, ..., -0.039, -0.039, -0.036]), 'II': array([-0.045, -0.048, -0.048, ...,  0.09 ,  0.09 ,  0.087]), 'III': array([0.03 , 0.027, 0.024, ..., 0.132, 0.129, 0.123]), 'AVR': array([-0.051, -0.051, -0.048, ..., -0.087, -0.084, -0.078]), 'AVL': array([ 0.06 ,  0.06 ,  0.06 , ..., -0.024, -0.024, -0.024]), 'AVF': array([-0.006, -0.009, -0.009, ...,  0.111,  0.108,  0.105]), 'V1': array([0.093, 0.093, 0.09 , ..., 0.021, 0.021, 0.018]), 'V2': array([ 0.078,  0.075,  0.075, ..., -0.006, -0.006, -0.006]), 'V3': array([0.078, 0.078, 0.075, ..., 0.129, 0.126, 0.126]), 'V4': array([0.024, 0.024, 0.021, ..., 0.117, 0.117, 0.114]), 'V5': array([-0.024, -0.027, -0.03 , ...,  0.093,  0.093,  0.09 ]), 'V6': array([-0.054, -0.054, -0.057, ...,  0.063,  0.063,  0.063])}


If we inspect this dataframe (note: recommended using extensions for inspecting the data without printing it):

In [17]:
ecg

{'I': array([-0.075, -0.075, -0.072, ..., -0.039, -0.039, -0.036]),
 'II': array([-0.045, -0.048, -0.048, ...,  0.09 ,  0.09 ,  0.087]),
 'III': array([0.03 , 0.027, 0.024, ..., 0.132, 0.129, 0.123]),
 'AVR': array([-0.051, -0.051, -0.048, ..., -0.087, -0.084, -0.078]),
 'AVL': array([ 0.06 ,  0.06 ,  0.06 , ..., -0.024, -0.024, -0.024]),
 'AVF': array([-0.006, -0.009, -0.009, ...,  0.111,  0.108,  0.105]),
 'V1': array([0.093, 0.093, 0.09 , ..., 0.021, 0.021, 0.018]),
 'V2': array([ 0.078,  0.075,  0.075, ..., -0.006, -0.006, -0.006]),
 'V3': array([0.078, 0.078, 0.075, ..., 0.129, 0.126, 0.126]),
 'V4': array([0.024, 0.024, 0.021, ..., 0.117, 0.117, 0.114]),
 'V5': array([-0.024, -0.027, -0.03 , ...,  0.093,  0.093,  0.09 ]),
 'V6': array([-0.054, -0.054, -0.057, ...,  0.063,  0.063,  0.063])}

We can see that this is an ECG, since it contains arrays of samples, and has 12 columns, one per lead.
Columns are named according to the lead they represent: "I", "II", "III", "AVR", "AVL", "AVF", "V1", "V2", "V3", "V4", "V5", and "V6". 
Rows are the individual values of each sample of the signal, in millivolts (mV).
The shape of the data is (2500, 12), which means that there are 2500 samples for each lead.
We know that the sampling frequency is _______ 
