# Preprocesamiento de datos

Como se observó en el Análisis Exploratorio de Datos, existen ciertas clases que son categóricas, por lo tanto se debería aplicar One-hot-encoding. También se observa un desbalance de datos, que debe ser tratado de forma adecuada.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

In [2]:
df = pd.read_excel("../Data/Diagnostics.xlsx")
df.head(5)

Unnamed: 0,FileName,Rhythm,Beat,PatientAge,Gender,VentricularRate,AtrialRate,QRSDuration,QTInterval,QTCorrected,RAxis,TAxis,QRSCount,QOnset,QOffset,TOffset
0,MUSE_20180113_171327_27000,AFIB,RBBB TWC,85,MALE,117,234,114,356,496,81,-27,19,208,265,386
1,MUSE_20180112_073319_29000,SB,TWC,59,FEMALE,52,52,92,432,401,76,42,8,215,261,431
2,MUSE_20180111_165520_97000,SA,NONE,20,FEMALE,67,67,82,382,403,88,20,11,224,265,415
3,MUSE_20180113_121940_44000,SB,NONE,66,MALE,53,53,96,456,427,34,3,9,219,267,447
4,MUSE_20180112_122850_57000,AF,STDD STTC,73,FEMALE,162,162,114,252,413,68,-40,26,228,285,354


La columna FileName no es reelevante por lo tanto procedemos a eliminarla

In [3]:
df.drop(['FileName'], axis='columns', inplace=True)
df.head(5)

Unnamed: 0,Rhythm,Beat,PatientAge,Gender,VentricularRate,AtrialRate,QRSDuration,QTInterval,QTCorrected,RAxis,TAxis,QRSCount,QOnset,QOffset,TOffset
0,AFIB,RBBB TWC,85,MALE,117,234,114,356,496,81,-27,19,208,265,386
1,SB,TWC,59,FEMALE,52,52,92,432,401,76,42,8,215,261,431
2,SA,NONE,20,FEMALE,67,67,82,382,403,88,20,11,224,265,415
3,SB,NONE,66,MALE,53,53,96,456,427,34,3,9,219,267,447
4,AF,STDD STTC,73,FEMALE,162,162,114,252,413,68,-40,26,228,285,354


In [4]:
df_value_counts = df['Rhythm'].value_counts()
print(df_value_counts)

SB       3889
SR       1826
AFIB     1780
ST       1568
SVT       587
AF        445
SA        399
AT        121
AVNRT      16
AVRT        8
SAAWR       7
Name: Rhythm, dtype: int64


In [5]:
for target_class in df_value_counts.index:
    if df_value_counts[target_class] <= 500:
        df = df[df.Rhythm != target_class]
print(df['Rhythm'].value_counts())

SB      3889
SR      1826
AFIB    1780
ST      1568
SVT      587
Name: Rhythm, dtype: int64


### One-hot-encoding y codificación de frecuencias

La columna Gender es categóricas, por lo que procedemos a realizar One-Hot-Encoding

In [6]:
df = pd.get_dummies(df, columns=["Gender"])
df.head()

Unnamed: 0,Rhythm,Beat,PatientAge,VentricularRate,AtrialRate,QRSDuration,QTInterval,QTCorrected,RAxis,TAxis,QRSCount,QOnset,QOffset,TOffset,Gender_FEMALE,Gender_MALE
0,AFIB,RBBB TWC,85,117,234,114,356,496,81,-27,19,208,265,386,0,1
1,SB,TWC,59,52,52,92,432,401,76,42,8,215,261,431,1,0
3,SB,NONE,66,53,53,96,456,427,34,3,9,219,267,447,0,1
5,SB,NONE,46,57,57,70,404,393,38,24,9,225,260,427,1,0
6,AFIB,TWC,80,98,86,74,360,459,69,83,17,215,252,395,1,0


La columna Beat es categórica pero tiene 649 categorías, por lo que utilizar One-Hot-Encoding no es viable debido a que aumentaría la dimensionalidad del modelo

In [7]:
df.Beat.value_counts()

NONE            5001
TWC              731
LVHV             554
STTC             394
LVHV TWC         169
                ... 
QTIE RBBB UW       1
ARS CR STTU        1
QTIE VPB           1
TWC VB             1
VFW                1
Name: Beat, Length: 649, dtype: int64

Para resolver este problema se utiliza la codificación de frecuencias

In [8]:
frequencies = df.Beat.value_counts(normalize=True)
df['Beat_Freq'] = df.Beat.map(frequencies)
df.drop(['Beat'], axis='columns', inplace=True)
df.head(5)

Unnamed: 0,Rhythm,PatientAge,VentricularRate,AtrialRate,QRSDuration,QTInterval,QTCorrected,RAxis,TAxis,QRSCount,QOnset,QOffset,TOffset,Gender_FEMALE,Gender_MALE,Beat_Freq
0,AFIB,85,117,234,114,356,496,81,-27,19,208,265,386,0,1,0.002591
1,SB,59,52,52,92,432,401,76,42,8,215,261,431,1,0,0.075751
3,SB,66,53,53,96,456,427,34,3,9,219,267,447,0,1,0.518238
5,SB,46,57,57,70,404,393,38,24,9,225,260,427,1,0,0.518238
6,AFIB,80,98,86,74,360,459,69,83,17,215,252,395,1,0,0.075751


In [9]:
df.to_parquet("../Data/clean.parquet")

In [10]:
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(df["Rhythm"])

In [11]:
label_encoder.classes_

array(['AFIB', 'SB', 'SR', 'ST', 'SVT'], dtype=object)

In [12]:
X = df.drop(['Rhythm'], axis='columns',inplace=False).values

In [13]:
np.savetxt("../Data/Y.csv",Y,delimiter=',')
np.savetxt("../Data/X.csv",X,delimiter=',')