In [1]:
#Importing necessary libraries
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from utils import get_train_data,get_train_split,get_test_split,preprocess

In [2]:
#Getting the feature and target variables from the train and test datasets respectively
X_train,y_train =  get_train_split()
X_test,y_test = get_test_split()

### Data Augmentation
Choosing a data augmentation technique to enhance the diversity of the dataset should be done keeping in mind the nature of the dataset and the domain. In addition to general data augmentation techniques, there are several domain-specific data augmentation techniques that can be used in healthcare datasets. These techniques take into account the specific characteristics and requirements of medical data.

To balance the class distribution, either over-sampling or under-sampling could be done. Downsampling can be aceived using **RandomUnderSampler**. Since, downsampling can lead to loass of information, upsampling was followed.
This project makes use of **SMOTE** in addressing the class imbalance. SMOTE is specifically designed to address class imbalance by generating synthetic samples for the minority class, in this case, the underrepresented abnormal heartbeats. By increasing the number of synthetic minority class samples, SMOTE helps balance the class distribution, leading to improved model performance. It helps prevent the model from being overly biased towards the majority class. With more diverse samples, the model can learn to identify patterns and variations associated with abnormal heartbeats, improving its ability to correctly classify unseen instances during testing.

### Feature Engineering
**Principal Component Analysis** is commonly used to reduce the dimensionality of high-dimensional data. In a heartbeat classification task, by applying PCA, the dimensionality of the feature space has be reduced, while still retaining the most important information present in the data. This helps in reducing computational complexity and improving model efficiency. 

PCA has been used to retain the principal components which account for 95% of variance in data. As a result,the dimensionality of the dataset has reduced to (362355, 29). The reduced-dimensional feature space obtained through PCA can help improve model performance in heartbeat classification tasks. By selecting the most informative principal components, the model can focus on the essential features, reducing the risk of overfitting and improving generalization.  

It is essential to *standardize* the data before performing PCA to ensure that all the data have *zero mean and unit variance*. Since the dataset at hand is already standardized(checked using StandardScaler), did not have to scale the data.

In [3]:
#Passing the X and y to get the preprocessed(after SMOTE and PCA) versions.
X_train_processed, y_train_processed, X_test_ = preprocess(X_train,y_train,X_test)

Counter({0.0: 72471, 4.0: 6431, 2.0: 5788, 1.0: 2223, 3.0: 641})
Counter({0.0: 72471, 1.0: 72471, 2.0: 72471, 3.0: 72471, 4.0: 72471})


The features have zero mean and unit variance.


Explained variance ratio: [0.43600339 0.12020286 0.08053981 0.04585722 0.03819538 0.03286634
 0.02462946 0.02214805 0.01758784 0.01574594 0.01391617 0.0120935
 0.01020785 0.00898517 0.0081478  0.00747258 0.00695072 0.00645686
 0.00591228 0.00544051 0.00487866 0.00437789 0.00424439 0.00374857
 0.00356697 0.00343599 0.00323012 0.00298821 0.00287535]


Principal components: [[-0.09055349 -0.07082363 -0.02486204 ...  0.0046117   0.00445436
   0.00436019]
 [ 0.08265825  0.05845756  0.06012891 ...  0.01997771  0.01910435
   0.01863422]
 [-0.08799919 -0.05715752 -0.12904523 ...  0.03641012  0.03497725
   0.03442125]
 ...
 [-0.10113452  0.00867999  0.13066281 ... -0.03729483 -0.03680375
  -0.03586752]
 [-0.01808309  0.03983699  0.02062407 ...  0.05903707  0.05806673
   0.05685286]
 [-0.12622701  0.03360272  0.1

In [4]:
#Check missing or NaN values in the dataset
ecg_train = get_train_data()
ecg_train.isna().sum()

0      0
1      0
2      0
3      0
4      0
      ..
183    0
184    0
185    0
186    0
187    0
Length: 188, dtype: int64

There are no missing values in the dataset.If there are very minimal number of missing values, the best practice is to drop them. If not, the domain specific data imputations could be used. The simplest way of imputing missing values is using SimpleImputer from scikit-learn. It handles missing values by filling them with a constant value, the mean, median, or most frequent value of the available data. Since the healthcare data is sensitive, some domain specific methods can be used based on the characteristics of the data, the nature of missingness, and the intended analysis. Some common methods can include Clinical Expert Imputation, Deep Learning-based Imputation, Pattern Recognition and Time-Series Analysis, Patient Similarity-based Imputation, etc.