PCA - Principal Component Analysis of [heart disease dataset](https://www.kaggle.com/fedesoriano/heart-failure-prediction):

In [1]:
import pandas as pd
df = pd.read_csv('heart.csv')
df.shape

(918, 12)

In [2]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [3]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


Remove outliers upto + or - 3 std.\
Here we define a function which can be called for different features.

In [4]:
def remove_outlier(df, column):
    upper_bound = df[column].mean() + 3 * df[column].std()
    lower_bound = df[column].mean() - 3 * df[column].std()
    return df[(df[column] <= upper_bound) & (df[column] >= lower_bound)]

In [5]:
df = remove_outlier(df, 'Cholesterol')
df.shape

(915, 12)

In [6]:
df = remove_outlier(df, 'MaxHR')
df.shape

(914, 12)

In [7]:
df = remove_outlier(df, 'FastingBS')
df.shape

(914, 12)

In [8]:
df = remove_outlier(df, 'RestingBP')
df.shape

(906, 12)

In [9]:
df = remove_outlier(df, 'Oldpeak')
df.shape

(899, 12)

In [10]:
df.ChestPainType.unique()

array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object)

In [11]:
df.RestingECG.unique()

array(['Normal', 'ST', 'LVH'], dtype=object)

In [12]:
df.ExerciseAngina.unique()

array(['N', 'Y'], dtype=object)

In [13]:
df.ST_Slope.unique()

array(['Up', 'Flat', 'Down'], dtype=object)

Label Encoding

In [14]:
pd.set_option('future.no_silent_downcasting', True) # handling data type warning

df['ExerciseAngina'] = df['ExerciseAngina'].replace({'N': 0, 'Y': 1})

df['ST_Slope'] = df['ST_Slope'].replace({'Down': 1,'Flat': 2,'Up': 3})

df['RestingECG'] = df['RestingECG'].replace({'Normal': 1,'ST': 2,'LVH': 3})

df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,1,172,0,0.0,3,0
1,49,F,NAP,160,180,0,1,156,0,1.0,2,1
2,37,M,ATA,130,283,0,2,98,0,0.0,3,0
3,48,F,ASY,138,214,0,1,108,1,1.5,2,1
4,54,M,NAP,150,195,0,1,122,0,0.0,3,0


One Hot Encoding

In [15]:
df = pd.get_dummies(df, drop_first=True)
df.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_2,RestingECG_3,ExerciseAngina_1,ST_Slope_2,ST_Slope_3
0,40,140,289,0,172,0.0,0,True,True,False,False,False,False,False,False,True
1,49,160,180,0,156,1.0,1,False,False,True,False,False,False,False,True,False
2,37,130,283,0,98,0.0,0,True,True,False,False,True,False,False,False,True
3,48,138,214,0,108,1.5,1,False,False,False,False,False,False,True,True,False
4,54,150,195,0,122,0.0,0,True,False,True,False,False,False,False,False,True


Separate Data and Target

In [16]:
X = df.drop("HeartDisease",axis='columns')
y = df.HeartDisease

X.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_2,RestingECG_3,ExerciseAngina_1,ST_Slope_2,ST_Slope_3
0,40,140,289,0,172,0.0,True,True,False,False,False,False,False,False,True
1,49,160,180,0,156,1.0,False,False,True,False,False,False,False,True,False
2,37,130,283,0,98,0.0,True,True,False,False,True,False,False,False,True
3,48,138,214,0,108,1.5,False,False,False,False,False,False,True,True,False
4,54,150,195,0,122,0.0,True,False,True,False,False,False,False,False,True


Scale the data

In [17]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled

array([[-1.42815446,  0.46590022,  0.84963584, ..., -0.8229452 ,
        -0.99888827,  1.13469459],
       [-0.47585532,  1.63471366, -0.16812204, ..., -0.8229452 ,
         1.00111297, -0.88129441],
       [-1.7455875 , -0.1185065 ,  0.79361247, ..., -0.8229452 ,
        -0.99888827,  1.13469459],
       ...,
       [ 0.3706328 , -0.1185065 , -0.62564622, ...,  1.21514774,
         1.00111297, -0.88129441],
       [ 0.3706328 , -0.1185065 ,  0.35476274, ..., -0.8229452 ,
         1.00111297, -0.88129441],
       [-1.63977649,  0.34901888, -0.21480818, ..., -0.8229452 ,
        -0.99888827,  1.13469459]], shape=(899, 15))

Split train and test data

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=30)

In [19]:
X_train.shape

(719, 15)

In [20]:
X_test.shape

(180, 15)

In [21]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)
model_rf.score(X_test, y_test)

0.8388888888888889

PCA for 95% useful data

In [22]:
from sklearn.decomposition import PCA

pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca

array([[ 93.12953998,  29.67837326],
       [-16.33857702,  14.79924872],
       [ 82.66937536, -38.91147272],
       ...,
       [-68.22613844, -17.69863447],
       [ 40.02665151,  33.46662536],
       [-20.61257906,  37.61626925]], shape=(899, 2))

Use the new PCA reduced dimension dataset and find the impact on accuracy

In [23]:
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=30)

In [24]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X_train_pca, y_train)
model_rf.score(X_test_pca, y_test)

0.6444444444444445

We find loss in accuracy