## PCA on Heart Disease Dataset

### Tasks:

1. Load heart disease dataset in pandas dataframe
2. Convert text columns to numbers using label encoding and one hot encoding
3. Apply scaling
4. Build a classification model using various methods (logistic regression, random forest) and check which model gives you the best accuracy
5. Now use PCA to reduce dimensions, retrain your model and see what impact it has on your model in terms of accuracy.

1. Load heart disease dataset in pandas dataframe

In [1]:
import pandas as pd

df = pd.read_csv("heart_disease_dataset.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [2]:
df.shape

(918, 12)

2. Convert text columns to numbers using label encoding and one hot encoding

In [5]:
print("Sex = ",df.Sex.unique())
print("ChestPainType = ",df.ChestPainType.unique())
print("RestingECG = ",df.RestingECG.unique())
print("ExerciseAngina = ",df.ExerciseAngina.unique())
print("ST_Slope = ",df.ST_Slope.unique())

Sex =  ['M' 'F']
ChestPainType =  ['ATA' 'NAP' 'ASY' 'TA']
RestingECG =  ['Normal' 'ST' 'LVH']
ExerciseAngina =  ['N' 'Y']
ST_Slope =  ['Up' 'Flat' 'Down']


In [15]:
df_new = df.copy()

df_new.Sex .replace(
    {
        'M': 1,
        'F': 0
    },
    inplace=True
)

df_new.ChestPainType.replace(
    {
        'ATA': 1,
        'NAP': 2,
        'ASY': 3,
        'TA': 4
    },
    inplace=True
)

df_new.RestingECG.replace(
    {
        'Normal': 1,
        'ST': 2,
        'LVH': 3   
    },
    inplace=True
)

df_new.ExerciseAngina.replace(
    {
        'Y': 1,
        'N': 0
    },
    inplace=True
)

df_new.ST_Slope.replace(
    {
        'Up': 1,
        'Flat': 2,
        'Down': 3
    },
    inplace=True
)

df_new.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,1,1,140,289,0,1,172,0,0.0,1,0
1,49,0,2,160,180,0,1,156,0,1.0,2,1
2,37,1,1,130,283,0,2,98,0,0.0,1,0
3,48,0,3,138,214,0,1,108,1,1.5,2,1
4,54,1,2,150,195,0,1,122,0,0.0,1,0


3. Apply scaling
4. Build a classification model using various methods (SVM, logistic regression, random forest) and check which model gives you the best accuracy

In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df_new.drop("HeartDisease",axis='columns')
y = df_new.HeartDisease

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("",X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=30)
print("",X_train.shape)
print("",X_test.shape)

 [[-1.4331398   0.51595242 -1.70557305 ... -0.8235563  -0.83243239
  -1.05211381]
 [-0.47848359 -1.93816322 -0.53099236 ... -0.8235563   0.10566353
   0.59607813]
 [-1.75135854  0.51595242 -1.70557305 ... -0.8235563  -0.83243239
  -1.05211381]
 ...
 [ 0.37009972  0.51595242  0.64358833 ...  1.21424608  0.29328271
   0.59607813]
 [ 0.37009972 -1.93816322 -1.70557305 ... -0.8235563  -0.83243239
   0.59607813]
 [-1.64528563  0.51595242 -0.53099236 ... -0.8235563  -0.83243239
  -1.05211381]]
 (734, 11)
 (184, 11)


In [18]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)
model_rf.score(X_test, y_test)

0.8532608695652174

In [19]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
model_lr.score(X_test, y_test)

0.842391304347826

## Now, lets use PCA to reduce dimensions and train model

In [20]:
X

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,1,1,140,289,0,1,172,0,0.0,1
1,49,0,2,160,180,0,1,156,0,1.0,2
2,37,1,1,130,283,0,2,98,0,0.0,1
3,48,0,3,138,214,0,1,108,1,1.5,2
4,54,1,2,150,195,0,1,122,0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...
913,45,1,4,110,264,0,1,132,0,1.2,2
914,68,1,3,144,193,1,1,141,0,3.4,2
915,57,1,3,130,131,0,1,115,1,1.2,2
916,57,0,1,130,236,0,3,174,0,0.0,2


In [28]:
from sklearn.decomposition import PCA

pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca

array([[ 92.31183305, -29.45192864],
       [-17.14349356, -13.74190654],
       [ 81.90835785,  38.21332096],
       ...,
       [-69.00464262,  17.33467901],
       [ 39.20885781, -33.60535182],
       [-21.43744118, -37.21615987]])

In [29]:
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=30)

In [30]:
model_rf_pca = RandomForestClassifier()
model_rf_pca.fit(X_train_pca, y_train)
model_rf_pca.score(X_test_pca, y_test)

0.6956521739130435

In [32]:
model_lr_pca = LogisticRegression()
model_lr_pca.fit(X_train_pca, y_train)
model_lr_pca.score(X_test_pca, y_test)

0.7010869565217391

### Result: Here, after PCA (i.e considering only those features that are contributing towards 95% variance of data), our models have a less accuracy than before.<br/>So this proves that there is a trade off between accuracy and computation which we need to consider while building models in real life.