##PCA (Principal Component Analysis)


1.   Dataset: Heart Disease UCI dataset from Kaggle
2.   link: https://www.kaggle.com/ronitf/heart-disease-uci



In [0]:
from google.colab import drive
drive.mount('/gdrive')


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /gdrive


In [0]:
# Some standard imports

import os

# scipy imports
# There are several universal functions for numpy arrays that are available through the scipy package
import scipy as sc
from scipy import stats, integrate
from scipy.stats.mstats import mode

# numpy imports
# pandas depends on numpy
import numpy as np
np.set_printoptions(precision=4, threshold=500, suppress=True)
np.random.seed(12345)
np.random.seed(sum(map(ord, "distributions")))

# pandas imports
# The convention is to import pandas package with a pd prefix. 
# Also, since we most commonly use Series and DataFrame classes from this package, 
# we import them into the current namespace, so we do not have to refer to them with the pd prefix.
import pandas as pd
from pandas import Series, DataFrame
pd.set_option('display.max_columns', None) # enables showing all columns
pd.options.display.max_rows = 25
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.notebook_repr_html = True
np.set_printoptions(precision=4, suppress=True)

# matplotlib imports
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
plt.subplots(figsize=(10,6))
%matplotlib inline

# seaborn imports
import seaborn as sns
sns.set(color_codes=True)

# bokeh imports
from bokeh.io import output_file, output_notebook, show
from bokeh.plotting import figure

# ignore warnings
import warnings
warnings.filterwarnings('ignore')
#warnings.filterwarnings(action='once') #enable if needed to see the warning the first time.

# logging setup
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Dataset: Heart Disease  UCI dataset from Kaggle

#### Data Import

In [0]:
df = pd.read_csv("/gdrive/My Drive/Data-Quality/heart.csv")
df.head(10)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB


In [0]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [0]:
# we drop the target column, which is our target variable to have only features.
df.drop(columns='target',inplace=True)

In [0]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


Data is clean. we don't have any missing values.

## Data standardization

In [0]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler() 

scaler.fit(df)

new_scaled = scaler.transform(df)# fit and transform can be applied together as well

In [0]:
new_scaled

array([[ 0.9522,  0.681 ,  1.9731, ..., -2.2746, -0.7144, -2.1489],
       [-1.9153,  0.681 ,  1.0026, ..., -2.2746, -0.7144, -0.5129],
       [-1.4742, -1.4684,  0.032 , ...,  0.9764, -0.7144, -0.5129],
       ...,
       [ 1.5036,  0.681 , -0.9385, ..., -0.6491,  1.2446,  1.123 ],
       [ 0.2905,  0.681 , -0.9385, ..., -0.6491,  0.2651,  1.123 ],
       [ 0.2905, -1.4684,  0.032 , ..., -0.6491,  0.2651, -0.5129]])

In [0]:
#Now we convert the array to dataframe:
df_new_scaled = pd.DataFrame(new_scaled, columns = df.columns.to_list())
df_new_scaled.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.952197,0.681005,1.973123,0.763956,-0.256334,2.394438,-1.005832,0.015443,-0.696631,1.087338,-2.274579,-0.714429,-2.148873
1,-1.915313,0.681005,1.002577,-0.092738,0.072199,-0.417635,0.898962,1.633471,-0.696631,2.122573,-2.274579,-0.714429,-0.512922
2,-1.474158,-1.468418,0.032031,-0.092738,-0.816773,-0.417635,-1.005832,0.977514,-0.696631,0.310912,0.976352,-0.714429,-0.512922
3,0.180175,0.681005,0.032031,-0.663867,-0.198357,-0.417635,0.898962,1.239897,-0.696631,-0.206705,0.976352,-0.714429,-0.512922
4,0.290464,-1.468418,-0.938515,-0.663867,2.08205,-0.417635,0.898962,0.583939,1.435481,-0.379244,0.976352,-0.714429,-0.512922


In [0]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
      dtype='object')

In [0]:
#now applying PCA method:
from sklearn.decomposition import PCA

pca = PCA(n_components=13)

pca.fit(df_new_scaled.values) 

arr_new_pca = pca.transform(df_new_scaled.values) 

#let's check the shape of df_new_pca array
print("shape of arr_new_pca", arr_new_pca.shape)

shape of arr_new_pca (303, 13)


In [0]:
elements=["Z1", "Z2", "Z3", "Z4", "Z5", "Z6", "Z7", "Z8", "Z9", "Z10", "Z11", "Z12", "Z13"]
len(elements)

13

In [0]:
pca_df = pd.DataFrame(pca.components_, columns = elements,index=df.columns.to_list())
pca_df

Unnamed: 0,Z1,Z2,Z3,Z4,Z5,Z6,Z7,Z8,Z9,Z10,Z11,Z12,Z13
age,0.314203,0.090838,-0.274607,0.18392,0.117375,0.07364,-0.127728,-0.416498,0.361267,0.419639,-0.379772,0.273262,0.222024
sex,0.406149,-0.377792,0.297266,0.438187,0.364514,0.317433,-0.220882,0.077876,-0.263118,-0.052255,0.048374,0.094147,-0.20072
cp,-0.094077,0.554849,0.356974,0.203849,-0.407825,0.481736,-0.089191,0.158255,-0.126356,0.110343,-0.073818,0.183569,0.125011
trestbps,-0.020662,-0.255309,0.2879,0.022601,-0.34341,-0.068605,0.266096,-0.184125,-0.115056,0.326296,-0.494849,-0.328016,-0.389191
chol,-0.307153,0.050704,0.163179,0.188138,0.320067,-0.233442,-0.393667,0.323284,0.034536,0.250579,-0.246823,-0.435365,0.33195
fbs,-0.128296,0.054969,-0.193411,-0.17946,-0.10473,0.249614,-0.666813,-0.120984,0.230699,-0.17008,-0.064069,-0.182107,-0.508857
restecg,-0.22373,-0.162507,-0.21539,0.332763,0.049329,0.510818,0.396896,0.101473,0.449919,-0.112888,0.055038,-0.337606,0.055165
thalach,-0.262477,-0.175992,0.04795,-0.595334,0.372381,0.432863,0.099841,0.143461,-0.112607,0.192323,-0.261807,0.259678,0.034349
exang,-0.379,-0.198925,-0.351432,0.350392,-0.153975,-0.177004,-0.038304,0.372044,-0.0585,0.233603,-0.028505,0.485808,-0.284201
oldpeak,-0.016722,0.535619,0.164351,0.071524,0.49517,-0.153696,0.269966,0.030813,0.198732,0.111384,0.055934,0.035325,-0.530831


In [0]:
pca_exp_var = pd.DataFrame(pca.explained_variance_, index = elements, columns = ["variance"])



In [0]:
pca_percent_exp_var = pd.DataFrame(pca.explained_variance_ratio_ * 100, index = elements, columns = ["% variance"])

In [0]:
pca_exp_var.merge(pca_percent_exp_var, left_index = True, right_index = True)

Unnamed: 0,variance,% variance
Z1,2.772176,21.254053
Z2,1.54178,11.820708
Z3,1.226883,9.406418
Z4,1.185057,9.085735
Z5,1.025351,7.861281
Z6,0.973228,7.461661
Z7,0.865627,6.636692
Z8,0.778515,5.968811
Z9,0.721306,5.530196
Z10,0.623628,4.781309


 As you can see above, after scaling variables and applying PCA, we see not only first
two components are important, but also the remaining components have noticable weight. The weight 
of remaining components are close to each other which makes all of them important to consider for any machin learning model.

Morover, in above you can see Z1 accounts for 21% of total variability, and Z2 for 11.8%.
the  z3 till z13 is responsible for remaining 67%  of variation.