# $\color{purple}{\text{Heart Attack EDA and Prediction with Cross Validation)  }}$

# $\color{blue}{\text{1. Introduction  }}$

About this dataset

1. Age : Age of the patient  --> *Discrite*

2. Sex : Sex of the patient --> *Categorical*



3. cp : Chest Pain type chest pain type --> *Categorical*

    * Value 1: typical angina
    * Value 2: atypical angina
    * Value 3: non-anginal pain
    * Value 4: asymptomatic
    

4. trtbps : resting blood pressure (in mm Hg) --> *Numerical (real)*

5. chol : cholestoral in mg/dl fetched via BMI sensor --> *Numeric (real)*

6. fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) --> *Categorical*

7. rest_ecg : resting electrocardiographic results --> *Categorical*
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    
8. thalach : maximum heart rate achieved --> *Numerical*

9. exang: exercise induced angina (1 = yes; 0 = no) --> *Categorical*

10. ca: number of major vessels (0-3) --> *Categorical*



11. output - target : 0= less chance of heart attack 1= more chance of heart attack --> *Categorical*

In [None]:
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import pandas as pd 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_validate
import warnings
warnings.filterwarnings("ignore")



data1 = pd.read_csv(r'/kaggle/input/heart-attack-analysis-prediction-dataset/o2Saturation.csv')
data2 = pd.read_csv(r'/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
print("original shape is:",data2.shape)
data2.head()

In [None]:
print(data1.shape)
data1.head()

**Data Cleaning and Nan-Null checking**

In [None]:
data2.isnull().sum()

# $\color{Blue}{\text{2. Exploratory Data Anaylsis  }}$

We will check :
* Multi collinearity
* P-values
* Outlier Clear by Feature deep-dive & IQR Method


In [None]:
data2.describe()

# **$\color{orange}{\text{2.1 Multi-Collinearity Check }}$**

In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.

In [None]:
korelasyon=data2.corr()
figure, axis=plt.subplots(figsize=(10,10))
sns.heatmap(korelasyon, annot=True)

In [None]:
sns.pairplot(data2, hue="output")

Seemly, there are not highly correleated features in our data set. 

# **$\color{orange}{\text{2.2 P Value check for Categorical features and Numerical Values  }}$**

In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.

* A small p (≤ 0.05), reject the null hypothesis. This is strong evidence that the null hypothesis is invalid.
* A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject the null.

We will use two different methods to estimate p-values since we both have numerical and categorical features. For Categorical values Cramer’s V will be used. For numerical values Kruskal-Wallis H-test will be used. After that insight, we will investigate each feature one by one to be sure that if they are worth or not.

For making this analysis, first we will divide our data set into as categorical and numerical (continous). We will use the help of pandas dataframe features for that.

* sex, cp, fbs, restecg, exng, thall, caa, slp are the features which are categorical.
* age, trtbps, chol, thalachh, oldpeak are the features which are numerical or continous.


Note that age can be a contionus or categorical. It should be excavated seriously. We can try both and decide but first i will make it discritize by bining, we will see.

In [None]:
categorical = ['sex', 'cp', 'fbs', 'restecg', 'exng', 'thall', 'caa', 'slp']
continous = ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']
print('Categorical Variables :', ', '.join(categorical))
print('Continous Variables :', ', '.join(continous))

**$\color{orange}{\text{2.2.1 Cramer’s V  }}$**

Cramer’s V  is a number between 0 and 1 that indicates how strongly two categorical variables are associated. If we'd like to know if 2 categorical variables are associated, our first option is the chi-square independence test. A p-value close to zero means that our variables are very unlikely to be completely unassociated in some population. However, this does not mean the variables are strongly associated; a weak association in a large sample size may also result in p = 0.000.

In [None]:
import scipy.stats as ss   #Statistic 

def cramers_corrected_stat(x, y):

    result = -1
    conf_matrix = pd.crosstab(x, y)
    if conf_matrix.shape[0] == 2:
        correct = False
    else:
        correct = True

    chi2, p = ss.chi2_contingency(conf_matrix, correction=correct)[0:2]

    n = sum(conf_matrix.sum())
    phi2 = chi2/n
    r, k = conf_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    result = np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
    return round(result, 6), round(p, 6)


for var in categorical:
    
    x = data2[var]
    y = data2['output']
    cramersV, p = cramers_corrected_stat(x, y)
    print(f'For variable {var}, Cramer\'s V: {cramersV} and p value: {p}')

fps has a large p-value that have the hypothesis collapsed. So that we will drop it when we will build our machine learning models.

**$\color{orange}{\text{2.2.2 Kruskal-Wallis H-test }}$**


The Kruskal-Wallis H test is also called the "one-way ANOVA on ranks". It is a rank-based nonparametric test that can be used to determine if there are statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable.

In [None]:
for var in continous:
    gp = data2[[var, 'output']].groupby(['output'])
    gp_array = [group[var].to_numpy() for name, group in gp]
    kstat, p = ss.kruskal(*gp_array)
    kstat, p = round(kstat, 6), round(p, 6)
    print(f'For variable {var}, Kruskal-Wallis H-test: {kstat} and p value: {p}')
    

Numarical variables seem ok.

# **$\color{orange}{\text{2.3 Feature Deep-dive }}$**

# $\color{purple}{\text{Feature 1: Age }}$

Age has min 29 and max 77. So that we can dicritize it into 5 groups as
* 29 - 40
* 40 - 50
* 50 - 60
* 60 - 70
* 70 - 77


In [None]:
age= data2.iloc[:,:1]

sns.distplot(data2['age'])

In [None]:
bins = [-np.inf, 40, 50, 60, 70, np.inf]
labels = ["Young", "Grown","Mature","Senior","Elder"]
age['binned age'] = pd.cut(age['age'], bins = bins, labels = labels)
print(age.isnull().sum())
age.head()


In [None]:
sns.catplot(x="output", y="age", data=data2, kind="box")

One hot encoder works better for embeding which contain more than 2 categories.

In [None]:
from sklearn import preprocessing
ohe= preprocessing.OneHotEncoder()
agge=age
agge = agge.iloc[:,:1]
age = age.iloc[:,1:]
age_hot = ohe.fit_transform(age).toarray()
dfage =pd.DataFrame(data=age_hot, index=range(age.shape[0]))
dfage.head()

In [None]:
agge.head()

# $\color{purple}{\text{Feature 2: Sexual Identity }}$

In [None]:
sns.distplot(data2['sex'])
dfsex = data2.iloc[:,1:2]

In [None]:
dfsex.head()

# $\color{purple}{\text{Feature 3: Chest Pain type chest pain - cp }}$

* Value 1: typical angina
* Value 2: atypical angina
* Value 3: non-anginal pain
* Value 4: asymptomatic

In [None]:
sns.distplot(data2['cp'])
dfc = data2.iloc[:,2:3]

In [None]:
dfc.head()

In [None]:
ohe2= preprocessing.OneHotEncoder()
c_hot = ohe2.fit_transform(dfc).toarray()
dfcp =pd.DataFrame(data=c_hot, index=range(dfc.shape[0]))
dfcp.head()

# $\color{purple}{\text{Feature 4: trtbps : resting blood pressure (in mm Hg)s }}$


In [None]:
sns.distplot(data2['trtbps'])


In [None]:
dftrtbps = data2.iloc[:,3:4]
dftrtbps.tail()


In [None]:
sns.catplot(x="output", y="trtbps", data=data2, kind="box")

# $\color{purple}{\text{Feature 5: chol : cholestoral in mg/dl fetched via BMI sensor }}$

In [None]:
sns.distplot(data2['chol'])

In [None]:
dfchol= data2.iloc[:,4:5]
dfchol.tail()

In [None]:
sns.catplot(x="output", y="chol", data=data2, kind="box")

# $\color{purple}{\text{Feature 6: fbs : (fasting blood sugar > 120 mg/dl)  }}$

In [None]:
sns.distplot(data2['fbs'])

In [None]:
dffbs= data2.iloc[:,5:6]
dffbs.tail()

# $\color{purple}{\text{Feature 7: Resting Electrocardiographic Results }}$


In [None]:
sns.distplot(data2['restecg'])

In [None]:
restecg = data2.iloc[:,6:7]
restecg.head()

In [None]:
ohe3= preprocessing.OneHotEncoder()
c_res = ohe2.fit_transform(restecg).toarray()
dfres =pd.DataFrame(data=c_res, index=range(restecg.shape[0]))
dfres.head()

In [None]:
sns.catplot(x="output", y="restecg", data=data2, kind="box")

# $\color{purple}{\text{Feature 8: thalachh : maximum heart rate achieved  }}$

In [None]:
sns.distplot(data2['thalachh'])

In [None]:
dfftha= data2.iloc[:,7:8]
dfftha.tail()

# $\color{purple}{\text{Feature 9: exang: exercise induced angina (1 = yes; 0 = no)  }}$

In [None]:
sns.distplot(data2['exng'])

In [None]:
dfex= data2.iloc[:,8:9]
dfex.tail()

# $\color{purple}{\text{Feature 10: oldpeak  }}$

In [None]:
sns.distplot(data2['oldpeak'])


In [None]:
dfold= data2.iloc[:,9:10]
dfold.head()

In [None]:
sns.catplot(x="output", y="oldpeak", data=data2, kind="box")

# $\color{purple}{\text{Feature 11: slp	  }}$

In [None]:
sns.distplot(data2['slp'])


In [None]:
dfslp= data2.iloc[:,10:11]
dfslp.head()

In [None]:
sns.catplot(x="output", y="slp", data=data2, kind="box")

# $\color{purple}{\text{Feature 12: caa  }}$

In [None]:
sns.distplot(data2['caa'])

In [None]:
dfcaa= data2.iloc[:,11:12]
dfcaa.head()

# $\color{purple}{\text{Feature 13: thall  }}$

In [None]:
sns.distplot(data2['thall'])

In [None]:
dfthall= data2.iloc[:,12:13]
dfthall.describe()

# $\color{purple}{\text{Target: Heart Attack  }}$

We investigate all features one by one. Especially in continous features, the distrubition of the data tells us there might be some outliers. We drew catplots of those features and investigate if there are outliers. This approach is so connected with statistcs. Inter-Quartile Range (IQR) Method is used for that.

In [None]:
sns.distplot(data2['output'])

In [None]:
y= data2.iloc[:,13:14]

# $\color{purple}{\text{Outlier Cleaning and Dataframe Setting }}$

In [None]:
x=pd.concat(   [ agge, dfsex , dfcp,  dftrtbps,dfchol,dffbs,dfres,dfftha,dfex,dfold, dfslp, dfcaa, dfthall ], axis=1)
x.head(10)

In [None]:
dataframe=pd.concat([x,y],axis=1)
numlist=dataframe.loc[:,dataframe.nunique()>5].columns
numlist




For outlier detection, the best practice is to use IQR method. This method helps us to illustrate the data distrubution as a box plot. The difference between Q3 (75%) and Q1 (25%) is called the Inter-Quartile Range It tells about the distribution characteristics of the data. It gives data's range, boundries and  skewness. As can be seen the figure below that a box plot enables us to draw inference from it for an ordered data. 

> IQR = Q3 - Q1
> Lower Bound: (Q1 - 1.5 * IQR)
> Upper Bound: (Q3 + 1.5 * IQR)

* * The median is the median (or centre point), also called second quartile, of the data (resulting from the fact that the data is ordered).
* * Q1 is the first quartile of the data, i.e., to say 25% of the data lies between minimum and Q1.
* * Q3 is the third quartile of the data, i.e., to say 75% of the data lies between minimum and Q3.

![Outlier](https://miro.medium.com/max/1246/1*0MvBAT8zFSOt6sfEojbI-A.png)



Quantile function return svalues at the given quantile over requested axis in a pandas dataframe.
> DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')




https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html

In [None]:
def clean_outliers(df, features):
    for i in features:
        Q1=df[i].quantile(0.25)
        Q3=df[i].quantile(0.75)
        IQR= (Q3-Q1)
        print("Feature {} has min value: {} max value: {}".format(i, Q1-IQR*1.5,Q3+IQR*1.5))
        df=df[((df[i]>(Q1-IQR*1.5))&(df[i]<(Q3+IQR*1.5)))]
    return df

In [None]:
df_clean=clean_outliers(dataframe, numlist)
print("New shape: ",df_clean.shape)
df_clean.head()

**24 outlier data is removed from the dataset. Removing the outliers increased the accuracy by 2% !!**

# $\color{Blue}{\text{3. Machine Learning Models: Classification Problem  }}$

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score as ass
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

#Linear Discriminant Analysis kütüphaneleri
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.covariance import LedoitWolf
from sklearn.covariance import MinCovDet
from sklearn.covariance import OAS
from sklearn.covariance import GraphicalLasso
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier


from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from numpy import loadtxt
from xgboost import XGBClassifier

In [None]:
#x=x.iloc[:,:-2]
#x.drop(columns=['sex'],inplace=True,axis=1)
df_clean.drop(columns=['fbs'],inplace=True,axis=1)
#x.drop(columns=['restecg'],inplace=True,axis=1)
#x.drop(columns=['exng'],inplace=True,axis=1)
df_clean.head()

In [None]:
x=df_clean.iloc[:,:-1]
y=df_clean.iloc[:,-1:]
print("x shape: ", x.shape)
print("y shape: ", y.shape)
x.head()

**SMOTE to deal with imbalanced data**

By SMOTE data set has been increased from 279 to 316 sythetically.

In [None]:
from imblearn.over_sampling import SMOTENC
xv=x.values
yv=y.values
smote_nc = SMOTENC(categorical_features=[dataframe['output'].min(), dataframe['output'].max()], random_state=0)
x, y = smote_nc.fit_resample(xv, yv)
x = pd.DataFrame(data=x, index=range(x.shape[0]))
y= pd.DataFrame(data=y, index=range(y.shape[0]))
print(x.shape)
print(y.shape)

In [None]:
k_nn=KNeighborsClassifier(n_neighbors=8, metric="chebyshev")
logi = LogisticRegression(random_state=5)
DT = DecisionTreeClassifier(max_features="sqrt")
SDF = SGDClassifier(penalty="l2", random_state=10)
S_VC= SVC(degree=4,C=20, kernel="poly")
RF= RandomForestClassifier(n_estimators=108, criterion= "entropy") # criterion = "gini" or "entropy"
Bayes=  GaussianNB()
MBayes = MultinomialNB()
BBayes = BernoulliNB()
LDA = LinearDiscriminantAnalysis(solver="eigen")    #solver= ‘svd’, ‘lsqr’, ‘eigen’
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0)
Result =[]

In [None]:
cv_sonuc= cross_validate(clf, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of GradientBoost: ", res*100, "%")
Result.append( "GradientBoost :")
Result.append(res)

In [None]:
cv_sonuc= cross_validate(k_nn, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of KNN: ", res*100, "%")
Result.append( "KNN :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(SDF, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of SDG: ", res*100, "%")

Result = []
Result.append( "SDG :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(logi, x, y, cv=10 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Logistic Regression: ", res*100, "%")
Result.append( "LR :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(DT, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()

print("Accuracy of Decision Tree: ", res*100, "%")
Result.append( "DT :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(S_VC, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Support Vector Classifier: ", res*100, "%")
Result.append( "SVC :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(RF, x, y, cv=20 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Random Forest: ", res*100, "%")
Result.append( "RF :")
Result.append( res)

In [None]:
cv_sonuc= cross_validate(Bayes, x, y, cv=5 , scoring='accuracy')
res=cv_sonuc['test_score'].mean()
print("Accuracy of Naive Bayes: ", res*100, "%")
Result.append( "NB :")
Result.append( res)

# $\color{Blue}{\text{4. Results and Discussion  }}$


In [None]:
print(Result)