**DATASET COLUMNS EXPLANATION** :

* Age: (age in years)
* Sex: (1 = male; 0 = female)
* CP: (chest pain type)
* TRESTBPS: (resting blood pressure (in mm Hg on admission to the hospital))
* CHOL: (serum cholestoral in mg/dl)
* FPS: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* RESTECH: (resting electrocardiographic results)
* THALACH:  (maximum heart rate achieved)
* EXANG : (exercise induced angina (1 = yes; 0 = no))
* OLDPEAK : (ST depression induced by exercise relative to rest)
* SLOPE : (the slope of the peak exercise ST segment)
* CA : (number of major vessels (0-3) colored by flourosopy)
* THAL : (3 = normal; 6 = fixed defect; 7 = reversable defect)
* TARGET :(1 or 0)

**Load Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier 
from sklearn.metrics import confusion_matrix, classification_report, recall_score, precision_score, accuracy_score, f1_score, roc_curve, roc_auc_score


In [None]:
df= pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
x= df.loc[:, df.columns!='target']
y=df['target']




**I will use heatmap to see if there is some correlations with features**

In [None]:
f,ax = plt.subplots(figsize=(14, 14))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

**In this task will try different features selection approaches such as SelectKBest, feature selections provided by models and RFECV**

1. SelectKBest

In [None]:
x_train, x_test, y_train , y_test= train_test_split(x, y, test_size=0.2)
select_feature= SelectKBest(chi2, k=5).fit(x_train, y_train)
feature_importances_kbest= pd.DataFrame(columns=['scores'], index= x.columns)
feature_importances_kbest['scores']= select_feature.scores_
feature_importances_kbest=feature_importances_kbest.sort_values('scores', ascending= False)
print(feature_importances_kbest)



**I'm going to fit RandomForest model on different combination of features starting with only 3 top features recommended by Scikitlearn to all 13 features and repeat the same steps with the 2 other features selection methods, after that I will put the results in a DataFrame so we can visualize it and compare between different combinations**

In [None]:
num_features=['3 features', '4 features', '5 features', '6 features', '7 features', '8 features', '9 features', '10 features', '11 features', '12 features', '13 features' ]
results_rf_features_kbest=pd.DataFrame(columns=['accuracy score', 'recall score', 'precision score', 'f1 score', 'roc score'], index=num_features)

In [None]:
rf_model= RandomForestClassifier()
for ind in range(3, len(feature_importances_kbest.index)+1):
    
    x_rf_kbest_train= x_train.loc[:, feature_importances_kbest.index[0:ind]]
    x_rf_kbest_test= x_test.loc[:, feature_importances_kbest.index[0:ind]]
    rf_model.fit(x_rf_kbest_train, y_train)
    y_pred_kbest= rf_model.predict(x_rf_kbest_test)
    
    results_rf_features_kbest.iloc[ind-3, 0]=accuracy_score(y_test, y_pred_kbest)
    results_rf_features_kbest.iloc[ind-3, 1]=recall_score(y_test, y_pred_kbest)
    results_rf_features_kbest.iloc[ind-3, 2]=precision_score(y_test, y_pred_kbest)
    results_rf_features_kbest.iloc[ind-3, 3]=f1_score(y_test, y_pred_kbest)
    results_rf_features_kbest.iloc[ind-3, 4]=roc_auc_score(y_test, y_pred_kbest)



In [None]:
results_rf_features_kbest


**Now I'm going to sort only the two best combinations of features in terms of accuracy, recall and auc score**


In [None]:
results_rf_features_kbest.sort_values(['accuracy score'], ascending= False).iloc[0:2,:]


In [None]:
results_rf_features_kbest.sort_values(['recall score'], ascending= False).iloc[0:2,:]

In [None]:
results_rf_features_kbest.sort_values(['roc score'], ascending= False).iloc[0:2,:] 

we got the best results with 8 top features


**Now let's try the second method which is provided by the model and repeat the same steps as we did before** 

2. Feature_importances 

In [None]:

feature_importances_rf= pd.DataFrame(columns=['scores'], index= x.columns)
feature_importances_rf['scores']= rf_model.feature_importances_
feature_importances_rf=feature_importances_rf.sort_values('scores', ascending= False)
feature_importances_rf




In [None]:
results_rf_features_rf=pd.DataFrame(columns=['accuracy score', 'recall score', 'precision score', 'f1 score', 'roc auc score'], index=num_features)

for ind in range(1, len(feature_importances_rf.index)+1):
        
    x_rf_rf_train=x_train.loc[:, feature_importances_rf.index[0:ind]]
    x_rf_rf_test= x_test.loc[:, feature_importances_rf.index[0:ind]]
    
    rf_model.fit(x_rf_rf_train, y_train)
    y_pred3= rf_model.predict(x_rf_rf_test)


    results_rf_features_rf.iloc[ind-3, 0]=accuracy_score(y_test, y_pred3)
    results_rf_features_rf.iloc[ind-3, 1]=recall_score(y_test, y_pred3)
    results_rf_features_rf.iloc[ind-3, 2]=precision_score(y_test, y_pred3)
    results_rf_features_rf.iloc[ind-3, 3]=f1_score(y_test, y_pred3)
    results_rf_features_rf.iloc[ind-3, 4]=roc_auc_score(y_test, y_pred3)

    



In [None]:
results_rf_features_rf

In [None]:
results_rf_features_rf.sort_values(['accuracy score'], ascending= False).iloc[0:2,:]



In [None]:
results_rf_features_rf.sort_values(['recall score'], ascending= False).iloc[0:2,:]


In [None]:
results_rf_features_rf.sort_values(['roc auc score'], ascending= False).iloc[0:2,:]

in this approach  we got the best results with all features.

But what about selecting features with recursive feature elimination with cross-validation ? let's see 

3. RFCV

In [None]:
rf_model=RandomForestClassifier()
rfe= RFECV(estimator=rf_model, cv=5, step=1, scoring='accuracy')
rfe=rfe.fit(x_train, y_train)
print('choosen best  features ', x_train.columns[rfe.support_])
feature_importances_rfe= pd.DataFrame(columns=['important features'] )
feature_importances_rfe['important features'] = x_train.columns[rfe.support_]
feature_importances_rfe


In [None]:
results_rfecv_features=pd.DataFrame(columns=['accuracy score', 'recall score', 'precision score', 'f1 score', 'roc auc score'], index=num_features)
for ind in range(3, len(feature_importances_rfe.index)+1):
    
    
    x_f4_train=x_train.loc[:, feature_importances_rfe['important features'] [0:ind]]
    x_f4_test= x_test.loc[:, feature_importances_rfe['important features'] [0:ind]]
    
    rf_model.fit(x_f4_train, y_train)
    y_pred4= rf_model.predict(x_f4_test)
    
    
    results_rfecv_features.iloc[ind-3, 0]=accuracy_score(y_test, y_pred4)
    results_rfecv_features.iloc[ind-3, 1]=recall_score(y_test, y_pred4)
    results_rfecv_features.iloc[ind-3, 2]=precision_score(y_test, y_pred4)
    results_rfecv_features.iloc[ind-3, 3]=f1_score(y_test, y_pred4)
    results_rfecv_features.iloc[ind-3, 4]=roc_auc_score(y_test, y_pred4)
    


In [None]:
results_rfecv_features

In [None]:
results_rfecv_features.sort_values(['accuracy score'], ascending= False).iloc[0:2,:]



In [None]:
results_rfecv_features.sort_values(['recall score'], ascending= False).iloc[0:2,:]


In [None]:
results_rfecv_features.sort_values(['roc auc score'], ascending= False).iloc[0:2,:]

**as we can see above, eliminating the least good feature 'thalassemia' shows the greatest scores**

Now I'm going to compare between the 3 methods of features selection and take the best combination for each method and compare these 3 selected ones

In [None]:
results_random_forest=pd.DataFrame(columns=[ 'RF_features_importances','RF_KBEST' ,'RF_rfecv'], index=['accuracy_score', 'recall_score', 'precision_score', 'f1score','roc_score'])
results_rf_features_sorted= results_rf_features_rf.sort_values('accuracy score', ascending=False)
results_rf_features_kbest_sorted= results_rf_features_kbest.sort_values('accuracy score', ascending= False)
results_rfecv_features_sorted= results_rfecv_features.sort_values('accuracy score', ascending= False)


results_random_forest.iloc[0,0]=results_rf_features_sorted.iloc[0,:][0]
results_random_forest.iloc[1,0]=results_rf_features_sorted.iloc[0,:][1]
results_random_forest.iloc[2,0]=results_rf_features_sorted.iloc[0,:][2]
results_random_forest.iloc[3,0]=results_rf_features_sorted.iloc[0,:][3]
results_random_forest.iloc[4,0]=results_rf_features_sorted.iloc[0,:][4]

results_random_forest.iloc[0,1]=results_rf_features_kbest_sorted.iloc[0,:][0]
results_random_forest.iloc[1,1]=results_rf_features_kbest_sorted.iloc[0,:][1]
results_random_forest.iloc[2,1]=results_rf_features_kbest_sorted.iloc[0,:][2]
results_random_forest.iloc[3,1]=results_rf_features_kbest_sorted.iloc[0,:][3]
results_random_forest.iloc[4,1]=results_rf_features_kbest_sorted.iloc[0,:][4]

results_random_forest.iloc[0,2]=results_rfecv_features_sorted.iloc[0,:][0]
results_random_forest.iloc[1,2]=results_rfecv_features_sorted.iloc[0,:][1]
results_random_forest.iloc[2,2]=results_rfecv_features_sorted.iloc[0,:][2]
results_random_forest.iloc[3,2]=results_rfecv_features_sorted.iloc[0,:][3]
results_random_forest.iloc[4,2]=results_rfecv_features_sorted.iloc[0,:][4]

results_random_forest

In [None]:
plt.figure(figsize = (10, 7))
sns.heatmap(results_random_forest[results_random_forest.columns.to_list()].astype(float), cmap = 'Blues', annot = True, linewidths = 1, cbar = False, annot_kws = {'fontsize': 12},
           yticklabels = ['accuracy score', 'Recall', 'precison', 'f1score', 'ROC AUC'])
sns.set(font_scale = 1.5)
plt.yticks(rotation = 0)
plt.show()   


as we can observe, RandomForest is giving the best results using KBest feature selections which include 8 features 


For the seek of better results I'm going to repeat these steps with XGboost

1. XGBOOST Feature importances

In [None]:
xgb_model= XGBClassifier(random_state= 22, use_label_encoder=False)
xgb_model.fit(x_train, y_train)
f_imp_xgboost= pd.DataFrame(columns= ['feature importances'], index= x_train.columns)
results_xgboost_scores= pd.DataFrame(columns=['accuracy score', 'recall score', 'precision score', 'roc auc score'], index= num_features)


f_imp_xgboost['feature importances']=abs( xgb_model.feature_importances_)    
f_imp_xgboost=f_imp_xgboost.sort_values('feature importances', ascending= False)    
f_imp_xgboost



In [None]:

for ind in range(3, len(f_imp_xgboost.index)+1):
        
    x_xgb_train2=x_train.loc[:, f_imp_xgboost.index[0:ind]]
    x_xgb_test2= x_test.loc[:, f_imp_xgboost.index[0:ind]]
    
    xgb_model.fit(x_xgb_train2, y_train)
    y_pred_xgb2= xgb_model.predict(x_xgb_test2)


    results_xgboost_scores.iloc[ind-3, 0]=accuracy_score(y_test, y_pred_xgb2)
    results_xgboost_scores.iloc[ind-3, 1]=recall_score(y_test, y_pred_xgb2)
    results_xgboost_scores.iloc[ind-3, 2]=precision_score(y_test, y_pred_xgb2)
    results_xgboost_scores.iloc[ind-3, 3]=roc_auc_score(y_test, y_pred_xgb2)



 

In [None]:
results_xgboost_scores

In [None]:
results_xgboost_scores.sort_values(['accuracy score'], ascending= False).iloc[0:2,:]


In [None]:
results_xgboost_scores.sort_values(['recall score'], ascending= False).iloc[0:2,:]
 

In [None]:
results_xgboost_scores.sort_values(['roc auc score'], ascending= False).iloc[0:2,:]

Here, including all features shows greatest results.

2. XGBoost with SelectKBest

In [None]:
results_xgb_kbest= pd.DataFrame(columns=['accuracy score', 'recall score', 'precision score', 'roc auc score'], index= num_features)
for ind in range(3,  len(feature_importances_kbest.index)+1):
        
    x_xgb_train1=x_train.loc[:,feature_importances_kbest.index[0:ind]]
    x_xgb_test1= x_test.loc[:, feature_importances_kbest.index[0:ind]]
    
    xgb_model.fit(x_xgb_train1, y_train)
    y_pred_xgb1= xgb_model.predict(x_xgb_test1)


    results_xgb_kbest.iloc[ind-3, 0]=accuracy_score(y_test, y_pred_xgb1)
    results_xgb_kbest.iloc[ind-3, 1]=recall_score(y_test, y_pred_xgb1)
    results_xgb_kbest.iloc[ind-3, 2]=precision_score(y_test, y_pred_xgb1)
    results_xgb_kbest.iloc[ind-3, 3]=roc_auc_score(y_test, y_pred_xgb1)





In [None]:
results_xgb_kbest.sort_values(['accuracy score'], ascending= False).iloc[0:2,:]


In [None]:
results_xgb_kbest.sort_values(['recall score'], ascending= False).iloc[0:2,:]
  

In [None]:
results_xgb_kbest.sort_values(['roc auc score'], ascending= False).iloc[0:2,:]

for this type of features selection, 8 features give the highest results

3. XGBOOST with RFECV

In [None]:
xgboost_model=XGBClassifier(random_state= 22, use_label_encoder=False)     

rfe2=RFE(estimator=xgboost_model, n_features_to_select=13, step=1)

rfe2=rfe2.fit(x_train, y_train)

feature_importances_rfe2= pd.DataFrame(columns=['important features'] )
feature_importances_rfe2['important features'] = x_train.columns[rfe2.support_]


In [None]:
feature_importances_rfe2

In [None]:
results_rfecv_features2=pd.DataFrame(columns=['accuracy score', 'recall score', 'precision score',  'roc auc score'], index=num_features)
for ind in range(3, len(feature_importances_rfe2.index)+1):
    
    
    x_gboost_train=x_train.loc[:, feature_importances_rfe2['important features'] [0:ind]]
    x_gboost_test= x_test.loc[:, feature_importances_rfe2['important features'] [0:ind]]
    
    xgboost_model.fit(x_gboost_train, y_train)
    y_pred_xgboost2= xgboost_model.predict(x_gboost_test)
    
    
    results_rfecv_features2.iloc[ind-3, 0]=accuracy_score(y_test, y_pred_xgboost2)
    results_rfecv_features2.iloc[ind-3, 1]=recall_score(y_test, y_pred_xgboost2)
    results_rfecv_features2.iloc[ind-3, 2]=precision_score(y_test, y_pred_xgboost2)
    results_rfecv_features2.iloc[ind-3, 3]=roc_auc_score(y_test, y_pred_xgboost2)
    



In [None]:
results_rfecv_features2

In [None]:
results_rfecv_features2.sort_values(['roc auc score'], ascending= False).iloc[0:2,:]

In [None]:
results_rfecv_features2.sort_values(['recall score'], ascending= False).iloc[0:2,:]

In [None]:
results_rfecv_features2.sort_values(['accuracy score'], ascending= False).iloc[0:2,:]

12 features provide the best scores


In [None]:
results_xgboost=pd.DataFrame(columns=[ 'XGBOOST_features_importances','XGBOOST_KBEST' ,'XGBOOST_rfecv'], index=['accuracy_score', 'recall_score', 'precision_score','roc_score'])
results_xgboost_features_sorted= results_xgboost_scores.sort_values('accuracy score', ascending=False)
results_xgboost_features_kbest_sorted= results_xgb_kbest.sort_values('accuracy score', ascending= False)
results_xgboost_rfecv_features_sorted= results_rfecv_features2.sort_values('accuracy score', ascending= False)


results_xgboost.iloc[0,0]=results_xgboost_features_sorted.iloc[0,:][0]
results_xgboost.iloc[1,0]=results_xgboost_features_sorted.iloc[0,:][1]
results_xgboost.iloc[2,0]=results_xgboost_features_sorted.iloc[0,:][2]
results_xgboost.iloc[3,0]=results_xgboost_features_sorted.iloc[0,:][3]


results_xgboost.iloc[0,1]=results_xgboost_features_kbest_sorted.iloc[0,:][0]
results_xgboost.iloc[1,1]=results_xgboost_features_kbest_sorted.iloc[0,:][1]
results_xgboost.iloc[2,1]=results_xgboost_features_kbest_sorted.iloc[0,:][2]
results_xgboost.iloc[3,1]=results_xgboost_features_kbest_sorted.iloc[0,:][3]


results_xgboost.iloc[0,2]=results_xgboost_rfecv_features_sorted.iloc[0,:][0]
results_xgboost.iloc[1,2]=results_xgboost_rfecv_features_sorted.iloc[0,:][1]
results_xgboost.iloc[2,2]=results_xgboost_rfecv_features_sorted.iloc[0,:][2]
results_xgboost.iloc[3,2]=results_xgboost_rfecv_features_sorted.iloc[0,:][3]




In [None]:
results_xgboost

In [None]:
plt.figure(figsize = (10, 7))
sns.heatmap(results_xgboost[results_xgboost.columns.to_list()].astype(float), cmap = 'Blues', annot = True, linewidths = 1, cbar = False, annot_kws = {'fontsize': 12},
           yticklabels = ['accuracy score', 'Recall', 'precison', 'ROC AUC'])
sns.set(font_scale = 1.5)
plt.yticks(rotation = 0)
plt.show()   


the best reuslts with XGBOOST are  with 12 features.


**now lets compare the highest results achieved in both xgboost and RandomForest**

In [None]:
results_xgboost_random_forest=pd.DataFrame(columns=['xgboost', 'RandomForest'], index= ['accuracy_score', 'recall_score', 'precision_score', 'roc auc score'])

In [None]:
results_xgboost

In [None]:
results_random_forest

**I will take XGboost with rfe features selection and RandomFOrest with SelectKBest features selections and compare them**

In [None]:
results_xgboost_random_forest.iloc[0,0]= results_xgboost.iloc[0, 2]
results_xgboost_random_forest.iloc[1,0]=results_xgboost.iloc[1, 2]
results_xgboost_random_forest.iloc[2,0]=results_xgboost.iloc[2, 2]
results_xgboost_random_forest.iloc[3,0]=results_xgboost.iloc[3, 2]

results_xgboost_random_forest.iloc[0,1]=results_random_forest.iloc[0, 1]
results_xgboost_random_forest.iloc[1,1]=results_random_forest.iloc[1, 1]
results_xgboost_random_forest.iloc[2,1]=results_random_forest.iloc[2, 1]
results_xgboost_random_forest.iloc[3,1]=results_random_forest.iloc[4, 1]
results_xgboost_random_forest

In [None]:
plt.figure(figsize = (10, 7))
sns.heatmap(results_xgboost_random_forest[results_xgboost_random_forest.columns.to_list()].astype(float), cmap = 'Blues', annot = True, linewidths = 1, cbar = False, annot_kws = {'fontsize': 12},
           yticklabels = ['accuracy score', 'Recall', 'precison', 'ROC AUC'])
sns.set(font_scale = 1.5)
plt.yticks(rotation = 0)
plt.show()   

As we can see, RF did slightly better with accuracy and Auc score.

**finally just to get better insights I will use heatmap to see confusion matrix for both models**

1. Confusion Matrix with XGBOOST

In [None]:
x_test_xgb_3=x_test.loc[: ,feature_importances_rfe2['important features'][0:12].to_list()]
xgb_model2= XGBClassifier(random_state= 22, use_label_encoder=False)
xgb_model2.fit(x_train.loc[: ,feature_importances_rfe2['important features'][0:12].to_list()], y_train)

y_pred_xgb_3=xgb_model2.predict(x_test_xgb_3)
xgb_cm= confusion_matrix(y_test, y_pred_xgb_3)

sns.heatmap(xgb_cm, cmap = 'Blues', annot = True, fmt = 'd', linewidths = 5, cbar = False, annot_kws = {'fontsize': 15}, 
            yticklabels = ['No heart disease', 'heart disease'], xticklabels = ['Predicted no heat disease', 'Predicted heart disease'])
plt.yticks(rotation = 0)
plt.show()

2. Confusion Matrix with RandomForest

In [None]:
x_test_rf_11=x_test.loc[: ,feature_importances_kbest.index[0:9]]
rf_model2= RandomForestClassifier()
rf_model2.fit(x_train.loc[: ,feature_importances_kbest.index[0:9]], y_train)

y_pred_rf_11=rf_model2.predict(x_test_rf_11)
rf_cm= confusion_matrix(y_test, y_pred_rf_11)

sns.heatmap(rf_cm, cmap = 'Blues', annot = True, fmt = 'd', linewidths = 5, cbar = False, annot_kws = {'fontsize': 15}, 
            yticklabels = ['No heart disease', 'heart disease'], xticklabels = ['Predicted no heat disease', 'Predicted heart disease'])
plt.yticks(rotation = 0)
plt.show()

well, fitting for the second time both models with the selected features gives relatively same results 

**Conclusion**

I think both models did well to be used for diagnosing Heart Disease based on this analysis maybe RandomForest slightly better.. 

the RandomForest with the top 9 features ['thalach', 'ca', 'cp', 'oldpeak', 'exang', 'age', 'chol', 'trestbps','slope'].

XGBOOST with 12 top features['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca']

This is my first notebook, please give me your feedback or some advice into how to improve the quality of my work and Thank you for your time. 





