<table align="left">
  <td>
    <a href="https://colab.research.google.com/drive/1hR92YKQZ6_PWoeOEH8TWe9MjPbFx5FAr" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td>
    <a target="_blank" href="https://www.kaggle.com/coderyug/heart-disease-multiple-models-rfc">
  <img src="https://kaggle.com/static/images/open-in-kaggle.svg" /></a>
  </td>
</table>

#$Imports$

In [1]:
# Common
import numpy as np
import pandas as pd
# Spliting
from sklearn.model_selection import train_test_split
# Feture Selection
from sklearn.ensemble import RandomForestClassifier
# Visualization
import matplotlib.pyplot as plt
# Models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import xgboost
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
# Evaluation
from sklearn.metrics import confusion_matrix, f1_score
# Tuning
from sklearn.model_selection import GridSearchCV
# Scaler
from sklearn.preprocessing import StandardScaler

#$Data$

In [1]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head()

Ahhh great!!, the data is good as it does not have categorical values. Let's check for the null values.

In [1]:
df.isnull().sum()

$Great!!$ There are `no null values`. I think some of the `features` like the `"sex"` does `not affect` the `data`, so in `order` to `check` it let's do `feature selection`.

In [1]:
X_data = df.iloc[:,:-1]
Y_data = df['target']
print(f" CD : {np.bincount(Y_data)}")

There is `not` much `Class Imbalance`. Hence, `no need` of `Resampling or Stratifying`.

#$Spliting$ $Data$

In [1]:
X_train,X_test,Y_train,Y_test = train_test_split(X_data,Y_data,test_size=0.2)

In [1]:
print(f"X Train : {X_train.shape}")
print(f"Y Train : {Y_train.shape}")
print(f"X Test : {X_test.shape}")
print(f"Y Test : {Y_test.shape}")

Yeah!!, good let's move to feature selection/importnace.

#$Feature$ $Importance$

In [1]:
RFC = RandomForestClassifier().fit(X_train,Y_train)
feature_imp_ = RFC.feature_importances_

In [1]:
X_cols = X_data.columns
X_cols

In [1]:
plt.figure(figsize=(10,8))
plt.bar(X_cols,feature_imp_,color="steelblue",label="Features")
plt.bar(X_cols[np.where(feature_imp_==feature_imp_.max())],feature_imp_.max(),color="green",label="Max")
plt.bar(X_cols[np.where(feature_imp_==feature_imp_.min())],feature_imp_.min(),color="purple",label="Min")
plt.xticks(rotation=70)
plt.legend()
plt.ylabel("Importnace Score")
plt.title("Feature Importnace")
plt.show()

As we can see $"fbs"$ and $"restecg"$ are `not that important` compare to others, we can `remove` them from the `dataset`. This will `save training time` and `model` will `generalize better`. But this is a `small dataset` that's why I am `not removing it`.

#$Models$

##Function

In [1]:
def evaluation(model):
  y_pred_model = model.predict(X_test)
  print(f"Accuracy : {model.score(X_test,Y_test)}")
  print(f"CM : \n {confusion_matrix(Y_test,y_pred_model)}")
  print(f"F1 Score  : {f1_score(Y_test,y_pred_model)}")

##$SVC$

In [1]:
vector_machine = SVC()
vector_machine.fit(X_train,Y_train)
evaluation(vector_machine)

Let's Try tuning it

In [1]:
gs_svc = GridSearchCV(vector_machine,param_grid={"C":[0.001,0.01,0.1,1.0,10.0,100.0]},cv=10)
gs_svc.fit(X_train,Y_train)

In [1]:
gs_svc.best_params_

In [1]:
best_svc = gs_svc.best_estimator_
best_svc.fit(X_train,Y_train)
evaluation(best_svc)

A little better not much. Scaling the input may help.

In [1]:
Sc = StandardScaler()
X_train_sc, X_test_sc = Sc.fit_transform(X_train), Sc.transform(X_test) 

In [1]:
scaled_svc = gs_svc.best_estimator_
scaled_svc.fit(X_train_sc,Y_train)
evaluation(scaled_svc)

Nahh!!!. Let's keep the previous version only.

##$RandomForest$

In [1]:
RFC = RandomForestClassifier(max_depth=10)
RFC.fit(X_train,Y_train)
evaluation(RFC)

My expectations was more than the results

In [1]:
gs_rfc = GridSearchCV(RFC,param_grid={"max_depth":[10,30,50,70,100],"n_estimators":[100,200,300,500]},cv=5)
gs_rfc.fit(X_train,Y_train)

In [1]:
gs_rfc.best_params_

In [1]:
best_rfc = gs_rfc.best_estimator_
best_rfc.fit(X_train,Y_train)
evaluation(best_rfc)

This time too, let's keep the normal one.

##$AdaBoost$

In [1]:
AdaBC = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=10),n_estimators=100,learning_rate=1.0)
AdaBC.fit(X_train,Y_train)
evaluation(AdaBC)

In [1]:
gs_ada = GridSearchCV(AdaBC,param_grid={"learning_rate":[0.001,0.01,0.1,1.0,10.0,100.0],"n_estimators":[50,100,200,300,500]},cv=5,scoring="accuracy")
gs_ada.fit(X_train,Y_train)

In [1]:
gs_ada.best_params_

In [1]:
gs_ada.best_score_

In [1]:
best_ada = gs_ada.best_estimator_
best_ada.fit(X_train,Y_train)
evaluation(best_ada)

##$DecisionTree$

In [1]:
DTC = DecisionTreeClassifier(max_depth=10)
DTC.fit(X_train,Y_train)
evaluation(DTC)

In [1]:
gs_dtc = GridSearchCV(DTC,param_grid={"max_depth":[50,10,20,30,80,100]},cv=5,scoring="accuracy")
gs_dtc.fit(X_train,Y_train)

In [1]:
best_dtc = gs_dtc.best_estimator_
best_dtc.fit(X_train,Y_train)
evaluation(best_dtc)

##$Kmeans$

In [1]:
Kmeans = KMeans(n_clusters=2)
Kmeans.fit(X_train,Y_train)
evaluation(Kmeans) # Accuracy(i.e. score) gives -ve of inertia.

##$Gausian$ $Mixture$

In [1]:
gmm = GaussianMixture(n_components=2)
gmm.fit(X_train,Y_train)
evaluation(gmm) # Accuracy(i.e. score) gives -ve of inertia.

#$Cause$
---
Interesting, no one is able to perform more than 80%

----
Cause for such a low accuracy :
1. $Small$ $Dataset$

---

In [1]:
def best_model(models,X_test,Y_test,n_):
  acc_ = []
  for i in models:
    acc = i.score(X_test,Y_test)
    acc_.append(acc)
  plt.bar(n_,acc_,color="green",label="Models")
  plt.bar(n_[acc_.index(np.max(acc_))],np.max(acc_),color="red",label="Max")
  plt.bar(n_[acc_.index(np.min(acc_))],np.min(acc_),color="blue",label="Min")
  plt.title("Model Comparision")
  plt.legend()
  plt.xticks(rotation=45)
  plt.ylabel("Accuracy Score")
  plt.show()
models = [vector_machine,best_svc,scaled_svc,RFC,best_rfc,AdaBC,best_ada,DTC,best_dtc]
best_model(models,X_test,Y_test,n_=["SVC","GS_SVC","Scaled_SVC","RFC","GS_RFC","AdaBoost","GS_AdaBoost","DTC","GS_DTC"])

We can `conclude` that $RFC$ works `best` for this `problem`.

----