# Raudhatul Jannah
# PYTN-KS19-09
# Python for Data Science by Hacktiv8
# **Final Projects 3**

# Pengenalan

Penyakit kardiovaskular (CVDs) adalah penyebab kematian nomor 1 secara global, mengambil sekitar 17,9 juta jiwa setiap tahun, yang menyumbang 31% dari semua kematian di seluruh dunia.
Gagal jantung adalah kejadian umum yang disebabkan oleh CVD dan kumpulan data ini berisi 12 fitur yang dapat digunakan untuk memprediksi kematian akibat gagal jantung.

Sebagian besar penyakit kardiovaskular dapat dicegah dengan mengatasi faktor risiko perilaku seperti penggunaan tembakau, diet tidak sehat dan obesitas, kurangnya aktivitas fisik, dan penggunaan alkohol yang berbahaya menggunakan strategi di seluruh populasi.

Orang dengan penyakit kardiovaskular atau yang berada pada risiko kardiovaskular tinggi (karena adanya satu atau lebih faktor risiko seperti hipertensi, diabetes, hiperlipidemia atau penyakit yang sudah ada) memerlukan deteksi dan manajemen dini di mana model pembelajaran mesin dapat sangat membantu.


**Attribute Information:**
1. age - umur pasien
2. anaemia - apakah ada pengurangan haemoglobin
3. creatinine_phosphokinase - level enzim CPK dalam mcg/L
4. diabetes - apakah pasien punya riwayat diabetes
5. ejection_fraction - persentase darah yang meninggalkan jantung dalam persentasi
di setiap kontraksi jantung
6. high_blood_pressure - apakah pasien punya darah tinggi
7. platelets - jumlah platelet di darah dalam kiloplatelets/mL
8. serum_creatinine - level serum creatinine di darah dalam mg/dL
9. serum_sodium - level serum sodium di darah dalam mEq/L
10. sex - apakah pasien pria atau wanita
11. smoking - apakah pasien merokok
12. time - waktu dalam hari untuk follow-up
13. DEATH_EVENT - apakah pasien sudah meninggal saat waktu follow-up


TARGET : DEATH_EVENT


# Import Library

In [4]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
import graphviz 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import export_graphviz
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import confusion_matrix,classification_report
import pickle
from imblearn.over_sampling import SMOTE

ModuleNotFoundError: No module named 'graphviz'

# Data Loading

In [None]:
df = pd.read_csv('dataset/heart_failure_clinical_records_dataset.csv')
df.head()

In [None]:
df.info()

## Melihat deskripsi dataset

In [None]:
df.describe()

**Age**

*   Rata-rata Usia pada Pasien Penyakit Jantung berkisar 60 Tahun
*   Usia terendah pada Pasien Penyakit Jantung berada pada umur 40 Tahun
*   Usia tertinggi pada Pasien Penyakit Jantung berada pada umur 95 Tahun

**Anemia**

* Rata-rata nilai Anemia pada Pasien Penyakit Jantung berada di 0.431438 

* Anemia terendah pada Pasien Penyakit Jantung berada di 0 (**tidak mengalami anemia**)
* Anemia tertinggi pada Pasien Penyakit Jantung berada di 1.0 (**mengalami anemia**)

**Creatinin Fosfokinase**
* Rata-rata Creatinin Fosfokinase pada Pasien Penyakit Jantung berada di 581.839465
* Creatinin Fosfokinase terendah pada Pasien Penyakit Jantung berada di 23.00
* Creatinin Fosfokinase tertinggi pada Pasien Penyakit Jantung berada di 7861.00

**Diabetes**
* Rata-rata nilai Diabetes pada Pasien Penyakit Jantung berada di 0.418060 

* Diabetes terendah pada Pasien Penyakit Jantung berada di 0 (**tidak mengalami diabetes**)

* Diabetes tertinggi pada Pasien Penyakit Jantung berada di 1.0 (**mengalami diabetes**)

## Mengecek dataset apakah terdapat null

In [None]:
df.isnull().sum()

## Meliat dimensi dataset

In [None]:
df.shape

## Data Cleaning ##

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
plt.savefig("korelasi_fitur.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

drop fitur yang tidak relevan dengan DEATH_EVENT

In [None]:
new_df = df.copy()
new_df.drop(columns=['time','serum_sodium','ejection_fraction'], inplace=True)

In [None]:
new_df.head()

# EDA

In [None]:
df_EDA = df.copy()
df_EDA['young_person'] = df_EDA['age'].apply(lambda x: 1 if x < 50 else 0)
df_EDA

## Bar plot

In [None]:
young_anemia_diabetes_hbp_sex_smoking = df_EDA[['young_person','anaemia','diabetes','high_blood_pressure','sex','smoking']]
young_anemia_diabetes_hbp_sex_smoking.head()

In [None]:
group1 = young_anemia_diabetes_hbp_sex_smoking.groupby(['young_person','sex']).sum()
group1

In [None]:
plt.figure(figsize=(16,10))
plt.style.use('ggplot')
group1.plot(kind='bar',title='perbandingan penyakit yang diderita berdasarkan umur dan sex')
plt.savefig("perbandingan_sex_umur.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

## Scatter Plot

In [None]:
np.unique(df_EDA['serum_creatinine'])

In [None]:
np.unique(df_EDA['time'])

In [None]:
plt.figure(figsize=(14,10))
plt.title('Persebaran data hubungan antara kadar serum creatinin dengan waktu penanganan pasien')
sns.scatterplot(data=df_EDA, x='serum_creatinine', y='time', hue='sex')
plt.savefig("scatter_serum_time.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
plt.figure(figsize=(16,10))

df_EDA['DEATH_EVENT'].value_counts().plot(
    kind='pie',
    autopct='%.2f%%',
    shadow=False
)
plt.title('Persentase Kematian pada dataset')
plt.ylabel('')
plt.savefig("pie_DE.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

# Data Preprocessing

In [None]:
new_df_scaled = new_df.copy()
numerics = ['int64','float64']
# new_df_scaled.drop(columns=['DEATH_EVENT'],inplace=True)
scaler = StandardScaler()
numeric_df = new_df.select_dtypes(include=numerics)
for column in numeric_df.columns:
  new_df_scaled[column] = scaler.fit_transform(new_df_scaled[column].values.reshape(-1,1))
new_df_scaled['DEATH_EVENT'] = new_df['DEATH_EVENT']
new_df_scaled['young_person'] = new_df['age'].apply(lambda x: 1 if x < 50 else 0)
# new_df_scaled = new_df_scaled.drop(columns=['age'])

In [None]:
new_df_scaled

## Data Splitting dengan dan tanpa SMOTE (Synthetic Minority Oversampling Technique).

Menggunakan SMOTE agar meng-oversampling data

In [None]:
X = new_df_scaled.drop(columns=['DEATH_EVENT'])
y = new_df_scaled['DEATH_EVENT']

smote = SMOTE()
X_s, y_s = smote.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
Xs_train, Xs_test, ys_train, ys_test = train_test_split(X_s, y_s, test_size=0.1, random_state=42)

# Pendefinisian Model

## Decision Tree

In [None]:
dt_model = DecisionTreeClassifier(criterion='gini')
dt_model_s = DecisionTreeClassifier(criterion='gini')
dt_model.fit(X_train, y_train)
dt_model_s.fit(Xs_train, ys_train)

In [None]:
dot_data = export_graphviz(dt_model, out_file=None, 
                      class_names=new_df_scaled['DEATH_EVENT'].astype('str'),  
                      filled=True, rounded=True,  
                      special_characters=True,
                      )  
graph = graphviz.Source(dot_data)
graph

**akurasi model tanpa SMOTE**

In [None]:
y_pred_dt = dt_model.predict(X_test)
accuracy_score_dt = accuracy_score(y_test, y_pred_dt)
accuracy_score_dt

**akurasi model dengan SMOTE**

In [None]:
y_pred_dts = dt_model_s.predict(Xs_test)
accuracy_score_dts = accuracy_score(ys_test, y_pred_dts)
accuracy_score_dts

## Decision Tree with Bagging

In [None]:
bag_clf = BaggingClassifier(
            DecisionTreeClassifier(), 
            bootstrap=True, 
            n_jobs=-1
          )
bag_clf_s = BaggingClassifier(
            DecisionTreeClassifier(), 
            bootstrap=True, 
            n_jobs=-1
          )
bag_clf.fit(X_train, y_train)
bag_clf_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_dt_bag = bag_clf.predict(X_test)
accuracy_score_dt_bag = accuracy_score(y_test, y_pred_dt_bag)
accuracy_score_dt_bag

**akurasi model dengan SMOTE**

In [None]:
y_pred_dt_bags = bag_clf_s.predict(Xs_test)
accuracy_score_dt_bags = accuracy_score(ys_test, y_pred_dt_bags)
accuracy_score_dt_bags

## Random Forest

In [None]:
rnd_clf = RandomForestClassifier()
rnd_clf_s = RandomForestClassifier()
rnd_clf.fit(X_train, y_train)
rnd_clf_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_rnd_clf = rnd_clf.predict(X_test)
accuracy_score_rnd_clf = accuracy_score(y_test, y_pred_rnd_clf)
accuracy_score_rnd_clf

**akurasi model dengan SMOTE**

In [None]:
y_pred_rnd_clfs = rnd_clf_s.predict(Xs_test)
accuracy_score_rnd_clfs = accuracy_score(ys_test, y_pred_rnd_clfs)
accuracy_score_rnd_clfs

## LogisticRegression

In [None]:
lgr_clf =LogisticRegression(random_state=42)
lgr_clf_s =LogisticRegression(random_state=42)
lgr_clf.fit(X_train, y_train)
lgr_clf_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_lgr_clf = lgr_clf.predict(X_test)
accuracy_score_lgr_clf = accuracy_score(y_test, y_pred_lgr_clf)
accuracy_score_lgr_clf

**akurasi model dengan SMOTE**

In [None]:
y_pred_lgr_clfs = lgr_clf_s.predict(Xs_test)
accuracy_score_lgr_clfs = accuracy_score(ys_test, y_pred_lgr_clfs)
accuracy_score_lgr_clfs

## SVM

In [None]:
svc = SVC()
svc_s = SVC()
svc.fit(X_train, y_train)
svc_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_svc = svc.predict(X_test)
accuracy_score_svc = accuracy_score(y_test, y_pred_svc)
accuracy_score_svc

**akurasi model dengan SMOTE**

In [None]:
y_pred_svcs = svc_s.predict(Xs_test)
accuracy_score_svcs = accuracy_score(ys_test, y_pred_svcs)
accuracy_score_svcs

## VotingClassifier

In [None]:
voting_clf = VotingClassifier(
                estimators=[('lr', lgr_clf), ('rf', rnd_clf), ('svc', svc)],
                voting='hard'
              )
voting_clf_s = VotingClassifier(
                estimators=[('lr', lgr_clf), ('rf', rnd_clf), ('svc', svc)],
                voting='hard'
              )
voting_clf.fit(X_train, y_train)
voting_clf_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_voting = voting_clf.predict(X_test)
accuracy_score_pred_voting = accuracy_score(y_test, y_pred_voting)
accuracy_score_pred_voting

**akurasi model dengan SMOTE**

In [None]:
y_pred_votings = voting_clf_s.predict(Xs_test)
accuracy_score_pred_votings = accuracy_score(ys_test, y_pred_votings)
accuracy_score_pred_votings

## XGBoost

In [None]:
xgb = XGBClassifier()
xgb_s = XGBClassifier()
xgb.fit(X_train, y_train)
xgb_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_xgb = xgb.predict(X_test)
accuracy_score_xgb = accuracy_score(y_test, y_pred_xgb)
accuracy_score_xgb

**akurasi model dengan SMOTE**

In [None]:
y_pred_xgbs = xgb_s.predict(Xs_test)
accuracy_score_xgbs = accuracy_score(ys_test, y_pred_xgbs)
accuracy_score_xgbs

## AdaBoost

In [None]:
ada_clf = AdaBoostClassifier(
            DecisionTreeClassifier(), 
            n_estimators=299,
            algorithm="SAMME.R", 
            learning_rate=0.35
          )

ada_clf_s = AdaBoostClassifier(
            DecisionTreeClassifier(), 
            n_estimators=299,
            algorithm="SAMME.R", 
            learning_rate=0.35
          )

ada_clf.fit(X_train, y_train)
ada_clf_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_adaBoost = ada_clf.predict(X_test)
accuracy_score_adaBoost = accuracy_score(y_test, y_pred_adaBoost)
accuracy_score_adaBoost

**akurasi model dengan SMOTE**

In [None]:
y_pred_adaBoosts = ada_clf_s.predict(Xs_test)
accuracy_score_adaBoosts = accuracy_score(ys_test, y_pred_adaBoosts)
accuracy_score_adaBoosts

## Extra Trees Classifier

In [None]:
extra_tree = ExtraTreesClassifier()
extra_tree_s = ExtraTreesClassifier()
extra_tree.fit(X_train, y_train)
extra_tree_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_tree = extra_tree.predict(X_test)
accuracy_score_tree = accuracy_score(y_test, y_pred_tree)
accuracy_score_tree

**akurasi model dengan SMOTE**

In [None]:
y_pred_trees = extra_tree_s.predict(Xs_test)
accuracy_score_trees = accuracy_score(ys_test, y_pred_trees)
accuracy_score_trees

## GradientBoost

In [None]:
grad_clf = GradientBoostingClassifier()
grad_clf_s = GradientBoostingClassifier()
grad_clf.fit(X_train, y_train)
grad_clf_s.fit(Xs_train, ys_train)

**akurasi model tanpa SMOTE**

In [None]:
y_pred_grad_clf = grad_clf.predict(X_test)
accuracy_score_grad_clf = accuracy_score(y_test, y_pred_grad_clf)
accuracy_score_grad_clf

**akurasi model dengan SMOTE**

In [None]:
y_pred_grad_clfs = grad_clf_s.predict(Xs_test)
accuracy_score_grad_clfs = accuracy_score(ys_test, y_pred_grad_clfs)
accuracy_score_grad_clfs

# Evaluasi Model

In [None]:
list_pred = [y_pred_dt,
             y_pred_dt_bag,
             y_pred_rnd_clf,
             y_pred_lgr_clf,
             y_pred_svc,
             y_pred_voting,
             y_pred_xgb,
             y_pred_adaBoost,
             y_pred_tree,
             y_pred_grad_clf]

list_pred_s = [y_pred_dts,
               y_pred_dt_bags,
               y_pred_rnd_clfs,
               y_pred_lgr_clfs,
               y_pred_svcs,
               y_pred_votings,
               y_pred_xgbs,
               y_pred_adaBoosts,
               y_pred_trees,
               y_pred_grad_clfs]

list_model = ['Decision Tree',
              'Decision Tree with Bagging',
              'Random Forest Classifier',
              'Logistic Regression',
              'Support Vector Classifier',
              'Voting Classifier',
              'XGBoost',
              'AdaBoost',
              'Extra Trees Classifier',
              'GradientBoost']

## F1 Score

In [None]:
for i in range(len(list_pred)):
  print('F1 score dari model {} tanpa SMOTE adalah {}'.format(list_model[i], f1_score(y_test,list_pred[i])))
  print('F1 score dari model {} dengan SMOTE adalah {}'.format(list_model[i], f1_score(ys_test,list_pred_s[i])))
  print('============================================')

## Recall

In [None]:
for i in range(len(list_pred)):
  print('Recall score dari model {} tanpa SMOTE adalah {}'.format(list_model[i], recall_score(y_test,list_pred[i])))
  print('Recall score dari model {} dengan SMOTE adalah {}'.format(list_model[i], recall_score(ys_test,list_pred_s[i])))
  print('============================================')

## Precision

In [None]:
for i in range(len(list_pred)):
  print('Precision score dari model {} tanpa SMOTE adalah {}'.format(list_model[i], precision_score(y_test,list_pred[i])))
  print('Precision score dari model {} dengan SMOTE adalah {}'.format(list_model[i], precision_score(ys_test,list_pred_s[i])))
  print('============================================')

## Classification Report

In [None]:
for i in range(len(list_pred)):
  print('=====================================================================')
  print('Classification report dari model {} tanpa menggunakan SMOTE'.format(list_model[i]))
  print('=====================================================================')
  print('{}'.format(classification_report(y_test,list_pred[i])))
  print('\n')
  print('=====================================================================')
  print('Classification report dari model {} dengan menggunakan SMOTE'.format(list_model[i]))
  print('=====================================================================')
  print('{}'.format(classification_report(ys_test,list_pred_s[i])))

## Confusion Matrix

In [None]:
dtm = confusion_matrix(y_test, y_pred_dt)
dtsm = confusion_matrix(ys_test, y_pred_dts)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(dtm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('Decision Tree tanpa SMOTE')
sns.heatmap(dtsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('Decision Tree dengan SMOTE')
plt.savefig("dtm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
dtbm = confusion_matrix(y_test, y_pred_dt_bag)
dtbsm = confusion_matrix(ys_test, y_pred_dt_bags)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(dtbm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('Decision Tree with Bagging tanpa SMOTE')
sns.heatmap(dtbsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('Decision Tree with Bagging dengan SMOTE')
plt.savefig("dtbm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
rfm = confusion_matrix(y_test, y_pred_rnd_clf)
rfsm = confusion_matrix(ys_test, y_pred_rnd_clfs)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(rfm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('Random Forest tanpa SMOTE')
sns.heatmap(rfsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('Random Forest dengan SMOTE')
plt.savefig("rfm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
lgrm = confusion_matrix(y_test, y_pred_lgr_clf)
lgrsm = confusion_matrix(ys_test, y_pred_lgr_clfs)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(lgrm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('Logistic Regression tanpa SMOTE')
sns.heatmap(lgrsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('Logistic Regression dengan SMOTE')
plt.savefig("lgrm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
svmm = confusion_matrix(y_test, y_pred_svc)
svmsm = confusion_matrix(ys_test, y_pred_svcs)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(svmm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('Support Vector Classifier tanpa SMOTE')
sns.heatmap(svmsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('Support Vector Classifier dengan SMOTE')
plt.savefig("svmm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
vcm = confusion_matrix(y_test, y_pred_voting)
vcsm = confusion_matrix(ys_test, y_pred_votings)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(vcm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('Voting Classifier tanpa SMOTE')
sns.heatmap(vcsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('Votting Classfier dengan SMOTE')
plt.savefig("vcm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
xgbm = confusion_matrix(y_test, y_pred_xgb)
xgbsm = confusion_matrix(ys_test, y_pred_xgbs)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(xgbm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('XGBoost tanpa SMOTE')
sns.heatmap(xgbsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('XGBoost dengan SMOTE')
plt.savefig("xgbm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
adabm = confusion_matrix(y_test, y_pred_adaBoost)
adabsm = confusion_matrix(ys_test, y_pred_adaBoosts)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(adabm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('AdaBoost tanpa SMOTE')
sns.heatmap(adabsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('AdaBoost dengan SMOTE')
plt.savefig("adabm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
xtm = confusion_matrix(y_test, y_pred_tree)
xtsm = confusion_matrix(ys_test, y_pred_trees)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(xtm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('Extra Trees Classfier tanpa SMOTE')
sns.heatmap(xtsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('Extra Trees Classfier dengan SMOTE')
plt.savefig("xtm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

In [None]:
gbm = confusion_matrix(y_test, y_pred_grad_clf)
gbsm = confusion_matrix(ys_test, y_pred_grad_clfs)

fig, ax = plt.subplots(1,2,figsize=(10,5), constrained_layout=True)
sns.heatmap(gbm, annot=True, fmt="d", linewidths=.5, cmap = 'YlGnBu', ax=ax[0])
ax[0].set_title('GradientBoost Classfier tanpa SMOTE')
sns.heatmap(gbsm, annot=True, fmt="d", linewidths=.5, cmap = 'Reds', ax=ax[1])
ax[1].set_title('GradientBoost Classfier dengan SMOTE')
plt.savefig("gbm.png",
            bbox_inches ="tight",
            pad_inches = 1,
            transparent = True,
            orientation ='landscape')

## Best Model

In [None]:
def finding_max(list_pred,y_test,type_pred=None):
  predicted_value = []
  for index,pred in enumerate(list_pred):
    if(type_pred == 'accuracy'):
      predicted_value.append(accuracy_score(y_test,pred))
    elif(type_pred == 'recall'):
      predicted_value.append(recall_score(y_test,pred))
    elif(type_pred == 'precision'):
      predicted_value.append(precision_score(y_test,pred))
    elif(type_pred == 'f1'):
      predicted_value.append(f1_score(y_test,pred))
  return predicted_value.index(max(predicted_value)),max(predicted_value)

In [None]:
best_accuracy = finding_max(list_pred,y_test,'accuracy')
best_accuracy_s = finding_max(list_pred_s,ys_test,'accuracy')

best_recall = finding_max(list_pred,y_test,'recall')
best_recall_s = finding_max(list_pred_s,ys_test,'recall')

best_precision = finding_max(list_pred,y_test,'precision')
best_precision_s = finding_max(list_pred_s,ys_test,'precision')

best_f1 = finding_max(list_pred,y_test,'f1')
best_f1_s = finding_max(list_pred_s,ys_test,'f1')

print('====================================')
print('Model yang di-training tanpa SMOTE')
print('====================================')
print(f'Model yang memiliki Accuracy Score tinggi adalah {list_model[best_accuracy[0]]} dengan nilai accuracy Score {best_accuracy[1]}')
print(f'Model yang memiliki Precision Score tinggi adalah {list_model[best_precision[0]]} dengan nilai precision Score {best_precision[1]}')
print(f'Model yang memiliki Recall Score tinggi adalah {list_model[best_recall[0]]} dengan nilai recall Score {best_recall[1]}')
print(f'Model yang memiliki F1 Score tinggi adalah {list_model[best_f1[0]]} dengan nilai F1 Score {best_f1[1]}')
print('\n')
print('====================================')
print('Model yang di-training dengan SMOTE')
print('====================================')
print(f'Model yang memiliki Accuracy Score tinggi adalah {list_model[best_accuracy_s[0]]} dengan nilai accuracy Score {best_accuracy_s[1]}')
print(f'Model yang memiliki Precision Score tinggi adalah {list_model[best_precision_s[0]]} dengan nilai precision Score {best_precision_s[1]}')
print(f'Model yang memiliki Recall Score tinggi adalah {list_model[best_recall_s[0]]} dengan nilai recall Score {best_recall_s[1]}')
print(f'Model yang memiliki F1 Score tinggi adalah {list_model[best_f1_s[0]]} dengan nilai F1 Score {best_f1_s[1]}')

# Model Inference

In [None]:
# age=float(input('age :'))
# anaemia=int(input('anaemia :'))
# creatinine_phosphokinase=int(input('creatinine_phosphokinase :'))
# diabetes=int(input('diabetes :'))
# high_blood_pressure=int(input('high_blood_pressure :'))
# platelets=float(input('platelets : '))
# serum_creatinine=float(input('serum_creatinine :'))
# sex=int(input('sex :'))
# smoking=int(input('smoking :'))
# young_person = int(input('young_person :'))

# x_input=[[
#     age,
#     anaemia,
#     creatinine_phosphokinase,
#     diabetes,
#     high_blood_pressure,
#     platelets,
#     serum_creatinine,
#     sex,
#     smoking,
#     young_person
# ]]

# x_input=scaler.fit_transform(x_input)
# y_output=bag_clf_s.predict(x_input)
# if y_output==0:
#     print('tidak meninggal')
# else:
#     print('meninggal')

In [None]:
pickle.dump(extra_tree_s,open('model_ETC.pkl','wb'))

# Kesimpulan

Kesimpulan yang didapatkan dari analisis data adalah 

* Sebesar 67.89% kematian terjadi akibat penyakit jantung. Penyakit jantung sendiri banyak terjadi untuk pasien berjenis kelamin laki-laki.

* Model **Extra Tree Classifier** dengan SMOTE memiliki akurasi tinggi sebesar 82.9% sehingga model tersebut sudah tepat untuk memprediksi keselamatan pasien penyakit jantung