# Introduction to Python for Data Science Final Project 3: Ensemble

## Introduction

### Team Member:
1. Qaris Ardian Pratama

### Abstract
Cardiovascular disease (CVDs) merupakan penyebab kematian nomor 1 di dunia, dengan angka kematian sekitar 17.9 juta per tahun. Jumlah itu merupakan 31% dari penyebab kematian di seluruh dunia. CVDs merupakan penyebab paling umum dari gagal jantung.

Sebagian besar penyakit cardiovascular dapat dicegah dengan mengubah gaya hidup seperti rokok, pola makan, aktivitas fisik, alkohol, dll. Orang dengan CVDs atau dengan resiko CVDs tinggi perlu dideteksi sejak dini. Oleh karena itu, pemanfaatan machine learning di sini dapat menyelamatkan nyawa.

Dataset diambil dari Kaggle berjudul Heart Failure Prediction dengan 12 feature, yaitu age, anaemia, creatinine_phosphokinase, diabetes, ejection_fraction, high_blood_pressure, platelets, serum_creatinine, serum_sodium, dan sex.

### Objective
Tujuan dari analisis ini adalah sebagai berikut.
1. Mampu memahami konsep Classification dengan Ensemble Model.
2. Mampu mempersiapkan data untuk digunakan dalam Ensemble Model.
3. Mampu mengimplementasikan Ensemble Model untuk membuat prediksi.

## Import Libraries

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import pickle
import warnings
import datetime
import missingno as msno
import plotly.figure_factory as ff

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

pd.set_option('display.max.columns', None)
warnings.filterwarnings('ignore')
%matplotlib inline

## Data Loading

In [22]:
df = pd.read_csv('../../../../data_set/heart_failure_clinical_records_dataset.csv')
df.head()


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [4]:
df.shape

(299, 13)

**Keterangan**
<br>
Terdapat 299 baris data dan 13 kolom pada dataset heart_failure_clinical_records

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [7]:
df.describe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


**Keterangan**
<br>
Pada feature creatinine_phosphokinase terlihat jarak yang sangat jauh antara 75% dan max, perlu diselidiki apakah terdapat outlier di sana

## Data Cleaning

In [12]:
print('Jumlah missing value: ', df.isna().sum().sum())

Jumlah missing value:  0


In [13]:
print('Jumlah duplicated value: ', df.duplicated().sum())

Jumlah duplicated value:  0


**Keterangan**
<br>
Dataset ini tidak memiliki missing value maupun duplicated value

In [24]:
df_clean = df.rename(columns={
    'creatinine_phosphokinase': 'cpk_level',
    'DEATH_EVENT': 'death_event'
})
df_clean.head()

Unnamed: 0,age,anaemia,cpk_level,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,death_event
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


**Keterangan**
<br>
Dua nama kolom diubah untuk mempermudah pengerjaan, yaitu creatinine_phosphokinase menjadi cpk_level dan DEATH_EVENT menjadi death_event.

## EDA and Visualization

In [25]:
hist_data = [df_clean["age"].values]
group_labels = ['age'] 

fig = ff.create_distplot(hist_data, group_labels)
fig.update_layout(title_text='Age Distribution plot')

fig.show()

**Keterangan**
<br>
Penyebaran umur sangat bervariatif dengan sebagian besar pada jarak 40 sampai 80 dan tertinggi di 60.

In [26]:
fig = px.box(df_clean, x='sex', y='age', points="all")
fig.update_layout(
    title_text="Gender wise Age Spread - Male = 1 Female =0")
fig.show()

In [27]:
male = df_clean[df_clean["sex"]==1]
female = df_clean[df_clean["sex"]==0]

male_survi = male[df_clean["death_event"]==0]
male_not = male[df_clean["death_event"]==1]
female_survi = female[df_clean["death_event"]==0]
female_not = female[df_clean["death_event"]==1]

labels = ['Male - Survived','Male - Not Survived', "Female -  Survived", "Female - Not Survived"]
values = [len(male[df_clean["death_event"]==0]),len(male[df_clean["death_event"]==1]),
         len(female[df_clean["death_event"]==0]),len(female[df_clean["death_event"]==1])]
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.4)])
fig.update_layout(
    title_text="Analysis on Survival - Gender")
fig.show()

**Keterangan**
<br>
Penderita berjenil kelamin lelaki memiliki persentase keselamatan lebih tinggi dibandingkan wanita. Akan tetapi, persentase kematian juga lebih tinggi lelaki dibandingkan wanita.

In [28]:
surv = df_clean[df_clean["death_event"]==0]["age"]
not_surv = df_clean[df_clean["death_event"]==1]["age"]
hist_data = [surv,not_surv]
group_labels = ['Survived', 'Not Survived']
fig = ff.create_distplot(hist_data, group_labels, bin_size=0.5)
fig.update_layout(
    title_text="Analysis in Age on Survival Status")
fig.show()

**Keterangan**
<br>
Tingkat keselamatan lebih tinggi di umur 40 - 70 tahun, lalu menurun di atas umur 70 tahun.

In [30]:
fig = px.violin(df_clean, y="age", x="sex", color="death_event", box=True, points="all", hover_data=df_clean.columns)
fig.update_layout(title_text="Analysis in Age and Gender on Survival Status")
fig.show()

**Keterangan**
<br>
Tingkat keselamatan lelaki tinggi di umur 50-60, sementara wanita di umur 60-70.

In [31]:
fig = px.violin(df_clean, y="age", x="smoking", color="death_event", box=True, points="all", hover_data=df_clean.columns)
fig.update_layout(title_text="Analysis in Age and Smoking on Survival Status")
fig.show()

**Keterangan**
<br>
Untuk yang bukan perokok, tingkat keselamatan tinggi di umur 55-65. Sedangkan, untuk perokok sekitar umur 50-60.

In [33]:
fig = px.violin(df_clean, y="age", x="diabetes", color="death_event", box=True, points="all", hover_data=df_clean.columns)
fig.update_layout(title_text="Analysis in Age and Diabetes on Survival Status")
fig.show()

**Keterangan**
<br>
Orang yang tidak mengidap diabetes memiliki tingkat keselamatan yang merata di umur 50-70 tahun. Sedangkan, orang yang mengidap diabetes memiliki tingkat keselamatan tinggi di umur 60-65 tahun.