# Machine Learning Workflow

### 1. Importing Data to Python
* Drop Duplicates
### 2. Data Preprocessing
* Input-output Split
* Train - Test Split
* Imputation
* Processing Categorical
* Normalization
### 3. Training Machine Learning
* Choose Score to optimize and Hyperparameter Space

## **Heart Disease Analysis**
* Task : Classification
* Objective : Prediksi pasien yang berpotensi kena serangan jantung

### Data Description
**Informasi tambahan:**

Database ini memiliki 76 atribut, namun semua eksperimen yang dipublikasikan hanya menggunakan subset dari 14 atribut tersebut. Secara khusus, database Cleveland adalah satu-satunya yang telah digunakan oleh peneliti machine learning hingga saat ini. Kolom "target" mengacu pada keberadaan penyakit jantung pada pasien, dengan nilai integer dari 0 (tidak ada penyakit) hingga 4 (kehadiran penyakit). Eksperimen yang dilakukan dengan database Cleveland berfokus pada usaha untuk membedakan antara keberadaan penyakit (nilai 1, 2, 3, 4) dengan tidak adanya penyakit (nilai 0).

**Dataset Atribut**

| Atribut    | Type         | Deskripsi                        |
|------------|--------------|----------------------------------|
| age        | Integer      | Usia pasien                      |
| sex        | Categorical  | Jenis kelamin pasien *(0=wanita, 1=pria)*|
| cp         | Categorical  | Tipe nyeri dada *(1=angina tipikal, 2=angina atipikal, 3=nyeri non-angina, 4=tanpa gejala)*             |
| trestbps   | Integer      | Tekanan darah istirahat (mm Hg)  |
| chol       | Integer      | Kolesterol serum (mg/dl)         |
| fbs        | Categorical  | Gula darah puasa >120 mg/dl (1=benar, 0=salah)|
| restecg    | Categorical  | Hasil elektrokardiografi istirahat *(0=normal, 1=memiliki kelainan gelombang ST-T (inversi gelombang T dan/atau elevasi atau depresi ST > 0,05 mV), 2=menunjukkan hipertrofi ventrikel kiri yang mungkin atau pasti berdasarkan kriteria Estes)*|
| thalach    | Integer      | Denyut jantung max yang dicapai|
| exang      | Categorical  | Angina yang disebabkan oleh olahraga *(1=ya, 0=tidak)*|
| oldpeak    | Integer      | Depresi ST yang diinduksi oleh olahraga relatif terhadap istirahat|
| slope      | Categorical  | Kemiringan segmen ST puncak latihan (0-2)|
| ca         | Integer      | Jumlah pembuluh besar yang diwarnai oleh fluoroskopi (0-3)|
| thal       | Categorical  | Thalassemia *(1 = normal; 2 = cacat tetap; 3 = cacat reversibel)*|
| target     | Integer      | Diagnosa penyakit jantung *(0=tidak ada, 1-4=ada)*|



## <b><font color='orange'> 1. Importing Data to Python </b>

In [1]:
# Import library pengolahan struktur data
import pandas as pd

# Import library pengolahan angka
import numpy as np

# Import library untuk visualisasi
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Buat fungsi untuk mengimpor dataset
def ImportData(data_file):
    """
    Fungsi untuk import data & hapus duplikat
    :param data_file: <string> nama file input (format .data)
    :return heart_df: <pandas> sample data
    """
    # Definisikan nama kolom sesuai dengan dokumentasi dataset
    column_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']

    # baca data
    heart_df = pd.read_csv(data_file, names=column_names)

    # cetak bentuk data
    print('Data asli:',heart_df.shape, '-(#Observasi, #kolom)')
    print('Jumlah baris',heart_df.shape[0], 'dan jumlah kolom',heart_df.shape[1])

    # Cek data duplikat
    duplicate_status = heart_df.duplicated(keep='first')

    if duplicate_status.sum() == 0:
        print('Tidak ada data duplikat')
    else:
        heart_df = heart_df.drop_duplicates()
        print('Data setelah di-drop :', heart_df.shape, '-(#observasi, #kolom)')

    return heart_df

# (data_file) adalah argumen
# Argumen adalah sebuah variable
# Jika fungsi tersebut diberi argumen data_file = "processed.cleveland.data",
# maka semua variable 'data_file' didalam fungsi akan berubah menjadi 'processed.cleveland.data'

In [3]:
# Input argumen
data_file = 'processed.cleveland.data'

# Panggil fungsi
heart_df = ImportData(data_file)

Data asli: (303, 14) -(#Observasi, #kolom)
Jumlah baris 303 dan jumlah kolom 14
Tidak ada data duplikat


In [4]:
# Cek statistical data
heart_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,303.0,54.438944,9.038662,29.0,48.0,56.0,61.0,77.0
sex,303.0,0.679868,0.467299,0.0,0.0,1.0,1.0,1.0
cp,303.0,3.158416,0.960126,1.0,3.0,3.0,4.0,4.0
trestbps,303.0,131.689769,17.599748,94.0,120.0,130.0,140.0,200.0
chol,303.0,246.693069,51.776918,126.0,211.0,241.0,275.0,564.0
fbs,303.0,0.148515,0.356198,0.0,0.0,0.0,0.0,1.0
restecg,303.0,0.990099,0.994971,0.0,0.0,1.0,2.0,2.0
thalach,303.0,149.607261,22.875003,71.0,133.5,153.0,166.0,202.0
exang,303.0,0.326733,0.469794,0.0,0.0,0.0,1.0,1.0
oldpeak,303.0,1.039604,1.161075,0.0,0.0,0.8,1.6,6.2


In [5]:
# Cek Jumlah nilai dan nilai unik pada setip kolom
summary_dict = {}
for i in list(heart_df.columns):
    summary_dict[i] = {
        'Jumlah Nilai': heart_df[i].value_counts().shape[0],
        'Nilai Unik': heart_df[i].unique()
    }
summary_df = pd.DataFrame(summary_dict).T

summary_df

Unnamed: 0,Jumlah Nilai,Nilai Unik
age,41,"[63.0, 67.0, 37.0, 41.0, 56.0, 62.0, 57.0, 53...."
sex,2,"[1.0, 0.0]"
cp,4,"[1.0, 4.0, 3.0, 2.0]"
trestbps,50,"[145.0, 160.0, 120.0, 130.0, 140.0, 172.0, 150..."
chol,152,"[233.0, 286.0, 229.0, 250.0, 204.0, 236.0, 268..."
fbs,2,"[1.0, 0.0]"
restecg,3,"[2.0, 0.0, 1.0]"
thalach,91,"[150.0, 108.0, 129.0, 187.0, 172.0, 178.0, 160..."
exang,2,"[0.0, 1.0]"
oldpeak,40,"[2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 3.1, ..."


>Terdapat nilai **'?'** pada kolom `ca` dan `thal`. Kita perlu merubah nilai tersebut menjadi NA/NaN

In [6]:
print('Jumlah nilai "?" pada kolom ca   :', (heart_df['ca'] == '?').sum())
print('Jumlah nilai "?" pada kolom thal :', (heart_df['thal'] == '?').sum())

Jumlah nilai "?" pada kolom ca   : 4
Jumlah nilai "?" pada kolom thal : 2


In [7]:
# Lihat semua kolom yang mengandung nilai '?'
heart_df_unique = heart_df[heart_df.isin(['?']).any(axis=1)]
heart_df_unique

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
87,53.0,0.0,3.0,128.0,216.0,0.0,2.0,115.0,0.0,0.0,1.0,0.0,?,0
166,52.0,1.0,3.0,138.0,223.0,0.0,0.0,169.0,0.0,0.0,1.0,?,3.0,0
192,43.0,1.0,4.0,132.0,247.0,1.0,2.0,143.0,1.0,0.1,2.0,?,7.0,1
266,52.0,1.0,4.0,128.0,204.0,1.0,0.0,156.0,1.0,1.0,2.0,0.0,?,2
287,58.0,1.0,2.0,125.0,220.0,0.0,0.0,144.0,0.0,0.4,2.0,?,7.0,0
302,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,?,3.0,0


In [8]:

# sns.histplot(heart_df['ca'])
# plt.show()

In [9]:
# sns.histplot(heart_df['thal'])
# plt.show()

In [10]:
# Penganganan missing value
def handle_missing_value(df):
    """
    Fungsi untuk menangani missing value yang ditandai dengan '?'
    param df: <pandas dataframe> data input
    return df: <pandas dataframe> data dengan missing value yang sudah diganti
    """
    # Ganti '?' dengan NaN
    df.replace('?', np.NaN, inplace=True)

    # Tampilkan jumlah missing value per kolom
    print('Jumlah missing value per kolom:\n', df.isnull().sum())

    return df

In [11]:
# Panggil fungsi untuk menangai missing value
heart_df = handle_missing_value(heart_df)

Jumlah missing value per kolom:
 age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
target      0
dtype: int64


In [12]:
heart_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [13]:
# sns.histplot(heart_df['ca'])
# plt.show()

In [14]:
# sns.histplot(heart_df['thal'])
# plt.show()

In [15]:
print('Nilai unik kolom ca   :', heart_df['ca'].unique())
print('Nilai unik kolom thal :', heart_df['thal'].unique())

Nilai unik kolom ca   : ['0.0' '3.0' '2.0' '1.0' nan]
Nilai unik kolom thal : ['6.0' '3.0' '7.0' nan]


## Data Visualisasi

In [16]:
# sns.countplot(data=heart_df, x='sex', hue='sex')
# plt.show()

Insight:
- Pria = 1, Wanita = 0
- Jumlah pria lebih banyak daripada wanita

In [17]:
# sns.countplot(data=heart_df, x='cp')
# plt.show()

## <b><font color='orange'> 2. Data Preprocessing:</font></b>
---
* Input-Output Split, Train-Test Split
* Processing Categorical
* Imputation, Normalization, Drop Duplicates

### **Input-Output Split**

- Fitur `y` adalah output variabel dari target
- yang lainnya menjadi input

Buat output data

In [18]:
# Buat data yang berisikan data target
# Pilih data dengan nama kolom 'target' sebagai output data
output_data = heart_df['target']

output_data.head()

0    0
1    2
2    1
3    0
4    0
Name: target, dtype: int64

**Buat data input**

- DATA = INPUT + OUTPUT
- DATA - OUTPUT = INPUT
- Jadi kalau dari data, kita drop VARIABLE OUTPUT, maka tersisa hanya variabel INPUT.

In [19]:
def extractInputOutput(data, output_column_name, column_to_drop=None):
    """
    Fungsi untuk memisahkan data input dan output
    :param data: <pandas dataframe> data seluruh sample
    :param output_column_name: <string> nama kolom output
    :param column_to_drop: daftar nama kolom yang ingin dihapus sebelum memisahkan
    :return input_data: <pandas dataframe> data input, <pandas series> data output
    """
    # drop data yang tidak diperlukan jika ada
    if column_to_drop:
        data = data.drop(columns=column_to_drop)

    # pisahkan data output
    output_data = data[output_column_name]

    # drop kolom output dari data untuk mendapatkan input_dataa
    input_data = data.drop(columns=output_column_name, axis=1)

    return input_data, output_data

# (data, output_column_name) adalah argumen
# Argumen adalah sebuah variable
# Jika fungsi tsb diberi argumen data = heart_df
# maka semua variable 'data' didalam fungsi akan berubah menjadi heart_data

In [20]:
# input_data, output_data = extractInputOutput(heart_df, 'target')
x, y = extractInputOutput(heart_df, 'target')

**Selalu sanity check**

In [21]:
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0


In [22]:
y.head()

0    0
1    2
2    1
3    0
4    0
Name: target, dtype: int64

**Check count value data**

In [23]:
# Cek Jumlah nilai dan nilai unik pada input_data (x)
summary_dict = {}
for i in list(x.columns):
    summary_dict[i] = {
        'Jumlah Nilai': x[i].value_counts().shape[0],
        'Nilai Unik': x[i].unique()
    }
summary_df = pd.DataFrame(summary_dict).T

summary_df

Unnamed: 0,Jumlah Nilai,Nilai Unik
age,41,"[63.0, 67.0, 37.0, 41.0, 56.0, 62.0, 57.0, 53...."
sex,2,"[1.0, 0.0]"
cp,4,"[1.0, 4.0, 3.0, 2.0]"
trestbps,50,"[145.0, 160.0, 120.0, 130.0, 140.0, 172.0, 150..."
chol,152,"[233.0, 286.0, 229.0, 250.0, 204.0, 236.0, 268..."
fbs,2,"[1.0, 0.0]"
restecg,3,"[2.0, 0.0, 1.0]"
thalach,91,"[150.0, 108.0, 129.0, 187.0, 172.0, 178.0, 160..."
exang,2,"[0.0, 1.0]"
oldpeak,40,"[2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 3.1, ..."


### **Train-Test Split**

- **Kenapa?**
  - Karena tidak mau overfit data training
  - Test data akan menjadi future data
  - Kita akan latih model ML di data training, dengan CV (Cross-validation)
  - Selanjutnya melakukan evaluasi di data testing

In [24]:
# Import train-test splitting library dari sklearn
from sklearn.model_selection import train_test_split

**Train Test Split Function**
1. `X` adalah input
2. `y` adalah output (target)
3. `test_size` adalah seberapa besar proporsi data test dari keseluruhan data. Contoh `test_size = 0.2` artinya data test akan berisi 20% data.
4. `random_state` adalah kunci untuk random. Harus di-setting sama. Misal `random_state = 12`.
5. Output:
   - `x_train` = input dari data training
   - `x_test` = input dari data testing
   - `y_train` = output dari data training
   - `y_test` = output dari data testing
6. Urutan outputnya: `x_train, x_test, y_train, y_test`. Tidak boleh terbalik

> Readmore: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [25]:
# Train Test Split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25,
                                                    random_state=12)

In [26]:
print('Bentuk dari x_train adalah :', x_train.shape)
print('Bentuk dari x_test adalah  :', x_test.shape)
print('Bentuk dari y_train adalah :', y_train.shape)
print('Bentuk dari y_test adalah  :', y_test.shape)

Bentuk dari x_train adalah : (227, 13)
Bentuk dari x_test adalah  : (76, 13)
Bentuk dari y_train adalah : (227,)
Bentuk dari y_test adalah  : (76,)


In [27]:
# Ratio
x_test.shape[0] / x.shape[0]

# Hasil 0.25 --> sudah sesuai dengan test_size

0.2508250825082508

**Congrats kita sudah punya data train & test**

### **Data Imputation**

- Proses pengisian data yang kosong (NaN)
- Ada 2 hal yang diperhatikan:
  - Numerical Imputation
  - Categorical Imputation

In [28]:
# Cek data x_train yang kosong
x_train.isnull().sum() / x_train.shape[0]*100

age         0.000000
sex         0.000000
cp          0.000000
trestbps    0.000000
chol        0.000000
fbs         0.000000
restecg     0.000000
thalach     0.000000
exang       0.000000
oldpeak     0.000000
slope       0.000000
ca          1.762115
thal        0.881057
dtype: float64

**Bedakan antara data categorical dan numerical**

In [29]:
x_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
255,42.0,0.0,3.0,120.0,209.0,0.0,0.0,173.0,0.0,0.0,2.0,0.0,3.0
250,57.0,1.0,4.0,110.0,201.0,0.0,0.0,126.0,1.0,1.5,2.0,0.0,6.0
38,55.0,1.0,4.0,132.0,353.0,0.0,0.0,132.0,1.0,1.2,2.0,1.0,7.0
24,60.0,1.0,4.0,130.0,206.0,0.0,2.0,132.0,1.0,2.4,2.0,2.0,7.0
247,47.0,1.0,4.0,110.0,275.0,0.0,2.0,118.0,1.0,1.0,2.0,1.0,3.0


**Data Categorical**
- sex
- cp
- fbs
- restecg
- exang

Sisanya adalah numerical

In [30]:
# Lihat kolom pada x_train
x_train.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
      dtype='object')

In [31]:
# Buat kolom numerical dan categorical
categorical_column = ['sex', 'cp', 'fbs', 'restecg', 'exang']

numerical_column = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak',
                    'slope', 'ca', 'thal']

In [32]:
# Lihat hasil pengkategorian
print(categorical_column)
print(numerical_column)

['sex', 'cp', 'fbs', 'restecg', 'exang']
['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca', 'thal']


In [33]:
# Seleksi dataframe x_train numerikal
x_train_numerical = x_train[numerical_column]

In [34]:
x_train_numerical.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,slope,ca,thal
255,42.0,120.0,209.0,173.0,0.0,2.0,0.0,3.0
250,57.0,110.0,201.0,126.0,1.5,2.0,0.0,6.0
38,55.0,132.0,353.0,132.0,1.2,2.0,1.0,7.0
24,60.0,130.0,206.0,132.0,2.4,2.0,2.0,7.0
247,47.0,110.0,275.0,118.0,1.0,2.0,1.0,3.0


**Cek apakah ada data numerical yang kosong**

In [35]:
x_train_numerical.isnull().any()


age         False
trestbps    False
chol        False
thalach     False
oldpeak     False
slope       False
ca           True
thal         True
dtype: bool

>Terdapat dua kolom yang mempunyai nilai kosong yaitu ca dan thal

**Gunakan Imputer dari sklearn untuk data Imputation numerik saja**

- fit : imputer agar mengetahui mean atau median dari tiap kolom
- transform : isi data dengan median atau mean
- output dari transform adalah pandas dataframe
- kembalikan dataFrame yang sudah memiliki kolom dan indeks yang sama seperti data asli

In [36]:
# Import library untuk melakukan impute
from sklearn.impute import SimpleImputer

In [37]:
def impute_missing_values(data, strategy="median"):
    """
    Fungsi untuk melakukan imputasi missing value pada data numerik
    param data: <pandas dataframe> Data numerik yang ingin di imputasi
    param strategy: <string> Strategi imputasi, default adalah "median"
                    Pilihan lain: "mean", "most_frequent", "constant"
    return data_imputed: <pandas dataframe> Data numerik dengan missing values yang sudah diimputasi
    """
    # Inisialisasi SimpeImputer dengan strategi yang dipilih
    imputer = SimpleImputer(missing_values=np.NaN, strategy="median")

    # Fit imputer pada data dan trasformasi
    imputed_data = imputer.fit_transform(data)

    # Konversi hasil transformasi kembali ke dataframe
    data_imputed = pd.DataFrame(imputed_data, columns=data.columns, index=data.index)

    return data_imputed

In [38]:
# Jalankan fungsi
x_train_numerical_imputed = impute_missing_values(x_train_numerical)

In [39]:
# Cek missing value setelah di impute
x_train_numerical_imputed.isnull().any()

age         False
trestbps    False
chol        False
thalach     False
oldpeak     False
slope       False
ca          False
thal        False
dtype: bool

>Setelah di impute tidak ada lagi data yang kosong

In [40]:
# Seleksi dataframe x_train kategorical
x_train_categorical = x_train[categorical_column]

In [41]:
x_train_categorical.head()

Unnamed: 0,sex,cp,fbs,restecg,exang
255,0.0,3.0,0.0,0.0,0.0
250,1.0,4.0,0.0,0.0,1.0
38,1.0,4.0,0.0,0.0,1.0
24,1.0,4.0,0.0,2.0,1.0
247,1.0,4.0,0.0,2.0,1.0


In [42]:
# Cek missing value pada data kategorikal
x_train_categorical.isnull().sum()

sex        0
cp         0
fbs        0
restecg    0
exang      0
dtype: int64

>Tidak ada missing value pada data kategorikal

**Preprocessing Categorical Variables**
- Kita tidak bisa memasukkan data categorical jika tidak diubah menjadi numerical
- Solusi --> One Hot Encoding (OHE)

In [43]:
def extractCategorical(data, categorical_column):
    """
    Fungsi untuk ekstrak data kategorikal dengan One Hot Encoding (OHE)
    :param data: <pandas dataframe> data sample
    :param categorical_column: <list> list kolom kategorik
    :param categorical_ohe: <pandas dataframe> data sample dengan OHE
    :return result_data: <pandas dataframe> hasil penggabungan data sample dengan data ketegorik OHE
    """
    # Lakukan One-Hot Encoding pada kolom kategorikal
    # categorical_ohe = pd.get_dummies(data[categorical_column])
    categorical_ohe = pd.get_dummies(x_train_categorical)
    
    # Gabungkan hasil OHE dengan kolom lainnya (jika ada)
    # result_data = pd.concat([data.drop(columns=categorical_column), categorical_ohe], axis=1)
    
    return categorical_ohe

In [44]:
# Panggil fungsi untuk melakukan encoding
x_train_categorical_ohe = extractCategorical(data=x_train_categorical, categorical_column=categorical_column)

In [45]:
x_train_categorical_ohe.head()

Unnamed: 0,sex,cp,fbs,restecg,exang
255,0.0,3.0,0.0,0.0,0.0
250,1.0,4.0,0.0,0.0,1.0
38,1.0,4.0,0.0,0.0,1.0
24,1.0,4.0,0.0,2.0,1.0
247,1.0,4.0,0.0,2.0,1.0


In [46]:
# Simpan kolom OHE untuk diimplementasikan dalam testing data
# Agar shape nya konsisten
ohe_columns = x_train_categorical_ohe.columns

### Join data Numerical dan Categorical
- Data numerik & kategorik harus digabungkan kembali
- Penggabungan dengan `pd.concat`

In [47]:
# Lakukan penggabungan data numerik dan data kategorik yang sudah di encoded
x_train_concat = pd.concat([x_train_numerical_imputed, x_train_categorical_ohe], axis=1)

In [48]:
# Lihat hasilnya
x_train_concat.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,slope,ca,thal,sex,cp,fbs,restecg,exang
255,42.0,120.0,209.0,173.0,0.0,2.0,0.0,3.0,0.0,3.0,0.0,0.0,0.0
250,57.0,110.0,201.0,126.0,1.5,2.0,0.0,6.0,1.0,4.0,0.0,0.0,1.0
38,55.0,132.0,353.0,132.0,1.2,2.0,1.0,7.0,1.0,4.0,0.0,0.0,1.0
24,60.0,130.0,206.0,132.0,2.4,2.0,2.0,7.0,1.0,4.0,0.0,2.0,1.0
247,47.0,110.0,275.0,118.0,1.0,2.0,1.0,3.0,1.0,4.0,0.0,2.0,1.0


In [49]:
x_train_concat.isnull().sum()

age         0
trestbps    0
chol        0
thalach     0
oldpeak     0
slope       0
ca          0
thal        0
sex         0
cp          0
fbs         0
restecg     0
exang       0
dtype: int64

>Tidak ada missing value pada penggabungan data numerik dan data kategorik

### Standardizing Variables
- Menyamakan skala dari variable input
- `fit` : imputer agar mengetahui mean dan standar deviasi dari setiap kolom
- `transform` : isi data dngan value yang sudah di normalisasi
- output dari transform berupa pandas dataframe
- normalize dikeluarkan karena akan digunakan pada data test


In [50]:
from sklearn.preprocessing import StandardScaler

def standardizerData(data):
    """
    Fungsi untuk melakukan standarisasi data
    :param data: <pandas dataframe> data sample
    :return standardized_data: <pandas dataframe> data sample standart
    :return standarizer: method untuk standarisasi data
    """
    data_columns = data.columns # agar nama kolom tidak hilang
    data_index = data.index # agar index tidak hilang

    # Buat (fit) Standardizer
    standarizer = StandardScaler()
    standarizer.fit(data)

    # Transform data
    standarized_data_raw = standarizer.transform(data)
    standarized_data = pd.DataFrame(standarized_data_raw)
    standarized_data.columns = data_columns
    standarized_data.index = data_index

    return standarized_data, standarizer

In [51]:
# Jalankan fungsi
x_train_clean, standardizer = standardizerData(data=x_train_concat)

In [52]:
x_train_clean.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,slope,ca,thal,sex,cp,fbs,restecg,exang
255,-1.367549,-0.611322,-0.734267,1.003189,-0.879338,0.649338,-0.702793,-0.946333,-1.545335,-0.15253,-0.419721,-1.00002,-0.674632
250,0.322039,-1.181688,-0.883387,-1.107861,0.355871,0.649338,-0.702793,0.59174,0.647109,0.896689,-0.419721,-1.00002,1.48229
38,0.096761,0.073117,1.949889,-0.838366,0.108829,0.649338,0.333142,1.104431,0.647109,0.896689,-0.419721,-1.00002,1.48229
24,0.659957,-0.040956,-0.790187,-0.838366,1.096996,0.649338,1.369078,1.104431,0.647109,0.896689,-0.419721,1.008869,1.48229
247,-0.804353,-1.181688,0.495971,-1.467189,-0.055866,0.649338,0.333142,-0.946333,0.647109,0.896689,-0.419721,1.008869,1.48229


In [53]:
# Cek missing value
x_train_clean.isnull().sum()

age         0
trestbps    0
chol        0
thalach     0
oldpeak     0
slope       0
ca          0
thal        0
sex         0
cp          0
fbs         0
restecg     0
exang       0
dtype: int64

>Tidak ada missing value dari data yang sudah di standarisasi

In [82]:
x_train_clean.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,227.0,-1.242276e-16,1.00221,-2.831859,-0.804353,0.096761,0.659957,2.574824
trestbps,227.0,-6.67112e-16,1.00221,-2.094274,-0.611322,-0.040956,0.52941,3.951608
chol,227.0,4.401765e-17,1.00221,-2.281385,-0.687667,-0.100508,0.579851,5.882923
thalach,227.0,6.377669e-16,1.00221,-3.57824,-0.523954,0.10487,0.756152,2.305753
oldpeak,227.0,-8.020995000000001e-17,1.00221,-0.879338,-0.879338,-0.22056,0.602913,4.226193
slope,227.0,-1.5650720000000002e-17,1.00221,-0.970439,-0.970439,0.649338,0.649338,2.269114
ca,227.0,-2.5432420000000003e-17,1.00221,-0.702793,-0.702793,-0.702793,0.333142,2.405014
thal,227.0,-2.034594e-16,1.00221,-0.946333,-0.946333,-0.946333,1.104431,1.104431
sex,227.0,-8.999165000000001e-17,1.00221,-1.545335,-1.545335,0.647109,0.647109,0.647109
cp,227.0,-4.401765e-18,1.00221,-2.250966,-0.15253,-0.15253,0.896689,0.896689


## <b><font color='orange'> 3. Training Machine Learning </b>

* Choose score to optimize and Hyperparameter Space
* Cross-Validation: Random vs Grid Search CV
* Kita  harus mengalahkan benchmark

### **Benchmark / Baseline**
- Baseline untuk evaluasi nanti
- Karena inii klarifikasi, bisa kita ambil dari proporsi kelas target yang terbesar
- Dengan kata lain, menebak hasil output marketing response dengan nilai "no" semua tanpa modeling

In [54]:
y_train.value_counts(normalize=True)

target
0    0.528634
1    0.198238
3    0.123348
2    0.110132
4    0.039648
Name: proportion, dtype: float64

### 1. Import Model
- Kita akan gunakan 3 model ML untuk klarifikasi:
    - K-nearest neighbor (K-NN)
    - Logistic Regression
    - Random Forest

In [55]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier


### 2. Fitting Model
- Cara fitting / training model mengikuti yang dokumentasi model

In [56]:
# Model K-nearst neighbor (KNN)
knn = KNeighborsClassifier()
knn.fit(x_train_clean, y_train)

KNeighborsClassifier()

In [57]:
# Model Logistic Regression
logreg = LogisticRegression(random_state=12)
logreg.fit(x_train_clean, y_train)

LogisticRegression(random_state=12)

In [58]:
# Model Random Forest
random_forest = RandomForestClassifier(random_state=12)
random_forest.fit(x_train_clean, y_train)

RandomForestClassifier(random_state=12)

In [59]:
# Model Random Forest Classifier 1
# Ubah hyperparameter dari random forest --> n_estimator
# Tambahkan n_estimator = 100
random_forest_1 = RandomForestClassifier(random_state=12, n_estimators=200)
random_forest_1.fit(x_train_clean, y_train)

RandomForestClassifier(n_estimators=200, random_state=12)

### 3. Prediction
- Saatnya melakukan prediksi

In [60]:
# Prediksi KNN
predicted_knn = pd.DataFrame(knn.predict(x_train_clean))
predicted_knn.head()

Unnamed: 0,0
0,0
1,2
2,2
3,1
4,1


In [61]:
# Prediksi Logistic Regression
predicted_logreg = pd.DataFrame(logreg.predict(x_train_clean))
predicted_logreg.head()

Unnamed: 0,0
0,0
1,0
2,2
3,3
4,1


In [62]:
# Prediksi Random Forest
predicted_random_forest = pd.DataFrame(random_forest.predict(x_train_clean))
predicted_random_forest.head()

Unnamed: 0,0
0,0
1,0
2,3
3,4
4,1


In [63]:
predicted_random_forest_1 = pd.DataFrame(random_forest_1.predict(x_train_clean))
predicted_random_forest_1.head()

Unnamed: 0,0
0,0
1,0
2,3
3,4
4,1


### 4. Cek performa model di data training

In [64]:
benchmark = y_train.value_counts(normalize=True)
benchmark

target
0    0.528634
1    0.198238
3    0.123348
2    0.110132
4    0.039648
Name: proportion, dtype: float64

In [65]:
# Cek Akurasi tiap-tiap model
print('Akurasi KNN                  :',knn.score(x_train_clean, y_train))
print('Akurasi Logistric Regression :',logreg.score(x_train_clean, y_train))
print('Akurasi Random Forest        :',random_forest.score(x_train_clean, y_train))
print('Akurasi Random Forest 1      :',random_forest_1.score(x_train_clean, y_train))

Akurasi KNN                  : 0.6960352422907489
Akurasi Logistric Regression : 0.6828193832599119
Akurasi Random Forest        : 1.0
Akurasi Random Forest 1      : 1.0


### 6. Test Prediction
1. Siapkan file test dataset
2. Lakukan preprocessing yang sama dengan yang dilakukan di train dataset
3. Gunakan `imputer_numerical` dan `standarizer` yang telah di fit di train dataset

In [66]:
# Cek nilai kosong pada x_test
x_test.isnull().any()

age         False
sex         False
cp          False
trestbps    False
chol        False
fbs         False
restecg     False
thalach     False
exang       False
oldpeak     False
slope       False
ca          False
thal        False
dtype: bool

In [67]:
# Cek nilai kosong pada y_test
y_test.isnull().any()

False

In [68]:
# Fungsi untuk mengekstrak & membersihkan test data
numerical_imputer = SimpleImputer(strategy='median')
numerical_imputer.fit(x_train_numerical)

def extractTest(data, numerical_column, categorical_column, ohe_columns,
                numerical_imputer, standardizer):
    """
    Fungsi untuk mengekstrak & membersihkan test data
    :param data: <pandas dataframe sample dataset
    :param numerical_column: <list> kolom numerik
    :param categorical_column: <list> kolom kategorik
    :param ohe_column: <list> kolom one_hot_encoding dari kategorik kolom
    :param impute_missing_values: <sklearn method> imputer data numerik
    :param standardizer: <sklearn method> standarizer data
    :return cleaned_data: <pandas dataframe> data final
    """
    # Filter data
    numerical_data = data[numerical_column]
    categorical_data = data[categorical_column]

    # Proses data numerik
    numerical_data = pd.DataFrame(numerical_imputer.transform(numerical_data))
    numerical_data.columns = numerical_column
    numerical_data.index = data.index
    
    # Proses data ketegorik
    categorical_data.index = data.index
    categorical_data = pd.get_dummies(categorical_data)
    categorical_data.reindex(index=categorical_data.index, 
                             columns=ohe_columns)
    
    # Gabungkan data
    concat_data = pd.concat([numerical_data, categorical_data], axis=1)
    cleaned_data = pd.DataFrame(standardizer.transform(concat_data))
    cleaned_data.columns = concat_data.columns

    return cleaned_data

In [69]:
# Jalankan fungsi
x_test_clean = extractTest(data=x_test,
                           numerical_column=numerical_column,
                           categorical_column=categorical_column,
                           ohe_columns=ohe_columns,
                           numerical_imputer = numerical_imputer,
                           standardizer=standardizer)

In [70]:
x_test_clean.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,slope,ca,thal,sex,cp,fbs,restecg,exang
0,0.885235,-0.040956,-0.324188,-0.209542,0.602913,0.649338,2.405014,1.104431,0.647109,-0.15253,-0.419721,-1.00002,-0.674632
1,-1.142271,0.52941,-0.249628,1.317601,-0.879338,-0.970439,-0.702793,-0.946333,0.647109,-0.15253,-0.419721,1.008869,-0.674632
2,1.223153,1.670143,2.080369,0.015038,-0.22056,-0.970439,-0.702793,-0.946333,-1.545335,-0.15253,-0.419721,1.008869,-0.674632
3,2.236906,-0.611322,0.384131,-1.332441,-0.714644,-0.970439,0.333142,-0.946333,-1.545335,-1.201748,-0.419721,1.008869,1.48229
4,0.772596,0.187191,-0.268268,-0.254458,1.261691,0.649338,1.369078,-0.946333,0.647109,-2.250966,-0.419721,-1.00002,-0.674632


In [71]:
y_test.head()

92     0
85     0
75     0
233    0
243    2
Name: target, dtype: int64

In [72]:
# Cek NaN pada data uji setelah standarisasi
print('Cek NaN pada data uji     :',np.any(np.isnan(x_test_clean)))
print('Cek Infinity pada data uji:',np.any(np.isinf(x_test_clean)))

Cek NaN pada data uji     : False
Cek Infinity pada data uji: False


In [73]:
x_test_clean.isna().sum()

age         0
trestbps    0
chol        0
thalach     0
oldpeak     0
slope       0
ca          0
thal        0
sex         0
cp          0
fbs         0
restecg     0
exang       0
dtype: int64

In [74]:
x_test_clean[x_test_clean.isna().any(axis=1)]

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,slope,ca,thal,sex,cp,fbs,restecg,exang


In [75]:
x_test_clean.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,76.0,0.133813,1.064686,-2.268663,-0.607235,0.2094,0.885235,2.462184
trestbps,76.0,0.220962,0.996915,-1.752055,-0.611322,0.301264,0.900148,2.468656
chol,76.0,-0.126261,0.843741,-1.647626,-0.720287,-0.258948,0.384131,3.142847
thalach,76.0,-0.189448,1.094019,-2.724837,-1.040488,0.082412,0.55403,1.632013
oldpeak,76.0,-0.092705,0.80486,-0.879338,-0.879338,-0.302907,0.355871,2.085164
slope,76.0,0.009952,0.992486,-0.970439,-0.970439,0.649338,0.649338,2.269114
ca,76.0,-0.062149,0.862043,-0.702793,-0.702793,-0.702793,0.333142,2.405014
thal,76.0,-0.251502,0.949808,-0.946333,-0.946333,-0.946333,1.104431,1.104431
sex,76.0,-0.218329,1.078774,-1.545335,-1.545335,0.647109,0.647109,0.647109
cp,76.0,0.054553,1.028301,-2.250966,-0.15253,0.896689,0.896689,0.896689


In [76]:
# def impute_missing_values_data_test(data, strategy="mean"):
#     """
#     Fungsi untuk melakukan imputasi missing value pada data numerik
#     param data: <pandas dataframe> Data numerik yang ingin di imputasi
#     param strategy: <string> Strategi imputasi, default adalah "median"
#                     Pilihan lain: "mean", "most_frequent", "constant"
#     return data_imputed: <pandas dataframe> Data numerik dengan missing values yang sudah diimputasi
#     """
#     # Inisialisasi SimpeImputer dengan strategi yang dipilih
#     imputer = SimpleImputer(missing_values=np.NaN)

#     # Fit imputer pada data dan trasformasi
#     imputed_data = imputer.fit_transform(data)

#     # Konversi hasil transformasi kembali ke dataframe
#     data_imputed = pd.DataFrame(imputed_data, columns=data.columns, index=data.index)

#     return data_imputed

In [77]:
# x_test_clean = impute_missing_values_data_test(x_test_clean)

In [78]:
# x_test_clean.isna().sum()

In [79]:
# Prediksi pada data uji
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

x_test_clean, _ = standardizerData(data=x_test)
y_pred_knn = knn.predict(x_test_clean)
y_pred_logreg = logreg.predict(x_test_clean)
y_pred_rf = random_forest.predict(x_test_clean)
y_pred_rf_1 = random_forest_1.predict(x_test_clean)

# Hitung matrix evaluasi pada data uji
print('Evaluasi pada data uji:')
print('KNN:')
print('Akurasi      :', accuracy_score(y_test, y_pred_knn))
print('Precission   :', precision_score(y_test, y_pred_knn, average='weighted'))
print('Recall       :', recall_score(y_test, y_pred_knn, average='weighted'))
print('F1-score     :', f1_score(y_test, y_pred_knn, average='weighted'))

print('\nLogistic Regression:')
print('Akurasi      :', accuracy_score(y_test, y_pred_logreg))
print('Precission   :', precision_score(y_test, y_pred_logreg, average='weighted'))
print('Recall       :', recall_score(y_test, y_pred_logreg, average='weighted'))
print('F1-score     :', f1_score(y_test, y_pred_logreg, average='weighted'))

print('\nRandom Forest:')
print('Akurasi      :', accuracy_score(y_test, y_pred_rf))
print('Precission   :', precision_score(y_test, y_pred_rf, average='weighted'))
print('Recall       :', recall_score(y_test, y_pred_rf, average='weighted'))
print('F1-score     :', f1_score(y_test, y_pred_rf, average='weighted'))

print('\nRandom Forest 1:')
print('Akurasi      :', accuracy_score(y_test, y_pred_rf_1))
print('Precission   :', precision_score(y_test, y_pred_rf_1, average='weighted'))
print('Recall       :', recall_score(y_test, y_pred_rf_1, average='weighted'))
print('F1-score     :', f1_score(y_test, y_pred_rf_1, average='weighted'))

Evaluasi pada data uji:
KNN:
Akurasi      : 0.4868421052631579
Precission   : 0.3626545176500574
Recall       : 0.4868421052631579
F1-score     : 0.41566598535172883

Logistic Regression:
Akurasi      : 0.40789473684210525
Precission   : 0.4415935672514621
Recall       : 0.40789473684210525
F1-score     : 0.4128158233421391

Random Forest:
Akurasi      : 0.5394736842105263
Precission   : 0.43568342498036133
Recall       : 0.5394736842105263
F1-score     : 0.4466389466389467

Random Forest 1:
Akurasi      : 0.5526315789473685
Precission   : 0.43993846556690236
Recall       : 0.5526315789473685
F1-score     : 0.45597393952657117


**Eksplorasi dan Seleksi Fitur:**

In [80]:
# # Analisis korelasi fitur
# import matplotlib.pyplot as plt
# import seaborn as sns
# corr_matrix = x_train_clean.corr()
# plt.figure(figsize=(12, 10))
# sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
# plt.title('Correlation Heatmap')
# plt.show()

# # Seleksi fitur berdasarkan korelasi
# high_corr_features = corr_matrix.index[abs(corr_matrix['target']) > 0.3]
# x_train_selected = x_train_clean[high_corr_features]
# x_test_selected = x_test_clean[high_corr_features]

**Tuning Hyperparameter menggunakan Grid Search CV:**

In [81]:
# # Tuning Hyperparameter menggunakan Grid Search CV:
# from sklearn.model_selection import GridSearchCV

# # Tuning hyperparameter KNN
# param_grid_knn = {'n_neighbor': [3,5,7,9,11]}
# knn_grid = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5)
# knn_grid.fit(x_train_clean, y_train)
# print('Best parameters for KNN:', knn_grid.best_params_)