# Prediksi Kuat Tekan Beton dengan KNN, Random Forest dan Decision Tree
- Pada artikel kali ini, kita akan menggunakan sebuah dataset yang memiliki 9 features.
- Tujuan dari tutorial ini adalah membandingkan score prediksi dari Regresi KNN, Random Forest dan Decision Tree

## Load Library

In [1]:
#import libraries
import pandas as pd
import numpy as np
#for feature selection importance
from sklearn.ensemble import ExtraTreesClassifier
#for split data to train and test 
from sklearn.model_selection import train_test_split
#for standardisation
from sklearn.preprocessing import StandardScaler
#KNN, random forest, decision tree
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
#for calculate mse
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
#for calculate sqrt
from math import sqrt

## Load Dataset

In [2]:
#Persiapkan nama kolom dataframe untuk proses rename
col_names = ['Cement',
             'Blast',
             'FlyAsh',
             'Water',
             'Superplasticizer',
             'CoarseAggregate',
             'FineAggregate',
             'Age',
             'CompressiveStrength']
#Load dataset dari sumber data yang type file-nya adalah excel
df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls', names=col_names)
#Melihat 5 baris teratas dari data
df.head()

Unnamed: 0,Cement,Blast,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,CompressiveStrength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [3]:
#Memilah data dengan memilih data dengan pengukuran hari (Age) ke-28
df_new = df.loc[df['Age'] == 28]
#Melihat dimensi data yang telah dipilah
df_new.shape

(425, 9)

In [4]:
#Melihat 5 baris teratas dari data
df_new.head()

Unnamed: 0,Cement,Blast,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,CompressiveStrength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.44777
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.854291
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.28979


## Check Missing Value

In [5]:
#Pengecekan apakah terdapat value yang kosong pada data
df_new.isnull().values.any()

False

## Train & Test Data

In [6]:
#Membentuk variabel input
X = df_new.drop(['Age', 'CompressiveStrength'], axis=1).values
X.shape

(425, 7)

In [7]:
#Membentuk variabel output
y = df_new["CompressiveStrength"].values
y

array([79.98611076, 61.88736576, 36.44776979, 45.85429086, 39.28978986,
       28.02168359, 47.81378165, 28.23748958, 37.42751518, 30.07976945,
       33.01900564, 40.85696881, 71.98818916, 61.09446836, 59.79825348,
       60.2946762 , 61.79773388, 56.69561148, 68.29949256, 66.89985628,
       60.2946762 , 50.69717028, 56.3991368 , 60.2946762 , 55.49592324,
       68.4994406 , 71.29871316, 74.69782984, 52.20022796, 71.29871316,
       67.69964844, 71.29871316, 65.99664272, 74.4978818 , 71.29871316,
       49.89737812, 24.8900836 , 22.83544512, 25.72434956, 26.40003604,
       24.90387312, 24.48329276, 28.46846404, 21.53923024, 24.24197616,
       45.70536404, 40.2309246 , 24.53845084, 30.2335226 , 29.21999288,
       31.64005364, 37.404073  , 38.50033984, 33.72916592, 29.65436276,
       32.66047812, 27.77209328, 31.2677366 , 31.11605188, 34.73580088,
       48.28400428, 39.94134468, 33.94290348, 30.84715624, 50.60064364,
       52.5035974 , 51.3314882 , 47.401475  , 40.14818748, 45.93

In [8]:
#Membagi variabel X dan y menjadi data latih dan data uji dengan porsi 80:20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [9]:
X_train.shape

(340, 7)

In [10]:
X_test.shape

(85, 7)

In [11]:
y_test

array([33.68779736, 17.59680647, 29.65436276, 25.17966352, 33.01900564,
       31.41942132, 24.29092896, 38.45897128, 33.7153764 , 32.40123514,
       50.69717028, 61.23581094, 44.02993736, 31.74347504, 41.94082508,
       31.87447548, 29.72606826, 27.923778  ,  9.73540112, 28.62980142,
       17.95947085, 41.68434001, 29.07313449, 61.88736576, 22.83544512,
       43.94237391, 52.5035974 , 37.43785732, 43.5748832 , 71.98818916,
       44.51946532, 33.76226077, 27.8748252 , 39.84481804, 44.38846488,
       33.0431373 , 67.86512268, 29.06830816, 38.6306508 , 19.69143456,
       19.98790924, 37.81362174, 37.43165204, 33.4195912 , 60.2946762 ,
       32.9569528 , 48.28400428, 36.80491836, 39.45595358, 37.2661778 ,
       44.86834018, 23.74417449, 47.81378165, 52.30364936, 18.19871902,
       31.1787942 , 26.200088  , 57.226508  , 25.17966352, 22.48932817,
       43.89204216, 34.73580088, 39.45181672, 39.42147978, 52.42637609,
       35.85964676, 64.0178466 , 30.88163004, 25.74503384, 36.34

## Standardization

In [12]:
scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
test_scaled = scaler.transform(X_test)

## Modeling

In [49]:
knn_model = KNeighborsRegressor(n_neighbors=1)
tree_model = DecisionTreeRegressor(max_depth=10)
rf_model = RandomForestRegressor(n_estimators=3000, oob_score=True, random_state=100)

In [50]:
knn_model.fit(train_scaled, y_train) 
tree_model.fit(train_scaled, y_train) 
rf_model.fit(train_scaled, y_train)

RandomForestRegressor(n_estimators=3000, oob_score=True, random_state=100)

## Model Evaluation

In [51]:
knn_mse =  round(mean_squared_error(y_train, knn_model.predict(train_scaled)), 3)
knn_mae = round(mean_absolute_error(y_train, knn_model.predict(train_scaled)), 3)
knn_score = round(knn_model.score(train_scaled, y_train), 3)
tree_mse = round(mean_squared_error(y_train, tree_model.predict(train_scaled)), 3)
tree_mae = round(mean_absolute_error(y_train, tree_model.predict(train_scaled)), 3)
tree_score = round(tree_model.score(train_scaled, y_train), 3)
rf_mse = round(mean_squared_error(y_train, rf_model.predict(train_scaled)), 3)
rf_mae = round(mean_absolute_error(y_train, rf_model.predict(train_scaled)), 3)
rf_score = round(rf_model.score(train_scaled, y_train), 3)

In [52]:
training_ev = [['Model', 'MSE', 'MAE', 'RMSE', 'Prediction Score'], 
               ['KNN', knn_mse, knn_mae, round(sqrt(knn_mse), 3), knn_score], 
               ['Decision Tree', tree_mse, tree_mae, round(sqrt(tree_mse), 3), tree_score], 
               ['Random Forest', rf_mse, rf_mae, round(sqrt(rf_mse), 3), rf_score]]

for row in training_ev:
    print("{: >20} {: >20} {: >20} {: >20} {: >20}".format(*row))

               Model                  MSE                  MAE                 RMSE     Prediction Score
                 KNN                0.163                 0.03                0.404                0.999
       Decision Tree                0.887                0.336                0.942                0.996
       Random Forest                4.763                 1.48                2.182                0.979


- Dari evaluasi model terhadap data training diatas didapatkan bahwa prediction score model KNN lebih baik dibandingkan Decision Tree dan Random Forest.

In [53]:
knn_test_mse = round(mean_squared_error(y_test, knn_model.predict(test_scaled)), 3)
knn_test_mae = round(mean_absolute_error(y_test, knn_model.predict(test_scaled)), 3)
knn_test_score = round(knn_model.score(test_scaled, y_test), 3)
tree_test_mse = round(mean_squared_error(y_test, tree_model.predict(test_scaled)), 3)
tree_test_mae = round(mean_absolute_error(y_test, tree_model.predict(test_scaled)), 3)
tree_test_score = round(tree_model.score(test_scaled, y_test), 3)
rf_test_mse = round(mean_squared_error(y_test, rf_model.predict(test_scaled)), 3)
rf_test_mae = round(mean_absolute_error(y_test, rf_model.predict(test_scaled)), 3)
rf_test_score = round(rf_model.score(test_scaled, y_test), 3)

In [54]:
testing_ev = [['Model', 'MSE', 'MAE', 'RMSE', 'Prediction Score'], 
               ['KNN', knn_test_mse, knn_test_mae, round(sqrt(knn_test_mse), 3), knn_test_score], 
               ['Decision Tree', tree_test_mse, tree_test_mae, round(sqrt(tree_test_mse), 3), tree_test_score], 
               ['Random Forest', rf_test_mse, rf_test_mae, round(sqrt(rf_test_mse), 3), rf_test_score]]

for row in testing_ev:
    print("{: >20} {: >20} {: >20} {: >20} {: >20}".format(*row))

               Model                  MSE                  MAE                 RMSE     Prediction Score
                 KNN               42.968                3.972                6.555                0.724
       Decision Tree               56.793                4.508                7.536                0.636
       Random Forest                39.69                4.316                  6.3                0.745


- Walaupun prediction score model KNN terhadap data training lebih baik, namun prediction score data testing model Random Forest lebih baik dibandingkan KNN dan Decision Tree.