### Chefboost
Chefboost adalah sebuah pustaka (library) Python yang digunakan untuk membangun dan melatih model pohon keputusan (decision tree). Pohon keputusan adalah model prediktif yang menggambarkan alur keputusan berdasarkan serangkaian aturan logika. Dalam pohon keputusan, setiap simpul (node) mewakili fitur atau atribut, sedangkan cabang-cabangnya mewakili aturan-aturan yang diterapkan pada fitur tersebut.

In [1]:
import pandas as pd
from chefboost import Chefboost as chef
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('golf.txt')
df

Unnamed: 0,Outlook,Temp.,Humidity,Wind,Decision
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


In [3]:
config = {'algorithm': 'ID3'}
model = chef.fit(df, config)

[INFO]:  1 CPU cores will be allocated in parallel running
ID3  tree is going to be built...
-------------------------
finished in  5.946926832199097  seconds
-------------------------
Evaluate  train set
-------------------------
Accuracy:  100.0 % on  14  instances
Labels:  ['No' 'Yes']
Confusion matrix:  [[5, 0], [0, 9]]
Precision:  100.0 %, Recall:  100.0 %, F1:  100.0 %


In [4]:
# obj[0]: Outlook, obj[1]: Temp., obj[2]: Humidity, obj[3]: Wind
def findDecision(obj):
    # {"feature": "Outlook", "instances": 14, "metric_value": 0.9403, "depth": 1}
    if obj[0] == 'Rain':
        # {"feature": "Wind", "instances": 5, "metric_value": 0.971, "depth": 2}
        if obj[3] == 'Weak':
            return 'Yes'
        elif obj[3] == 'Strong':
            return 'No'
        else:
            return 'No'
    elif obj[0] == 'Sunny':
        # {"feature": "Humidity", "instances": 5, "metric_value": 0.971, "depth": 2}
        if obj[2] == 'High':
            return 'No'
        elif obj[2] == 'Normal':
            return 'Yes'
        else:
            return 'Yes'
    elif obj[0] == 'Overcast':
        return 'Yes'
    else:
        return 'Yes'


In [5]:
outlook = 14 * 0.9403 - 5 * 0.971 - 5 * 0.971 
wind = 5 * 0.971
humidity = 5 * 0.971
temperature = 0

In [6]:
total = outlook + wind + humidity + temperature

In [7]:
print ('outlook = ', 100*outlook/total)
print ('wind = ', 100*wind/total)
print ('humidity = ', 100*humidity/total)
print ('temperature = ', 100*temperature/total)

outlook =  26.239346105346325
wind =  36.880326947326836
humidity =  36.880326947326836
temperature =  0.0


### Forward Feature Selection
Di kasus ini kita menggunakan algoritma linear regression

In [9]:
df2 = pd.read_csv('golf_label_num.txt')
df2

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Decision
0,1,1,1,1,1
1,1,1,1,2,1
2,2,1,1,1,2
3,3,2,1,1,2
4,3,3,2,1,2
5,3,3,2,2,1
6,2,3,2,2,2
7,1,2,1,1,1
8,1,3,2,1,2
9,3,2,2,1,2


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Outlook   14 non-null     object
 1   Temp.     14 non-null     object
 2   Humidity  14 non-null     object
 3   Wind      14 non-null     object
 4   Decision  14 non-null     object
dtypes: object(5)
memory usage: 688.0+ bytes


In [10]:
df2.shape

(14, 5)

In [12]:
df2.isnull().sum()

Outlook        0
Temperature    0
Humidity       0
Wind           0
Decision       0
dtype: int64

In [13]:
x = df2.drop(['Decision'], axis=1)
y = df2['Decision']

In [14]:
x.shape, y.shape

((14, 4), (14,))

In [15]:
import joblib
import sys
sys.modules['sklearn.externals.joblib'] = joblib
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.linear_model import LinearRegression

In [17]:
lreg = LinearRegression()
feat_sec = sfs(lreg, k_features=3, forward=True, verbose=2, scoring='neg_mean_squared_error')

feat_sec = feat_sec.fit(x,y)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.4s finished

[2023-05-20 13:43:12] Features: 1/3 -- score: -0.231037037037037[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.4s finished

[2023-05-20 13:43:12] Features: 2/3 -- score: -0.25595913533631903[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.2s finished

[2023-05-20 13:43:12] Features: 3/3 -- score: -0.30242567633645

In [18]:
feat_names = list(feat_sec.k_feature_names_)
feat_names

['Outlook', 'Temperature', 'Humidity']

In [19]:
new_data = df2[feat_names]
new_data['Decision'] = df['Decision']

new_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data['Decision'] = df['Decision']


Unnamed: 0,Outlook,Temperature,Humidity,Decision
0,1,1,1,No
1,1,1,1,No
2,2,1,1,Yes
3,3,2,1,Yes
4,3,3,2,Yes
5,3,3,2,No
6,2,3,2,Yes
7,1,2,1,No
8,1,3,2,Yes
9,3,2,2,Yes
