# Boosting Algorithm - HAR dataset

<br></br>

### --▶ Dataset

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/features.txt"

feature_name_df = pd.read_csv(url, sep='\s+', header=None, names=['column_index', 'column_name'])

X_train_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/train/X_train.txt'
X_test_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/test/X_test.txt'

X_train = pd.read_csv(X_train_url, sep = '\s+',header=None)
X_test = pd.read_csv(X_test_url, sep = '\s+',header=None)

X_train.columns = feature_name_df.column_name.tolist()
X_test.columns = feature_name_df.column_name.tolist()

In [2]:
y_train_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/train/y_train.txt'
y_test_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/HAR_dataset/test/y_test.txt'

y_train = pd.read_csv(y_train_url, sep='\s+', header=None, names=['action'])
y_test = pd.read_csv(y_test_url, sep='\s+', header=None, names=['action'])

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((7352, 561), (2947, 561), (7352, 1), (2947, 1))

<br></br>

## 🎫 GBM

> Gradient Boosting Machine

In [5]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time
import warnings
warnings.filterwarnings("ignore")

In [5]:
%%time

start_time = time.time()

gb_clf = GradientBoostingClassifier(random_state=13)
gb_clf.fit(X_train, y_train)
gb_pred = gb_clf.predict(X_test)

print('ACC :', accuracy_score(y_test, gb_pred))
print('Fit time :', time.time() - start_time)

ACC : 0.9385816084153377
Fit time : 2463.7074313163757
CPU times: total: 40min 49s
Wall time: 41min 3s


👉 ACC가 93.9%, 계산시간 2463초.. 와~~~

(수업 자료엔 522초 였는데... 컴이 너무 느리다... ㅠ)

- 일반적으로 GBM이 성능 자체는 랜덤 포레스트보다는 좋다고 알려져 있음

- sckit-learn의 GBM은 속도가 아주 느린 것으로 알려져 있음<br></br>

### --▶ GridSearch로 Best모델 찾기

- 시간이 너무 오래 걸려서 일단 PASS~!!! (수업자료에 45분 이상 걸렸으니 왠지 3시간 이상은 걸릴꺼 같음 -.-;;;)

In [None]:
from sklearn.model_selection import GridSearchCV

params = {
	'n_estimators': [100, 500],
	'learning_rate': [0.05, 0.1]
}

start_time = time.time()
grid = GridSearchCV(gb_clf, param_grid=params, cv=2, verbose=1, n_jobs=-1)
grid.fit(X_train, y_train)
print('Fit time :', time.time() - start_time)

In [None]:
grid.best_score_

In [None]:
grid.best_params_

In [None]:
# Test Data에서의 성능은...

accuracy_score(y_test, grid.best_estimator_.predict(X_test))

<br></br>


## 🎫 XGBoost

> eXtra Gradient Boost

</br>

### --▶ `xgboost` install

- pip install xgboost 에러가 날 경우 conda install py-xgboost

In [7]:
#!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.0.3-py3-none-win_amd64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.3-py3-none-win_amd64.whl (99.8 MB)
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/99.8 MB 325.1 kB/s eta 0:05:07
   ---------------------------------------- 0.0/99.8 MB 281.8 kB/s eta 0:05:54
   ---------------------------------------- 0.3/99.8 MB 1.4 MB/s eta 0:01:10
   ---------------------------------------- 1.0/99.8 MB 4.2 MB/s eta 0:00:24
    --------------------------------------- 1.5/99.8 MB 5.3 MB/s eta 0:00:19
    --------------------------------------- 2.0/99.8 MB 6.2 MB/s eta 0:00:16
   - -------------------------------------- 2.6/99.8 MB 6.9 MB/s eta 0:00:15
   - -------------------------

### --▶ XGB 주요 Params

📍 주요 파라미터

- `nthread` : CPU의 실행 스레드 개수를 조정. 디폴트는 CPU의 전체 스레드를 사용하는 것

- `eta` : GBM 학습률

- `num_boost_rounds` : n_estimators와 같은 파라미터

- `max_depth`

In [10]:
X_train.values

array([[ 0.28858451, -0.02029417, -0.13290514, ..., -0.84124676,
         0.17994061, -0.05862692],
       [ 0.27841883, -0.01641057, -0.12352019, ..., -0.8447876 ,
         0.18028889, -0.05431672],
       [ 0.27965306, -0.01946716, -0.11346169, ..., -0.84893347,
         0.18063731, -0.04911782],
       ...,
       [ 0.27338737, -0.01701062, -0.04502183, ..., -0.77913261,
         0.24914484,  0.04081119],
       [ 0.28965416, -0.01884304, -0.15828059, ..., -0.78518142,
         0.24643223,  0.02533948],
       [ 0.35150347, -0.01242312, -0.20386717, ..., -0.78326693,
         0.24680852,  0.03669484]])

In [30]:
import numpy as np

np.unique(y_train-1)

array([0, 1, 2, 3, 4, 5], dtype=int64)

In [31]:
y_train.values -1

array([[4],
       [4],
       [4],
       ...,
       [1],
       [1],
       [1]], dtype=int64)

📍 XGBoost의 `XGBClassifier()`에서 클래스의 인덱스는 기본적으로 0부터 시작합니다.

- 이는 일반적인 머신러닝 모델에서 흔한 관행입니다. 따라서 클래스를 1부터 6까지 사용하는 경우, 클래스를 0부터 5까지 재조정해야 합니다.

- XGBoost의 `XGBClassifier()`에서 클래스의 인덱스를 1부터 6까지 사용하는 것은 일반적으로 지원되지 않습니다.

- XGBoost 및 많은 다른 기계 학습 라이브러리에서 클래스 인덱스는 보통 0부터 시작하여 클래스 수 - 1까지의 정수로 인코딩됩니다.

- 예를 들어, 클래스가 6개인 경우, 클래스는 0부터 5까지의 정수로 인덱싱됩니다.

- 만약 데이터셋이 클래스를 1부터 6까지의 정수로 인코딩된 경우, 일반적으로는 데이터를 클래스 0부터 클래스 수 - 1까지의 정수로 변환하여 사용해야 합니다.

- 이는 대부분의 머신러닝 라이브러리 및 모델이 기대하는 표준 형식입니다.

- 따라서 데이터셋을 클래스 인덱스 1부터 6까지 사용하는 경우, 이를 0부터 5까지의 정수로 변환하여 `XGBClassifier()`에 전달해야 합니다.

- 이렇게 하면 모델이 올바르게 작동할 것입니다.

In [37]:
from xgboost import XGBClassifier

start_time = time.time()

xgb = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3) #, num_class=6, objective='multi:softmax')
xgb.fit(X_train.values, y_train-1) #--> DataFrame도 가능한 scikit learn과 다르게 'numpy.array'형태로 넣어 주어야 한다.

print('Fit time :', time.time() - start_time)

Fit time : 70.76708054542542


In [38]:
accuracy_score(y_test-1, xgb.predict(X_test.values))

0.9497794367153037

### --▶ 조기 종료 & 검증 데이터

- 조기 종료 조건과 검증 데이터를 지정할 수 있다.

- `early_stopping_rounds` - 비슷한 성능이 10번 이상 반복되면 반복을 종료하라는 param이다.

In [39]:
%%time

evals = [(X_test.values, y_test-1)]

start_time = time.time()

xgb = XGBClassifier(n_estimators=400, learning_rate=0.1, max_depth=3, num_class=6, objective='multi:softmax')
xgb.fit(X_train.values, y_train-1, early_stopping_rounds=10, eval_set=evals) #--> 검증 데이터를 따로 넣어줘야 함.

print('Fit time :', time.time() - start_time)

[0]	validation_0-mlogloss:1.58912
[1]	validation_0-mlogloss:1.43298
[2]	validation_0-mlogloss:1.30579
[3]	validation_0-mlogloss:1.19398
[4]	validation_0-mlogloss:1.10151
[5]	validation_0-mlogloss:1.01952
[6]	validation_0-mlogloss:0.94821
[7]	validation_0-mlogloss:0.88468
[8]	validation_0-mlogloss:0.82846
[9]	validation_0-mlogloss:0.77660
[10]	validation_0-mlogloss:0.73051
[11]	validation_0-mlogloss:0.68873
[12]	validation_0-mlogloss:0.65163
[13]	validation_0-mlogloss:0.61809
[14]	validation_0-mlogloss:0.58776
[15]	validation_0-mlogloss:0.55936
[16]	validation_0-mlogloss:0.53447
[17]	validation_0-mlogloss:0.51131
[18]	validation_0-mlogloss:0.49076
[19]	validation_0-mlogloss:0.47043
[20]	validation_0-mlogloss:0.45119
[21]	validation_0-mlogloss:0.43441
[22]	validation_0-mlogloss:0.41777
[23]	validation_0-mlogloss:0.40352
[24]	validation_0-mlogloss:0.38949
[25]	validation_0-mlogloss:0.37684
[26]	validation_0-mlogloss:0.36371
[27]	validation_0-mlogloss:0.35286
[28]	validation_0-mlogloss:0.3

In [40]:
accuracy_score(y_test-1, xgb.predict(X_test.values))

0.9453681710213777

<br></br>


## 🎫 LightGBM

> Light Gradient Boost

</br>

### --▶ `lightgbm` install

- conda install lightgbm --> pip install lightgbm

In [49]:
#conda install lightgbm

Retrieving notices: ...working... done
Channels:
 - conda-forge
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: c:\Users\happy\miniconda3\envs\ds_study

  added / updated specs:
    - lightgbm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2024.2.2   |       h56e8100_0         152 KB  conda-forge
    certifi-2024.2.2           |     pyhd8ed1ab_0         157 KB  conda-forge
    lightgbm-4.3.0             |   py38hd3f51b4_0        1001 KB  conda-forge
    openssl-3.2.1              |       hcfcfb64_0         7.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         9.1 MB

The following NEW packages will be INSTALLED:

  lightgbm           conda-forge/win-64::

In [41]:
#!pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-4.3.0-py3-none-win_amd64.whl.metadata (19 kB)
Downloading lightgbm-4.3.0-py3-none-win_amd64.whl (1.3 MB)
   ---------------------------------------- 0.0/1.3 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.3 MB ? eta -:--:--
   --- ------------------------------------ 0.1/1.3 MB 2.0 MB/s eta 0:00:01
   ------------------- -------------------- 0.6/1.3 MB 5.8 MB/s eta 0:00:01
   ----------------------------------- ---- 1.2/1.3 MB 7.6 MB/s eta 0:00:01
   ---------------------------------------  1.3/1.3 MB 7.0 MB/s eta 0:00:01
   ---------------------------------------- 1.3/1.3 MB 5.7 MB/s eta 0:00:00
Installing collected packages: lightgbm
Successfully installed lightgbm-4.3.0


### --▶ LGBMClassifier()

In [6]:
from lightgbm import LGBMClassifier

start_time = time.time()

lgbm = LGBMClassifier(n_estimators=400)
lgbm.fit(X_train.values, y_train)

print('Fit time :', time.time()-start_time)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.047368 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 140170
[LightGBM] [Info] Number of data points in the train set: 7352, number of used features: 561
[LightGBM] [Info] Start training from score -1.791216
[LightGBM] [Info] Start training from score -1.924514
[LightGBM] [Info] Start training from score -2.009071
[LightGBM] [Info] Start training from score -1.743436
[LightGBM] [Info] Start training from score -1.677246
[LightGBM] [Info] Start training from score -1.653513
Fit time : 26.891047477722168


In [7]:

pred = lgbm.predict(X_test.values)
print('ACC :', accuracy_score(y_test, pred))

ACC : 0.9375636240244316


### --▶ 조기 종료 & 검증 데이터

- 조기 종료 조건과 검증 데이터를 지정할 수 있다.

- `early_stopping_rounds=N` - 비슷한 성능이 N번 이상 반복되면 반복을 종료하라는 param이다.

- LightGBM이 4.X로 업그레이드 하면서 API가 많이 바뀌었다.

	- early_stopping_rounds=N을 사용하려면 LightGBM 3.X로 lightgbm을 downgrade(`pip install lightgbm==3.3.2`로 다운)하고 실습을 수행하거나,

	- from lightgbm import early_stopping 를 실행한뒤 

		early_stopping_rounds=50을 callbacks=[early_stopping(stopping_rounds=50)]으로 코드를 바꿔 사용할 수도 있다.

In [10]:
%%time

from lightgbm import early_stopping

evals = [(X_test.values, y_test)]

start_time = time.time()

lgbm = LGBMClassifier(n_estimators=400)
lgbm.fit(X_train.values, y_train, callbacks=[early_stopping(stopping_rounds=20)], eval_set=evals)

print('Fit time :', time.time()-start_time)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.045982 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 140170
[LightGBM] [Info] Number of data points in the train set: 7352, number of used features: 561
[LightGBM] [Info] Start training from score -1.791216
[LightGBM] [Info] Start training from score -1.924514
[LightGBM] [Info] Start training from score -2.009071
[LightGBM] [Info] Start training from score -1.743436
[LightGBM] [Info] Start training from score -1.677246
[LightGBM] [Info] Start training from score -1.653513
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[38]	valid_0's multi_logloss: 0.233106
Fit time : 12.108598947525024
CPU times: total: 47.3 s
Wall time: 12.1 s


In [11]:
print('ACC :', accuracy_score(y_test, lgbm.predict(X_test.values)))

ACC : 0.9260264675941635
