<a href="https://colab.research.google.com/github/subin1005/project/blob/main/%EA%B1%B4%EC%84%A4%EA%B8%B0%EA%B3%84_%EC%98%A4%EC%9D%BC%EC%83%81%ED%83%9C_%EB%B6%84%EB%A5%98_%5BClassifier_Regressor%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 건설기계 오일상태 분류 경진대회

### 0. 대회설명

 - **대회명** : 건설기계 오일 상태 분류 AI 경진대회
 - **대회주제** : 건설장비에서 작동오일의 상태 판단 모델개발 (정상, 이상의 이진분류)
 - **분석 아이디어** : 지식 증류(Knowledge Distillation) 기법을 통한 모델 학습

**지식 증류 (Knowledge Distillation)** : 이미 학습된 모델(교사모델)을 이용하여
학습이 부족한 모델(학생모델)을 학습시키는 방법으로 이번 분석에서 train이 가진 정보가 test의 정보보다 많아서, (= train 변수 개수 > test 변수 개수)  train의 전체 변수로 먼저 모델을 학습한 뒤에, 그 모델을 이용하여 test에 속하는 변수들만을 가지고 학습하는 방법이다.

### 1. 라이브러리 불러오기

In [1]:
import torch
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import os
import pandas as pd
import numpy as np
import random
import warnings
warnings.filterwarnings(action='ignore')

### 2. 데이터 불러오기

In [2]:
# 구글 드라이브 연결
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# 데이터 불러오기
train = pd.read_csv('/content/drive/MyDrive/건설기계 오일 상태 분류 경진대회/train.csv')
test = pd.read_csv('/content/drive/MyDrive/건설기계 오일 상태 분류 경진대회/test.csv')

- Fixed RandomSeed

In [4]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(0)

### 3. 데이터 탐색 및 전처리

- 데이터 형태확인

In [5]:
train.head()

Unnamed: 0,ID,COMPONENT_ARBITRARY,ANONYMOUS_1,YEAR,SAMPLE_TRANSFER_DAY,ANONYMOUS_2,AG,AL,B,BA,...,U25,U20,U14,U6,U4,V,V100,V40,ZN,Y_LABEL
0,TRAIN_00000,COMPONENT3,1486,2011,7,200,0,3,93,0,...,,,,,,0,,154.0,75,0
1,TRAIN_00001,COMPONENT2,1350,2021,51,375,0,2,19,0,...,2.0,4.0,6.0,216.0,1454.0,0,,44.0,652,0
2,TRAIN_00002,COMPONENT2,2415,2015,2,200,0,110,1,1,...,0.0,3.0,39.0,11261.0,41081.0,0,,72.6,412,1
3,TRAIN_00003,COMPONENT3,7389,2010,2,200,0,8,3,0,...,,,,,,0,,133.3,7,0
4,TRAIN_00004,COMPONENT3,3954,2015,4,200,0,1,157,0,...,,,,,,0,,133.1,128,0


In [6]:
test.head()

Unnamed: 0,ID,COMPONENT_ARBITRARY,ANONYMOUS_1,YEAR,ANONYMOUS_2,AG,CO,CR,CU,FE,H2O,MN,MO,NI,PQINDEX,TI,V,V40,ZN
0,TEST_0000,COMPONENT1,2192,2016,200,0,0,0,1,12,0.0,0,0,0,10,0,0,91.3,1091
1,TEST_0001,COMPONENT3,2794,2011,200,0,0,2,1,278,0.0,3,0,0,2732,1,0,126.9,12
2,TEST_0002,COMPONENT2,1982,2010,200,0,0,0,16,5,0.0,0,0,0,11,0,0,44.3,714
3,TEST_0003,COMPONENT3,1404,2009,200,0,0,3,4,163,0.0,4,3,0,8007,0,0,142.8,94
4,TEST_0004,COMPONENT2,8225,2013,200,0,0,0,6,13,0.0,0,0,0,16,0,0,63.4,469


In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14095 entries, 0 to 14094
Data columns (total 54 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   14095 non-null  object 
 1   COMPONENT_ARBITRARY  14095 non-null  object 
 2   ANONYMOUS_1          14095 non-null  int64  
 3   YEAR                 14095 non-null  int64  
 4   SAMPLE_TRANSFER_DAY  14095 non-null  int64  
 5   ANONYMOUS_2          14095 non-null  int64  
 6   AG                   14095 non-null  int64  
 7   AL                   14095 non-null  int64  
 8   B                    14095 non-null  int64  
 9   BA                   14095 non-null  int64  
 10  BE                   14095 non-null  int64  
 11  CA                   14095 non-null  int64  
 12  CD                   12701 non-null  float64
 13  CO                   14095 non-null  int64  
 14  CR                   14095 non-null  int64  
 15  CU                   14095 non-null 

In [8]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6041 entries, 0 to 6040
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   6041 non-null   object 
 1   COMPONENT_ARBITRARY  6041 non-null   object 
 2   ANONYMOUS_1          6041 non-null   int64  
 3   YEAR                 6041 non-null   int64  
 4   ANONYMOUS_2          6041 non-null   int64  
 5   AG                   6041 non-null   int64  
 6   CO                   6041 non-null   int64  
 7   CR                   6041 non-null   int64  
 8   CU                   6041 non-null   int64  
 9   FE                   6041 non-null   int64  
 10  H2O                  6041 non-null   float64
 11  MN                   6041 non-null   int64  
 12  MO                   6041 non-null   int64  
 13  NI                   6041 non-null   int64  
 14  PQINDEX              6041 non-null   int64  
 15  TI                   6041 non-null   i

 1. train : 54개의 변수(범주형 변수 2개 + Y_LABEL도 범주형으로 보임), 총 14095개의 데이터
 2. test : 19개의 변수, 총 6041개의 데이터

인덱스 역할의 `ID` 변수가 있으며, train 변수개수가 test 변수개수에 비해 35개 (타깃변수인 Y_LABEL 포함) 더 많음

- 결측치 확인

In [9]:
train.isnull().sum()/len(train)

ID                     0.000000
COMPONENT_ARBITRARY    0.000000
ANONYMOUS_1            0.000000
YEAR                   0.000000
SAMPLE_TRANSFER_DAY    0.000000
ANONYMOUS_2            0.000000
AG                     0.000000
AL                     0.000000
B                      0.000000
BA                     0.000000
BE                     0.000000
CA                     0.000000
CD                     0.098900
CO                     0.000000
CR                     0.000000
CU                     0.000000
FH2O                   0.724016
FNOX                   0.724016
FOPTIMETHGLY           0.724016
FOXID                  0.724016
FSO4                   0.724016
FTBN                   0.724016
FE                     0.000000
FUEL                   0.724016
H2O                    0.000000
K                      0.163107
LI                     0.000000
MG                     0.000000
MN                     0.000000
MO                     0.000000
NA                     0.000000
NI      

In [10]:
test.isnull().sum()

ID                     0
COMPONENT_ARBITRARY    0
ANONYMOUS_1            0
YEAR                   0
ANONYMOUS_2            0
AG                     0
CO                     0
CR                     0
CU                     0
FE                     0
H2O                    0
MN                     0
MO                     0
NI                     0
PQINDEX                0
TI                     0
V                      0
V40                    0
ZN                     0
dtype: int64

1. train : 전체 54개의 변수 중 17개의 변수가 70% 이상 결측이고, `K`, `CD` 변수는 20% 이하의 결측이 확인됨
2. test : 결측치 없음

##### 결측치 처리

- 결측이 70% 이상인 컬럼 제거 (총 17개) : 모두 test 데이터에는 없는 변수

In [11]:
train = train.drop(columns = ['FH2O', 'FNOX', 'FOPTIMETHGLY', 'FOXID', 'FSO4', 'FTBN',
                              'FUEL', 'SOOTPERCENTAGE', 'U100', 'U75', 'U50', 'U25',
                              'U20', 'U14', 'U6', 'U4', 'V100'])

- 변수 `CD` 결측치 대치

In [12]:
train['CD'].describe()

count    12701.000000
mean         0.015589
std          0.209407
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max         18.000000
Name: CD, dtype: float64

In [13]:
len(train[train['CD']==0])/len(train)

0.889890031926215

In [14]:
train['CD'] = train['CD'].fillna(0)

> `CD` 데이터를 살펴본 결과, 약 89% 데이터가 0을 가지는 것을 확인 => 결측치를 0값으로 대치

- 변수 `K` 결측치 대치

In [15]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 5)
train['K'] = imputer.fit_transform(train['K'].values.reshape(-1, 1))

In [16]:
train.isnull().sum()

ID                     0
COMPONENT_ARBITRARY    0
ANONYMOUS_1            0
YEAR                   0
SAMPLE_TRANSFER_DAY    0
ANONYMOUS_2            0
AG                     0
AL                     0
B                      0
BA                     0
BE                     0
CA                     0
CD                     0
CO                     0
CR                     0
CU                     0
FE                     0
H2O                    0
K                      0
LI                     0
MG                     0
MN                     0
MO                     0
NA                     0
NI                     0
P                      0
PB                     0
PQINDEX                0
S                      0
SB                     0
SI                     0
SN                     0
TI                     0
V                      0
V40                    0
ZN                     0
Y_LABEL                0
dtype: int64

모든 결측 데이터가 대치되어 더이상 결측치가 없음을 확인

##### Y_LABEL 변수 타입 변경

In [17]:
train['Y_LABEL'] = train['Y_LABEL'].astype('category')

##### 분석에 필요하지 않는 `ID` 변수 제거

In [18]:
train.drop(['ID'], axis = 1, inplace = True)
test.drop(['ID'], axis = 1, inplace = True)

##### 범주형 변수와 test에만 있는 변수 저장 (타깃변수인 `Y_LABEL` 제외)

In [19]:
categorical_features = ['COMPONENT_ARBITRARY', 'YEAR']
test_stage_features = list(test.columns)

##### 데이터 분할

In [20]:
X = train.drop(['Y_LABEL'], axis = 1)
y = train['Y_LABEL']
train_X, val_X, train_y, val_y = train_test_split(X, y, test_size=0.3,  stratify=y, random_state = 0)#test_size = 0.2)

##### 표준화 (수치형변수)

In [21]:
from sklearn.preprocessing import MinMaxScaler

def get_values(value):
    return value.values.reshape(-1, 1)

for col in train_X.columns:
    if col not in categorical_features:
        scaler = MinMaxScaler()
        train_X[col] = scaler.fit_transform(get_values(train_X[col]))
        val_X[col] = scaler.transform(get_values(val_X[col]))

        if col in test.columns:
            test[col] = scaler.transform(get_values(test[col]))

##### 레이블 인코딩 (범주형)

In [22]:
le = LabelEncoder()
for col in categorical_features:
    train_X[col] = le.fit_transform(train_X[col])
    val_X[col] = le.transform(val_X[col])
    if col in test.columns:
        test[col] = le.transform(test[col])

In [23]:
# train_X, val_X 변수명 부여
train_X = pd.DataFrame(train_X, columns = X.columns)
val_X = pd.DataFrame(val_X, columns = X.columns)

##### 데이터 형태 확인

In [24]:
# Y_LABEL이 불균형 데이터임을 확인할 수 있다.
train_y.value_counts()

Y_LABEL
0    9024
1     842
Name: count, dtype: int64

### 4. OverSampling
 - 데이터 불균형 문제를 완화해 주기 위해 오버샘플링 실시.

In [25]:
# # (1) 랜덤오버샘플링

# from imblearn.over_sampling import RandomOverSampler
# X_train_over, y_train_over = RandomOverSampler(random_state=0).fit_resample(train_X, train_y)

In [26]:
# (2) smote

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_over, y_train_over = smote.fit_resample(train_X, train_y)

In [27]:
# # (3) borderline-SMOTE

# from imblearn.over_sampling import BorderlineSMOTE
# from collections import Counter
# oversample = BorderlineSMOTE(random_state = 0)
# X_train_over, y_train_over = oversample.fit_resample(train_X, train_y)
# counter = Counter(y_train_over)
# print(counter)

RandomOverSampler, SMOTE, BorderlineSMOTE 중 점수가 좋았던 SMOTE를 사용하여 오버샘플링을 진행

### 5. Modeling

#### [1] Classifier model

##### 1) Catboost

In [None]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.5


In [None]:
from catboost import CatBoostClassifier, Pool
model = CatBoostClassifier(iterations=500,
                           depth=5,
                           learning_rate=0.1,
                           loss_function='Logloss',
                           verbose=True,
                           random_state = 0)
model.fit(X_train_over, y_train_over)
pred = model.predict(val_X)

0:	learn: 0.5852179	total: 60.1ms	remaining: 30s
1:	learn: 0.5119804	total: 71.5ms	remaining: 17.8s
2:	learn: 0.4592224	total: 82.5ms	remaining: 13.7s
3:	learn: 0.4172240	total: 92.9ms	remaining: 11.5s
4:	learn: 0.3841836	total: 103ms	remaining: 10.2s
5:	learn: 0.3521945	total: 114ms	remaining: 9.41s
6:	learn: 0.3293910	total: 125ms	remaining: 8.83s
7:	learn: 0.3131558	total: 136ms	remaining: 8.37s
8:	learn: 0.2946358	total: 147ms	remaining: 8.02s
9:	learn: 0.2753902	total: 157ms	remaining: 7.71s
10:	learn: 0.2602894	total: 168ms	remaining: 7.46s
11:	learn: 0.2494542	total: 178ms	remaining: 7.26s
12:	learn: 0.2383208	total: 189ms	remaining: 7.08s
13:	learn: 0.2284004	total: 203ms	remaining: 7.06s
14:	learn: 0.2223367	total: 216ms	remaining: 6.97s
15:	learn: 0.2139972	total: 226ms	remaining: 6.84s
16:	learn: 0.2081899	total: 237ms	remaining: 6.74s
17:	learn: 0.2036264	total: 248ms	remaining: 6.64s
18:	learn: 0.1973996	total: 266ms	remaining: 6.74s
19:	learn: 0.1920820	total: 277ms	remai

In [None]:
from sklearn.metrics import f1_score

f1_score(pred, val_y)

0.68135593220339

In [None]:
pred_cb = model.predict_proba(X_train_over)

##### 2) XGB Classifier

In [31]:
import xgboost as xgb

clf =xgb.XGBClassifier(max_depth =17, min_child_weight = 1, n_estimators = 120, random_state = 0)#,learning_rate = 0.1,  objective = 'binary:logistic' ,)
clf.fit(X_train_over, y_train_over)

In [32]:
from sklearn.metrics import f1_score
pred = clf.predict(val_X)
f1_score(pred, val_y)

0.6821192052980133

In [33]:
pred_xgb = clf.predict_proba(X_train_over)

##### 3) Histrogram-based Gradient Bosting

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

hgb = HistGradientBoostingClassifier(random_state = 0, max_depth = 17, learning_rate = 0.2)
hgb.fit(X_train_over, y_train_over)


In [None]:
from sklearn.metrics import f1_score
pred = hgb.predict(val_X)
f1_score(pred, val_y)

0.6789915966386555

In [None]:
# 32, 0.4 : 0.573099
# 20, 0.1 : 0.608163
# 15, 0.1 : 0.604
# 15, 0.01 : 0.504
# 15, 0.2 : 0.64055299 <- PICK
# 18, 0.2 :0.6241457
# 20, 0.2 : 0.62414578
# 13, 0.2 : 0.607888
# 16, 0.2 : 0.6178489
# 15, 0.3 :0.6124409
# 15, 0.4 : 0.625390

from sklearn.metrics import f1_score
pred = hgb.predict(val_X)
f1_score(pred, val_y)

0.6730769230769232

In [None]:
pred_hgb = hgb.predict_proba(X_train_over)

##### 3) LightGBM Classifier

In [None]:
import lightgbm as lgb

lgb = lgb.LGBMClassifier(boosting_type	= 'gbdt', max_depth = 12,
                         learning_rate = 0.2, n_estimators = 150,random_state = 0,force_col_wise=True,
                         feature_fraction = 0.4 )
# feature_fraction : 각각의 iteration 반복에서 변수(features)의 몇 %를 랜덤하게 쓸 것인가 결정.
lgb.fit(X_train_over, y_train_over)
pred = lgb.predict(val_X)

[LightGBM] [Info] Number of positive: 9024, number of negative: 9024
[LightGBM] [Info] Total Bins 7779
[LightGBM] [Info] Number of data points in the train set: 18048, number of used features: 35
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


In [None]:
from sklearn.metrics import f1_score
f1_score(pred, val_y)

0.6839378238341969

In [None]:
pred_lgbm = lgb.predict_proba(X_train_over)

##### 4) RandomForest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=0, max_depth=18, min_samples_leaf=3, min_samples_split=2, n_estimators=90)
rf.fit(X_train_over, y_train_over)
pred = rf.predict(val_X)

In [None]:
from sklearn.metrics import f1_score
f1_score(pred, val_y)

In [None]:
pred_rf = rf.predict_proba(X_train_over)

##### 5) Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(min_samples_leaf =3, random_state=0, max_depth = 12, learning_rate = 0.2) # 기본값: max_depth=3, learning_rate=0.1
gbc.fit(X_train_over, y_train_over)

In [None]:
# 3, 12, 0.1 : 0.700507  <- PICK
# 2, 11, 0.2 : 0.68062
# 2, 11, 0.1 : 0.690355
# 2, 12, 0.1 :0.69565217
# 2, 13, 0.1 : 0.69387755

from sklearn.metrics import f1_score
pred = gbc.predict(val_X)
f1_score(pred, val_y)

0.6723259762308998

In [None]:
pred_gbc = gbc.predict_proba(X_train_over)

##### Stacking

In [None]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.5


In [None]:
from catboost import CatBoostClassifier, Pool
import xgboost as xgb
from sklearn.ensemble import HistGradientBoostingClassifier
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings(action = 'ignore')

# 개별 모델
cat_clf = CatBoostClassifier(iterations=500, depth=7, learning_rate=0.4,loss_function='Logloss',verbose=True,random_state = 0)
xgb_clf = xgb.XGBClassifier(max_depth =12, min_child_weight = 1, n_estimators = 100, random_state = 0)#,learning_rate = 0.1,  objective = 'binary:logistic' ,)
# hgb_clf = HistGradientBoostingClassifier(random_state = 0, max_depth = 14, learning_rate = 0.2)
lgb_clf = lgb.LGBMClassifier(boosting_type	= 'gbdt', max_depth = 25, learning_rate = 0.1, n_estimators = 500,random_state = 0,force_col_wise=True, feature_fraction = 1 )
rf_clf = RandomForestClassifier(random_state=0, max_depth=18, min_samples_leaf=3, min_samples_split=3, n_estimators=90)

# 최종 메타 모델
gbc_final = GradientBoostingClassifier(min_samples_leaf =4,  max_depth = 15, learning_rate = 0.2) #

In [None]:
# 개별모델 내부에서 CV 적용 => Stacking
def get_stacking_datasets(model, x_train_n, y_train_n, x_test_n, n_folds) :
  # K-fold 설정
  kf = KFold(n_splits=n_folds, shuffle = False)#, random_state = 0)#, shuffle = False,

  # 최종 메타 모델이 사용할 학습데이터 반환을 위해 넘파이 배열을 0으로 만듦
  train_fold_pred = np.zeros((x_train_n.shape[0], 1))
  test_pred = np.zeros((x_test_n.shape[0], n_folds))
  # print(model.__class__.name__, '모델 시작')

  for folder_counter, (train_idx, valid_idx) in enumerate(kf.split(x_train_n)):
    # 개별 모델 학습하고 1개의 fold로 예측할 데이터 셋 추출
    print(f" Fold 횟수 : {folder_counter+1}")
    x_tr = x_train_n.iloc[train_idx,:]
    y_tr = y_train_n[train_idx]
    x_te = x_train_n.iloc[valid_idx,:]

    # 개별 모델 학습 후 1개의 fold 데이터셋으로 예측값 반환후 최종 메타모델이 학습할 데이터셋에 추가
    model.fit(x_tr, y_tr)
    train_fold_pred[valid_idx, :] = model.predict(x_te).reshape(-1,1)
    # 개별 모델이 검증 데이터셋을 기반으로 예측 결과값 반환 후 최종 메타모델이 검증할 데이터셋에 추가
    test_pred[:, folder_counter] = model.predict(x_test_n)

  # 개별 모델 안에서 테스트 데이터셋을 기반으로 예측한 결과값의 평균을 구하고 2차원으로 바꾸기
  test_pred_mean = np.mean(test_pred, axis = 1).reshape(-1,1)

  return train_fold_pred, test_pred_mean

In [None]:
# 모든 변수 사용
cat_train, cat_test = get_stacking_datasets(cat_clf, X_train_over, y_train_over, val_X, 5)
xgb_train, xgb_test = get_stacking_datasets(xgb_clf, X_train_over, y_train_over, val_X, 5)
# hgb_train, hgb_test = get_stacking_datasets(hgb_clf, X_train_over[['AL', 'CA', 'YEAR', 'ANONYMOUS_1', 'ANONYMOUS_2', 'FE', 'SI']], y_train_over, val_X[['AL', 'CA', 'YEAR', 'ANONYMOUS_1', 'ANONYMOUS_2', 'FE', 'SI']], 5)
lgb_train, lgb_test = get_stacking_datasets(lgb_clf, X_train_over, y_train_over, val_X, 5)
# rf_train, rf_test = get_stacking_datasets(rf_clf, X_train_over, y_train_over, val_X, 5)

 Fold 횟수 : 1
0:	learn: 0.3716646	total: 25.8ms	remaining: 12.9s
1:	learn: 0.2665095	total: 45.3ms	remaining: 11.3s
2:	learn: 0.2084828	total: 65.8ms	remaining: 10.9s
3:	learn: 0.1838455	total: 85.9ms	remaining: 10.7s
4:	learn: 0.1598075	total: 111ms	remaining: 11s
5:	learn: 0.1398903	total: 135ms	remaining: 11.2s
6:	learn: 0.1295247	total: 157ms	remaining: 11.1s
7:	learn: 0.1245917	total: 176ms	remaining: 10.9s
8:	learn: 0.1174635	total: 198ms	remaining: 10.8s
9:	learn: 0.1136381	total: 217ms	remaining: 10.6s
10:	learn: 0.1050656	total: 238ms	remaining: 10.6s
11:	learn: 0.1017472	total: 258ms	remaining: 10.5s
12:	learn: 0.0979490	total: 280ms	remaining: 10.5s
13:	learn: 0.0922919	total: 300ms	remaining: 10.4s
14:	learn: 0.0882738	total: 320ms	remaining: 10.4s
15:	learn: 0.0852553	total: 343ms	remaining: 10.4s
16:	learn: 0.0824790	total: 363ms	remaining: 10.3s
17:	learn: 0.0807548	total: 383ms	remaining: 10.3s
18:	learn: 0.0790253	total: 403ms	remaining: 10.2s
19:	learn: 0.0771456	total

In [None]:
# 모든 변수 사용시 f1 : 0.6678 -> 0.6656 (hgb 모델 제거) -> 0.6678 (hgb, rf 모델 제거)
# 변수선택시 : 0.6498 -> 0.6531 (hgb 모델 제거) -> 0.6656 (hgb, rf 모델 제거)
# 최종 메타 모델을 위해 결합
stack_final_x_train = np.concatenate((cat_train, xgb_train,lgb_train), axis = 1) # hgb_train, , rf_train
stack_final_x_test = np.concatenate((cat_test, xgb_test,lgb_test), axis = 1) #  hgb_test, , rf_test

# 최종 메타모델 학습
gbc_final.fit(stack_final_x_train, y_train_over)
stack_final_pred = gbc_final.predict(stack_final_x_test)

print(f"최종 메타모델 f1점수 : {f1_score(val_y, stack_final_pred): .4f}")

최종 메타모델 f1점수 :  0.6678


In [None]:
pred_stacking = gbc_final.predict_proba(stack_final_x_train)

#### [2] Regressor model

##### 0) X,y 정의

In [34]:
X = X_train_over[test_stage_features]
# y = pd.DataFrame(pred_stacking).iloc[:,1]
y = pd.DataFrame(pred_xgb).iloc[:,1]


##### 1) Elasticnet Regressor

In [None]:
from sklearn.linear_model import Lasso,ElasticNet,Ridge
from sklearn.model_selection import GridSearchCV

elasticnet = ElasticNet(random_state = 0)
alphas = np.logspace(-4, 0, 200)
parameters = {'alpha': alphas }

elasticnet_reg = GridSearchCV(elasticnet, parameters, scoring='neg_mean_squared_error',cv=5)
elasticnet_reg.fit(X,y)
print(elasticnet_reg.best_params_)
print(elasticnet_reg.best_score_)

{'alpha': 0.0001}
-0.2945661897516635


In [None]:
from sklearn.linear_model import Lasso,ElasticNet,Ridge

ER = ElasticNet(alpha = 0.0001, random_state = 0)
ER.fit(X,y)

In [None]:
pred = ER.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.57 :
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5362278495695132

In [None]:
# 변수선택(7개) 분류모델 돌린 데이터 사용
pred = ER.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.57 :
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5337348312813646

##### 2) Huber Regressor

In [None]:
from sklearn.linear_model import HuberRegressor
huber = HuberRegressor(epsilon= 1.4).fit(X, y)

In [None]:
pred = huber.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.58  : # 기준 0.63
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5361213027062658

In [None]:
# 변수선택 기준
pred = huber.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.58  : # 기준 0.63
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.528514519181545

In [None]:
pred = huber.predict(test)
for i in range(len(pred)) :
  if pred[i] >= 0.58:
    pred[i] = 1
  else :
    pred[i] = 0

In [None]:
pred = pred.astype(int)

##### 3) XGB Regressor

In [None]:
from xgboost import XGBRegressor

model_xgb = XGBRegressor(learning_rate= 0.2, max_depth = 16, #0.05, 32
                          n_estimators = 100, random_state = 0)

model_xgb.fit(X,y)

In [None]:
# 543
pred = model_xgb.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.46: #0.5
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5420821471153612

In [None]:
pred = model_xgb.predict(test)

In [None]:
for i in range(len(pred)) :
  if pred[i] >= 0.3:
    pred[i] = 1
  else :
    pred[i] = 0


In [None]:
pred_test = pred.astype(int)

##### 4) Gradient Boosting Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(max_depth = 16, learning_rate = 0.04, n_estimators = 90, subsample = 0.7, random_state = 0)
gbr.fit(X, y)

In [None]:
# XGB 모델 결과 (0.4 기준)
# 16,0.06,80,0.7 : 0.55668
# 16,0.06,90,0.7 : 0.55793
# 16,0.06,100,0.7 : 0.55546
# 16,0.05,90,0.7 : 0.5591
# 16,0.04,90,0.7 : 0.5596 -> 0.5624 (0.42 기준)
# 16,0.03,90,0.7 : 0.55027
# 18,0.04,90,0.7 : 0.5486
# 14,0.04,90,0.7 : 0.5560

pred = gbr.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.42 :
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5624011651726735

In [None]:
pred = gbr.predict(test)

In [None]:
for i in range(len(pred)) :
  if pred[i] >= 0.42:
    pred[i] = 1
  else :
    pred[i] = 0


In [None]:
pred_test = pred.astype(int)

##### 5) LightGBM Regressor

In [None]:
import lightgbm as ltb

lgb = ltb.LGBMRegressor(learning_rate = 0.05, max_depth = 16, metric = 'rmse')
lgb.fit(X,y)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002867 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3855
[LightGBM] [Info] Number of data points in the train set: 18048, number of used features: 18
[LightGBM] [Info] Start training from score 0.500000


In [None]:
# stacking 버전 (0.05,16,rmse)
pred = lgb.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.33: # lgb -> lgb 에서 가장 좋았던 기준 0.595
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')



0.5673411952332847

In [None]:
# 0.35 : 0.5605092231748506
# 0.36 :0.5575167077193213
# 0.34 : 0.5592002230578205

pred = lgb.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.36: # lgb -> lgb 에서 가장 좋았던 기준 0.595
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')



0.5575167077193213

In [None]:
pred = lgb.predict(test)



In [None]:
for i in range(len(pred)) :
  if pred[i] >= 0.33:
    pred[i] = 1
  else :
    pred[i] = 0

In [None]:
pred_test = pred.astype(int)

##### 6) K Neighbors Regressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor

regressor = KNeighborsRegressor(n_neighbors = 20, weights = "distance")
regressor.fit(X, y)


In [None]:
# 20, 0.79 : 0.5509905526601195
# 0.8 : 0.5504737386672168
# 0.75 : 0.5466300097303332
pred  = regressor.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.79:
    pred[i] = 1
  else :
    pred[i] = 0

from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5509905526601195

In [None]:
pred = regressor.predict(test)

In [None]:
for i in range(len(pred)) :
  if pred[i] >= 0.79:
    pred[i] = 1
  else :
    pred[i] = 0

In [None]:
pred_test = pred.astype(int)

##### 7) RandomForest

In [35]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=0, max_depth=20, min_samples_leaf=4, min_samples_split=10,n_estimators=110)
rf.fit(X, y)

In [None]:
# XGB score (0.42 기준)
# 20,4,10,110 : 0.550668
# 20,4,10,100 : 0.548910
# 20,4,9,110 : 0.55027
# 20,3,10,110 : 0.5494
# 20,5,10,110 : 0.5460
# 18,4,10,110 : 0.5459
# 22,4,10,110 : 0.5470
# 20,4,10,120 : 0.5415
# 20,4,10,80 : 0.5436

pred = rf.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.42:  #0.42 : 0.5485
    pred[i] = 1
  else :
    pred[i] = 0
from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5506686717192808

In [None]:
# CATBOOST score (0.44기준)
# 18,4,10,100 : 0.5503046385530627
# 20,4,10,100 : 0.5548732424299898
# 20,4,10,110 : 0.555846504243899
# 20,4,10,120 : 0.552112609342271

pred = rf.predict(val_X[test_stage_features])
for i in range(len(pred)) :
  if pred[i] >= 0.44:  #0.42 : 0.5485
    pred[i] = 1
  else :
    pred[i] = 0
from sklearn.metrics import f1_score
f1_score(pred, val_y, average = 'macro')

0.5577470266709132

In [36]:
pred = rf.predict(test)

In [37]:
for i in range(len(pred)) :
  if pred[i] >= 0.42:
    pred[i] = 1
  else :
    pred[i] = 0

In [38]:
pred_test = pred.astype(int)

### 6. SUBMISSION

XGBClassifier 분류모델과 RandomForestRegressor 예측모델을 사용한 결과가 가장 좋은 점수를 내어 최종 모델로 결정

In [39]:
submit = pd.read_csv('/content/drive/MyDrive/건설기계 오일 상태 분류 경진대회/sample_submission.csv')
submit['Y_LABEL'] = pred_test
submit.head()

Unnamed: 0,ID,Y_LABEL
0,TEST_0000,0
1,TEST_0001,0
2,TEST_0002,0
3,TEST_0003,0
4,TEST_0004,1


In [40]:
len(submit[submit['Y_LABEL']==1])/len(submit)

0.0880648899188876

In [41]:
submit.to_csv('./xgb(모든변수_17,1,120)_gbr(0.42기준16,0.04,90,0.7).csv', index=False)