- XGBoost 로 회귀예측 후 교차검증, 예측값과 실제값 csv 저장.
- Permutation Feature Importance 기반으로 피쳐별 중요도 계산. 
- [feature selection] 중요도가 가장 낮은 피쳐부터 Sequental Backward Elimination 방식으로 optimal subset 확보.
- Optimal subset 으로 RandomForest 모델 학습 및 평가. 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score
from scipy.stats import skew
from catboost import CatBoostRegressor
import gc
from sklearn.linear_model import Ridge , LogisticRegression
from lightgbm import LGBMRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
import xgboost as xgb
import time
from tqdm import tqdm

In [10]:
# 데이터 로딩,전처리

# 데이터 불러오기 
pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)
file = r"cycletime_minmax.csv"
ct_df = pd.read_csv(file)
ct_df = ct_df.sample(frac=0.01, random_state=42) # 학습 빠르게....

# x-y 분리, y(타겟) 정규화
y_ct = ct_df.iloc[:,-1]
x_ct = ct_df.iloc[:,:-1]
y_ct = np.log1p(y_ct)

# 숫자형/범주형 변수 분리 
numerical_list=[]
categorical_list=[]

for i in x_ct.columns :
  if x_ct[i].dtypes == 'O' :
    categorical_list.append(i)
  else :
    numerical_list.append(i)

# 범주형 변수 원핫인코딩 
x_ct = pd.get_dummies(x_ct, columns=categorical_list, drop_first=True)

# 숫자형변수 float32로 통일
for col in x_ct.select_dtypes(include=['float64']).columns:
    x_ct[col] = x_ct[col].astype('float32')

In [11]:
ct_df.shape

(5218, 57)

In [12]:
print(type(x_ct), type(y_ct))

<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>


In [13]:
x_train, x_val, y_train, y_val = train_test_split(x_ct, y_ct, test_size=0.2, random_state=42)

머신러닝 모델
1. XGBoost 

In [14]:
xgboost = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=120, learning_rate=0.4, max_depth=5, n_jobs=-1)

In [15]:
xgboost.fit(x_train, y_train)

In [16]:
# Predict on validation data
predictions = xgboost.predict(x_val)

# Calculate individual MSE and MAPE
individual_mses = (predictions - y_val) ** 2
individual_mapes = np.abs((predictions - y_val) / y_val) * 100

In [13]:
import csv

# Save individual MSE and MAPE to CSV
def save_to_csv(mses, mapes, filename="evaluation_results_XGBoost.csv"):
    with open(filename, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["MSE", "MAPE"])
        for mse, mape in zip(mses, mapes):
            writer.writerow([mse, mape])

# Call the function to save results
save_to_csv(individual_mses, individual_mapes)

print("MSE and MAPE values have been saved to validation_results.csv.")

MSE and MAPE values have been saved to validation_results.csv.


In [20]:
# 교차검증 

kf = KFold(n_splits=5, shuffle=True, random_state=42)

mse_scores = cross_val_score(xgboost, x_ct, y_ct, cv=kf, scoring='neg_mean_squared_error')
print("MSE Scores: ", mse_scores)
print("Average MSE: ", np.mean(mse_scores))

rmse_scores = np.sqrt(-mse_scores)  # Calculating RMSE from MSE
print("RMSE Scores: ", rmse_scores)
print("Average RMSE: ", np.mean(rmse_scores))

mape_scores = cross_val_score(xgboost, x_ct, y_ct, cv=kf, scoring='neg_mean_absolute_percentage_error')
print("MAPE Scores: ", -mape_scores * 100)  # Convert from negative fraction to positive percentage
print("Average MAPE: ", np.mean(-mape_scores) * 100)

r2_scores = cross_val_score(xgboost, x_ct, y_ct, cv=kf, scoring='r2')
print("R^2 Scores: ", r2_scores)
print("Average R^2: ", np.mean(r2_scores))

MSE Scores:  [-0.18242299 -0.18187483 -0.18231599 -0.18147723 -0.18282936]
Average MSE:  -0.18218407883190446
RMSE Scores:  [0.42711004 0.42646785 0.42698477 0.42600144 0.42758549]
Average RMSE:  0.4268299200739178
MAPE Scores:  [8.80649878 8.81643753 8.8090694  8.7697892  8.83820175]
Average MAPE:  8.807999332350686
R^2 Scores:  [0.76344753 0.76362934 0.76172196 0.76425327 0.76114483]
Average R^2:  0.7628393852422206


cross_val_score 하나당 15분 정도 소요 

결과 
MSE Scores:  [-0.18242299 -0.18187483 -0.18231599 -0.18147723 -0.18282936]
Average MSE:  -0.18218407883190446
RMSE Scores:  [0.42711004 0.42646785 0.42698477 0.42600144 0.42758549]
Average RMSE:  0.4268299200739178
MAPE Scores:  [8.80649878 8.81643753 8.8090694  8.7697892  8.83820175]
Average MAPE:  8.807999332350686
R^2 Scores:  [0.76344753 0.76362934 0.76172196 0.76425327 0.76114483]
Average R^2:  0.7628393852422206

2. PFI기반 feature selection + RF

In [23]:
from sklearn.inspection import permutation_importance
from sklearn.metrics import r2_score

x_sample = x_ct.sample(frac=0.5, random_state=42)
y_sample = y_ct.loc[x_sample.index]
X_train, X_test, y_train, y_test = train_test_split(x_sample, y_sample, test_size=0.25, random_state=42)


# Train the baseline Random Forest model
rf =LGBMRegressor(n_estimators=10, learning_rate=0.5, random_state=156)
rf.fit(X_train, y_train)

# Evaluate the baseline model
baseline_preds = rf.predict(X_test)
baseline_r2 = r2_score(y_test, baseline_preds)
print(f'Baseline R2 Score: {baseline_r2}')

# Calculate permutation feature importances
perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
feature_importance = pd.DataFrame({'features': X_train.columns, 'importance': perm_importance.importances_mean})

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001413 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6628
[LightGBM] [Info] Number of data points in the train set: 1956, number of used features: 199
[LightGBM] [Info] Start training from score 3.740476
Baseline R2 Score: 0.6258290142785398


In [31]:
feature_importance

Unnamed: 0,features,importance
0,QTY,0.112206
1,GRADE,0.000000
2,COMPLETE_RATE,0.017019
3,DUE_DATE,0.001167
4,WIPTURN,0.000530
...,...,...
639,SHIFT_TYPE_HS,0.000000
640,SHIFT_TYPE_WD,0.000000
641,SHIFT_TYPE_WG,0.000000
642,SHIFT_TYPE_WO,0.000000


In [35]:
pd.set_option('display.max_rows', None)
# Sort features by importance
sorted_features = feature_importance.sort_values(by='importance', ascending=False)['features'].tolist()
print(len(sorted_features), sorted_features)

644 ['PRC_WAIT_MEDIAN', 'QTY', 'MOVE_1', 'EQPTYPE_WETCCU', 'Q_R', 'ROOM_DI', 'EQP_WORKLOAD_0', 'ROOM_ME', 'Q_W', 'COMPLETE_RATE', 'WIP_WAITTIME', 'EQPTYPE_WETSTN', 'Q_8', 'EQP_WORKLOAD_2', 'PRC_WAIT_STD', 'L_3', 'Q_H', 'MOVE_2', 'L_4', 'PRC_WAIT_MAX', 'HOLD_FLAG_Y', 'EQPTYPE_EFEASH', 'L_R', 'EQP_WORKLOAD_1', 'Q_4', 'ROOM_CM', 'FLOORID_S3-B', 'PROCESS_GROUP_CU_PE_NDC', 'L_P', 'L_E', 'LAYER_GROUP_LG51', 'L_5', 'BLOCK_GROUP_PU6', 'Q_E', 'EQPTYPE_EBEMTL', 'L_H', 'DUE_DATE', 'L_7', 'LAYER_TITLE_WF', 'EQPTYPE_EBEOXD', 'LAYER_GROUP_LG62', 'ROOM_CV', 'L_W', 'ROOM_ET', 'WIPTURN', 'EQPTYPE_WETSSW', 'ROOM_IM', 'LOT_PURPOSE_U', 'LAYER_GROUP_LG31', 'LAYER_GROUP_LG42', 'Q_2', 'Q_3', 'PROCESS_ID_CT', 'LAYER_TITLE_L1', 'PRC_WAIT_MIN', 'LAYER_TITLE_FC', 'LAYER_TITLE_EO', 'LAYER_TITLE_FMD', 'LAYER_TITLE_FM', 'LAYER_TITLE_FL', 'LAYER_TITLE_FE', 'LAYER_TITLE_ER', 'LAYER_TITLE_FB', 'LAYER_TITLE_FD', 'LAYER_TITLE_F4E', 'LAYER_TITLE_FA', 'LAYER_TITLE_F5', 'PROCESS_GROUP_SI_GHM_KIYO-EXP', 'LAYER_TITLE_FR', 'P

In [45]:
# Implementing Sequential Backward Search using Permutation Importance
while sorted_features:
    rf_temp = LGBMRegressor(n_estimators=13, learning_rate=0.5, random_state=156, verbose=-1)
    rf_temp.fit(X_train[sorted_features], y_train)
    temp_preds = rf_temp.predict(X_test[sorted_features])
    temp_r2 = r2_score(y_test, temp_preds)
    print(temp_r2)

    # Check if removing the least important feature improves performance
    if temp_r2 >= baseline_r2*0.95:
        # Remove the least important feature
        least_important = sorted_features.pop(-1)
        print(f'Removed feature: {least_important} - New R2 Score: {temp_r2}')
    else:
        break

print(f'Optimal feature set \'{len(sorted_features)}ea\': {sorted_features}')
print(f'Optimal R2 Score: {baseline_r2}')

0.592781653199747
Optimal feature set '244ea': ['PRC_WAIT_MEDIAN', 'QTY', 'MOVE_1', 'EQPTYPE_WETCCU', 'Q_R', 'ROOM_DI', 'EQP_WORKLOAD_0', 'ROOM_ME', 'Q_W', 'COMPLETE_RATE', 'WIP_WAITTIME', 'EQPTYPE_WETSTN', 'Q_8', 'EQP_WORKLOAD_2', 'PRC_WAIT_STD', 'L_3', 'Q_H', 'MOVE_2', 'L_4', 'PRC_WAIT_MAX', 'HOLD_FLAG_Y', 'EQPTYPE_EFEASH', 'L_R', 'EQP_WORKLOAD_1', 'Q_4', 'ROOM_CM', 'FLOORID_S3-B', 'PROCESS_GROUP_CU_PE_NDC', 'L_P', 'L_E', 'LAYER_GROUP_LG51', 'L_5', 'BLOCK_GROUP_PU6', 'Q_E', 'EQPTYPE_EBEMTL', 'L_H', 'DUE_DATE', 'L_7', 'LAYER_TITLE_WF', 'EQPTYPE_EBEOXD', 'LAYER_GROUP_LG62', 'ROOM_CV', 'L_W', 'ROOM_ET', 'WIPTURN', 'EQPTYPE_WETSSW', 'ROOM_IM', 'LOT_PURPOSE_U', 'LAYER_GROUP_LG31', 'LAYER_GROUP_LG42', 'Q_2', 'Q_3', 'PROCESS_ID_CT', 'LAYER_TITLE_L1', 'PRC_WAIT_MIN', 'LAYER_TITLE_FC', 'LAYER_TITLE_EO', 'LAYER_TITLE_FMD', 'LAYER_TITLE_FM', 'LAYER_TITLE_FL', 'LAYER_TITLE_FE', 'LAYER_TITLE_ER', 'LAYER_TITLE_FB', 'LAYER_TITLE_FD', 'LAYER_TITLE_F4E', 'LAYER_TITLE_FA', 'LAYER_TITLE_F5', 'PROCESS_G

In [46]:
x_ct_reduced = x_ct[sorted_features]

In [47]:
print(x_ct_reduced.shape)

(5218, 244)


In [14]:
random_forest = LGBMRegressor(n_estimators=200, learning_rate=0.5, random_state=156)

In [15]:
random_forest.fit(x_train, y_train)

LGBMRegressor(learning_rate=0.5, n_estimators=200, random_state=156)

In [16]:
# Predict on validation data
predictions = random_forest.predict(x_val)

# Calculate individual MSE and MAPE
individual_mses = (predictions - y_val) ** 2
individual_mapes = np.abs((predictions - y_val) / y_val) * 100

In [17]:
import csv

# Save individual MSE and MAPE to CSV
def save_to_csv(mses, mapes, filename="evaluation_results_SFS+RF.csv"):
    with open(filename, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["MSE", "MAPE"])
        for mse, mape in zip(mses, mapes):
            writer.writerow([mse, mape])

# Call the function to save results
save_to_csv(individual_mses, individual_mapes)

print("MSE and MAPE values have been saved to validation_results.csv.")

MSE and MAPE values have been saved to validation_results.csv.


In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, cross_val_score
import numpy as np

# Initialize the RandomForestRegressor
random_forest = LGBMRegressor(n_estimators=200, learning_rate=0.5, random_state=156)

# Set up KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42) 

# Cross-validate using the reduced dataset (x_ct_reduced)
mse_scores = cross_val_score(random_forest, x_ct_reduced, y_ct, cv=kf, scoring='neg_mean_squared_error')
print("MSE Scores: ", mse_scores)
print("Average MSE: ", np.mean(mse_scores))

# Calculate RMSE from MSE
rmse_scores = np.sqrt(-mse_scores)
print("RMSE Scores: ", rmse_scores)
print("Average RMSE: ", np.mean(rmse_scores))

# Calculate MAPE
mape_scores = cross_val_score(random_forest, x_ct_reduced, y_ct, cv=kf, scoring='neg_mean_absolute_percentage_error')
print("MAPE Scores: ", -mape_scores * 100)  # Convert from negative fraction to positive percentage
print("Average MAPE: ", np.mean(-mape_scores) * 100)

# Calculate R^2
r2_scores = cross_val_score(random_forest, x_ct_reduced, y_ct, cv=kf, scoring='r2')
print("R^2 Scores: ", r2_scores)
print("Average R^2: ", np.mean(r2_scores))

MSE Scores:  [-0.16788393 -0.16860606 -0.16832098 -0.16737038 -0.16909894]
Average MSE:  -0.16825605709986635
RMSE Scores:  [0.40973641 0.41061668 0.4102694  0.40910925 0.41121642]
Average RMSE:  0.4101896317175265
MAPE Scores:  [8.3616131  8.39254689 8.36628713 8.33582955 8.40619166]
Average MAPE:  8.372493666380185
R^2 Scores:  [0.78230069 0.78087388 0.78001275 0.78257868 0.77908276]
Average R^2:  0.7809697528040624


MSE Scores:  [-0.16788393 -0.16860606 -0.16832098 -0.16737038 -0.16909894]
Average MSE:  -0.16825605709986635
RMSE Scores:  [0.40973641 0.41061668 0.4102694  0.40910925 0.41121642]
Average RMSE:  0.4101896317175265
MAPE Scores:  [8.3616131  8.39254689 8.36628713 8.33582955 8.40619166]
Average MAPE:  8.372493666380185
R^2 Scores:  [0.78230069 0.78087388 0.78001275 0.78257868 0.77908276]
Average R^2:  0.7809697528040624