# 문제 6

[Kaggle 형] train_prob.csv로 failure 예측하는 모델을 만들고, 

test_prob.csv에 대한 failure가 1일 확률 예측하여 다음과 같은 형식의 answer6.csv를 만들어라. 

측정 지표는 AUC(area under of ROC curve)이다. id 는 테스트 케이스의 id 이고, failure에는 failure가 1이 될 확률이다.

id,failure

16115, 0.1

16116, 0.2


**강사: 멀티캠퍼스 강선구(sunku0316.kang@multicampus.com, sun9sun9@gmail.com)**

# Kaggle형 풀이 단계

Step 0: Kaggle용 데이터셋을 만든다.

Step 1: 검증 방법을 정하고, 검증 루틴을 만듭니다.

Step 2: Baseline 모델을 만듭니다

Step 3: 모델 선택 루틴을 만듭니다.

|DateHour|	TotalHour|
|----|----|
|2021-08-14 00:00:00|	102.607580|
|2021-08-14 01:00:00|	94.078890|
....	

Step 4: 모델 개선 작업을 합니다.

In [1]:
# 실행 환경 확인

import pandas as pd
import numpy as np
import sklearn
import scipy
import statsmodels
import mlxtend
import sys
import xgboost as xgb

print(sys.version)
for i in [pd, np, sklearn, scipy, mlxtend, statsmodels, xgb]:
    print(i.__name__, i.__version__)

3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
pandas 0.25.1
numpy 1.18.5
sklearn 0.21.3
scipy 1.5.2
mlxtend 0.15.0.0
statsmodels 0.11.1
xgboost 0.80


In [2]:
df_train = pd.read_csv('train_prob.csv', index_col='id')
df_test = pd.read_csv('test_prob.csv', index_col='id')

In [3]:
df_train['na_1'] = df_train['measurement_3'].isna()
df_train['na_2'] = df_train['measurement_5'].isna()

df_test['na_1'] = df_test['measurement_3'].isna()
df_test['na_2'] = df_test['measurement_5'].isna()

In [5]:
df_train['product_code'].value_counts()

C    5765
E    5343
B    5250
A    5100
Name: product_code, dtype: int64

In [4]:
df_test['product_code'].value_counts()

D    5112
Name: product_code, dtype: int64

In [6]:
# 방법 2: groupby ~ fit_transform
from sklearn.experimental import enable_iterative_imputer
from sklearn.linear_model import LinearRegression
from sklearn.impute import IterativeImputer# , random_state=123

X_imp = ['measurement_{}'.format(i) for i in range(3, 10)] + ['measurement_17']
imp = IterativeImputer(
    estimator = LinearRegression(fit_intercept=True),
    random_state=123
)
df_train[X_imp] = df_train.groupby('product_code')[X_imp].apply(
    lambda x: pd.DataFrame(imp.fit_transform(x), index=x.index, columns=X_imp)
)

In [7]:
X_imp = ['measurement_{}'.format(i) for i in range(3, 10)] + ['measurement_17']
imp = IterativeImputer(
    estimator = LinearRegression(fit_intercept=True),
    random_state=123
)
df_test[X_imp] = df_test.groupby('product_code')[X_imp].apply(
    lambda x: pd.DataFrame(imp.fit_transform(x), index=x.index, columns=X_imp)
)

In [8]:
X_mean = ['measurement_{}'.format(i) for i in range(10, 17)]
df_train[X_mean] = df_train.groupby('product_code')[X_mean].apply(
    lambda x: x.fillna(x.mean())
).reset_index(level=0, drop=True)

X_mean = ['measurement_{}'.format(i) for i in range(10, 17)]
df_test[X_mean] = df_test.groupby('product_code')[X_mean].apply(
    lambda x: x.fillna(x.mean())
).reset_index(level=0, drop=True)

In [11]:
m = pd.concat([df_train['loading'], df_test['loading']]).mean()
df_train['loading'] = df_train['loading'].fillna(m)
df_test['loading'] = df_test['loading'].fillna(m)

In [12]:
df_train.isna().sum().pipe(
    lambda x: x.loc[x > 0]
)

Series([], dtype: int64)

- na_1, na_2로 파생 변수를 만들기

- attribute_0, attribute_1의 결합값이 failure와 연관성 없다.

- LR,solver='lbfgs'  ['loading', 'measurement_1', 'measurement_4', 'measurement_14', 'measurement_17', 'na_1'] 0.5838326230092876

- LinearDiscriminantAnalysis: transform / predict measurement_0 ~ 17 

- LR, solver ='lbfgs', n_components=7, 0.581757510516433

- RandomForestClassifier {'max_depth': 7, 'min_samples_split': 512, 'n_estimators': 15}, random_state=123 0.5687712018291998

- loading -> loading_log

In [13]:
df_train['loading_log'] = np.log(df_train['loading'])
df_test['loading_log'] = np.log(df_test['loading'])

In [18]:
X_all = df_test.columns.tolist()
np.array(X_all)

array(['product_code', 'loading', 'attribute_0', 'attribute_1',
       'attribute_2', 'attribute_3', 'measurement_0', 'measurement_1',
       'measurement_2', 'measurement_3', 'measurement_4', 'measurement_5',
       'measurement_6', 'measurement_7', 'measurement_8', 'measurement_9',
       'measurement_10', 'measurement_11', 'measurement_12',
       'measurement_13', 'measurement_14', 'measurement_15',
       'measurement_16', 'measurement_17', 'na_1', 'na_2', 'loading_log'],
      dtype='<U14')

# Step1: 검증 방법을 정하고

In [19]:
from sklearn.model_selection import GroupKFold

gcv = GroupKFold(4)
for train_idx, valid_idx in gcv.split(df_train[X_all], df_train['failure'], groups=df_train['product_code']):
    df_cv_train, df_valid = df_train.iloc[train_idx], df_train.iloc[valid_idx]
    print(df_cv_train['product_code'].unique(), df_valid['product_code'].unique())

['A' 'B' 'E'] ['C']
['A' 'B' 'C'] ['E']
['A' 'C' 'E'] ['B']
['B' 'C' 'E'] ['A']


In [25]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
ct = ColumnTransformer([
    ('std', StandardScaler(), ['loading', 'measurement_1', 'measurement_4', 'measurement_14', 'measurement_17'] ),
    ('pt', 'passthrough', ['na_1'])
])
clf_lr = make_pipeline(ct, LogisticRegression(solver='lbfgs'))
cross_validate(
    clf_lr, df_train[X_all], df_train['failure'], scoring='roc_auc', cv=gcv, groups = df_train['product_code'],
    return_train_score=True
)

{'fit_time': array([0.04114151, 0.02462268, 0.04014039, 0.02982831]),
 'score_time': array([0.00506544, 0.00512266, 0.        , 0.01019859]),
 'test_score': array([0.58821746, 0.58491734, 0.58894014, 0.59540058]),
 'train_score': array([0.59262252, 0.59350804, 0.59192438, 0.58956303])}