https://dacon.io/competitions/official/236590/codeshare/12889?page=1&dtype=recent

[배경] 
- 현장에서 가동되는 일반적인 장비들은 온도·압력·진동·전류 등 여러 센서로 상태를 상시 모니터링한다. 
- 작은 이상 패턴을 제때 구분하지 못하면 불필요한 정지, 품질 저하, 안전 리스크가 증가 
- 장비 센서 데이터를 기반으로 장비의 정상/비정상 작동 유형을 분류하는 모델을 개발

[목표]
: 센서 간 관계와 미세한 변화를 포착해 신속한 점검·보전을 돕는 현장 활용형 진단기를 설계

[주제]
: 이상신호 감지 기반 비정상작동 진단 분류

[설명]
- 핵심 장비는 [온도·압력·진동·전류 등 다종 센서]를 통해 상태 기록
- 도메인 의미가 차단된 블랙박스 환경에서 "비식별화된 데이터(X_01,X_02 등)"만으로 정상/비정상 유형을 분류하는 모델

해당 내용에 관한 해설:  
        https://dacon.io/edu/527

In [28]:
# 장비센서 데이터를 기반으로 장비의 정상/비정상 작동유형을 분류 하는 모델 개발 

# 기본
import os
import numpy as np
import pandas as pd

# 시각화
import matplotlib.pyplot as plt
import seaborn as sns

# 전처리/평가
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, f1_score, roc_auc_score,precision_recall_fscore_support, confusion_matrix, classification_report)
# ML /DL
from sklearn.ensemble import RandomForestClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, BatchNormalization, MaxPooling1D, Flatten, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

In [29]:
test = pd.read_csv('/home/alpaco/sryang/DACON/test.csv')
train = pd.read_csv('/home/alpaco/sryang/DACON/train.csv')

In [30]:
df_test = test
df_test.head(), df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15004 entries, 0 to 15003
Data columns (total 53 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      15004 non-null  object 
 1   X_01    15004 non-null  float64
 2   X_02    15004 non-null  float64
 3   X_03    15004 non-null  float64
 4   X_04    15004 non-null  float64
 5   X_05    15004 non-null  float64
 6   X_06    15004 non-null  float64
 7   X_07    15004 non-null  float64
 8   X_08    15004 non-null  float64
 9   X_09    15004 non-null  float64
 10  X_10    15004 non-null  float64
 11  X_11    15004 non-null  float64
 12  X_12    15004 non-null  float64
 13  X_13    15004 non-null  float64
 14  X_14    15004 non-null  float64
 15  X_15    15004 non-null  float64
 16  X_16    15004 non-null  float64
 17  X_17    15004 non-null  float64
 18  X_18    15004 non-null  float64
 19  X_19    15004 non-null  float64
 20  X_20    15004 non-null  float64
 21  X_21    15004 non-null  float64
 22

(           ID   X_01      X_02      X_03      X_04      X_05      X_06  \
 0  TEST_00000  0.027  0.248234  0.521686  0.507419  0.391153  0.583795   
 1  TEST_00001  0.021  0.237060  0.537939  0.545298  0.359449  0.657034   
 2  TEST_00002  0.020  0.244556  0.541783  0.511458  0.380849  0.673393   
 3  TEST_00003  0.011  0.241627  0.600781  0.514907  0.374210  0.618073   
 4  TEST_00004  0.019  0.251017  0.504123  0.512723  0.378423  0.614282   
 
        X_07      X_08      X_09  ...      X_43      X_44      X_45      X_46  \
 0  0.663798  0.501200  0.571666  ...  0.260703  0.428539  0.583749  0.746367   
 1  0.647725  0.501224  0.586882  ...  0.253675  0.374611  0.657051  0.768609   
 2  0.649568  0.485117  0.565430  ...  0.262817  0.442951  0.673385  0.750324   
 3  0.668874  0.494310  0.584442  ...  0.262562  0.428725  0.618055  0.748490   
 4  0.644375  0.456430  0.553999  ...  0.263064  0.442768  0.614234  0.751743   
 
        X_47      X_48      X_49      X_50      X_51      X_

In [31]:
df_train = train
df_train.head(), df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21693 entries, 0 to 21692
Data columns (total 54 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      21693 non-null  object 
 1   X_01    21693 non-null  float64
 2   X_02    21693 non-null  float64
 3   X_03    21693 non-null  float64
 4   X_04    21693 non-null  float64
 5   X_05    21693 non-null  float64
 6   X_06    21693 non-null  float64
 7   X_07    21693 non-null  float64
 8   X_08    21693 non-null  float64
 9   X_09    21693 non-null  float64
 10  X_10    21693 non-null  float64
 11  X_11    21693 non-null  float64
 12  X_12    21693 non-null  float64
 13  X_13    21693 non-null  float64
 14  X_14    21693 non-null  float64
 15  X_15    21693 non-null  float64
 16  X_16    21693 non-null  float64
 17  X_17    21693 non-null  float64
 18  X_18    21693 non-null  float64
 19  X_19    21693 non-null  float64
 20  X_20    21693 non-null  float64
 21  X_21    21693 non-null  float64
 22

(            ID   X_01      X_02      X_03      X_04      X_05      X_06  \
 0  TRAIN_00000  0.016  0.242994  0.538536  0.522295  0.374494  0.555348   
 1  TRAIN_00001  0.019  0.240380  0.517223  0.538976  0.371149  0.693825   
 2  TRAIN_00002  0.012  0.248946  0.547109  0.466713  0.415830  0.656887   
 3  TRAIN_00003  0.013  0.245877  0.527870  0.515534  0.379199  0.594391   
 4  TRAIN_00004  0.024  0.239237  0.566087  0.514384  0.378451  0.610543   
 
        X_07      X_08      X_09  ...      X_44      X_45      X_46      X_47  \
 0  0.639091  0.494800  0.584233  ...  0.435885  0.555359  0.751714  0.376801   
 1  0.663667  0.530931  0.577200  ...  0.479859  0.693855  0.748955  0.356118   
 2  0.681782  0.580773  0.527069  ...  0.416115  0.656884  0.750059  0.417200   
 3  0.663816  0.494931  0.581796  ...  0.436761  0.594364  0.746297  0.374659   
 4  0.644811  0.508567  0.593614  ...  0.422407  0.610526  0.749565  0.372742   
 
        X_48      X_49      X_50      X_51      X_52  

In [32]:
df_test.columns

Index(['ID', 'X_01', 'X_02', 'X_03', 'X_04', 'X_05', 'X_06', 'X_07', 'X_08',
       'X_09', 'X_10', 'X_11', 'X_12', 'X_13', 'X_14', 'X_15', 'X_16', 'X_17',
       'X_18', 'X_19', 'X_20', 'X_21', 'X_22', 'X_23', 'X_24', 'X_25', 'X_26',
       'X_27', 'X_28', 'X_29', 'X_30', 'X_31', 'X_32', 'X_33', 'X_34', 'X_35',
       'X_36', 'X_37', 'X_38', 'X_39', 'X_40', 'X_41', 'X_42', 'X_43', 'X_44',
       'X_45', 'X_46', 'X_47', 'X_48', 'X_49', 'X_50', 'X_51', 'X_52'],
      dtype='object')

In [33]:
df_train.columns

Index(['ID', 'X_01', 'X_02', 'X_03', 'X_04', 'X_05', 'X_06', 'X_07', 'X_08',
       'X_09', 'X_10', 'X_11', 'X_12', 'X_13', 'X_14', 'X_15', 'X_16', 'X_17',
       'X_18', 'X_19', 'X_20', 'X_21', 'X_22', 'X_23', 'X_24', 'X_25', 'X_26',
       'X_27', 'X_28', 'X_29', 'X_30', 'X_31', 'X_32', 'X_33', 'X_34', 'X_35',
       'X_36', 'X_37', 'X_38', 'X_39', 'X_40', 'X_41', 'X_42', 'X_43', 'X_44',
       'X_45', 'X_46', 'X_47', 'X_48', 'X_49', 'X_50', 'X_51', 'X_52',
       'target'],
      dtype='object')

In [34]:
df_train['target'], df_train['target'].value_counts()

(0         0
 1        20
 2         1
 3        19
 4        15
          ..
 21688    17
 21689     0
 21690     5
 21691     3
 21692    17
 Name: target, Length: 21693, dtype: int64,
 target
 0     1033
 20    1033
 1     1033
 19    1033
 15    1033
 8     1033
 16    1033
 12    1033
 14    1033
 18    1033
 3     1033
 4     1033
 5     1033
 11    1033
 13    1033
 6     1033
 10    1033
 2     1033
 9     1033
 17    1033
 7     1033
 Name: count, dtype: int64)

In [35]:
print(df_train.isnull().sum().to_numpy())
print(df_test.isnull().sum().to_numpy())

print(df_train.duplicated().sum())
print(df_test.duplicated().sum())

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0
0


In [36]:
train_x = train.drop(columns=['ID', 'target'])
train_y = train['target']
test_x = test.drop(columns=['ID'])

train_x.shape, test_x.shape, train_y.shape  
# <DataFrame'> , <Series'>, <DataFrame'>

((21693, 52), (15004, 52), (21693,))

In [37]:
model = RandomForestClassifier(random_state=42)

In [38]:
train_y = df_train['target']
print("target unique:", train_y.unique())


target unique: [ 0 20  1 19 15  8 16 12 14 18  3  4  5 11 13  6 10  2  9 17  7]


In [39]:


def build_model():
    model = Sequential()
    model.add(Conv1D(filters=64, kernel_size=6, activation='relu', 
                    padding='same', input_shape=(187, 1)))
    model.add(BatchNormalization())
    
    # adding a pooling layer
    model.add(MaxPooling1D(pool_size=3, strides=2, padding='same'))
    
    model.add(Conv1D(filters=64, kernel_size=6, activation='relu', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=3, strides=2, padding='same'))
    
    model.add(Conv1D(filters=64, kernel_size=6, activation='relu', padding='same'))
    model.add(BatchNormalization())
    model.add(MaxPooling1D(pool_size=3, strides=2, padding='same'))
    
    model.add(Flatten())
    model.add(Dense(64, activation='relu'))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(5, activation='softmax'))   # 다중 분류
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model = build_model()
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [40]:
X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, test_size=0.2, random_state=42) 
classifi = RandomForestClassifier(random_state=42, n_estimators = 300, n_jobs = -1)
classifi.fit(X_train, y_train)

preds = classifi.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, preds))

Validation Accuracy: 0.7711454252131827


In [42]:
pred_test = classifi.predict(test_x)
sub = pd.DataFrame({"ID": id, "target": pred_test})

out_path = "/home/alpaco/sryang/DACON/submission_v1.csv"
sub.to_csv(out_path, index=False)
print(f"Saved: {out_path}")

Saved: /home/alpaco/sryang/DACON/submission_v1.csv
