# Semi_defect_finder
반도체 공정/계측 데이터 기반 **결함 여부(불량/정상)** 이진 분류 모델 구축 노트북

- EDA → 전처리 → AutoML 모델링(A/B) → 평가 → 해석 → 운영(저장/스키마/드리프트)
- **시나리오 A**: `Etch_Depth` 포함(사후 판정 성격 가능)
- **시나리오 B**: `Etch_Depth` 제외(공정 조건+장비 상태 기반 사전/중간 판정)

> 요구사항에 따라 `Process_ID`, `Timestamp`는 **피처에서 제거**합니다(분할에만 필요하면 time-aware 옵션으로 사용).

[데이터]
| 컬럼명 | 데이터 타입 (예시) | 설명 | EDA 분석 포인트 |
| --- | --- | --- | --- |
| **Process_ID** | String | 공정 작업 고유 ID | 중복 데이터 확인 및 작업 단위 식별 |
| **Timestamp** | DateTime | 데이터 기록 시각 | 시간 흐름에 따른 공정 안정성 및 트렌드 분석 |
| **Tool_Type** | Categorical | 공정 설비 종류 (Lithography 등) | 설비별 불량률 차이 분석 |
| **Wafer_ID** | String | 웨이퍼 고유 식별 번호 | 특정 웨이퍼 로트(Lot)의 문제 여부 확인 |
| **Chamber_Temperature** | Float | 챔버 내부 온도 | 온도와 결함 간의 상관관계(상하한 임계치) |
| **Gas_Flow_Rate** | Float | 주입 가스 유량 | 가스 공급 안정성이 품질에 미치는 영향 |
| **RF_Power** | Float | 무선 주파수 전력 강도 | 에너지 세기와 식각/증착 품질의 관계 |
| **Etch_Depth** | Float | 식각(Etching) 깊이 | 목표 깊이 대비 편차 분석 |
| **Rotation_Speed** | Float | 장비/웨이퍼 회전 속도(RPM) | 회전 균일도와 코팅/식각 균일도 관계 |
| **Vacuum_Pressure** | Float | 챔버 내 진공 압력 | 압력 변화에 따른 파티클 발생 및 공정 오류 |
| **Stage_Alignment_Error** | Float | 스테이지 정렬 오차 | 정밀도 오류와 결함(Defect) 간의 상관성 |
| **Vibration_Level** | Float | 설비 미세 진동 수치 | 진동이 공정 정밀도에 미치는 영향 |
| **UV_Exposure_Intensity** | Float | UV 노광 강도 | 노광량 부족/과다에 따른 패턴 결함 분석 |
| **Particle_Count** | Integer | 미세 먼지(파티클) 개수 | 환경 오염도와 결함 여부의 직접적 연관성 |
| **Defect** | Binary (0, 1) | 결함 발생 여부 (Target) | 분류 모델의 종속 변수 (0: 정상, 1: 불량) |
| **Join_Status** | Categorical | 최종 공정 통과 상태 | 최종 판정 결과 (Joining: 합격, Non-Joining: 불량) |

In [114]:
# =========================================
# 0) CONFIG (여기만 바꿔 끼우면 됩니다)
# =========================================
CONFIG = {
    # 데이터
    "csv_path": "data/semiconductor_quality_control.csv",

    # 타깃 후보 (우선순위대로 탐색)
    "target_candidates": ["Defect"],
    
    # 공정(설비) 분리 기준 컬럼
    "tool_type_candidates": ["Tool_Type", "tool_type"],

    # Join_Status를 타깃으로 쓰고 싶다면 True로 (Defect와 중복/누수 확인 로직 포함)
    "use_join_status_as_target": False,
    "join_status_col_candidates": ["Join_Status"],
    "join_status_positive_values": ["Non-Joining", "FAIL", "Fail", "NG", "Bad", "Defect", "1"],  # 필요시 수정

    # 누수 위험 컬럼/ID류/시간류 (피처에서 제거)
    "drop_cols_always": ["Process_ID", "Timestamp"],

    # Etch_Depth 후보 (A/B 시나리오 분기)
    "etch_depth_candidates": ["Etch_Depth"],

    # 공정(설비) 분리 기준 컬럼
    "tool_type_candidates": ["Tool_Type", "tool_type"],

    # 그룹 분할 우선순위: wafer/lot 단위
    "group_candidates": ["Wafer_ID", "wafer_id", "Lot_ID", "lot_id", "die_id", "Die_ID"],

    # 시간 분할 컬럼 후보 (분할에만 사용)
    "time_candidates": ["Timestamp", "timestamp", "DateTime", "datetime", "time", "Time"],

    # 모델/검증
    "random_state": 42,
    "test_size": 0.2,
    "cv_splits": 5,
    "n_iter_search": 30,
    "scoring_primary": "average_precision",  # PR-AUC(AP)
    "precision_constraint": 0.90,            # threshold 선택 시 precision 최소
    "calibration": None,                     # None / "sigmoid" / "isotonic"

    # 전처리 옵션
    "use_iterative_imputer": False,          # True로 바꾸면 IterativeImputer 사용
    "winsorize_limits": (0.01, 0.01),        # 상하 1% winsorization
    "use_isolation_forest": False,           # 선택 옵션(훈련셋 필터링 원칙)

    # 출력/저장
    "model_output_dir": "./artifacts",
    "model_name_prefix": "semi_defect_model",
}



In [115]:
# =========================================
# 1) Imports (preprocess & modeling)
# =========================================
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from dataclasses import dataclass
from typing import List, Optional

from sklearn.model_selection import StratifiedKFold, GroupKFold, TimeSeriesSplit, RandomizedSearchCV, train_test_split
from sklearn.metrics import (
    average_precision_score, precision_recall_curve, roc_auc_score,
    f1_score, precision_score, recall_score, confusion_matrix,
    ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay
)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, RobustScaler, StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.experimental import enable_iterative_imputer  # noqa: F401
from sklearn.impute import IterativeImputer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.svm import SVC

from sklearn.base import BaseEstimator, TransformerMixin



In [116]:
# =========================================
# 2) Load data & basic checks
# =========================================

df = pd.read_csv(CONFIG["csv_path"])
print("Shape:", df.shape)
print("Columns:", df.columns.tolist())

target_col = "Defect"
tool_col = "Tool_Type"


print("Target col:", target_col)
print("Tool_Type col:", tool_col)

# 결측률 상위 확인
print("Missing ratio (top 10):")
print(df.isna().mean().sort_values(ascending=False).head(16))


Shape: (4219, 16)
Columns: ['Process_ID', 'Timestamp', 'Tool_Type', 'Wafer_ID', 'Chamber_Temperature', 'Gas_Flow_Rate', 'RF_Power', 'Etch_Depth', 'Rotation_Speed', 'Vacuum_Pressure', 'Stage_Alignment_Error', 'Vibration_Level', 'UV_Exposure_Intensity', 'Particle_Count', 'Defect', 'Join_Status']
Target col: Defect
Tool_Type col: Tool_Type
Missing ratio (top 10):
Process_ID               0.0
Timestamp                0.0
Tool_Type                0.0
Wafer_ID                 0.0
Chamber_Temperature      0.0
Gas_Flow_Rate            0.0
RF_Power                 0.0
Etch_Depth               0.0
Rotation_Speed           0.0
Vacuum_Pressure          0.0
Stage_Alignment_Error    0.0
Vibration_Level          0.0
UV_Exposure_Intensity    0.0
Particle_Count           0.0
Defect                   0.0
Join_Status              0.0
dtype: float64


In [117]:
# =========================================
# 3) Tool_Type별 분포 및 불량률 확인
# =========================================

tool_summary = (df.groupby(tool_col)[target_col].agg(count='size', defect_rate='mean').sort_values('count', ascending=False))
print(tool_summary)

# 샘플이 너무 작은 Tool_Type 경고
min_n = 100
small_tools = tool_summary[tool_summary['count'] < min_n]
if not small_tools.empty:
    print("\n[주의] 샘플 수가 적은 Tool_Type:")
    print(small_tools)


             count  defect_rate
Tool_Type                      
Etching       1418     0.143159
Deposition    1416     0.153955
Lithography   1385     0.141516


In [118]:
# =========================================
# Defect vs Join_Status 일치율 계산 (원본 다시 로드)
# =========================================
df_raw = pd.read_csv(CONFIG["csv_path"])

js_bin = df_raw["Join_Status"].astype(str).str.strip().str.lower().map(
    {"non-joining": 1, "joining": 0}
)

valid = js_bin.notna()
agree = (df_raw.loc[valid, "Defect"].astype(int).values == js_bin.loc[valid].values).mean()

print(f"Agreement rate: {agree:.4f}")


Agreement rate: 1.0000


In [119]:
# =========================================
# 3.5) ID/시간 컬럼 제거 (원본 분리 전에)
# =========================================

cols_to_drop = [c for c in ["Process_ID", "Timestamp","Join_Status","Wafer_ID"] if c in df.columns]
if cols_to_drop:
    df = df.drop(columns=cols_to_drop)
    print("Dropped columns:", cols_to_drop)
else:
    print("No Process_ID/Timestamp/Join_Status/Wafer_ID columns found.")


Dropped columns: ['Process_ID', 'Timestamp', 'Join_Status', 'Wafer_ID']


In [120]:
df.head()

Unnamed: 0,Tool_Type,Chamber_Temperature,Gas_Flow_Rate,RF_Power,Etch_Depth,Rotation_Speed,Vacuum_Pressure,Stage_Alignment_Error,Vibration_Level,UV_Exposure_Intensity,Particle_Count,Defect
0,Lithography,74.077728,56.527432,324.281923,554.358076,1397.936121,0.549974,2.147302,0.009007,128.419361,118,0
1,Deposition,74.341499,39.350802,364.527083,493.382895,1433.488274,0.407351,2.970405,0.007927,109.449848,525,0
2,Deposition,74.626094,38.181393,314.257182,589.544476,1311.34543,0.480282,1.310555,0.008856,161.172686,729,0
3,Etching,79.467364,47.569284,301.464082,488.986118,1342.92897,0.49294,1.56459,0.009416,133.259454,178,0
4,Lithography,76.221205,59.152873,289.702098,458.012763,1785.025252,0.557101,2.338089,0.00959,117.129348,514,1


In [121]:
# =========================================
# 4) Tool_Type별 원본 데이터 분리 
# =========================================

# 문자열 정규화
df["Tool_Type"] = df["Tool_Type"].astype(str).str.strip().str.lower()

tool_map = {"etching": 1, "deposition": 2, "lithography": 3}
df["Tool_Type"] = df["Tool_Type"].map(tool_map)

print(df["Tool_Type"].isna().sum())

df.head()



0


Unnamed: 0,Tool_Type,Chamber_Temperature,Gas_Flow_Rate,RF_Power,Etch_Depth,Rotation_Speed,Vacuum_Pressure,Stage_Alignment_Error,Vibration_Level,UV_Exposure_Intensity,Particle_Count,Defect
0,3,74.077728,56.527432,324.281923,554.358076,1397.936121,0.549974,2.147302,0.009007,128.419361,118,0
1,2,74.341499,39.350802,364.527083,493.382895,1433.488274,0.407351,2.970405,0.007927,109.449848,525,0
2,2,74.626094,38.181393,314.257182,589.544476,1311.34543,0.480282,1.310555,0.008856,161.172686,729,0
3,1,79.467364,47.569284,301.464082,488.986118,1342.92897,0.49294,1.56459,0.009416,133.259454,178,0
4,3,76.221205,59.152873,289.702098,458.012763,1785.025252,0.557101,2.338089,0.00959,117.129348,514,1


In [123]:
# =========================================
#  Stage_Alignment, Vibration_Level 절대값 변환
# =========================================

df["Stage_Alignment_Error"] = df["Stage_Alignment_Error"].abs()
df["Vibration_Level"] = df["Vibration_Level"].abs()


In [124]:
# =========================================
# EDA 기본
# =========================================

# 1) 기본 정보
print("Shape:", df.shape)
print("\nDtypes:")
print(df.dtypes)

# 2) 결측치 확인
print("\nMissing ratio (top 10):")
print(df.isna().mean().sort_values(ascending=False).head(10))

# 3) 타깃 분포
print("\nDefect rate:")
print(df["Defect"].value_counts(normalize=True))

# 4) 숫자형 컬럼 요약 통계
num_cols = df.select_dtypes(include=[np.number]).columns
print("\nNumeric summary:")
display(df[num_cols].describe())

# 5) 상관계수 (Defect 기준 상위)
corr = df[num_cols].corr()["Defect"].sort_values(ascending=False)
print("\nCorrelation with Defect:")
print(corr)

# 6) 툴타입별 Defect 비율
if "Tool_Type" in df.columns:
    tool_summary = df.groupby("Tool_Type")["Defect"].agg(count="size", defect_rate="mean")
    print("\nTool_Type summary:")
    print(tool_summary)


Shape: (4219, 12)

Dtypes:
Tool_Type                  int64
Chamber_Temperature      float64
Gas_Flow_Rate            float64
RF_Power                 float64
Etch_Depth               float64
Rotation_Speed           float64
Vacuum_Pressure          float64
Stage_Alignment_Error    float64
Vibration_Level          float64
UV_Exposure_Intensity    float64
Particle_Count             int64
Defect                     int64
dtype: object

Missing ratio (top 10):
Tool_Type                0.0
Chamber_Temperature      0.0
Gas_Flow_Rate            0.0
RF_Power                 0.0
Etch_Depth               0.0
Rotation_Speed           0.0
Vacuum_Pressure          0.0
Stage_Alignment_Error    0.0
Vibration_Level          0.0
UV_Exposure_Intensity    0.0
dtype: float64

Defect rate:
Defect
0    0.853757
1    0.146243
Name: proportion, dtype: float64

Numeric summary:


Unnamed: 0,Tool_Type,Chamber_Temperature,Gas_Flow_Rate,RF_Power,Etch_Depth,Rotation_Speed,Vacuum_Pressure,Stage_Alignment_Error,Vibration_Level,UV_Exposure_Intensity,Particle_Count,Defect
count,4219.0,4219.0,4219.0,4219.0,4219.0,4219.0,4219.0,4219.0,4219.0,4219.0,4219.0,4219.0
mean,1.992178,75.077152,49.936138,301.363854,498.669397,1504.745316,0.50066,2.005564,0.01016,119.92993,555.856838,0.146243
std,0.815151,5.001785,10.118648,49.478589,101.450988,202.833239,0.049935,0.810941,0.0048,15.057384,261.977363,0.353392
min,1.0,55.718123,5.343961,127.597854,157.862038,859.588353,0.346181,0.004709,1.3e-05,69.040348,100.0,0.0
25%,1.0,71.709006,42.950962,267.563073,430.825513,1368.01513,0.46718,1.468576,0.006732,109.977057,327.0,0.0
50%,2.0,75.088763,49.944598,301.023758,500.385843,1506.984248,0.500761,2.012931,0.010114,119.904629,560.0,0.0
75%,3.0,78.468876,57.010941,335.050447,566.789573,1640.389926,0.534421,2.554972,0.013398,130.103221,787.0,0.0
max,3.0,97.395421,86.02415,476.826583,874.537923,2199.022643,0.688008,4.537797,0.029149,170.990733,999.0,1.0



Correlation with Defect:
Defect                   1.000000
Chamber_Temperature      0.017555
Vibration_Level          0.007963
Etch_Depth               0.006065
Stage_Alignment_Error    0.000612
Tool_Type               -0.001789
Vacuum_Pressure         -0.002940
Gas_Flow_Rate           -0.007495
UV_Exposure_Intensity   -0.009095
RF_Power                -0.014321
Rotation_Speed          -0.019876
Particle_Count          -0.028406
Name: Defect, dtype: float64

Tool_Type summary:
           count  defect_rate
Tool_Type                    
1           1418     0.143159
2           1416     0.153955
3           1385     0.141516
