#**TabNet MI 결측치 제거**

*   입력 파일 :

1. 유전형인코딩_결측치_행_제거.csv
2. 2016년 표현형 데이터.xlsx
3. 상위N개_MI.csv


In [1]:
# ✅ TabNet 설치
!pip install pytorch-tabnet

Collecting pytorch-tabnet
  Downloading pytorch_tabnet-4.1.0-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 

In [2]:
import pandas as pd
import numpy as np
from pytorch_tabnet.tab_model import TabNetRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from google.colab import files

# ✅ 파일 업로드
print("⬆️ 유전형 인코딩 파일, 표현형 파일, MI 기반 SNP 파일들을 업로드하세요.")
uploaded = files.upload()

⬆️ 유전형 인코딩 파일, 표현형 파일, MI 기반 SNP 파일들을 업로드하세요.


Saving 유전형인코딩_결측치_행_제거.csv to 유전형인코딩_결측치_행_제거.csv
Saving 상위2000개_MI.csv to 상위2000개_MI.csv
Saving 상위1000개_MI.csv to 상위1000개_MI.csv
Saving 상위500개_MI.csv to 상위500개_MI.csv
Saving 상위100개_MI.csv to 상위100개_MI.csv
Saving 상위50개_MI.csv to 상위50개_MI.csv
Saving 상위20개_MI.csv to 상위20개_MI.csv
Saving 2016년 표현형 데이터.xlsx to 2016년 표현형 데이터.xlsx


In [3]:
# ✅ 경로 설정
geno_path = "유전형인코딩_결측치_행_제거.csv"
pheno_path = "2016년 표현형 데이터.xlsx"
mi_files = {
    "Top 20": "상위20개_MI.csv",
    "Top 50": "상위50개_MI.csv",
    "Top 100": "상위100개_MI.csv",
    "Top 500": "상위500개_MI.csv",
    "Top 1000": "상위1000개_MI.csv",
    "Top 2000": "상위2000개_MI.csv"
}

# ✅ 유전형 데이터 불러오기 (전치 필수)
geno_df = pd.read_csv(geno_path, index_col=0).T
geno_df.index = geno_df.index.astype(str).str.strip()
geno_df.columns = geno_df.columns.astype(str).str.strip()

# ✅ 표현형 데이터
pheno_df = pd.read_excel(pheno_path)
pheno_df.columns = pheno_df.columns.str.strip()
pheno_df.set_index("Genotype", inplace=True)
pheno_df.index = pheno_df.index.astype(str).str.strip()
phenotypes = pheno_df.columns.tolist()

# ✅ 결과 저장
results = []

In [4]:
# ✅ TabNet 학습 반복
for pheno in phenotypes:
    y_all = pheno_df[pheno].dropna()

    for label, file in mi_files.items():
        try:
            mi_df = pd.read_csv(file)
            mi_df["SNP"] = mi_df["SNP"].astype(str).str.strip()

            # 해당 표현형에 해당하는 SNP만 추출
            snps = mi_df[mi_df["Phenotype"] == pheno]["SNP"].tolist()
            snps = [s for s in snps if s in geno_df.columns]

            if len(snps) == 0:
                raise ValueError("사용 가능한 SNP 없음")

            # X, y 병합
            X = geno_df[snps].loc[y_all.index].dropna()
            y = y_all.loc[X.index]

            # 정규화
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)

            # 분할
            X_train, X_test, y_train, y_test = train_test_split(
                X_scaled, y, test_size=0.2, random_state=42
            )

            # TabNet 학습
            model = TabNetRegressor(verbose=0)
            model.fit(
                X_train=X_train, y_train=y_train.values.reshape(-1, 1),
                eval_set=[(X_test, y_test.values.reshape(-1, 1))],
                eval_metric=['rmse'],
                max_epochs=100,
                patience=10,
                batch_size=256,
                virtual_batch_size=128,
                num_workers=0,
                drop_last=False
            )

            # 평가
            y_pred = model.predict(X_test).flatten()
            rmse = np.sqrt(mean_squared_error(y_test, y_pred))
            r2 = r2_score(y_test, y_pred)

            results.append({
                "표현형": pheno,
                "SNP 개수": label,
                "R²": round(r2, 4),
                "RMSE": round(rmse, 4)
            })

        except Exception as e:
            print(f"❌ {pheno} - {label} 에러: {e}")
            results.append({
                "표현형": pheno,
                "SNP 개수": label,
                "R²": "에러",
                "RMSE": "에러"
            })




Early stopping occurred at epoch 58 with best_epoch = 48 and best_val_0_rmse = 76.66087





Early stopping occurred at epoch 58 with best_epoch = 48 and best_val_0_rmse = 47.41489





Early stopping occurred at epoch 69 with best_epoch = 59 and best_val_0_rmse = 45.94399





Early stopping occurred at epoch 41 with best_epoch = 31 and best_val_0_rmse = 57.22556





Early stopping occurred at epoch 62 with best_epoch = 52 and best_val_0_rmse = 45.15551





Early stopping occurred at epoch 56 with best_epoch = 46 and best_val_0_rmse = 59.23724





Early stopping occurred at epoch 67 with best_epoch = 57 and best_val_0_rmse = 8.72892





Early stopping occurred at epoch 51 with best_epoch = 41 and best_val_0_rmse = 10.8719





Early stopping occurred at epoch 44 with best_epoch = 34 and best_val_0_rmse = 11.87581





Early stopping occurred at epoch 43 with best_epoch = 33 and best_val_0_rmse = 14.88653





Early stopping occurred at epoch 60 with best_epoch = 50 and best_val_0_rmse = 11.3125





Early stopping occurred at epoch 62 with best_epoch = 52 and best_val_0_rmse = 9.95434





Early stopping occurred at epoch 69 with best_epoch = 59 and best_val_0_rmse = 15.83568





Early stopping occurred at epoch 60 with best_epoch = 50 and best_val_0_rmse = 14.45873





Early stopping occurred at epoch 46 with best_epoch = 36 and best_val_0_rmse = 13.04639





Early stopping occurred at epoch 45 with best_epoch = 35 and best_val_0_rmse = 17.10711





Early stopping occurred at epoch 47 with best_epoch = 37 and best_val_0_rmse = 21.73287





Early stopping occurred at epoch 50 with best_epoch = 40 and best_val_0_rmse = 15.61012





Early stopping occurred at epoch 47 with best_epoch = 37 and best_val_0_rmse = 0.86761





Early stopping occurred at epoch 43 with best_epoch = 33 and best_val_0_rmse = 0.89701





Early stopping occurred at epoch 27 with best_epoch = 17 and best_val_0_rmse = 1.17213





Early stopping occurred at epoch 26 with best_epoch = 16 and best_val_0_rmse = 1.87522





Early stopping occurred at epoch 32 with best_epoch = 22 and best_val_0_rmse = 1.44323





Early stopping occurred at epoch 60 with best_epoch = 50 and best_val_0_rmse = 1.06543




Stop training because you reached max_epochs = 100 with best_epoch = 91 and best_val_0_rmse = 0.79429





Early stopping occurred at epoch 23 with best_epoch = 13 and best_val_0_rmse = 1.46525





Early stopping occurred at epoch 24 with best_epoch = 14 and best_val_0_rmse = 1.09168





Early stopping occurred at epoch 25 with best_epoch = 15 and best_val_0_rmse = 1.61024





Early stopping occurred at epoch 33 with best_epoch = 23 and best_val_0_rmse = 1.10663





Early stopping occurred at epoch 55 with best_epoch = 45 and best_val_0_rmse = 1.07744





Early stopping occurred at epoch 99 with best_epoch = 89 and best_val_0_rmse = 0.095





Early stopping occurred at epoch 19 with best_epoch = 9 and best_val_0_rmse = 0.32369





Early stopping occurred at epoch 80 with best_epoch = 70 and best_val_0_rmse = 0.10698





Early stopping occurred at epoch 10 with best_epoch = 0 and best_val_0_rmse = 0.5004





Early stopping occurred at epoch 66 with best_epoch = 56 and best_val_0_rmse = 0.12171





Early stopping occurred at epoch 14 with best_epoch = 4 and best_val_0_rmse = 0.28188




In [5]:
# ✅ 결과 출력
results_df = pd.DataFrame(results)
results_pivot = results_df.pivot(index="표현형", columns="SNP 개수", values="R²")
print("📊 TabNet 성능 비교표 (R² 기준):")
display(results_pivot)

📊 TabNet 성능 비교표 (R² 기준):


SNP 개수,Top 100,Top 1000,Top 20,Top 2000,Top 50,Top 500
표현형,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
과실경도 (kg),0.1041,-0.1597,0.2934,-5.2207,-7.2024,-18.6036
과장 (mm),-0.1599,-0.0525,0.3734,0.1851,0.0279,-0.8225
과중 (g),0.4082,0.4283,-0.6477,0.0162,0.3697,0.0819
과폭 (mm),0.3183,-0.8917,-0.0044,0.0241,0.1627,-0.1721
과피두께 (mm),-0.4346,-1.1749,0.214,-0.1853,0.1598,-2.6718
당도 (%),0.1555,0.1322,0.5529,0.1773,-0.5214,-0.8374


# **TabNet GWAS 결측치 제거**

*   입력 파일 :

1. 유전형인코딩_결측치_행_제거.csv
2. 2016년 표현형 데이터.xlsx
3. GWAS_SNP(N).csv

In [6]:
# ✅ 라이브러리
import pandas as pd
import numpy as np
from pytorch_tabnet.tab_model import TabNetRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

In [7]:
from google.colab import files
uploaded = files.upload()

Saving 2016년 표현형 데이터.xlsx to 2016년 표현형 데이터 (1).xlsx
Saving GWAS_plink.ipynb to GWAS_plink.ipynb
Saving GWAS_SNP(20).csv to GWAS_SNP(20).csv
Saving GWAS_SNP(50).csv to GWAS_SNP(50).csv
Saving GWAS_SNP(100).csv to GWAS_SNP(100).csv
Saving GWAS_SNP(500).csv to GWAS_SNP(500).csv
Saving GWAS_SNP(1000).csv to GWAS_SNP(1000).csv
Saving GWAS_SNP(2000).csv to GWAS_SNP(2000).csv
Saving 유전형인코딩_결측치_행_제거.csv to 유전형인코딩_결측치_행_제거 (1).csv


In [8]:
# ✅ 경로 설정
geno_path = "유전형인코딩_결측치_행_제거.csv"
pheno_path = "2016년 표현형 데이터.xlsx"
gwas_snp_files = {
    "Top 20": "GWAS_SNP(20).csv",
    "Top 50": "GWAS_SNP(50).csv",
    "Top 100": "GWAS_SNP(100).csv",
    "Top 500": "GWAS_SNP(500).csv",
    "Top 1000": "GWAS_SNP(1000).csv",
    "Top 2000": "GWAS_SNP(2000).csv"
}

# ✅ 유전형 데이터 (전치)
geno_df = pd.read_csv(geno_path, index_col=0).T
geno_df.index = geno_df.index.astype(str).str.strip()
geno_df.columns = geno_df.columns.astype(str).str.strip()

# ✅ 표현형 데이터
pheno_df = pd.read_excel(pheno_path)
pheno_df.columns = pheno_df.columns.str.strip()
pheno_df.set_index("Genotype", inplace=True)
pheno_df.index = pheno_df.index.astype(str).str.strip()
phenotypes = pheno_df.columns.tolist()

# ✅ 결과 저장
results = []


In [9]:
# ✅ TabNet 반복 학습
for pheno in phenotypes:
    y_all = pheno_df[pheno].dropna()

    for label, file in gwas_snp_files.items():
        try:
            snp_df = pd.read_csv(file)
            snp_df["SNP"] = snp_df["SNP"].astype(str).str.strip()

            # 🔧 Trait 조건 제거 → 전체 SNP 사용
            snps = snp_df["SNP"].tolist()
            snps = [s for s in snps if s in geno_df.columns]

            if len(snps) == 0:
                raise ValueError("사용 가능한 SNP 없음")

            # X, y 병합
            X = geno_df[snps].loc[y_all.index].dropna()
            y = y_all.loc[X.index]

            # 정규화
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)

            # train/test 분할
            X_train, X_test, y_train, y_test = train_test_split(
                X_scaled, y, test_size=0.2, random_state=42
            )

            # TabNet 학습
            model = TabNetRegressor(verbose=0)
            model.fit(
                X_train=X_train, y_train=y_train.values.reshape(-1, 1),
                eval_set=[(X_test, y_test.values.reshape(-1, 1))],
                eval_metric=['rmse'],
                max_epochs=100,
                patience=10,
                batch_size=256,
                virtual_batch_size=128,
                num_workers=0,
                drop_last=False
            )

            # 예측 및 평가
            y_pred = model.predict(X_test).flatten()
            rmse = np.sqrt(mean_squared_error(y_test, y_pred))
            r2 = r2_score(y_test, y_pred)

            results.append({
                "표현형": pheno,
                "SNP 개수": label,
                "R²": round(r2, 4),
                "RMSE": round(rmse, 4)
            })

        except Exception as e:
            print(f"❌ {pheno} - {label} 에러: {e}")
            results.append({
                "표현형": pheno,
                "SNP 개수": label,
                "R²": "에러",
                "RMSE": "에러"
            })


Early stopping occurred at epoch 43 with best_epoch = 33 and best_val_0_rmse = 38.65518





Early stopping occurred at epoch 53 with best_epoch = 43 and best_val_0_rmse = 51.4846





Early stopping occurred at epoch 64 with best_epoch = 54 and best_val_0_rmse = 42.71335





Early stopping occurred at epoch 69 with best_epoch = 59 and best_val_0_rmse = 46.82139





Early stopping occurred at epoch 69 with best_epoch = 59 and best_val_0_rmse = 51.63574





Early stopping occurred at epoch 58 with best_epoch = 48 and best_val_0_rmse = 52.85795





Early stopping occurred at epoch 36 with best_epoch = 26 and best_val_0_rmse = 18.26959





Early stopping occurred at epoch 47 with best_epoch = 37 and best_val_0_rmse = 11.81922





Early stopping occurred at epoch 63 with best_epoch = 53 and best_val_0_rmse = 12.88209





Early stopping occurred at epoch 60 with best_epoch = 50 and best_val_0_rmse = 9.52904





Early stopping occurred at epoch 65 with best_epoch = 55 and best_val_0_rmse = 11.11242





Early stopping occurred at epoch 54 with best_epoch = 44 and best_val_0_rmse = 14.43175





Early stopping occurred at epoch 34 with best_epoch = 24 and best_val_0_rmse = 19.85939





Early stopping occurred at epoch 51 with best_epoch = 41 and best_val_0_rmse = 14.29513





Early stopping occurred at epoch 56 with best_epoch = 46 and best_val_0_rmse = 18.16874





Early stopping occurred at epoch 52 with best_epoch = 42 and best_val_0_rmse = 17.28598





Early stopping occurred at epoch 55 with best_epoch = 45 and best_val_0_rmse = 15.96052





Early stopping occurred at epoch 53 with best_epoch = 43 and best_val_0_rmse = 19.51835





Early stopping occurred at epoch 54 with best_epoch = 44 and best_val_0_rmse = 0.95734





Early stopping occurred at epoch 24 with best_epoch = 14 and best_val_0_rmse = 1.42391





Early stopping occurred at epoch 35 with best_epoch = 25 and best_val_0_rmse = 1.20712





Early stopping occurred at epoch 35 with best_epoch = 25 and best_val_0_rmse = 1.15459





Early stopping occurred at epoch 88 with best_epoch = 78 and best_val_0_rmse = 0.87587





Early stopping occurred at epoch 31 with best_epoch = 21 and best_val_0_rmse = 1.59801





Early stopping occurred at epoch 50 with best_epoch = 40 and best_val_0_rmse = 0.98984





Early stopping occurred at epoch 22 with best_epoch = 12 and best_val_0_rmse = 1.63334





Early stopping occurred at epoch 75 with best_epoch = 65 and best_val_0_rmse = 0.91759





Early stopping occurred at epoch 35 with best_epoch = 25 and best_val_0_rmse = 0.99175





Early stopping occurred at epoch 30 with best_epoch = 20 and best_val_0_rmse = 2.02624





Early stopping occurred at epoch 30 with best_epoch = 20 and best_val_0_rmse = 1.59919





Early stopping occurred at epoch 68 with best_epoch = 58 and best_val_0_rmse = 0.10866




Stop training because you reached max_epochs = 100 with best_epoch = 98 and best_val_0_rmse = 0.09098





Early stopping occurred at epoch 16 with best_epoch = 6 and best_val_0_rmse = 0.2576





Early stopping occurred at epoch 19 with best_epoch = 9 and best_val_0_rmse = 0.30119





Early stopping occurred at epoch 18 with best_epoch = 8 and best_val_0_rmse = 0.27632





Early stopping occurred at epoch 75 with best_epoch = 65 and best_val_0_rmse = 0.12878




In [10]:
# ✅ 결과 출력
results_df = pd.DataFrame(results)

# 컬럼 순서 정렬
ordered_cols = ["Top 20", "Top 50", "Top 100", "Top 500", "Top 1000", "Top 2000"]

# R²
results_r2 = results_df.pivot(index="표현형", columns="SNP 개수", values="R²")
results_r2 = results_r2[ordered_cols]

# RMSE
results_rmse = results_df.pivot(index="표현형", columns="SNP 개수", values="RMSE")
results_rmse = results_rmse[ordered_cols]

print("📊 TabNet 성능 비교표 (R² 기준):")
display(results_r2)

print("📉 TabNet 성능 비교표 (RMSE 기준):")
display(results_rmse)

📊 TabNet 성능 비교표 (R² 기준):


SNP 개수,Top 20,Top 50,Top 100,Top 500,Top 1000,Top 2000
표현형,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
과실경도 (kg),0.0756,0.352,-4.1951,-6.102,-4.9777,-0.2984
과장 (mm),-1.745,-0.1488,-0.3648,0.2532,-0.0156,-0.7129
과중 (g),0.5811,0.2568,0.4885,0.3854,0.2525,0.2167
과폭 (mm),-0.5796,0.1816,-0.3221,-0.1967,-0.0202,-0.5258
과피두께 (mm),0.043,-1.1171,-0.5215,-0.392,0.199,-1.6665
당도 (%),0.3057,-0.8905,0.4033,0.303,-1.9095,-0.8123


📉 TabNet 성능 비교표 (RMSE 기준):


SNP 개수,Top 20,Top 50,Top 100,Top 500,Top 1000,Top 2000
표현형,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
과실경도 (kg),0.1087,0.091,0.2576,0.3012,0.2763,0.1288
과장 (mm),18.2696,11.8192,12.8821,9.529,11.1124,14.4317
과중 (g),38.6552,51.4846,42.7134,46.8214,51.6357,52.858
과폭 (mm),19.8594,14.2951,18.1687,17.286,15.9605,19.5184
과피두께 (mm),0.9573,1.4239,1.2071,1.1546,0.8759,1.598
당도 (%),0.9898,1.6333,0.9176,0.9918,2.0262,1.5992
