# GU001 — Kidney Histology Dataset Split
## Papillary vs Clear Cell Renal Cell Carcinoma

### Objective
This notebook defines a reproducible train/validation/test split for a
binary kidney histology classification task:

- Papillary renal cell carcinoma
- Clear cell renal cell carcinoma

The goal is to establish a clean and transparent dataset split that can be
used consistently across downstream baseline and foundation model experiments.

---

### Rationale
- Papillary and Clear cell RCC represent two common but morphologically distinct renal tumor subtypes.
- A controlled split is necessary to avoid data leakage and ensure fair model comparison.
- All splits are performed at the slide level.

---

### Dataset
Source metadata file:
- data/kidney MetaData_Release_1.csv

Key columns used:
- File Name
- Diagnosis
- Slide Type
- Data Split (overwritten by this notebook)

---

### Split Strategy
For Papillary RCC:
- 84 slides → Training
- 21 slides → Validation
- 45 slides → Testing

For Clear Cell RCC:
- 105 slides → Training + Validation (84 / 21)
- 398 slides → Testing

Random sampling is performed with a fixed seed to ensure reproducibility.


In [None]:
import pandas as pd

df = pd.read_csv("kidney MetaData_Release_1.csv")
df.columns


Index(['File Name', 'Diagnosis', 'Slide Type', 'Data Split'], dtype='object')

In [None]:
import pandas as pd
import numpy as np

# 1. 读入数据
df = pd.read_csv("kidney MetaData_Release_1.csv")

# 2. 只保留我们要的两个类别
df = df[df["Diagnosis"].isin(["Papillary", "Clearcell"])].copy()

# 3. 打乱（保证随机）
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# 4. 分开两类
pap = df[df["Diagnosis"] == "Papillary"]
cc = df[df["Diagnosis"] == "Clearcell"]

# 5. 按你给的规则切分
pap_trainval = pap.iloc[:105]
pap_test = pap.iloc[105:150]   # 45

cc_trainval = cc.iloc[:105]
cc_test = cc.iloc[105:503]     # 398

# 6. 在 trainval 里再分 train / val
pap_train = pap_trainval.iloc[:84]
pap_val = pap_trainval.iloc[84:105]

cc_train = cc_trainval.iloc[:84]
cc_val = cc_trainval.iloc[84:105]

# 7. 加 Data Split 标签
def label(df, name):
    df = df.copy()
    df["Split"] = name
    return df

final_df = pd.concat([
    label(pap_train, "train"),
    label(pap_val, "val"),
    label(pap_test, "test"),
    label(cc_train, "train"),
    label(cc_val, "val"),
    label(cc_test, "test"),
])

# 8. 保存成 Excel
final_df.to_excel("kidney_Papillary_vs_Clearcell_split.xlsx", index=False)

final_df.head()


Unnamed: 0,File Name,Diagnosis,Slide Type,Data Split,Split
1,DHMC_0439.png,Papillary,Resection,Train,train
5,DHMC_0072.png,Papillary,Resection,Test,train
8,DHMC_0442.png,Papillary,Resection,Train,train
9,DHMC_0417.png,Papillary,Resection,Train,train
12,DHMC_0448.png,Papillary,Resection,Train,train
