# GU002 – Lung Cancer Histology Dataset Split  
## Solid vs Acinar (Train / Validation / Test)

### Overview
This notebook documents the construction of a class-balanced train/validation/test
split for a lung cancer histology dataset, focusing on **solid** and **acinar**
histologic subtypes.

The split is designed to support downstream machine learning experiments
(e.g., patch-level or slide-level classification) while maintaining strict
separation between training, validation, and testing sets.

---

### Input Data
- **Metadata file**: `Lung MetaData_Release_1.csv`
- Each row corresponds to one whole-slide image (WSI)
- Key fields used in this notebook:
  - `File Name`: unique slide identifier (e.g., `DHMC_0001.tif`)
  - `Class`: histologic subtype label (e.g., solid, acinar)

Only slides labeled as **solid** or **acinar** are included in this analysis.

---

### Dataset Split Strategy
The dataset is split **by slide**, not by patch, to avoid data leakage.

The target split sizes are:

| Class   | Train | Validation | Test |
|--------|-------|------------|------|
| Solid  | 28    | 8          | 15   |
| Acinar | 28    | 8          | 23   |

Slides are randomly assigned within each class while preserving the specified
counts for each split.

---

### Outputs
The notebook produces:
- A verified class-balanced split
- A summary table confirming split counts per class
- An exportable split table mapping:
  - `File Name` → `Class` → `Split`

This split is intended to serve as the **fixed experimental baseline**
for subsequent modeling experiments in GU002.


In [10]:
import pandas as pd

df = pd.read_csv("Lung MetaData_Release_1.csv")
df.head()


Unnamed: 0,File Name,Class,Microns Per Pixel,Magnification,The Number of Pyramid Levels in Tiff,Level[0] Downsampling Factor,Level[0] Image Width (Pixels),Level[0] Image Height (Pixels),Level[0] Tile Image Width (Pixels),Level[0] Tile Image Height (Pixels),...,Level[9] Downsampling Factor,Level[9] Image Width (Pixels),Level[9] Image Height (Pixels),Level[9] Tile Image Width (Pixels),Level[9] Tile Image Height (Pixels),Level[10] Downsampling Factor,Level[10] Image Width (Pixels),Level[10] Image Height (Pixels),Level[10] Tile Image Width (Pixels),Level[10] Tile Image Height (Pixels)
0,DHMC_0001.tif,solid,0.5028,20,9,1,39839,30468,240,240,...,,,,,,,,,,
1,DHMC_0002.tif,solid,0.5038,20,8,1,49800,25855,512,512,...,,,,,,,,,,
2,DHMC_0003.tif,solid,0.5038,20,8,1,45816,26741,512,512,...,,,,,,,,,,
3,DHMC_0004.tif,solid,0.5038,20,8,1,45816,26097,512,512,...,,,,,,,,,,
4,DHMC_0005.tif,lepidic,0.5038,20,8,1,31872,33269,512,512,...,,,,,,,,,,


In [11]:
df.columns


Index(['File Name', 'Class', 'Microns Per Pixel', 'Magnification',
       'The Number of Pyramid Levels in Tiff', 'Level[0] Downsampling Factor',
       'Level[0] Image Width (Pixels)', 'Level[0] Image Height (Pixels)',
       'Level[0] Tile Image Width (Pixels)',
       'Level[0] Tile Image Height (Pixels)', 'Level[1] Downsampling Factor',
       'Level[1] Image Width (Pixels)', 'Level[1] Image Height (Pixels)',
       'Level[1] Tile Image Width (Pixels)',
       'Level[1] Tile Image Height (Pixels)', 'Level[2] Downsampling Factor',
       'Level[2] Image Width (Pixels)', 'Level[2] Image Height (Pixels)',
       'Level[2] Tile Image Width (Pixels)',
       'Level[2] Tile Image Height (Pixels)', 'Level[3] Downsampling Factor',
       'Level[3] Image Width (Pixels)', 'Level[03 Image Height (Pixels)',
       'Level[3] Tile Image Width (Pixels)',
       'Level[3] Tile Image Height (Pixels)', 'Level[4] Downsampling Factor',
       'Level[4] Image Width (Pixels)', 'Level[4] Image Height (Pi

In [12]:
# 只保留 solid 和 acinar
df_sa = df[df["Class"].isin(["solid", "acinar"])].copy()

# 看每一类有多少
df_sa["Class"].value_counts()


Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
acinar,59
solid,51


In [13]:
import numpy as np

# 固定随机种子（非常重要）
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

splits = []

# 定义每一类的配额
split_plan = {
    "acinar": {"train": 28, "val": 8, "test": 23},
    "solid":  {"train": 28, "val": 8, "test": 15},
}

for cls, plan in split_plan.items():
    df_cls = df_sa[df_sa["Class"] == cls].sample(frac=1, random_state=RANDOM_SEED)  # shuffle

    n_train = plan["train"]
    n_val   = plan["val"]

    df_train = df_cls.iloc[:n_train].copy()
    df_val   = df_cls.iloc[n_train:n_train + n_val].copy()
    df_test  = df_cls.iloc[n_train + n_val:].copy()

    df_train["Split"] = "train"
    df_val["Split"]   = "val"
    df_test["Split"]  = "test"

    splits.append(df_train)
    splits.append(df_val)
    splits.append(df_test)

# 合并
df_split = pd.concat(splits, ignore_index=True)


In [14]:
df_split.groupby(["Class", "Split"]).size()


Unnamed: 0_level_0,Unnamed: 1_level_0,0
Class,Split,Unnamed: 2_level_1
acinar,test,23
acinar,train,28
acinar,val,8
solid,test,15
solid,train,28
solid,val,8


In [15]:
# 只保留最核心的三列（reviewer 友好）
df_out = df_split[["File Name", "Class", "Split"]].copy()

# 排一下序（只是为了好看）
df_out = df_out.sort_values(by=["Class", "Split", "File Name"]).reset_index(drop=True)

df_out.head()


Unnamed: 0,File Name,Class,Split
0,DHMC_0007.tif,acinar,test
1,DHMC_0009.tif,acinar,test
2,DHMC_0016.tif,acinar,test
3,DHMC_0027.tif,acinar,test
4,DHMC_0050.tif,acinar,test


In [16]:
output_path = "GU002_lung_solid_vs_acinar_train_val_test_split.xlsx"
df_out.to_excel(output_path, index=False)

output_path


'GU002_lung_solid_vs_acinar_train_val_test_split.xlsx'