# Notebook 1: Data Processing and Feature Engineering

## HCT Survival Prediction Challenge

This notebook covers:
1. Loading raw clinical and molecular data
2. Understanding the domain shift between train and test
3. Feature engineering (83-feature and 128-feature sets)
4. Handling missing values (NaN fixing)
5. Defining risk groups for evaluation

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler


# Paths
BASE_PATH = '/your_path/SurvivalPrediction'
TRAIN_PATH = f'{BASE_PATH}/data'

## 1. Load Raw Data

In [7]:
# Training data
clinical_train = pd.read_csv(f'{TRAIN_PATH}/X_train/clinical_train.csv')
molecular_train = pd.read_csv(f'{TRAIN_PATH}/X_train/molecular_train.csv')
target_train = pd.read_csv(f'{TRAIN_PATH}/target_train.csv')  # Root level, not in y_train/

# Test data
clinical_test = pd.read_csv(f'{TRAIN_PATH}/X_test/clinical_test.csv')
molecular_test = pd.read_csv(f'{TRAIN_PATH}/X_test/molecular_test.csv')

print(f"Clinical train: {clinical_train.shape}")
print(f"Molecular train: {molecular_train.shape}")
print(f"Target train: {target_train.shape}")
print(f"Clinical test: {clinical_test.shape}")
print(f"Molecular test: {molecular_test.shape}")

Clinical train: (3323, 9)
Molecular train: (10935, 11)
Target train: (3323, 3)
Clinical test: (1193, 9)
Molecular test: (3089, 11)


In [8]:
# Check clinical columns
print("Clinical columns:")
print(clinical_train.columns)

Clinical columns:
Index(['ID', 'CENTER', 'BM_BLAST', 'WBC', 'ANC', 'MONOCYTES', 'HB', 'PLT',
       'CYTOGENETICS'],
      dtype='object')


In [9]:
# Check molecular data structure
print("Molecular columns:")
print(molecular_train.columns)
print(f"\nUnique genes: {molecular_train['GENE'].nunique()}")

Molecular columns:
Index(['ID', 'CHR', 'START', 'END', 'REF', 'ALT', 'GENE', 'PROTEIN_CHANGE',
       'EFFECT', 'VAF', 'DEPTH'],
      dtype='object')

Unique genes: 124


## 2. Domain Shift Analysis

**Key finding**: Test data comes from a single center (KYW) with sicker patients:
- Higher BM_BLAST (+100%)
- Lower PLT (-40%)
- Lower HB (-13%)
- Higher TP53+ rate (25.7% vs 11.4%)

In [10]:
# Compare train vs test distributions
def compare_distributions(train_df, test_df, columns):
    """Compare train and test distributions for numeric columns."""
    results = []
    for col in columns:
        if col in train_df.columns and col in test_df.columns:
            train_mean = train_df[col].mean()
            test_mean = test_df[col].mean()
            percent_diff = (test_mean - train_mean) / (train_mean) * 100
            results.append({
                'Feature': col,
                'Train Mean': train_mean,
                'Test Mean': test_mean,
                '% Difference': percent_diff
            })
    return pd.DataFrame(results)

numeric_cols = ['BM_BLAST', 'WBC', 'ANC', 'HB', 'PLT']
comparison = compare_distributions(clinical_train, clinical_test, numeric_cols)
print("Train vs Test Distribution Comparison:")
print(comparison)

Train vs Test Distribution Comparison:
    Feature  Train Mean   Test Mean  % Difference
0  BM_BLAST    5.982545   12.966605    116.740611
1       WBC    6.535164    7.325801     12.098195
2       ANC    3.264735    2.978544     -8.766144
3        HB    9.893549    8.823937    -10.811205
4       PLT  167.048900  120.631633    -27.786634


In [11]:
# Center distribution
print("\nTraining centers:")
print(clinical_train['CENTER'].value_counts())
print(f"\nTest center: {clinical_test['CENTER'].unique()}")


Training centers:
CENTER
KI       900
DUS      455
PV       316
GESMD    246
RMCN     199
CCH      159
CGM      107
ROM      104
UOB       88
HMS       83
MUV       83
TUD       73
FUCE      73
ICO       71
FLO       68
DUTH      66
UOXF      50
HIAE      47
MSK       37
IHBT      33
VU        33
UMG       26
REL        6
Name: count, dtype: int64

Test center: ['KYW']


## 3. Feature Engineering

We create two feature sets:
1. **83-feature set**: Base clinical + molecular features
2. **128-feature set**: 83 features + 45 interaction features

### 83-Feature Set Components:
- Core clinical (5): BM_BLAST, WBC, ANC, HB, PLT
- Missing indicators (6)
- Ratios (2): ANC_WBC_ratio, PLT_WBC_ratio
- Cytogenetic features (9)
- Gene mutations (50)
- VAF features (5)
- Prognostic features (6)

In [13]:
# Load pre-processed feature sets
X_train_83 = pd.read_csv(f'{TRAIN_PATH}/X_train_83features.csv')
X_train_128 = pd.read_csv(f'{TRAIN_PATH}/X_train_128features_clean.csv')

print(f"83-feature set: {X_train_83.shape}")
print(f"128-feature set: {X_train_128.shape}")

83-feature set: (3323, 83)
128-feature set: (3120, 128)


In [15]:
# List 83 features
features_83 = [c for c in X_train_83.columns]
print("Features:")
for i, f in enumerate(features_83):
    print(f"  {i+1:2d}. {f}")

Features:
   1. BM_BLAST
   2. WBC
   3. ANC
   4. HB
   5. PLT
   6. BM_BLAST_missing
   7. WBC_missing
   8. ANC_missing
   9. HB_missing
  10. PLT_missing
  11. CYTOGENETICS_missing
  12. ANC_WBC_ratio
  13. PLT_WBC_ratio
  14. num_cytogenetic_abnormalities
  15. cyto_risk_score
  16. is_normal_karyotype
  17. has_complex_karyotype
  18. has_isolated_del5q
  19. has_multiple_clones
  20. num_clones
  21. dominant_clone_pct
  22. abnormal_clone_size
  23. has_TET2
  24. has_ASXL1
  25. has_SF3B1
  26. has_SRSF2
  27. has_DNMT3A
  28. has_RUNX1
  29. has_TP53
  30. has_STAG2
  31. has_U2AF1
  32. has_EZH2
  33. has_BCOR
  34. has_CBL
  35. has_ZRSR2
  36. has_NRAS
  37. has_IDH2
  38. has_CUX1
  39. has_NF1
  40. has_KRAS
  41. has_SETBP1
  42. has_DDX41
  43. has_PHF6
  44. has_JAK2
  45. has_MLL
  46. has_IDH1
  47. has_PTPN11
  48. has_CEBPA
  49. has_ETV6
  50. has_ETNK1
  51. has_MPL
  52. has_SH2B3
  53. has_WT1
  54. has_PPM1D
  55. has_BRCC3
  56. has_BCORL1
  57. has_NPM1
  5

In [16]:
# Additional 45 interaction features in 128-feature set
features_128 = X_train_128.columns
interaction_features = [f for f in features_128 if f not in features_83]
print(f"\nInteraction Features ({len(interaction_features)}):")
for f in interaction_features:
    print(f"  - {f}")


Interaction Features (45):
  - TP53_x_complex
  - TP53_x_cyto_high_risk
  - TP53_x_num_cyto_abn
  - TP53_x_high_blast
  - TP53_x_blast_cont
  - TP53_x_normal_karyo
  - TP53_mutation_count
  - TP53_multihit
  - n_splicing_mutations
  - multiple_splicing
  - SF3B1_isolated
  - SF3B1_x_del5q
  - SRSF2_x_TET2
  - SRSF2_x_RUNX1
  - SRSF2_x_ASXL1
  - SRSF2_x_IDH2
  - U2AF1_x_ASXL1
  - DNMT3A_x_IDH
  - TET2_x_IDH
  - ASXL1_x_RUNX1
  - ASXL1_x_EZH2
  - n_epigenetic_mutations
  - has_RAS_pathway
  - RAS_x_high_blast
  - RAS_x_splicing
  - n_cytopenias
  - bicytopenia
  - pancytopenia
  - severe_anemia_x_low_plt
  - TP53_x_bicytopenia
  - blast_x_cyto_risk
  - high_blast_x_complex
  - high_blast_x_normal_karyo
  - very_high_blast
  - very_high_blast_x_TP53
  - has_cohesin
  - has_MLL_FLT3
  - SF3B1_favorable
  - high_risk_molecular
  - very_high_risk_molecular
  - NPM1_x_FLT3
  - NPM1_no_FLT3
  - n_mutations
  - high_mutation_burden
  - high_mutations_x_complex


## 4. NaN Handling

**Issue discovered**: `np.max(vafs)` returns NaN if ANY VAF is missing.

**Root cause**: 89 patients have MLL gene alterations (chromosomal rearrangements, not point mutations), so VAF doesn't apply.

**Fix**: Use `np.nanmax(vafs)` to ignore missing values.

In [17]:
# Check NaN in feature sets
print("NaN counts in unfixed 83-feature set:")
nan_counts = X_train_83.isna().sum()
nan_cols = nan_counts[nan_counts > 0]
print(nan_cols)

NaN counts in unfixed 83-feature set:
max_vaf      89
mean_vaf     89
vaf_range    89
dtype: int64


In [18]:
# Load fixed datasets
X_train_83_fixed = pd.read_csv(f'{TRAIN_PATH}/X_train_83features_with_id_fixed.csv')
X_train_128_fixed = pd.read_csv(f'{TRAIN_PATH}/X_train_128features_clean_fixed.csv')

print(f"NaN in 83-feature fixed: {X_train_83_fixed.isna().sum().sum()}")
print(f"NaN in 128-feature fixed: {X_train_128_fixed.isna().sum().sum()}")

FileNotFoundError: [Errno 2] No such file or directory: '/Users/yuanzhong/Downloads/SurvivalPrediction/data/X_train_83features_with_id_fixed.csv'

## 5. Risk Groups for Evaluation

Evaluation uses **weighted C-index** across three groups:
- Overall population: 30% weight
- Test-like population (1+ risk factors): 40% weight
- High-risk population (2+ risk factors): 30% weight

**Risk factors**:
1. High blast count (BM_BLAST > 10)
2. TP53 mutation (has_TP53 > 0)
3. Low hemoglobin (HB < 10)
4. Low platelets (PLT < 50)
5. High cytogenetic risk (cyto_risk_score >= 3)

Not a perfect classification but gives a general idea of patient risk levels.

In [19]:
def define_risk_groups(X):
    """Define risk groups based on clinical risk factors."""
    risk_factors = pd.DataFrame(index=X.index)
    risk_factors['high_blast'] = (X['BM_BLAST'] > 10).astype(int)
    risk_factors['has_TP53'] = (X['has_TP53'] > 0).astype(int)
    risk_factors['low_hb'] = (X['HB'] < 10).astype(int)
    risk_factors['low_plt'] = (X['PLT'] < 50).astype(int)
    risk_factors['high_cyto'] = (X['cyto_risk_score'] >= 3).astype(int)
    
    n_risk_factors = risk_factors.sum(axis=1)
    
    return {
        'test_like': n_risk_factors >= 1,  # 1+ risk factors
        'high_risk': n_risk_factors >= 2,  # 2+ risk factors
    }

# Apply to training data
X_train_unscaled = pd.read_csv(f'{TRAIN_PATH}/X_train_128features_clean_fixed.csv')
risk_groups = define_risk_groups(X_train_unscaled)

print(f"Total samples: {len(X_train_unscaled)}")
print(f"Test-like (1+ risk factors): {risk_groups['test_like'].sum()} ({risk_groups['test_like'].mean()*100:.1f}%)")
print(f"High-risk (2+ risk factors): {risk_groups['high_risk'].sum()} ({risk_groups['high_risk'].mean()*100:.1f}%)")

Total samples: 3120
Test-like (1+ risk factors): 2192 (70.3%)
High-risk (2+ risk factors): 772 (24.7%)


## 6. Data Scaling

StandardScaler is applied for models that require scaled features (CoxPH, DeepSurv).
XGBoost handles raw features natively.

In [20]:
# Load scaled datasets
X_train_83_scaled = pd.read_csv(f'{TRAIN_PATH}/X_train_83features_fixed_scaled.csv')
X_train_128_scaled = pd.read_csv(f'{TRAIN_PATH}/X_train_128features_clean_fixed_scaled.csv')

print(f"83-feature scaled: {X_train_83_scaled.shape}")
print(f"128-feature scaled: {X_train_128_scaled.shape}")

# Verify scaling
print(f"\nMean of first 5 columns (should be ~0):")
print(X_train_128_scaled.iloc[:, :5].mean())
print(f"\nStd of first 5 columns (should be ~1):")
print(X_train_128_scaled.iloc[:, :5].std())

83-feature scaled: (3120, 83)
128-feature scaled: (3120, 128)

Mean of first 5 columns (should be ~0):
BM_BLAST    7.287618e-17
WBC        -6.832142e-18
ANC         2.960595e-17
HB          5.875642e-16
PLT        -7.743094e-17
dtype: float64

Std of first 5 columns (should be ~1):
BM_BLAST    1.00016
WBC         1.00016
ANC         1.00016
HB          1.00016
PLT         1.00016
dtype: float64


## 7. Target Variable

In [21]:
# Load aligned target
target = pd.read_csv(f'{TRAIN_PATH}/target_train_clean_aligned.csv')
print(f"Target shape: {target.shape}")
print(target.head())

print(f"\nSurvival time (OS_YEARS):")
print(f"  Min: {target['OS_YEARS'].min():.2f}")
print(f"  Max: {target['OS_YEARS'].max():.2f}")
print(f"  Mean: {target['OS_YEARS'].mean():.2f}")

print(f"\nEvent rate (deaths): {target['OS_STATUS'].mean()*100:.1f}%")

Target shape: (3120, 3)
        ID  OS_YEARS  OS_STATUS
0  P132697  1.115068        1.0
1  P132698  4.928767        0.0
2  P116889  2.043836        0.0
3  P132699  2.476712        1.0
4  P132700  3.145205        0.0

Survival time (OS_YEARS):
  Min: 0.00
  Max: 22.04
  Mean: 2.52

Event rate (deaths): 51.3%


## Summary

### Feature Sets
| Feature Set | # Features | NaN Fixed? | For Use in |
|-------------|------------|------------|------------|
| 83 unfixed | 83 | No | XGB AFT (handles NaN natively) |
| 83 fixed | 83 | Yes | DeepSurv |
| 128 fixed | 128 | Yes | CoxPH, Two-Model |