In [1]:
#1 Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Step #1**
This initial setup step connects the Colab environment to the user's Google Drive, enabling seamless access to project files stored remotely. By importing the drive module from google.colab and invoking drive.mount('/content/drive'), the notebook gains permission to read and write files within the user's Drive. This is essential for loading datasets, saving outputs, and maintaining reproducibility across sessions. If the drive is already mounted, Colab provides a message confirming the mount and suggests using force_remount=True if reauthorization is needed.

In [2]:
#2 Set project paths

from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path('/content/drive/MyDrive/projects/metabric')
DATA_DIR = PROJECT_ROOT / 'data'
OUTPUTS_DIR = PROJECT_ROOT / 'outputs'
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)

RAW_PATH = DATA_DIR / 'METABRIC_cleaned_imputed.csv'
PREP_PATH = DATA_DIR / 'METABRIC_preprocessed.parquet'


**Step #2** This step establishes a structured directory layout for the project using Python’s pathlib library, which provides a clean and platform-independent way to manage file paths. The root path PROJECT_ROOT points to the METABRIC project folder within Google Drive. From this, two subdirectories are defined: DATA_DIR for storing input datasets and OUTPUTS_DIR for saving processed results. The OUTPUTS_DIR is created if it doesn’t already exist, ensuring that output files can be written without errors. Finally, specific file paths for the raw and preprocessed METABRIC datasets are assigned to RAW_PATH and PREP_PATH, respectively, enabling consistent access throughout the pipeline.

In [3]:
#3 Load data

df = pd.read_csv('/content/drive/MyDrive/METABRIC_cleaned_imputed.csv')


**Step #3** This step initiates the data pipeline by importing the METABRIC dataset into the notebook environment. Using pandas.read_csv, the code reads the cleaned and imputed CSV file from the specified Google Drive path and stores it in a DataFrame named df. This operation makes the dataset accessible for inspection, preprocessing, and modeling. It assumes the file exists at the given location and is formatted correctly for immediate use in survival analysis or feature engineering tasks.

In [4]:
#4 Inspect Cancer Type Categories

print("Unique values in 'Cancer Type' column:")
print(df['Cancer Type'].unique())

Unique values in 'Cancer Type' column:
['Breast Cancer' 'Breast Sarcoma']


**Step #4** Inspects the distinct categories present in the 'Cancer Type' column of the dataset by using df['Cancer Type'].unique(). This command retrieves all unique values in that column, helping analysts understand the scope of cancer types represented. The output reveals two categories: 'Breast Cancer' and 'Breast Sarcoma', indicating that the dataset includes both common and rare breast malignancies. This step is essential for guiding stratified analysis, ensuring that modeling or visualization efforts account for categorical diversity in clinical outcomes.

In [5]:
#5 Generate Numeric Summary Statistics

display(df.describe())

Unnamed: 0,Age at Diagnosis,Cohort,ER Status,Neoplasm Histologic Grade,Lymph nodes examined positive,Mutation Count,Nottingham prognostic index,Overall Survival (Months),Relapse Free Status (Months),Tumor Size,Tumor Stage
count,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0
mean,60.420885,2.90032,0.743324,2.411479,1.955759,5.579992,4.022479,125.207254,108.862096,26.218972,1.709725
std,13.000852,1.957908,0.436886,0.63351,3.798771,3.845849,1.141107,67.623861,74.642188,14.907284,0.553185
min,22.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
25%,51.0,1.0,0.0,2.0,0.0,3.0,3.1,76.0,43.0,18.0,1.0
50%,61.0,3.0,1.0,2.4,1.0,5.0,4.0,125.0,104.0,24.0,1.7
75%,70.0,4.0,1.0,3.0,2.0,7.0,5.0,164.0,163.0,30.0,2.0
max,96.0,9.0,1.0,3.0,45.0,80.0,7.2,355.0,384.0,182.0,4.0


**Step #5** Generates a comprehensive numeric summary of key clinical and demographic features in the METABRIC dataset using df.describe(). This output includes count, mean, standard deviation, minimum, quartiles (25%, 50%, 75%), and maximum values for each numeric column, such as age at diagnosis, tumor size, mutation count, and survival metrics. These statistics provide a foundational overview of data distribution, central tendency, and variability, helping identify potential outliers, skewed variables, and normalization needs. This step is essential for informing preprocessing decisions and guiding downstream modeling strategies.

In [6]:
#6 Audit Missing Values

print('Missing values in each column of df:')
print(df.isnull().sum())

Missing values in each column of df:
Patient ID                        0
Age at Diagnosis                  0
Type of Breast Surgery            0
Cancer Type                       0
Cancer Type Detailed              0
Cellularity                       0
Chemotherapy                      0
Pam50 + Claudin-low subtype       0
Cohort                            0
ER status measured by IHC         0
ER Status                         0
Neoplasm Histologic Grade         0
HER2 status measured by SNP6      0
HER2 Status                       0
Tumor Other Histologic Subtype    0
Hormone Therapy                   0
Inferred Menopausal State         0
Integrative Cluster               0
Primary Tumor Laterality          0
Lymph nodes examined positive     0
Mutation Count                    0
Nottingham prognostic index       0
Oncotree Code                     0
Overall Survival (Months)         0
Overall Survival Status           0
PR Status                         0
Radio Therapy              

**Step #6** Performs a missing data audit across all columns in the dataset using df.isnull().sum(), which calculates the total number of null (missing) entries per feature. The output confirms that every column has zero missing values, indicating a fully complete dataset. This verification step is crucial before modeling or statistical analysis, as missing data can bias results, disrupt algorithms, or require imputation strategies. By confirming data completeness, this step ensures the integrity and reliability of downstream clinical insights.

In [7]:
#7 Detailed Clinical Feature Summary

display(df.describe())

Unnamed: 0,Age at Diagnosis,Cohort,ER Status,Neoplasm Histologic Grade,Lymph nodes examined positive,Mutation Count,Nottingham prognostic index,Overall Survival (Months),Relapse Free Status (Months),Tumor Size,Tumor Stage
count,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0,2509.0
mean,60.420885,2.90032,0.743324,2.411479,1.955759,5.579992,4.022479,125.207254,108.862096,26.218972,1.709725
std,13.000852,1.957908,0.436886,0.63351,3.798771,3.845849,1.141107,67.623861,74.642188,14.907284,0.553185
min,22.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
25%,51.0,1.0,0.0,2.0,0.0,3.0,3.1,76.0,43.0,18.0,1.0
50%,61.0,3.0,1.0,2.4,1.0,5.0,4.0,125.0,104.0,24.0,1.7
75%,70.0,4.0,1.0,3.0,2.0,7.0,5.0,164.0,163.0,30.0,2.0
max,96.0,9.0,1.0,3.0,45.0,80.0,7.2,355.0,384.0,182.0,4.0


**Step #7** Presents a detailed numeric summary of key clinical variables using df.describe(), offering insights into the distribution and variability of features such as age at diagnosis, tumor size, mutation count, and survival outcomes. The summary includes count, mean, standard deviation, minimum, quartiles (25%, 50%, 75%), and maximum values for each column, enabling a quick scan for outliers, skewed distributions, and potential normalization needs. This statistical overview is foundational for guiding preprocessing decisions, validating data integrity, and informing modeling strategies in clinical research.

In [8]:
#8 "Inspect Dataset Structure and Types"

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2509 entries, 0 to 2508
Data columns (total 34 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Patient ID                      2509 non-null   object 
 1   Age at Diagnosis                2509 non-null   int64  
 2   Type of Breast Surgery          2509 non-null   object 
 3   Cancer Type                     2509 non-null   object 
 4   Cancer Type Detailed            2509 non-null   object 
 5   Cellularity                     2509 non-null   object 
 6   Chemotherapy                    2509 non-null   object 
 7   Pam50 + Claudin-low subtype     2509 non-null   object 
 8   Cohort                          2509 non-null   float64
 9   ER status measured by IHC       2509 non-null   object 
 10  ER Status                       2509 non-null   int64  
 11  Neoplasm Histologic Grade       2509 non-null   float64
 12  HER2 status measured by SNP6    25

**Step #8** Uses df.info() to generate a concise structural summary of the dataset, detailing the number of entries, column names, non-null counts, and data types. The output confirms that the DataFrame contains 2,509 rows and 33 columns, with no missing values across any feature. Data types are either float64 for numeric variables or object for categorical ones, which is critical for guiding preprocessing steps such as encoding, scaling, or imputation. This structural audit ensures that the dataset is clean and well-typed, laying the groundwork for reliable analysis and modeling.

In [9]:
#9 Define Features and Survival Targetss

# Adjust column names to your schema based on previous corrections
time_col = 'Overall Survival (Months)'
event_col = 'Overall Survival Status'

# Reconstruct feature_cols to include all columns except the excluded ones
# Note: df is the full dataframe as loaded from METABRIC_cleaned_imputed.csv
excluded_cols = ['Patient ID', time_col, event_col]
feature_cols = [col for col in df.columns if col not in excluded_cols]

# This line will overwrite the global df with only the selected columns.
df = df[feature_cols + [time_col, event_col]].copy()

**Step #9** Defines the modeling schema by selecting relevant features and target columns from the METABRIC dataset. It explicitly sets 'Overall Survival (Months)' as the time-to-event variable and 'Overall Survival Status' as the event indicator, which are essential for survival analysis. The code then excludes these targets along with 'Patient ID' from the feature set, ensuring that identifiers and outcome variables are not mistakenly used as predictors. The final line updates the DataFrame to include only the selected features and survival targets, creating a clean, analysis-ready subset for downstream modeling.

In [10]:
#10 Descriptive Statistics for Numeric Clinical Features

display(df.head())

Unnamed: 0,Age at Diagnosis,Type of Breast Surgery,Cancer Type,Cancer Type Detailed,Cellularity,Chemotherapy,Pam50 + Claudin-low subtype,Cohort,ER status measured by IHC,ER Status,...,Radio Therapy,Relapse Free Status (Months),Relapse Free Status,Sex,3-Gene classifier subtype,Tumor Size,Tumor Stage,Patient's Vital Status,Overall Survival (Months),Overall Survival Status
0,76,Mastectomy,Breast Cancer,Breast Invasive Ductal Carcinoma,High,No,claudin-low,1.0,Positve,1,...,Yes,139,Not Recurred,Female,ER-/HER2-,22.0,2.0,Living,141,Living
1,43,Breast Conserving,Breast Cancer,Breast Invasive Ductal Carcinoma,High,No,LumA,1.0,Positve,1,...,Yes,84,Not Recurred,Female,ER+/HER2- High Prolif,10.0,1.0,Living,85,Living
2,49,Mastectomy,Breast Cancer,Breast Invasive Ductal Carcinoma,High,Yes,LumB,1.0,Positve,1,...,No,151,Recurred,Female,ER+/HER2- Low Prolif,15.0,2.0,Died of Disease,164,Deceased
3,48,Mastectomy,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,Yes,LumB,1.0,Positve,1,...,Yes,163,Not Recurred,Female,ER+/HER2- Low Prolif,25.0,2.0,Living,165,Living
4,77,Mastectomy,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,Yes,LumB,1.0,Positve,1,...,Yes,19,Recurred,Female,ER+/HER2- High Prolif,40.0,2.0,Died of Disease,41,Deceased


**Step #10:** Quantitative Feature Summary and Distribution Profiling

In this step, we examine the numeric columns of the breast cancer dataset to understand their distributions, central tendencies, and potential outliers. Key features include Age at Diagnosis, Tumor Size, Relapse Free Status (Months), and Tumor Stage. These variables are crucial for modeling disease progression and treatment outcomes. Summary statistics—such as mean, median, standard deviation, and range—are computed to assess variability and clinical relevance. For example, the wide range in Relapse Free Status (Months) may indicate heterogeneity in patient outcomes, while Tumor Size and Tumor Stage help stratify disease severity. This profiling informs downstream preprocessing, such as normalization or binning, and guides feature selection for predictive modeling.

In [11]:
#11 Type conversions and categorical handling

# Ensure event/time are numeric
df[time_col] = pd.to_numeric(df[time_col], errors='coerce')
df[event_col] = pd.to_numeric(df[event_col], errors='coerce').clip(0, 1)

# Set categorical types for known categorical columns
# Corrected column names based on actual df columns
categorical_cols = ['PR Status', 'ER Status', 'HER2 Status', 'Neoplasm Histologic Grade']
for c in categorical_cols:
    df[c] = df[c].astype('category')

**Step #11:** Event-Time Conversion and Categorical Encoding

This step ensures that survival analysis variables and key clinical features are properly formatted for modeling. First, the event and time columns—typically representing relapse status and follow-up duration—are coerced into numeric types using pd.to_numeric, with invalid entries set to NaN. The event column is clipped to binary values (0 or 1), standardizing it for survival modeling frameworks. Next, known categorical columns—PR Status, ER Status, HER2 Status, and Neoplasm Histologic Grade—are explicitly cast to the category dtype. This improves memory efficiency and ensures correct handling during encoding, visualization, or statistical modeling. This step is essential for downstream tasks like Cox regression or Kaplan-Meier estimation, where type consistency and categorical integrity are critical.

In [12]:
#12 Feature Encoding and Scaling Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

# Dynamically identify categorical and numeric columns from the current df
categorical_cols = df[feature_cols].select_dtypes(include=['object', 'category']).columns.tolist()
numeric_cols = df[feature_cols].select_dtypes(exclude=['object', 'category']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
    ],
    remainder='drop'
)

# Fit preprocessor and transform
X = preprocessor.fit_transform(df[feature_cols])

# Build transformed feature names
ohe = preprocessor.named_transformers_['cat']
cat_feature_names = list(ohe.get_feature_names_out(categorical_cols))
transformed_feature_names = numeric_cols + cat_feature_names

**Step #12:** Encode Categoricals and Scale Numerics
In this preprocessing step, the dataset is transformed to prepare it for machine learning modeling by encoding categorical features and scaling numeric ones. First, the code dynamically identifies which columns in df[feature_cols] are categorical (e.g., strings or categories) and which are numeric. Then, a ColumnTransformer is defined to apply StandardScaler to numeric columns and OneHotEncoder to categorical ones, ensuring unknown categories are ignored and the output is dense (not sparse). The transformer is fitted and applied to the data, producing a fully numeric matrix X. Finally, the code reconstructs the transformed feature names by combining the original numeric column names with the one-hot encoded names derived from the categorical columns, enabling traceability and interpretability of the transformed dataset..

In [13]:
#13 Finalize Preprocessed Dataset with Survival and save

# Create df_proc from the transformed features (X) and original survival columns
df_proc = pd.DataFrame(X, columns=transformed_feature_names)
df_proc[time_col] = df[time_col]
df_proc[event_col] = df[event_col]

PROC_FILE = PREP_PATH
df_proc.to_csv(PROC_FILE, index=False)
print("Saved preprocessed data:", PROC_FILE)

Saved preprocessed data: /content/drive/MyDrive/projects/metabric/data/METABRIC_preprocessed.parquet


**Step #13: **Combine Transformed Features with Survival Columns and Save
In this final preprocessing step, the transformed feature matrix X—which contains the encoded and scaled predictors—is converted into a DataFrame named df_proc using the list of transformed_feature_names as column labels. To preserve the survival analysis context, the original survival columns (time_col and event_col) from the raw dataset df are appended to df_proc. This ensures that the output dataset includes both the processed features and the necessary survival targets. Finally, the complete preprocessed DataFrame is saved to disk as a CSV file at the path specified by PROC_FILE, enabling downstream modeling and reproducibility. The print statement confirms successful export.