In [1]:
# Step 1 — Connect Google Drive to Colab Workspace

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Step #1** initializes access to your Google Drive within the Google Colab environment. By importing google.colab.drive and running drive.mount('/content/drive'), Colab opens an authorization flow that links your notebook session to your Drive account. Once mounted, your Drive appears as a directory inside the Colab file system, allowing you to load datasets, save outputs, and maintain persistent files across sessions.

**Why This Step Matters**
Colab’s runtime is temporary—files stored locally disappear when the session resets. Mounting Google Drive solves this by giving you a stable, persistent storage location. It also centralizes your workflow: datasets, scripts, checkpoints, and results all live in one place that you can reuse across sessions and share with collaborators. Without this step, you would need to re-upload files repeatedly or risk losing work.

In [2]:
# Step 2 — Load the cleaned colorectal dataset

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/colorectal_cancer_prediction_cleaned_imputed.csv')


**Step #2** loads the cleaned colorectal cancer dataset into your Colab environment so it can be used for analysis and modeling. By importing pandas and reading the CSV file stored in Google Drive, the dataset is loaded into a DataFrame named df, giving you a structured, tabular object that supports filtering, exploration, visualization, and feature engineering. This step establishes the core data object that the rest of the workflow will operate on.

**Why This Step Matters**
Everything in the pipeline depends on having the dataset properly loaded into memory. Without this step, no analysis, visualization, or modeling can occur. It also ensures reproducibility: by loading the exact cleaned dataset from Drive, you maintain consistency across sessions and collaborators. This step anchors the workflow by providing a reliable, standardized starting point for all subsequent operations.

In [3]:
# Step 3 — Remove Non‑Analytic and Identifier Columns

df.drop(columns=['Patient_ID', 'Notes'], inplace=True, errors='ignore')


**Step #3** removes columns from the dataset that do not contribute to analysis or modeling. By calling df.drop(columns=['Patient_ID', 'Notes'], inplace=True, errors='ignore'), the workflow eliminates identifiers and free‑text fields that either provide no predictive value or could introduce noise. This ensures that the DataFrame contains only meaningful, analytic variables, streamlining the dataset for cleaner preprocessing and more reliable modeling.

**Why This Step Matters**
Removing irrelevant columns reduces noise, prevents accidental leakage of personally identifiable information, and keeps the modeling process focused on meaningful clinical features. It also improves computational efficiency and reduces the risk of models overfitting to non‑informative or inappropriate variables. This step is foundational for building a trustworthy, reproducible clinical prediction pipeline.

In [4]:
# Step 4 — Encode Categorical Features into Dummy Variables

df = pd.get_dummies(df, drop_first=True)


**Step #4** converts all categorical variables in the dataset into numerical features using one‑hot encoding. By applying pd.get_dummies(df, drop_first=True), each categorical column is expanded into binary indicator variables representing its categories, with the first category dropped to avoid multicollinearity. This transformation ensures that machine‑learning algorithms—which generally require numeric inputs—can interpret and learn from categorical information in a mathematically appropriate way.

**Why This Step Matters**
Most machine‑learning models cannot interpret raw text categories, so encoding them is essential for model performance and correctness. One‑hot encoding preserves the meaning of each category without imposing an artificial order, ensuring that the model learns appropriate relationships. Dropping the first category also protects against the dummy‑variable trap, improving model stability and interpretability. Without this step, the dataset would be incompatible with many algorithms and could produce misleading results.

In [5]:
# Step 5 — Configure median imputation for missing values

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')


**Step #5** sets up a median‑based imputation strategy to handle missing values in the dataset. By importing SimpleImputer from sklearn.impute and initializing it with strategy='median', you define a consistent rule for replacing missing entries in numerical columns. This prepares the imputer object for later application to the DataFrame, ensuring that gaps in the data are filled using a robust, distribution‑aware method that is less sensitive to outliers than mean imputation.

**Why This Step Matters**
Missing data is unavoidable in clinical datasets, and how you handle it directly affects model performance and validity. Median imputation provides a stable, outlier‑resistant approach that preserves the central tendency of each feature without artificially inflating or deflating values. By configuring the imputer early, you ensure that the workflow treats missingness consistently and reproducibly, reducing bias and improving the reliability of downstream modeling.

In [6]:
# Step 6 — Apply Median Imputation to Dataset

imputed_data = imputer.fit_transform(df)


**Step #6** applies the median‑imputation strategy to the dataset by fitting the imputer to the DataFrame and transforming it in a single operation. Using imputer.fit_transform(df), the imputer first learns the median value of each column and then replaces any missing entries with those learned medians. The result is a fully imputed numerical array stored in imputed_data, ensuring that the dataset is complete and ready for downstream modeling without gaps or inconsistencies.

**Why This Step Matters**
Machine‑learning algorithms cannot handle missing values, and inconsistent handling of missingness can introduce bias or instability. By fitting and transforming the dataset in one controlled step, you ensure that every missing value is replaced using a reproducible, statistically sound method. This improves model reliability, prevents training failures, and maintains the integrity of clinical variables—especially important in medical prediction tasks where data quality directly affects outcomes.

In [7]:
# Step 7 — Retrieve Final Imputed Feature Names

processed_columns = imputer.get_feature_names_out(df.columns)


**Step #7** retrieves the final set of feature names after imputation has been applied to the dataset. Using imputer.get_feature_names_out(df.columns), the workflow extracts the processed column names in the exact order produced by the imputer. This ensures that you can correctly label the imputed NumPy array, maintain alignment between features and their transformed values, and preserve interpretability as the dataset moves into modeling stages.

**Why This Step Matters**
Once the dataset is imputed, the output becomes a NumPy array with no inherent column labels. Without retrieving the processed feature names, you risk losing track of which values correspond to which clinical variables—an issue that can compromise interpretability, reproducibility, and downstream analysis. Extracting the feature names ensures that the imputed data can be reassembled into a clean, labeled DataFrame, preserving transparency and enabling accurate modeling and reporting.

In [8]:
# Step 8 — Reconstruct Labeled DataFrame After Imputation

df = pd.DataFrame(imputed_data, columns=processed_columns, index=df.index)


**Step #8** reconstructs the dataset into a fully labeled pandas DataFrame after imputation. Using pd.DataFrame(imputed_data, columns=processed_columns, index=df.index), the workflow converts the imputed NumPy array back into a structured table with the correct column names and original row index. This restores the familiar DataFrame format, ensuring that all downstream analysis—such as feature inspection, visualization, and model training—can proceed with a clean, well‑organized dataset that retains its original structure.

**Why This Step Matters**
After imputation, the dataset temporarily loses its column names and index, which can make analysis error‑prone and difficult to interpret. Rebuilding the DataFrame ensures that every value is correctly aligned with its feature name and original row position. This preserves the integrity of the dataset, supports reproducibility, and prevents subtle mistakes in downstream modeling. In clinical prediction workflows, maintaining precise feature labeling is essential for transparency, auditability, and trustworthiness.

In [9]:
# Step 9 — Apply Z‑Score Standardization to All Features

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[df.columns] = scaler.fit_transform(df)


**Step #9** standardizes all numerical features in the dataset using Z‑score normalization. By creating a StandardScaler() object and applying scaler.fit_transform(df) directly to all columns, each feature is transformed to have a mean of 0 and a standard deviation of 1. This ensures that variables measured on different scales—such as lab values, biomarker levels, or demographic metrics—are placed on a comparable footing, preventing any single feature from dominating the learning process due to its magnitude.

**Why This Step Matters**
Many machine‑learning models—such as logistic regression, SVMs, neural networks, and K‑means—perform poorly when features are on vastly different scales. Standardization prevents large‑magnitude variables from overpowering smaller ones, improves numerical stability, accelerates model convergence, and often boosts predictive performance. In clinical prediction tasks, it also ensures that no feature’s scale inadvertently biases the model, supporting fairness, interpretability, and reproducibility.

In [10]:
import os

# Step 10 — Export Final Preprocessed Dataset to Disk

output_directory = '/content/drive/MyDrive/projects/colorectal/notebooks'
os.makedirs(output_directory, exist_ok=True)

df.to_csv(os.path.join(output_directory, 'colorectal_cancer_prediction_preprocessed.csv'),
          index=False)

**Step #10** saves the fully preprocessed dataset to a designated project directory so it can be reused, shared, or version‑controlled. The code creates the output folder if it does not already exist and then writes the cleaned, imputed, encoded, and standardized DataFrame to a CSV file. This ensures that the entire preprocessing pipeline produces a stable, persistent artifact that can be loaded directly in future notebooks or modeling scripts without rerunning earlier steps.

**Why This Step Matters**
A reproducible pipeline depends on stable, versioned outputs. Saving the preprocessed dataset ensures that downstream modeling is consistent across sessions, collaborators, and environments. It also protects against data loss in temporary runtimes like Google Colab and allows you to checkpoint your progress. In clinical prediction workflows, having a fixed, documented preprocessed dataset is essential for transparency, auditability, and regulatory‑friendly research practices.