In [1]:
# 1. Initialize Analysis Environment and Core Libraries”
# -----------------------------
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer

print("Environment initialized.")

Environment initialized.


**Step #1** sets up the entire analytical workspace by importing the core Python libraries needed for data handling, numerical computation, visualization, and preprocessing. This includes pandas for structured data manipulation, numpy for numerical operations, seaborn and matplotlib for plotting, and scikit‑learn tools for encoding, scaling, and imputing missing values. By loading these modules at the very beginning, the environment becomes fully equipped to read datasets, clean them, transform variables, and generate visual insights without interruption.

**Why This Step Matters**
Initializing the environment is crucial because it establishes consistency, reproducibility, and readiness. When collaborators or future readers run the notebook, they immediately see which tools the workflow depends on, reducing confusion and preventing errors. It also ensures that the analysis begins from a stable, predictable starting point, which is a hallmark of well‑designed scientific and clinical research pipelines.

In [2]:
# 2. Connect to Google Drive for File Access
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Step #2** connects your Google Colab notebook to your Google Drive so that the notebook can access datasets, scripts, models, and output directories stored there. By mounting Drive, Colab creates a secure link between the cloud runtime and your personal file system, allowing you to read and write files just as if they were stored locally. This ensures that your workflow can load raw data, save processed outputs, and maintain persistent storage across sessions.

**Why This Step Matters**
Mounting Google Drive is essential for reproducibility and workflow continuity. Without this connection, the notebook would have no persistent storage—meaning datasets couldn’t be loaded, and results couldn’t be saved. For clinical research pipelines, where data integrity and traceability are critical, establishing a stable file-access layer at the beginning ensures that every subsequent step operates on consistent, version‑controlled files.

In [3]:
# 3. Define Project File Paths and Output Structure
# -----------------------------
project_root = '/content/drive/MyDrive/colorectal_cancer_prediction'
data_path = os.path.join(project_root, "data", "processed", "colorectal_cancer_prediction_cleaned_imputed.csv")
output_path = os.path.join(project_root, "data", "preprocessed")
CLEANED_DATA_PATH = os.path.join(output_path, "colorectal_cancer_prediction_preprocessed.csv")

os.makedirs(output_path, exist_ok=True)

print("Project paths set.")
print("Data path:", data_path)
print("Output path:", output_path)
print("Cleaned data save path:", CLEANED_DATA_PATH)

Project paths set.
Data path: /content/drive/MyDrive/colorectal_cancer_prediction/data/processed/colorectal_cancer_prediction_cleaned_imputed.csv
Output path: /content/drive/MyDrive/colorectal_cancer_prediction/data/preprocessed
Cleaned data save path: /content/drive/MyDrive/colorectal_cancer_prediction/data/preprocessed/colorectal_cancer_prediction_preprocessed.csv


**Step #3** establishes all of the key file paths that the workflow will rely on, ensuring that the notebook knows exactly where to find the input dataset, where to store preprocessed outputs, and how to maintain a clean project structure. By defining a project_root and constructing paths for the processed data, preprocessed output directory, and final cleaned dataset, this step centralizes all file‑location logic in one place. It also creates the output directory if it doesn’t already exist, guaranteeing that subsequent steps can write files without errors.

**Why This Step Matters**
Defining file paths early prevents confusion, broken references, and hard‑coded paths scattered throughout the notebook. In clinical research pipelines—where reproducibility, traceability, and version control are essential—centralizing file paths ensures that collaborators can run the workflow without modifying code. It also makes the project more maintainable: if directory structures change, only this step needs updating.

In [4]:
# 4. Load and Inspect Raw Dataset
# -----------------------------
df = pd.read_csv(data_path)

print("Dataset loaded successfully.")
print("Shape:", df.shape)
print("Columns:", list(df.columns))

Dataset loaded successfully.
Shape: (89945, 30)
Columns: ['Patient_ID', 'Age', 'Gender', 'Race', 'Region', 'Urban_or_Rural', 'Socioeconomic_Status', 'Family_History', 'Previous_Cancer_History', 'Stage_at_Diagnosis', 'Tumor_Aggressiveness', 'Colonoscopy_Access', 'Screening_Regularity', 'Diet_Type', 'BMI', 'Physical_Activity_Level', 'Smoking_Status', 'Alcohol_Consumption', 'Red_Meat_Consumption', 'Fiber_Consumption', 'Insurance_Coverage', 'Time_to_Diagnosis', 'Treatment_Access', 'Chemotherapy_Received', 'Radiotherapy_Received', 'Surgery_Received', 'Follow_Up_Adherence', 'Survival_Status', 'Recurrence', 'Time_to_Recurrence']


**Step #4** loads the colorectal cancer dataset into memory using pandas, creating a DataFrame that becomes the central object for all subsequent analysis and preprocessing. By reading the CSV file from the previously defined data_path, this step verifies that the dataset is accessible and correctly formatted. It then prints the dataset’s shape and column names, giving you an immediate snapshot of its size and structure—an essential first look before performing cleaning, feature engineering, or modeling.

**Why This Step Matters**
Loading the dataset is the gateway to the entire analytical workflow. If the dataset is missing, corrupted, or incorrectly formatted, every downstream step—from cleaning to modeling—would fail. By inspecting the shape and column names immediately, you catch structural issues early and ensure that the dataset matches the schema expected by the pipeline. In clinical research, where data integrity is non‑negotiable, this early validation step is essential for reproducibility and trustworthiness.

In [5]:
# 5. Standardize Column Names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

**Step #5** performs a foundational round of basic data cleaning by standardizing all column names in the dataset. The code removes leading and trailing whitespace, converts all characters to lowercase, and replaces spaces with underscores. This creates a consistent, machine‑friendly naming convention that prevents errors when referencing columns later in the workflow. By normalizing column names early, the pipeline becomes more robust, readable, and easier to maintain.

**Why This Step Matters**
Consistent column naming is essential for avoiding subtle bugs and confusion later in the pipeline. Many modeling and preprocessing functions expect clean, uniform variable names, and inconsistent formatting can lead to errors that are difficult to trace. In clinical research workflows—where clarity, reproducibility, and auditability matter—standardizing column names ensures that collaborators, scripts, and automated tools all reference variables reliably.

In [6]:
# 6 Remove Duplicate Records
df = df.drop_duplicates()

**Step #6** removes duplicate rows from the dataset to ensure that each patient record appears only once and that the analysis is not biased by repeated entries. By applying df.drop_duplicates(), the workflow eliminates redundant observations that could distort summary statistics, inflate sample size, or mislead downstream modeling. This step helps maintain the integrity of the dataset and ensures that all subsequent analyses are based on unique, non‑duplicated information.

**Why This Step Matters**
Duplicate records can artificially skew distributions, inflate counts, and mislead machine‑learning models—especially in clinical datasets where each row represents a patient or clinical event. Removing duplicates ensures that the dataset reflects true sample size and prevents biased results. In a clinical research pipeline, this is essential for maintaining scientific validity, reproducibility, and ethical data handling.

In [7]:
# 7 Remove Identifier Columns for De‑Identification
id_cols = ["patient_id", "mrn", "record_id"]
df = df.drop(columns=[col for col in id_cols if col in df.columns], errors="ignore")

**Step #7** removes identifying columns—such as patient IDs or medical record numbers—from the dataset to protect privacy and ensure that no personally identifiable information (PII) is used in downstream analysis. The code defines a list of potential identifier columns and then drops only those that actually appear in the DataFrame. This creates a safer, analysis‑ready dataset that focuses solely on clinical and behavioral variables relevant to colorectal cancer prediction.

**Why This Step Matters**
Removing identifier columns is essential for patient privacy, ethical data handling, and regulatory compliance—especially in clinical research settings. Identifiers add no predictive value to machine‑learning models and can introduce bias or leakage if left in place. Eliminating them ensures that the dataset is de‑identified, reduces risk, and aligns the workflow with best practices for secure, reproducible research.

In [8]:
# 8 Standardize and Convert Date Columns
date_cols = [col for col in df.columns if "date" in col]
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors="coerce")


**Step #8** identifies any columns in the dataset that contain the word “date” and converts them into proper datetime objects using pandas. This ensures that all date‑related fields—such as diagnosis dates, follow‑up dates, or recurrence dates—are stored in a standardized, machine‑readable format. By coercing invalid entries into NaT, the step gracefully handles messy or inconsistent date formats, preparing the dataset for accurate time‑based calculations, sorting, and feature engineering.

**Why This Step Matters**
Datetime consistency is essential for any clinical pipeline that relies on timing—such as time to diagnosis, time to recurrence, or follow‑up intervals. If date columns remain as raw strings, analyses can break, models can misinterpret values, and time‑based features cannot be computed reliably. Converting dates early ensures accuracy, prevents subtle bugs, and supports reproducible, trustworthy clinical research.

In [9]:
# 9. Identify Numeric and Categorical Columns for Imputation

# Separate numeric and categorical
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

**Step #9** begins the missing‑value imputation process by separating the dataset into numeric and categorical columns. This distinction is essential because numeric and categorical variables require different imputation strategies—such as mean or median for numeric fields and mode or constant values for categorical ones. By programmatically identifying column types, the workflow ensures that each variable is handled appropriately and consistently, setting the stage for a clean, well‑structured imputation pipeline.

**Why This Step Matters**
Numeric and categorical variables behave differently, and applying the wrong imputation method can distort the dataset or introduce bias. For example, filling numeric values with a mode or categorical values with a mean would be inappropriate and misleading. By cleanly separating column types upfront, the workflow ensures that imputation is accurate, reproducible, and aligned with best practices—especially important in clinical research where data integrity directly affects model validity.

In [10]:
# 10. Impute Missing Values Using Median and Mode
num_imputer = SimpleImputer(strategy="median")
df[numeric_cols] = num_imputer.fit_transform(df[numeric_cols])

cat_imputer = SimpleImputer(strategy="most_frequent")
df[categorical_cols] = cat_imputer.fit_transform(df[categorical_cols])

**Step #10** performs the actual imputation of missing values for both numeric and categorical variables using appropriate, statistically sound strategies. Numeric columns are imputed with the median, which is robust to outliers and preserves the central tendency of skewed clinical variables. Categorical columns are imputed with the most frequent value, ensuring that missing entries are replaced with the most common and contextually plausible category. This step transforms the dataset into a complete, analysis‑ready form without discarding valuable patient records.

**Why This Step Matters**
Missing data is inevitable in clinical datasets, and improper handling can bias results, reduce statistical power, or cause machine‑learning models to fail. Median imputation protects numeric variables from distortion by extreme values, while mode imputation preserves the categorical distribution. By applying these strategies systematically, the workflow maintains data integrity, maximizes sample retention, and ensures that the dataset remains both clinically meaningful and computationally stable.

In [11]:
# 11. Engineer and Transform Model‑Ready Feature

# Example: BMI
if {"weight_kg", "height_cm"}.issubset(df.columns):
    df["bmi"] = df["weight_kg"] / (df["height_cm"] / 100)**2

# Example: Age from DOB
if "date_of_birth" in df.columns:
    df["age"] = (pd.Timestamp("today") - df["date_of_birth"]).dt.days // 365

# One-hot encode categorical variables
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Standardize numeric columns
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])


**Step #11** performs feature engineering by creating new variables, transforming existing ones, and preparing the dataset for machine‑learning algorithms. This includes calculating BMI when height and weight are available, deriving age from date of birth, converting categorical variables into one‑hot encoded indicators, and standardizing numeric features for consistent scaling. Together, these transformations enrich the dataset with clinically meaningful features and ensure that all variables are in a format suitable for modeling.

**Why This Step Matters**
Feature engineering is one of the most influential steps in any predictive modeling pipeline. Well‑constructed features can dramatically improve model performance, interpretability, and clinical relevance. Calculating BMI and age introduces clinically meaningful predictors, one‑hot encoding ensures categorical variables are usable by algorithms, and standardization prevents models from being dominated by variables with large numeric ranges. In clinical research, thoughtful feature engineering directly impacts the accuracy and fairness of predictive models.

In [12]:
# 12. Validate Data Quality and Final Dataset Structure
print(df.isnull().sum().sort_values(ascending=False).head())
print(df.describe())
print(df.shape)


age                   0
bmi                   0
time_to_recurrence    0
gender_Male           0
race_Black            0
dtype: int64
                age           bmi  time_to_recurrence
count  8.994500e+04  8.994500e+04        8.994500e+04
mean  -1.671586e-16  6.010127e-16        9.278253e-17
std    1.000006e+00  1.000006e+00        1.000006e+00
min   -1.701155e+00 -1.733538e+00       -1.710836e+00
25%   -8.588247e-01 -8.642018e-01       -8.421942e-01
50%   -1.649443e-02  5.134727e-03        2.644731e-02
75%    8.753847e-01  8.583724e-01        8.371794e-01
max    1.717715e+00  1.727709e+00        1.705821e+00
(89945, 46)


**Step #12** performs a set of essential data‑quality checks to confirm that the dataset is clean, complete, and structurally sound after all preprocessing steps. It examines the remaining missing values, reviews summary statistics for numeric variables, and verifies the final shape of the DataFrame. These checks provide a quick diagnostic snapshot of the dataset, ensuring that imputation, feature engineering, and scaling were applied correctly and that the dataset is ready for modeling.

**Why This Step Matters**
Quality checks are critical in clinical research pipelines because even small preprocessing errors can propagate into biased models or incorrect conclusions. Confirming that no missing values remain, that numeric variables are properly scaled, and that the dataset has the expected dimensions ensures that the pipeline is stable, reproducible, and trustworthy. This step acts as a safeguard before committing to modeling, preventing costly downstream errors.

In [13]:
# 13 Export Final Cleaned Dataset
df.to_csv(CLEANED_DATA_PATH, index=False)
print("Saved cleaned dataset to:", CLEANED_DATA_PATH)

Saved cleaned dataset to: /content/drive/MyDrive/colorectal_cancer_prediction/data/preprocessed/colorectal_cancer_prediction_preprocessed.csv


**Step #13** finalizes the preprocessing pipeline by saving the fully cleaned and transformed dataset to a designated output path. Using df.to_csv() with index=False, the step writes the processed DataFrame to disk in a clean, reproducible format, ensuring that all prior cleaning, imputation, feature engineering, and validation steps are preserved. The printed confirmation message provides immediate feedback that the dataset has been successfully exported and stored in the correct project directory.

**Why This Step Matters**
Saving the cleaned dataset is essential for reproducibility, collaboration, and workflow efficiency. It creates a stable, version‑controlled artifact that can be used for modeling without rerunning the entire preprocessing pipeline. In clinical research, where transparency and traceability are critical, exporting the cleaned dataset ensures that every downstream analysis is based on a consistent, documented data snapshot.