In [None]:
#1 Initialize Google Drive Access
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Step#1** initializes access to your Google Drive within a Google Colab environment. By importing the google.colab.drive module and running drive.mount('/content/drive'), Colab creates a secure connection between your notebook session and your Drive storage. This allows you to read, write, and organize files—datasets, scripts, outputs—directly from Drive as if they were part of the Colab file system. The output message confirms that the mount was successful.

In [None]:
#2 Set Up Project File Paths

from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path('/content/drive/MyDrive/projects/prostate_cancer')
DATA_DIR = PROJECT_ROOT / 'data'
OUTPUTS_DIR = PROJECT_ROOT / 'outputs'
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)

RAW_PATH = DATA_DIR / 'prostate_cancer_prediction_cleaned_imputed.csv'
PREP_PATH = DATA_DIR / 'prostate_cancer_prediction_preprocessed.csv'


**Step#2** establishes the core directory structure for your project by defining consistent, reusable file paths that the rest of your workflow will rely on. Using pathlib.Path, you set a PROJECT_ROOT that points to your prostate cancer project folder in Google Drive, then derive subdirectories for data and outputs. The code ensures the outputs directory exists, preventing errors later when saving figures, tables, or processed datasets. Finally, you define explicit paths for the raw and preprocessed CSV files, giving your pipeline stable references for loading and saving data throughout the analysis.

In [None]:
#3 Load Raw Dataset

RAW_PATH = '/content/drive/MyDrive/prostate_cancer_prediction_cleaned_imputed.csv'
df = pd.read_csv(RAW_PATH)
df.head()

Unnamed: 0,Patient_ID,Age,Family_History,Race_African_Ancestry,PSA_Level,DRE_Result,Biopsy_Result,Difficulty_Urinating,Weak_Urine_Flow,Blood_in_Urine,...,Alcohol_Consumption,Hypertension,Diabetes,Cholesterol_Level,Screening_Age,Follow_Up_Required,Prostate_Volume,Genetic_Risk_Factors,Previous_Cancer_History,Early_Detection
0,1,78,No,Yes,5.07,Normal,Benign,No,No,No,...,Moderate,No,No,Normal,45,No,46.0,No,No,Yes
1,2,68,No,Yes,10.24,Normal,Benign,Yes,No,No,...,Low,No,No,High,65,No,78.2,No,No,Yes
2,3,54,No,No,13.79,Normal,Benign,No,No,No,...,Low,No,No,Normal,61,No,21.1,No,No,Yes
3,4,82,No,No,8.03,Abnormal,Benign,No,No,No,...,Low,No,No,Normal,47,Yes,79.9,No,Yes,Yes
4,5,47,Yes,No,1.89,Normal,Malignant,Yes,Yes,No,...,Moderate,Yes,No,Normal,72,No,32.0,No,No,Yes


**Step#3** loads the prostate cancer dataset into your analysis environment by reading the cleaned and imputed CSV file directly from Google Drive. After correcting the file path to point to the actual location of the dataset, the code uses pd.read_csv() to import the data into a pandas DataFrame named df. Displaying df.head() provides a quick preview of the first few rows, allowing you to verify that the file loaded correctly and that the dataset’s structure, column names, and values appear as expected before moving on to preprocessing or modeling.

In [None]:
# 4 Inspect Categorical Feature Values

for col in df.select_dtypes(include=['object']).columns:
    print(f"Unique values in '{col}':")
    print(df[col].unique(), "\n")


Unique values in 'Family_History':
['No' 'Yes'] 

Unique values in 'Race_African_Ancestry':
['Yes' 'No'] 

Unique values in 'DRE_Result':
['Normal' 'Abnormal'] 

Unique values in 'Biopsy_Result':
['Benign' 'Malignant'] 

Unique values in 'Difficulty_Urinating':
['No' 'Yes'] 

Unique values in 'Weak_Urine_Flow':
['No' 'Yes'] 

Unique values in 'Blood_in_Urine':
['No' 'Yes'] 

Unique values in 'Pelvic_Pain':
['No' 'Yes'] 

Unique values in 'Back_Pain':
['No' 'Yes'] 

Unique values in 'Erectile_Dysfunction':
['No' 'Yes'] 

Unique values in 'Cancer_Stage':
['Localized' 'Metastatic' 'Advanced'] 

Unique values in 'Treatment_Recommended':
['Active Surveillance' 'Radiation' 'Immunotherapy' 'Chemotherapy'
 'Surgery' 'Hormone Therapy'] 

Unique values in 'Survival_5_Years':
['Yes' 'No'] 

Unique values in 'Exercise_Regularly':
['No' 'Yes'] 

Unique values in 'Healthy_Diet':
['Yes' 'No'] 

Unique values in 'Smoking_History':
['Yes' 'No'] 

Unique values in 'Alcohol_Consumption':
['Moderate' 'Low

**Step#4** performs an exploratory scan of all categorical variables in the dataset by selecting columns with the object data type and printing their unique values. This quick inspection helps you understand the structure and possible categories within each feature—such as “Yes/No” patterns, multi‑class labels, or clinically meaningful groupings like cancer stage or treatment type. By reviewing these unique values early, you can identify inconsistencies, plan encoding strategies, and ensure that downstream preprocessing steps are informed by the actual distribution of categories in the data.

In [None]:
#5 Generate Numeric Summary Statistics

df.describe()


Unnamed: 0,Patient_ID,Age,PSA_Level,BMI,Screening_Age,Prostate_Volume
count,27945.0,27945.0,27945.0,27945.0,27945.0,27945.0
mean,13973.0,64.459939,7.751599,26.511605,56.902308,47.75577
std,8067.170973,14.404755,4.175012,4.888293,10.118064,18.704286
min,1.0,40.0,0.5,18.0,40.0,15.0
25%,6987.0,52.0,4.13,22.3,48.0,31.7
50%,13973.0,64.0,7.75,26.5,57.0,47.7
75%,20959.0,77.0,11.32,30.7,66.0,63.9
max,27945.0,89.0,15.0,35.0,74.0,80.0


**Step#5** generates a numerical summary of the dataset using df.describe(), which computes key descriptive statistics for all numeric columns. This includes the count of observations, measures of central tendency (mean and median), measures of spread (standard deviation and quartiles), and the minimum and maximum values. Reviewing these statistics helps you quickly understand the distribution, scale, and variability of important clinical variables such as age, PSA level, BMI, screening age, and prostate volume. It also provides an early check for anomalies, outliers, or unexpected ranges before moving into deeper preprocessing or modeling.

In [None]:
#6 Audit Missing Values

print("Missing values per column:")
df.isnull().sum()


Missing values per column:


Unnamed: 0,0
Patient_ID,0
Age,0
Family_History,0
Race_African_Ancestry,0
PSA_Level,0
DRE_Result,0
Biopsy_Result,0
Difficulty_Urinating,0
Weak_Urine_Flow,0
Blood_in_Urine,0


**Step#6** performs a systematic audit of missing values across the entire dataset using df.isnull().sum(). This command counts how many null entries appear in each column, allowing you to quickly assess data completeness and identify any variables that may require imputation, removal, or special handling. In this case, the output shows zero missing values for every feature, confirming that the dataset is fully populated and ready for downstream preprocessing and modeling without additional cleaning for missingness.

In [None]:
#7 Inspect Dataset Structure and Variable Types
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27945 entries, 0 to 27944
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Patient_ID               27945 non-null  int64  
 1   Age                      27945 non-null  int64  
 2   Family_History           27945 non-null  object 
 3   Race_African_Ancestry    27945 non-null  object 
 4   PSA_Level                27945 non-null  float64
 5   DRE_Result               27945 non-null  object 
 6   Biopsy_Result            27945 non-null  object 
 7   Difficulty_Urinating     27945 non-null  object 
 8   Weak_Urine_Flow          27945 non-null  object 
 9   Blood_in_Urine           27945 non-null  object 
 10  Pelvic_Pain              27945 non-null  object 
 11  Back_Pain                27945 non-null  object 
 12  Erectile_Dysfunction     27945 non-null  object 
 13  Cancer_Stage             27945 non-null  object 
 14  Treatment_Recommended 

**Step#7** summarizes the dataset’s overall structure by displaying every column, its data type, and the count of non‑missing values, giving a complete snapshot of how the raw data is organized before any preprocessing begins. The output shows 30 columns with a mix of integers, floats, and many object‑typed fields, which signals that several predictors will later need type correction or categorical handling. This structural overview is essential because it reveals which variables are numeric, which are stored as strings, and whether any columns contain unexpected types or missingness patterns that could affect downstream survival modeling.

In [None]:
#8 Define Survival Outcomes and Select Feature Set

# Fix: 'time' and 'event' columns do not exist in the original DataFrame.
# Assuming 'Survival_5_Years' is the event status and a 5-year follow-up period.
df['event'] = df['Survival_5_Years'].map({'Yes': 1, 'No': 0})
df['time'] = 5 # Assuming a 5-year follow-up duration for all records

time_col = 'time'
event_col = 'event'

excluded_cols = ['Patient_ID', 'Survival_5_Years', time_col, event_col] # Exclude original 'Survival_5_Years'
feature_cols = [c for c in df.columns if c not in excluded_cols]

df = df[feature_cols + [time_col, event_col]].copy()

**Step#8** creates the survival analysis targets and isolates the modeling feature set by converting the original survival indicator into a binary event column, assigning a fixed follow‑up time, and removing columns that should not be used as predictors. The code maps “Yes/No” survival labels into 1/0, defines the time and event column names, excludes identifiers and the original survival column, and then rebuilds the DataFrame so that all modeling features appear first, followed by the survival outcome variables. This establishes a clean, analysis‑ready structure that downstream preprocessing and modeling steps can rely on.

In [None]:
#9 Standardize Dataset Structure and Variable Types

# Convert survival columns
df[time_col] = pd.to_numeric(df[time_col], errors='coerce')
df[event_col] = pd.to_numeric(df[event_col], errors='coerce').clip(0, 1)

# Identify categorical columns
categorical_cols = df[feature_cols].select_dtypes(include=['object']).columns.tolist()

# Convert to category dtype
for c in categorical_cols:
    df[c] = df[c].astype('category')


**Step#9** cleans and standardizes the dataset’s structure by converting the survival outcome columns into proper numeric form and ensuring that all categorical predictors are explicitly stored as categorical data types. The survival time and event indicators are coerced to numeric values, with the event column clipped to valid binary values, preventing downstream modeling errors. The code then identifies all object‑typed feature columns and converts them to pandas’ category dtype, which improves memory efficiency, enforces consistent semantics, and prepares these variables for later encoding steps.

In [None]:
#10 Encode and Scale Feature Matrix

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

categorical_cols = df[feature_cols].select_dtypes(include=['object', 'category']).columns.tolist()
numeric_cols = df[feature_cols].select_dtypes(exclude=['object', 'category']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
    ],
    remainder='drop'
)

X = preprocessor.fit_transform(df[feature_cols])

# Build transformed feature names
ohe = preprocessor.named_transformers_['cat']
cat_feature_names = list(ohe.get_feature_names_out(categorical_cols))
transformed_feature_names = numeric_cols + cat_feature_names


**Step#10** applies a unified preprocessing pipeline that standardizes numeric variables and one‑hot encodes categorical variables, producing a clean, model‑ready feature matrix. The code first identifies which columns are numeric and which are categorical, then uses a ColumnTransformer to apply StandardScaler to numeric columns and OneHotEncoder to categorical ones. After fitting and transforming the data, it reconstructs the full list of transformed feature names by combining the original numeric column names with the expanded one‑hot‑encoded category names. This step ensures consistent preprocessing, prevents data leakage, and produces a fully interpretable feature set for downstream modeling.

In [None]:
# 11 Assemble and Save Final Preprocessed Dataset

from pathlib import Path

df_proc = pd.DataFrame(X, columns=transformed_feature_names)
df_proc[time_col] = df[time_col]
df_proc[event_col] = df[event_col]

# Ensure the directory exists before saving
Path(PREP_PATH).parent.mkdir(parents=True, exist_ok=True)
df_proc.to_csv(PREP_PATH, index=False)
print("Saved preprocessed dataset to:", PREP_PATH)

Saved preprocessed dataset to: /content/drive/MyDrive/projects/prostate_cancer/data/prostate_cancer_prediction_preprocessed.csv


**Step#11** takes the fully transformed feature matrix and turns it into a complete, analysis‑ready dataset by rebuilding it into a DataFrame, reattaching the survival time and event columns, and saving the final preprocessed file to disk. The code first constructs a new DataFrame using the transformed features and their corresponding names, then appends the original outcome variables so the dataset is ready for modeling. It ensures the output directory exists, writes the dataset to the specified path, and prints a confirmation message. This step marks the transition from preprocessing to modeling by producing a clean, reproducible artifact that downstream scripts can load directly.