**Step 1-4.** Mount Google Drive and load the dataset

In [None]:
#1 Initialize Google Drive Integration
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Step#1** establishes access to your Google Drive within the Colab environment. The code imports the google.colab.drive module and mounts your Drive at the directory /content/drive, which allows Colab to read and write files just as if they were stored locally. Once mounted, any datasets, scripts, or outputs saved in your Drive become available for the rest of your workflow, ensuring smooth integration between cloud storage and your analysis pipeline.

In [None]:
#2 Load pandas Library for Data Handling
import pandas as pd

**Step#2** loads the pandas library, which is one of the core tools for data manipulation and analysis in Python. Importing it at the beginning of your workflow ensures that you can create DataFrames, read files, clean datasets, and perform transformations throughout the rest of your pipeline. The alias pd is a widely accepted convention that keeps your code concise and readable, especially when calling pandas functions repeatedly.

In [None]:
#3 Load Primary Dataset into DataFrame
df = pd.read_csv('/content/drive/MyDrive/colorectal_cancer_prediction.csv')

**Step#3** brings your dataset into the workflow by reading a CSV file from your Google Drive and loading it into a pandas DataFrame. This action transforms the raw file into a structured, table‑like object (df) that you can filter, clean, explore, and analyze throughout the rest of your pipeline. By specifying the full path to the file, the notebook knows exactly where to find your data, ensuring a smooth transition from storage to analysis.

In [None]:
#4 Explore Dataset Structure and Metadata

print(df.shape)      # dimensions
print(df.head())     # preview first rows
print(df.info())     # column types + missing values

(89945, 30)
   Patient_ID  Age  Gender   Race         Region Urban_or_Rural  \
0           1   71    Male  Other         Europe          Urban   
1           2   34  Female  Black  North America          Urban   
2           3   80  Female  White  North America          Urban   
3           4   40    Male  Black  North America          Rural   
4           5   43  Female  White         Europe          Urban   

  Socioeconomic_Status Family_History Previous_Cancer_History  \
0               Middle            Yes                      No   
1               Middle             No                      No   
2               Middle             No                      No   
3                  Low             No                      No   
4                 High            Yes                      No   

  Stage_at_Diagnosis  ... Insurance_Coverage Time_to_Diagnosis  \
0                III  ...                Yes           Delayed   
1                  I  ...                 No            Timely

**Step 1-4** Goal is to Confirm patient‑level rows, key clinical variables (e.g., diagnosis, stage, treatment, outcome), and data types.

**Step#4** focuses on inspecting the structure and contents of your dataset to ensure it loaded correctly and to understand what you’re working with before any cleaning or modeling begins. By displaying the first few rows and reviewing the dataset’s dimensions, column names, and data types, you gain an immediate sense of variable formats, potential missing values, and the overall shape of the data. This early exploration helps you identify issues such as inconsistent categories, incorrect data types, or unexpected values, setting the stage for a clean and reliable preprocessing workflow.

In [None]:
#5 Assess Missing Data Across Variables
missing_counts = df.isnull().sum().sort_values(ascending=False)
print(missing_counts)

Patient_ID                 0
Age                        0
Gender                     0
Race                       0
Region                     0
Urban_or_Rural             0
Socioeconomic_Status       0
Family_History             0
Previous_Cancer_History    0
Stage_at_Diagnosis         0
Tumor_Aggressiveness       0
Colonoscopy_Access         0
Screening_Regularity       0
Diet_Type                  0
BMI                        0
Physical_Activity_Level    0
Smoking_Status             0
Alcohol_Consumption        0
Red_Meat_Consumption       0
Fiber_Consumption          0
Insurance_Coverage         0
Time_to_Diagnosis          0
Treatment_Access           0
Chemotherapy_Received      0
Radiotherapy_Received      0
Surgery_Received           0
Follow_Up_Adherence        0
Survival_Status            0
Recurrence                 0
Time_to_Recurrence         0
dtype: int64


**Step#5** evaluates the completeness of your dataset by calculating how many missing values appear in each column. The code identifies null entries, sums them for every variable, and sorts the results so that any columns with the highest number of missing values would appear first. This quick check helps you confirm data quality before moving into preprocessing or modeling. In this case, every column shows a count of zero, meaning the dataset is fully complete with no missing entries to address.

**Step#5**. Quantify and review missing values

What to look for:

Clinically critical variables: e.g., tumor stage, grade, biomarkers, outcome (survival/recurrence).

High missingness: candidates for dropping or special handling (e.g., >40%).

In [None]:
#6 Filter Out Low‑Completeness Variables
threshold = 0.4
cols_to_drop = missing_counts[missing_counts > threshold * len(df)].index
print("Dropping columns:", list(cols_to_drop))

df = df.drop(columns=cols_to_drop)


Dropping columns: []


**Step#6** focuses on evaluating whether any columns in the dataset should be removed due to excessive missing data. By defining a threshold of 40%, the code identifies columns where more than 40% of their values are missing and prepares them for removal. This protects the integrity of your analysis by ensuring that variables with too much incomplete information don’t distort downstream modeling or require heavy imputation. In this case, the output shows an empty list, meaning no columns exceeded the threshold and the dataset remains fully intact.

In [None]:
#7 Standardize and Encode Categorical Variables

if 'Biomarker_Status' in df.columns:
    df['Biomarker_Status'] = (
        df['Biomarker_Status']
        .str.upper()
        .map({'POSITIVE': 1, 'NEGATIVE': 0})
    )

# Example: standardize a sex column
if 'Sex' in df.columns:
    df['Sex'] = df['Sex'].str.strip().str.upper()


**Step#7** focuses on cleaning and standardizing categorical variables so they are consistent and ready for modeling. The code checks whether certain columns exist—such as a binary biomarker status or a sex column—and then applies transformations to make their values uniform. For the biomarker example, the text values are converted to uppercase and mapped to numeric indicators (1 for positive, 0 for negative), which is essential for machine‑learning algorithms that require numeric inputs. For the sex column, whitespace is removed and values are standardized to uppercase, ensuring consistent categories and preventing subtle formatting issues from causing errors later in the pipeline.

**Step#7**. Standardize and encode key clinical variables
Adapt this to your actual column names (examples below):
**Goal:** Make clinically important variables consistent and model ready.

In [None]:
#8 Identify Numeric and Categorical Feature Groups

numeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(exclude=['number']).columns

print("Numeric columns:", list(numeric_cols))
print("Categorical columns:", list(categorical_cols))


Numeric columns: ['Patient_ID', 'Age', 'BMI', 'Time_to_Recurrence']
Categorical columns: ['Gender', 'Race', 'Region', 'Urban_or_Rural', 'Socioeconomic_Status', 'Family_History', 'Previous_Cancer_History', 'Stage_at_Diagnosis', 'Tumor_Aggressiveness', 'Colonoscopy_Access', 'Screening_Regularity', 'Diet_Type', 'Physical_Activity_Level', 'Smoking_Status', 'Alcohol_Consumption', 'Red_Meat_Consumption', 'Fiber_Consumption', 'Insurance_Coverage', 'Time_to_Diagnosis', 'Treatment_Access', 'Chemotherapy_Received', 'Radiotherapy_Received', 'Surgery_Received', 'Follow_Up_Adherence', 'Survival_Status', 'Recurrence']


**Step#8** organizes your dataset by separating numeric and categorical columns, a foundational move that prepares the data for targeted preprocessing and modeling. By using select_dtypes, the code automatically identifies which variables contain numerical values and which contain non‑numeric, label‑based information. This distinction matters because numeric features often require scaling or normalization, while categorical features typically need encoding or grouping. Clearly defining these two groups early in the workflow keeps your pipeline clean, modular, and ready for the next transformation steps.

**Step#8** Reason: Numeric and categorical features usually need different imputation and encoding strategies.

In [None]:
#9 Fill numeric columns with their mean (or median if preferred)
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())


**Step#9**. Impute missing values in numeric columns

Optional. If you prefer median (often more robust in clinical data):
Alternative: median imputation

df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

**Clinical note:** For some variables (e.g., “Family history”), you might instead use a category like "Unknown" if that’s more interpretable.


In [None]:
#10. Deduplicate Patient-Level Records

if 'Patient_ID' in df.columns:
    df = df.drop_duplicates(subset=['Patient_ID'])
else:
    # If no explicit ID, you can still drop exact row duplicates
    df = df.drop_duplicates()


**Step#10** Ensures that each patient appears only once in the dataset by removing duplicate records at the patient level. The code first checks whether a dedicated identifier such as Patient_ID exists; if it does, duplicates are removed based on that column so that every patient is uniquely represented. If no identifier is available, the fallback is to remove exact row‑level duplicates. This step protects the integrity of downstream analyses by preventing inflated sample sizes, repeated observations, or biased modeling results.

In [None]:
#11. Validate Absence of Missing Data”
# You want to see zeros (or only intentional missingness) across all columns before modeling.

print(df.isnull().sum())


Patient_ID                 0
Age                        0
Gender                     0
Race                       0
Region                     0
Urban_or_Rural             0
Socioeconomic_Status       0
Family_History             0
Previous_Cancer_History    0
Stage_at_Diagnosis         0
Tumor_Aggressiveness       0
Colonoscopy_Access         0
Screening_Regularity       0
Diet_Type                  0
BMI                        0
Physical_Activity_Level    0
Smoking_Status             0
Alcohol_Consumption        0
Red_Meat_Consumption       0
Fiber_Consumption          0
Insurance_Coverage         0
Time_to_Diagnosis          0
Treatment_Access           0
Chemotherapy_Received      0
Radiotherapy_Received      0
Surgery_Received           0
Follow_Up_Adherence        0
Survival_Status            0
Recurrence                 0
Time_to_Recurrence         0
dtype: int64


**Step#11** confirms that all missing values in the dataset have been properly addressed before any modeling or statistical analysis begins. By printing the count of null values for every column, you get a quick diagnostic snapshot showing whether any variables still contain gaps that could distort model training, bias estimates, or break downstream algorithms. Seeing zeros across all fields indicates that the earlier cleaning steps—imputation, removal, or validation—were successful, and the dataset is now structurally complete and ready for feature engineering or modeling.

In [None]:
#12. Export Final Cleaned Dataset

output_path = '/content/drive/MyDrive/colorectal_cancer_prediction_cleaned_imputed.csv'
df.to_csv(output_path, index=False)
print("Cleaned dataset saved to:", output_path)


Cleaned dataset saved to: /content/drive/MyDrive/colorectal_cancer_prediction_cleaned_imputed.csv


**Step#12** finalizes the data‑cleaning workflow by exporting the fully processed dataset to long‑term storage so it can be reused for modeling, sharing, or documentation. After all cleaning, deduplication, and validation steps are complete, the code writes the DataFrame to a CSV file in the user’s Drive. This ensures the cleaned dataset is preserved in a stable, accessible location, preventing the need to rerun preprocessing and enabling consistent results across future analyses.