In [1]:
# Step 1 — Mount Google Drive for File Access

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


**Step#1** establishes access to your Google Drive so the rest of the workflow can load, save, and organize project files directly from a consistent directory. The numbered section simply runs the drive.mount('/content/drive') command, which authorizes the Colab session and attaches your Drive at the path /content/drive, making all folders and datasets available to your notebook. This step matters because every downstream operation—data ingestion, saving outputs, versioning—depends on having a stable, mounted file system.

In [2]:
# Step 2 — Load the colorectal cancer dataset

import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/colorectal_cancer_prediction.csv')
df.head()


Unnamed: 0,Patient_ID,Age,Gender,Race,Region,Urban_or_Rural,Socioeconomic_Status,Family_History,Previous_Cancer_History,Stage_at_Diagnosis,...,Insurance_Coverage,Time_to_Diagnosis,Treatment_Access,Chemotherapy_Received,Radiotherapy_Received,Surgery_Received,Follow_Up_Adherence,Survival_Status,Recurrence,Time_to_Recurrence
0,1,71,Male,Other,Europe,Urban,Middle,Yes,No,III,...,Yes,Delayed,Good,Yes,No,No,Good,Survived,No,16
1,2,34,Female,Black,North America,Urban,Middle,No,No,I,...,No,Timely,Good,No,Yes,Yes,Poor,Deceased,No,28
2,3,80,Female,White,North America,Urban,Middle,No,No,III,...,Yes,Timely,Limited,No,Yes,Yes,Good,Survived,No,26
3,4,40,Male,Black,North America,Rural,Low,No,No,I,...,Yes,Delayed,Limited,Yes,No,Yes,Poor,Deceased,No,44
4,5,43,Female,White,Europe,Urban,High,Yes,No,III,...,No,Delayed,Good,Yes,No,Yes,Poor,Deceased,Yes,20


**Step#2** loads the colorectal cancer dataset into memory so the notebook can begin inspecting, cleaning, and modeling the data. The numbered section imports pandas and uses pd.read_csv() to read the CSV file stored in Google Drive, returning a DataFrame whose first rows are displayed with df.head(). This step is essential because it transforms the raw file into a structured, queryable table, enabling all downstream preprocessing, feature engineering, and analysis.

In [3]:
#Step 3 —  Inspect Dataset Dimensions and Schema

print(df.shape)
print(df.columns)
df.info()


(89945, 30)
Index(['Patient_ID', 'Age', 'Gender', 'Race', 'Region', 'Urban_or_Rural',
       'Socioeconomic_Status', 'Family_History', 'Previous_Cancer_History',
       'Stage_at_Diagnosis', 'Tumor_Aggressiveness', 'Colonoscopy_Access',
       'Screening_Regularity', 'Diet_Type', 'BMI', 'Physical_Activity_Level',
       'Smoking_Status', 'Alcohol_Consumption', 'Red_Meat_Consumption',
       'Fiber_Consumption', 'Insurance_Coverage', 'Time_to_Diagnosis',
       'Treatment_Access', 'Chemotherapy_Received', 'Radiotherapy_Received',
       'Surgery_Received', 'Follow_Up_Adherence', 'Survival_Status',
       'Recurrence', 'Time_to_Recurrence'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89945 entries, 0 to 89944
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Patient_ID               89945 non-null  int64  
 1   Age                      89945 non-null  int64  
 2   

**Step#3** evaluates the structure of the dataset so you can understand its size, column names, and data types before performing any cleaning or modeling. The numbered section prints the DataFrame’s shape, lists all column names, and runs df.info() to show non‑null counts and dtypes for each variable. This matters because it reveals whether the dataset is large, whether any columns need type correction, and whether missing‑value handling or preprocessing steps will be required.

In [4]:
# Step 4 — Generate Comprehensive Summary Statistics

df.describe(include='all')


Unnamed: 0,Patient_ID,Age,Gender,Race,Region,Urban_or_Rural,Socioeconomic_Status,Family_History,Previous_Cancer_History,Stage_at_Diagnosis,...,Insurance_Coverage,Time_to_Diagnosis,Treatment_Access,Chemotherapy_Received,Radiotherapy_Received,Surgery_Received,Follow_Up_Adherence,Survival_Status,Recurrence,Time_to_Recurrence
count,89945.0,89945.0,89945,89945,89945,89945,89945,89945,89945,89945,...,89945,89945,89945,89945,89945,89945,89945,89945,89945,89945.0
unique,,,2,5,5,2,3,2,2,4,...,2,2,2,2,2,2,2,2,2,
top,,,Male,White,North America,Urban,Middle,No,No,II,...,Yes,Timely,Good,Yes,No,Yes,Good,Survived,No,
freq,,,49369,44887,31537,62990,45088,67372,80985,26869,...,72118,54080,62897,45067,53952,62879,54041,67341,62975,
mean,44973.0,54.332892,,,,,,,,,...,,,,,,,,,,29.543299
std,25965.029318,20.18222,,,,,,,,,...,,,,,,,,,,17.26844
min,1.0,20.0,,,,,,,,,...,,,,,,,,,,0.0
25%,22487.0,37.0,,,,,,,,,...,,,,,,,,,,15.0
50%,44973.0,54.0,,,,,,,,,...,,,,,,,,,,30.0
75%,67459.0,72.0,,,,,,,,,...,,,,,,,,,,44.0


**Step#4** summarizes the dataset’s distributions so you can quickly understand central tendencies, variability, and the dominant categories across all variables. The numbered section runs df.describe(include='all'), which produces counts, unique values, most frequent categories, and numerical statistics such as mean, standard deviation, and quartiles. This step is important because it reveals early patterns—imbalances, unusual ranges, or unexpected category frequencies—that guide cleaning, encoding, and modeling decisions.

In [5]:
# Step 5 — Detect Missing Values Across All Columns

missing = df.isnull().sum()
missing[missing > 0]


Unnamed: 0,0


**Step#5** checks the dataset for any missing values so you can determine whether imputation, cleaning, or column‑level corrections are needed before modeling. The numbered section computes df.isnull().sum() to count nulls in every column, then filters to show only columns with at least one missing entry. The output indicates that no columns contain missing values, meaning the dataset is already complete and requires no null‑handling steps at this stage.

In [6]:
# Step 6 — Explore Categorical Variable Distributions

categorical_cols = df.select_dtypes(include='object').columns

for col in categorical_cols:
    print(f"\n{col} value counts:\n", df[col].value_counts(dropna=False))



Gender value counts:
 Gender
Male      49369
Female    40576
Name: count, dtype: int64

Race value counts:
 Race
White       44887
Black       18005
Asian       13502
Hispanic     9040
Other        4511
Name: count, dtype: int64

Region value counts:
 Region
North America    31537
Europe           27019
Asia Pacific     17916
Latin America     9050
Africa            4423
Name: count, dtype: int64

Urban_or_Rural value counts:
 Urban_or_Rural
Urban    62990
Rural    26955
Name: count, dtype: int64

Socioeconomic_Status value counts:
 Socioeconomic_Status
Middle    45088
Low       26868
High      17989
Name: count, dtype: int64

Family_History value counts:
 Family_History
No     67372
Yes    22573
Name: count, dtype: int64

Previous_Cancer_History value counts:
 Previous_Cancer_History
No     80985
Yes     8960
Name: count, dtype: int64

Stage_at_Diagnosis value counts:
 Stage_at_Diagnosis
II     26869
I      22594
III    22412
IV     18070
Name: count, dtype: int64

Tumor_Aggressivene

**Step#6** examines how each categorical variable is distributed so you can spot dominant classes, imbalances, or patterns that may influence encoding and model performance. The numbered section first identifies all columns stored as object types, then loops through them and prints their value counts, revealing how often each category appears. This matters because categorical skew—such as heavily dominant “Yes/No” patterns or uneven treatment access groups—directly affects feature engineering, class weighting, and interpretation in clinical modeling.

In [7]:
# Step 7 — Clean and Impute Dataset Values

# Example numeric imputation
numeric_cols = df.select_dtypes(include=['float64','int64']).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Example categorical imputation
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

# Example date parsing
date_cols = [col for col in df.columns if 'date' in col.lower()]
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')


**Step#7** applies core cleaning operations so the dataset is consistent, complete, and ready for modeling. The numbered section identifies numeric columns and fills any gaps with their median values, then imputes categorical columns using the most frequent category, and finally scans for any columns containing the word “date” and converts them into proper datetime format. This matters because consistent numeric scales, resolved missing values, and correctly typed date fields prevent downstream errors and ensure that preprocessing, feature engineering, and model training behave predictably.

In [8]:
# Step 8 — Save Cleaned Dataset to Drive

output_path = '/content/drive/MyDrive/colorectal_cancer_prediction_cleaned_imputed.csv'
df.to_csv(output_path, index=False)
output_path


'/content/drive/MyDrive/colorectal_cancer_prediction_cleaned_imputed.csv'

**Step#8** stores the fully cleaned and imputed dataset so it can be reused consistently in later analysis, modeling, or sharing. The numbered section defines an output path inside your Drive and writes the DataFrame to a CSV file without the index, then prints the path to confirm where the file was saved. This step matters because it creates a stable, versioned artifact of your preprocessing work, ensuring that downstream notebooks or collaborators operate on the exact same cleaned dataset.