**Step1-4:** Mount Google Drive and load the dataset

In [1]:
#1 Mount Google Drive for File Access
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Step#1** initializes access to your Google Drive within the Colab environment so that your notebook can read from and write to files stored there. By importing the google.colab.drive module and calling drive.mount('/content/drive'), Colab prompts you to authenticate your Google account. Once authenticated, your Drive becomes available at the /content/drive directory, allowing seamless loading of datasets, saving outputs, and maintaining a reproducible workflow across sessions.

In [2]:
#2 Import Pandas Library
import pandas as pd

**Step#2** This step loads the pandas library, a core tool for data manipulation and analysis in Python. By importing it with the alias pd, you streamline your workflow and make future code more concise and readable. Pandas provides powerful data structures—especially DataFrames—that allow you to clean, transform, merge, summarize, and explore datasets efficiently. Importing it early in the workflow ensures that all subsequent steps can rely on its functionality without interruption.

In [3]:
#3 Load Raw Dataset into DataFrame

df = pd.read_csv('/content/drive/MyDrive/prostate_cancer_prediction.csv')

**Step#3** This step loads the raw prostate cancer dataset into your working environment so you can begin exploring, cleaning, and analyzing the data. Using pd.read_csv(), the code reads a CSV file stored in your Google Drive and converts it into a pandas DataFrame named df, which becomes the central object for all downstream processing. This step ensures that the dataset is accessible in memory, properly structured, and ready for any transformations, visualizations, or modeling tasks that follow.

In [4]:
#4 Inspect Raw Dataset Structure
print(df.shape)      # dimensions
print(df.head())     # preview first rows
print(df.info())     # column types + missing values


(27945, 30)
   Patient_ID  Age Family_History Race_African_Ancestry  PSA_Level DRE_Result  \
0           1   78             No                   Yes       5.07     Normal   
1           2   68             No                   Yes      10.24     Normal   
2           3   54             No                    No      13.79     Normal   
3           4   82             No                    No       8.03   Abnormal   
4           5   47            Yes                    No       1.89     Normal   

  Biopsy_Result Difficulty_Urinating Weak_Urine_Flow Blood_in_Urine  ...  \
0        Benign                   No              No             No  ...   
1        Benign                  Yes              No             No  ...   
2        Benign                   No              No             No  ...   
3        Benign                   No              No             No  ...   
4     Malignant                  Yes             Yes             No  ...   

  Alcohol_Consumption Hypertension Diabetes 

**Step#4** This step performs an initial inspection of the raw dataset to understand its structure, size, and basic characteristics before any cleaning or preprocessing begins. By printing the DataFrame’s shape, previewing the first few rows, and displaying detailed metadata such as column names, data types, and non‑null counts, you gain an early sense of data quality, potential missing values, and the overall layout of the dataset. This early diagnostic pass helps you plan downstream cleaning steps, identify categorical vs. numerical features, and confirm that the dataset loaded correctly.

**Step 1-4** Goal: Confirm patient-level granularity, identify outcome variables (e.g., cancer status, Gleason score, survival/recurrence if present), and understand variable types.

**Step5-6:** Quantify and review missing values

In [5]:
#5 Check total missing values per column
missing_counts = df.isnull().sum()
print(missing_counts)


Patient_ID                 0
Age                        0
Family_History             0
Race_African_Ancestry      0
PSA_Level                  0
DRE_Result                 0
Biopsy_Result              0
Difficulty_Urinating       0
Weak_Urine_Flow            0
Blood_in_Urine             0
Pelvic_Pain                0
Back_Pain                  0
Erectile_Dysfunction       0
Cancer_Stage               0
Treatment_Recommended      0
Survival_5_Years           0
Exercise_Regularly         0
Healthy_Diet               0
BMI                        0
Smoking_History            0
Alcohol_Consumption        0
Hypertension               0
Diabetes                   0
Cholesterol_Level          0
Screening_Age              0
Follow_Up_Required         0
Prostate_Volume            0
Genetic_Risk_Factors       0
Previous_Cancer_History    0
Early_Detection            0
dtype: int64


**Step#5** This step evaluates the dataset for missing values, an essential early task in any data‑cleaning workflow. By using df.isnull().sum(), you compute the total number of null entries in each column, allowing you to quickly identify whether the dataset requires imputation, removal of incomplete rows, or additional preprocessing. In this case, the output shows zeros across all columns, confirming that the dataset is fully complete and does not require missing‑value handling—an ideal scenario that simplifies downstream analysis and modeling.

In [6]:
#6 Calculate Missingness Percentages
missing_percent = (df.isnull().mean() * 100).round(2)
print(missing_percent)

Patient_ID                 0.0
Age                        0.0
Family_History             0.0
Race_African_Ancestry      0.0
PSA_Level                  0.0
DRE_Result                 0.0
Biopsy_Result              0.0
Difficulty_Urinating       0.0
Weak_Urine_Flow            0.0
Blood_in_Urine             0.0
Pelvic_Pain                0.0
Back_Pain                  0.0
Erectile_Dysfunction       0.0
Cancer_Stage               0.0
Treatment_Recommended      0.0
Survival_5_Years           0.0
Exercise_Regularly         0.0
Healthy_Diet               0.0
BMI                        0.0
Smoking_History            0.0
Alcohol_Consumption        0.0
Hypertension               0.0
Diabetes                   0.0
Cholesterol_Level          0.0
Screening_Age              0.0
Follow_Up_Required         0.0
Prostate_Volume            0.0
Genetic_Risk_Factors       0.0
Previous_Cancer_History    0.0
Early_Detection            0.0
dtype: float64


**Step#6** This step provides an optional but more interpretable view of missingness by converting raw null counts into percentages. Instead of simply knowing whether a column has missing values, calculating the proportion of missing entries helps you understand the relative impact of missingness on each feature—especially useful in larger datasets. By computing the mean of the boolean null mask, multiplying by 100, and rounding to two decimals, you generate a clean, human‑readable summary. In this dataset, every column shows 0.0% missingness, confirming once again that the data is fully complete and requires no imputation or removal of incomplete records.

**Step5-6:** Goal: Decide which variables are clinically essential (e.g., PSA, Gleason score, stage) and which may be dropped or imputed.

**Step 7:** Drop columns with excessive missingness (optional, threshold-based)

In [7]:
#7 Drop High‑Missingness Columns
threshold = 40  # percent
cols_to_drop = missing_percent[missing_percent > threshold].index
print("Dropping columns:", list(cols_to_drop))

df = df.drop(columns=cols_to_drop)


Dropping columns: []


**Step#7** This step demonstrates how to automatically remove columns that exceed a predefined threshold of missingness—in this case, more than 40%. By comparing each column’s missing‑value percentage to the threshold, the code identifies which features are too incomplete to be useful for analysis or modeling. These columns are collected into cols_to_drop and then removed from the DataFrame. In your dataset, the output shows an empty list, meaning no columns surpassed the threshold and therefore none were dropped. This confirms that the dataset is not only complete but also structurally stable for downstream preprocessing.

**Step 7:** Goal: Remove low-quality variables that are unlikely to be salvageable or clinically interpretable.

**Step 8:** Handle obvious duplicates (e.g., by patient ID)
If your dataset has a patient identifier column (e.g., "Patient_ID", "id", "patient_id"), use it here. Adjust the column name accordingly.

In [8]:
#8 Remove Duplicate Patient Records

id_col = 'Patient_ID'

if id_col in df.columns:
    before = df.shape[0]
    df = df.drop_duplicates(subset=[id_col])
    after = df.shape[0]
    print(f"Removed {before - after} duplicate rows based on {id_col}")
else:
    print("No explicit patient ID column found; consider checking duplicates on key clinical fields.")


Removed 0 duplicate rows based on Patient_ID


**Step#8** This step ensures that your dataset contains only unique patient records by identifying and removing duplicate rows based on a designated ID column. After defining the ID field (Patient_ID), the code checks whether that column exists in the DataFrame. If it does, it compares the number of rows before and after applying drop_duplicates to determine how many duplicate entries were removed. If the ID column is missing, the code provides a fallback message suggesting that duplicates be checked using key clinical variables instead. In your dataset, the output shows that 0 duplicate rows were removed, confirming that each patient record is already unique.

**Step 8:** Goal: Ensure each patient is represented once to avoid bias in modeling.

**Step 9:** Separate numeric and categorical columns

In [9]:
#9 Identify Numeric and Categorical Features

numeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(exclude=['number']).columns

print("Numeric columns:", list(numeric_cols))
print("Categorical columns:", list(categorical_cols))


Numeric columns: ['Patient_ID', 'Age', 'PSA_Level', 'BMI', 'Screening_Age', 'Prostate_Volume']
Categorical columns: ['Family_History', 'Race_African_Ancestry', 'DRE_Result', 'Biopsy_Result', 'Difficulty_Urinating', 'Weak_Urine_Flow', 'Blood_in_Urine', 'Pelvic_Pain', 'Back_Pain', 'Erectile_Dysfunction', 'Cancer_Stage', 'Treatment_Recommended', 'Survival_5_Years', 'Exercise_Regularly', 'Healthy_Diet', 'Smoking_History', 'Alcohol_Consumption', 'Hypertension', 'Diabetes', 'Cholesterol_Level', 'Follow_Up_Required', 'Genetic_Risk_Factors', 'Previous_Cancer_History', 'Early_Detection']


**Step#9** This step separates your dataset into numeric and categorical features, a foundational move for nearly all downstream preprocessing, modeling, and visualization tasks. By using select_dtypes, the code automatically identifies which columns contain numerical values (such as age, PSA level, or prostate volume) and which contain categorical information (such as family history or clinical symptoms). This separation allows you to apply the appropriate transformations to each group—like scaling numeric variables or encoding categorical ones—while maintaining a clean, organized workflow. The printed lists confirm exactly how the dataset is structured, giving you a clear map of your feature types before moving forward.

**Step 9:** Goal: Prepare for different imputation strategies for numeric vs categorical variables.

**Step 10:** Impute numeric columns (e.g., with mean or median)
For clinical data, median is often more robust than mean if there are outliers. You can choose either; here’s median:

In [10]:
#10 Impute Numeric Columns with Median Values
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())


**Step#10** This step handles missing values in numeric columns by replacing any NaN entries with the median of each respective column. Median imputation is a widely used technique because it preserves the central tendency of the data while being robust to outliers—unlike the mean, which can be skewed by extreme values. By applying this operation to all numeric features at once, you ensure that the dataset remains complete and ready for modeling without introducing bias or distorting the underlying distributions. Even if your dataset currently has no missing numeric values, this step provides a reproducible safeguard for future datasets or updates.

If you prefer mean:

-Alternative: fill numeric columns with mean
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())


**Step 10:** Goal: Ensure all numeric features (e.g., PSA, age, lab values) are complete for modeling.

In [11]:
#11 Impute Categorical Columns with Mode

for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col] = df[col].fillna(df[col].mode()[0])


**Step#11** This step imputes missing values in categorical columns by filling them with each column’s mode—the most frequently occurring category. Mode imputation is a common and practical approach because it preserves the dominant pattern within a feature without introducing unrealistic or synthetic categories. The loop checks each categorical column individually and only applies imputation if that column actually contains missing values, ensuring the process is efficient and minimally invasive. Even if your dataset currently has no missing categorical values, this step adds robustness and reproducibility for future datasets or updates.

**Step#11** Goal: Preserve the most common clinically plausible category (e.g., “Localized”, “Benign”, “Unknown”) rather than dropping rows.

**Step 12:** (Optional) Encode key clinical categorical variables
If you have a binary outcome like "cancer_present" or "diagnosis" with values such as "Malignant"/"Benign", you can encode it:

Goal: Prepare clinically meaningful targets for prediction models.

In [12]:
#12 Encode Binary Outcome Variable / Adjust 'Outcome' and value labels to match your dataset
if 'Outcome' in df.columns:
    df['Outcome_binary'] = df['Outcome'].str.upper().map({
        'MALIGNANT': 1,
        'BENIGN': 0
    })
    print(df['Outcome_binary'].value_counts(dropna=False))


**Step#12** This step converts a binary outcome column into a numerical format suitable for modeling by mapping text labels to numeric values. After confirming that the dataset contains a column named Outcome, the code standardizes its text to uppercase and then maps the two possible categories—such as MALIGNANT and BENIGN—to 1 and 0, respectively. This transformation is essential because most machine‑learning algorithms require numerical inputs and cannot directly interpret string‑based labels. The printed value counts provide a quick check to ensure the encoding worked correctly and that both classes are represented as expected.

In [13]:
#13 Final Missing‑Value Audit
print(df.isnull().sum())


Patient_ID                 0
Age                        0
Family_History             0
Race_African_Ancestry      0
PSA_Level                  0
DRE_Result                 0
Biopsy_Result              0
Difficulty_Urinating       0
Weak_Urine_Flow            0
Blood_in_Urine             0
Pelvic_Pain                0
Back_Pain                  0
Erectile_Dysfunction       0
Cancer_Stage               0
Treatment_Recommended      0
Survival_5_Years           0
Exercise_Regularly         0
Healthy_Diet               0
BMI                        0
Smoking_History            0
Alcohol_Consumption        0
Hypertension               0
Diabetes                   0
Cholesterol_Level          0
Screening_Age              0
Follow_Up_Required         0
Prostate_Volume            0
Genetic_Risk_Factors       0
Previous_Cancer_History    0
Early_Detection            0
dtype: int64


**Step#13** This step performs a final verification of missing values across the entire dataset after all earlier cleaning and imputation steps have been applied. By running df.isnull().sum(), you generate a complete count of null entries for every column, allowing you to confirm that no missing values remain and that the dataset is fully prepared for downstream modeling or feature engineering. The output shows zeros for every feature, indicating that your preprocessing pipeline has successfully produced a complete, analysis‑ready dataset.

**Step 13:** Verify all missing values are handled
You want to see zeros for all columns you plan to use in modeling.

In [14]:
#14 Save Cleaned Dataset to Drive

df.to_csv('/content/drive/MyDrive/prostate_cancer_prediction_cleaned_imputed.csv', index=False)
print("Cleaned dataset saved to Drive.")


Cleaned dataset saved to Drive.


**Step#14** This step saves your fully cleaned and imputed dataset back to Google Drive, creating a permanent, versioned output that can be reused for modeling, sharing, or future analysis. By writing the DataFrame to a CSV file using df.to_csv(), you ensure that all preprocessing steps—such as missing‑value handling, deduplication, encoding, and type separation—are preserved in a stable, portable format. Saving the file without the index keeps the dataset tidy and consistent with typical machine‑learning workflows. The confirmation message provides immediate feedback that the export was successful.

**Step#14:** If you’d like, you can paste a few lines of df.head() or df.info() from your actual prostate_cancer_prediction file, and we can tailor the cleaning logic to its exact columns (e.g., specific outcome, staging, PSA fields).