In [1]:
# Step-by-Step Data Cleaning Workflow (Clinical Research Context)
# 1. Load and Inspect the Dataset
  # - Use pandas to read the CSV into a DataFrame.
  # - Check dataset dimensions, column names, and data types.


# Mount Google Drive if your dataset is stored there
from google.colab import drive
drive.mount('/content/drive')

# Load dataset
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/Breast Cancer METABRIC.csv')

# Quick inspection
print(df.shape)      # dimensions
print(df.head())     # preview first rows
print(df.info())     # column types + missing values

# Goal: Understand dataset scope, patient-level granularity, and variable types.
# Clinical note: Confirm outcome variables (e.g., relapse, survival time) are present.

Mounted at /content/drive
(2509, 34)
  Patient ID  Age at Diagnosis Type of Breast Surgery    Cancer Type  \
0    MB-0000             75.65             Mastectomy  Breast Cancer   
1    MB-0002             43.19      Breast Conserving  Breast Cancer   
2    MB-0005             48.87             Mastectomy  Breast Cancer   
3    MB-0006             47.68             Mastectomy  Breast Cancer   
4    MB-0008             76.97             Mastectomy  Breast Cancer   

                        Cancer Type Detailed Cellularity Chemotherapy  \
0           Breast Invasive Ductal Carcinoma         NaN           No   
1           Breast Invasive Ductal Carcinoma        High           No   
2           Breast Invasive Ductal Carcinoma        High          Yes   
3  Breast Mixed Ductal and Lobular Carcinoma    Moderate          Yes   
4  Breast Mixed Ductal and Lobular Carcinoma        High          Yes   

  Pam50 + Claudin-low subtype  Cohort ER status measured by IHC  ...  \
0                 c

**Step#1** This captures the initial step of a structured data cleaning workflow in a clinical research setting using Google Colab, where a breast cancer dataset (METABRIC) is loaded and inspected. After mounting Google Drive with drive.mount('/content/drive'), the dataset is read into a pandas DataFrame using pd.read_csv(...). The code then performs a quick diagnostic using df.shape to reveal the dataset’s dimensions (2509 rows × 34 columns), df.head() to preview the first few patient records, and df.info() to assess column types and missing values. These steps are foundational for understanding the dataset’s structure, verifying patient-level granularity, and confirming the presence of key clinical variables such as “Patient ID,” “Age at Diagnosis,” and “Type of Breast Surgery.” This inspection ensures readiness for downstream tasks like deduplication, imputation, and survival modeling, while anchoring the workflow in reproducibility and clinical relevance.

In [2]:
# 2. Handle Missing Values
  # - Quantify missingness: df.isnull().sum()
  # - Clinical relevance: Some variables (e.g., ER status, HER2 status) are critical;
  # - missingness here may require imputation or exclusion.
  # - Strategies:
    # - Drop columns with excessive missingness (e.g., >40%).
    # - Impute categorical values with mode or "Unknown".
    # - Impute continuous values with median or clinical-informed methods.

df['ER Status'] = df['ER Status'].str.upper().map({'POSITIVE':1, 'NEGATIVE':0})

**Step#2** This presents a clinically guided approach to handling missing values in a breast cancer dataset, emphasizing both statistical rigor and domain relevance. The comments outline key strategies: quantifying missingness with df.isnull().sum(), recognizing that variables like ER and HER2 status are clinically critical and may require careful imputation or exclusion, and applying thresholds (e.g., dropping columns with >40% missingness). It also suggests imputing categorical values with the mode or a placeholder like "Unknown," and continuous values with the median or clinically informed estimates. The final line of code standardizes and numerically encodes the 'ER Status' column by converting its string values to uppercase and mapping 'POSITIVE' to 1 and 'NEGATIVE' to 0. This transformation prepares the data for machine learning models while preserving interpretability, ensuring that preprocessing aligns with both statistical best practices and clinical decision-making needs.

In [3]:
# 3. Check for Duplicates
  # - Remove duplicate patient records.
  # - Example:

df = df.drop_duplicates(subset=['Patient ID'])

**Step#3** The code shown is used to identify and remove duplicate patient records from a pandas DataFrame, which is a vital step in clinical data cleaning to ensure data integrity. The command df = df.drop_duplicates(subset=['Patient ID']) scans the DataFrame for repeated entries based on the 'Patient ID' column and retains only the first occurrence of each unique patient, discarding any duplicates. This helps prevent redundancy and potential bias in downstream analyses, such as survival modeling or treatment outcome studies. Colab’s interactive environment makes it easy to run, verify, and document this cleaning step as part of a reproducible clinical data pipeline.

In [4]:
# Step 4: Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(exclude=['number']).columns

**Step#4** This code shown is used to separate numeric and categorical columns in a pandas DataFrame, which is a foundational step in clinical data preprocessing. The line numeric_cols = df.select_dtypes(include=['number']).columns identifies all columns with numeric data types—such as age, lab values, or biomarker counts—while categorical_cols = df.select_dtypes(exclude=['number']).columns captures columns with non-numeric types, typically representing categories like diagnosis, treatment type, or patient status. This separation allows for tailored cleaning and transformation strategies, such as mean imputation for numeric features and mode imputation or encoding for categorical ones. Running this in Colab ensures a reproducible, cloud-based workflow that supports collaborative clinical research and streamlined data preparation.

In [5]:
# Step 5: Fill numeric columns with mean
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

**Step#5** This contains a Python code that performs a common data cleaning operation using the pandas library: imputing missing values in numeric columns of a DataFrame. Specifically, the line df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean()) replaces all NaN (missing) values in the columns identified as numeric (stored in numeric_cols) with the mean of each respective column. This technique is known as mean imputation and is widely used to preserve the statistical properties of the dataset while handling missing data. By applying the fillna() method with df[numeric_cols].mean() as the argument, the code ensures that each numeric column is filled with its own mean value, maintaining column-wise integrity and preventing the introduction of bias that could occur if a single global value were used. This step is crucial before feeding the data into machine learning models, which typically cannot handle missing values directly.

In [6]:
# Step 6: Fill categorical columns with mode
for col in categorical_cols:
    if df[col].isnull().sum() > 0:  # only if missing values exist
        df[col] = df[col].fillna(df[col].mode()[0])

**Step#6** This code demonstrates a practical approach to cleaning missing values in categorical columns of a pandas DataFrame, which is a common step in clinical data preprocessing. The loop iterates through each column listed in categorical_cols, checks if it contains any missing values using isnull().sum() > 0, and fills those gaps with the mode—the most frequently occurring value—using fillna(df[col].mode()[0]). This technique helps preserve the distribution of categorical variables like diagnosis codes, medication names, or demographic labels, which is crucial for maintaining clinical relevance and interpretability. Running this in Colab allows for seamless collaboration, real-time execution, and integration with other tools like plotting libraries or cloud storage, making it ideal for reproducible clinical data workflows.

In [7]:
# Step 7: Verify all missing values are handled
print(df.isnull().sum())

Patient ID                        0
Age at Diagnosis                  0
Type of Breast Surgery            0
Cancer Type                       0
Cancer Type Detailed              0
Cellularity                       0
Chemotherapy                      0
Pam50 + Claudin-low subtype       0
Cohort                            0
ER status measured by IHC         0
ER Status                         0
Neoplasm Histologic Grade         0
HER2 status measured by SNP6      0
HER2 Status                       0
Tumor Other Histologic Subtype    0
Hormone Therapy                   0
Inferred Menopausal State         0
Integrative Cluster               0
Primary Tumor Laterality          0
Lymph nodes examined positive     0
Mutation Count                    0
Nottingham prognostic index       0
Oncotree Code                     0
Overall Survival (Months)         0
Overall Survival Status           0
PR Status                         0
Radio Therapy                     0
Relapse Free Status (Months)

**Step#7**  This code shown verifies that all missing values in a pandas DataFrame have been successfully handled, which is a critical final step in clinical data cleaning workflows. By executing print(df.isnull().sum()), the code outputs the count of null (NaN) values for each column in the DataFrame df. Seeing zeros across all columns confirms that previous imputation steps—such as filling numeric columns with their mean and categorical columns with their mode—were effective. This validation ensures the dataset is complete and ready for downstream tasks like statistical analysis, survival modeling, or machine learning. Using Colab for this process offers a reproducible, cloud-based environment ideal for collaborative clinical research and streamlined data preparation.

In [8]:
# Step 8: (Optional) Save cleaned dataset back to Drive
df.to_csv('/content/drive/MyDrive/METABRIC_cleaned_imputed.csv', index=False)

**Step#8** This code shown demonstrates how to save a cleaned and imputed clinical dataset to your Google Drive for future use or sharing. The command df.to_csv('/content/drive/MyDrive/METABRIC_cleaned_imputed.csv', index=False) writes the contents of the DataFrame df to a CSV file named "METABRIC_cleaned_imputed.csv" in the specified Drive folder, omitting the index column to keep the file tidy. This step is especially useful in clinical data workflows where reproducibility and traceability are critical, allowing collaborators to access the processed dataset without rerunning the entire pipeline. Colab’s integration with Google Drive makes it easy to manage and persist data artifacts directly from your notebook environment.