In [1]:
#1 Mount Google Drive / Connect to Google Drive for File Access
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Step#1:** In this initial step, the code mounts the user's Google Drive to the Colab environment, enabling seamless access to files stored in Drive. By importing the drive module from google.colab and executing drive.mount('/content/drive'), the notebook establishes a virtual link between Colab and the user's Drive. This is essential for loading datasets, saving outputs, or accessing project resources directly from cloud storage. Once mounted, the Drive appears under /content/drive, and the confirmation message "Mounted at /content/drive" signals successful integration.

In [2]:
#2 Load dataset / Import Cleaned FSGS Dataset into DataFrame
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/fsgs_dataset_cleaned.csv')

**Step#2:** This step loads a cleaned dataset into the Colab environment using the pandas library, a foundational tool for data manipulation in Python. By importing pandas as pd and calling pd.read_csv() with the path to the CSV file stored in Google Drive, the code reads the contents of fsgs_dataset_cleaned.csv into a DataFrame named df. This structure allows for efficient data exploration, filtering, and analysis. The file path reflects the integration from Step #1, ensuring seamless access to Drive-hosted resources.

In [3]:
#3 Drop irrelevant columns / Remove Non-Analytic Columns for Clean Modeling
df.drop(columns=['Patient_ID', 'Notes'], inplace=True, errors='ignore')

**Step#3:** This step performs a data cleaning operation by removing columns that are deemed irrelevant for analysis. Specifically, the code uses df.drop(columns=['Patient_ID', 'Notes'], inplace=True, errors='ignore') to eliminate the 'Patient_ID' and 'Notes' columns from the DataFrame df. The inplace=True argument ensures the changes are applied directly to df without creating a copy, while errors='ignore' prevents the code from breaking if either column is missing. This streamlines the dataset by discarding identifiers and unstructured text that may not contribute to modeling or statistical analysis.

In [4]:
#4 Encode categorical variables / One-Hot Encode Categorical Features for Modeling
df = pd.get_dummies(df, drop_first=True)

**Step#4:** This step transforms categorical variables in the dataset into a format suitable for machine learning models. Using pd.get_dummies(df, drop_first=True), the code replaces categorical columns with binary indicator variables, a process known as one-hot encoding. The drop_first=True parameter avoids multicollinearity by omitting the first category in each encoded feature, ensuring the resulting DataFrame is numerically optimized for regression or classification tasks. This encoding step is crucial for converting qualitative data into quantitative form without introducing redundancy.

In [5]:
#5  Configure Median-Based Imputation for Missing Values
from sklearn.impute import SimpleImputer
import pandas as pd # Ensure pandas is imported

imputer = SimpleImputer(strategy='median')

**Step#5:** This step sets up the imputation strategy for handling missing values in the dataset. By importing SimpleImputer from sklearn.impute and instantiating it with strategy='median', the code prepares a tool that will later replace missing entries in numerical columns with their respective median values. This approach is robust against outliers and preserves the central tendency of each feature. The reminder to import pandas ensures compatibility with DataFrame operations, which are essential for applying the imputer in subsequent steps.

In [6]:
#6 Fit and transform the data, which returns a NumPy array / Apply Median Imputation to Fill Missing Values
imputed_data = imputer.fit_transform(df)



**Step#6:** This step applies the median-based imputation strategy configured earlier to the dataset. By executing imputed_data = imputer.fit_transform(df), the code fits the SimpleImputer to the DataFrame df, calculating the median for each column with missing values, and then replaces those missing entries accordingly. The result is a NumPy array (imputed_data) containing the fully imputed data. This transformation ensures that the dataset is numerically complete and ready for downstream modeling, while preserving the central tendency of each feature.

In [7]:
#7 Get the names of the features that were actually processed and output by the imputer./ Extract Imputed Feature Names
# This accounts for columns skipped due to being all NaN.
processed_columns = imputer.get_feature_names_out(df.columns)

**Step#7:** This step extracts the names of the columns that were successfully processed by the imputer using imputer.get_feature_names_out(df.columns). It ensures that only the columns with valid (non-all-NaN) data are included in the output, which is crucial for maintaining consistency between the input and the transformed dataset. Columns that were entirely missing (all NaN) are automatically excluded by the imputer, and this method reflects that exclusion by returning only the names of the retained features.

In [8]:
#8 Create a new DataFrame from the imputed NumPy array / Rebuild DataFrame with Imputed Values
# explicitly retaining the correct column names and index.
df = pd.DataFrame(imputed_data, columns=processed_columns, index=df.index)

**Step#8:** This step reconstructs a pandas DataFrame from the imputed NumPy array (imputed_data) by explicitly assigning the correct column names (processed_columns) and preserving the original row index (df.index). This ensures that the transformed data remains aligned with the original dataset structure, which is critical for downstream analysis, merging, or interpretation. Without this step, the imputed data would lack meaningful labels and indexing, making it harder to trace or validate.

In [9]:
#9 Normalize features / Standardize Features with Z-Score Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[df.columns] = scaler.fit_transform(df)

**Step#9:** This step standardizes all numerical features in the DataFrame using scikit-learn’s StandardScaler, which transforms each column to have a mean of 0 and a standard deviation of 1. This normalization is essential for many machine learning algorithms that are sensitive to the scale of input features, such as logistic regression, support vector machines, and k-nearest neighbors. By applying scaler.fit_transform(df) and reassigning the result to df[df.columns], the code ensures that the scaled values replace the original ones while preserving the DataFrame’s structure.

In [10]:
#10 Save preprocessed data / Export Final Dataset to CSV
df.to_csv('/content/drive/MyDrive/projects/fsgs/notebooks/fsgs_dataset_preprocessed.csv', index=False)


**Step#10:** This step finalizes the preprocessing pipeline by exporting the cleaned and transformed DataFrame to a CSV file using df.to_csv(...). The file is saved to a designated path within the user's Google Drive, and index=False ensures that the row indices are excluded from the output file. This makes the dataset ready for downstream tasks such as modeling, sharing, or archival. Saving at this stage preserves all prior transformations—imputation, scaling, and structural alignment—into a reproducible format.