<a href="https://colab.research.google.com/github/tribeop/ML-from-scratch/blob/main/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Data Loading


In [None]:
import pandas as pd
df = pd.read_csv("/content/Data.csv")
display(df.head())

## 2. Numerical Imputation
This step addresses missing values in numerical columns like 'Age' and 'Salary' by applying median imputation. The SimpleImputer is initialized with a median strategy, ensuring that missing values are replaced with the median of their respective columns.


In [None]:
from sklearn.impute import SimpleImputer

num_cols = ["Age", "Salary"]

imputer = SimpleImputer(strategy="median")
df_num_imputed = df.copy()
df_num_imputed[num_cols] = imputer.fit_transform(df[num_cols])

print(f"original_df: {df[num_cols]}")
print(f"imputed_df: {df_num_imputed[num_cols]}")

## 3. One-Hot Encoding

This step converts categorical features, specifically the 'Country' column, into a one-hot encoded format. `OneHotEncoder` creates new binary columns for each unique category, effectively transforming nominal data into a numerical representation suitable for machine learning algorithms. The `sparse_output=False` ensures a dense array output, and `handle_unknown="ignore"` prevents errors during transformation if an unseen category appears in the test set. The original 'Country' column is then dropped.

In [None]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

country_ohe = ohe.fit_transform(df_num_imputed[["Country"]])
ohe_cols = ohe.get_feature_names_out(["Country"])

df_country_ohe = pd.DataFrame(country_ohe, columns=ohe_cols, index=df.index)
df_encoded = pd.concat([df_num_imputed, df_country_ohe], axis=1)

df_encoded = df_encoded.drop(columns="Country")
display(df_encoded)

## 4. Train-Test Split

This crucial step divides the dataset into training and testing subsets. The training set (80%) is used to train the machine learning model, while the testing set (20%) is reserved for evaluating the model's performance on unseen data. This split helps assess the model's generalization capabilities and prevent overfitting. `random_state` ensures reproducibility, and `stratify=y` ensures that the proportion of target classes is similar in both training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop(columns="Purchased")
y = df_encoded["Purchased"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

print("X_train head:\n", X_train.head())
print("X_test head:\n", X_test.head())
print("y_train head:\n", y_train.head())
print("y_test head:\n", y_test.head())

## 5. Numerical Feature Scaling

This step scales numerical features using StandardScaler to ensure that all features contribute equally to the model, preventing features with larger values from dominating. The scaler is fitted on the training data and then used to transform both training and testing sets.

In [None]:
from sklearn.preprocessing import StandardScaler

num_cols = ["Age", "Salary"]

scaler = StandardScaler()
scaler.fit(X_train[num_cols])

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[num_cols] = scaler.transform(X_train[num_cols])
X_test_scaled[num_cols] = scaler.transform(X_test[num_cols])

print("Scaled X_train numerical features:\n", X_train_scaled[num_cols])
print("Scaled X_test numerical features:\n", X_test_scaled[num_cols])

## Full Implementation - Linear Approach

This cell illustrates a manual, step-by-step data preprocessing pipeline, explicitly showing each stage of imputation, one-hot encoding, and scaling in a sequential manner.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

df = pd.read_csv("/content/Data.csv")
X = df.drop(columns="Purchased")
y = df["Purchased"]

num_cols = ["Age", "Salary"]
cat_cols = ["Country"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

num_imp = SimpleImputer(strategy="median")
X_train_num = pd.DataFrame(num_imp.fit_transform(X_train[num_cols]),
                           columns=num_cols, index=X_train.index)
X_test_num  = pd.DataFrame(num_imp.transform(X_test[num_cols]),
                           columns=num_cols, index=X_test.index)

ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
ohe_train = ohe.fit_transform(X_train[cat_cols])
ohe_test  = ohe.transform(X_test[cat_cols])
ohe_cols = ohe.get_feature_names_out(cat_cols)

X_train_cat = pd.DataFrame(ohe_train, columns=ohe_cols, index=X_train.index)
X_test_cat  = pd.DataFrame(ohe_test,  columns=ohe_cols, index=X_test.index)

X_train_prep = pd.concat([X_train_num, X_train_cat], axis=1)
X_test_prep  = pd.concat([X_test_num,  X_test_cat],  axis=1)

scaler = StandardScaler()
scaler.fit(X_train_prep[num_cols])

X_train_prep[num_cols] = scaler.transform(X_train_prep[num_cols])
X_test_prep[num_cols]  = scaler.transform(X_test_prep[num_cols])

print("Train prepared shape:", X_train_prep.shape)
display(X_train_prep.head())

## Full Implementation - Pipeline Approach

This cell demonstrates a streamlined and robust data preprocessing pipeline using Scikit-learn's `Pipeline` and `ColumnTransformer` for efficient handling of numerical and categorical features.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

df = pd.read_csv("/content/Data.csv")

y = df["Purchased"].map({"No": 0, "Yes": 1})
X = df.drop(columns="Purchased")

num_cols = ["Age", "Salary"]
cat_cols = ["Country"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_pipe = Pipeline([
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first"))
])

preprocessor = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

feature_names = preprocessor.get_feature_names_out()

X_train_df = pd.DataFrame(
    X_train_preprocessed,
    columns=feature_names,
    index=X_train.index
)

X_test_df = pd.DataFrame(
    X_test_preprocessed,
    columns=feature_names,
    index=X_test.index
)

print("Preprocessed X_train DataFrame head:")
display(X_train_df.head())

print("\nPreprocessed X_test DataFrame head:")
display(X_test_df.head())