# 33_data_preprocessing_data_split.ipynb

To build a reliable and generalizable machine learning model, the dataset is split into three parts:

- Training Set: is used to fit the model and adjust internal parameters (e.g., weights in a neural network or splits in a decision tree). The model learns from this subset.

- Validation Set: helps in tuning hyperparameters and selecting the best-performing model. It acts as a proxy for unseen data during the model development phase.

- Test Set: is held out completely from the training and validation process. It is only used at the final stage to evaluate the model’s generalization performance on truly unseen data.

Traing : Validation : Test = 6:2:2

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

# === 1. Load formatted input data ===
df_no_transform = pd.read_csv("../../data/processed/train_dataset_formatted_no_missing.csv")

# === 2. Split features and label ===
X_no_transform = df_no_transform.drop(columns=["label", "longitude", "latitude"])
y_no_transform = df_no_transform["label"]

# === 3. Train / Validation / Test split ===
# First split train_val and test (e.g. 80% train_val, 20% test)
X_train_val_no_transform, X_test_no_transform, y_train_val_no_transform, y_test_no_transform = train_test_split(
    X_no_transform, y_no_transform, test_size=0.2, random_state=42, stratify=y_no_transform
)

# Then split train and val (e.g. 75% train, 25% val → overall 60/20/20)
X_train_val_no_transform, X_val_no_transform, y_train_no_transform, y_val_no_transform = train_test_split(
    X_train_val_no_transform, y_train_val_no_transform, test_size=0.25, random_state=42, stratify=y_train_val_no_transform
)

# Print shape summary
print("Train:", X_train_val_no_transform.shape)
print("Validation:", X_val_no_transform.shape)
print("Test:", X_test_no_transform.shape)

# === 4. Save to disk ===
X_train_val_no_transform.to_csv("../../data/processed/no_transformed/X_train_no_transform.csv", index=False)
X_val_no_transform.to_csv("../../data/processed/no_transformed/X_val_no_transform.csv", index=False)
X_test_no_transform.to_csv("../../data/processed/no_transformed/X_test_no_transform.csv", index=False)

y_train_no_transform.to_csv("../../data/processed/no_transformed/y_train_no_transform.csv", index=False)
y_val_no_transform.to_csv("../../data/processed/no_transformed/y_val_no_transform.csv", index=False)
y_test_no_transform.to_csv("../../data/processed/no_transformed/y_test_no_transform.csv", index=False)

print("Train/Validation/Test datasets with no transformation saved.")


Train: (2951, 19)
Validation: (984, 19)
Test: (984, 19)
Train/Validation/Test datasets with no transformation saved.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# === 1. Load formatted input data ===
df_transform = pd.read_csv("../../data/processed/train_dataset_log1p_transformed.csv")

# === 2. Split features and label ===
X_transform = df_transform.drop(columns=["label", "longitude", "latitude"])
y_transform = df_transform["label"]

# === 3. Train / Validation / Test split ===
# First split train_val and test (e.g. 80% train_val, 20% test)
X_train_val_transform, X_test_transform, y_train_val_transform, y_test_transform = train_test_split(
    X_transform, y_transform, test_size=0.2, random_state=42, stratify=y_transform
)

# Then split train and val (e.g. 75% train, 25% val → overall 60/20/20)
X_train_val_transform, X_val_transform, y_train_transform, y_val_transform = train_test_split(
    X_train_val_transform, y_train_val_transform, test_size=0.25, random_state=42, stratify=y_train_val_transform
)

# Print shape summary
print("Train:", X_train_val_transform.shape)
print("Validation:", X_val_transform.shape)
print("Test:", X_test_transform.shape)

# === 4. Save to disk ===
X_train_val_transform.to_csv("../../data/processed/transformed/X_train_transform.csv", index=False)
X_val_transform.to_csv("../../data/processed/transformed/X_val_transform.csv", index=False)
X_test_transform.to_csv("../../data/processed/transformed/X_test_transform.csv", index=False)

y_train_transform.to_csv("../../data/processed/transformed/y_train_transform.csv", index=False)
y_val_transform.to_csv("../../data/processed/transformed/y_val_transform.csv", index=False)
y_test_transform.to_csv("../../data/processed/transformed/y_test_transform.csv", index=False)

print("Train/Validation/Test datasets with log1p transformation saved.")


Train: (2951, 19)
Validation: (984, 19)
Test: (984, 19)
Train/Validation/Test datasets with log1p transformation saved.
