### TODO: Feature Engineering

"#### Primary tasks (must-have)\n",
- [ ] Document which raw columns are dropped/kept and why (e.g., `high_corr_features` list).
- [ ] Encode categoricals consistently—Tip: stash the fitted encoder (e.g., `OneHotEncoder(handle_unknown="ignore")`) so inference matches training.
- [ ] Normalize/standardize numeric ranges (`StandardScaler`, `MinMaxScaler`) and persist the scaler params.
- [ ] Capture any derived features (aggregations, ratios) with short examples so later notebooks can regenerate them.
- [ ] Drop Socket/Identity columns (IPV4_SRC_ADDR, IPV4_DST_ADDR, L4_SRC_PORT, Timestamp) to prevent data leakage and model memorization.
- [ ] Apply Log Transformation ($log1p$) to heavily skewed numeric features (e.g., IN_BYTES, OUT_BYTES, FLOW_DURATION_MS) before scaling.
- [ ] Implement a Class Balancing strategy (e.g., SMOTE or Class Weights) to address the extreme scarcity of attack labels compared to benign traffic.
- [ ] Lock Test/Val sets: Isolate the test and validation sets immediately after the stratified split; they should never "see" a balancer like SMOTE.
- [ ] Fit only on Train: Call .fit() strictly on the training split for all scalers and encoders to establish the processing "rules."
- [ ] Transform-only for Val/Test: Use .transform() to apply the training "rules" to the validation and test sets without recalculating parameters.
- [ ] Oversample post-split: Apply SMOTE or balancing techniques strictly to the training data to avoid creating synthetic duplicates of your test samples.

"#### Secondary tasks (second essentials)\n",
- [ ] Attach MITRE ATT&CK tactic/technique tags as engineered features (Tip: keep a keyword→technique mapping dict and serialize it for inference).
- [ ] [Supervised ML] Preserve TF-IDF vocabulary/features used by the RandomForest text pipeline so SHAP explains identical dimensions (Tip: dump `tfidf.get_feature_names_out()` with the encoder artifacts).
- [ ] [Unsupervised ML] Assemble the Isolation-Forest feature frame (`Hour`, `Src_IP_LastOctet`, etc.) and record any imputations so runtime anomaly detection feeds match training.

In [None]:
# Load Dataset from the Google Drive


