<a href="https://colab.research.google.com/github/sreent/machine-learning/blob/main/Final%20DNN%20Code%20Examples/Rain%20in%20Australia/Rain%20in%20Australia%20-%20Mixed%20Feature%20Type%20%26%20Missing%20Value%20Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Rain in Australia - Mixed Feature Type & Missing Value Example

This notebook demonstrates the **Universal ML Workflow** applied to a binary classification problem with **mixed feature types** (categorical and numerical) and **missing values**.

## Learning Objectives

By the end of this notebook, you will be able to:
- Handle **mixed feature types** (categorical + numerical) using `ColumnTransformer`
- Apply appropriate preprocessing: **One-Hot Encoding** for categorical, **Standardisation** for numerical
- Handle **missing values** using different strategies (kNN imputation for numerical, "Unknown" category for categorical)
- Address **class imbalance** using class weights during training
- Build and train deep neural networks for **binary classification**
- Use **Hyperband** for efficient hyperparameter tuning
- Apply **Dropout + L2 regularisation** to prevent overfitting

---

## Dataset Overview

| Attribute | Description |
|-----------|-------------|
| **Source** | [Kaggle Weather Dataset](https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package) |
| **Problem Type** | Binary Classification |
| **Target Variable** | RainTomorrow (Yes/No) |
| **Data Balance** | Imbalanced (~78% No, ~22% Yes) |
| **Data Type** | Structured Tabular (Mixed Categorical & Numerical) |
| **Missing Data** | Significant missing values in many columns |
| **Features** | 16 numerical + 5 categorical variables |
| **Imbalance Handling** | Class Weights during Training |

---

## Technique Scope

This notebook uses only techniques from **Chapters 1–4** of *Deep Learning with Python* (Chollet, 2021). This means:

| Technique | Status | Rationale |
|-----------|--------|-----------|
| **Dense layers (DNN)** | ✓ Used | Core building block (Ch. 3-4) |
| **Dropout** | ✓ Used | Regularisation technique (Ch. 4) |
| **L2 regularisation** | ✓ Used | Weight penalty (Ch. 4) |
| **Early stopping** | ✗ Not used | Introduced in Ch. 7 |
| **CNN** | ✗ Not used | Introduced in Ch. 8 |
| **RNN/LSTM** | ✗ Not used | Introduced in Ch. 10 |

We demonstrate that **Dropout + L2 regularisation** alone can effectively prevent overfitting without requiring early stopping.

---

## 1. Defining the Problem and Assembling a Dataset

The first step in any machine learning project is to clearly define the problem and understand the data.

**Problem Statement:** Given weather observations from various Australian locations, predict whether it will rain tomorrow.

**Why this problem is interesting:**
- Requires handling **mixed feature types** - we have both categorical (Location, WindDirection) and numerical (Temperature, Humidity) features
- Contains **missing values** that must be handled appropriately
- Features **class imbalance** - rain days are less common than non-rain days
- Real-world application: agricultural planning, event scheduling, water resource management

**Data Source:** This dataset contains about 10 years of daily weather observations from numerous Australian weather stations.

## 2. Choosing a Measure of Success

### Metric Selection Based on Class Imbalance

The choice of evaluation metric depends on **class imbalance**. We use practical guidelines derived from the literature:

| Imbalance Ratio | Classification | Primary Metric | Rationale |
|-----------------|----------------|----------------|-----------|
| ≤ 1.5:1 | Balanced | **Accuracy** | Classes roughly equal |
| 1.5:1 – 3:1 | Mild Imbalance | **Accuracy** | Majority class < 75% |
| > 3:1 | Moderate/Severe | **F1-Score** | Accuracy becomes misleading |

**Why these thresholds?**
- **3:1 ratio**: When majority class exceeds 75%, a naive classifier achieves high accuracy while ignoring minority classes
- **F1-Score**: Harmonic mean of precision and recall, effective for imbalanced data (He and Garcia, 2009)

### References

- Branco, P., Torgo, L. and Ribeiro, R.P. (2016) 'A survey of predictive modelling on imbalanced domains', *ACM Computing Surveys*, 49(2), pp. 1–50.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

*Note: The 3:1 threshold is a practical guideline, not a strict academic standard. The literature suggests metric choice depends on domain-specific costs of errors.*

## 3. Deciding on an Evaluation Protocol

### Hold-Out vs K-Fold Cross-Validation

The choice between hold-out and K-fold depends on **dataset size** and **computational cost**:

| Dataset Size | Recommended Method | Rationale |
|--------------|-------------------|-----------|
| < 1,000 | K-Fold (K=5 or 10) | High variance with small hold-out sets |
| 1,000 – 10,000 | K-Fold or Hold-Out | Either works; K-fold more robust |
| > 10,000 | Hold-Out | Sufficient data; K-fold computationally expensive |
| Deep Learning | Hold-Out (preferred) | Training cost prohibitive for K iterations |

**Why 10,000 as a practical threshold?**
- Below 10,000 samples, hold-out validation has higher variance (Kohavi, 1995)
- Above 10,000, statistical estimates from hold-out are reliable
- Deep learning models are expensive to train; K-fold multiplies cost by K (Chollet, 2021)

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning: data mining, inference, and prediction*. 2nd edn. New York: Springer.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *Proceedings of the 14th International Joint Conference on Artificial Intelligence*, 2, pp. 1137–1145.

- Pedregosa, F. et al. (2011) 'Scikit-learn: machine learning in Python', *Journal of Machine Learning Research*, 12, pp. 2825–2830.

*Note: The 10,000 threshold is a practical guideline. For computationally cheap models, K-fold is preferred regardless of size.*

### Data Split Strategy (This Notebook)

```
Original Data (~142,000 samples) → Hold-Out Selected
├── Test Set (10%) - Final evaluation only
└── Training Pool (90%)
    ├── Training Set (~81%) - Model training
    └── Validation Set (~9%) - Hyperparameter tuning
```

**Important:** We use `stratify` parameter to maintain class proportions in all splits.

## 4. Preparing Your Data

### 4.1 Import Libraries and Set Random Seed

We set random seeds for reproducibility - this ensures that running the notebook multiple times produces the same results.

In [None]:
import pandas as pd
import numpy as np

from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Keras Tuner for hyperparameter search
%pip install -q -U keras-tuner
import keras_tuner as kt

import matplotlib.pyplot as plt

SEED = 204

tf.random.set_seed(SEED)
np.random.seed(SEED)

import warnings
warnings.filterwarnings('ignore')

### 4.2 Load and Explore the Dataset

Let's download the weather data from Google Drive and examine its structure. Notice the many NaN (missing) values in several columns.

In [None]:
# Load data directly from Google Drive
GDRIVE_FILE_ID = '1gt0c-jdMPYs_o7SBP67Al7Kg-5Ij3xY5'
DATA_URL = f'https://drive.google.com/uc?id={GDRIVE_FILE_ID}&export=download'

weather = pd.read_csv(DATA_URL)

print(f"Dataset shape: {weather.shape}")
weather.head()

In [None]:
# Examine numerical features
weather.describe()

In [None]:
# Define feature columns
NUMERICAL_VARIABLES = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 
                       'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
                       'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm',
                       'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm']

CATEGORICAL_VARIABLES = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

In [None]:
# Examine categorical features
weather.drop(NUMERICAL_VARIABLES, axis=1).describe(include=object)

In [None]:
# Check missing values
missing_pct = (weather.isnull().sum() / len(weather) * 100).round(1)
print("Missing Values (%):\n")
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

### 4.3 Clean Data and Define Features

We drop rows where the target variable or location is missing, as these cannot be imputed meaningfully.

In [None]:
# Drop rows where target variable or location is missing
weather = weather[~weather['RainTomorrow'].isnull() & ~weather['Location'].isnull()]

print(f"Dataset shape after cleaning: {weather.shape}")

In [None]:
# Define features and target
COLUMNS = CATEGORICAL_VARIABLES + NUMERICAL_VARIABLES

features = weather[COLUMNS]

TARGET_VARIABLE = 'RainTomorrow'
target = weather[TARGET_VARIABLE]

### 4.4 Split Data into Train and Test Sets

We reserve 15% of the data for final testing. The `stratify` parameter ensures that each split maintains the same class proportions as the original dataset - critical for imbalanced data.

In [None]:
TEST_SIZE = 0.10

(features_train, features_test, 
 target_train, target_test) = train_test_split(features, target, 
                                                test_size=TEST_SIZE, stratify=target,
                                                shuffle=True, random_state=SEED)

print(f"Training set: {len(features_train):,} samples")
print(f"Test set: {len(features_test):,} samples")

### 4.5 Handle Missing Values

**Missing Value Strategies:**

| Feature Type | Strategy | Rationale |
|--------------|----------|-----------|
| **Numerical** | kNN Imputation | Estimates missing values based on similar observations |
| **Categorical** | Fill with "Unknown" | Creates an explicit category for missing data |

**Why kNN Imputation for numerical features?**
- Preserves relationships between features better than simple mean/median imputation
- Uses information from similar samples to estimate missing values
- More sophisticated than assuming all missing values equal the mean

**Why "Unknown" for categorical features?**
- Missing categorical data may carry information (e.g., "wind direction not recorded" could correlate with calm conditions)
- Creating an explicit category allows the model to learn patterns associated with missingness

**Important:** We fit imputers only on training data to prevent data leakage.

In [None]:
# Impute missing values for numerical features using kNN
knn_imputer = KNNImputer(n_neighbors=5)
knn_imputer.fit(features_train[NUMERICAL_VARIABLES])

numerical_train = knn_imputer.transform(features_train[NUMERICAL_VARIABLES])
numerical_test = knn_imputer.transform(features_test[NUMERICAL_VARIABLES])

print(f"Numerical features imputed: {numerical_train.shape[1]} columns")

In [None]:
# Fill missing categorical values with "Unknown"
categorical_train = features_train[CATEGORICAL_VARIABLES].fillna('Unknown')
categorical_test = features_test[CATEGORICAL_VARIABLES].fillna('Unknown')

print(f"Categorical features: {categorical_train.shape[1]} columns")

### 4.6 Preprocessing Pipeline with ColumnTransformer

We use `ColumnTransformer` to apply different preprocessing to different feature types simultaneously:

```
ColumnTransformer
├── Categorical Features → One-Hot Encoding
│   (Creates binary columns for each category)
└── Numerical Features → Standard Scaling
    (Mean=0, Std=1)
```

This is a common pattern for mixed-type datasets.

---

#### Why One-Hot Encoding for Categorical Features?

| Encoding | How it works | Pros | Cons |
|----------|--------------|------|------|
| **One-Hot** | Creates binary column per category | No ordinal assumption, works with any model | High dimensionality for many categories |
| **Label Encoding** | Maps categories to integers | Compact representation | Implies false ordinal relationship |
| **Target Encoding** | Maps to target mean | Captures target relationship | Risk of overfitting, data leakage |

We use **One-Hot Encoding** because:
1. Neural networks work well with binary features
2. No false ordinal relationships (e.g., "Sydney" is not > "Melbourne")
3. `handle_unknown="ignore"` gracefully handles unseen categories in test data

#### Why Standardisation for Numerical Features?

Neural networks train faster and more stably when inputs have similar scales:
- **Before:** Features range from 0-100 (humidity) to 980-1040 (pressure)
- **After:** All features have mean=0 and std=1

In [None]:
# Combine imputed numerical and filled categorical features
combined_train = pd.DataFrame(
    data=np.hstack((numerical_train, categorical_train)), 
    columns=NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES
)

combined_test = pd.DataFrame(
    data=np.hstack((numerical_test, categorical_test)), 
    columns=NUMERICAL_VARIABLES + CATEGORICAL_VARIABLES
)

In [None]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer([
    ('one-hot-encoder', OneHotEncoder(handle_unknown="ignore"), CATEGORICAL_VARIABLES),
    ('standard_scaler', StandardScaler(), NUMERICAL_VARIABLES)
])

# Fit on training data only (prevent data leakage)
preprocessor.fit(combined_train)

# Transform both sets
X_train_full = preprocessor.transform(combined_train)
X_test = preprocessor.transform(combined_test)

print(f"Preprocessed training shape: {X_train_full.shape}")
print(f"Preprocessed test shape: {X_test.shape}")

### 4.7 Encode Target Variable

In [None]:
# Encode target variable (Yes=1, No=0)
label_encoder = LabelEncoder()
label_encoder.fit(target)

y_train_full = label_encoder.transform(target_train)
y_test = label_encoder.transform(target_test)

print(f"Classes: {label_encoder.classes_}")
print(f"Encoding: No=0, Yes=1")

## 5. Developing a Model That Does Better Than a Baseline

Before building complex models, we need to establish **baseline performance**. This gives us a reference point to know if our model is actually learning something useful.

### 5.1 Examine Class Distribution

In [None]:
counts = target.value_counts().sort_index()
print("Class distribution:")
print(counts)
print(f"\nTotal samples: {counts.sum():,}")

In [None]:
# =============================================================================
# DATA-DRIVEN ANALYSIS: Dataset Size & Imbalance
# =============================================================================

# Dataset size analysis (for hold-out vs K-fold decision)
n_samples = len(weather)
HOLDOUT_THRESHOLD = 10000  # Use hold-out if samples > 10,000 (Kohavi, 1995; Chollet, 2021)

# Imbalance analysis (for metric selection)
majority_class = counts.max()
minority_class = counts.min()
imbalance_ratio = majority_class / minority_class
IMBALANCE_THRESHOLD = 3.0  # Use F1-Score if ratio > 3.0 (He & Garcia, 2009)

# Determine evaluation strategy and metric
use_holdout = n_samples > HOLDOUT_THRESHOLD
use_f1 = imbalance_ratio > IMBALANCE_THRESHOLD

print("=" * 60)
print("DATA-DRIVEN CONFIGURATION")
print("=" * 60)
print(f"\n1. DATASET SIZE: {n_samples:,} samples")
print(f"   Threshold: {HOLDOUT_THRESHOLD:,} samples (Kohavi, 1995)")
print(f"   Decision: {'Hold-Out' if use_holdout else 'K-Fold Cross-Validation'}")

print(f"\n2. CLASS IMBALANCE: {imbalance_ratio:.2f}:1 ratio")
print(f"   Threshold: {IMBALANCE_THRESHOLD:.1f}:1 (He & Garcia, 2009)")
print(f"   Decision: {'F1-Score (imbalanced)' if use_f1 else 'Accuracy (balanced)'}")

print("\n" + "=" * 60)
PRIMARY_METRIC = 'f1' if use_f1 else 'accuracy'
print(f"PRIMARY METRIC: {PRIMARY_METRIC.upper()}")
print("=" * 60)

### 5.2 Calculate Baseline Metrics

**Naive Baseline (Majority Class):** If we always predict "No Rain", we achieve ~78% accuracy. This is our accuracy baseline.

**Balanced Accuracy Baseline:** A random classifier would achieve 50% balanced accuracy. This is more meaningful for imbalanced data.

In [None]:
# Baseline accuracy (always predict majority class)
baseline = counts['No'] / counts.sum()

# Balanced accuracy baseline (random classifier)
balanced_accuracy_baseline = 0.5  # For binary classification

print(f"Baseline accuracy (always predict 'No'): {baseline:.2%}")
print(f"Balanced accuracy baseline (random): {balanced_accuracy_baseline:.2%}")

### 5.3 Create Validation Set

We split off a portion of the training data for validation. This will be used to:
- Evaluate model performance during hyperparameter tuning
- Compare models without touching the test set

In [None]:
VALIDATION_SIZE = 0.10

X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, 
    test_size=VALIDATION_SIZE, stratify=y_train_full,
    shuffle=True, random_state=SEED
)

print(f"Training set: {X_train.shape[0]:,} samples")
print(f"Validation set: {X_val.shape[0]:,} samples")
print(f"Test set: {X_test.shape[0]:,} samples")

### 5.4 Configure Training Parameters

**Key training settings:**
- **Optimiser:** Adam - adaptive learning rate optimiser with momentum, widely used for deep learning
- **Loss:** Binary cross-entropy - standard loss for binary classification
- **Training Metrics:** Accuracy, Precision, Recall, AUC (tracked by Keras during training)
- **Primary Metric:** F1-Score - computed separately after training using sklearn

In [None]:
INPUT_DIMENSION = X_train.shape[1]
OUTPUT_DIMENSION = 1  # Binary classification: single output neuron

OPTIMIZER = 'adam'
LOSS_FUNC = 'binary_crossentropy'

# Training metrics (tracked by Keras during training)
# Note: F1-Score (our primary metric) is computed separately using sklearn
METRICS = ['accuracy', 
           tf.keras.metrics.Precision(name='precision'), 
           tf.keras.metrics.Recall(name='recall'),
           tf.keras.metrics.AUC(name='auc')]

print(f"Input dimension: {INPUT_DIMENSION}")
print(f"Output dimension: {OUTPUT_DIMENSION}")

In [None]:
# Single-Layer Perceptron (no hidden layers) - Baseline
slp_model = Sequential(name='Single_Layer_Perceptron')
slp_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
slp_model.add(Dense(OUTPUT_DIMENSION, activation='sigmoid'))
slp_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

slp_model.summary()

In [None]:
# =============================================================================
# TRAINING CONFIGURATION
# =============================================================================

BATCH_SIZE = 512

# We use DIFFERENT epoch counts for different training phases:
#
# EPOCHS_BASELINE (100): For SLP and unregularised DNN
#   - SLP converges quickly (simple model)
#   - Unregularised DNN: 100 epochs clearly shows overfitting (val_loss increasing)
#
# EPOCHS_REGULARIZED (150): For DNN with Dropout + L2
#   - WHY train longer? Regularisation SLOWS DOWN learning:
#     * Dropout randomly masks neurons, so each update uses partial information
#     * L2 penalty constrains weight updates, preventing large steps
#     * The model needs MORE iterations to reach the same level of convergence
#   - Without extra epochs, we'd stop before the model reaches its full potential
#   - With regularisation, longer training is SAFE (no overfitting risk)
#
# The trade-off: Regularisation exchanges faster convergence for overfitting protection.
# We compensate by allowing more training time.

EPOCHS_BASELINE = 100      # SLP and DNN (no regularisation)
EPOCHS_REGULARIZED = 150   # DNN with Dropout + L2 (needs more time to converge)

### 5.5 Handle Class Imbalance with Class Weights

To handle imbalanced classes, we compute **class weights** that give more importance to the minority class during training:
- **No (majority):** Lower weight
- **Yes (minority):** Higher weight

This makes errors on the minority class "cost more", encouraging the model to learn it better.

---

#### Why Class Weights Instead of Resampling?

| Technique | How it works | Pros | Cons |
|-----------|--------------|------|------|
| **Class Weights** | Adjusts loss function to penalise minority errors more | Simple, no data modification | Doesn't add information |
| **Oversampling (SMOTE)** | Creates synthetic minority samples | Adds training data | Risk of overfitting to synthetic data |
| **Undersampling** | Removes majority class samples | Balances dataset | Loses potentially useful information |

We use **class weights** because:
1. **Simplicity:** No need to modify the dataset
2. **No synthetic data risk:** SMOTE can create unrealistic samples
3. **Efficiency:** Training time unchanged
4. **Keras integration:** Native support via `class_weight` parameter

In [None]:
# Compute class weights for imbalanced data
weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
CLASS_WEIGHTS = dict(enumerate(weights))

print("Class weights:")
print(f"  No (0):  {CLASS_WEIGHTS[0]:.4f}")
print(f"  Yes (1): {CLASS_WEIGHTS[1]:.4f}")

In [None]:
# Train the Single-Layer Perceptron
history_slp = slp_model.fit(
    X_train, y_train, 
    class_weight=CLASS_WEIGHTS,
    batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
    validation_data=(X_val, y_val),
    verbose=0
)
val_score_slp = slp_model.evaluate(X_val, y_val, verbose=0)[1:]

In [None]:
# Display SLP validation metrics
print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_slp[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_slp[1]))
print('Recall (Validation): {:.2f}'.format(val_score_slp[2]))
print('AUC (Validation): {:.2f}'.format(val_score_slp[3]))

preds_slp_val = (slp_model.predict(X_val, verbose=0) > 0.5).astype('int32').flatten()
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val, preds_slp_val), balanced_accuracy_baseline))

# Calculate F1-Score (primary metric for imbalanced data)
f1_slp_val = f1_score(y_val, preds_slp_val)
print(f'F1-Score (Validation): {f1_slp_val:.2f}  ← Primary Metric')

In [None]:
def plot_training_history(history, title=None):
    """
    Plot training and validation metrics over epochs.
    Plots: (1) Loss, (2) Accuracy
    
    Parameters:
    -----------
    history : keras History object
        Training history from model.fit()
    title : str, optional
        Model name to display in plot titles (e.g., 'SLP', 'DNN')
    """
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    epochs = range(1, len(history.history['loss']) + 1)
    title_suffix = f' ({title})' if title else ''

    # Plot 1: Loss
    axs[0].plot(epochs, history.history['loss'], 'b-', label='Training', linewidth=1.5)
    axs[0].plot(epochs, history.history['val_loss'], 'r-', label='Validation', linewidth=1.5)
    axs[0].set_title(f'Loss{title_suffix}')
    axs[0].set_xlabel('Epochs')
    axs[0].set_ylabel('Loss')
    axs[0].legend()
    axs[0].grid(alpha=0.3)

    # Plot 2: Accuracy
    axs[1].plot(epochs, history.history['accuracy'], 'b-', label='Training', linewidth=1.5)
    axs[1].plot(epochs, history.history['val_accuracy'], 'r-', label='Validation', linewidth=1.5)
    axs[1].set_title(f'Accuracy{title_suffix}')
    axs[1].set_xlabel('Epochs')
    axs[1].set_ylabel('Accuracy')
    axs[1].legend()
    axs[1].grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

In [None]:
# Plot SLP training history
plot_training_history(history_slp, title='SLP Baseline')

## 6. Scaling Up: Developing a Model That Overfits

The next step in the Universal ML Workflow is to build a model with **enough capacity to overfit**. If a model can't overfit, it may be too simple to learn the patterns in the data.

**Strategy:** Add hidden layers and neurons to increase model capacity.

**No regularisation applied:** We intentionally train this model **without any regularisation** (no dropout, no L2, no early stopping) to observe overfitting behaviour. In the training plots, you should see:
- Training loss continues to decrease
- Validation loss starts increasing after some epochs (overfitting)

---

#### Architecture Design Decisions

**Why 64 neurons in the hidden layer?**

This is a practical starting point that balances capacity and efficiency:
- **Too few (e.g., 16):** May not have enough capacity to learn complex patterns
- **Too many (e.g., 512):** Increases overfitting risk and training time
- **64 neurons:** A common choice for tabular data that provides sufficient capacity

**Why only 1 hidden layer?**

Per the **Universal ML Workflow**, the goal is to demonstrate that the model *can* overfit. Once overfitting is observed:
1. **Capacity is proven sufficient**
2. **No need for more depth**
3. **Regularise, don't expand**

*"The right question is not 'How many layers?' but 'Can it overfit?' If yes, regularise. If no, add capacity."*

### 6.1 Build a Deep Neural Network (DNN)

In [None]:
# Deep Neural Network (1 hidden layer, no regularisation for overfitting demo)
dnn_model = Sequential(name='Deep_Neural_Network')
dnn_model.add(layers.Input(shape=(INPUT_DIMENSION,)))
dnn_model.add(Dense(64, activation='relu'))
dnn_model.add(Dense(OUTPUT_DIMENSION, activation='sigmoid'))
dnn_model.compile(optimizer=OPTIMIZER, loss=LOSS_FUNC, metrics=METRICS)

dnn_model.summary()

In [None]:
# Train the Deep Neural Network (without regularisation to demonstrate overfitting)
history_dnn = dnn_model.fit(
    X_train, y_train, 
    class_weight=CLASS_WEIGHTS,
    batch_size=BATCH_SIZE, epochs=EPOCHS_BASELINE, 
    validation_data=(X_val, y_val), 
    verbose=0
)
val_score_dnn = dnn_model.evaluate(X_val, y_val, verbose=0)[1:]

In [None]:
# Plot DNN training history (expect overfitting: val_loss increasing)
plot_training_history(history_dnn, title='DNN - No Regularisation')

In [None]:
# Display DNN validation metrics
print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_dnn[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_dnn[1]))
print('Recall (Validation): {:.2f}'.format(val_score_dnn[2]))
print('AUC (Validation): {:.2f}'.format(val_score_dnn[3]))

preds_dnn_val = (dnn_model.predict(X_val, verbose=0) > 0.5).astype('int32').flatten()
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val, preds_dnn_val), balanced_accuracy_baseline))

# Calculate F1-Score (primary metric for imbalanced data)
f1_dnn_val = f1_score(y_val, preds_dnn_val)
print(f'F1-Score (Validation): {f1_dnn_val:.2f}  ← Primary Metric')

## 7. Regularising Your Model and Tuning Hyperparameters

Now we address the overfitting observed in Section 6 by adding **regularisation**. We use two complementary techniques:

| Technique | How it works | Effect |
|-----------|--------------|--------|
| **Dropout** | Randomly drops neurons during training | Acts like ensemble averaging, reduces co-adaptation |
| **L2 (Weight Decay)** | Adds penalty for large weights to loss | Keeps weights small, smoother decision boundaries |

**Same architecture, different regularisation:** We keep the same 1-layer architecture (64 neurons) as Section 6, so the only difference is regularisation.

Using **Hyperband** for efficient hyperparameter tuning.

---

#### Why Hyperband?

| Method | How it works | Pros | Cons |
|--------|--------------|------|------|
| **Grid Search** | Tries all combinations exhaustively | Thorough, reproducible | Exponentially expensive |
| **Random Search** | Samples random combinations | More efficient than grid | Still trains all configs to completion |
| **Hyperband** | Early stopping of poor performers | Very efficient for deep learning | May discard slow starters prematurely |

We use **Hyperband** because:
1. **Efficiency:** Eliminates poor configurations early
2. **Deep learning fit:** Training epochs are a natural "resource" to allocate adaptively
3. **Keras Tuner integration:** Native support via `kt.Hyperband`

### 7.1 Hyperband Search

In [None]:
# Hyperband Model Builder for Binary Classification
def build_model_hyperband(hp):
    """
    Build Rain in Australia model with FIXED architecture (1 hidden layer, 64 neurons).
    Same architecture as Section 6 DNN - only tunes regularisation and learning rate.
    """
    model = keras.Sequential()
    model.add(layers.Input(shape=(INPUT_DIMENSION,)))

    # L2 regularisation strength
    l2_reg = hp.Float('l2_reg', 1e-5, 1e-2, sampling='log')

    # Fixed architecture: 1 hidden layer with 64 neurons (same as Section 6)
    model.add(layers.Dense(64, activation='relu', 
                           kernel_regularizer=regularizers.l2(l2_reg)))
    dropout_rate = hp.Float('dropout', 0.0, 0.5, step=0.1)
    model.add(layers.Dropout(dropout_rate))

    # Output layer for binary classification
    model.add(layers.Dense(OUTPUT_DIMENSION, activation='sigmoid'))

    lr = hp.Float('lr', 1e-4, 1e-2, sampling='log')
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss=LOSS_FUNC,
        metrics=METRICS
    )
    return model

In [None]:
# Configure Hyperband tuner
# ===========================================================================
# TUNING OBJECTIVE: AUC for imbalanced data
# ===========================================================================
# AUC is threshold-independent and handles imbalanced data well.
# F1-Score is computed separately for final evaluation.

TUNING_OBJECTIVE = 'val_accuracy' if PRIMARY_METRIC == 'accuracy' else 'val_auc'

tuner = kt.Hyperband(
    build_model_hyperband,
    objective=TUNING_OBJECTIVE,
    max_epochs=20,
    factor=3,
    directory='rain_australia_hyperband',
    project_name='rain_australia_tuning',
    overwrite=True
)

print(f"Tuning objective: {TUNING_OBJECTIVE}")
print("(Note: Final evaluation uses F1-Score as primary metric)")

# Run Hyperband search
tuner.search(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=20,
    batch_size=BATCH_SIZE,
    class_weight=CLASS_WEIGHTS
)

In [None]:
# Get best hyperparameters and best model directly
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters found by Hyperband:")
print(f"  L2 Regularisation: {best_hp.get('l2_reg'):.6f}")
print(f"  Dropout Rate: {best_hp.get('dropout')}")
print(f"  Learning Rate: {best_hp.get('lr'):.6f}")

# Get the best model directly - already trained at optimal epochs for these hyperparameters
opt_model = tuner.get_best_models(num_models=1)[0]
opt_model.summary()

### 7.2 Using the Best Model Directly

Rather than rebuilding and retraining from scratch, we retrieve the best model directly from the tuner using `tuner.get_best_models()`. This approach avoids the **epoch mismatch problem**:

---

#### The Epoch Mismatch Problem

Hyperband uses **successive halving** - most configurations train for few epochs, only top performers get more:

```
Hyperband with max_epochs=20, factor=3:
Round 1: 81 configs × ~1 epoch  → Keep top 27
Round 2: 27 configs × ~2 epochs → Keep top 9
Round 3:  9 configs × ~7 epochs → Keep top 3
Round 4:  3 configs × ~20 epochs → Select best
```

The best hyperparameters were found optimal at a **specific epoch count** (e.g., 20 epochs). If we rebuild and retrain for a different number of epochs (e.g., 150), the hyperparameters may no longer be optimal - **this is the epoch mismatch problem**.

---

#### Clean Solution: Use Best Model Directly

Instead of rebuilding, we use `tuner.get_best_models(num_models=1)[0]` to retrieve the model that **already achieved the best validation performance** during tuning. This model:

- Has weights trained at the optimal epoch count for its hyperparameters
- Achieved the best validation AUC during the Hyperband search
- Avoids any mismatch between tuning epochs and final epochs

| Approach | Epochs Match? | Issue |
|----------|---------------|-------|
| ~~Rebuild + retrain for 150 epochs~~ | ✗ No | Hyperparameters may be suboptimal at 150 epochs |
| **Use best model directly** | ✓ Yes | Model already trained at optimal epochs |

> *"Use the model that actually achieved the best performance, not a rebuilt version that might perform differently."*

In [None]:
# The best model is already trained - evaluate on validation set
val_score_opt = opt_model.evaluate(X_val, y_val, verbose=0)[1:]
print(f"Validation Accuracy: {val_score_opt[0]:.4f}")
print(f"Validation Precision: {val_score_opt[1]:.4f}")
print(f"Validation Recall: {val_score_opt[2]:.4f}")
print(f"Validation AUC: {val_score_opt[3]:.4f}")

In [None]:
# Note: Training history plot is not available when using get_best_models()
# The best model was retrieved directly from the tuner, which doesn't preserve
# the training history. To visualise training curves, you would need to either:
# 1. Use TensorBoard callbacks during tuning, or
# 2. Retrain the model (but this risks the epoch mismatch problem)
#
# For this notebook, we skip the training history plot since we're using
# the best model directly to ensure optimal performance.
print("Training history not available when using get_best_models() directly.")

In [None]:
preds_opt_val = (opt_model.predict(X_val, verbose=0) > 0.5).astype('int32').flatten()

print('Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(val_score_opt[0], baseline))
print('Precision (Validation): {:.2f}'.format(val_score_opt[1]))
print('Recall (Validation): {:.2f}'.format(val_score_opt[2]))
print('AUC (Validation): {:.2f}'.format(val_score_opt[3]))
print('Balanced Accuracy (Validation): {:.2f} (baseline={:.2f})'.format(
    balanced_accuracy_score(y_val, preds_opt_val), balanced_accuracy_baseline))

# Calculate F1-Score (primary metric for imbalanced data)
f1_opt_val = f1_score(y_val, preds_opt_val)
print(f'F1-Score (Validation): {f1_opt_val:.2f}  ← Primary Metric')

### 7.3 Final Model Evaluation on Test Set

Now we evaluate our best model on the held-out test set that was never used during training or tuning.

In [None]:
# Final evaluation on test set
test_score = opt_model.evaluate(X_test, y_test, verbose=0)[1:]
preds_test = (opt_model.predict(X_test, verbose=0) > 0.5).astype('int32').flatten()

# Calculate F1-Score (our primary metric)
test_f1 = f1_score(y_test, preds_test)

print('=' * 50)
print('FINAL TEST SET RESULTS')
print('=' * 50)
print(f'F1-Score (Test): {test_f1:.4f}  ← Primary Metric')
print(f'Accuracy (Test): {test_score[0]:.4f} (baseline={baseline:.4f})')
print(f'Precision (Test): {test_score[1]:.4f}')
print(f'Recall (Test): {test_score[2]:.4f}')
print(f'AUC (Test): {test_score[3]:.4f}')
print(f'Balanced Accuracy (Test): {balanced_accuracy_score(y_test, preds_test):.4f}')

In [None]:
# Display confusion matrix for test predictions
fig, ax = plt.subplots(figsize=(8, 6))
cm = confusion_matrix(y_test, preds_test)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_encoder.classes_)
disp.plot(ax=ax, cmap='Blues', values_format='d')
plt.title('Confusion Matrix - Test Set Predictions')
plt.tight_layout()
plt.show()

# Print per-class metrics
print("\nPer-Class Performance:")
for i, class_name in enumerate(label_encoder.classes_):
    class_mask = y_test == i
    class_recall = (preds_test[class_mask] == i).mean()
    print(f"  {class_name}: {class_recall:.2%} recall ({class_mask.sum():,} samples)")

---

## 8. Results Summary

The following dynamically-generated table compares all models trained in this notebook.

In [None]:
# =============================================================================
# RESULTS SUMMARY
# =============================================================================

# Create results DataFrame
results = pd.DataFrame({
    'Model': ['Naive Baseline', 'SLP (No Hidden)', 'DNN (No Regularisation)', 'DNN (Dropout + L2)', 'DNN (Dropout + L2) - Test'],
    'Accuracy': [baseline, val_score_slp[0], val_score_dnn[0], val_score_opt[0], test_score[0]],
    'F1-Score': [0.0, f1_slp_val, f1_dnn_val, f1_opt_val, test_f1],
    'Dataset': ['N/A', 'Validation', 'Validation', 'Validation', 'Test']
})

print("=" * 70)
print("MODEL COMPARISON - RESULTS SUMMARY")
print("=" * 70)
print(f"Primary Metric: F1-SCORE (imbalance ratio: {imbalance_ratio:.2f}:1)")
print("=" * 70)
print(results.to_string(index=False, float_format='{:.4f}'.format))
print("=" * 70)
print(f"\nKey Observations:")
print(f"  - All models outperform naive baseline ({baseline:.2%} accuracy)")
print(f"  - Regularisation improves F1: {f1_dnn_val:.4f} → {f1_opt_val:.4f}")
print(f"  - Final test F1-Score: {test_f1:.4f}")

---

## 9. Key Takeaways

### Decision Framework Summary

| Decision | Threshold | This Dataset | Choice | Reference |
|----------|-----------|--------------|--------|-----------|
| **Hold-Out vs K-Fold** | > 10,000 samples | 142,193 samples | Hold-Out | Kohavi (1995); Chollet (2021) |
| **Accuracy vs F1-Score** | > 3:1 imbalance | 3.51:1 ratio | F1-Score | He and Garcia (2009) |

### Lessons Learned

1. **Mixed Feature Types:** Use `ColumnTransformer` to apply different preprocessing to different feature types:
   - **Categorical:** One-Hot Encoding (no false ordinal relationships)
   - **Numerical:** Standard Scaling (consistent scale for neural networks)

2. **Missing Value Handling:** Different strategies for different feature types:
   - **Numerical:** kNN imputation preserves feature relationships
   - **Categorical:** "Unknown" category allows model to learn from missingness patterns

3. **Data-Driven Metric Selection:** With imbalance ratio > 3:1, we use F1-Score instead of Accuracy.

4. **Class Imbalance Handling:** Class weights during training help the model learn minority class patterns without synthetic data risks.

5. **Regularisation Prevents Overfitting:** Combining **Dropout + L2 regularisation** controls overfitting effectively.

6. **Regularisation Enables Longer Training:** With proper regularisation, we train for 150 epochs (vs 100 baseline) without overfitting risk.

7. **Technique Scope:** We use only techniques from Chapters 1–4 of *Deep Learning with Python* (Chollet, 2021).

### References

- Chollet, F. (2021) *Deep learning with Python*. 2nd edn. Shelter Island, NY: Manning Publications.

- Hastie, T., Tibshirani, R. and Friedman, J. (2009) *The elements of statistical learning*. 2nd edn. New York: Springer.

- He, H. and Garcia, E.A. (2009) 'Learning from imbalanced data', *IEEE Transactions on Knowledge and Data Engineering*, 21(9), pp. 1263–1284.

- Kohavi, R. (1995) 'A study of cross-validation and bootstrap for accuracy estimation and model selection', *IJCAI*, 2, pp. 1137–1145.

- Pedregosa, F. et al. (2011) 'Scikit-learn: machine learning in Python', *Journal of Machine Learning Research*, 12, pp. 2825–2830.

---

## Appendix: Modular Helper Functions

For cleaner code organisation, you can wrap the model building and training patterns into reusable functions.

In [None]:
# =============================================================================
# MODULAR HELPER FUNCTIONS
# =============================================================================

def build_binary_classifier(input_dim, hidden_units=None, dropout=0.0, l2_reg=0.0,
                            optimizer='adam', loss='binary_crossentropy', 
                            metrics=['accuracy'], name=None):
    """
    Build a binary classification neural network.
    
    Parameters:
    -----------
    input_dim : int
        Number of input features
    hidden_units : list of int, optional
        Neurons per hidden layer, e.g., [64] or [128, 64]
        None or [] creates a single-layer perceptron
    dropout : float
        Dropout rate (0.0 to 0.5)
    l2_reg : float
        L2 regularisation strength
    optimizer : str or keras.optimizers.Optimizer
        Optimiser name or instance
    loss : str
        Loss function name
    metrics : list
        Metrics to track during training
    name : str, optional
        Model name
        
    Returns:
    --------
    keras.Sequential : Compiled model ready for training
    """
    model = Sequential(name=name)
    model.add(layers.Input(shape=(input_dim,)))
    
    hidden_units = hidden_units or []
    kernel_reg = regularizers.l2(l2_reg) if l2_reg > 0 else None
    
    for units in hidden_units:
        model.add(Dense(units, activation='relu', kernel_regularizer=kernel_reg))
        if dropout > 0:
            model.add(Dropout(dropout))
    
    # Output layer for binary classification
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    return model


def train_model(model, X_train, y_train, X_val, y_val,
                class_weights=None, batch_size=512, epochs=100, verbose=0):
    """
    Train a model and return training history.
    
    Parameters:
    -----------
    model : keras.Model
        Compiled Keras model
    X_train, y_train : array-like
        Training data and labels
    X_val, y_val : array-like
        Validation data and labels
    class_weights : dict, optional
        Class weights for imbalanced data
    batch_size : int
        Training batch size
    epochs : int
        Number of training epochs
    verbose : int
        Verbosity mode
        
    Returns:
    --------
    keras.callbacks.History : Training history object
    """
    return model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        class_weight=class_weights,
        batch_size=batch_size, 
        epochs=epochs,
        verbose=verbose
    )


def evaluate_binary_model(model, X, y_true, threshold=0.5):
    """
    Evaluate binary classification model.
    
    Parameters:
    -----------
    model : keras.Model
        Trained Keras model
    X : array-like
        Input features
    y_true : array-like
        True labels (0 or 1)
    threshold : float
        Classification threshold
        
    Returns:
    --------
    dict : Dictionary containing all metrics
    """
    y_pred_proba = model.predict(X, verbose=0).flatten()
    y_pred = (y_pred_proba > threshold).astype('int32')
    
    metrics = {
        'accuracy': accuracy_score(y_true, y_pred),
        'f1': f1_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred),
        'recall': recall_score(y_true, y_pred),
        'auc': roc_auc_score(y_true, y_pred_proba),
        'balanced_accuracy': balanced_accuracy_score(y_true, y_pred),
    }
    
    return metrics


# =============================================================================
# USAGE EXAMPLES
# =============================================================================
# 
# # Build models
# slp = build_binary_classifier(INPUT_DIMENSION, name='SLP')
# dnn = build_binary_classifier(INPUT_DIMENSION, hidden_units=[64], name='DNN')
# dnn_reg = build_binary_classifier(INPUT_DIMENSION, hidden_units=[64], 
#                                   dropout=0.3, l2_reg=0.001, name='DNN_Regularized')
# 
# # Train
# history = train_model(dnn, X_train, y_train, X_val, y_val, 
#                       class_weights=CLASS_WEIGHTS)
# 
# # Evaluate
# metrics = evaluate_binary_model(dnn, X_val, y_val)
# print(f"F1-Score: {metrics['f1']:.4f}")