# Industrial Pump Predictive Maintenance using RNN

**Course:** 62FIT4ATI - Artificial Intelligence

**Topic 2:** Recurrent Neural Network for Predictive Maintenance

---

## Section 1: Problem Formulation

### 1.1 Problem Overview

**Predictive maintenance** is a proactive maintenance strategy that uses data analysis and machine learning to predict when equipment might fail, allowing maintenance to be scheduled before failures occur. This approach offers significant advantages over traditional reactive maintenance (fixing after failure) or preventive maintenance (scheduled maintenance regardless of condition).

In this project, we develop a **Recurrent Neural Network (RNN)** model to predict the operational status of industrial pumps based on time-series sensor data. The goal is to classify the machine's status into one of three categories:

| Class | Description |
|-------|-------------|
| **NORMAL** | The pump is operating within normal parameters |
| **RECOVERING** | The pump is in a recovery state after an issue |
| **BROKEN** | The pump has failed or is in a failure state |

### 1.2 Why RNN for This Problem?

Industrial sensor data is inherently **temporal** - the current state of a machine depends on its previous states. Traditional machine learning models treat each data point independently, losing valuable sequential information. RNNs, particularly **LSTM (Long Short-Term Memory)** networks, are designed to:

1. **Capture temporal dependencies**: Learn patterns across time steps
2. **Handle variable-length sequences**: Process sensor readings over different time windows
3. **Remember long-term patterns**: Detect gradual degradation that precedes failures

### 1.3 Key Challenges

This project presents several significant challenges:

#### Challenge 1: Extreme Class Imbalance

The dataset exhibits severe class imbalance:
- **NORMAL**: 205,836 samples (93.43%)
- **RECOVERING**: 14,477 samples (6.57%)
- **BROKEN**: 7 samples (0.003%)

This imbalance means a naive model could achieve >93% accuracy by always predicting "NORMAL", while completely failing to detect actual failures. We address this through:
- Class weighting during training
- Focal loss function
- Appropriate evaluation metrics (F1-score, macro-average)

#### Challenge 2: Temporal Pattern Recognition

Equipment failures often develop gradually through subtle changes in sensor readings. The model must learn to:
- Identify early warning signs in sensor patterns
- Distinguish between normal variations and anomalous trends
- Capture both short-term fluctuations and long-term degradation

#### Challenge 3: High-Dimensional Input

With 52 sensor features, the model must handle high-dimensional input while avoiding overfitting. We employ:
- Dropout regularization
- Feature normalization
- Appropriate model architecture

### 1.4 Project Objectives

1. Build an LSTM-based classifier for machine status prediction
2. Implement techniques to handle extreme class imbalance
3. Apply optimization techniques for stable RNN training
4. Evaluate model performance with appropriate metrics
5. Create an inference pipeline for new sensor data

## Section 2: Identify Inputs and Outputs

### 2.1 Dataset Overview

The dataset contains time-series sensor readings from industrial pumps with the following characteristics:

- **Total samples**: 220,320 time-series records
- **Time period**: Continuous sensor readings at regular intervals
- **Features**: 52 continuous sensor measurements + timestamp
- **Target**: Machine operational status (3 classes)

### 2.2 Input Features (52 Sensors)

The input consists of **52 continuous sensor measurements** (sensor_00 to sensor_51) that capture various physical properties of the industrial pump:

| Feature Group | Sensors | Description |
|--------------|---------|-------------|
| sensor_00 - sensor_51 | 52 sensors | Continuous measurements including temperature, pressure, vibration, flow rate, and other operational parameters |

Each sensor provides real-valued measurements that vary over time, capturing the operational state of the pump.

### 2.3 Output Target (Machine Status)

The target variable **machine_status** is a categorical variable with three possible values:

| Class | Label | Description | Count | Percentage |
|-------|-------|-------------|-------|------------|
| 0 | NORMAL | Pump operating normally | 205,836 | 93.43% |
| 1 | RECOVERING | Pump in recovery state | 14,477 | 6.57% |
| 2 | BROKEN | Pump has failed | 7 | 0.003% |

The model will output probability distributions over these three classes using softmax activation.

### 2.4 Data Shapes and Types

**Raw Data Shape:**
- Input:  - 220,320 samples × 52 features
- Target:  - 220,320 labels

**After Sequence Creation (for RNN):**
- Input:  - 3D tensor
- Target:  - One-hot encoded labels

**Data Types:**
- Sensor features:  (continuous values)
- Timestamp: 
- Machine status:  (categorical string)

In [None]:
# Setup: Install dependencies and configure environment
import sys

# Check if running in Colab
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    # Mount Google Drive
    from google.colab import drive
    drive.mount("/content/drive")
    
    # Install required packages
    !pip install -q hypothesis imbalanced-learn
else:
    print("Running locally - ensure dependencies are installed via requirements.txt")

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option("display.max_columns", 60)
plt.style.use("seaborn-v0_8-whitegrid")

print("Libraries imported successfully!")

In [None]:
# Import our custom data loader
import sys
sys.path.insert(0, "src") if "src" not in sys.path else None

from data_loader import load_csv, get_feature_columns, get_target_column, get_class_names

# Display input/output specifications
print("=" * 50)
print("INPUT FEATURES")
print("=" * 50)
feature_cols = get_feature_columns()
print(f"Number of sensor features: {len(feature_cols)}")
print(f"Feature names: {feature_cols[:5]} ... {feature_cols[-3:]}")

print("
" + "=" * 50)
print("OUTPUT TARGET")
print("=" * 50)
print(f"Target column: {get_target_column()}")
print(f"Class names: {get_class_names()}")

## Section 3: Data Preparation - Inspection

In this section, we load and inspect the sensor dataset to understand its structure, identify data quality issues, and visualize key characteristics.

### 3.1 Load and Display Dataset Information

In [None]:
# Load the sensor data
# For Colab: Update path to your Google Drive location
# For local: Use relative path

DATA_PATH = "sensor.csv"  # Update this path as needed

df = load_csv(DATA_PATH)

print("Dataset Shape:", df.shape)
print(f"\nTotal samples: {len(df):,}")
print(f"Total features: {len(df.columns)}")

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Display data types
print("Data Types:")
print(df.dtypes.value_counts())
print("\nDetailed column info:")
df.info()

In [None]:
# Statistical summary of sensor features
feature_cols = get_feature_columns()
print("Statistical Summary of Sensor Features:")
df[feature_cols].describe()

### 3.2 Class Distribution Analysis

In [None]:
# Analyze class distribution
target_col = get_target_column()
class_counts = df[target_col].value_counts()
class_percentages = df[target_col].value_counts(normalize=True) * 100

print("Class Distribution:")
print("=" * 50)
for cls in class_counts.index:
    count = class_counts[cls]
    pct = class_percentages[cls]
    print(f"{cls:12s}: {count:>10,} samples ({pct:>6.3f}%)")
print("=" * 50)
print(f"Total: {len(df):>16,} samples")

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
colors = ["#2ecc71", "#f39c12", "#e74c3c"]
ax1 = axes[0]
bars = ax1.bar(class_counts.index, class_counts.values, color=colors)
ax1.set_xlabel("Machine Status")
ax1.set_ylabel("Count")
ax1.set_title("Class Distribution (Bar Chart)")
ax1.set_yscale("log")  # Log scale due to extreme imbalance

# Add count labels on bars
for bar, count in zip(bars, class_counts.values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height(), 
             f"{count:,}", ha="center", va="bottom", fontsize=10)

# Pie chart
ax2 = axes[1]
ax2.pie(class_counts.values, labels=class_counts.index, autopct="%1.2f%%",
        colors=colors, explode=[0, 0.05, 0.1])
ax2.set_title("Class Distribution (Pie Chart)")

plt.tight_layout()
plt.show()

print("\n⚠️ CRITICAL: Extreme class imbalance detected!")
print(f"   BROKEN class has only {class_counts.get("BROKEN", 0)} samples ({class_percentages.get("BROKEN", 0):.4f}%)")

### 3.3 Missing Values Analysis

In [None]:
# Check for missing values
missing_counts = df[feature_cols].isnull().sum()
missing_pct = (missing_counts / len(df)) * 100

# Create summary DataFrame
missing_df = pd.DataFrame({
    "Missing Count": missing_counts,
    "Missing %": missing_pct
}).sort_values("Missing Count", ascending=False)

# Show only columns with missing values
missing_with_nulls = missing_df[missing_df["Missing Count"] > 0]

print(f"Columns with missing values: {len(missing_with_nulls)} out of {len(feature_cols)}")
print("\nMissing Values Summary:")
if len(missing_with_nulls) > 0:
    print(missing_with_nulls)
else:
    print("No missing values found in sensor columns!")

print(f"\nTotal missing values: {missing_counts.sum():,}")

In [None]:
# Visualize missing values pattern
if missing_counts.sum() > 0:
    fig, ax = plt.subplots(figsize=(12, 4))
    
    # Show missing values heatmap for columns with missing data
    cols_with_missing = missing_with_nulls.index.tolist()[:10]  # Top 10
    if cols_with_missing:
        sns.heatmap(df[cols_with_missing].isnull().T, cbar=True, 
                    yticklabels=True, cmap="YlOrRd", ax=ax)
        ax.set_title("Missing Values Pattern (Top 10 columns)")
        ax.set_xlabel("Sample Index")
        plt.tight_layout()
        plt.show()
else:
    print("No missing values to visualize.")

### 3.4 Sensor Distributions

In [None]:
# Visualize distribution of selected sensors
selected_sensors = ["sensor_00", "sensor_10", "sensor_20", "sensor_30", "sensor_40", "sensor_50"]

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for idx, sensor in enumerate(selected_sensors):
    ax = axes[idx]
    df[sensor].hist(bins=50, ax=ax, color="steelblue", edgecolor="white")
    ax.set_title(f"{sensor} Distribution")
    ax.set_xlabel("Value")
    ax.set_ylabel("Frequency")

plt.suptitle("Distribution of Selected Sensors", fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

### 3.5 Sensor Correlations

In [None]:
# Compute correlation matrix for sensor features
# Using a subset for visualization clarity
sensor_subset = feature_cols[:20]  # First 20 sensors

corr_matrix = df[sensor_subset].corr()

fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=False, cmap="RdBu_r", center=0,
            square=True, linewidths=0.5, ax=ax)
ax.set_title("Correlation Matrix (First 20 Sensors)")
plt.tight_layout()
plt.show()

# Find highly correlated pairs
high_corr_threshold = 0.9
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > high_corr_threshold:
            high_corr_pairs.append((
                corr_matrix.columns[i],
                corr_matrix.columns[j],
                corr_matrix.iloc[i, j]
            ))

print(f"\nHighly correlated sensor pairs (|r| > {high_corr_threshold}):")
for s1, s2, corr in high_corr_pairs[:10]:
    print(f"  {s1} <-> {s2}: {corr:.3f}")

### 3.6 Time Series Visualization

In [None]:
# Visualize sensor readings over time with machine status
fig, axes = plt.subplots(3, 1, figsize=(15, 10), sharex=True)

# Sample a subset for visualization (every 100th point)
sample_df = df.iloc[::100].copy()

# Plot selected sensors
sensors_to_plot = ["sensor_00", "sensor_25", "sensor_50"]

for idx, sensor in enumerate(sensors_to_plot):
    ax = axes[idx]
    
    # Color by machine status
    colors_map = {"NORMAL": "green", "RECOVERING": "orange", "BROKEN": "red"}
    for status, color in colors_map.items():
        mask = sample_df["machine_status"] == status
        ax.scatter(sample_df.index[mask], sample_df.loc[mask, sensor],
                   c=color, label=status, alpha=0.5, s=1)
    
    ax.set_ylabel(sensor)
    ax.legend(loc="upper right")

axes[-1].set_xlabel("Sample Index")
plt.suptitle("Sensor Readings Over Time (colored by Machine Status)", fontsize=14)
plt.tight_layout()
plt.show()

### 3.7 Class Imbalance Handling Analysis

The extreme class imbalance in this dataset presents a critical challenge for model training. Without proper handling, the model would simply learn to predict the majority class (NORMAL) and achieve high accuracy while completely failing to detect actual failures.

#### The Imbalance Problem

| Class | Count | Percentage | Imbalance Ratio |
|-------|-------|------------|----------------|
| NORMAL | 205,836 | 93.43% | 1x (baseline) |
| RECOVERING | 14,477 | 6.57% | ~14x underrepresented |
| BROKEN | 7 | 0.003% | ~29,405x underrepresented |

#### Techniques to Address Imbalance

We employ two main techniques:

1. **Class Weighting**: Assign higher weights to minority classes during training
2. **Focal Loss**: A loss function that down-weights well-classified examples and focuses on hard cases

In [None]:
# Import our imbalance handler module
from imbalance_handler import compute_class_weights, get_class_distribution, compute_alpha_from_weights

# Get class distribution statistics
target_col = get_target_column()

# Encode labels for weight computation
from preprocessor import encode_labels
labels_encoded, label_encoder = encode_labels(
    df[target_col], 
    class_order=['NORMAL', 'RECOVERING', 'BROKEN']
)

# Compute class weights using inverse frequency
class_weights = compute_class_weights(labels_encoded)

print("Class Weights (Inverse Frequency):")
print("=" * 60)
class_names = ['NORMAL', 'RECOVERING', 'BROKEN']
for idx, name in enumerate(class_names):
    weight = class_weights[idx]
    print(f"{name:12s}: weight = {weight:>12.4f}")
print("=" * 60)

In [None]:
# Visualize class weights
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart of class weights
ax1 = axes[0]
colors = ['#2ecc71', '#f39c12', '#e74c3c']
bars = ax1.bar(class_names, [class_weights[i] for i in range(3)], color=colors)
ax1.set_xlabel('Machine Status')
ax1.set_ylabel('Class Weight')
ax1.set_title('Class Weights (Inverse Frequency)')
ax1.set_yscale('log')  # Log scale due to extreme differences

# Add weight labels on bars
for bar, idx in zip(bars, range(3)):
    weight = class_weights[idx]
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height(), 
             f'{weight:.2f}', ha='center', va='bottom', fontsize=10)

# Pie chart showing effective contribution after weighting
ax2 = axes[1]
# After weighting, each class should contribute equally
weighted_contributions = [1/3, 1/3, 1/3]  # Balanced after weighting
ax2.pie(weighted_contributions, labels=class_names, autopct='%1.1f%%',
        colors=colors, explode=[0, 0.05, 0.1])
ax2.set_title('Effective Class Contribution After Weighting')

plt.tight_layout()
plt.show()

In [None]:
# Verify class weight proportionality
# The product of weight * count should be approximately equal for all classes
class_counts = df[target_col].value_counts()

print("Verification: weight × count should be approximately equal for all classes")
print("=" * 70)
products = []
for idx, name in enumerate(class_names):
    count = class_counts.get(name, 0)
    weight = class_weights[idx]
    product = weight * count
    products.append(product)
    print(f"{name:12s}: {weight:>12.4f} × {count:>10,} = {product:>15.2f}")

print("=" * 70)
print(f"Mean product: {np.mean(products):,.2f}")
print(f"Std deviation: {np.std(products):,.2f}")
print(f"\n✓ Products are approximately equal, confirming balanced contribution to loss.")

#### Focal Loss Explanation

In addition to class weighting, we use **Focal Loss** to further address the imbalance:

$$FL(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$

Where:
- $p_t$ is the probability of the correct class
- $\gamma$ (gamma) is the focusing parameter (we use $\gamma = 2$)
- $\alpha_t$ is the class weight

**How Focal Loss Helps:**
- When $p_t$ is high (easy example), $(1-p_t)^\gamma$ becomes small, reducing the loss contribution
- When $p_t$ is low (hard example), the loss remains high
- This focuses training on hard-to-classify examples, which are often minority class samples

In [None]:
# Demonstrate focal loss behavior
import numpy as np

# Compare cross-entropy vs focal loss
p_t = np.linspace(0.01, 0.99, 100)  # Probability of true class

# Cross-entropy loss: -log(p_t)
ce_loss = -np.log(p_t)

# Focal loss with different gamma values
gamma_values = [0, 1, 2, 5]
focal_losses = {}
for gamma in gamma_values:
    focal_losses[gamma] = -((1 - p_t) ** gamma) * np.log(p_t)

# Plot comparison
fig, ax = plt.subplots(figsize=(10, 6))

for gamma in gamma_values:
    label = f'γ={gamma}' + (' (Cross-Entropy)' if gamma == 0 else '')
    ax.plot(p_t, focal_losses[gamma], label=label, linewidth=2)

ax.set_xlabel('Probability of True Class ($p_t$)', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Focal Loss vs Cross-Entropy Loss', fontsize=14)
ax.legend(fontsize=10)
ax.set_xlim([0, 1])
ax.set_ylim([0, 5])
ax.grid(True, alpha=0.3)

# Add annotation
ax.annotate('Well-classified\n(low loss)', xy=(0.9, 0.3), fontsize=10,
            ha='center', color='green')
ax.annotate('Hard examples\n(high loss)', xy=(0.2, 3), fontsize=10,
            ha='center', color='red')

plt.tight_layout()
plt.show()

print("\nKey Insight:")
print("- With γ=2, well-classified examples (p_t > 0.8) contribute very little to the loss")
print("- This allows the model to focus on learning the minority classes")

### 3.8 Data Inspection Summary

**Key Findings:**

1. **Dataset Size**: 220,320 samples with 52 sensor features
2. **Class Imbalance**: Extreme imbalance with BROKEN class having only 7 samples (0.003%)
3. **Missing Values**: Some sensors have missing values that need imputation
4. **Feature Correlations**: Several sensors show high correlation, suggesting potential redundancy
5. **Data Types**: All sensor features are continuous (float64)

**Imbalance Handling Strategy:**
- **Class Weights**: NORMAL=0.36, RECOVERING=5.14, BROKEN=10,490.29
- **Focal Loss**: γ=2.0 to focus on hard-to-classify examples
- **Evaluation Metrics**: Macro F1-score and per-class metrics (not just accuracy)

**Next Steps:**
- Handle missing values using forward fill imputation
- Normalize features using StandardScaler
- Create sequences for RNN input
- Apply class weighting and focal loss during training

## Section 4: Optimization Techniques

Training RNNs on imbalanced time-series data requires careful optimization to achieve stable convergence and good generalization. This section explains the optimization techniques we employ.

### 4.1 Class Weighting Strategy

Class weighting assigns higher importance to minority classes during training by scaling the loss contribution of each sample based on its class.

**Formula:** $weight_i = \frac{N}{n_{classes} \times count_i}$

Where:
- $N$ = total number of samples
- $n_{classes}$ = number of classes (3)
- $count_i$ = number of samples in class $i$

**Effect:** The product $weight_i \times count_i$ becomes approximately equal for all classes, ensuring balanced contribution to the total loss.

| Class | Count | Weight | Effective Contribution |
|-------|-------|--------|------------------------|
| NORMAL | 205,836 | 0.36 | ~73,440 |
| RECOVERING | 14,477 | 5.14 | ~73,440 |
| BROKEN | 7 | 10,490.29 | ~73,440 |

### 4.2 Learning Rate Scheduling (ReduceLROnPlateau)

Learning rate scheduling dynamically adjusts the learning rate during training to improve convergence.

**ReduceLROnPlateau Strategy:**
- Monitor validation loss
- When loss stops improving for `patience` epochs, reduce learning rate by `factor`
- Continue until `min_lr` is reached

**Our Configuration:**
```python
ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,        # Reduce LR by half
    patience=5,        # Wait 5 epochs before reducing
    min_lr=1e-6        # Minimum learning rate
)
```

**Benefits:**
- Starts with larger learning rate for fast initial progress
- Reduces learning rate for fine-tuning as training progresses
- Helps escape local minima and achieve better convergence

### 4.3 Gradient Clipping for RNN Stability

RNNs are prone to the **exploding gradient problem** where gradients can grow exponentially during backpropagation through time (BPTT).

**Gradient Clipping** limits the maximum norm of gradients:

$$\text{if } ||g|| > \text{threshold}: g \leftarrow \frac{g \times \text{threshold}}{||g||}$$

**Our Configuration:**
```python
Adam(learning_rate=0.001, clipnorm=1.0)
```

**Benefits:**
- Prevents gradient explosion during training
- Maintains training stability with long sequences
- Allows use of higher learning rates without divergence

### 4.4 Early Stopping to Prevent Overfitting

Early stopping monitors validation performance and stops training when the model begins to overfit.

**Our Configuration:**
```python
EarlyStopping(
    monitor='val_loss',
    patience=10,           # Stop after 10 epochs without improvement
    restore_best_weights=True  # Restore weights from best epoch
)
```

**Benefits:**
- Prevents overfitting to training data
- Saves computation time by stopping unnecessary epochs
- Automatically selects the best model checkpoint

### 4.5 Optimization Summary

| Technique | Purpose | Configuration |
|-----------|---------|---------------|
| Class Weighting | Handle class imbalance | Inverse frequency weights |
| Focal Loss | Focus on hard examples | γ=2.0, α=class weights |
| Learning Rate Scheduling | Adaptive learning | ReduceLROnPlateau, factor=0.5 |
| Gradient Clipping | Prevent exploding gradients | clipnorm=1.0 |
| Early Stopping | Prevent overfitting | patience=10 epochs |
| Dropout | Regularization | rate=0.3 |

## Section 5: Neural Network Model

In this section, we define, build, and compile our LSTM-based neural network for multi-class classification of machine status.

### 5.1 Why LSTM for Time-Series Classification?

**Long Short-Term Memory (LSTM)** networks are a type of Recurrent Neural Network (RNN) specifically designed to learn long-term dependencies in sequential data.

#### LSTM Architecture

Each LSTM cell contains three gates that control information flow:

1. **Forget Gate**: Decides what information to discard from the cell state
   - $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$

2. **Input Gate**: Decides what new information to store in the cell state
   - $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
   - $\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$

3. **Output Gate**: Decides what to output based on the cell state
   - $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
   - $h_t = o_t \times \tanh(C_t)$

#### Why LSTM Over Standard RNN?

| Feature | Standard RNN | LSTM |
|---------|-------------|------|
| Long-term memory | Poor (vanishing gradients) | Excellent (cell state) |
| Gradient flow | Degrades over time | Maintained via gates |
| Training stability | Difficult | More stable |
| Sequence length | Short sequences only | Long sequences supported |

### 5.2 Model Architecture Design

Our LSTM model architecture is designed to:
1. Process sequences of 60 time steps with 52 sensor features
2. Learn hierarchical temporal patterns through stacked LSTM layers
3. Prevent overfitting through dropout regularization
4. Output probability distributions over 3 classes

```
┌─────────────────────────────────────────────────────────┐
│                    Input Layer                          │
│              Shape: (batch, 60, 52)                     │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   LSTM Layer 1                          │
│         128 units, return_sequences=True                │
│         Output: (batch, 60, 128)                        │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   Dropout (0.3)                         │
│         Randomly drops 30% of connections               │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   LSTM Layer 2                          │
│         64 units, return_sequences=False                │
│         Output: (batch, 64)                             │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   Dropout (0.3)                         │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   Dense Layer                           │
│         32 units, ReLU activation                       │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│                   Output Layer                          │
│         3 units, Softmax activation                     │
│         Output: (batch, 3) - probabilities              │
└─────────────────────────────────────────────────────────┘
```

### 5.3 Hyperparameter Choices

| Hyperparameter | Value | Rationale |
|----------------|-------|----------|
| Sequence Length | 60 | Captures ~1 hour of sensor data (assuming 1-min intervals) |
| LSTM Units (Layer 1) | 128 | Sufficient capacity to learn complex patterns |
| LSTM Units (Layer 2) | 64 | Reduces dimensionality while preserving key features |
| Dropout Rate | 0.3 | Balances regularization without losing too much information |
| Dense Units | 32 | Intermediate layer before classification |
| Learning Rate | 0.001 | Standard starting point for Adam optimizer |
| Batch Size | 64 | Good balance between training speed and gradient stability |

### 5.4 Regularization Techniques

#### Dropout Regularization

Dropout randomly sets a fraction of input units to 0 during training, which:
- Prevents co-adaptation of neurons
- Acts as an ensemble of multiple networks
- Reduces overfitting on the training data

**Our Implementation:**
- Dropout rate: 0.3 (30% of neurons dropped)
- Applied after each LSTM layer
- Only active during training (disabled during inference)

#### Why Dropout is Important for This Problem:
1. **High-dimensional input**: 52 sensors can lead to overfitting
2. **Class imbalance**: Model might memorize minority class examples
3. **Limited BROKEN samples**: Only 7 samples make overfitting very likely

In [None]:
# Import TensorFlow and model builder
import tensorflow as tf
from model_builder import build_model, compile_model, get_model_summary

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {len(tf.config.list_physical_devices('GPU')) > 0}")

In [None]:
# Define model configuration
MODEL_CONFIG = {
    'seq_length': 60,           # Number of time steps
    'n_features': 52,           # Number of sensor features
    'lstm_units': [128, 64],    # Units in each LSTM layer
    'dropout_rate': 0.3,        # Dropout rate for regularization
    'n_classes': 3,             # NORMAL, RECOVERING, BROKEN
}

TRAINING_CONFIG = {
    'learning_rate': 0.001,
    'loss': 'focal',            # Use focal loss for imbalanced data
    'focal_loss_gamma': 2.0,
    'clipnorm': 1.0,            # Gradient clipping
}

print("Model Configuration:")
for key, value in MODEL_CONFIG.items():
    print(f"  {key}: {value}")

print("\nTraining Configuration:")
for key, value in TRAINING_CONFIG.items():
    print(f"  {key}: {value}")

In [None]:
# Build the LSTM model
model = build_model(
    seq_length=MODEL_CONFIG['seq_length'],
    n_features=MODEL_CONFIG['n_features'],
    lstm_units=MODEL_CONFIG['lstm_units'],
    dropout_rate=MODEL_CONFIG['dropout_rate'],
    n_classes=MODEL_CONFIG['n_classes']
)

print("Model built successfully!")
print("\n" + "=" * 70)
print("MODEL ARCHITECTURE SUMMARY")
print("=" * 70)
model.summary()

In [None]:
# Compile the model with focal loss and gradient clipping
# First, compute alpha values from class weights for focal loss
alpha_values = compute_alpha_from_weights(class_weights, n_classes=3)
print(f"Focal loss alpha values: {alpha_values}")

model = compile_model(
    model=model,
    learning_rate=TRAINING_CONFIG['learning_rate'],
    loss=TRAINING_CONFIG['loss'],
    focal_loss_gamma=TRAINING_CONFIG['focal_loss_gamma'],
    focal_loss_alpha=alpha_values,
    clipnorm=TRAINING_CONFIG['clipnorm']
)

print("\nModel compiled successfully!")
print(f"  Optimizer: Adam (lr={TRAINING_CONFIG['learning_rate']}, clipnorm={TRAINING_CONFIG['clipnorm']})")
print(f"  Loss: Focal Loss (gamma={TRAINING_CONFIG['focal_loss_gamma']})")
print(f"  Metrics: accuracy")

In [None]:
# Verify model output with a sample input
import numpy as np

# Create a sample input batch
sample_input = np.random.randn(2, MODEL_CONFIG['seq_length'], MODEL_CONFIG['n_features']).astype(np.float32)

# Get predictions
sample_output = model.predict(sample_input, verbose=0)

print("Model Output Verification:")
print(f"  Input shape: {sample_input.shape}")
print(f"  Output shape: {sample_output.shape}")
print(f"  Output (probabilities):")
for i, probs in enumerate(sample_output):
    print(f"    Sample {i+1}: NORMAL={probs[0]:.4f}, RECOVERING={probs[1]:.4f}, BROKEN={probs[2]:.4f}")
    print(f"             Sum={probs.sum():.6f} (should be ~1.0)")

### 5.5 Model Architecture Summary

**Total Parameters:** ~144,259
- LSTM Layer 1: 92,672 parameters
- LSTM Layer 2: 49,408 parameters  
- Dense Layer: 2,080 parameters
- Output Layer: 99 parameters

**Key Design Decisions:**

1. **Stacked LSTM Layers**: Two LSTM layers allow the model to learn hierarchical temporal features - the first layer captures low-level patterns, the second captures higher-level abstractions.

2. **Decreasing Units**: 128 → 64 units creates a bottleneck that forces the model to learn compressed representations.

3. **Dropout After Each LSTM**: Prevents overfitting by randomly dropping connections during training.

4. **Softmax Output**: Produces valid probability distributions that sum to 1.0, enabling confidence-based predictions.

5. **Focal Loss**: Addresses extreme class imbalance by focusing on hard-to-classify examples.

6. **Gradient Clipping**: Prevents exploding gradients common in RNN training.