In [12]:
!git clone https://github.com/santhoshkuamares/reinforcement-anomaly-detection.git
%cd reinforcement-anomaly-detection


Cloning into 'reinforcement-anomaly-detection'...
/content/reinforcement-anomaly-detection/reinforcement-anomaly-detection


**Step 1: Imports**

This section imports libraries for data handling (pandas, numpy), preprocessing (scikit-learn), and deep reinforcement learning (PyTorch). It sets up everything needed for loading data, training the RL model, and evaluating results.

In [2]:
#Step 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
import random
from collections import deque
sns.set(style="whitegrid")


**Step 2: Load Data**

The AI4I dataset is loaded and cleaned by dropping irrelevant columns (UDI, Product ID). New features like temperature difference and power are engineered to better capture machine failure signals.

In [3]:
# ## Step 2: Load Data
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv'
data = pd.read_csv(url)
data_clean = data.drop(['UDI', 'Product ID'], axis=1)


In [5]:
#Eda
print(data_clean.shape)
print(data_clean.head())
print(data_clean.info())
print(data_clean.describe())

(10000, 12)
  Type  Air temperature [K]  Process temperature [K]  Rotational speed [rpm]  \
0    M                298.1                    308.6                    1551   
1    L                298.2                    308.7                    1408   
2    L                298.1                    308.5                    1498   
3    L                298.2                    308.6                    1433   
4    L                298.2                    308.7                    1408   

   Torque [Nm]  Tool wear [min]  Machine failure  TWF  HDF  PWF  OSF  RNF  
0         42.8                0                0    0    0    0    0    0  
1         46.3                3                0    0    0    0    0    0  
2         49.4                5                0    0    0    0    0    0  
3         39.5                7                0    0    0    0    0    0  
4         40.0                9                0    0    0    0    0    0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1

**About Feature engineered**

I created Temp_Diff (process – air temperature) to capture overheating risk, since large gaps between ambient and process heat can signal machine stress and failure.

You engineered Power (torque × rotational speed) using physics to reflect the machine’s actual workload, which is more informative than torque or speed alone.


In [21]:
#featured engineered
data_clean['Temp_Diff'] = data_clean['Process temperature [K]'] - data_clean['Air temperature [K]']
data_clean['Power'] = data_clean['Torque [Nm]'] * data_clean['Rotational speed [rpm]']

features = data_clean.drop(['Machine failure', 'TWF','HDF','PWF','OSF','RNF'], axis=1)
labels = data_clean['HDF']

categorical_cols = ['Type']
numerical_cols = features.select_dtypes(include=[np.number]).columns.tolist()
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(drop='first'), categorical_cols)
    ])



In [None]:
X = preprocessor.fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

In [22]:
X_train = X_train.toarray() if hasattr(X_train, 'toarray') else X_train
X_test = X_test.toarray() if hasattr(X_test, 'toarray') else X_test
y_train, y_test = y_train.values, y_test.values

**Step 3: Oversample anomalies in training**

Since failures are rare, anomalies are oversampled to balance the dataset. This prevents the model from being biased toward predicting only normal cases.

In [23]:
#Step 3: Oversample anomalies in training
anomaly_idx = np.where(y_train == 1)[0]
normal_idx = np.where(y_train == 0)[0]
anomaly_samples = np.random.choice(anomaly_idx, size=len(normal_idx), replace=True)

X_train_bal = np.vstack([X_train[normal_idx], X_train[anomaly_samples]])
y_train_bal = np.hstack([y_train[normal_idx], y_train[anomaly_samples]])

# Shuffle
idx = np.random.permutation(len(y_train_bal))
X_train, y_train = X_train_bal[idx], y_train_bal[idx]

**Step 4: RL Environment**

A custom environment (AnomalyEnv) simulates the predictive maintenance problem where the agent chooses actions (normal vs. anomaly). Rewards are shaped to strongly penalize false alarms and missed failures, guiding the agent toward better anomaly detection.

In [24]:
#Step 4: RL Environment
class AnomalyEnv:
    def __init__(self, data, labels):
        self.data, self.labels = data, labels
        self.current = 0
        self.max_steps = len(data)
    def reset(self):
        self.current = 0
        return self.data[0]
    def step(self, action):
        true = self.labels[self.current]
        # Reward shaping
        if action == 1 and true == 1: reward = 6   # TP
        elif action == 1 and true == 0: reward = -5  # FP (stronger penalty)
        elif action == 0 and true == 1: reward = -6  # FN
        else: reward = 2   # TN
        self.current += 1
        done = self.current >= self.max_steps
        next_state = self.data[self.current] if not done else np.zeros_like(self.data[0])
        return next_state, reward, done

**Step 5: Tiny DQN**

A small Deep Q-Network (DQN) is defined with two layers. It learns to approximate the best action (detect anomaly or not) given machine sensor data.

In [25]:
#Step 5: Tiny DQN
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 32)
        self.fc2 = nn.Linear(32, action_size)
    def forward(self, x):
        return self.fc2(torch.relu(self.fc1(x)))


**Step 6: Training loop**

The agent interacts with the environment for multiple episodes, choosing actions, collecting rewards, and storing experiences. The DQN is trained using replay memory and periodically updated with a target network to stabilize learning.

In [26]:
# ## Step 6: Training loop
state_size = X_train.shape[1]
action_size = 2
model = DQN(state_size, action_size)
target = DQN(state_size, action_size)
target.load_state_dict(model.state_dict())
optimizer = optim.Adam(model.parameters(), lr=0.001)

gamma = 0.9
epsilon, eps_min, eps_decay = 1.0, 0.05, 0.98
memory = deque(maxlen=1000)
batch_size = 16
episodes = 100   # a bit longer but still lightweight

env = AnomalyEnv(X_train, y_train)

for ep in range(episodes):
    state = torch.FloatTensor(env.reset()).unsqueeze(0)
    done, total_reward = False, 0
    while not done:
        if random.random() < epsilon:
            action = random.randrange(action_size)
        else:
            with torch.no_grad():
                action = torch.argmax(model(state)).item()
        next_state, reward, done = env.step(action)
        next_state = torch.FloatTensor(next_state).unsqueeze(0)
        memory.append((state, action, reward, next_state, done))
        state = next_state
        total_reward += reward

        if len(memory) >= batch_size:
            batch = random.sample(memory, batch_size)
            states = torch.cat([b[0] for b in batch])
            actions = torch.LongTensor([b[1] for b in batch])
            rewards = torch.FloatTensor([b[2] for b in batch])
            next_states = torch.cat([b[3] for b in batch])
            dones = torch.FloatTensor([b[4] for b in batch])

            q_values = model(states).gather(1, actions.unsqueeze(1)).squeeze(1)
            next_q = target(next_states).max(1)[0]
            target_q = rewards + (1 - dones) * gamma * next_q
            loss = nn.MSELoss()(q_values, target_q.detach())

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    if ep % 10 == 0:
        target.load_state_dict(model.state_dict())
    epsilon = max(eps_min, epsilon * eps_decay)
    print(f"Episode {ep+1}/{episodes}, Reward: {total_reward}, Epsilon: {epsilon:.2f}")



Episode 1/100, Reward: -10993, Epsilon: 0.98
Episode 2/100, Reward: -10695, Epsilon: 0.96
Episode 3/100, Reward: -9001, Epsilon: 0.94
Episode 4/100, Reward: -6562, Epsilon: 0.92
Episode 5/100, Reward: -5052, Epsilon: 0.90
Episode 6/100, Reward: -5368, Epsilon: 0.89
Episode 7/100, Reward: -3737, Epsilon: 0.87
Episode 8/100, Reward: -2491, Epsilon: 0.85
Episode 9/100, Reward: -1338, Epsilon: 0.83
Episode 10/100, Reward: 986, Epsilon: 0.82
Episode 11/100, Reward: 1928, Epsilon: 0.80
Episode 12/100, Reward: 3331, Epsilon: 0.78
Episode 13/100, Reward: 4713, Epsilon: 0.77
Episode 14/100, Reward: 6219, Epsilon: 0.75
Episode 15/100, Reward: 6208, Epsilon: 0.74
Episode 16/100, Reward: 7760, Epsilon: 0.72
Episode 17/100, Reward: 8970, Epsilon: 0.71
Episode 18/100, Reward: 8511, Epsilon: 0.70
Episode 19/100, Reward: 10855, Epsilon: 0.68
Episode 20/100, Reward: 11560, Epsilon: 0.67
Episode 21/100, Reward: 13517, Epsilon: 0.65
Episode 22/100, Reward: 13473, Epsilon: 0.64
Episode 23/100, Reward: 151

**Step 7: Evaluation with confidence threshold**

The trained DQN is evaluated on test data with a stricter threshold to avoid false positives. Predictions are compared with ground truth using a classification report and confusion matrix.

In [28]:
#Step 7: Evaluation with confidence threshold
def predict(model, X, threshold=1.5):
    preds = []
    with torch.no_grad():
        for s in X:
            s = torch.FloatTensor(s).unsqueeze(0)
            q = model(s)
            # anomaly only if much stronger than normal
            if q[0,1] > threshold * q[0,0]:
                preds.append(1)
            else:
                preds.append(0)
    return np.array(preds)

y_pred = predict(model, X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1983
           1       0.89      1.00      0.94        17

    accuracy                           1.00      2000
   macro avg       0.95      1.00      0.97      2000
weighted avg       1.00      1.00      1.00      2000

Confusion Matrix:
[[1981    2]
 [   0   17]]
