## Problem Statement

In [5]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/svgoudar/datasets/refs/heads/main/financial_anomaly_data.csv")

In [6]:
df.head()

Unnamed: 0,Timestamp,TransactionID,AccountID,Amount,Merchant,TransactionType,Location
0,01-01-2023 08:00,TXN1127,ACC4,95071.92,MerchantH,Purchase,Tokyo
1,01-01-2023 08:01,TXN1639,ACC10,15607.89,MerchantH,Purchase,London
2,01-01-2023 08:02,TXN872,ACC8,65092.34,MerchantE,Withdrawal,London
3,01-01-2023 08:03,TXN1438,ACC6,87.87,MerchantE,Purchase,London
4,01-01-2023 08:04,TXN1338,ACC6,716.56,MerchantI,Purchase,Los Angeles



Here’s a good Kaggle dataset for anomaly detection + starter code sketch using DBSCAN.

---

## Dataset recommendation

**“Network Traffic Anomaly Detection Dataset”** on Kaggle. ([Kaggle][1])

* Contains network traffic data.
* Useful for detecting anomalous patterns (e.g. intrusion, unusual behaviour). ([Kaggle][1])

Another option: **Financial Anomaly Data** on Kaggle. ([Kaggle][2])

---

## Starter code sketch (in Python) using DBSCAN


In [None]:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.metrics import classification_report, confusion_matrix

# 1. Load dataset
df = pd.read_csv('path/to/network_traffic_anomaly_detection.csv')

# 2. Preprocess
#    - handle missing values
#    - select relevant features
#    - encode categorical variables (if any)
#    - scale features (DBSCAN is distance-based)
features = ['feature1', 'feature2', 'feature3', ...]  # replace
X = df[features].copy()
X.fillna(X.mean(), inplace=True)
# if categorical:
#   X = pd.get_dummies(X, columns=['cat_feature1', ...])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. DBSCAN clustering
eps = 0.5   # to tune
min_samples = 5  # to tune
db = DBSCAN(eps=eps, min_samples=min_samples, metric='euclidean')
labels = db.fit_predict(X_scaled)

# 4. Mark anomalies
# Usually noise = label -1 in DBSCAN
df['cluster'] = labels
df['anomaly'] = (labels == -1).astype(int)

# 5. If you have true labels, evaluate
if 'true_label' in df.columns:
    y_true = df['true_label']
    y_pred = df['anomaly']
    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

# 6. Explore clusters
# Count number of anomalies
print("Anomalies detected:", sum(df['anomaly']))
print("Clusters found:", len(set(labels)) - (1 if -1 in labels else 0))
