<a href="https://colab.research.google.com/github/srilamaiti/ml_works/blob/main/bert_anomaly_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from transformers import BertTokenizer, BertModel
from sklearn.ensemble import IsolationForest
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample texts (normal and anomalous examples)
texts = [
    "The server is running smoothly.",
    "Error detected in database connection.",
    "Data pipeline completed successfully.",
    "Disk space critically low.",
    # Add more normal and anomalous texts
]

# Tokenize and convert texts to BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the pooled output or the average of token embeddings
    return outputs.pooler_output.numpy().flatten()

# Generate embeddings for all texts
embeddings = [get_bert_embedding(text) for text in texts]

# Use Isolation Forest for anomaly detection
clf = IsolationForest(contamination=0.1)
clf.fit(embeddings)

# Predict anomalies (-1 indicates anomaly, 1 indicates normal)
predictions = clf.predict(embeddings)

# Display results
for i, text in enumerate(texts):
    result = "Anomalous" if predictions[i] == -1 else "Normal"
    print(f"Text: {text} --> {result}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Text: The server is running smoothly. --> Anomalous
Text: Error detected in database connection. --> Normal
Text: Data pipeline completed successfully. --> Normal
Text: Disk space critically low. --> Normal
