# Embedding Model Comparison for PLC Data

**Objective:** To identify an open-source embedding model that effectively captures the semantic meaning of PLC documentation and code snippets for relevant retrieval.

**Methodology:**
1.  **Load Data:** Prepare a representative dataset of PLC documents and code snippets.
2.  **Select Models:** Choose a few candidate sentence-transformer models.
3.  **Embed & Query:** For each model, embed the dataset and a list of test queries.
4.  **Retrieve & Evaluate:** Perform similarity search, retrieve top K chunks, and qualitatively evaluate relevance.

In [None]:
# 1. Import Libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss # For FAISS, if used. Alternatively, use sklearn.metrics.pairwise.cosine_similarity for smaller datasets
import numpy as np

# Placeholder for your data loading function
def load_plc_data():
    """
    Loads PLC documents and code snippets.
    Replace this with your actual data loading logic.
    Should return a list of strings, where each string is a document or code chunk.
    """
    # Example:
    documents = [
        "CODESYS Function Block Diagram (FBD) is a graphical programming language.",
        "Structured Text (ST) in CODESYS resembles Pascal or C.",
        "The TON timer function block provides an on-delay timing.",
        "VAR_INPUT TempSensor : REAL; END_VAR",
        "IF TempSensor > 100 THEN Alarm := TRUE; END_IF;"
        # Add more representative PLC documents and code snippets here
    ]
    return documents

# Placeholder for your test queries
def get_test_queries():
    """
    Returns a list of test queries relevant to PLC programming.
    """
    queries = [
        "How to use a TON timer in CODESYS?",
        "What is Structured Text syntax for IF statements?",
        "Explain Function Block Diagrams.",
        "Example of variable declaration in CODESYS."
        # Add more realistic queries
    ]
    return queries

plc_documents = load_plc_data()
test_queries = get_test_queries()

print(f"Loaded {len(plc_documents)} documents.")
print(f"Loaded {len(test_queries)} test queries.")

## 2. Define Embedding Models to Compare
List the Sentence Transformer model names you want to test.

In [None]:
embedding_model_names = [
    "all-MiniLM-L6-v2",       # A good general-purpose lightweight model
    "BAAI/bge-small-en-v1.5", # Another strong contender, good balance
    "sentence-transformers/multi-qa-mpnet-base-dot-v1" # Tuned for QA tasks
    # Add other models you want to test, e.g., multilingual if needed
]

models = {name: SentenceTransformer(name) for name in embedding_model_names}
print(f"Loaded models: {list(models.keys())}")

## 3. Embed Documents and Queries & Perform Retrieval

For each model:
- Embed the PLC documents.
- Embed the test queries.
- For each query, find the top K most similar documents.

In [None]:
def get_top_k_similar(query_embedding, document_embeddings, documents, k=3):
    # Using FAISS for efficient search (can be replaced with simple cosine similarity for small datasets)
    dimension = document_embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension) # L2 distance works well for normalized embeddings
    index.add(document_embeddings.astype(np.float32))
    
    distances, indices = index.search(query_embedding.astype(np.float32).reshape(1, -1), k)
    
    return [(documents[i], 1 - d) for i, d in zip(indices[0], distances[0])] # Convert L2 dist to a pseudo-similarity

results = {} # To store results for each model

for model_name, model in models.items():
    print(f"\n--- Processing with model: {model_name} ---")
    model_results = []
    
    # Embed documents
    doc_embeddings = model.encode(plc_documents, convert_to_tensor=False, show_progress_bar=True)
    
    for query in test_queries:
        query_embedding = model.encode(query, convert_to_tensor=False)
        
        # Get top K similar documents
        # For simplicity, using basic cosine similarity here if FAISS is not preferred for small scale
        # from sklearn.metrics.pairwise import cosine_similarity
        # similarities = cosine_similarity(query_embedding.reshape(1, -1), doc_embeddings)
        # top_k_indices = np.argsort(similarities[0])[::-1][:3]
        # top_k_docs = [(plc_documents[i], similarities[0][i]) for i in top_k_indices]
        
        top_k_docs = get_top_k_similar(query_embedding, doc_embeddings, plc_documents, k=3)
        
        model_results.append({
            "query": query,
            "retrieved_chunks": top_k_docs
        })
        
        print(f"\nQuery: {query}")
        for doc, score in top_k_docs:
            print(f"  Retrieved: {doc[:100]}... (Score: {score:.4f})")
            
    results[model_name] = model_results

## 4. Qualitative Evaluation

Review the `results` dictionary. For each model and each query, assess the relevance of the retrieved chunks.

**Considerations for Evaluation:**
- **Relevance:** How well do the retrieved chunks address the query?
- **Specificity:** Do they provide specific information or just general context?
- **Diversity:** If multiple chunks are retrieved, do they offer different facets of the answer or are they redundant?

Based on this qualitative review, you can decide which embedding model performs best for your PLC data.

In [None]:
# Example: Print results for one model to examine
# You would typically do a more structured review, perhaps exporting to a spreadsheet or just manually going through 'results'
if results:
    sample_model_name = list(results.keys())[0]
    print(f"\n--- Detailed Results for Model: {sample_model_name} ---")
    for item in results[sample_model_name]:
        print(f"\nQuery: {item['query']}")
        for doc, score in item['retrieved_chunks']:
            print(f"  Retrieved: {doc} (Score: {score:.4f})")
else:
    print("No results to display. Ensure the previous cells ran correctly.")