# LLM Response Quality Comparison (Simulated RAG)

**Objective:** To evaluate which locally runnable, quantized LLM provides the most accurate, coherent, and helpful answers when given PLC-related questions and relevant context (simulating RAG retrieval).

**Methodology:**
1.  **Prepare Test Cases:** Create pairs of (PLC query, manually selected relevant context chunks).
2.  **Select LLMs:** Choose a few candidate quantized LLMs accessible via Ollama.
3.  **Generate Responses:** For each LLM and test case, construct a prompt and generate a response.
4.  **Qualitative Evaluation:** Manually score responses based on accuracy, relevance, faithfulness, clarity, and helpfulness.

In [None]:
import ollama
import pandas as pd

def get_llm_test_cases():
    """
    Returns a list of dictionaries, each containing a query and relevant context.
    Replace this with your actual test cases.
    Context should be what you expect a good retriever to find.
    """
    test_cases = [
        {
            "query": "How do I declare a global variable in CODESYS Structured Text that can be accessed by multiple POUs?",
            "context": [
                "Global variables in CODESYS are typically declared in a Global Variable List (GVL).",
                "To declare a variable in a GVL, you use the VAR_GLOBAL keyword. Example: VAR_GLOBAL MyGlobalCounter : INT; END_VAR",
                "POUs can then access this variable directly by its name, provided the GVL is part of the project."
            ]
        },
        {
            "query": "What is the purpose of the TON timer in PLC programming?",
            "context": [
                "The TON (Timer On-Delay) function block is used to create a time delay before an output is set to TRUE.",
                "When the input IN of the TON block becomes TRUE, the timer starts counting up to the preset time (PT).",
                "Once the elapsed time (ET) reaches the preset time (PT), the output Q becomes TRUE and remains TRUE as long as IN is TRUE."
            ]
        },
        {
            "query": "Explain the difference between a Function and a Function Block in CODESYS.",
            "context": [
                "Functions (FCs) in CODESYS are POUs that execute and return a value. They do not have memory of their own between calls (stateless).",
                "Function Blocks (FBs) are POUs that have their own instance memory. This means they can retain their state between calls.",
                "Typically, FBs are used for operations that require internal state, like timers, counters, or state machines, while Functions are for reusable calculations."
            ]
        }
    ]
    return test_cases

llm_test_cases = get_llm_test_cases()
print(f"Loaded {len(llm_test_cases)} LLM test cases.")

## 2. Define LLMs to Compare (from Ollama)
List the Ollama model names you want to test (ensure they are pulled locally: `ollama pull <model_name>`).

In [None]:
ollama_model_names = [
    "llama3:8b-instruct-q4_K_M",
    "mistral:7b-instruct-v0.2-q4_K_M",
]

local_models = ollama.list()['models']


## 3. Generate Responses from LLMs

For each LLM and each test case:
- Construct a prompt using the query and the provided context.
- Generate a response from the LLM.

In [None]:
def generate_llm_response(model_name, query, context_chunks):
    context_str = "\n".join(context_chunks)
    prompt = f"Given the following information:\n---\n{context_str}\n---\nAnswer the question: {query}"
    
    try:
        response = ollama.chat(
            model=model_name,
            messages=[
                {
                    'role': 'user',
                    'content': prompt,
                }
            ]
        )
        return response["message"]["content"]
    except Exception as e:
        print(f"Error generating response from {model_name}: {e}")
        return "Error: Could not generate response."

llm_results = []

for test_case in llm_test_cases:
    query = test_case["query"]
    context = test_case["context"]
    
    case_results = {"query": query, "context": "\n".join(context), "responses": {}}
    print(f"\n--- Processing Query: {query[:80]}... ---")
    
    for model_name in ollama_model_names:
        print(f"  Generating response from {model_name}...")
        response_text = generate_llm_response(model_name, query, context)
        case_results["responses"][model_name] = response_text
        print(f"    {model_name} Response: {response_text[:150]}...") # Print a snippet
        
    llm_results.append(case_results)

# df_llm_results = pd.DataFrame(llm_results)
# print(df_llm_results.head())

## 4. Qualitative Evaluation of LLM Responses

Review the `llm_results`. For each query and each LLM, evaluate the generated response based on the following criteria:

*   **Accuracy:** Is the answer factually correct based on the provided context?
*   **Relevance:** Does it directly answer the question?
*   **Faithfulness:** Does it stick to the provided context, or does it introduce outside information/hallucinate?
*   **Clarity & Conciseness:** Is the answer easy to understand and to the point?
*   **Helpfulness:** Would this answer be genuinely useful to a student trying to solve a PLC programming problem?

Assign scores (e.g., on a 1-5 scale) for each criterion to help in comparing the models. This will guide your choice of LLM for the RAG system.

In [None]:
for item in llm_results:
    print(f"\n{'='*20} QUERY: {item['query']} {'='*20}")
    print(f"\nCONTEXT PROVIDED:\n{item['context']}")
    for model_name, response in item['responses'].items():
        print(f"\n--- Response from {model_name} ---")
        print(response)
    print(f"\n{'='*50}")

print("\nReview complete. Consider creating a spreadsheet or structured document for formal scoring.")