## Initial Phase 2 implementation: Developed by Arka Mukherjee

In [2]:
import lmstudio as lms

downloaded = lms.list_downloaded_models()
llm_only = lms.list_downloaded_models("llm")
embedding_only = lms.list_downloaded_models("embedding")

for model in downloaded:
    print(model.model_key)

text-embedding-nomic-embed-text-v1.5
internvl3-8b
smolvlm2-2.2b-instruct
irix-12b-model_stock
deepseek-r1-0528-qwen3-8b
qwen2.5-vl-32b-instruct
qwen2.5-vl-7b-instruct
gemma-3-4b-it
qwen2-vl-2b-instruct
granite-vision-3.2-2b
smolvlm2-500m-video-instruct
gemma-3-27b-it-qat
gemma-3-12b-it-qat


In [3]:
model = lms.llm('gemma-3-4b-it')
print(model.respond("What is the meaning of life?"))

Okay, let's tackle this big one – the meaning of life. The short answer is: **there isn’t a single, universally agreed-upon answer.** It's arguably *the* most pondered question in human history, and that's because it's deeply personal and philosophical. 

Here's a breakdown of different perspectives, grouped into categories:

**1. Philosophical Perspectives:**

* **Nihilism:** This view suggests life is inherently meaningless. There’s no objective purpose or value. It can be bleak, but some find freedom in accepting this lack of inherent meaning and creating their own.
* **Existentialism:**  This philosophy argues that existence precedes essence. We are born into the world without a predetermined purpose. *We* create our own meaning through our choices and actions. Key figures include Sartre and Camus. It emphasizes responsibility – we're responsible for defining ourselves.
* **Absurdism:** Closely related to existentialism, absurdism recognizes the conflict between humanity’s innate d

In [4]:
from datasets import load_dataset

ds = load_dataset("openai/gsm8k", "main")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 7473/7473 [00:00<00:00, 1158658.65 examples/s]
Generating test split: 100%|██████████| 1319/1319 [00:00<00:00, 344690.78 examples/s]


In [None]:
import pandas as pd
import random
import json
import re

# Define common mistake patterns for each concept
MISTAKE_PATTERNS = {
    "Fractions (Specifically Halves)": [
        "When finding half, divide by 1 instead of 2",
        "When finding half, subtract 1 instead of dividing by 2",
        "When finding half, multiply by 2 instead of dividing by 2"
    ],
    "Calculating Hourly Rates": [
        "Use the hourly rate directly without converting time units",
        "Multiply by 60 instead of dividing when converting hours to minutes",
        "Add the hourly rate to the time instead of multiplying"
    ],
    "Addition And Subtraction Of Integers": [
        "Skip one of the numbers in a multi-step addition",
        "Add when should subtract or subtract when should add",
        "Forget to include one of the given amounts"
    ],
    "Multi-Step Problem Solving": [
        "Stop after the first calculation and give that as the final answer",
        "Skip intermediate steps and guess the final answer",
        "Mix up the order of operations"
    ],
    "Application Of Arithmetic Operations In Real-World Scenarios": [
        "Use the wrong operation (add instead of multiply, etc.)",
        "Misinterpret what the numbers represent in the problem",
        "Apply operations to wrong numbers"
    ],
    "Multiplication": [
        "Add instead of multiply",
        "Forget to multiply by one of the numbers",
        "Multiply by the wrong number"
    ],
    "Addition": [
        "Skip one of the numbers to add",
        "Add numbers in wrong order causing confusion",
        "Make arithmetic errors in basic addition"
    ],
    "Time Conversion (Minutes To Hours)": [
        "Multiply by 60 instead of dividing by 60",
        "Add 60 instead of dividing by 60",
        "Use the wrong conversion factor"
    ],
    "Money Calculations": [
        "Forget to account for all money sources",
        "Add when should subtract or vice versa",
        "Lose track of decimal places"
    ]
}

def identify_prerequisites_for_question(question, concepts_list):
    """
    Use LLM to identify which prerequisites are needed for a specific question
    """
    prompt = f"""
Given this math word problem and a list of mathematical concepts, identify which concepts are ESSENTIAL prerequisites to solve this problem.

Problem: {question}

Available Concepts:
{', '.join(concepts_list)}

Instructions:
- Only list concepts that are absolutely necessary to solve this problem
- Be precise - don't include concepts that are merely related but not required
- Return only the concept names, separated by commas
- If none of the listed concepts are needed, return "None"

Essential Prerequisites:"""
    
    response = model.respond(prompt)
    # Parse the response to extract concept names
    prerequisites = [concept.strip() for concept in response.content.split(',') if concept.strip() in concepts_list]
    return prerequisites

def generate_wrong_answer_with_specific_mistake(question, missing_concept, correct_answer):
    """
    Generate wrong answer using specific mistake patterns
    """
    
    # Get mistake patterns for this concept
    if missing_concept in MISTAKE_PATTERNS:
        mistake_pattern = random.choice(MISTAKE_PATTERNS[missing_concept])
    else:
        mistake_pattern = "make a computational error or use wrong operation"
    
    # Extract correct final answer for avoidance
    correct_final = "the correct answer"
    if '####' in correct_answer:
        correct_final = correct_answer.split('####')[-1].strip()
    
    prompt = f"""
You are a confused student who makes a specific mistake because you don't understand "{missing_concept}".

Problem: {question}

Your specific confusion: {mistake_pattern}

CRITICAL REQUIREMENTS:
- Show your step-by-step work
- Make the specific mistake described above
- Your final answer MUST be wrong (definitely NOT {correct_final})
- Be realistic and show partial understanding of other concepts
- You WILL get this wrong because of your confusion
- END with your final numerical answer in this exact format: $\\boxed{{your_wrong_answer}}$

Your solution with the mistake:"""
    
    return prompt

def generate_wrong_answer_adversarial(question, missing_concept):
    """
    Use adversarial prompting to force wrong answers
    """
    prompt = f"""
ROLEPLAY SCENARIO: You are a student who has a knowledge gap in "{missing_concept}".

Problem: {question}

ROLEPLAY REQUIREMENTS (MUST FOLLOW):
- You CANNOT use "{missing_concept}" correctly because you don't understand it
- You must show your work but make a realistic mistake
- Your final numerical answer MUST be incorrect
- Stay in character as a confused student
- If you give the right answer, you FAIL the roleplay
- END with your final numerical answer in this exact format: $\\boxed{{your_wrong_answer}}$

IMPORTANT: Success means getting the wrong answer due to your confusion.

Student's confused attempt:"""
    
    return prompt

def generate_wrong_answer_original(question, missing_concept):
    """
    Enhanced version of original approach with stronger instructions
    """
    prompt = f"""
You are a grade schooler who is CONFUSED about "{missing_concept}" and WILL make a mistake.

Problem: {question}

ABSOLUTE REQUIREMENTS:
- You don't understand "{missing_concept}" so you WILL get it wrong
- Show your step-by-step thinking
- Make a realistic mistake that a real student would make
- Your final answer MUST be incorrect
- You are guaranteed to fail this problem because of your confusion
- END with your final numerical answer in this exact format: $\\boxed{{your_wrong_answer}}$

Your confused solution:"""
    
    return prompt

def extract_final_number(text):
    """
    Extract the final numerical answer from a response, prioritizing boxed format
    """
    import re
    
    # First, look for LaTeX boxed format: $\boxed{number}$ or $$\boxed{number}$$
    boxed_patterns = [
        r'\$\$\\boxed\{([^}]+)\}\$\$',  # $$\boxed{answer}$$
        r'\$\\boxed\{([^}]+)\}\$',      # $\boxed{answer}$
        r'\\boxed\{([^}]+)\}'           # \boxed{answer} without dollar signs
    ]
    
    for pattern in boxed_patterns:
        matches = re.findall(pattern, text)
        if matches:
            # Get the last match (final answer)
            answer_str = matches[-1].strip()
            try:
                # Try to convert to float
                # Handle common formats like fractions, decimals, etc.
                if '/' in answer_str:
                    # Handle fractions like "3/4"
                    parts = answer_str.split('/')
                    if len(parts) == 2:
                        return float(parts[0]) / float(parts[1])
                else:
                    # Handle regular numbers, including decimals
                    # Remove any non-numeric characters except decimal point and negative sign
                    cleaned = re.sub(r'[^\d.-]', '', answer_str)
                    if cleaned:
                        return float(cleaned)
            except (ValueError, ZeroDivisionError):
                continue
    
    # If no boxed format found, look for other common answer patterns
    answer_patterns = [
        r'(?:answer|result|solution)(?:\s*is)?(?:\s*:)?\s*([+-]?\d+(?:\.\d+)?)',  # "answer is 42"
        r'(?:final|total)(?:\s*answer)?(?:\s*is)?(?:\s*:)?\s*([+-]?\d+(?:\.\d+)?)',  # "final answer is 42"
        r'(?:equals?|=)\s*([+-]?\d+(?:\.\d+)?)',  # "equals 42"
        r'([+-]?\d+(?:\.\d+)?)\s*(?:dollars?|cents?|\$)',  # "42 dollars"
        r'\$\s*([+-]?\d+(?:\.\d+)?)',  # "$42"
    ]
    
    for pattern in answer_patterns:
        matches = re.findall(pattern, text, re.IGNORECASE)
        if matches:
            try:
                return float(matches[-1])
            except ValueError:
                continue
    
    # Last resort: find the last number in the text
    all_numbers = re.findall(r'([+-]?\d+(?:\.\d+)?)', text)
    if all_numbers:
        try:
            return float(all_numbers[-1])
        except ValueError:
            pass
    
    # If nothing found, return None
    return None

def validate_wrong_answer(generated_answer, correct_answer):
    """
    Check if the generated answer is actually wrong
    """
    # Extract correct final answer
    correct_final = None
    if '####' in correct_answer:
        correct_final_str = correct_answer.split('####')[-1].strip()
        try:
            correct_final = float(correct_final_str)
        except:
            pass
    
    # Extract generated final answer
    generated_final = extract_final_number(generated_answer)
    
    if correct_final is not None and generated_final is not None:
        is_wrong = abs(generated_final - correct_final) > 0.001  # Account for floating point
        return is_wrong, f"Generated: {generated_final}, Correct: {correct_final}"
    
    return True, "Could not extract numbers for comparison"  # Assume wrong if can't compare

def inject_manual_mistake(question, correct_answer, missing_concept):
    """
    Fallback: manually create a wrong answer by modifying the correct solution
    """
    # Extract the final number from correct answer
    correct_final = None
    if '####' in correct_answer:
        try:
            correct_final = int(float(correct_answer.split('####')[-1].strip()))
        except:
            pass
    
    if correct_final is not None:
        # Create a plausible wrong answer
        wrong_final = correct_final + random.choice([-10, -5, -3, -2, -1, 1, 2, 3, 5, 10])
        if wrong_final < 0:
            wrong_final = abs(wrong_final)
        
        return f"""Let me solve this step by step:

Looking at this problem, I need to work through it carefully.

[Due to confusion with {missing_concept}, I make an error in my calculations]

After working through the problem, I get {wrong_final}.

$\\boxed{{{wrong_final}}}$"""
    
    return f"I'm having trouble with this problem because I don't understand {missing_concept} well.\n\n$\\boxed{{0}}$"

def enhanced_wrong_answer_generation(question, missing_concept, correct_answer, max_attempts=3):
    """
    Try multiple strategies to generate wrong answers
    """
    strategies = [
        lambda: generate_wrong_answer_with_specific_mistake(question, missing_concept, correct_answer),
        lambda: generate_wrong_answer_adversarial(question, missing_concept),
        lambda: generate_wrong_answer_original(question, missing_concept)
    ]
    
    for attempt in range(max_attempts):
        for strategy_idx, strategy in enumerate(strategies):
            try:
                prompt = strategy()
                response = model.respond(prompt)
                
                # Validate if answer is actually wrong
                is_wrong, validation_msg = validate_wrong_answer(response.content, correct_answer)
                
                if is_wrong:
                    return response.content, f"Strategy {strategy_idx + 1}, Attempt {attempt + 1}"
                
                print(f"  Strategy {strategy_idx + 1}, Attempt {attempt + 1} failed: {validation_msg}")
                
            except Exception as e:
                print(f"  Strategy {strategy_idx + 1}, Attempt {attempt + 1} error: {e}")
                continue
    
    # If all strategies fail, use manual mistake injection
    manual_answer = inject_manual_mistake(question, correct_answer, missing_concept)
    return manual_answer, "Manual mistake injection"

# Load prerequisites data
concepts_df = pd.read_csv(r"D:\eduvlm-bench\cps\team-arka-eduvlmbench\gsm8k_unique_concepts.csv")
unique_concepts = concepts_df['Concept'].tolist()

# Main processing loop for first 200 questions
results = []
successful_generations = 0
failed_generations = 0

print("Starting enhanced wrong answer generation...")
print("=" * 50)

for i in range(min(200, len(ds['train']))):
    question = ds['train'][i]['question']
    correct_answer = ds['train'][i]['answer']
    
    print(f"\nProcessing question {i+1}/200...")
    print(f"Question: {question[:100]}...")
    
    # Step 1: Identify prerequisites for this specific question
    try:
        prerequisites = identify_prerequisites_for_question(question, unique_concepts)
    except Exception as e:
        print(f"  Error identifying prerequisites: {e}")
        continue
    
    if not prerequisites:
        print(f"  No prerequisites identified, skipping...")
        continue
    
    print(f"  Prerequisites found: {prerequisites}")
    
    # Step 2: Randomly select one prerequisite to remove
    missing_concept = random.choice(prerequisites)
    print(f"  Missing concept: {missing_concept}")
    
    # Step 3: Generate wrong answer with missing prerequisite
    try:
        wrong_answer, method_used = enhanced_wrong_answer_generation(
            question, missing_concept, correct_answer, max_attempts=3
        )
        print(f"  ✓ Generated wrong answer using: {method_used}")
        successful_generations += 1
        
    except Exception as e:
        print(f"  ✗ Failed to generate wrong answer: {e}")
        failed_generations += 1
        continue
    
    # Store results
    result = {
        'question_id': i,
        'question': question,
        'correct_answer': correct_answer,
        'all_prerequisites': prerequisites,
        'missing_prerequisite': missing_concept,
        'wrong_answer': wrong_answer,
        'generation_method': method_used
    }
    
    results.append(result)
    
    # Save intermediate results every 10 questions
    if (i + 1) % 10 == 0:
        checkpoint_filename = f'gsm8k_wrong_answers_checkpoint_{i+1}.json'
        with open(checkpoint_filename, 'w') as f:
            json.dump(results, f, indent=2)
        print(f"  Saved checkpoint: {checkpoint_filename}")

print("\n" + "=" * 50)
print("GENERATION COMPLETE!")
print(f"Successful generations: {successful_generations}")
print(f"Failed generations: {failed_generations}")
print(f"Total processed: {len(results)}")

# Save final results
if results:
    results_df = pd.DataFrame(results)
    results_df.to_csv('gsm8k_wrong_answers_with_missing_prerequisites_enhanced.csv', index=False)
    
    # Save as JSON as well
    with open('gsm8k_wrong_answers_enhanced.json', 'w') as f:
        json.dump(results, f, indent=2)
    
    print("Results saved to:")
    print("- gsm8k_wrong_answers_with_missing_prerequisites_enhanced.csv")
    print("- gsm8k_wrong_answers_enhanced.json")
    
    # Show some statistics
    print(f"\nGeneration method breakdown:")
    if 'generation_method' in results_df.columns:
        method_counts = results_df['generation_method'].value_counts()
        for method, count in method_counts.items():
            print(f"  {method}: {count}")
else:
    print("No results to save!")

print("\nDone! 🎉")

Starting enhanced wrong answer generation...

Processing question 1/200...
Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How m...
  Prerequisites found: ['Addition', 'Multiplication', 'Fractions (Specifically Halves)', 'Application Of Arithmetic Operations In Real-World Scenarios', 'Applying Multiple Steps In A Calculation', 'Multi-Step Problem Solving', 'Interpreting Word Problems']
  Missing concept: Addition
  Strategy 1, Attempt 1 failed: Generated: 72.0, Correct: 72.0
  ✓ Generated wrong answer using: Strategy 2, Attempt 1

Processing question 2/200...
Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much ...
  Prerequisites found: ['Addition', 'Addition And Subtraction Of Time', 'Application Of Arithmetic Operations In Real-World Scenarios', 'Application Of Unit Rates', 'Calculating Hourly Rates', 'Converting Units (Minutes To Hours)', 'Monetary Units (Cents And Doll

In [1]:
import pandas as pd

df = pd.read_csv(r"D:\eduvlm-bench\cps\team-arka-eduvlmbench\gsm8k_wrong_answers_with_missing_prerequisites.csv")

df.head()

Unnamed: 0,question_id,question,correct_answer,all_prerequisites,missing_prerequisite,wrong_answer
0,0,Natalia sold clips to 48 of her friends in Apr...,Natalia sold 48/2 = <<48/2=24>>24 clips in May...,"['Addition', 'Multiplication', 'Fractions (Spe...",Multiplication,"Okay, let's tackle this problem! It says Natal..."
1,1,Weng earns $12 an hour for babysitting. Yester...,Weng earns 12/60 = $<<12/60=0.2>>0.2 per minut...,"['Addition', 'Addition And Subtraction Of Inte...",Converting Units (Minutes To Hours),"Okay, let's solve this!\n\nFirst, I need to fi..."
2,2,Betty is saving money for a new wallet which c...,"In the beginning, Betty has only 100 / 2 = $<<...","['Addition', 'Addition And Subtraction Of Inte...",Subtraction,"Okay, let's tackle this problem! It seems like..."
3,3,"Julie is reading a 120-page book. Yesterday, s...",Maila read 12 x 2 = <<12*2=24>>24 pages today....,"['Addition', 'Multiplication', 'Fractions', 'S...",Subtraction,"Okay, let's tackle this problem! It seems like..."
4,4,James writes a 3-page letter to 2 different fr...,He writes each friend 3*2=<<3*2=6>>6 pages a w...,"['Addition', 'Multiplication', 'Application Of...",Multi-Step Problem Solving,"Okay, let’s tackle this problem! It says James..."
