# Preprocess Problem Statements

This notebook reads the problem statements YAML file and creates a CSV with:
- Problem ID (e.g., "1_experimentalist_basic_natural", "1_experimentalist_basic_corrected")
- Domain (e.g., "materials_science")
- Complexity (e.g., "simple", "intermediate", "advanced")
- Persona (e.g., "experimentalist_basic", "industrial_practitioner", "research_scientist")
- Description (problem category description)
- Notes (version-specific notes)
- Problem (the actual problem text)
- Expected grid selections (first 8 key-value pairs)
- Solution (ground truth)

In [23]:
import yaml
import pandas as pd
import json
from pathlib import Path

In [24]:
# Load the problem statements YAML file
yaml_path = Path("C:\\Users\\hsayeed\\Documents\\GitHub\\honegumi_rag_assistant\\data\\raw\\problem_statements.yaml")

with open(yaml_path, 'r', encoding='utf-8') as f:
    data = yaml.safe_load(f)

problem_statements = data['problem_statements']
print(f"Loaded {len(problem_statements)} problem statements")

Loaded 30 problem statements


In [25]:
# Process each problem statement and create rows for CSV
rows = []

for problem in problem_statements:
    problem_id = problem['id']
    
    # Extract metadata
    domain = problem.get('domain', '')
    complexity = problem.get('complexity', '')
    description = problem.get('description', '')
    
    # Extract grid selections (first 8 key-value pairs only)
    grid_selections = problem['expected_grid_selections']
    grid_keys = ['objective', 'model', 'task', 'categorical', 'sum_constraint', 
                 'order_constraint', 'linear_constraint', 'composition_constraint']
    grid_dict = {k: grid_selections[k] for k in grid_keys if k in grid_selections}
    
    # Extract solution
    solution = problem['solution']
    
    # Iterate through personas
    personas = problem.get('personas', {})
    for persona_name, persona_data in personas.items():
        # Create rows for natural and corrected versions
        for version in ['natural', 'corrected']:
            version_key = f"{version}_version"
            if version_key in persona_data:
                # Strip and convert to single line (replace newlines with spaces)
                problem_text = persona_data[version_key]['description'].strip()
                problem_text = ' '.join(problem_text.split())
                
                # Extract notes from the version (if available)
                notes = persona_data[version_key].get('notes', '').strip()
                notes = ' '.join(notes.split()) if notes else ''
                
                row = {
                    'problem_id': f"{problem_id}_{persona_name}_{version}",
                    'domain': domain,
                    'complexity': complexity,
                    'persona': persona_name,
                    'description': description,
                    'notes': notes,
                    'problem': problem_text,
                    'expected_grid_selections': json.dumps(grid_dict, ensure_ascii=False),
                    'solution': json.dumps(solution, ensure_ascii=False)
                }
                rows.append(row)

print(f"Created {len(rows)} rows")
df = pd.DataFrame(rows)
print(f"Columns: {list(df.columns)}")
df.head()

Created 180 rows
Columns: ['problem_id', 'domain', 'complexity', 'persona', 'description', 'notes', 'problem', 'expected_grid_selections', 'solution']


Unnamed: 0,problem_id,domain,complexity,persona,description,notes,problem,expected_grid_selections,solution
0,1_experimentalist_basic_natural,materials_science,simple,experimentalist_basic,Ceramic sintering optimization,Typical underspecified request from a material...,I need to find the best temperature and time f...,"{""objective"": ""Single"", ""model"": ""Default"", ""t...","{""search_space"": [{""name"": ""temperature"", ""typ..."
1,1_experimentalist_basic_corrected,materials_science,simple,experimentalist_basic,Ceramic sintering optimization,"More conversational, includes key details but ...",I want to optimize sintering conditions for ce...,"{""objective"": ""Single"", ""model"": ""Default"", ""t...","{""search_space"": [{""name"": ""temperature"", ""typ..."
2,1_industrial_practitioner_natural,materials_science,simple,industrial_practitioner,Ceramic sintering optimization,Industrial practitioner with cost consciousnes...,We need to optimize our ceramic sintering proc...,"{""objective"": ""Single"", ""model"": ""Default"", ""t...","{""search_space"": [{""name"": ""temperature"", ""typ..."
3,1_industrial_practitioner_corrected,materials_science,simple,industrial_practitioner,Ceramic sintering optimization,More structured but maintains industrial/pract...,We're optimizing our ceramic sintering line to...,"{""objective"": ""Single"", ""model"": ""Default"", ""t...","{""search_space"": [{""name"": ""temperature"", ""typ..."
4,1_research_scientist_natural,materials_science,simple,research_scientist,Ceramic sintering optimization,"Academic researcher with formal, literature-ba...",I'm investigating the sintering behavior of ce...,"{""objective"": ""Single"", ""model"": ""Default"", ""t...","{""search_space"": [{""name"": ""temperature"", ""typ..."


In [26]:
# Save to CSV
output_csv = Path("C:\\Users\\hsayeed\\Documents\\GitHub\\honegumi_rag_assistant\\data\\processed\\questions.csv")
output_csv.parent.mkdir(parents=True, exist_ok=True)

df.to_csv(output_csv, index=False)
print(f"\n✓ Saved {len(df)} problems to: {output_csv}")
print(f"\nColumns: {list(df.columns)}")
print(f"Problem IDs: {df['problem_id'].tolist()}")


✓ Saved 180 problems to: C:\Users\hsayeed\Documents\GitHub\honegumi_rag_assistant\data\processed\questions.csv

Columns: ['problem_id', 'domain', 'complexity', 'persona', 'description', 'notes', 'problem', 'expected_grid_selections', 'solution']
Problem IDs: ['1_experimentalist_basic_natural', '1_experimentalist_basic_corrected', '1_industrial_practitioner_natural', '1_industrial_practitioner_corrected', '1_research_scientist_natural', '1_research_scientist_corrected', '2_computational_intermediate_natural', '2_computational_intermediate_corrected', '2_industrial_practitioner_natural', '2_industrial_practitioner_corrected', '2_research_scientist_natural', '2_research_scientist_corrected', '3_data_science_advanced_natural', '3_data_science_advanced_corrected', '3_industrial_practitioner_natural', '3_industrial_practitioner_corrected', '3_research_scientist_natural', '3_research_scientist_corrected', '4_experimentalist_basic_natural', '4_experimentalist_basic_corrected', '4_industrial_p