Skip to content

sablet/dspy_custom_coder

Repository files navigation

DSPyCustomCoder

Overview

A comprehensive ML pipeline development system consisting of two main components:

  1. Task Dialogue Bot: Interactive requirement gathering system that transforms unstructured user requests into structured task specifications
  2. DSPy Custom Coder: Code generation system that converts structured requirements into complete ML pipeline implementations

This implementation is inspired by the Paper2Code methodology, using DSPy's structured LLM interactions throughout both components.

Note: This is a DSPy adaptation of the original Paper2Code framework. Please refer to the original paper and repository for the foundational methodology and concepts.


🎯 Task Dialogue Bot

Overview

The Task Dialogue Bot is an interactive requirement gathering system that transforms unstructured user requests into structured task specifications. Using DSPy's signature-based approach, it conducts intelligent conversations to collect comprehensive domain-specific information.

Key Features

  • Adaptive Questioning: Context-aware questions with concrete examples and options
  • Domain-Specific Schemas: Tailored information collection for different task types
  • Flexible Input Handling: Natural language interpretation with LLM-powered intent recognition
  • Comprehensive Validation: Multi-stage verification ensuring complete and clear requirements
  • Structured Output: Final requirements saved as YAML files for downstream processing

Supported Domains

  • Data Processing (data_processing): Input/output formats, processing steps, purposes
  • Web Development (web_development): Site purpose, features, design preferences, technical requirements
  • General Requirements (general_requirements): Ideal scenarios, motivations, blocking factors

DSPy Implementation Architecture

Core Signatures

  • ResponseValidationSignature: Validates user responses and determines follow-up needs
  • ValueInterpretationSignature: Extracts structured information from natural language input
  • OptionGenerationSignature: Creates contextually appropriate multiple-choice options
  • ComprehensiveValidationSignature: Performs final validation of collected requirements
  • IntentInterpretationSignature: Interprets user intentions (confirm/modify/skip actions)

State Management

class DialogueState(Enum):
    INITIAL = "initial"
    COLLECTING = "collecting"
    OPTION_SELECTION = "option_selection"
    CONFIRMING_VALUE = "confirming_value"
    MODIFYING = "modifying"
    COMPREHENSIVE_CLARIFICATION = "comprehensive_clarification"
    FINAL_REVIEW = "final_review"
    COMPLETED = "completed"

Workflow Process

  1. Domain Selection: User selects task domain (flexible input handling)
  2. Information Collection: Adaptive questioning with contextual options
  3. Value Confirmation: User confirms or modifies collected information
  4. Comprehensive Validation: System validates completeness and clarity
  5. Final Review: User reviews and finalizes all collected requirements
  6. Output Generation: Structured YAML file creation with timestamp

State Transition Flow

sequenceDiagram
    participant User
    participant Bot
    participant State as DialogueState

    Note over State: INITIAL
    User->>Bot: start_session()
    Bot-->>User: Domain selection menu
    
    User->>Bot: select_domain("1")
    Note over State: INITIAL → COLLECTING
    Bot-->>User: Task description request
    
    User->>Bot: "CSVファイルの売上データを分析したい"
    Bot->>Bot: _ask_next_question_with_options()
    Note over State: COLLECTING → OPTION_SELECTION
    Bot-->>User: Question with options (1,2,3,4)
    
    alt Option 1-3 selected
        User->>Bot: "1"
        Bot->>Bot: option_selector()
        Note over State: OPTION_SELECTION → COLLECTING
        Bot-->>User: Next question or validation
        
        alt All elements completed
            Bot->>Bot: _start_comprehensive_validation()
            Note over State: COLLECTING → COMPREHENSIVE_CLARIFICATION
            Bot-->>User: Validation questions
            
            alt Validation passed
                Bot->>Bot: _start_final_review()
                Note over State: COMPREHENSIVE_CLARIFICATION → FINAL_REVIEW
                Bot-->>User: Final review summary
            else Needs clarification
                User->>Bot: Clarification response
                Bot->>Bot: _handle_comprehensive_clarification()
                Note over State: COMPREHENSIVE_CLARIFICATION → MODIFYING
                Bot-->>User: Modification request
            end
        else More elements to collect
            Bot->>Bot: _ask_next_question_with_options()
            Note over State: Stay in COLLECTING
            Bot-->>User: Next question with options
        end
        
    else Option 4 (direct input) selected
        User->>Bot: "4"
        Note over State: OPTION_SELECTION → COLLECTING
        Bot-->>User: Direct input request
        
        User->>Bot: "具体的な回答"
        Bot->>Bot: value_interpreter()
        Note over State: COLLECTING → CONFIRMING_VALUE
        Bot-->>User: Confirmation message
        
        alt User confirms
            User->>Bot: "はい"
            Note over State: CONFIRMING_VALUE → COLLECTING
            Bot->>Bot: Move to next element
            Bot-->>User: Next question or validation
        else User wants to modify
            User->>Bot: "修正"
            Note over State: CONFIRMING_VALUE → MODIFYING
            Bot-->>User: Modification request
            
            User->>Bot: "新しい内容"
            Bot->>Bot: _handle_modification()
            Note over State: MODIFYING → CONFIRMING_VALUE
            Bot-->>User: Confirm modified content
        else User wants to skip
            User->>Bot: "続ける"
            Note over State: CONFIRMING_VALUE → COLLECTING
            Bot->>Bot: Move to next element
            Bot-->>User: Next question
        end
    end
    
    rect rgb(255, 245, 245)
        Note over User, State: Final Review Phase
        Bot->>Bot: _start_final_review()
        Note over State: → FINAL_REVIEW
        Bot-->>User: Complete summary for final confirmation
        
        alt User confirms completion
            User->>Bot: "確定"
            Note over State: FINAL_REVIEW → COMPLETED
            Bot->>Bot: _save_result_as_yaml()
            Bot-->>User: Success message with file path
        else User requests modification
            User->>Bot: "処理概要を修正"
            Bot->>Bot: modification_interpreter()
            Note over State: FINAL_REVIEW → MODIFYING
            Bot-->>User: Specific modification request
            
            User->>Bot: "新しい処理内容"
            Note over State: MODIFYING → FINAL_REVIEW
            Bot-->>User: Updated final review
        end
    end
    
    Note over State: COMPLETED
    Bot-->>User: Session ended
Loading

Key State Transitions:

  • Initialization Flow: INITIALCOLLECTING after domain selection
  • Information Collection: COLLECTINGOPTION_SELECTION with adaptive questioning
  • Confirmation Flow: COLLECTINGCONFIRMING_VALUEMODIFYING (if needed)
  • Validation Flow: COLLECTINGCOMPREHENSIVE_CLARIFICATIONFINAL_REVIEW
  • Completion Flow: FINAL_REVIEWCOMPLETED with YAML output generation

Usage Examples

Starting a Session

from task_dialogue_bot import TaskDialogueBot

bot = TaskDialogueBot()
response = bot.start_session("Japanese")
print(response)
# Output: Domain selection menu with options

Interactive Dialogue Flow

# Domain selection
user_input = "1"  # Select data processing
response = bot.select_domain(user_input)

# Information collection with adaptive questioning
user_input = "CSVファイルの売上データを分析したい"
response = bot.process_user_input(user_input)
# Bot generates contextual options and follow-up questions

# Confirmation and modification
user_input = "はい"  # Confirm current value
response = bot.process_user_input(user_input)
# Bot moves to next question or final review

CLI Interactive Mode

uv run python task_dialogue_bot.py

Output Structure

Generated requirements are saved to output/requirements/ in YAML format:

output/requirements/202412191530_data_analysis_pipeline.yaml

Example Output:

processing_purpose: "売上データの傾向分析と予測モデル構築"
input_data_examples: "CSV形式:日付,店舗,商品,売上金額"
output_data_examples: "分析結果レポート(PDF)、予測値(CSV)"
processing_overview: "データクリーニング→統計分析→可視化→予測モデル訓練"
metadata:
  generated_at: "2024-12-19T15:30:00"
  domain: "data_processing"
  completion_percentage: 100

🚀 DSPy Custom Coder

Overview

The DSPy Custom Coder is a code generation system that converts structured requirements (from Task Dialogue Bot or direct input) into complete ML pipeline implementations. Following the original Paper2Code methodology using DSPy signatures and modules.

Data Flow Architecture

Complete System Data Flow

flowchart TD
    %% Input Layer
    A["🎯 User Requirements<br/>- Natural language text<br/>- Optional test_config"]
    
    %% Main Agent Layer
    B["📋 PlanningOnlyAgent.forward()"]
    
    %% Step 1: Planning Agent Components
    C1["📝 OverallPlanningSignature<br/>Input: requirements<br/>Output: planning_response"]
    C2["🏗️ ArchitectureDesignSignature<br/>Input: planning_context<br/>Output: implementation_approach,<br/>file_list, data_structures_interfaces,<br/>program_call_flow"]
    C3["📋 TaskListGenerationSignature<br/>Input: combined_context<br/>Output: required_packages,<br/>logic_analysis, task_list"]
    C4["⚙️ ConfigurationGenerationSignature<br/>Input: compact_context<br/>Output: config_yaml"]
    
    %% Test Strategy Generation
    TS["🧪 Test Strategy Generation<br/>Input: requirements, file_list,<br/>has_api, has_ml<br/>Output: TestStrategy"]
    
    %% Context Management
    CI["📚 ContextIndex Creation<br/>Input: planning_summary,<br/>config_summary<br/>Output: context_index"]
    
    %% Step 2: Logic Analysis Loop
    LA["🔍 Logic Analysis Loop<br/>For each file in logic_analysis"]
    LAS["📊 LogicAnalysisSignature<br/>Input: requirements, planning_context,<br/>config_yaml, target_file,<br/>logic_analysis_desc<br/>Output: detailed_analysis"]
    
    %% Output Generation
    O1["📄 Master YAML Generation<br/>create_master_yaml()"]
    O2["📝 Planning Report Generation<br/>create_planning_markdown()"]
    O3["⚡ Task YAML Files<br/>create_task_yaml()"]
    O4["🧪 Test Task YAML Files<br/>create_test_task_yaml()"]
    O5["📊 Summary YAML<br/>planning_summary.yaml"]
    
    %% File System Outputs
    F1["📁 output/planning_output_*/"]
    F2["master_plan.yaml"]
    F3["planning_report.md"]
    F4["task_*.yaml files"]
    F5["test_task_*.yaml files"]
    F6["planning_summary.yaml"]
    
    %% Flow Connections
    A --> B
    B --> C1
    C1 --> C2
    C2 --> C3
    C3 --> C4
    C2 --> TS
    C1 --> CI
    C4 --> CI
    B --> LA
    LA --> LAS
    LAS --> LA
    
    %% Context flows
    CI -.-> LAS
    
    %% Output flows
    B --> O1
    O1 --> O2
    O1 --> O3
    O1 --> O4
    O1 --> O5
    
    %% File outputs
    O1 --> F1
    O2 --> F1
    O3 --> F1
    O4 --> F1
    O5 --> F1
    F1 --> F2
    F1 --> F3
    F1 --> F4
    F1 --> F5
    F1 --> F6
    
    %% Styling
    classDef inputNode fill:#e1f5fe
    classDef processNode fill:#f3e5f5
    classDef outputNode fill:#e8f5e8
    classDef fileNode fill:#fff3e0
    
    class A inputNode
    class B,C1,C2,C3,C4,TS,CI,LA,LAS processNode
    class O1,O2,O3,O4,O5 outputNode
    class F1,F2,F3,F4,F5,F6 fileNode
Loading

Detailed Step-by-Step Data Flow

sequenceDiagram
    participant User
    participant Main as main()
    participant POA as PlanningOnlyAgent
    participant PA as PlanningAgent
    participant OPS as OverallPlanningSignature
    participant ADS as ArchitectureDesignSignature
    participant TLS as TaskListGenerationSignature
    participant CGS as ConfigurationGenerationSignature
    participant TSG as TestStrategyGeneration
    participant LAA as LogicAnalysisAgent
    participant LAS as LogicAnalysisSignature
    participant FS as FileSystem
    
    User->>Main: requirements, test_config
    Main->>POA: forward(requirements, test_config)
    
    rect rgb(240, 248, 255)
        Note over POA: Step 1: Planning Phase
        POA->>PA: forward(requirements, test_config)
        
        PA->>OPS: requirements
        OPS-->>PA: planning_response
        
        PA->>ADS: planning_context=planning_response
        ADS-->>PA: implementation_approach, file_list,<br/>data_structures_interfaces, program_call_flow
        
        Note over PA: Create combined_context
        PA->>TLS: planning_context=combined_context
        TLS-->>PA: required_packages, logic_analysis,<br/>task_list, development_tools_config
        
        PA->>TSG: requirements, file_list, has_api, has_ml
        TSG-->>PA: TestStrategy(config, test_files)
        
        Note over PA: Create compact_context
        PA->>CGS: planning_context=compact_context
        CGS-->>PA: config_yaml
        
        PA-->>POA: planning_result{planning_response, implementation_approach,<br/>file_list, required_packages, logic_analysis,<br/>task_list, config_yaml, test_strategy}
    end
    
    rect rgb(255, 248, 240)
        Note over POA: Context Index Creation
        POA->>POA: create_planning_summary(planning_result)
        POA->>POA: create_config_summary(config_yaml)
        POA->>POA: ContextIndex(planning_summary, config_summary)
    end
    
    rect rgb(248, 255, 248)
        Note over POA: Step 2: Logic Analysis Phase
        loop For each file in logic_analysis
            POA->>LAA: requirements, planning_context,<br/>config_yaml, target_file, logic_analysis_desc
            LAA->>LAS: All input parameters
            LAS-->>LAA: detailed_analysis
            LAA-->>POA: detailed_analysis
            Note over POA: Store in detailed_analyses[filename]
        end
    end
    
    POA-->>Main: PlanningResult{planning_result, detailed_analyses,<br/>context_index, requirements, test_strategy}
    
    rect rgb(255, 240, 245)
        Note over Main: Output Generation Phase
        Main->>Main: create_master_yaml()
        Main->>Main: create_planning_markdown()
        Main->>Main: analyze_file_dependencies()
        
        loop For each logic_analysis file
            Main->>Main: create_task_yaml(filename, dependencies)
        end
        
        loop For each test_file
            Main->>Main: create_test_task_yaml(test_file, source_file)
        end
        
        Main->>FS: Save master_plan.yaml
        Main->>FS: Save planning_report.md
        Main->>FS: Save task_*.yaml files
        Main->>FS: Save test_task_*.yaml files
        Main->>FS: Save planning_summary.yaml
    end
    
    FS-->>User: output/planning_output_*/<br/>All generated files
Loading

Input/Output Data Structure Details

Planning Agent Input/Output

  • Input:
    • requirements: str - Natural language requirements
    • test_config: Optional[Dict[str, bool]] - Test configuration flags
  • Output:
    • planning_response: str - Strategic planning response
    • implementation_approach: str - Architecture design approach
    • file_list: List[str] - List of files to generate
    • data_structures_interfaces: Optional[str] - Mermaid class diagram
    • program_call_flow: Optional[str] - Mermaid sequence diagram
    • required_packages: List[str] - Python dependencies
    • logic_analysis: List[List[str]] - [filename, description] pairs
    • task_list: List[str] - Prioritized file generation order
    • config_yaml: str - Complete pyproject.toml content
    • test_strategy: TestStrategy - Test configuration and file list

Logic Analysis Agent Input/Output

  • Input:
    • requirements: str - Original requirements
    • planning_context: str - Accumulated context from ContextIndex
    • config_yaml: str - Configuration content
    • target_file: str - File to analyze
    • logic_analysis_desc: str - Analysis description from planning
  • Output:
    • detailed_analysis: str - Comprehensive implementation analysis

Context Management Flow

graph LR
    A[Planning Result] --> B[create_planning_summary]
    C[Config YAML] --> D[create_config_summary] 
    B --> E[ContextIndex]
    D --> E
    E --> F[get_context_string]
    F --> G[Logic Analysis Input]
    
    style E fill:#f9f,stroke:#333,stroke-width:2px
Loading

Input & Output

Input

  • Requirements (string): Natural language description of ML pipeline needs
    Example: "Create a machine learning pipeline that loads CSV data, 
    trains a classification model, evaluates performance, and provides CLI predictions"
    
  • Structured Requirements (YAML): Output from Task Dialogue Bot

Output

  • Generated Files: Complete Python codebase with proper structure
    • main.py / app.py: Entry point
    • src/: Source modules (data processing, training, evaluation)
    • config.yaml: Configuration file
    • requirements.txt: Dependencies
  • Validation Results: Quality checks and warnings for each file
  • Metadata: Planning documents, architecture diagrams, task breakdowns

DSPy Implementation Workflow

This implementation follows the original Paper2Code methodology using DSPy signatures and modules:

1. Planning Agent (DSPy Module)

LLM Role: Strategic planner following DSPyCustomCoder's planning phase

  • Input: User requirements text
  • DSPy Signatures:
    • OverallPlanningSignature: Strategic planning for ML experiments
    • ArchitectureDesignSignature: System design with file structure
    • TaskListGenerationSignature: Task decomposition and dependencies
    • ConfigurationGenerationSignature: Configuration template creation
  • Output: Structured plan with file list, architecture diagrams, and config

2. Logic Analysis Agent (DSPy Module)

LLM Role: Implementation analyzer following DSPyCustomCoder's analysis phase

  • Input: Planning context + target filename
  • DSPy Signature: LogicAnalysisSignature
  • LLM Tasks:
    • Analyze specific implementation logic for each file
    • Define data structures and interfaces
    • Plan function signatures and class hierarchies
  • Output: Detailed implementation specifications per file

3. Code Generation Agent (DSPy Module)

LLM Role: Code generator following DSPyCustomCoder's generation phase

  • Input: Requirements + planning context + logic analysis + previous files
  • DSPy Signature: CodeGenerationSignature
  • LLM Tasks:
    • Generate complete Python code with type hints
    • Follow Ruff linting standards
    • Create clean code without exception handling
    • Maintain consistency across files
  • Output: Syntactically correct Python files

4. Validation System (Pydantic)

System Role: Quality assurance layer (DSPy enhancement)

  • Content completeness checks using GeneratedFile model
  • Code block formatting validation
  • Syntax and structure verification via CodeGenerationResult model

Processing Flow

graph TD
    A[User Requirements] --> B(Planning Agent);
    B --> C{Planning Context, File List, Config};
    C --> D(Loop: For Each File in File List);
    D --> E(Logic Analysis Agent);
    E --> F{Detailed Logic Analysis for Current File};
    F --> G(Code Generation Agent);
    G --> H{Generated Code for Current File};
    H --> I[Accumulate Generated Files];
    I --> J(End Loop);
    J --> K(Pydantic Validation);
    K --> L[Validated Codebase];
    L --> M[Save Files to output/generated_code/];
Loading

Output Structure

Generated code is saved to output/generated_code/ with a complete ML pipeline ready for use.


🔧 Installation

# Install dependencies
uv sync

# Run the interactive Task Dialogue Bot
uv run python task_dialogue_bot.py

# Run the DSPy Custom Coder implementation
uv run python main.py

Acknowledgments

This implementation is based on the Paper2Code methodology. Please cite the original work when using this DSPy adaptation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages