A comprehensive ML pipeline development system consisting of two main components:
- Task Dialogue Bot: Interactive requirement gathering system that transforms unstructured user requests into structured task specifications
- DSPy Custom Coder: Code generation system that converts structured requirements into complete ML pipeline implementations
This implementation is inspired by the Paper2Code methodology, using DSPy's structured LLM interactions throughout both components.
Note: This is a DSPy adaptation of the original Paper2Code framework. Please refer to the original paper and repository for the foundational methodology and concepts.
The Task Dialogue Bot is an interactive requirement gathering system that transforms unstructured user requests into structured task specifications. Using DSPy's signature-based approach, it conducts intelligent conversations to collect comprehensive domain-specific information.
- Adaptive Questioning: Context-aware questions with concrete examples and options
- Domain-Specific Schemas: Tailored information collection for different task types
- Flexible Input Handling: Natural language interpretation with LLM-powered intent recognition
- Comprehensive Validation: Multi-stage verification ensuring complete and clear requirements
- Structured Output: Final requirements saved as YAML files for downstream processing
- Data Processing (
data_processing): Input/output formats, processing steps, purposes - Web Development (
web_development): Site purpose, features, design preferences, technical requirements - General Requirements (
general_requirements): Ideal scenarios, motivations, blocking factors
ResponseValidationSignature: Validates user responses and determines follow-up needsValueInterpretationSignature: Extracts structured information from natural language inputOptionGenerationSignature: Creates contextually appropriate multiple-choice optionsComprehensiveValidationSignature: Performs final validation of collected requirementsIntentInterpretationSignature: Interprets user intentions (confirm/modify/skip actions)
class DialogueState(Enum):
INITIAL = "initial"
COLLECTING = "collecting"
OPTION_SELECTION = "option_selection"
CONFIRMING_VALUE = "confirming_value"
MODIFYING = "modifying"
COMPREHENSIVE_CLARIFICATION = "comprehensive_clarification"
FINAL_REVIEW = "final_review"
COMPLETED = "completed"- Domain Selection: User selects task domain (flexible input handling)
- Information Collection: Adaptive questioning with contextual options
- Value Confirmation: User confirms or modifies collected information
- Comprehensive Validation: System validates completeness and clarity
- Final Review: User reviews and finalizes all collected requirements
- Output Generation: Structured YAML file creation with timestamp
sequenceDiagram
participant User
participant Bot
participant State as DialogueState
Note over State: INITIAL
User->>Bot: start_session()
Bot-->>User: Domain selection menu
User->>Bot: select_domain("1")
Note over State: INITIAL → COLLECTING
Bot-->>User: Task description request
User->>Bot: "CSVファイルの売上データを分析したい"
Bot->>Bot: _ask_next_question_with_options()
Note over State: COLLECTING → OPTION_SELECTION
Bot-->>User: Question with options (1,2,3,4)
alt Option 1-3 selected
User->>Bot: "1"
Bot->>Bot: option_selector()
Note over State: OPTION_SELECTION → COLLECTING
Bot-->>User: Next question or validation
alt All elements completed
Bot->>Bot: _start_comprehensive_validation()
Note over State: COLLECTING → COMPREHENSIVE_CLARIFICATION
Bot-->>User: Validation questions
alt Validation passed
Bot->>Bot: _start_final_review()
Note over State: COMPREHENSIVE_CLARIFICATION → FINAL_REVIEW
Bot-->>User: Final review summary
else Needs clarification
User->>Bot: Clarification response
Bot->>Bot: _handle_comprehensive_clarification()
Note over State: COMPREHENSIVE_CLARIFICATION → MODIFYING
Bot-->>User: Modification request
end
else More elements to collect
Bot->>Bot: _ask_next_question_with_options()
Note over State: Stay in COLLECTING
Bot-->>User: Next question with options
end
else Option 4 (direct input) selected
User->>Bot: "4"
Note over State: OPTION_SELECTION → COLLECTING
Bot-->>User: Direct input request
User->>Bot: "具体的な回答"
Bot->>Bot: value_interpreter()
Note over State: COLLECTING → CONFIRMING_VALUE
Bot-->>User: Confirmation message
alt User confirms
User->>Bot: "はい"
Note over State: CONFIRMING_VALUE → COLLECTING
Bot->>Bot: Move to next element
Bot-->>User: Next question or validation
else User wants to modify
User->>Bot: "修正"
Note over State: CONFIRMING_VALUE → MODIFYING
Bot-->>User: Modification request
User->>Bot: "新しい内容"
Bot->>Bot: _handle_modification()
Note over State: MODIFYING → CONFIRMING_VALUE
Bot-->>User: Confirm modified content
else User wants to skip
User->>Bot: "続ける"
Note over State: CONFIRMING_VALUE → COLLECTING
Bot->>Bot: Move to next element
Bot-->>User: Next question
end
end
rect rgb(255, 245, 245)
Note over User, State: Final Review Phase
Bot->>Bot: _start_final_review()
Note over State: → FINAL_REVIEW
Bot-->>User: Complete summary for final confirmation
alt User confirms completion
User->>Bot: "確定"
Note over State: FINAL_REVIEW → COMPLETED
Bot->>Bot: _save_result_as_yaml()
Bot-->>User: Success message with file path
else User requests modification
User->>Bot: "処理概要を修正"
Bot->>Bot: modification_interpreter()
Note over State: FINAL_REVIEW → MODIFYING
Bot-->>User: Specific modification request
User->>Bot: "新しい処理内容"
Note over State: MODIFYING → FINAL_REVIEW
Bot-->>User: Updated final review
end
end
Note over State: COMPLETED
Bot-->>User: Session ended
Key State Transitions:
- Initialization Flow:
INITIAL→COLLECTINGafter domain selection - Information Collection:
COLLECTING↔OPTION_SELECTIONwith adaptive questioning - Confirmation Flow:
COLLECTING→CONFIRMING_VALUE→MODIFYING(if needed) - Validation Flow:
COLLECTING→COMPREHENSIVE_CLARIFICATION→FINAL_REVIEW - Completion Flow:
FINAL_REVIEW→COMPLETEDwith YAML output generation
from task_dialogue_bot import TaskDialogueBot
bot = TaskDialogueBot()
response = bot.start_session("Japanese")
print(response)
# Output: Domain selection menu with options# Domain selection
user_input = "1" # Select data processing
response = bot.select_domain(user_input)
# Information collection with adaptive questioning
user_input = "CSVファイルの売上データを分析したい"
response = bot.process_user_input(user_input)
# Bot generates contextual options and follow-up questions
# Confirmation and modification
user_input = "はい" # Confirm current value
response = bot.process_user_input(user_input)
# Bot moves to next question or final reviewuv run python task_dialogue_bot.pyGenerated requirements are saved to output/requirements/ in YAML format:
output/requirements/202412191530_data_analysis_pipeline.yaml
Example Output:
processing_purpose: "売上データの傾向分析と予測モデル構築"
input_data_examples: "CSV形式:日付,店舗,商品,売上金額"
output_data_examples: "分析結果レポート(PDF)、予測値(CSV)"
processing_overview: "データクリーニング→統計分析→可視化→予測モデル訓練"
metadata:
generated_at: "2024-12-19T15:30:00"
domain: "data_processing"
completion_percentage: 100The DSPy Custom Coder is a code generation system that converts structured requirements (from Task Dialogue Bot or direct input) into complete ML pipeline implementations. Following the original Paper2Code methodology using DSPy signatures and modules.
flowchart TD
%% Input Layer
A["🎯 User Requirements<br/>- Natural language text<br/>- Optional test_config"]
%% Main Agent Layer
B["📋 PlanningOnlyAgent.forward()"]
%% Step 1: Planning Agent Components
C1["📝 OverallPlanningSignature<br/>Input: requirements<br/>Output: planning_response"]
C2["🏗️ ArchitectureDesignSignature<br/>Input: planning_context<br/>Output: implementation_approach,<br/>file_list, data_structures_interfaces,<br/>program_call_flow"]
C3["📋 TaskListGenerationSignature<br/>Input: combined_context<br/>Output: required_packages,<br/>logic_analysis, task_list"]
C4["⚙️ ConfigurationGenerationSignature<br/>Input: compact_context<br/>Output: config_yaml"]
%% Test Strategy Generation
TS["🧪 Test Strategy Generation<br/>Input: requirements, file_list,<br/>has_api, has_ml<br/>Output: TestStrategy"]
%% Context Management
CI["📚 ContextIndex Creation<br/>Input: planning_summary,<br/>config_summary<br/>Output: context_index"]
%% Step 2: Logic Analysis Loop
LA["🔍 Logic Analysis Loop<br/>For each file in logic_analysis"]
LAS["📊 LogicAnalysisSignature<br/>Input: requirements, planning_context,<br/>config_yaml, target_file,<br/>logic_analysis_desc<br/>Output: detailed_analysis"]
%% Output Generation
O1["📄 Master YAML Generation<br/>create_master_yaml()"]
O2["📝 Planning Report Generation<br/>create_planning_markdown()"]
O3["⚡ Task YAML Files<br/>create_task_yaml()"]
O4["🧪 Test Task YAML Files<br/>create_test_task_yaml()"]
O5["📊 Summary YAML<br/>planning_summary.yaml"]
%% File System Outputs
F1["📁 output/planning_output_*/"]
F2["master_plan.yaml"]
F3["planning_report.md"]
F4["task_*.yaml files"]
F5["test_task_*.yaml files"]
F6["planning_summary.yaml"]
%% Flow Connections
A --> B
B --> C1
C1 --> C2
C2 --> C3
C3 --> C4
C2 --> TS
C1 --> CI
C4 --> CI
B --> LA
LA --> LAS
LAS --> LA
%% Context flows
CI -.-> LAS
%% Output flows
B --> O1
O1 --> O2
O1 --> O3
O1 --> O4
O1 --> O5
%% File outputs
O1 --> F1
O2 --> F1
O3 --> F1
O4 --> F1
O5 --> F1
F1 --> F2
F1 --> F3
F1 --> F4
F1 --> F5
F1 --> F6
%% Styling
classDef inputNode fill:#e1f5fe
classDef processNode fill:#f3e5f5
classDef outputNode fill:#e8f5e8
classDef fileNode fill:#fff3e0
class A inputNode
class B,C1,C2,C3,C4,TS,CI,LA,LAS processNode
class O1,O2,O3,O4,O5 outputNode
class F1,F2,F3,F4,F5,F6 fileNode
sequenceDiagram
participant User
participant Main as main()
participant POA as PlanningOnlyAgent
participant PA as PlanningAgent
participant OPS as OverallPlanningSignature
participant ADS as ArchitectureDesignSignature
participant TLS as TaskListGenerationSignature
participant CGS as ConfigurationGenerationSignature
participant TSG as TestStrategyGeneration
participant LAA as LogicAnalysisAgent
participant LAS as LogicAnalysisSignature
participant FS as FileSystem
User->>Main: requirements, test_config
Main->>POA: forward(requirements, test_config)
rect rgb(240, 248, 255)
Note over POA: Step 1: Planning Phase
POA->>PA: forward(requirements, test_config)
PA->>OPS: requirements
OPS-->>PA: planning_response
PA->>ADS: planning_context=planning_response
ADS-->>PA: implementation_approach, file_list,<br/>data_structures_interfaces, program_call_flow
Note over PA: Create combined_context
PA->>TLS: planning_context=combined_context
TLS-->>PA: required_packages, logic_analysis,<br/>task_list, development_tools_config
PA->>TSG: requirements, file_list, has_api, has_ml
TSG-->>PA: TestStrategy(config, test_files)
Note over PA: Create compact_context
PA->>CGS: planning_context=compact_context
CGS-->>PA: config_yaml
PA-->>POA: planning_result{planning_response, implementation_approach,<br/>file_list, required_packages, logic_analysis,<br/>task_list, config_yaml, test_strategy}
end
rect rgb(255, 248, 240)
Note over POA: Context Index Creation
POA->>POA: create_planning_summary(planning_result)
POA->>POA: create_config_summary(config_yaml)
POA->>POA: ContextIndex(planning_summary, config_summary)
end
rect rgb(248, 255, 248)
Note over POA: Step 2: Logic Analysis Phase
loop For each file in logic_analysis
POA->>LAA: requirements, planning_context,<br/>config_yaml, target_file, logic_analysis_desc
LAA->>LAS: All input parameters
LAS-->>LAA: detailed_analysis
LAA-->>POA: detailed_analysis
Note over POA: Store in detailed_analyses[filename]
end
end
POA-->>Main: PlanningResult{planning_result, detailed_analyses,<br/>context_index, requirements, test_strategy}
rect rgb(255, 240, 245)
Note over Main: Output Generation Phase
Main->>Main: create_master_yaml()
Main->>Main: create_planning_markdown()
Main->>Main: analyze_file_dependencies()
loop For each logic_analysis file
Main->>Main: create_task_yaml(filename, dependencies)
end
loop For each test_file
Main->>Main: create_test_task_yaml(test_file, source_file)
end
Main->>FS: Save master_plan.yaml
Main->>FS: Save planning_report.md
Main->>FS: Save task_*.yaml files
Main->>FS: Save test_task_*.yaml files
Main->>FS: Save planning_summary.yaml
end
FS-->>User: output/planning_output_*/<br/>All generated files
- Input:
requirements: str- Natural language requirementstest_config: Optional[Dict[str, bool]]- Test configuration flags
- Output:
planning_response: str- Strategic planning responseimplementation_approach: str- Architecture design approachfile_list: List[str]- List of files to generatedata_structures_interfaces: Optional[str]- Mermaid class diagramprogram_call_flow: Optional[str]- Mermaid sequence diagramrequired_packages: List[str]- Python dependencieslogic_analysis: List[List[str]]- [filename, description] pairstask_list: List[str]- Prioritized file generation orderconfig_yaml: str- Complete pyproject.toml contenttest_strategy: TestStrategy- Test configuration and file list
- Input:
requirements: str- Original requirementsplanning_context: str- Accumulated context from ContextIndexconfig_yaml: str- Configuration contenttarget_file: str- File to analyzelogic_analysis_desc: str- Analysis description from planning
- Output:
detailed_analysis: str- Comprehensive implementation analysis
graph LR
A[Planning Result] --> B[create_planning_summary]
C[Config YAML] --> D[create_config_summary]
B --> E[ContextIndex]
D --> E
E --> F[get_context_string]
F --> G[Logic Analysis Input]
style E fill:#f9f,stroke:#333,stroke-width:2px
- Requirements (string): Natural language description of ML pipeline needs
Example: "Create a machine learning pipeline that loads CSV data, trains a classification model, evaluates performance, and provides CLI predictions" - Structured Requirements (YAML): Output from Task Dialogue Bot
- Generated Files: Complete Python codebase with proper structure
main.py/app.py: Entry pointsrc/: Source modules (data processing, training, evaluation)config.yaml: Configuration filerequirements.txt: Dependencies
- Validation Results: Quality checks and warnings for each file
- Metadata: Planning documents, architecture diagrams, task breakdowns
This implementation follows the original Paper2Code methodology using DSPy signatures and modules:
LLM Role: Strategic planner following DSPyCustomCoder's planning phase
- Input: User requirements text
- DSPy Signatures:
OverallPlanningSignature: Strategic planning for ML experimentsArchitectureDesignSignature: System design with file structureTaskListGenerationSignature: Task decomposition and dependenciesConfigurationGenerationSignature: Configuration template creation
- Output: Structured plan with file list, architecture diagrams, and config
LLM Role: Implementation analyzer following DSPyCustomCoder's analysis phase
- Input: Planning context + target filename
- DSPy Signature:
LogicAnalysisSignature - LLM Tasks:
- Analyze specific implementation logic for each file
- Define data structures and interfaces
- Plan function signatures and class hierarchies
- Output: Detailed implementation specifications per file
LLM Role: Code generator following DSPyCustomCoder's generation phase
- Input: Requirements + planning context + logic analysis + previous files
- DSPy Signature:
CodeGenerationSignature - LLM Tasks:
- Generate complete Python code with type hints
- Follow Ruff linting standards
- Create clean code without exception handling
- Maintain consistency across files
- Output: Syntactically correct Python files
System Role: Quality assurance layer (DSPy enhancement)
- Content completeness checks using
GeneratedFilemodel - Code block formatting validation
- Syntax and structure verification via
CodeGenerationResultmodel
graph TD
A[User Requirements] --> B(Planning Agent);
B --> C{Planning Context, File List, Config};
C --> D(Loop: For Each File in File List);
D --> E(Logic Analysis Agent);
E --> F{Detailed Logic Analysis for Current File};
F --> G(Code Generation Agent);
G --> H{Generated Code for Current File};
H --> I[Accumulate Generated Files];
I --> J(End Loop);
J --> K(Pydantic Validation);
K --> L[Validated Codebase];
L --> M[Save Files to output/generated_code/];
Generated code is saved to output/generated_code/ with a complete ML pipeline ready for use.
# Install dependencies
uv sync
# Run the interactive Task Dialogue Bot
uv run python task_dialogue_bot.py
# Run the DSPy Custom Coder implementation
uv run python main.pyThis implementation is based on the Paper2Code methodology. Please cite the original work when using this DSPy adaptation.