adding option to set task_type + tests#94
adding option to set task_type + tests#94tekrajchhetri merged 12 commits intosensein:improvementfrom
Conversation
…if it is an existing path; adding click.Paths to all options that should be existing texts
…into add_source_checks
…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent
… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text
…ssed properly; removing src/tests from gitignore
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility and robustness of the system's input processing and task type identification. By introducing separate options for file-based and raw text inputs, and a prioritized mechanism for task type detection, the changes aim to provide clearer control and more reliable operation. The update also includes comprehensive testing and dependency management to ensure stability and maintainability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This PR improves task-type reliability by allowing an explicit task_type in task config (with fallback to LLM/heuristics), and refactors input handling to cleanly separate file-path input (source) from raw text input (source_text) across the Python API and CLI. It also adds tests and updates tutorials/docs accordingly.
Changes:
- Add
task_typeoverride intask_configand prefer it during task-type detection before falling back to LLM/heuristics. - Replace
input_source/process_input_datawithsource+source_textandprocess_file, including stricter file validation. - Add pytest coverage for source validation, CLI invalid source behavior, and task-type detection paths; update README/tutorial examples.
Reviewed changes
Copilot reviewed 12 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/structsense/app.py |
Adds source/source_text handling and updates _get_detected_task_type to honor task_type in config before LLM/heuristics. |
src/utils/utils.py |
Introduces process_file() + _structured_data_to_text() for file-to-text conversion and stricter error handling. |
src/structsense/cli.py |
Adds --source_text, makes --source optional, and enforces mutual exclusivity; updates config/source option types. |
src/tests/task_detection_test.py |
New tests for config-based, heuristic, and (optional) LLM-based task-type detection. |
src/tests/structsense_flow_test.py |
New tests for source/source_text validation and initialization behavior. |
src/tests/cli_test.py |
New test asserting CLI fails cleanly when --source path doesn’t exist. |
src/tests/configs/ner-config_free.yaml |
Adds a test config fixture used by new test cases. |
README.md |
Updates CLI and programmatic usage examples to use source/source_text. |
tutorial/python-example/*.py |
Updates tutorial scripts to use source= instead of input_source= and applies formatting tweaks. |
.gitignore |
Stops ignoring src/tests so tests can be committed/run. |
poetry.lock |
Updates dependency lockfile (Poetry generator/version and resolved packages). |
src/tests/app_test.py |
Removes placeholder “assert 1 == 1” test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async def kickoff( | ||
| agentconfig: Union[str, dict], | ||
| taskconfig: Union[str, dict], | ||
| embedderconfig: Union[str, dict], | ||
| input_source: Union[str, dict], | ||
| knowledgeconfig: Optional[Union[str, dict]] = None, | ||
| enable_human_feedback: bool = True, | ||
| agent_feedback_config: Optional[Dict[str, bool]] = None, | ||
| env_file: Optional[str] = None, | ||
| api_key: Optional[str] = None, | ||
| enable_chunking: bool = False, | ||
| chunk_size: Optional[int] = None, | ||
| max_workers: Optional[int] = None, | ||
| downstream_max_input_chars: Optional[int] = None, | ||
| max_extraction_chunk_chars: Optional[int] = None, | ||
| agentconfig: Union[str, dict], | ||
| taskconfig: Union[str, dict], | ||
| embedderconfig: Union[str, dict], | ||
| source: Optional[str] = None, | ||
| source_text: Optional[str] = None, | ||
| knowledgeconfig: Optional[Union[str, dict]] = None, | ||
| enable_human_feedback: bool = True, | ||
| agent_feedback_config: Optional[Dict[str, bool]] = None, | ||
| env_file: Optional[str] = None, | ||
| api_key: Optional[str] = None, | ||
| enable_chunking: bool = False, | ||
| chunk_size: Optional[int] = None, | ||
| max_workers: Optional[int] = None, | ||
| downstream_max_input_chars: Optional[int] = None, | ||
| max_extraction_chunk_chars: Optional[int] = None, |
| "--source", | ||
| required=True, | ||
| help=("The source—whether a file (text or PDF), a folder, or a text string."), | ||
| type=click.Path(exists=True), |
| @click.option( | ||
| "--source", | ||
| type=click.Path(exists=True), | ||
| help="Path to the file to process (PDF, CSV or TXT). Alternative to --source_text.", | ||
| ) | ||
| @click.option( | ||
| "--source_text", | ||
| type=str, | ||
| help="Text string to use as input directly. Alternative to --source.", | ||
| ) |
| |--------|--------------------------------------------------------------------------------------| | ||
| | `--config` | **(Required)** Path to YAML config (agent + task + embedder). | | ||
| | `--source` | Path to a PDF, CSV, or TXT file to process. Mutually exclusive with `--source_text`. | | ||
| | `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`. | |
| } | ||
| } | ||
| else: | ||
| task_config = BASE_TASK_CONFIG |
| assert task_type == expected_task_type | ||
|
|
||
|
|
||
| @pytest.mark.requires_openrouter |
| task_data = self.task_config.get(task_key, {}) | ||
| task_type = self.task_config.get(task_key, {}).get("task_type") | ||
| if task_type in DEFAULT_TAXONOMY: | ||
| logger.info(f"Using task type from agent config for agent '{agent_key}': {task_type}") |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable option to explicitly set task_type in the configuration, improving model guidance and behavior. The refactoring of input handling to distinguish between source and source_text significantly clarifies the API for both CLI and programmatic use. The addition of unit tests for these new features is excellent and enhances the robustness of the codebase. My feedback includes a few minor suggestions for code simplification and a documentation fix.
| |--------|--------------------------------------------------------------------------------------| | ||
| | `--config` | **(Required)** Path to YAML config (agent + task + embedder). | | ||
| | `--source` | Path to a PDF, CSV, or TXT file to process. Mutually exclusive with `--source_text`. | | ||
| | `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`. | |
There was a problem hiding this comment.
There's a small typo in the description for the --source_text option. It seems like a placeholder _ was left in the text.
| | `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`. | | |
| | `--source_text` | Raw text string to use as input. Mutually exclusive with `--source`. | |
| task_data = self.task_config.get(task_key, {}) | ||
| task_type = self.task_config.get(task_key, {}).get("task_type") |
There was a problem hiding this comment.
The task_data variable is defined on line 1478 but not used on line 1479. This results in an unnecessary second dictionary lookup. You can simplify this by using task_data to get the task_type.
| task_data = self.task_config.get(task_key, {}) | |
| task_type = self.task_config.get(task_key, {}).get("task_type") | |
| task_data = self.task_config.get(task_key, {}) | |
| task_type = task_data.get("task_type") |
| @@ -51,14 +95,16 @@ def extract(config, api_key, source, env_file, save_file, chunk_size, max_worker | |||
| enable_human_feedback = bool(human_in_loop.get("humanfeedback_agent", False)) | |||
| if "ENABLE_HUMAN_FEEDBACK" in os.environ: | |||
| from utils.utils import str_to_bool | |||
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## improvement #94 +/- ##
==============================================
Coverage ? 14.68%
==============================================
Files ? 22
Lines ? 4964
Branches ? 0
==============================================
Hits ? 729
Misses ? 4235
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
contains #67 since I'm using "source text" for testing.
I created this PR to address #92
We could have an option to use llm model to identify task_type, but would like to have also an option to help the model, since we are having some problems.
I've also added 3 tests that checks if the task type is correctly identifyied in
app.py, based on eithertask_typeset in the config file, llm model, or keyword match with the description.