Skip to content

adding option to set task_type + tests#94

Merged
tekrajchhetri merged 12 commits intosensein:improvementfrom
djarecka:task_type
Mar 18, 2026
Merged

adding option to set task_type + tests#94
tekrajchhetri merged 12 commits intosensein:improvementfrom
djarecka:task_type

Conversation

@djarecka
Copy link
Copy Markdown
Contributor

contains #67 since I'm using "source text" for testing.

I created this PR to address #92

We could have an option to use llm model to identify task_type, but would like to have also an option to help the model, since we are having some problems.

I've also added 3 tests that checks if the task type is correctly identifyied in app.py, based on either task_type set in the config file, llm model, or keyword match with the description.

…if it is an existing path; adding click.Paths to all options that should be existing texts
…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent
… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text
…ssed properly; removing src/tests from gitignore
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and robustness of the system's input processing and task type identification. By introducing separate options for file-based and raw text inputs, and a prioritized mechanism for task type detection, the changes aim to provide clearer control and more reliable operation. The update also includes comprehensive testing and dependency management to ensure stability and maintainability.

Highlights

  • Refactored Input Handling: The system's input mechanism has been significantly refactored. The previous input_source parameter has been replaced by two distinct and mutually exclusive options: source for file paths (PDF, CSV, TXT) and source_text for direct raw text input. This change improves clarity and flexibility in how data is provided to the StructSenseFlow.
  • Enhanced Task Type Detection: The logic for detecting the task type has been improved to offer more control and robustness. It now prioritizes an explicitly defined task_type within the configuration file. If not specified, it falls back to detection by an LLM, and finally to a heuristic based on keyword matching in the task description. This ensures more accurate and configurable task identification.
  • Updated Command-Line Interface (CLI): The structsense-cli extract and run_agent commands have been updated to reflect the new input handling. Users can now specify --source for file paths or --source_text for raw text, with validation to prevent using both simultaneously. This aligns the CLI with the core application's input changes.
  • Comprehensive Test Coverage: New test files have been added to validate the refactored input handling and the enhanced task type detection logic. These tests cover scenarios such as invalid source paths, mutual exclusivity of source inputs, and the different methods of task type identification (config, heuristic, LLM).
  • Dependency Updates: The poetry.lock file has undergone significant updates, reflecting changes in various package versions and dependencies. Notable changes include the addition of abnf, fastmcp, and srsly, and adjustments to click and litellm versions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .gitignore
    • Removed src/tests from ignored paths.
  • README.md
    • Updated CLI options for source input, replacing input_source with source and source_text.
    • Corrected a formatting issue.
  • poetry.lock
    • Updated Poetry version from 2.2.1 to 2.3.2.
    • Added abnf package.
    • Downgraded click from 8.3.1 to 8.1.8 and adjusted its Python version compatibility.
    • Added fastmcp package.
    • Updated grpcio from 1.71.0 to 1.76.0.
    • Updated jsonschema-specifications version format.
    • Updated certifi version format.
    • Downgraded litellm from 1.75.3 to 1.53.3 and modified its dependencies.
    • Removed mcp package.
    • Removed pytest-xdist package.
    • Removed pytube package.
    • Added srsly package.
    • Removed stack-data package.
    • Added new platform-specific wheels for torch.
    • Reordered markers for uvloop.
    • Updated the content hash.
  • src/structsense/app.py
    • Refactored input handling to use source for file paths and source_text for raw input, replacing input_source.
    • Updated task type detection logic to prioritize explicit configuration, then LLM inference, then keyword matching.
    • Replaced process_input_data with process_file and _structured_data_to_text for clearer file processing.
    • Made minor formatting adjustments.
  • src/structsense/cli.py
    • Modified CLI commands (extract, run_agent) to support distinct --source (file path) and --source_text (raw text) options.
    • Ensured mutual exclusivity for --source and --source_text in CLI arguments.
    • Updated StructSenseFlow initialization in CLI to correctly pass source and source_text.
  • src/tests/app_test.py
    • Deleted the placeholder test file.
  • src/tests/cli_test.py
    • Added new tests for the CLI extract command, specifically for invalid source paths.
  • src/tests/configs/ner-config_free.yaml
    • Added a new YAML configuration file for neuroscience NER, defining agent and task configurations.
  • src/tests/structsense_flow_test.py
    • Added new tests for StructSenseFlow initialization, covering invalid source paths, mutual exclusivity of source inputs, and correct handling of source_text and source file inputs.
  • src/tests/task_detection_test.py
    • Added new tests for task type detection logic within StructSenseFlow, verifying behavior when task_type is set in config, inferred by heuristic, or detected by an LLM.
  • src/utils/utils.py
    • Refactored input processing by replacing process_input_data with process_file and _structured_data_to_text.
    • Enhanced file type handling in process_file for PDF, CSV, and TXT, and improved error management.
    • Added _structured_data_to_text to convert structured data from file processors into plain text.
    • Made minor formatting changes.
  • tutorial/python-example/run_structsense_full_pipeline.py
    • Updated the example script to use the new source parameter instead of input_source.
  • tutorial/python-example/run_structsense_partial_extraction_only.py
    • Updated the example script to use the new source parameter instead of input_source.
  • tutorial/python-example/run_structsense_partial_extraction_only_without_key.py
    • Updated the example script to use the new source parameter instead of input_source.
Activity
  • The author, djarecka, created this pull request to address issue Inconsistent task type #92, which likely relates to improving task type identification.
  • The changes also incorporate aspects from issue adding source checks and source_text option #67, specifically regarding the use of 'source text' for testing purposes.
  • Three new tests have been added to verify the correct identification of task_type in app.py, covering cases where task_type is set in the config, determined by an LLM, or matched via keywords in the description.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves task-type reliability by allowing an explicit task_type in task config (with fallback to LLM/heuristics), and refactors input handling to cleanly separate file-path input (source) from raw text input (source_text) across the Python API and CLI. It also adds tests and updates tutorials/docs accordingly.

Changes:

  • Add task_type override in task_config and prefer it during task-type detection before falling back to LLM/heuristics.
  • Replace input_source/process_input_data with source + source_text and process_file, including stricter file validation.
  • Add pytest coverage for source validation, CLI invalid source behavior, and task-type detection paths; update README/tutorial examples.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/structsense/app.py Adds source/source_text handling and updates _get_detected_task_type to honor task_type in config before LLM/heuristics.
src/utils/utils.py Introduces process_file() + _structured_data_to_text() for file-to-text conversion and stricter error handling.
src/structsense/cli.py Adds --source_text, makes --source optional, and enforces mutual exclusivity; updates config/source option types.
src/tests/task_detection_test.py New tests for config-based, heuristic, and (optional) LLM-based task-type detection.
src/tests/structsense_flow_test.py New tests for source/source_text validation and initialization behavior.
src/tests/cli_test.py New test asserting CLI fails cleanly when --source path doesn’t exist.
src/tests/configs/ner-config_free.yaml Adds a test config fixture used by new test cases.
README.md Updates CLI and programmatic usage examples to use source/source_text.
tutorial/python-example/*.py Updates tutorial scripts to use source= instead of input_source= and applies formatting tweaks.
.gitignore Stops ignoring src/tests so tests can be committed/run.
poetry.lock Updates dependency lockfile (Poetry generator/version and resolved packages).
src/tests/app_test.py Removes placeholder “assert 1 == 1” test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1537 to +1552
async def kickoff(
agentconfig: Union[str, dict],
taskconfig: Union[str, dict],
embedderconfig: Union[str, dict],
input_source: Union[str, dict],
knowledgeconfig: Optional[Union[str, dict]] = None,
enable_human_feedback: bool = True,
agent_feedback_config: Optional[Dict[str, bool]] = None,
env_file: Optional[str] = None,
api_key: Optional[str] = None,
enable_chunking: bool = False,
chunk_size: Optional[int] = None,
max_workers: Optional[int] = None,
downstream_max_input_chars: Optional[int] = None,
max_extraction_chunk_chars: Optional[int] = None,
agentconfig: Union[str, dict],
taskconfig: Union[str, dict],
embedderconfig: Union[str, dict],
source: Optional[str] = None,
source_text: Optional[str] = None,
knowledgeconfig: Optional[Union[str, dict]] = None,
enable_human_feedback: bool = True,
agent_feedback_config: Optional[Dict[str, bool]] = None,
env_file: Optional[str] = None,
api_key: Optional[str] = None,
enable_chunking: bool = False,
chunk_size: Optional[int] = None,
max_workers: Optional[int] = None,
downstream_max_input_chars: Optional[int] = None,
max_extraction_chunk_chars: Optional[int] = None,
"--source",
required=True,
help=("The source—whether a file (text or PDF), a folder, or a text string."),
type=click.Path(exists=True),
Comment on lines +140 to +149
@click.option(
"--source",
type=click.Path(exists=True),
help="Path to the file to process (PDF, CSV or TXT). Alternative to --source_text.",
)
@click.option(
"--source_text",
type=str,
help="Text string to use as input directly. Alternative to --source.",
)
|--------|--------------------------------------------------------------------------------------|
| `--config` | **(Required)** Path to YAML config (agent + task + embedder). |
| `--source` | Path to a PDF, CSV, or TXT file to process. Mutually exclusive with `--source_text`. |
| `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`. |
}
}
else:
task_config = BASE_TASK_CONFIG
assert task_type == expected_task_type


@pytest.mark.requires_openrouter
task_data = self.task_config.get(task_key, {})
task_type = self.task_config.get(task_key, {}).get("task_type")
if task_type in DEFAULT_TAXONOMY:
logger.info(f"Using task type from agent config for agent '{agent_key}': {task_type}")
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable option to explicitly set task_type in the configuration, improving model guidance and behavior. The refactoring of input handling to distinguish between source and source_text significantly clarifies the API for both CLI and programmatic use. The addition of unit tests for these new features is excellent and enhances the robustness of the codebase. My feedback includes a few minor suggestions for code simplification and a documentation fix.

|--------|--------------------------------------------------------------------------------------|
| `--config` | **(Required)** Path to YAML config (agent + task + embedder). |
| `--source` | Path to a PDF, CSV, or TXT file to process. Mutually exclusive with `--source_text`. |
| `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a small typo in the description for the --source_text option. It seems like a placeholder _ was left in the text.

Suggested change
| `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`. |
| `--source_text` | Raw text string to use as input. Mutually exclusive with `--source`. |

Comment on lines +1478 to +1479
task_data = self.task_config.get(task_key, {})
task_type = self.task_config.get(task_key, {}).get("task_type")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The task_data variable is defined on line 1478 but not used on line 1479. This results in an unnecessary second dictionary lookup. You can simplify this by using task_data to get the task_type.

Suggested change
task_data = self.task_config.get(task_key, {})
task_type = self.task_config.get(task_key, {}).get("task_type")
task_data = self.task_config.get(task_key, {})
task_type = task_data.get("task_type")

@@ -51,14 +95,16 @@ def extract(config, api_key, source, env_file, save_file, chunk_size, max_worker
enable_human_feedback = bool(human_in_loop.get("humanfeedback_agent", False))
if "ENABLE_HUMAN_FEEDBACK" in os.environ:
from utils.utils import str_to_bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, imports should usually be at the top of the file. Please move this import to the top of cli.py (e.g., with the other imports from utils.utils) to improve readability and avoid local imports.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 61.30653% with 77 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (improvement@fc11999). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/utils/utils.py 23.80% 48 Missing ⚠️
src/structsense/app.py 46.66% 16 Missing ⚠️
src/structsense/cli.py 64.00% 9 Missing ⚠️
src/tests/task_detection_test.py 89.47% 4 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             improvement      #94   +/-   ##
==============================================
  Coverage               ?   14.68%           
==============================================
  Files                  ?       22           
  Lines                  ?     4964           
  Branches               ?        0           
==============================================
  Hits                   ?      729           
  Misses                 ?     4235           
  Partials               ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tekrajchhetri tekrajchhetri mentioned this pull request Mar 18, 2026
@tekrajchhetri tekrajchhetri merged commit 8e15b63 into sensein:improvement Mar 18, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants