adding option to set task_type + tests by djarecka · Pull Request #94 · sensein/structsense

djarecka · 2026-03-17T22:05:51Z

contains #67 since I'm using "source text" for testing.

I created this PR to address #92

We could have an option to use llm model to identify task_type, but would like to have also an option to help the model, since we are having some problems.

I've also added 3 tests that checks if the task type is correctly identifyied in app.py, based on either task_type set in the config file, llm model, or keyword match with the description.

…if it is an existing path; adding click.Paths to all options that should be existing texts

…into add_source_checks

…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent

… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text

…ssed properly; removing src/tests from gitignore

…e meantime)

gemini-code-assist · 2026-03-17T22:06:23Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and robustness of the system's input processing and task type identification. By introducing separate options for file-based and raw text inputs, and a prioritized mechanism for task type detection, the changes aim to provide clearer control and more reliable operation. The update also includes comprehensive testing and dependency management to ensure stability and maintainability.

Highlights

Refactored Input Handling: The system's input mechanism has been significantly refactored. The previous input_source parameter has been replaced by two distinct and mutually exclusive options: source for file paths (PDF, CSV, TXT) and source_text for direct raw text input. This change improves clarity and flexibility in how data is provided to the StructSenseFlow.
Enhanced Task Type Detection: The logic for detecting the task type has been improved to offer more control and robustness. It now prioritizes an explicitly defined task_type within the configuration file. If not specified, it falls back to detection by an LLM, and finally to a heuristic based on keyword matching in the task description. This ensures more accurate and configurable task identification.
Updated Command-Line Interface (CLI): The structsense-cli extract and run_agent commands have been updated to reflect the new input handling. Users can now specify --source for file paths or --source_text for raw text, with validation to prevent using both simultaneously. This aligns the CLI with the core application's input changes.
Comprehensive Test Coverage: New test files have been added to validate the refactored input handling and the enhanced task type detection logic. These tests cover scenarios such as invalid source paths, mutual exclusivity of source inputs, and the different methods of task type identification (config, heuristic, LLM).
Dependency Updates: The poetry.lock file has undergone significant updates, reflecting changes in various package versions and dependencies. Notable changes include the addition of abnf, fastmcp, and srsly, and adjustments to click and litellm versions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

.gitignore
- Removed src/tests from ignored paths.
README.md
- Updated CLI options for source input, replacing input_source with source and source_text.
- Corrected a formatting issue.
poetry.lock
- Updated Poetry version from 2.2.1 to 2.3.2.
- Added abnf package.
- Downgraded click from 8.3.1 to 8.1.8 and adjusted its Python version compatibility.
- Added fastmcp package.
- Updated grpcio from 1.71.0 to 1.76.0.
- Updated jsonschema-specifications version format.
- Updated certifi version format.
- Downgraded litellm from 1.75.3 to 1.53.3 and modified its dependencies.
- Removed mcp package.
- Removed pytest-xdist package.
- Removed pytube package.
- Added srsly package.
- Removed stack-data package.
- Added new platform-specific wheels for torch.
- Reordered markers for uvloop.
- Updated the content hash.
src/structsense/app.py
- Refactored input handling to use source for file paths and source_text for raw input, replacing input_source.
- Updated task type detection logic to prioritize explicit configuration, then LLM inference, then keyword matching.
- Replaced process_input_data with process_file and _structured_data_to_text for clearer file processing.
- Made minor formatting adjustments.
src/structsense/cli.py
- Modified CLI commands (extract, run_agent) to support distinct --source (file path) and --source_text (raw text) options.
- Ensured mutual exclusivity for --source and --source_text in CLI arguments.
- Updated StructSenseFlow initialization in CLI to correctly pass source and source_text.
src/tests/app_test.py
- Deleted the placeholder test file.
src/tests/cli_test.py
- Added new tests for the CLI extract command, specifically for invalid source paths.
src/tests/configs/ner-config_free.yaml
- Added a new YAML configuration file for neuroscience NER, defining agent and task configurations.
src/tests/structsense_flow_test.py
- Added new tests for StructSenseFlow initialization, covering invalid source paths, mutual exclusivity of source inputs, and correct handling of source_text and source file inputs.
src/tests/task_detection_test.py
- Added new tests for task type detection logic within StructSenseFlow, verifying behavior when task_type is set in config, inferred by heuristic, or detected by an LLM.
src/utils/utils.py
- Refactored input processing by replacing process_input_data with process_file and _structured_data_to_text.
- Enhanced file type handling in process_file for PDF, CSV, and TXT, and improved error management.
- Added _structured_data_to_text to convert structured data from file processors into plain text.
- Made minor formatting changes.
tutorial/python-example/run_structsense_full_pipeline.py
- Updated the example script to use the new source parameter instead of input_source.
tutorial/python-example/run_structsense_partial_extraction_only.py
- Updated the example script to use the new source parameter instead of input_source.
tutorial/python-example/run_structsense_partial_extraction_only_without_key.py
- Updated the example script to use the new source parameter instead of input_source.

Activity

The author, djarecka, created this pull request to address issue Inconsistent task type #92, which likely relates to improving task type identification.
The changes also incorporate aspects from issue adding source checks and source_text option #67, specifically regarding the use of 'source text' for testing purposes.
Three new tests have been added to verify the correct identification of task_type in app.py, covering cases where task_type is set in the config, determined by an LLM, or matched via keywords in the description.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This PR improves task-type reliability by allowing an explicit task_type in task config (with fallback to LLM/heuristics), and refactors input handling to cleanly separate file-path input (source) from raw text input (source_text) across the Python API and CLI. It also adds tests and updates tutorials/docs accordingly.

Changes:

Add task_type override in task_config and prefer it during task-type detection before falling back to LLM/heuristics.
Replace input_source/process_input_data with source + source_text and process_file, including stricter file validation.
Add pytest coverage for source validation, CLI invalid source behavior, and task-type detection paths; update README/tutorial examples.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/structsense/app.py`	Adds `source`/`source_text` handling and updates `_get_detected_task_type` to honor `task_type` in config before LLM/heuristics.
`src/utils/utils.py`	Introduces `process_file()` + `_structured_data_to_text()` for file-to-text conversion and stricter error handling.
`src/structsense/cli.py`	Adds `--source_text`, makes `--source` optional, and enforces mutual exclusivity; updates config/source option types.
`src/tests/task_detection_test.py`	New tests for config-based, heuristic, and (optional) LLM-based task-type detection.
`src/tests/structsense_flow_test.py`	New tests for source/source_text validation and initialization behavior.
`src/tests/cli_test.py`	New test asserting CLI fails cleanly when `--source` path doesn’t exist.
`src/tests/configs/ner-config_free.yaml`	Adds a test config fixture used by new test cases.
`README.md`	Updates CLI and programmatic usage examples to use `source`/`source_text`.
`tutorial/python-example/*.py`	Updates tutorial scripts to use `source=` instead of `input_source=` and applies formatting tweaks.
`.gitignore`	Stops ignoring `src/tests` so tests can be committed/run.
`poetry.lock`	Updates dependency lockfile (Poetry generator/version and resolved packages).
`src/tests/app_test.py`	Removes placeholder “assert 1 == 1” test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/structsense/app.py

 async def kickoff(
-        agentconfig: Union[str, dict],
-        taskconfig: Union[str, dict],
-        embedderconfig: Union[str, dict],
-        input_source: Union[str, dict],
-        knowledgeconfig: Optional[Union[str, dict]] = None,
-        enable_human_feedback: bool = True,
-        agent_feedback_config: Optional[Dict[str, bool]] = None,
-        env_file: Optional[str] = None,
-        api_key: Optional[str] = None,
-        enable_chunking: bool = False,
-        chunk_size: Optional[int] = None,
-        max_workers: Optional[int] = None,
-        downstream_max_input_chars: Optional[int] = None,
-        max_extraction_chunk_chars: Optional[int] = None,
+    agentconfig: Union[str, dict],
+    taskconfig: Union[str, dict],
+    embedderconfig: Union[str, dict],
+    source: Optional[str] = None,
+    source_text: Optional[str] = None,
+    knowledgeconfig: Optional[Union[str, dict]] = None,
+    enable_human_feedback: bool = True,
+    agent_feedback_config: Optional[Dict[str, bool]] = None,
+    env_file: Optional[str] = None,
+    api_key: Optional[str] = None,
+    enable_chunking: bool = False,
+    chunk_size: Optional[int] = None,
+    max_workers: Optional[int] = None,
+    downstream_max_input_chars: Optional[int] = None,
+    max_extraction_chunk_chars: Optional[int] = None,


src/structsense/cli.py

    "--source",
-    required=True,
-    help=("The source—whether a file (text or PDF), a folder, or a text string."),
+    type=click.Path(exists=True),


src/structsense/cli.py

+@click.option(
+    "--source",
+    type=click.Path(exists=True),
+    help="Path to the file to process (PDF, CSV or TXT). Alternative to --source_text.",
+)
+@click.option(
+    "--source_text",
+    type=str,
+    help="Text string to use as input directly. Alternative to --source.",
+)


README.md

+|--------|--------------------------------------------------------------------------------------|
+| `--config` | **(Required)** Path to YAML config (agent + task + embedder).                        |
+| `--source` | Path to a PDF, CSV, or TXT file to process. Mutually exclusive with `--source_text`. |
+| `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`.                     |


src/tests/task_detection_test.py

+            }
+        }
+    else:
+        task_config = BASE_TASK_CONFIG


src/tests/task_detection_test.py

+    assert task_type == expected_task_type
+
+
+@pytest.mark.requires_openrouter


src/structsense/app.py

+        task_data = self.task_config.get(task_key, {})
+        task_type = self.task_config.get(task_key, {}).get("task_type")
+        if task_type in DEFAULT_TAXONOMY:
+            logger.info(f"Using task type from agent config for agent '{agent_key}': {task_type}")


gemini-code-assist

Code Review

This pull request introduces a valuable option to explicitly set task_type in the configuration, improving model guidance and behavior. The refactoring of input handling to distinguish between source and source_text significantly clarifies the API for both CLI and programmatic use. The addition of unit tests for these new features is excellent and enhances the robustness of the codebase. My feedback includes a few minor suggestions for code simplification and a documentation fix.

gemini-code-assist · 2026-03-17T22:13:12Z

README.md

+|--------|--------------------------------------------------------------------------------------|
+| `--config` | **(Required)** Path to YAML config (agent + task + embedder).                        |
+| `--source` | Path to a PDF, CSV, or TXT file to process. Mutually exclusive with `--source_text`. |
+| `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`.                     |


There's a small typo in the description for the --source_text option. It seems like a placeholder _ was left in the text.

Suggested change

| `--source_text` | Raw text string to use as _. Mutually exclusive with `--source`. |

| `--source_text` | Raw text string to use as input. Mutually exclusive with `--source`. |

gemini-code-assist · 2026-03-17T22:13:12Z

src/structsense/app.py

+        task_data = self.task_config.get(task_key, {})
+        task_type = self.task_config.get(task_key, {}).get("task_type")


The task_data variable is defined on line 1478 but not used on line 1479. This results in an unnecessary second dictionary lookup. You can simplify this by using task_data to get the task_type.

Suggested change

task_data = self.task_config.get(task_key, {})

task_type = self.task_config.get(task_key, {}).get("task_type")

task_data = self.task_config.get(task_key, {})

task_type = task_data.get("task_type")

gemini-code-assist · 2026-03-17T22:13:12Z

src/structsense/cli.py

@@ -51,14 +95,16 @@ def extract(config, api_key, source, env_file, save_file, chunk_size, max_worker
    enable_human_feedback = bool(human_in_loop.get("humanfeedback_agent", False))
    if "ENABLE_HUMAN_FEEDBACK" in os.environ:
        from utils.utils import str_to_bool


According to PEP 8, imports should usually be at the top of the file. Please move this import to the top of cli.py (e.g., with the other imports from utils.utils) to improve readability and avoid local imports.

codecov-commenter · 2026-03-17T22:16:03Z

Codecov Report

❌ Patch coverage is 61.30653% with 77 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (improvement@fc11999). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/utils/utils.py	23.80%	48 Missing ⚠️
src/structsense/app.py	46.66%	16 Missing ⚠️
src/structsense/cli.py	64.00%	9 Missing ⚠️
src/tests/task_detection_test.py	89.47%	4 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             improvement      #94   +/-   ##
==============================================
  Coverage               ?   14.68%           
==============================================
  Files                  ?       22           
  Lines                  ?     4964           
  Branches               ?        0           
==============================================
  Hits                   ?      729           
  Misses                 ?     4235           
  Partials               ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

djarecka added 12 commits February 20, 2026 09:58

[wip] adding source_text option for text and adding check for source …

33927b3

…if it is an existing path; adding click.Paths to all options that should be existing texts

Merge branch 'improvement' of https://github.com/sensein/structsense …

d659a70

…into add_source_checks

changing process_input_source to process_files, since it only process…

f5ee673

…es the files that are passed with source argument; removing the text processing from StructSenseFlow and doing the processing before; updating cli.run_agent

finishing changes to cli.run_agent

edb5fee

moving back processing to the StructSenseFlow to minimize the changes…

0f335c9

… to api (but keep it in a separate function); adding the same arguments to StructSenseFlow as in cli: source and source_text

updating docs and tutorial

36ae0a4

adding simple tests to check that the source and source_text is proce…

3eaaafd

…ssed properly; removing src/tests from gitignore

update poetry.lock

65f7ec0

add option to set the task_type in the config, add tests

c886efb

removing conflict

25395be

updating the Readme (since the tutorial readme has been removed in th…

7c16448

…e meantime)

Merge branch 'add_source_checks' into task_type

ba45f29

djarecka requested review from Copilot and puja-trivedi March 17, 2026 22:06

Copilot started reviewing on behalf of djarecka March 17, 2026 22:06 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

tekrajchhetri mentioned this pull request Mar 18, 2026

Add NER tests #79

Merged

tekrajchhetri merged commit 8e15b63 into sensein:improvement Mar 18, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding option to set task_type + tests#94

adding option to set task_type + tests#94
tekrajchhetri merged 12 commits intosensein:improvementfrom
djarecka:task_type

djarecka commented Mar 17, 2026

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

codecov-commenter commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		assert task_type == expected_task_type


		@pytest.mark.requires_openrouter

	\| `--source_text` \| Raw text string to use as _. Mutually exclusive with `--source`. \|
	\| `--source_text` \| Raw text string to use as input. Mutually exclusive with `--source`. \|

		task_data = self.task_config.get(task_key, {})
		task_type = self.task_config.get(task_key, {}).get("task_type")

Conversation

djarecka commented Mar 17, 2026

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 17, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants