Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add split_by_row feature to CSVDocumentSplitter #9031

Merged
merged 13 commits into from
Mar 19, 2025

Conversation

Amnah199
Copy link
Contributor

@Amnah199 Amnah199 commented Mar 12, 2025

Related Issues

Proposed Changes:

Add a new parameter _split_by_row to CSVDocumentSplitter. When _split_by_row=True, other split arguments won't be considered.

How did you test it?

Added new unit tests

Notes for the reviewer

If this is merged, we need to update the documentation of this component slightly.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@github-actions github-actions bot added type:documentation Improvements on the docs topic:tests labels Mar 12, 2025
@coveralls
Copy link
Collaborator

coveralls commented Mar 12, 2025

Pull Request Test Coverage Report for Build 13944618786

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.002%) to 90.068%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/csv_document_splitter.py 4 96.26%
Totals Coverage Status
Change from base Build 13920793695: -0.002%
Covered Lines: 9740
Relevant Lines: 10814

💛 - Coveralls

@Amnah199 Amnah199 marked this pull request as ready for review March 12, 2025 17:43
@Amnah199 Amnah199 requested review from a team as code owners March 12, 2025 17:43
@Amnah199 Amnah199 requested review from dfokina and sjrl and removed request for a team March 12, 2025 17:43
@sjrl
Copy link
Contributor

sjrl commented Mar 14, 2025

@Amnah199 thanks for working on this!

I think we should take a slightly different approach to toggle between the two different split modes. Instead of using the boolean split_by_row, I think we should introduce a split_mode variable with a Literal (or Enum) type to explicitly define the two modes. An idea for the two different split mode names could be threshold and row-wise, so something like

from typing import Literal

SplitMode = Literal["threshold", "row-wise"]
  • "threshold" → Uses row/column thresholds to determine splits (default behavior).
  • "row-wise" → Each row becomes its own sub-table, ignoring thresholds.

I think this would improve readability and makes it immediately clear what the expected behavior is. Let me know what you think!

@Amnah199 Amnah199 requested a review from sjrl March 16, 2025 23:15
@Amnah199 Amnah199 requested a review from sjrl March 17, 2025 12:08
@sjrl
Copy link
Contributor

sjrl commented Mar 18, 2025

@Amnah199 In the test_from_dict_defaults in the tests could we add an assert checking that split_mode is equal to threshold?

@Amnah199 Amnah199 requested a review from sjrl March 18, 2025 22:44
Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@Amnah199 Amnah199 merged commit 3c101cd into main Mar 19, 2025
17 checks passed
@Amnah199 Amnah199 deleted the add-split-by-row-csv-splitter branch March 19, 2025 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants