Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for 61123 read_excel nrows param reads extra rows #61129

Conversation

zanuka
Copy link

@zanuka zanuka commented Mar 15, 2025

Issue: GH-61123
When reading Excel files with pd.read_excel and specifying nrows=4, the behavior differs depending on whether there’s a blank row between tables. For a file with two tables (each with a header and 3 data rows), nrows=4 should yield a DataFrame with one header and 3 data rows (shape (3, n)). However:

  • In test1.xlsx (with a blank row), it correctly reads the first table (header + 3 rows).
  • In test2.xlsx (no blank row), it incorrectly includes the second table’s header as a data row, resulting in a shape of (4, n).

This inconsistency occurs because read_excel doesn’t properly respect table boundaries when tables are adjacent, despite the nrows limit.

Fix:

  • Modified pandas/io/excel/_base.py and related reader modules (_openpyxl.py, _pyxlsb.py, _xlrd.py) to ensure nrows limits reading to the specified number of rows, excluding subsequent table headers even when tables are adjacent.
  • Added a new test test_excel_read_tables_with_and_without_blank_row in pandas/tests/io/excel/test_readers.py to verify that nrows=4 consistently returns a DataFrame with shape (3, 2) (header + 3 data rows) for both cases.

Changes:

  • Updated Excel reader logic to stop at nrows without parsing beyond table boundaries.
  • Ensured consistent behavior across openpyxl, pyxlsb, and xlrd engines.
  • Squashed commits into a single commit for clarity.

Verification:

  • Tested with test1.xlsx (blank row) and test2.xlsx (no blank row).
  • Confirmed both now yield a DataFrame with shape (3, 2) and only the first table’s data.

Steps to Test:

  1. Run pytest pandas/tests/io/excel/test_readers.py::TestReaders::test_excel_read_tables_with_and_without_blank_row.
  2. Verify df1.shape == (3, 2) and df2.shape == (3, 2) match the expected output.

Related Files:

  • pandas/io/excel/_base.py
  • pandas/io/excel/_openpyxl.py
  • pandas/io/excel/_pyxlsb.py
  • pandas/io/excel/_xlrd.py
  • pandas/tests/io/excel/test_readers.py

Closes #61123

@zanuka zanuka requested a review from rhshadrach as a code owner March 15, 2025 05:32
@zanuka zanuka force-pushed the fix/61123-read_excel-nrows-param-reads-extra-rows branch from 3d54264 to 476a24d Compare March 16, 2025 07:59
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you use an LLM to largely solve this issue (I see a commit from Jolt AI)? At this time the project does not want contributions that are largely AI generated

@zanuka
Copy link
Author

zanuka commented Mar 17, 2025

yeah, was running a test on behalf of Jolt AI to see how their system would handle these types of issues. am fine to close the PR.

@zanuka zanuka closed this Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: read_excel nrows parameter reads extra rows when tables are adjacent (no blank row)
3 participants