Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError in test_read_parquet_partitioned_filtered[True-files-pfilters1] #15295

Closed
jakirkham opened this issue Mar 13, 2024 · 2 comments
Closed

Comments

@jakirkham
Copy link
Member

jakirkham commented Mar 13, 2024

Seeing this test failure on CI:

=================================== FAILURES ===================================
_________ test_read_parquet_partitioned_filtered[True-files-pfilters1] _________
[gw1] linux -- Python 3.9.18 /opt/conda/envs/test/bin/python3.9

tmpdir = local('/tmp/pytest-of-root/pytest-0/popen-gw1/test_read_parquet_partitioned_3')
pfilters = [('b', '==', 'a'), ('c', '==', 1)], selection = 'files'
use_cat = True

    @pytest.mark.parametrize(
        "pfilters",
        [[("b", "==", "b")], [("b", "==", "a"), ("c", "==", 1)]],
    )
    @pytest.mark.parametrize("selection", ["directory", "files", "row-groups"])
    @pytest.mark.parametrize("use_cat", [True, False])
    def test_read_parquet_partitioned_filtered(
        tmpdir, pfilters, selection, use_cat
    ):
        path = str(tmpdir)
        size = 100
        df = cudf.DataFrame(
            {
                "a": np.arange(0, stop=size, dtype="int64"),
                "b": np.random.choice(list("abcd"), size=size),
                "c": np.random.choice(np.arange(4), size=size),
            }
        )
        df.to_parquet(path, partition_cols=["c", "b"])
    
        if selection == "files":
            # Pass in a list of paths
            fs = get_fs_token_paths(path)[0]
            read_path = fs.find(path)
            row_groups = None
        elif selection == "row-groups":
            # Pass in a list of paths AND row-group ids
            fs = get_fs_token_paths(path)[0]
            read_path = fs.find(path)
            row_groups = [[0] for p in read_path]
        else:
            # Pass in a directory path
            # (row-group selection not allowed in this case)
            read_path = path
            row_groups = None
    
        # Filter on partitioned columns
        expect = pd.read_parquet(read_path, filters=pfilters)
>       got = cudf.read_parquet(
            read_path,
            filters=pfilters,
            row_groups=row_groups,
            categorical_partitions=use_cat,
        )

tests/test_parquet.py:2144: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/test/lib/python3.9/site-packages/nvtx/nvtx.py:116: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/test/lib/python3.9/site-packages/cudf/io/parquet.py:577: in read_parquet
    df = _parquet_to_frame(
/opt/conda/envs/test/lib/python3.9/site-packages/nvtx/nvtx.py:116: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/test/lib/python3.9/site-packages/cudf/io/parquet.py:721: in _parquet_to_frame
    return _read_parquet(
/opt/conda/envs/test/lib/python3.9/site-packages/nvtx/nvtx.py:116: in inner
    result = func(*args, **kwargs)
/opt/conda/envs/test/lib/python3.9/site-packages/cudf/io/parquet.py:831: in _read_parquet
    return libparquet.read_parquet(
parquet.pyx:124: in cudf._lib.parquet.read_parquet
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   IndexError: list index out of range

parquet.pyx:275: IndexError
-------- generated xml file: /__w/cudf/cudf/test-results/junit-cudf.xml --------

Edit: Seen recently in an unrelated Doxygen build fix ( #15289 )

@jakirkham
Copy link
Member Author

Looks like this was run into by the Spark team recently as well

#15219 (comment)

@jakirkham jakirkham mentioned this issue Mar 13, 2024
3 tasks
rapids-bot bot pushed a commit that referenced this issue Mar 14, 2024
xref #15295

Hoping to make this test easier to debug if the input data is deterministic

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15296
@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

Seems like this was fixed by #15296. Perhaps we were randomly generating invalid data on occasion. Feel free to reopen if we find a meaningful reproducer again.

@vyasr vyasr closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants