Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Chained urls (allowed by fsspec) not recognized by is_fsspec_url #48978

Closed
3 tasks done
ligon opened this issue Oct 6, 2022 · 7 comments · Fixed by #61041
Closed
3 tasks done

BUG: Chained urls (allowed by fsspec) not recognized by is_fsspec_url #48978

ligon opened this issue Oct 6, 2022 · 7 comments · Fixed by #61041
Assignees
Labels
Bug IO Data IO issues that don't fit into a more specific label

Comments

@ligon
Copy link

ligon commented Oct 6, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from pandas.io.common import is_fsspec_url
import fsspec

# This fsspec trick chains a cache fs with an s3 file system using "::".
# See https://filesystem-spec.readthedocs.io/en/latest/features.html#url-chaining.
fn = "filecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv"

with fsspec.open(fn, storage_options={"s3": {"anon": True}}) as f:
    foo = pd.read_csv(f)

print(foo.shape)

# But this url isn't recognized as a valid fsspec url by pandas...
print(pd.io.common.is_fsspec_url(fn))

# ...and so attempts to use chained url directly fail:
bar = pd.read_csv("filecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
                 "SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
                 storage_options={"s3": {"anon": True},
                                  "filecache":{"cache_storage":"/tmp/cache"}})

Issue Description

fsspec allows one to pass "chained" urls (useful, e.g., for caching), for example "filecache::https://example.com/my_file". However, since commit eeff2b0 meant to address issue #36271 supplying urls of this form to e.g., pd.read_csv has failed.

Expected Behavior

Pandas should pass the chained url to fsspec. In the example code, pd.Dataframes foo and bar should be identical.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4e python : 3.9.12.final.0 python-bits : 64 OS : Linux OS-release : 5.10.131-19115-g2e2fb0ed324d Version : #1 SMP PREEMPT Mon Sep 12 18:55:51 PDT 2022 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.8.2
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : 2022.8.2
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@ligon ligon added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 6, 2022
@JMBurley
Copy link
Contributor

JMBurley commented Oct 19, 2022

Can confirm the issue is as stated. The root cause is the pattern match for is_fsspec_url being _RFC_3986_PATTERN = re.compile(r"^[A-Za-z][A-Za-z0-9+\-+.]*://").

Therefore the double-colon :: causes the regex to fail and fsspec is not called.

Easy to adjust that to allow a double colon, however I am not sure exactly what syntax we are corresponding to by allowing a double colon at the start of a (pseudo)URL, or if there is any unintended behaviour that could result...

I believe that double-colons are not part of RFC_3986, which wouldn't stop us making the change, but we should ensure that the regex pattern is logically named to reflect its function.

@JMBurley
Copy link
Contributor

JMBurley commented Oct 19, 2022

In theory can fix with a PR that:

  • modifies regex pattern in pandas/io/common.py
  • updates test test_read_json_with_url_value with a double-colon case
  • updates test test_is_fsspec_url with a double-colon case

which should get the fsspec functionality AND ensure we don't have a collision with the prior JSON read issue.

If we want to action it, I can take this PR as I recently investigated the fsspec handoff conditions as part of another PR and am therefore pretty familiar (some of) the relevant code.

@ligon
Copy link
Author

ligon commented Oct 19, 2022 via email

@lithomas1 lithomas1 added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 20, 2023
@martindurant
Copy link
Contributor

I am coming very late to this, and believe that chained fsspec URLs would be very handy to permit. The specific issue, why the URL pattern was tightened up, was to do with URLs embedded in JSON strings passed to from_json - so always after both " and {-or-[.

If fsspec makes a canonical "this is what our URLs should look like" function or regex, could it be used here?

Also, http(s) continues to be excluded from fsspec handling, but with compression="infer", we do guess compressors for gzip etc. It could simplify pandas' code, but that stuff is very stable, so this is hardly urgent.

@ynouri
Copy link

ynouri commented Feb 26, 2024

Not sure if there's a cleaner workaround, but here's a quick hack I used to load a parquet file using filecache::

import pandas as pd

# Patch Pandas bug https://github.com/pandas-dev/pandas/issues/48978
pd.io.common.is_fsspec_url = lambda x: True

df = pd.read_parquet("filecache::s3://my-bucket/my_file.parquet")

(This works with pandas==1.5.0)

@schmidt-ai
Copy link

I'm also hitting this. Another workaround could be:

import pandas as pd
import fsspec

with fsspec.open("filecache::s3://my-bucket/my_file.parquet") as f:
    df = pd.read_parquet(f)

@snitish
Copy link
Contributor

snitish commented Mar 3, 2025

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants