-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Chained urls (allowed by fsspec) not recognized by is_fsspec_url #48978
Comments
Can confirm the issue is as stated. The root cause is the pattern match for Therefore the double-colon Easy to adjust that to allow a double colon, however I am not sure exactly what syntax we are corresponding to by allowing a double colon at the start of a (pseudo)URL, or if there is any unintended behaviour that could result... I believe that double-colons are not part of RFC_3986, which wouldn't stop us making the change, but we should ensure that the regex pattern is logically named to reflect its function. |
In theory can fix with a PR that:
which should get the fsspec functionality AND ensure we don't have a collision with the prior JSON read issue. If we want to action it, I can take this PR as I recently investigated the fsspec handoff conditions as part of another PR and am therefore pretty familiar (some of) the relevant code. |
Thanks for tackling this!
It may not matter for this case, but the double colons do show up in RFC
3986 for the case of elisions in IPv6 addresses. I suppose one would like
to be able to handle the case of
"filecache::s3://2607:f8b0:4005:811::200e/foo", though I don't know whether
fsspec itself does this correctly or not.
…On Wed, Oct 19, 2022 at 6:16 AM JMBurley ***@***.***> wrote:
Can confirm the issue is as stated. The root cause is the pattern match
for is_fsspec_url being _RFC_3986_PATTERN =
re.compile(r"^[A-Za-z][A-Za-z0-9+\-+.]*://").
Therefore the double-colon causes the regex to fail and fsspec is not
called.
Easy to adjust that to allow a double colon, however I am not sure what
exactly syntax we are corresponding to by allowing a double colon at the
start of a (pseudo)URL, or if there is any unintended behaviour that could
result...
I believe that double-colons are not part of RFC_3986
<https://datatracker.ietf.org/doc/html/rfc3986>
—
Reply to this email directly, view it on GitHub
<#48978 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAET2DA2WXRWLFJYC6Q2JLLWD7YA3ANCNFSM6AAAAAAQ67ZJDI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Ethan Ligon, Professor
Agricultural & Resource Economics
University of California, Berkeley
|
I am coming very late to this, and believe that chained fsspec URLs would be very handy to permit. The specific issue, why the URL pattern was tightened up, was to do with URLs embedded in JSON strings passed to from_json - so always after both " and {-or-[. If fsspec makes a canonical "this is what our URLs should look like" function or regex, could it be used here? Also, http(s) continues to be excluded from fsspec handling, but with |
Not sure if there's a cleaner workaround, but here's a quick hack I used to load a parquet file using import pandas as pd
# Patch Pandas bug https://github.com/pandas-dev/pandas/issues/48978
pd.io.common.is_fsspec_url = lambda x: True
df = pd.read_parquet("filecache::s3://my-bucket/my_file.parquet") (This works with |
I'm also hitting this. Another workaround could be: import pandas as pd
import fsspec
with fsspec.open("filecache::s3://my-bucket/my_file.parquet") as f:
df = pd.read_parquet(f) |
take |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
fsspec allows one to pass "chained" urls (useful, e.g., for caching), for example "filecache::https://example.com/my_file". However, since commit eeff2b0 meant to address issue #36271 supplying urls of this form to e.g., pd.read_csv has failed.
Expected Behavior
Pandas should pass the chained url to fsspec. In the example code, pd.Dataframes foo and bar should be identical.
Installed Versions
pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.8.2
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : 2022.8.2
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
The text was updated successfully, but these errors were encountered: