Skip to content

Commit 9dcce63

Browse files
ENH: Allow third-party packages to register IO engines (#61642)
1 parent b91fa1d commit 9dcce63

File tree

7 files changed

+352
-1
lines changed

7 files changed

+352
-1
lines changed

doc/source/development/extending.rst

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -489,6 +489,69 @@ registers the default "matplotlib" backend as follows.
489489
More information on how to implement a third-party plotting backend can be found at
490490
https://github.com/pandas-dev/pandas/blob/main/pandas/plotting/__init__.py#L1.
491491

492+
.. _extending.io-engines:
493+
494+
IO engines
495+
-----------
496+
497+
pandas provides several IO connectors such as :func:`read_csv` or :meth:`DataFrame.to_parquet`, and many
498+
of those support multiple engines. For example, :func:`read_csv` supports the ``python``, ``c``
499+
and ``pyarrow`` engines, each with its advantages and disadvantages, making each more appropriate
500+
for certain use cases.
501+
502+
Third-party package developers can implement engines for any of the pandas readers and writers.
503+
When a ``pandas.read_*`` function or ``DataFrame.to_*`` method are called with an ``engine="<name>"``
504+
that is not known to pandas, pandas will look into the entry points registered in the group
505+
``pandas.io_engine`` by the packages in the environment, and will call the corresponding method.
506+
507+
An engine is a simple Python class which implements one or more of the pandas readers and writers
508+
as class methods:
509+
510+
.. code-block:: python
511+
512+
class EmptyDataEngine:
513+
@classmethod
514+
def read_json(cls, path_or_buf=None, **kwargs):
515+
return pd.DataFrame()
516+
517+
@classmethod
518+
def to_json(cls, path_or_buf=None, **kwargs):
519+
with open(path_or_buf, "w") as f:
520+
f.write()
521+
522+
@classmethod
523+
def read_clipboard(cls, sep='\\s+', dtype_backend=None, **kwargs):
524+
return pd.DataFrame()
525+
526+
A single engine can support multiple readers and writers. When possible, it is a good practice for
527+
a reader to provide both a reader and writer for the supported formats. But it is possible to
528+
provide just one of them.
529+
530+
The package implementing the engine needs to create an entry point for pandas to be able to discover
531+
it. This is done in ``pyproject.toml``:
532+
533+
```toml
534+
[project.entry-points."pandas.io_engine"]
535+
empty = empty_data:EmptyDataEngine
536+
```
537+
538+
The first line should always be the same, creating the entry point in the ``pandas.io_engine`` group.
539+
In the second line, ``empty`` is the name of the engine, and ``empty_data:EmptyDataEngine`` is where
540+
to find the engine class in the package (``empty_data`` is the module name in this case).
541+
542+
If a user has the package of the example installed, them it would be possible to use:
543+
544+
.. code-block:: python
545+
546+
pd.read_json("myfile.json", engine="empty")
547+
548+
When pandas detects that no ``empty`` engine exists for the ``read_json`` reader in pandas, it will
549+
look at the entry points, will find the ``EmptyDataEngine`` engine, and will call the ``read_json``
550+
method on it with the arguments provided by the user (except the ``engine`` parameter).
551+
552+
To avoid conflicts in the names of engines, we keep an "IO engines" section in our
553+
`Ecosystem page <https://pandas.pydata.org/community/ecosystem.html#io-engines>`_.
554+
492555
.. _extending.pandas_priority:
493556

494557
Arithmetic with 3rd party types

doc/source/whatsnew/v3.0.0.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ Other enhancements
9494
- Support passing a :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
9595
- Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`)
9696
- Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`)
97+
- Third-party packages can now register engines that can be used in pandas I/O operations :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` (:issue:`61584`)
9798

9899
.. ---------------------------------------------------------------------------
99100
.. _whatsnew_300.notable_bug_fixes:

pandas/core/frame.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,10 @@
188188
nargsort,
189189
)
190190

191-
from pandas.io.common import get_handle
191+
from pandas.io.common import (
192+
allow_third_party_engines,
193+
get_handle,
194+
)
192195
from pandas.io.formats import (
193196
console,
194197
format as fmt,
@@ -3547,6 +3550,7 @@ def to_xml(
35473550

35483551
return xml_formatter.write_output()
35493552

3553+
@allow_third_party_engines
35503554
def to_iceberg(
35513555
self,
35523556
table_identifier: str,
@@ -3556,6 +3560,7 @@ def to_iceberg(
35563560
location: str | None = None,
35573561
append: bool = False,
35583562
snapshot_properties: dict[str, str] | None = None,
3563+
engine: str | None = None,
35593564
) -> None:
35603565
"""
35613566
Write a DataFrame to an Apache Iceberg table.
@@ -3580,6 +3585,10 @@ def to_iceberg(
35803585
If ``True``, append data to the table, instead of replacing the content.
35813586
snapshot_properties : dict of {str: str}, optional
35823587
Custom properties to be added to the snapshot summary
3588+
engine : str, optional
3589+
The engine to use. Engines can be installed via third-party packages. For an
3590+
updated list of existing pandas I/O engines check the I/O engines section of
3591+
the pandas Ecosystem page.
35833592
35843593
See Also
35853594
--------

pandas/io/common.py

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,15 @@
99
import codecs
1010
from collections import defaultdict
1111
from collections.abc import (
12+
Callable,
1213
Hashable,
1314
Mapping,
1415
Sequence,
1516
)
1617
import dataclasses
1718
import functools
1819
import gzip
20+
from importlib.metadata import entry_points
1921
from io import (
2022
BufferedIOBase,
2123
BytesIO,
@@ -90,6 +92,10 @@
9092

9193
from pandas import MultiIndex
9294

95+
# registry of I/O engines. It is populated the first time a non-core
96+
# pandas engine is used
97+
_io_engines: dict[str, Any] | None = None
98+
9399

94100
@dataclasses.dataclass
95101
class IOArgs:
@@ -1282,3 +1288,149 @@ def dedup_names(
12821288
counts[col] = cur_count + 1
12831289

12841290
return names
1291+
1292+
1293+
def _get_io_engine(name: str) -> Any:
1294+
"""
1295+
Return an I/O engine by its name.
1296+
1297+
pandas I/O engines can be registered via entry points. The first time this
1298+
function is called it will register all the entry points of the "pandas.io_engine"
1299+
group and cache them in the global `_io_engines` variable.
1300+
1301+
Engines are implemented as classes with the `read_<format>` and `to_<format>`
1302+
methods (classmethods) for the formats they wish to provide. This function will
1303+
return the method from the engine and format being requested.
1304+
1305+
Parameters
1306+
----------
1307+
name : str
1308+
The engine name provided by the user in `engine=<value>`.
1309+
1310+
Examples
1311+
--------
1312+
An engine is implemented with a class like:
1313+
1314+
>>> class DummyEngine:
1315+
... @classmethod
1316+
... def read_csv(cls, filepath_or_buffer, **kwargs):
1317+
... # the engine signature must match the pandas method signature
1318+
... return pd.DataFrame()
1319+
1320+
It must be registered as an entry point with the engine name:
1321+
1322+
```
1323+
[project.entry-points."pandas.io_engine"]
1324+
dummy = "pandas:io.dummy.DummyEngine"
1325+
1326+
```
1327+
1328+
Then the `read_csv` method of the engine can be used with:
1329+
1330+
>>> _get_io_engine(engine_name="dummy").read_csv("myfile.csv") # doctest: +SKIP
1331+
1332+
This is used internally to dispatch the next pandas call to the engine caller:
1333+
1334+
>>> df = read_csv("myfile.csv", engine="dummy") # doctest: +SKIP
1335+
"""
1336+
global _io_engines
1337+
1338+
if _io_engines is None:
1339+
_io_engines = {}
1340+
for entry_point in entry_points().select(group="pandas.io_engine"):
1341+
if entry_point.dist:
1342+
package_name = entry_point.dist.metadata["Name"]
1343+
else:
1344+
package_name = None
1345+
if entry_point.name in _io_engines:
1346+
_io_engines[entry_point.name]._packages.append(package_name)
1347+
else:
1348+
_io_engines[entry_point.name] = entry_point.load()
1349+
_io_engines[entry_point.name]._packages = [package_name]
1350+
1351+
try:
1352+
engine = _io_engines[name]
1353+
except KeyError as err:
1354+
raise ValueError(
1355+
f"'{name}' is not a known engine. Some engines are only available "
1356+
"after installing the package that provides them."
1357+
) from err
1358+
1359+
if len(engine._packages) > 1:
1360+
msg = (
1361+
f"The engine '{name}' has been registered by the package "
1362+
f"'{engine._packages[0]}' and will be used. "
1363+
)
1364+
if len(engine._packages) == 2:
1365+
msg += (
1366+
f"The package '{engine._packages[1]}' also tried to register "
1367+
"the engine, but it couldn't because it was already registered."
1368+
)
1369+
else:
1370+
msg += (
1371+
"The packages {str(engine._packages[1:]}[1:-1] also tried to register "
1372+
"the engine, but they couldn't because it was already registered."
1373+
)
1374+
warnings.warn(msg, RuntimeWarning, stacklevel=find_stack_level())
1375+
1376+
return engine
1377+
1378+
1379+
def allow_third_party_engines(
1380+
skip_engines: list[str] | Callable | None = None,
1381+
) -> Callable:
1382+
"""
1383+
Decorator to avoid boilerplate code when allowing readers and writers to use
1384+
third-party engines.
1385+
1386+
The decorator will introspect the function to know which format should be obtained,
1387+
and to know if it's a reader or a writer. Then it will check if the engine has been
1388+
registered, and if it has, it will dispatch the execution to the engine with the
1389+
arguments provided by the user.
1390+
1391+
Parameters
1392+
----------
1393+
skip_engines : list of str, optional
1394+
For engines that are implemented in pandas, we want to skip them for this engine
1395+
dispatching system. They should be specified in this parameter.
1396+
1397+
Examples
1398+
--------
1399+
The decorator works both with the `skip_engines` parameter, or without:
1400+
1401+
>>> class DataFrame:
1402+
... @allow_third_party_engines(["python", "c", "pyarrow"])
1403+
... def read_csv(filepath_or_buffer, **kwargs):
1404+
... pass
1405+
...
1406+
... @allow_third_party_engines
1407+
... def read_sas(filepath_or_buffer, **kwargs):
1408+
... pass
1409+
"""
1410+
1411+
def decorator(func: Callable) -> Callable:
1412+
@functools.wraps(func)
1413+
def wrapper(*args: Any, **kwargs: Any) -> Any:
1414+
if callable(skip_engines) or skip_engines is None:
1415+
skip_engine = False
1416+
else:
1417+
skip_engine = kwargs["engine"] in skip_engines
1418+
1419+
if "engine" in kwargs and not skip_engine:
1420+
engine_name = kwargs.pop("engine")
1421+
engine = _get_io_engine(engine_name)
1422+
try:
1423+
return getattr(engine, func.__name__)(*args, **kwargs)
1424+
except AttributeError as err:
1425+
raise ValueError(
1426+
f"The engine '{engine_name}' does not provide a "
1427+
f"'{func.__name__}' function"
1428+
) from err
1429+
else:
1430+
return func(*args, **kwargs)
1431+
1432+
return wrapper
1433+
1434+
if callable(skip_engines):
1435+
return decorator(skip_engines)
1436+
return decorator

pandas/io/iceberg.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@
66

77
from pandas import DataFrame
88

9+
from pandas.io.common import allow_third_party_engines
910

11+
12+
@allow_third_party_engines
1013
def read_iceberg(
1114
table_identifier: str,
1215
catalog_name: str | None = None,
@@ -18,6 +21,7 @@ def read_iceberg(
1821
snapshot_id: int | None = None,
1922
limit: int | None = None,
2023
scan_properties: dict[str, Any] | None = None,
24+
engine: str | None = None,
2125
) -> DataFrame:
2226
"""
2327
Read an Apache Iceberg table into a pandas DataFrame.
@@ -52,6 +56,10 @@ def read_iceberg(
5256
scan_properties : dict of {str: obj}, optional
5357
Additional Table properties as a dictionary of string key value pairs to use
5458
for this scan.
59+
engine : str, optional
60+
The engine to use. Engines can be installed via third-party packages. For an
61+
updated list of existing pandas I/O engines check the I/O engines section of
62+
our Ecosystem page.
5563
5664
Returns
5765
-------

0 commit comments

Comments
 (0)