wip-feat: pandas as soft dependency #3384

mattijn · 2024-03-25T23:18:11Z

This PR is an attempt to make pandas a soft dependency. I hope it can be used as inspiration, as I was not able to make the types happy. I've no real idea how it should be done, but I've been trying a few things, some with success and others without.

I also made an attempt to prioritize the DataFrameLike approach over the pandas routine, but decided to not do this as otherwise usage of a pandas DataFrame within Altair will require pyarrow to infer/serialize. My current feeling is that usage of pandas to infer and serialize the data is still preferred as it is not yet depending on pyarrow.

…nd index attributes over type

binste · 2024-03-29T06:55:17Z

Great to get the ball rolling on this, thank you @mattijn! I did not yet have time to review but just wanted to say that I'm happy to have a look at the types once I get to it. As long as the package works, I'm optimistic that we can make mypy happy.

mattijn · 2024-03-29T07:00:32Z

Thanks @binste! No rush! Maybe something for version 5.4

binste

Just some first comments. I haven't had the chance yet to run mypy on this PR (reviewed it in the browser) but I have some ideas how to make it work which I want to try out depending on the errors it throws.

binste · 2024-04-21T14:43:54Z

altair/utils/_importers.py

+
+
+def import_pandas() -> ModuleType:
+    min_version = "0.25"


Could you add a comment in pyproject.toml? Next to the pandas requirement that if the pandas version is updated, it also needs to be changed here. Although I'm realizing now that that file needs to be changed anyway to make pandas optional

binste · 2024-04-21T14:51:42Z

altair/_magics.py

        return curried.pipe(data, data_transformers.get())
    elif isinstance(data, str):
        return {"url": data}
+    elif _is_pandas_dataframe(data):


Is my understanding correct that this line is only reached if it's an old Pandas version which does not support the dataframe interchange protocol? Else it would already stop at line 43, right?

If yes, could you add a comment about this?

binste · 2024-04-21T14:52:15Z

altair/utils/core.py

@@ -53,6 +52,11 @@ def __dataframe__(
    ) -> DfiDataFrame: ...


+def _is_pandas_dataframe(obj: Any) -> bool:


Could this function be a simple isinstance(obj, pd.DataFrame)?

Thanks for start reviewing this PR @binste! I don't think I can do this without importing pandas first.

I tried setting up a function on which I can do some duck typing

def instance(obj): return type(obj).__name__

But found out that both polars and pandas are using the instance type DataFrame for their dataframe.

Maybe I'm missing something but couldn't we call the pandas import function you created in here and if it raises an importerror, we know it's not a pandas dataframe anyway.

It's pragmatic, I admit. But that would be an unnecessary import of pandas if it is available in the environment, but if the data object is something else.
I wish we could sniff the type without importing modules first.

Here's the optional import logic I added to plotly.py a while back: https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/_plotly_utils/optional_imports.py if should_load is False then it won't perform the import even if the library is installed. This was used with isinstance checks, because if pandas hasn't been loaded yet, you know the object you're dealing with isn't a pandas DataFrame, even if pandas is installed.

A trick I learned from scikit-learn is to check if pandas is in sys.modules before doing the isinstance check, something like

if (pd := sys.modules.get('pandas')) is not None and isinstance(df, pd.DataFrame): ...

If pandas was never imported, then df is definitely not pandas

(this is also what we do in Narwhals, were pandas/polars/etc are never explicitly imported)

I just see your response here @MarcoGorelli! I also made this observation recently, see the comment I just added #3384 (comment)...

binste · 2024-04-21T14:57:47Z

altair/utils/_importers.py

+        return pd
+    except ImportError as err:
+        raise ImportError(
+            f"Serialization of the DataFrame requires\n"


Suggested change

f"Serialization of the DataFrame requires\n"

f"Serialization of this data requires\n"

It can also be a dict as in data.py: _data_to_csv_string. Furthermore, if it's a dataframe, it's already given that Pandas is installed.

binste · 2024-04-21T15:04:51Z

altair/utils/schemapi.py

+if TYPE_CHECKING:
+    pass
+


Suggested change

if TYPE_CHECKING:

pass

Aware that it's just a wip PR, thought I'd just note it anyway :)

binste · 2024-04-21T15:13:22Z

altair/utils/schemapi.py

+class _PandasTimestamp:
+    def isoformat(self):
+        return "dummy_isoformat"  # Return a dummy ISO format string


I think this should inherit from a Protocol as a pd.Timestamp is not an instance of _PandasTimestamp. You'll then also need to add the @runtime_checkable decorator from typing. Also, we could directly test for a pandas timestamp in a similar function to is_pandas_dataframe to keep these approaches consistent?

binste · 2024-04-21T15:18:47Z

tests/utils/test_core.py

@@ -4,11 +4,11 @@

 import numpy as np
 import pandas as pd
+from pandas.api.types import infer_dtype


Let's make the tests also run without pandas installed so that we can run the whole test suite once with pandas installed and once without. Prevents us from accidentally reintroducing a hard dependency again in the future

dangotbanned · 2024-06-25T08:55:31Z

@mattijn just throwing in as a suggestion, have you considered narwhals?

narwhals is quite new but seems promising:

The author (@MarcoGorelli) is a maintainer for both pandas and polars
It has zero dependencies
Uses a single API, which could potentially simplify a lot of altair compatibility code
- It could be worthwhile to review what they've implemented so far, and to what extent this covers altair's use case
- I'm likely biased towards it as I'm a big fan of the polars API it is based upon

Even if you were not to go down this route; they collected a range of Issues/PRs in narwhals-dev/narwhals#62 from projects interested in the same topic as this PR - which could prove to be a great resource regardless.

Side note:
I was initially thinking narwhals could help with #3213 (comment), as you could use nw.col - but AFAIK key-completions aren't in there yet.

Original

From my understanding so far, part of this would be solved with translate_dtype.
However that only covers the case where the d/type is known.

altair/altair/utils/core.py

Lines 600 to 651 in 62ab14d

    
           # if data is specified and type is not, infer type from data 
        
           if "type" not in attrs: 
        
               if pyarrow_available() and data is not None and isinstance(data, DataFrameLike): 
        
                   dfi = data.__dataframe__() 
        
                   if "field" in attrs: 
        
                       unescaped_field = attrs["field"].replace("\\", "") 
        
                       if unescaped_field in dfi.column_names(): 
        
                           column = dfi.get_column_by_name(unescaped_field) 
        
                           try: 
        
                               attrs["type"] = infer_vegalite_type_for_dfi_column(column) 
        
                           except (NotImplementedError, AttributeError, ValueError): 
        
                               # Fall back to pandas-based inference. 
        
                               # Note: The AttributeError catch is a workaround for 
        
                               # https://github.com/pandas-dev/pandas/issues/55332 
        
                               if _is_pandas_dataframe(data): 
        
                                   attrs["type"] = infer_vegalite_type(data[unescaped_field]) 
        
                               else: 
        
                                   raise 
        
                           if isinstance(attrs["type"], tuple): 
        
                               attrs["sort"] = attrs["type"][1] 
        
                               attrs["type"] = attrs["type"][0] 
        
               elif _is_pandas_dataframe(data): 
        
                   # Fallback if pyarrow is not installed or if pandas is older than 1.5 
        
                   # 
        
                   # Remove escape sequences so that types can be inferred for columns with special characters 
        
                   if "field" in attrs and attrs["field"].replace("\\", "") in data.columns: 
        
                       attrs["type"] = infer_vegalite_type( 
        
                           data[attrs["field"].replace("\\", "")] 
        
                       ) 
        
                       # ordered categorical dataframe columns return the type and sort order as a tuple 
        
                       if isinstance(attrs["type"], tuple): 
        
                           attrs["sort"] = attrs["type"][1] 
        
                           attrs["type"] = attrs["type"][0] 
        
           # If an unescaped colon is still present, it's often due to an incorrect data type specification 
        
           # but could also be due to using a column name with ":" in it. 
        
           if ( 
        
               "field" in attrs 
        
               and ":" in attrs["field"] 
        
               and attrs["field"][attrs["field"].rfind(":") - 1] != "\\" 
        
           ): 
        
               raise ValueError( 
        
                   '"{}" '.format(attrs["field"].split(":")[-1]) 
        
                   + "is not one of the valid encoding data types: {}.".format( 
        
                       ", ".join(TYPECODE_MAP.values()) 
        
                   ) 
        
                   + "\nFor more details, see https://altair-viz.github.io/user_guide/encodings/index.html#encoding-data-types. " 
        
                   + "If you are trying to use a column name that contains a colon, " 
        
                   + 'prefix it with a backslash; for example "column\\:name" instead of "column:name".' 
        
               ) 
        
           return attrs

For the infer_vegalite_type cases above, they depend on a pandas C extension function infer_dtype.

narwhals has maybe_convert_dtypes that wraps pandas.NDFrame.convert_dtypes or no-ops.

@MarcoGorelli was this restriction intentional?

import narwhals
import pandas

pandas.DataFrame.convert_dtypes
pandas.Series.convert_dtypes
narwhals.maybe_convert_dtypes # seems to only apply for DataFrame

These altair tests seem to only cover list, which would approximate to pd.Series

altair/tests/utils/test_core.py

Lines 73 to 85 in 76a9ce1

    
           @pytest.mark.parametrize( 
        
               "value,expected_type", 
        
               [ 
        
                   ([1, 2, 3], "integer"), 
        
                   ([1.0, 2.0, 3.0], "floating"), 
        
                   ([1, 2.0, 3], "mixed-integer-float"), 
        
                   (["a", "b", "c"], "string"), 
        
                   (["a", "b", np.nan], "mixed"), 
        
               ], 
        
           ) 
        
           def test_infer_dtype(value, expected_type): 
        
               assert infer_dtype(value, skipna=False) == expected_type

Overall, these seem like minor, solvable issues to me

MarcoGorelli · 2024-06-26T11:12:26Z

@MarcoGorelli was this restriction intentional?

as in, the restriction of maybe_convert_dtypes to DataFrame? No reason, we could (and should!) do it for Series too

However that only covers the case where the d/type is known.

could you clarify please? when is the dtype not known?

Perhaps we should have a separate thread to discuss this so as to not risk losing focus on this PR too much. I think Narwhals support is related but orthogonal, and that the simplest way to go about things might be:

get this working without Narwhals (as per this PR)
once its working, evaluate whether Narwhals can help keep down complexity / simplify maintenance

dangotbanned · 2024-06-26T11:41:19Z

@MarcoGorelli was this restriction intentional?

as in, the restriction of maybe_convert_dtypes to DataFrame? No reason, we could (and should!) do it for Series too

Yeah that was what I meant.
I wasn't sure if I spotted an easy future extension to narwhals, or that during implementing maybe_convert_dtypes this was considered but rejected for some reason I couldn't see from my brief viewing.

However that only covers the case where the d/type is known.

could you clarify please? when is the dtype not known?

Apologies, maybe that makes more sense expanding to the code prior to block I linked.
That function would be taking a shorthand parameter, which could be a column name with potentially additional information see Encoding Shorthands.
Which is used in combination with any metadata provided by data - that in theory must solve for a generic dataframe with/out datatypes present.

Perhaps we should have a separate thread to discuss this so as to not risk losing focus on this PR too much. I think Narwhals support is related but orthogonal, and that the simplest way to go about things might be:
1. get this working without Narwhals (as per this PR)

2. once its working, evaluate whether Narwhals can help keep down complexity / simplify maintenance

That would be fine with me, @mattijn what are your thoughts on this plan?

mattijn · 2024-06-26T12:38:25Z

Make sense to open a new issue with the suggestion of utilizing narwhals. Thanks!

Regarding this issue, the recent work within vegafusion to make imports lazy, might also be of interested here. See vega/vegafusion#491.

Especially the approach as such:

pd = sys.modules.get("pandas", None)
pl = sys.modules.get("polars", None)

if pd is not None and isinstance(value, pd.DataFrame):
    ...
if pl is not None and isinstance(value, pl.DataFrame):
    ...

binste · 2024-06-27T12:44:55Z

Great to see all the activity on this topic and thanks to everyone chiming in! :) Regarding narwhals, not sure how it relates to https://github.com/data-apis/dataframe-api but as mentioned by others, best to continue this discussion separately and first strip Pandas out as a hard dependency.

The approach of scikit-learn/vegafusion with sys.modules looks efficient to me so I'd be in favor of adopting that one! @mattijn How would you like to proceed here? Would you prefer a more detailed review or are there some open items you first want to get implemented such as changing to sys.modules?

mattijn · 2024-07-15T18:29:58Z

Superseded by #3452

jonmmease · 2024-07-15T18:47:50Z

Superseded by #3452

Thanks for getting the ball rolling @mattijn!

mattijn added 10 commits March 24, 2024 23:44

adapt tools files

110d4cd

changes from rerun generate_schema_wrapper

f40d951

add importer for pandas

ba4b778

prioritize DataFrameLike, use the pandas importer only when needed.

21910b1

prioritze DataFrameLike, check pandas dataframe using iloc, columns a…

88a8870

…nd index attributes over type

ruff

c14d94a

ruff format

d00cd10

relocate function

bd70bf4

prioritze pd.dataframe, currently no dependency on pyarrow

8b41305

ruff

62ab14d

jonmmease mentioned this pull request Mar 26, 2024

fix: Support falling back to pandas when pyarrow is installed but too old #3387

Merged

joelostblom added the enhancement label Mar 29, 2024

binste reviewed Apr 21, 2024

View reviewed changes

MarcoGorelli mentioned this pull request Jun 26, 2024

Remove PyArrow dependency for Polars support #3445

Closed

mattijn mentioned this pull request Jul 2, 2024

feat: make pandas and NumPy optional dependencies, don't require PyArrow for plotting with Polars/Modin/cuDF #3452

Merged

mattijn closed this Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip-feat: pandas as soft dependency #3384

wip-feat: pandas as soft dependency #3384

mattijn commented Mar 25, 2024

binste commented Mar 29, 2024

mattijn commented Mar 29, 2024

binste left a comment

binste Apr 21, 2024

binste Apr 21, 2024

binste Apr 21, 2024

mattijn Apr 24, 2024

binste Apr 24, 2024

mattijn Apr 24, 2024

jonmmease Apr 24, 2024

MarcoGorelli Jun 25, 2024 •

edited

Loading

mattijn Jun 26, 2024

binste Apr 21, 2024

binste Apr 21, 2024

binste Apr 21, 2024

binste Apr 21, 2024

dangotbanned commented Jun 25, 2024 •

edited

Loading

MarcoGorelli commented Jun 25, 2024 •

edited

Loading

dangotbanned commented Jun 25, 2024

mattijn commented Jun 25, 2024 •

edited

Loading

dangotbanned commented Jun 26, 2024 •

edited

Loading

MarcoGorelli commented Jun 26, 2024

dangotbanned commented Jun 26, 2024 •

edited

Loading

mattijn commented Jun 26, 2024

binste commented Jun 27, 2024

mattijn commented Jul 15, 2024

jonmmease commented Jul 15, 2024

		@@ -53,6 +52,11 @@ def __dataframe__(
		) -> DfiDataFrame: ...


		def _is_pandas_dataframe(obj: Any) -> bool:

	f"Serialization of the DataFrame requires\n"
	f"Serialization of this data requires\n"

wip-feat: pandas as soft dependency #3384

wip-feat: pandas as soft dependency #3384

Conversation

mattijn commented Mar 25, 2024

binste commented Mar 29, 2024

mattijn commented Mar 29, 2024

binste left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dangotbanned commented Jun 25, 2024 • edited Loading

Related

MarcoGorelli commented Jun 25, 2024 • edited Loading

dangotbanned commented Jun 25, 2024

mattijn commented Jun 25, 2024 • edited Loading

dangotbanned commented Jun 26, 2024 • edited Loading

Note

Original

MarcoGorelli commented Jun 26, 2024

dangotbanned commented Jun 26, 2024 • edited Loading

mattijn commented Jun 26, 2024

binste commented Jun 27, 2024

mattijn commented Jul 15, 2024

jonmmease commented Jul 15, 2024

MarcoGorelli Jun 25, 2024 •

edited

Loading

dangotbanned commented Jun 25, 2024 •

edited

Loading

MarcoGorelli commented Jun 25, 2024 •

edited

Loading

mattijn commented Jun 25, 2024 •

edited

Loading

dangotbanned commented Jun 26, 2024 •

edited

Loading

dangotbanned commented Jun 26, 2024 •

edited

Loading