Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extracting transformed chart data using VegaFusion #3081

Merged
merged 23 commits into from
Jun 14, 2023

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Jun 9, 2023

Overview

As discussed in #3054, this PR ports the transformed_data logic from VegaFusion into Altair. See https://vegafusion.io/transformed_data.html for some background on VegaFusion's transformed_data function.

A standalone transformed_data utility function is added to altair/utils/transformed_data.py, and a chart._transformed_data method is added to Chart, FacetChart, LayerChart, ConcatChart, HConcatChart, and VConcatChart (but not RepeatChart, as I'm not sure how to do this yet). I added a leading underscore to the chart method so that we can use this for internal testing (and documentation) before making it a public, but happy to reconsider that decision.

Example

import altair as alt
from vega_datasets import data

source = data.movies.url

chart = alt.Chart(source).mark_bar().encode(
    alt.X("aggregate_gross:Q").aggregate("mean").title(None),
    alt.Y("ranked_director:N")
        .sort(op="mean", field="aggregate_gross", order="descending")
        .title(None)
).transform_aggregate(
    aggregate_gross='mean(Worldwide_Gross)',
    groupby=["Director"],
).transform_window(
    rank='row_number()',
    sort=[alt.SortField("aggregate_gross", order="descending")],
).transform_calculate(
    ranked_director="datum.rank < 10 ? datum.Director : 'All Others'"
).properties(
    title="Top Directors by Average Worldwide Gross",
)
chart

visualization

chart._transformed_data()
|    | ranked_director   |   mean_aggregate_gross |
|---:|:------------------|-----------------------:|
|  0 | All Others        |            8.87602e+07 |
|  1 | James Cameron     |            8.29781e+08 |
|  2 | George Lucas      |            6.73577e+08 |
|  3 | Peter Jackson     |            5.95566e+08 |
|  4 | Andrew Stanton    |            7.00319e+08 |
|  5 | David Yates       |            9.37984e+08 |
|  6 | Carlos Saldanha   |            7.69293e+08 |
|  7 | Andrew Adamson    |            6.43134e+08 |
|  8 | David Slade       |            6.88155e+08 |
|  9 | Pete Docter       |            7.31305e+08 |

Testing

I added tests for the expected transformed data length and (a subset of) the columns for all of the example charts that are currently supported

Limitations

VegaFusion doesn't support every Vega transform, and it doesn't support every option for all of the transforms it does support. When an unsupported transform is required, VegaFusion raises an, admittedly not very helpful, error like this:

ValueError: Pre-transform error: Requested variable (Variable { name: "data_2", namespace: Data }, [])
 requires transforms or signal expressions that are not yet supported

Tracking improvements to this error message in vega/vegafusion#336. Tracking improvements to the documentation of supported transforms in vega/vegafusion#337.

@jonmmease jonmmease requested review from mattijn and binste June 9, 2023 13:07
# Compile to Vega and extract inline DataFrames
with data_transformers.enable("vegafusion-inline"):
vega_spec = chart.to_dict(format="vega")
inline_datasets = get_inline_datasets_for_spec(vega_spec)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to bring the vegafusion-inline transformer and the get_inline_datasets_for_spec function over to Altair eventually, but will save that for a future PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, does the vegafusion-inline transformer use the entrypoint system to register itself or why is it available here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... ]
... }
>>> get_group_mark_for_scope(spec, (1,))
{'type': 'group', 'marks': [{'type': 'rect'}]}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does hatch run test automatically run doctests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hatch run test runs the following:

black --diff --color --check .
ruff check .
mypy altair tests
python -m pytest --pyargs --doctest-modules tests

See https://github.com/altair-viz/altair/blob/master/pyproject.toml#L104-L107

Copy link
Contributor

@binste binste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonmmease, this is going to be a great feature! I don't have a preference if we keep it as ._transformed_data for now until we have some documentation as well.

altair/utils/transformed_data.py Outdated Show resolved Hide resolved
altair/utils/transformed_data.py Outdated Show resolved Hide resolved
Comment on lines 58 to 60
DataFrame or list of DataFrame
If input chart is a Chart or Facet Chart, returns a DataFrame of the transformed data
Otherwise, returns a list of DataFrames of the transformed data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFrame or list of DataFrame
If input chart is a Chart or Facet Chart, returns a DataFrame of the transformed data
Otherwise, returns a list of DataFrames of the transformed data
Pandas DataFrame, list of Pandas DataFrames or None
If input chart is a Chart or Facet Chart, returns a Pandas DataFrame of the transformed data if it exists and None if the chart does not have a dataset
Otherwise, returns a list of Pandas DataFrames of the transformed data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 75cf958.

As a side note, pandas is supposed to always be lower case, even at the start of a sentence. See https://pandas.pydata.org/about/citing.html.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, thanks!

# Compile to Vega and extract inline DataFrames
with data_transformers.enable("vegafusion-inline"):
vega_spec = chart.to_dict(format="vega")
inline_datasets = get_inline_datasets_for_spec(vega_spec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, does the vegafusion-inline transformer use the entrypoint system to register itself or why is it available here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this module "internal/private" by renaming it to _transformed_data.py? Almost everything in Altair so far is public which makes further development difficult in some areas (and bloats up the top-level API and the "API reference" section in the documentation, see #2918, although that would not be the case here). I'd be in favour that we are more selective moving forward with new features and mark modules or functions as private if we don't expect a user to access them directly. Users should use the Chart.transformed_data method anyway and it would make it easier to iterate on this once you start integrating more of the VegaFusion functionality.

from altair.utils.schemapi import Undefined

Scope = Tuple[int, ...]
FacetMapping = Dict[Tuple[str, Scope], Tuple[str, Scope]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate that you're adding type hints! :)

altair/utils/transformed_data.py Outdated Show resolved Hide resolved
DataFrame
Transformed data as a DataFrame
"""
from ...utils.transformed_data import transformed_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing codebase uses relative imports in many places so this would be consistent but I'm in favour of switching to absolute ones. They are easier to read, especially as Altair has many modules with the same names on different levels.

source = pkgutil.get_data(examples_methods_syntax.__name__, filename)
chart = eval_block(source)
df = chart._transformed_data()
assert len(df) == rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to also check if this dataframe no nulls? assert df.notnull().all().all()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. When the input DataFrame has nulls it's possible for these to be pass through to the transformed data. Vega-Lite usually filters null values for the columns that are used in the chart, but transformed_data returns all of the columns, so the unused columns can still have nulls.

jonmmease and others added 6 commits June 10, 2023 12:09
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
Co-authored-by: Stefan Binder <binder_stefan@outlook.com>
@jonmmease
Copy link
Contributor Author

Thanks for the review @binste! I think I've addressed all of your feedback.

Copy link
Contributor

@binste binste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks! :)

Copy link
Contributor

@mattijn mattijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonmmease! I've placed a few more questions inline.

Scope = Tuple[int, ...]
FacetMapping = Dict[Tuple[str, Scope], Tuple[str, Scope]]

MAGIC_CHART_NAME = "_vf_mark{}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so fond of the word magic. Can we describe what it actually is? And replace all places where the word is used?

Is this the name of each view? I think we name if elsewhere view+_+<incrementing value>. Maybe align? Or, if different, add a trailing underscore before the incrementing number.

Or is this the reference name of the dataset within a view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, happy to rename this. This is the name that is applied to each unnamed Chart/Subchart/View.

Changed to:

VIEW_NAME = "altair_view_{}"

in 280eb0f.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, if this also defines a view name, could we also use _get_name()? Or will you run into conflicts then?
_get_name() is defined on the Chart class, see here: https://github.com/altair-viz/altair/blob/master/altair/vegalite/v5/api.py#L2569-L2571.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't familiar with _get_name(). Yeah, I think this would work. I'll give it a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 88fceb5

) from err

if isinstance(chart, Chart):
# Add dummy mark if None specified to satisfy Vega-Lite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename into template mark if this section is really necessary, but I was wondering if we need this? Shouldn't this raise a warning or error instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded comment in aabf5d6.

The reason to do this is so that it's possible to call chart.transformed_data() on a chart with transforms even if no mark has been defined. For example:

import pandas as pd
chart = alt.Chart(
    pd.DataFrame({"a": [1, 2, 3], "b": ["A", "BB", "CCC"]})
)
chart.transform_filter("datum.a > 1")._transformed_data()
a b
0 2 BB
1 3 CCC

Not sure if this will be a common use case, but I think it's neat to be able to use an Altair chart kind of like a DataFrame this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a chart anymore😉! Thinking aloud, without judging: I always thought it was good that dataframe libraries make a clear distinction between dataframe operations and visualizations. With the powerful options introduced in this PR, altair is slowly turning into a dataframe library as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea I've had is that Altair could provide a alt.Dataset object that supports the .transform_* methods and the mark_* methods. The .transform_* methods would return a new Dataset and the .mark_* methods would return a Chart.

This is somewhat similar to how HoloViews works (https://holoviews.org/getting_started/Tabular_Datasets.html)

Copy link
Contributor

@mattijn mattijn Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this concept. I don't think it is possible to run transforms in Vega-Lite without defining a mark.

Naming is also an issue. alt.Data and alt.Dataset already exists. Maybe we can extend these? Worth raising a follow up issue on this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this it is possible to run transforms in Vega-Lite without defining a mark

Yes, this is why I add a mark when none is defined before calling vega-lite through vl-convert.

Naming is also an issue. alt.Data and alt.Dataset already exists

Or maybe just alt.DataFrame? I'll open a follow up issue.

subcharts = chart.vconcat
elif isinstance(chart, ConcatChart):
subcharts = chart.concat
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make a special raise for RepeatChart? Or is this not applicable here? It may feel like an omission.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, RepeatChart doesn't have a _transformed_data() method, so it won't get this far. We could add a _transformed_data() method to RepeatChart that always raises an exception that it's not supported. But this might also be confusing (why have the method as all if it never works). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, why can it never works? A NotImplementedError seems OK for me for now. It is also confusing if RepeatChart is the only Chart without a transformed_data() method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a NotImplementedError sounds like a good approach. I'll add that. I think it's probably possible to support RepeatChart in the future, I just haven't work through quite what it means, and how to process the structure of the Vega that Vega-Lite produces in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a738408

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that all top-level charts have the ._transformed_data method, you could add it to TopLevelMixin so there is no need anymore to repeat it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. The return type in the type signature is different between Chart/FacetChart (where it's Optional[DataFrameLike], and the others (where it's List[DataFrameLike].

For the standalone utility transformed_data function I used typeing.overload to express the two signatures. But I wasn't sure how to accomplish this in a superclass method when the type signature depends on the type of self. Do you have any ideas on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Just tried and got this far with defining _transformed_data on TopLevelMixin:

   @overload
    def _transformed_data(
        self: Union["Chart", "FacetChart", "RepeatChart"],
        row_limit: Optional[int] = ...,
        exclude: Optional[Iterable[str]] = ...,
    ) -> Optional[DataFrameLike]:
        ...

    @overload
    def _transformed_data(
        self: Union["LayerChart", "ConcatChart", "HConcatChart", "VConcatChart"],
        row_limit: Optional[int] = ...,
        exclude: Optional[Iterable[str]] = ...,
    ) -> List[DataFrameLike]:
        ...

    def _transformed_data(
        self,
        row_limit=None,
        exclude=None,
    ):

But then mypy complains, rightly so, that these types are not superclasses of TopLevelMixin:

altair/vegalite/v5/api.py:2407: error: The erased type of self "Union[altair.vegalite.v5.api.Chart, altair.vegalite.v5.api.FacetChart, altair.vegalite.v5.api.RepeatChart]" is not a supertype of its class "altair.vegalite.v5.api.TopLevelMixin"  [misc]
altair/vegalite/v5/api.py:2415: error: The erased type of self "Union[altair.vegalite.v5.api.LayerChart, altair.vegalite.v5.api.ConcatChart, altair.vegalite.v5.api.HConcatChart, altair.vegalite.v5.api.VConcatChart]" is not a supertype of its class "altair.vegalite.v5.api.TopLevelMixin"  [misc]

According to this GH issue and some others, the solution might be to create protocols whcih identify the individual chart types but that seems to be more complicated then just keeping it as it is now -> I'm ok with leaving it.

self,
row_limit: Optional[int] = None,
exclude: Optional[Iterable[str]] = None,
) -> List[pd.DataFrame]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output is always a list of pandas dataframes? Do we want this to eventually support other dataframes-a-like as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now in VegaFusion, these results can actually be an Arrow tables (if the input Chart(s) wrap arrow tables), a Polars DataFrames (if the input Chart(s) wrap Polars DataFrames), or pandas DataFrames (all other cases).

It seemed to me that this would make for a pretty awkward type signature, so I left it as just pandas. It could be a Union, but arrow and polars are optional dependencies so I'm not sure how to have the Union include different options based on what's installed. Do you have any ideas here @mattijn or @binste?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define a DataFrameLike type (elsewhere in the codebase) that can refer to any object with a .__dataframe__ attribute, regardless of whether it's a pandas DataFrame, polars DataFrame, or any other object? And use that type here? Would that help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I'll look into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6f43d6b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check this @binste? Should a type definition comes back in the root of altair (see changes in __init__)?

Copy link
Contributor

@binste binste Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of renaming it to _DataFrameLike so no one starts using it themselves and it won't show up in __init__. I'm doing the same in #2976. Gives us the flexibility to gather these types at a later stage into a types.py module if that makes sense or move them around/redefine otherwise.

Apart from that I like the protocol usage. We can narrow down the type further once we figure out how to best do that but at least it's better then Any while being correct. At a later stage (not in this PR), we can try to use generics to define the return type of transformed_data based on the type of the .data attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also move the Protocol import into a if sys.version_info >= (3, 8): statement? It was introduced in Python 3.8 and we try to import from the official library as soon as it's available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll move the protocol.

I might not have the right understanding here, but since DataFrameLike is in the return type of the public method, doesn't the type need to be public as well? Otherwise, Altair users wouldn't be able to work with it in their own typed code.

@mattijn
Copy link
Contributor

mattijn commented Jun 10, 2023

One comment. Naming is hard, but I’m not sure if transformed_data covers all. I think this function also will return data if there are no transforms defined in the specification. Maybe just name it dataframes() or something similar?

@jonmmease
Copy link
Contributor Author

One comment. Naming is hard, but I’m not sure if transformed_data covers all. I think this function also will return data if there are no transforms defined in the specification. Maybe just name it dataframes() or something similar?

I chose .transformed_data() to mirror the .data property of a chart. It is true that this method will return data even if there are no transforms, but I think it's appropriate for a chart with no transforms to return the original dataset. @binste do you have any thoughts on the naming here?

@mattijn
Copy link
Contributor

mattijn commented Jun 11, 2023

What about to_df() or to_data(). Limitations (that I'm willing to accept):

  • to_df() is maybe too narrow once we support array data too?
  • to_data() is maybe too wide, how to know upfront what will you get?

@jonmmease jonmmease mentioned this pull request Jun 11, 2023
1 task
@jonmmease
Copy link
Contributor Author

What about to_df() or to_data()

I'm not a fan of .to_data(), since we already have a .data property on the Chart (I think it would be confusing why they are both there, and why they mean different things). .to_df() is interesting, though it would feel odd to me if we eventually added an alt.DataFrame class as suggested in #3083.

If we think of the transforms on a chart as being analogous to a lazy data frame, we could consider names like .collect() (used by Polars, but maybe not a great idea since Vega has a collect transform), .compute() (used by Dask), or .evaluate() or .eval().

I like .eval() since it's a natural to say we're "evaluating the transforms" in the chart, and it's nice and short.

@binste
Copy link
Contributor

binste commented Jun 11, 2023

Agree, naming is hard... If GitHub CoPilot could solve this I would be very happy ;) Two points from my side:

  • For me, something like to_data does not convey that it's not just the original data but the one after all transformations have been applied. transformed_data works for me even if no transforms are defined as it still technically is the "data with all transformations applied" but maybe I'm just biased because I now used it in vegafusion for a while and got used to it.
    • -> I think it's great if it has something like transformed/evaluated/eval in the name
  • On the other hand, it should also have data/df/... in the name as else it's unclear what is being evaluated. eval() could mean that the Chart specification is evaluated to Vega-Lite or even to Vega with VegaFusion.

Hence, I like transformed_data. Synonyms for transformed could be used such as evaluated but transformed maps closest to the transformation concept: alt.Chart().transform_pivot().transformed_data vs. alt.Chart().transform_pivot().evaluated_data().

@mattijn
Copy link
Contributor

mattijn commented Jun 11, 2023

I liked the eval(), but agree with @binste that the meaning is ambiguous. Regarding the following:

Right now in VegaFusion, these results can actually be an Arrow tables (if the input Chart(s) wrap arrow tables), a Polars DataFrames (if the input Chart(s) wrap Polars DataFrames), or pandas DataFrames (all other cases).

Should the output type be identical to the input type (eg. what is the input type of a url)? Otherwise can do .pl() for polars, .arrow() for pyarrow and .df() for pandas, like duckdb is doing?

Anyhow, no strong feelings for or against the suggestions.

@binste
Copy link
Contributor

binste commented Jun 11, 2023

That's a good point. Makes it more stable if a user can declare what they want in return independent of what they put in. As another idea: transformed_data(return_type="pandas"/"polars"/"arrow") or df_type.

@mattijn
Copy link
Contributor

mattijn commented Jun 12, 2023

To summarize, up until now we have discussed:

  • .transformed_data()
  • .to_data()
  • .to_df()
  • .collect()
  • .compute()
  • .evaluate()
  • .eval()
  • .evaluated_data()
  • .df()
  • .pl()
  • .arrow()

Currently in Altair we do it as such:

  • .save(format=***)
  • .to_dict(format=***) # defaults to vega-lite option for vega
  • .to_json(format=***) # defaults to vega-lite option for vega

So than it could be consistent to say

  • transformed_data(format=*** OR dtype=***), # eg defaults to auto (equals input-dtype) option for manually define dtype

Personally, I like it when there is a grouping using to_***().
Using the autocomplete I can discover what this object can be evaluated into (similar one can autocomplete to .mark_***() and transform_***() to see what is possible).

For example:

  • to_svg()
  • to_png()
  • to_pdf(()
  • to_df()
  • to_pl()
  • to_arrow()
  • to_vgjson()
  • to_vljson()
  • to_json() # is vega-lite
  • to_vgdict()
  • to_vldict()
  • to_dict() # is vega-lite

I'm aware that the latter (using to_***()) is probably better to discuss in a different issue. So also tempted to say, leave it as is (.transformed_data()).

@jonmmease
Copy link
Contributor Author

Thanks for the thorough summary @mattijn. It sounds like we've all landed on leaving the name as transformed_data. I'll still leave it as an internal method for this PR, so it's not completely final, but let's move forward as-is.

I think the last remaining question for this PR is whether the DataFrameLike protocol should be public or private. As I mentioned above, my assumption was that it should be public since it will be used in the return type of public methods. But I'm honestly not that familiar with mypy conventions.

@binste
Copy link
Contributor

binste commented Jun 12, 2023

It's a great point that the types should be public as well. I've never type hinted a publicly accessible library so if someone has more experience please chime in but it sounds reasonable to me. I'm still unsure how to best organise these types. In #2976 I actually wanted to introduce the same protocol. I'll probably switch to yours in case this PR is merged first.

How about this: We mark these types as private for now to give us the freedom to refactor. Altair is not yet marked as a typed package (i.e. it does not have a py.typed file in its repo, see the mypy docs). Once we have typed most of the public API and add that file, we convert the private types to public ones and make it official by adding it to the release notes that this is ready. As this might still be a few weeks out, for any release until then, users won't have any downside of these private types being in the codebase but it's clear that they should not yet be used. I added a note to do this in #2951 which I'm working through.

Thanks @mattijn for the overview, this is helpful! I also find the to_* methods convenient but I'm not sure yet how to best do it. I'm also ok with leaving as-is and moving this forward.

@jonmmease
Copy link
Contributor Author

Sounds good @binste, I made the protocol private in c665e8f.

I think that's everything. I'll let this sit until tomorrow in case anything else comes to mind, but if I don't hear anything I'll merge tomorrow. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants