Description
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Right now the string aliases for our types is inconsistent
>>> import pandas as pd
>>> pd.Series(range(3), dtype="int8") # NumPy type
>>> pd.Series(range(3), dtype="Int8") # Pandas extension type
>>> pd.Series(range(3), dtype="int8[pyarrow]") # Arrow type
Strings have a similar inconsistency with "string", "string[pyarrow]" and "string[pyarrow_numpy]"
Feature Description
I think we should create"int8[numpy]" and "int8[pandas]" aliases to stay consistent with pyarrow. This also has the advantage of decoupling "int8" from NumPy, so perhaps in the future we can allow the setting of the backend determine if NumPy or pyarrow types are returned
The pattern thus becomes "data_type[backend]", with the exception of "string[pyarrow_numpy]" which combines combines the backend and nullability semantics together. I am less sure what to do in that case - maybe even that should be called "string[pyarrow, numpy]" where the second argument is nullability?
In any case I am just hoping we can start to detach the logical type from the physical storage / nulllability semantics with a well defined pattern
Alternative Solutions
n/a
Additional Context
No response
Activity
WillAyd commentedon Apr 4, 2024
Meant to tag @jorisvandenbossche
jbrockmendel commentedon Apr 8, 2024
i like this idea, though as i mentioned at the sprint i think we should avoid "backend". maybe dtype "family"?
WillAyd commentedon Apr 9, 2024
Maybe "type provider"?
WillAyd commentedon Apr 10, 2024
Thinking through some more what I suggested above for
type_category[type_provider, nullability_provider]
won't always work as a pattern because there are still types that accept more arguments, e.g. datetime, pa.list, pa.dictionary, etc...I am wondering now if it is even worth trying to support string aliases or if we should push users towards using a more explicit dtype construction. This would be a change from where we are today but could be better in the long run (?)
WillAyd commentedon Apr 10, 2024
As an exercise I tried to map out all of the types that pandas does today support (or reasonably could in the near term) and place in a hierarchy. Here is what I was able to come up with:
Tagging @pandas-dev/pandas-core in case this is of use to the larger team
graphviz used to build this:
digraph type_graph {
node [shape=box];
"type"
"type" -> "scalar"
"scalar" -> "numeric"
"numeric" -> "integral"
"integral" -> "signed"
subgraph cluster_signed {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "np.int8";
node [fillcolor="lightgreen" style=filled] "np.int16";
node [fillcolor="lightgreen" style=filled] "np.int32";
node [fillcolor="lightgreen" style=filled] "np.int64";
node [fillcolor="lightblue" style=filled] "pd.Int8Dtype";
node [fillcolor="lightblue" style=filled] "pd.Int16Dtype";
node [fillcolor="lightblue" style=filled] "pd.Int32Dtype";
node [fillcolor="lightblue" style=filled] "pd.Int64Dtype";
node [fillcolor="lightgray" style=filled] "pa.int8";
node [fillcolor="lightgray" style=filled] "pa.int16";
node [fillcolor="lightgray" style=filled] "pa.int32";
node [fillcolor="lightgray" style=filled] "pa.int64";
"np.int8" -> "np.int16" -> "np.int32" -> "np.int64"
"pd.Int8Dtype" -> "pd.Int16Dtype" -> "pd.Int32Dtype" -> "pd.Int64Dtype"
"pa.int8" -> "pa.int16" -> "pa.int32" -> "pa.int64"
}
"signed" -> "pd.Int8Dtype" [arrowsize=0]
"integral" -> "unsigned"
subgraph cluster_unsigned {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "np.uint8";
node [fillcolor="lightgreen" style=filled] "np.uint16";
node [fillcolor="lightgreen" style=filled] "np.uint32";
node [fillcolor="lightgreen" style=filled] "np.uint64";
node [fillcolor="lightblue" style=filled] "pd.UInt8Dtype";
node [fillcolor="lightblue" style=filled] "pd.UInt16Dtype";
node [fillcolor="lightblue" style=filled] "pd.UInt32Dtype";
node [fillcolor="lightblue" style=filled] "pd.UInt64Dtype";
node [fillcolor="lightgray" style=filled] "pa.uint8";
node [fillcolor="lightgray" style=filled] "pa.uint16";
node [fillcolor="lightgray" style=filled] "pa.uint32";
node [fillcolor="lightgray" style=filled] "pa.uint64";
"np.uint8" -> "np.uint16" -> "np.uint32" -> "np.uint64"
"pd.UInt8Dtype" -> "pd.UInt16Dtype" -> "pd.UInt32Dtype" -> "pd.UInt64Dtype"
"pa.uint8" -> "pa.uint16" -> "pa.uint32" -> "pa.uint64"
}
"unsigned" -> "pd.UInt8Dtype" [arrowsize=0]
"numeric" -> "floating point"
subgraph cluster_floating {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "np.float32";
node [fillcolor="lightgreen" style=filled] "np.float64";
node [fillcolor="lightblue" style=filled] "pd.Float32Dtype";
node [fillcolor="lightblue" style=filled] "pd.Float64Dtype";
node [fillcolor="lightgray" style=filled] "pa.float32";
node [fillcolor="lightgray" style=filled] "pa.float64";
"np.float32" -> "np.float64"
"pd.Float32Dtype" -> "pd.Float64Dtype"
"pa.float32" -> "pa.float64"
}
"floating point" -> "pd.Float32Dtype" [arrowsize=0]
"numeric" -> "fixed point"
subgraph cluster_fixed {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.decimal128";
node [fillcolor="lightgray" style=filled] "pa.decimal256";
"pa.decimal128" -> "pa.decimal256"
}
"fixed point" -> "pa.decimal128" [arrowsize=0]
"scalar" -> "boolean"
subgraph cluster_boolean {
edge[style=invis]
node[fillcolor="lightgreen" style=filled] "np.bool_";
node[fillcolor="lightblue" style=filled] "pd.BooleanDtype";
node[fillcolor="lightgray" style=filled] "pa.bool_";
}
"boolean" -> "pd.BooleanDtype" [arrowsize=0]
"scalar" -> "temporal"
"temporal" -> "date"
subgraph cluster_date {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.date32"
node [fillcolor="lightgray" style=filled] "pa.date64"
"pa.date32" -> "pa.date64"
}
"date" -> "pa.date32" [arrowsize=0]
"temporal" -> "datetime"
subgraph cluster_timestamp {
edge [style=invis]
node [fillcolor="lightblue" style=filled] "datetime64[unit, tz]";
node [fillcolor="lightgray" style=filled] "pa.timestamp(unit, tz)";
"datetime64[unit, tz]" -> "pa.timestamp(unit, tz)" [style=invis]
}
"datetime" -> "datetime64[unit, tz]" [arrowsize=0]
"temporal" -> "duration"
subgraph cluster_duration {
edge [style=invis]
node [fillcolor="lightblue" style=filled] "timedelta64[unit]";
node [fillcolor="lightgray" style=filled] "pa.duration(unit)";
"timedelta64[unit]" -> "pa.duration(unit)" [style=invis]
}
"duration" -> "timedelta64[unit]" [arrowsize=0]
"temporal" -> "interval"
"pa.month_day_nano_interval" [fillcolor="lightgray" style=filled]
"interval" -> "pa.month_day_nano_interval"
"scalar" -> "binary"
subgraph cluster_binary {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.binary";
node [fillcolor="lightgray" style=filled] "pa.large_binary";
"pa.binary" -> "pa.large_binary"
}
"binary" -> "pa.binary"
"binary" -> "string"
subgraph cluster_string {
edge [style=invis]
node [fillcolor="lightgreen" style=filled] "object";
node [fillcolor="lightgreen" style=filled] "np.StringDType";
node [fillcolor="lightblue" style=filled] "pd.StringDtype";
node [fillcolor="lightgray" style=filled] "pa.string";
node [fillcolor="lightgray" style=filled] "pa.large_string";
node [fillcolor="lightgray:lightgreen" style=filled] "string[pyarrow_numpy]";
"object" -> "np.StringDType"
"pa.string" -> "pa.large_string"
}
"string" -> "pa.string" [arrowsize=0]
"scalar" -> "categorical"
subgraph cluster_categorical {
edge [style=invis]
node [fillcolor="lightblue" style=filled] "pd.CategoricalDtype";
node [fillcolor="lightgray" style=filled] "pa.dictionary(index_type, value_type)";
"pd.CategoricalDtype" -> "pa.dictionary(index_type, value_type)"
}
"categorical" -> "pd.CategoricalDtype" [arrowsize=0]
"scalar" -> "sparse"
"pd.SparseDtype(dtype)" [fillcolor="lightblue" style=filled];
"sparse" -> "pd.SparseDtype(dtype)" [arrowsize=0]
"type" -> "aggregate"
"aggregate" -> "list"
subgraph cluster_list {
edge [style=invis]
node [fillcolor="lightgray" style=filled] "pa.list_(value_type)";
node [fillcolor="lightgray" style=filled] "pa.large_list(value_type)";
"pa.list_(value_type)" -> "pa.large_list(value_type)"
}
"list" -> "pa.list_(value_type)" [arrowsize=0]
"aggregate" -> "struct"
"pa.struct(fields)" [fillcolor="lightgray" style=filled]
"struct" -> "pa.struct(fields)" [arrowsize=0]
"aggregate" -> "dictionary"
"dictionary" -> "pa.dictionary(index_type, value_type)" [arrowsize=0]
"pa.map(index_type, value_type)" [fillcolor="lightgray" style=filled]
"dictionary" -> "pa.map(index_type, value_type)" [arrowsize=0]
}
Dr-Irv commentedon Apr 10, 2024
From a typing perspective, supporting all the different string versions of valid types for
dtype
are a PITA inpandas-stubs
. So I'd be supportive of just having a class hierarchy to represent valid dtypes.Having said that, if we are to deprecate the strings, we'd probably need a PDEP for that....
mroeschke commentedon Apr 10, 2024
I would be supportive of this as well. Especially for dtypes as strings that take parameters (timezone types, decimal types), it would be great to avoid string parsing to dtype object construction
jorisvandenbossche commentedon Apr 10, 2024
To your original point, I very much agree with this (at least for the physical storage, not necessarily for nullability semantics because I personally think we should move to just having one nullability semantic, but that's the topic for another PDEP)
This is a topic that I brought up last summer during the sprint, but never got around writing up publicly. The summary is that I would like to see us move to just having "pandas" dtypes, at least for the majority of the users that don't need to know the lower-level details.
Most users just need to know they have eg a "int64" or "string" column, and don't have to care whether that is under the hood stored using a single numpy array, a combo of numpy arrays (our masked arrays) or a pyarrow array.
The current string aliases for non-default dtypes are I think mostly a band-aid to let people more easily specify those dtypes, and I fully agree those aren't very pretty. I do think it will be hard (or even desirable) to fully do away with string aliases though, at least for the default data types, because this is so widespread.
But IMO we should at least make the alternative to string aliases, construct dtypes programmatically, better supported and more consistent (eg so a user can just do
pd.Series(..., dtype=pd.int64())
orpd.Series(..., dtype=pd.Int64Dtype())
and get the default int64 dtype based on their settings (which currently is the numpy dtype, but could also be a masked or pyarrow dtype based on their settings)).WillAyd commentedon Apr 10, 2024
So maybe then for each category in the type hierarchy above we have wrappers with signatures like:
I know @jbrockmendel prefers something besides
dtype_backend
but keeping that now for consistency with the I/O methods.I was thinking this as well
WillAyd commentedon Apr 10, 2024
Yea this would be a long process. I think what's hard about the string alias is that it only works for very basic types. It definitely has been and would continue to be relatively easy for users to just say "int64" and get a 64 bit integer irrespective of what that is backed by, but if the user wants to then create a list column they can't just do "list".
I think users will end up with a frankenstein of string aliases alongside arguments like
dtype=pd.ArrowDtype(pa.list(pa.string()))
, which I find confusingDr-Irv commentedon Apr 10, 2024
I agree. One possibility to consider is to limit the number of string aliases to simple types
"int"
,"float"
,"string"
,"object"
,"datetime"
,"timedelta"
, which default to something based on default backends, and even sizes (e.g.,"int"
means"int64"
) as I guess that only a few of the strings are really used most often.jorisvandenbossche commentedon Apr 10, 2024
I found the notebook that I presented at the sprint last summer. It's a little bit off topic for the discussion just about string aliases, but I think it is relevant for the bigger picture (that we need to look at anyway if considering to move away from string aliases), so just dumping the content here (updated a little bit).
I like to have "pandas data types" with a consistent interface:
(and at least I think we should allow you to write code that is agnostic to it)
For example, for datetime-like data, we currently have:
Another example, we currently have
pd.ArrowDtype("date64")
or"date64[pyarrow]"
, but if we want to enable a date dtype by default, users shouldn't need to know this is stored using pyarrow under the hood, so this could bepd.date()
or"date"
?Logical data types vs physical data types:
For pandas, I think most users should care about logical data types, and not too much about the physical data type (and we can choose the best default, and advanced users can give hints which to use for performance optimizations)
Assuming we want a single pandas interface to all dtypes, we need to decide:
pd.string()
vs class constructors (pd.StringDtype()
)pd.StringDtype(backend="arrow")
,pd.StringDtype(backend="numpy")
) or separate classes based on the physical storage (what we have right now with egpd.Int64Dtype()
andpd.ArrowDtype(pa.int64())
)Either we can use "backend-parametrized" classes or either hide classes a bit more and use dtype constructor factory functions:
-> but that means that choosing the approach of the current
StringDtype
with different backends instead ofArrowDtype("string")
or we could have different classes but then we definitely need the functional interface and dtype-checking helpers (because isinstance then doesn't work):
(and maybe
pd.string(backend="arrow", storage="string_view")
?In this case we are more free to keep whatever classes structure we want under the hood.
jbrockmendel commentedon Apr 10, 2024
I forget the details, but remember finding Joris's presentation at the sprint compelling.
WillAyd commentedon Apr 10, 2024
This is an interesting example, but do we even need to support the pyarrow date64? I'm not really clear what advantages that has over date32. Per the hierarchy above I would just abstract this as
pd.date()
which under the hood would only use pyarrow's date32. It would be a suboptimal API if we had to do something likepd.date(backend="pyarrow", size=32)
but I'm not sure how likely that is.Outside of date types I do see that issue with strings where
dtype_backend="pyarrow"
would leave it open to interpretation if you wanted pa.string(), pa.large_string(), or any of the other pyarrow string types you already mentioned.In an ideal world I would be indifferent, but the problem with the class constructors is that they already exist (pd.StringDtype, pd.Int64Dtype, etc...). Repurposing them might only add to the confusion
Overall though I agree with your sentiment of starting at a place where we think in terms of logical data types foremost, which should cover the majority of use cases, and then giving some control over the physical data types via keyword arguments or options
WillAyd commentedon Apr 10, 2024
Is this in reference to how nulls are stored or how they are expressed to the end user? Storage-wise I feel like it would be a mistake to stray from the Arrow implementation
jorisvandenbossche commentedon Apr 13, 2024
I think we already have both types somewhat, so we will need to clean this up to a certain extent whichever choice we make:
pd.StringDtype(backend="python|"pyarrow")
is an example of using a single dtype class (that users can instantiate) as an interface to multiple implementations of the actual memory/compute (i.e. in this case we actually have multiple ExtensionArray classes that map to this single dtype depending on the backend)(although I know we also have
pd.ArrowDtype(pa.string())
which then uses a different DType class)np.dtype("int64")
/pd.Int64Dtype()
/pd.ArrowDtype(pa.int64())
is essentially an example of a logical integer type where the entry point is a different class depending on which implementation you want (I know this wasn't necessarily designed together /grown historically, and I am mixing with also numpy dtypes, but it is the current situation as how users face it)While we could decide to have a single
pd.Int64Dtype(backend="numpy"|"pyarrow")
parametrized class (mimicking the string case above), or we could also decide that we are fine with having bothpd.Int64Dtype
(numpy based) andpd.ArrowDtype
classes, but then I think we would need another entry point for users likepd.int64()
factory function that can then create an instance of either of those classes (depending on your settings, or depending on a keyword you pass)So while those class constructors indeed already exist, I think we have to repurpose or change the existing ones (and add new ones) to some extent anyway. And it is also not because we have those classes right now, that we can't decide we want to hide them more from the user by providing an alternative. I don't think there are already that many users that use
pd.Int64Dtype()
directly, and (if we would prefer that interface) there is certainly still room to start pushing a functional constructor interface.jorisvandenbossche commentedon Apr 13, 2024
In the first place to how they are expressed to the end user, because IMO that's the most important aspect (since we are talking about the user interface how dtypes are specified / presented). Personally I would also prefer to use a consistent implementation storage-wise, but that's more of an implementation detail that we could discuss/compromise per dtype.
feature pandas-dev#58141: Consistent naming conventions for string dt…
feature pandas-dev#58141: Consistent naming conventions for string dt…
feature pandas-dev#58141: Consistent naming conventions for string dt…
feature pandas-dev#58141: Consistent naming conventions for string dt…
feature pandas-dev#58141: Consistent naming conventions for string dt…
feature pandas-dev#58141: Consistent naming conventions for string dt…
feature pandas-dev#58141: Consistent naming conventions for string dt…