<img src="../../../images/banners/pandas-cropped.jpeg" width="600"/>

<a class="anchor" id="essential_basic_functionality"></a>
# <img src="../../../images/logos/pandas.png" width="23"/> Data Types

## <img src="../../../images/logos/toc.png" width="20"/> Table of Contents 
* [defaults](#defaults)
* [upcasting](#upcasting)
* [astype](#astype)
* [object conversion](#object_conversion)
* [gotchas](#gotchas)
* [Selecting columns based on ](#selecting_columns_based_on_)

---

For the most part, pandas uses NumPy arrays and dtypes for Series or individual
columns of a DataFrame. NumPy provides support for `float`,
`int`, `bool`, `timedelta64[ns]` and `datetime64[ns]` (note that NumPy
does not support timezone-aware datetimes).

pandas and third-party libraries *extend* NumPy’s type system in a few places.
This section describes the extensions pandas has made internally.
See [Extension types](https://pandas.pydata.org/docs/development/extending.html#extending-extension-types) for how to write your own extension that
works with pandas. See [Extension data types](https://pandas.pydata.org/docs/ecosystem.html#ecosystem-extensions) for a list of third-party
libraries that have implemented an extension.

The following table lists all of pandas extension types. For methods requiring `dtype`
arguments, strings can be specified as indicated. See the respective
documentation sections for more on each type.

pandas has two ways to store strings.

Generally, we recommend using [`StringDtype`](../reference/api/pandas.StringDtype.html#pandas.StringDtype "pandas.StringDtype"). See [Text data types](text.html#text-types) for more.

Finally, arbitrary objects may be stored using the `object` dtype, but should
be avoided to the extent possible (for performance and interoperability with
other libraries and methods. See [object conversion](#basics-object-conversion)).

A convenient [`dtypes`](../reference/api/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes "pandas.DataFrame.dtypes") attribute for DataFrame returns a Series
with the data type of each column.

In [None]:
dft = pd.DataFrame(
    {
        "A": np.random.rand(3),
        "B": 1,
        "C": "foo",
        "D": pd.Timestamp("20010102"),
        "E": pd.Series([1.0] * 3).astype("float32"),
        "F": False,
        "G": pd.Series([1] * 3, dtype="int8"),
    }
)

dft
dft.dtypes

On a `Series` object, use the [`dtype`](../reference/api/pandas.Series.dtype.html#pandas.Series.dtype "pandas.Series.dtype") attribute.

In [None]:
dft["A"].dtype

If a pandas object contains data with multiple dtypes *in a single column*, the
dtype of the column will be chosen to accommodate all of the data types
(`object` is the most general).

In [None]:
pd.Series([1, 2, 3, 4, 5, 6.0])
pd.Series([1, 2, 3, 6.0, "foo"])

The number of columns of each type in a `DataFrame` can be found by calling
`DataFrame.dtypes.value_counts()`.

In [None]:
dft.dtypes.value_counts()

Numeric dtypes will propagate and can coexist in DataFrames.
If a dtype is passed (either directly via the `dtype` keyword, a passed `ndarray`,
or a passed `Series`), then it will be preserved in DataFrame operations. Furthermore,
different numeric dtypes will **NOT** be combined. The following example will give you a taste.

In [None]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=["A"], dtype="float32")
df1
df1.dtypes
df2 = pd.DataFrame(
    {
        "A": pd.Series(np.random.randn(8), dtype="float16"),
        "B": pd.Series(np.random.randn(8)),
        "C": pd.Series(np.array(np.random.randn(8), dtype="uint8")),
    }
)

df2
df2.dtypes

<a class="anchor" id="defaults"></a>
### defaults

By default integer types are `int64` and float types are `float64`,
*regardless* of platform (32-bit or 64-bit).
The following will all result in `int64` dtypes.

In [None]:
pd.DataFrame([1, 2], columns=["a"]).dtypes
pd.DataFrame({"a": [1, 2]}).dtypes
pd.DataFrame({"a": 1}, index=list(range(2))).dtypes

Note that Numpy will choose *platform-dependent* types when creating arrays.
The following **WILL** result in `int32` on 32-bit platform.

In [None]:
frame = pd.DataFrame(np.array([1, 2]))

<a class="anchor" id="upcasting"></a>
### upcasting

Types can potentially be *upcasted* when combined with other types, meaning they are promoted
from the current type (e.g. `int` to `float`).

In [None]:
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
df3
df3.dtypes

[`DataFrame.to_numpy()`](../reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy "pandas.DataFrame.to_numpy") will return the *lower-common-denominator* of the dtypes, meaning
the dtype that can accommodate **ALL** of the types in the resulting homogeneous dtyped NumPy array. This can
force some *upcasting*.

In [None]:
df3.to_numpy().dtype

<a class="anchor" id="astype"></a>
### astype

You can use the [`astype()`](../reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype "pandas.DataFrame.astype") method to explicitly convert dtypes from one to another. These will by default return a copy,
even if the dtype was unchanged (pass `copy=False` to change this behavior). In addition, they will raise an
exception if the astype operation is invalid.

Upcasting is always according to the **NumPy** rules. If two different dtypes are involved in an operation,
then the more *general* one will be used as the result of the operation.

In [None]:
df3
df3.dtypes
df3.astype("float32").dtypes

Convert a subset of columns to a specified type using [`astype()`](../reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype "pandas.DataFrame.astype").

In [None]:
dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
dft[["a", "b"]] = dft[["a", "b"]].astype(np.uint8)
dft
dft.dtypes

Convert certain columns to a specific dtype by passing a dict to [`astype()`](../reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype "pandas.DataFrame.astype").

In [None]:
dft1 = pd.DataFrame({"a": [1, 0, 1], "b": [4, 5, 6], "c": [7, 8, 9]})
dft1 = dft1.astype({"a": np.bool_, "c": np.float64})
dft1
dft1.dtypes

Note

When trying to convert a subset of columns to a specified type using [`astype()`](../reference/api/pandas.DataFrame.astype.html#pandas.DataFrame.astype "pandas.DataFrame.astype") and [`loc()`](../reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc "pandas.DataFrame.loc"), upcasting occurs.

[`loc()`](../reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc "pandas.DataFrame.loc") tries to fit in what we are assigning to the current dtypes, while `[]` will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.

In [None]:
dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
dft.loc[:, ["a", "b"]].astype(np.uint8).dtypes
dft.loc[:, ["a", "b"]] = dft.loc[:, ["a", "b"]].astype(np.uint8)
dft.dtypes

<a class="anchor" id="object_conversion"></a>
### object conversion

pandas offers various functions to try to force conversion of types from the `object` dtype to other types.
In cases where the data is already of the correct type, but stored in an `object` array, the
[`DataFrame.infer_objects()`](../reference/api/pandas.DataFrame.infer_objects.html#pandas.DataFrame.infer_objects "pandas.DataFrame.infer_objects") and [`Series.infer_objects()`](../reference/api/pandas.Series.infer_objects.html#pandas.Series.infer_objects "pandas.Series.infer_objects") methods can be used to soft convert
to the correct type.

> 
> 
> ```
> In [383]: import datetime
> 
> In [384]: df = pd.DataFrame(
>  .....:     [
>  .....:         [1, 2],
>  .....:         ["a", "b"],
>  .....:         [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)],
>  .....:     ]
>  .....: )
>  .....: 
> 
> In [385]: df = df.T
> 
> In [386]: df
> Out[386]: 
>  0 1 2
> 0 1 a 2016-03-02
> 1 2 b 2016-03-02
> 
> In [387]: df.dtypes
> Out[387]: 
> 0 object
> 1 object
> 2 datetime64[ns]
> dtype: object
> 
> ```
> 
> 
>

Because the data was transposed the original inference stored all columns as object, which
`infer_objects` will correct.

> 
> 
> ```
> In [388]: df.infer_objects().dtypes
> Out[388]: 
> 0 int64
> 1 object
> 2 datetime64[ns]
> dtype: object
> 
> ```
> 
> 
>

The following functions are available for one dimensional object arrays or scalars to perform
hard conversion of objects to a specified type:

To force a conversion, we can pass in an `errors` argument, which specifies how pandas should deal with elements
that cannot be converted to desired dtype or object. By default, `errors='raise'`, meaning that any errors encountered
will be raised during the conversion process. However, if `errors='coerce'`, these errors will be ignored and pandas
will convert problematic elements to `pd.NaT` (for datetime and timedelta) or `np.nan` (for numeric). This might be
useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has
non-conforming elements intermixed that you want to represent as missing:

In [None]:
import datetime
m = ["apple", datetime.datetime(2016, 3, 2)]
pd.to_datetime(m, errors="coerce")
m = ["apple", 2, 3]
pd.to_numeric(m, errors="coerce")
m = ["apple", pd.Timedelta("1day")]
pd.to_timedelta(m, errors="coerce")

The `errors` parameter has a third option of `errors='ignore'`, which will simply return the passed in data if it
encounters any errors with the conversion to a desired data type:

In [None]:
import datetime
m = ["apple", datetime.datetime(2016, 3, 2)]
pd.to_datetime(m, errors="ignore")
m = ["apple", 2, 3]
pd.to_numeric(m, errors="ignore")
m = ["apple", pd.Timedelta("1day")]
pd.to_timedelta(m, errors="ignore")

In addition to object conversion, [`to_numeric()`](../reference/api/pandas.to_numeric.html#pandas.to_numeric "pandas.to_numeric") provides another argument `downcast`, which gives the
option of downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:

In [None]:
m = ["1", 2, 3]
pd.to_numeric(m, downcast="integer")  # smallest signed int dtype
pd.to_numeric(m, downcast="signed")  # same as 'integer'
pd.to_numeric(m, downcast="unsigned")  # smallest unsigned int dtype
pd.to_numeric(m, downcast="float")  # smallest float dtype

As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such
as DataFrames. However, with [`apply()`](../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply "pandas.DataFrame.apply"), we can “apply” the function over each column efficiently:

In [None]:
import datetime
df = pd.DataFrame([["2016-07-09", datetime.datetime(2016, 3, 2)]] * 2, dtype="O")
df
df.apply(pd.to_datetime)
df = pd.DataFrame([["1.1", 2, 3]] * 2, dtype="O")
df
df.apply(pd.to_numeric)
df = pd.DataFrame([["5us", pd.Timedelta("1day")]] * 2, dtype="O")
df
df.apply(pd.to_timedelta)

<a class="anchor" id="gotchas"></a>
### gotchas

Performing selection operations on `integer` type data can easily upcast the data to `floating`.
The dtype of the input data will be preserved in cases where `nans` are not introduced.
See also [Support for integer NA](gotchas.html#gotchas-intna).

In [None]:
dfi = df3.astype("int32")
dfi["E"] = 1
dfi
dfi.dtypes
casted = dfi[dfi > 0]
casted
casted.dtypes

While float dtypes are unchanged.

In [None]:
dfa = df3.copy()
dfa["A"] = dfa["A"].astype("float32")
dfa.dtypes
casted = dfa[df2 > 0]
casted
casted.dtypes

<a class="anchor" id="selecting_columns_based_on_"></a>
## Selecting columns based on 

The [`select_dtypes()`](../reference/api/pandas.DataFrame.select_dtypes.html#pandas.DataFrame.select_dtypes "pandas.DataFrame.select_dtypes") method implements subsetting of columns
based on their `dtype`.

First, let’s create a [`DataFrame`](../reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame") with a slew of different
dtypes:

In [None]:
df = pd.DataFrame(
    {
        "string": list("abc"),
        "int64": list(range(1, 4)),
        "uint8": np.arange(3, 6).astype("u1"),
        "float64": np.arange(4.0, 7.0),
        "bool1": [True, False, True],
        "bool2": [False, True, False],
        "dates": pd.date_range("now", periods=3),
        "category": pd.Series(list("ABC")).astype("category"),
    }
)

df["tdeltas"] = df.dates.diff()
df["uint64"] = np.arange(3, 6).astype("u8")
df["other_dates"] = pd.date_range("20130101", periods=3)
df["tz_aware_dates"] = pd.date_range("20130101", periods=3, tz="US/Eastern")
df

And the dtypes:

In [None]:
df.dtypes

[`select_dtypes()`](../reference/api/pandas.DataFrame.select_dtypes.html#pandas.DataFrame.select_dtypes "pandas.DataFrame.select_dtypes") has two parameters `include` and `exclude` that allow you to
say “give me the columns *with* these dtypes” (`include`) and/or “give the
columns *without* these dtypes” (`exclude`).

For example, to select `bool` columns:

In [None]:
df.select_dtypes(include=[bool])

You can also pass the name of a dtype in the [NumPy dtype hierarchy](https://numpy.org/doc/stable/reference/arrays.scalars.html):

In [None]:
df.select_dtypes(include=["bool"])

[`select_dtypes()`](../reference/api/pandas.DataFrame.select_dtypes.html#pandas.DataFrame.select_dtypes "pandas.DataFrame.select_dtypes") also works with generic dtypes as well.

For example, to select all numeric and boolean columns while excluding unsigned
integers:

In [None]:
df.select_dtypes(include=["number", "bool"], exclude=["unsignedinteger"])

To select string columns you must use the `object` dtype:

In [None]:
df.select_dtypes(include=["object"])

To see all the child dtypes of a generic `dtype` like `numpy.number` you
can define a function that returns a tree of child dtypes:

In [None]:
def subdtypes(dtype):
    subs = dtype.__subclasses__()
    if not subs:
        return dtype
    return [dtype, [subdtypes(dt) for dt in subs]]


All NumPy dtypes are subclasses of `numpy.generic`:

In [None]:
subdtypes(np.generic)

Note

pandas also defines the types `category`, and `datetime64[ns, tz]`, which are not integrated into the normal
NumPy hierarchy and won’t show up with the above function.