# 7.3 Extension Data Types

pandas was originally built with numpy functionality in mind. This has produced a few issues, however.

1. Missing data handling isn't great for integers and Booleans
1. Datasets with lots of strings were computationally expensive
1. time intervals, time deltas, timestamps couldn't be easily supported

To get around this, *extension types* have been built to handle data types not supported by NumPy.



In [16]:
import pandas as pd
import numpy as np

Example 1: Creating a series of integers with a missing value will convert the type to `float64` and the missing value to `NaN`

In [17]:
# Series of integers with a missing value:
s = pd.Series([1, 2, 3, None])
s

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [18]:
s.isna()

0    False
1    False
2    False
3     True
dtype: bool

In [19]:
s.dtype

dtype('float64')

Using the *extension type* `Ind64Dtype`, we can create this series and maintain the `NA` as well as the integer designation.

In [20]:
# Series of integers with extension integer type
s = pd.Series([1, 2, 3, None], dtype=pd.Int64Dtype())
s

0       1
1       2
2       3
3    <NA>
dtype: Int64

In [21]:
# Confirm that None is still NA
s.isna()

0    False
1    False
2    False
3     True
dtype: bool

In [22]:
# Confirm that type is still integer
s.dtype

Int64Dtype()

The `NA` in this instance is the `pands.NA` sentinel value:

In [23]:
s[3] is pd.NA

True

The `pd.` and the `Dtype()` can both be omitted in favor of just `"Int64"` (capitalized!)

In [24]:
s = pd.Series([1, 2, 3, None], dtype="Int64")
s

0       1
1       2
2       3
3    <NA>
dtype: Int64

The `StringDtype` is more efficient for large datasets with strings. (note how it handles the NA similarly as well)

In [25]:
s = pd.Series(['one', 'two', None, 'three'], dtype=pd.StringDtype())
s

0      one
1      two
2     <NA>
3    three
dtype: string

In [26]:
# Alternative name
s = pd.Series(['one', 'two', None, 'three'], dtype="string")
s

0      one
1      two
2     <NA>
3    three
dtype: string

Another common/important extension type is `Categorical` (See 7.5 for more info)

A standard type can be converted to an extension type with the `astype` Series method.

In [27]:
# Example DF with int, str, and bool Series with missing values
df = pd.DataFrame({"A": [1, 2, None, 4],
                   "B": ["one", "two", "three", None],
                   "C": [False, None, False, True]})
df

Unnamed: 0,A,B,C
0,1.0,one,False
1,2.0,two,
2,,three,False
3,4.0,,True


Notice how the integer missing is `NaN`, while the string and boolean are both `None`. After conversion, they'll all be NAs.

In [28]:
# Convert each type
df["A"] = df["A"].astype("Int64")
df["B"] = df["B"].astype("string")
df["C"] = df["C"].astype("boolean")

This table has most extension types but is missing the string one for some reason. It's explained above and its string shortcut is simply `"string"`

<img src="./myImages/table7.3_pandasExtensionDataTypes.png" width = 600>