# Agenda: dtypes

1. Basic dtypes (review)
2. Changing dtypes
3. Limits/issues with changing them
4. `NaN` ("Not a number")
5. Nullable types -- the evolution of Pandas

# Dtypes in Pandas

Each column in a data frame is a series. Each series (whether on its own or inside of a data frame) has a dtype. That determines the type of data that each value in the series contains.

In traditional Python lists, we can have any values, in any combination. In a series, they all must have exactly the same type. In a series, though, we do need to tell Pandas what types of data we want, so it can turn to NumPy (the lower-level layer) and allocate an array of the right size. Moreover, it needs to interpret the bits in memory in the right way.

If we see `12` and `12.34`, we understand that the first is an integer, and the second is a float. We also think of a float has having "extra stuff beyond the integer." The dtype not only tells Pandas what kinds of data we're going to store, and thus what the limits are on those values, but also how it needs to interpret the bits at the lowest level.

Choosing a dtype is thus important for (a) making sure that the values will work, (b) making sure that they'll fit, and (c) making sure that you don't use too much memory.

Normally, when we create a series, Pandas chooses a dtype for us:

- If it sees only integers (decimal digits), then we get a dtype of `int64` -- 64-bit integers, aka 8-byte integers. These are signed, meaning that half of the numbers are positive and half are negative.
- If it sees decimal digits and one decimal point, then we get a dtype of `float64` -- 64-bit floats, aka 8-byte floats. These are also signed.
- If it sees other things, then it basically assumes that we have strings. But it doesn't use NumPy's strings, which are awful. Instead, it uses Python strings, and refers to them using a dtype of `object`.

If you want to tell Pandas what dtype to use when you create a series, you can pass the `dtype=` keyword argument. The value for that argument is typically going to be a string indicating what dtype you want.

Signed integers
- `int8`
- `int16`
- `int32`
- `int64`

Unsigned integers (only positive, so we get 2x the range)
- `uint8`
- `uint16`
- `uint32`
- `uint64`

Floats
- `float16`
- `float32`
- `float64`

For everything else, we have `object`. If you see a dtype of `object`, it *probably* means that you have strings there, but it might mean that Pandas wasn't sure what to do with the objects, which could be dicts, lists, etc.

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30])  # here, Pandas sees integers and will thus choose int64
s.dtype

dtype('int64')

In [3]:
s * 10

0    100
1    200
2    300
dtype: int64

In [4]:
# we don't get a warning telling us that the numbers rolled over
s ** 15

0       1000000000000000
1   -4125488147419103232
2   -2659889346031157248
dtype: int64

In [5]:
# What if we'll be using much smaller numbers, such as for years? Then we can (and should) use a different dtype

s = Series([10, 20, 30], dtype='int8')
s

0    10
1    20
2    30
dtype: int8

In [6]:
s + 100

0    110
1    120
2   -126
dtype: int8

In [7]:
2 ** 8

256

In [8]:
(2 ** 8 ) / 2

128.0

In [12]:
s + 98

0    108
1    118
2   -128
dtype: int8

You have to consider not just the data that you're going to be entering, but also the operations that you're going to be using on that data, and what the possible min/max values are that you'll be getting.

In [14]:
# let's use an unsigned int
s = Series([10, 20, 30], dtype='uint8')

s + 200

0    210
1    220
2    230
dtype: uint8

In [15]:
# how much memory are these really using?

s = Series([10, 20, 30], dtype='int8')  # 3 bytes of actual values
s.memory_usage()    # 135 total, meaning 132 overhead

135

In [17]:
s = Series([10, 20, 30, 40, 50, 60], dtype='int8')  # 6 bytes of actual values
s.memory_usage()    # 138 total, meaning 132 overhead

138

In [18]:
# how much memory are these really using?

s = Series([10, 20, 30], dtype='int64')  # 3 * 8
s.memory_usage()    # 132 + 24 = 156

156

In [19]:
s = Series([10, 20, 30, 40, 50, 60], dtype='int64')  
s.memory_usage()  

180

In [20]:
# floats are going to take the same amount of space (because we know how many bits they'll use)

s = Series([10, 20, 30.5, 40, 50])   # automatically, Pandas will assign a dtype of float64
s

0    10.0
1    20.0
2    30.5
3    40.0
4    50.0
dtype: float64

In [21]:
s.memory_usage()

172

In [25]:
s = Series([10.1, 20.2, 30.5, 40, 50], dtype='float16')
s


0    10.101562
1    20.203125
2    30.500000
3    40.000000
4    50.000000
dtype: float16

In [26]:
s = Series([10, 20.5, 'hello', [2,4,6]])
s

0           10
1         20.5
2        hello
3    [2, 4, 6]
dtype: object