# Agenda: dtypes

1. Basic dtypes (review)
2. Changing dtypes
3. Limits / issues with changing
4. `NaN`
5. Nullable types
6. Say "hi" in person!

# Dtypes in Pandas

When we use objects in regular Python, everything is an object, we have to think about what *type* it is, but not about what size it is, or how it is stored. So long as I have a Python object, I can have a variable refer to it, and that's that.

In Pandas, things are dramatically different. All of our data is stored in C, which makes it small and fast. This means that Pandas is giving us a thin layer on top of those C data structures. We need to know what their sizes are to use them.

This is known as the "dtype." Every series (or column of a data frame) has a single dtype -- all of the values are the same dtype.

Normally, when we create a series, Pandas guesses (and usually guesses well) what type we want:

- If it sees only integers (only decimal digits), we get a dtype of `int64` -- 64-bit integers (8 bytes). These are signed integers, meaning that half of the values are negative, and half are positive.
- If it sees decimal digits and a decimal point, then it assumes we have floats, and gives a dtype of `float64`.
- If it sees other things (any other things), then it assumes we have a string, and it gives a dtype of `object` -- which means, "I'm going to use Python strings, and hope for the best."

If you want another dtype, then you can set that when you create the series, by passing the keyword argument `dtype=` and then a dtype. Those can be specified either as strings or as attributes of the `np` (NumPy) package.

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30])
s

0    10
1    20
2    30
dtype: int64

In [3]:
s * 10

0    100
1    200
2    300
dtype: int64

In [7]:
s ** 15

0       1000000000000000
1   -4125488147419103232
2   -2659889346031157248
dtype: int64

What if you'll be using much smaller numbers? Do you really need 64 bits?

Answer: No. You can/should specify another dtype.

In [8]:
s = Series([10, 20, 30], dtype='int32')
s

0    10
1    20
2    30
dtype: int32

In [9]:
s.dtype

dtype('int32')

In [12]:
s ** 9

0    1000000000
1     898891776
2    -835117568
dtype: int32

When you choose a dtype, you have to balance: (a) How much memory it'll use and (b) what is the maximum value you expect to get.

- 1 billion numbers * 8 bytes == 8 GB of values
- 1 billion numbers * 4 bytes == 4 GB of values
- 1 billion numbers * 2 bytes == 2 GB of values


What if you don't need negative numbers? You can use `uint` dtypes. These are the same as `int`, but they are only positive.

That means, for integers, we have:

- `int8`
- `int16`
- `int32`
- `int64`
- `uint8`  (unsigned int)
- `uint16`
- `uint32`
- `uint64`

In [None]:
import n