# Agenda: dtypes

1. Basic dtypes (review)
2. Changing dtypes
3. Limits / issues with changing
4. `NaN`
5. Nullable types
6. Say "hi" in person!

# Dtypes in Pandas

When we use objects in regular Python, everything is an object, we have to think about what *type* it is, but not about what size it is, or how it is stored. So long as I have a Python object, I can have a variable refer to it, and that's that.

In Pandas, things are dramatically different. All of our data is stored in C, which makes it small and fast. This means that Pandas is giving us a thin layer on top of those C data structures. We need to know what their sizes are to use them.

This is known as the "dtype." Every series (or column of a data frame) has a single dtype -- all of the values are the same dtype.

Normally, when we create a series, Pandas guesses (and usually guesses well) what type we want:

- If it sees only integers (only decimal digits), we get a dtype of `int64` -- 64-bit integers (8 bytes). These are signed integers, meaning that half of the values are negative, and half are positive.
- If it sees decimal digits and a decimal point, then it assumes we have floats, and gives a dtype of `float64`.
- If it sees other things (any other things), then it assumes we have a string, and it gives a dtype of `object` -- which means, "I'm going to use Python strings, and hope for the best."

If you want another dtype, then you can set that when you create the series, by passing the keyword argument `dtype=` and then a dtype. Those can be specified either as strings or as attributes of the `np` (NumPy) package.

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30])
s

0    10
1    20
2    30
dtype: int64

In [3]:
s * 10

0    100
1    200
2    300
dtype: int64

In [7]:
s ** 15

0       1000000000000000
1   -4125488147419103232
2   -2659889346031157248
dtype: int64

What if you'll be using much smaller numbers? Do you really need 64 bits?

Answer: No. You can/should specify another dtype.

In [8]:
s = Series([10, 20, 30], dtype='int32')
s

0    10
1    20
2    30
dtype: int32

In [9]:
s.dtype

dtype('int32')

In [12]:
s ** 9

0    1000000000
1     898891776
2    -835117568
dtype: int32

When you choose a dtype, you have to balance: (a) How much memory it'll use and (b) what is the maximum value you expect to get.

- 1 billion numbers * 8 bytes == 8 GB of values
- 1 billion numbers * 4 bytes == 4 GB of values
- 1 billion numbers * 2 bytes == 2 GB of values


What if you don't need negative numbers? You can use `uint` dtypes. These are the same as `int`, but they are only positive.

That means, for integers, we have:

- `int8`
- `int16`
- `int32`
- `int64`
- `uint8`  (unsigned int)
- `uint16`
- `uint32`
- `uint64`

In [13]:
import numpy as np

In [14]:
np.random.seed(0)
s = Series(np.random.randint(0, 100, 100_000), dtype='int64')

In [15]:
s.memory_usage()

800132

In [16]:
np.random.seed(0)
s = Series(np.random.randint(0, 100, 100_000), dtype='int8')

In [17]:
s.memory_usage()

100132

In [18]:
# What about floats?

s = Series([10, 20, 30.5, 40, 50])
s

0    10.0
1    20.0
2    30.5
3    40.0
4    50.0
dtype: float64

In [19]:
s.loc[0]

10.0

In [20]:
type(s.loc[0])

numpy.float64

In [21]:
s = Series([10.1, 20.2, 30.5, 40, 50], dtype=np.float16)
s

0    10.101562
1    20.203125
2    30.500000
3    40.000000
4    50.000000
dtype: float16

In [25]:
# you can use the round method to remove some errors

s.round(1)

0    10.101562
1    20.203125
2    30.500000
3    40.000000
4    50.000000
dtype: float16

# What kinds of floats do we have?

- `float16`
- `float32`
- `float64` -- default

In [26]:
# What about other things?

s = Series(['hello', 'out', 'there'])
s

0    hello
1      out
2    there
dtype: object

In [27]:
s = Series([10, 20.5, 'hello', [2,4,6]])
s

0           10
1         20.5
2        hello
3    [2, 4, 6]
dtype: object

If you see "object" as the dtype, almost certainly that means you're dealing with strings.

In [28]:
# if you read in data that is dirty/bad/missing/corrupt, you might well
# get a dtype of "object" even though you really want/expect int or float

# Exercise: Ages

1. Create a series containing the ages of 5 people in your family.
2. What dtype would be appropriate?
3. What if you want to measure the age more precisely, using fractional years? What would you use then?
4. Create a new series in which you measure the age (approximately) in seconds. Create that series using multiplication. Is `int64` big enough for that?

In [29]:
s = Series([53, 51, 23, 21, 18])
s

0    53
1    51
2    23
3    21
4    18
dtype: int64

In [31]:
# what is the max we'll get with int64?
# it's 8 bytes, so we'll get up to 2 ** 64

2 ** 64

18446744073709551616

In [32]:
s = Series([53, 51, 23, 21, 18], dtype='int8')
s

0    53
1    51
2    23
3    21
4    18
dtype: int8

In [33]:
2 ** 8

256

In [34]:
s = Series([53.9, 51.3, 23.5, 21.5, 18.6])
s

0    53.9
1    51.3
2    23.5
3    21.5
4    18.6
dtype: float64

In [35]:
s = Series([53.9, 51.3, 23.5, 21.5, 18.6], dtype='float16')
s

0    53.90625
1    51.31250
2    23.50000
3    21.50000
4    18.59375
dtype: float16

In [36]:
s = Series([53, 51, 23, 21, 18])
s * 86400 * 365

0    1671408000
1    1608336000
2     725328000
3     662256000
4     567648000
dtype: int64

In [37]:
s = Series([53, 51, 23, 21, 18], dtype='int8')
s * 86400 * 365

0    1671408000
1    1608336000
2     725328000
3     662256000
4     567648000
dtype: int32

# Changing dtypes

It's very common for us to want to change the dtype of a series. We can't do that! Once a dtype is set, it is set forever. However, we can get a new series back, based on the old one, with a different dtype.

In [38]:
s = Series('10 20 30'.split())
s

0    10
1    20
2    30
dtype: object

In [39]:
s + s

0    1010
1    2020
2    3030
dtype: object

In [40]:
s * 3

0    101010
1    202020
2    303030
dtype: object

In [41]:
s.dtype = 'int64'

AttributeError: property 'dtype' of 'Series' object has no setter

In [42]:
# we're going to use the "astype" method
# this returns a new series whose values are based on the old one
# (the old series doesn't change)
# you can, if you want to , assign the new one back to the old variable

s.astype('int64')

0    10
1    20
2    30
dtype: int64

In [43]:
s.astype('int64') * 3

0    30
1    60
2    90
dtype: int64

In [44]:
s = s.astype('int64') 
s

0    10
1    20
2    30
dtype: int64

In [45]:
# if you pass a list of strings to Series, and you know it contains
# only digits, you can assign a dtype and the conversion will be done internally

Series('10 20 30'.split(), dtype='int64')

0    10
1    20
2    30
dtype: int64

In [46]:
# if some aren't digits, bad news!

Series('10 20 hello 30'.split(), dtype='int64')

ValueError: invalid literal for int() with base 10: 'hello'

In [47]:
s = Series([10, 20, 30])
s.astype('int8')

0    10
1    20
2    30
dtype: int8

In [48]:
s = Series('10 20 hello 30'.split())
s

0       10
1       20
2    hello
3       30
dtype: object

In [49]:
s.astype('int8')

ValueError: invalid literal for int() with base 10: 'hello'

In [50]:
# if you really want, you can tell Pandas to ignore errors

s.astype('int8', errors='raise')  # default

ValueError: invalid literal for int() with base 10: 'hello'

In [52]:
# as a general rule, I wouldn't do this because you don't know what the dtype
# is of the series you get back! 

s.astype('int8', errors='ignore')  

0       10
1       20
2    hello
3       30
dtype: object

In [54]:
s = Series([10, 20, 30])
s.astype('float16')

0    10.0
1    20.0
2    30.0
dtype: float16

In [55]:
s = Series([10, 20, 30], dtype='float16')
s

0    10.0
1    20.0
2    30.0
dtype: float16

In [56]:
s = Series([10, 20, 30], dtype=np.float16)
s

0    10.0
1    20.0
2    30.0
dtype: float16

In [57]:
s = Series([10, 20, 30], dtype=float)
s

0    10.0
1    20.0
2    30.0
dtype: float64

In [60]:
s = Series([10.9, 20.3, 30.4])
s

0    10.9
1    20.3
2    30.4
dtype: float64

In [61]:
s.astype('int64')

0    10
1    20
2    30
dtype: int64

In [62]:
int('10')

10

In [63]:
int(10.2)

10

In [64]:
int(10.9)

10

In [65]:
s.round()

0    11.0
1    20.0
2    30.0
dtype: float64

In [66]:
s

0    10.9
1    20.3
2    30.4
dtype: float64

In [67]:
# rounded + integers

s.round().astype('int')

0    11
1    20
2    30
dtype: int64

In [68]:
Series(np.random.randint(0, 100, 10)  )  # this gives me a series of 10 ints 0-100

0    63
1     6
2    30
3     3
4    28
5    36
6    44
7     6
8    89
9    42
dtype: int64

In [None]:
np.random.rand(10)   # this returns 10 