# Agenda: dtypes

1. Basic dtypes (review)
2. Changing dtypes
3. Limits/issues with changing them
4. `NaN` ("Not a number")
5. Nullable types -- the evolution of Pandas

# Dtypes in Pandas

Each column in a data frame is a series. Each series (whether on its own or inside of a data frame) has a dtype. That determines the type of data that each value in the series contains.

In traditional Python lists, we can have any values, in any combination. In a series, they all must have exactly the same type. In a series, though, we do need to tell Pandas what types of data we want, so it can turn to NumPy (the lower-level layer) and allocate an array of the right size. Moreover, it needs to interpret the bits in memory in the right way.

If we see `12` and `12.34`, we understand that the first is an integer, and the second is a float. We also think of a float has having "extra stuff beyond the integer." The dtype not only tells Pandas what kinds of data we're going to store, and thus what the limits are on those values, but also how it needs to interpret the bits at the lowest level.

Choosing a dtype is thus important for (a) making sure that the values will work, (b) making sure that they'll fit, and (c) making sure that you don't use too much memory.

Normally, when we create a series, Pandas chooses a dtype for us:

- If it sees only integers (decimal digits), then we get a dtype of `int64` -- 64-bit integers, aka 8-byte integers. These are signed, meaning that half of the numbers are positive and half are negative.
- If it sees decimal digits and one decimal point, then we get a dtype of `float64` -- 64-bit floats, aka 8-byte floats. These are also signed.
- If it sees other things, then it basically assumes that we have strings. But it doesn't use NumPy's strings, which are awful. Instead, it uses Python strings, and refers to them using a dtype of `object`.

If you want to tell Pandas what dtype to use when you create a series, you can pass the `dtype=` keyword argument. The value for that argument is typically going to be a string indicating what dtype you want.

Signed integers
- `int8`
- `int16`
- `int32`
- `int64`

Unsigned integers (only positive, so we get 2x the range)
- `uint8`
- `uint16`
- `uint32`
- `uint64`

Floats
- `float16`
- `float32`
- `float64`

For everything else, we have `object`. If you see a dtype of `object`, it *probably* means that you have strings there, but it might mean that Pandas wasn't sure what to do with the objects, which could be dicts, lists, etc.

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30])  # here, Pandas sees integers and will thus choose int64
s.dtype

dtype('int64')

In [3]:
s * 10

0    100
1    200
2    300
dtype: int64

In [4]:
# we don't get a warning telling us that the numbers rolled over
s ** 15

0       1000000000000000
1   -4125488147419103232
2   -2659889346031157248
dtype: int64

In [5]:
# What if we'll be using much smaller numbers, such as for years? Then we can (and should) use a different dtype

s = Series([10, 20, 30], dtype='int8')
s

0    10
1    20
2    30
dtype: int8

In [6]:
s + 100

0    110
1    120
2   -126
dtype: int8

In [7]:
2 ** 8

256

In [8]:
(2 ** 8 ) / 2

128.0

In [12]:
s + 98

0    108
1    118
2   -128
dtype: int8

You have to consider not just the data that you're going to be entering, but also the operations that you're going to be using on that data, and what the possible min/max values are that you'll be getting.

In [14]:
# let's use an unsigned int
s = Series([10, 20, 30], dtype='uint8')

s + 200

0    210
1    220
2    230
dtype: uint8

In [15]:
# how much memory are these really using?

s = Series([10, 20, 30], dtype='int8')  # 3 bytes of actual values
s.memory_usage()    # 135 total, meaning 132 overhead

135

In [17]:
s = Series([10, 20, 30, 40, 50, 60], dtype='int8')  # 6 bytes of actual values
s.memory_usage()    # 138 total, meaning 132 overhead

138

In [18]:
# how much memory are these really using?

s = Series([10, 20, 30], dtype='int64')  # 3 * 8
s.memory_usage()    # 132 + 24 = 156

156

In [19]:
s = Series([10, 20, 30, 40, 50, 60], dtype='int64')  
s.memory_usage()  

180

In [20]:
# floats are going to take the same amount of space (because we know how many bits they'll use)

s = Series([10, 20, 30.5, 40, 50])   # automatically, Pandas will assign a dtype of float64
s

0    10.0
1    20.0
2    30.5
3    40.0
4    50.0
dtype: float64

In [21]:
s.memory_usage()

172

In [25]:
s = Series([10.1, 20.2, 30.5, 40, 50], dtype='float16')
s


0    10.101562
1    20.203125
2    30.500000
3    40.000000
4    50.000000
dtype: float16

In [26]:
s = Series([10, 20.5, 'hello', [2,4,6]])
s

0           10
1         20.5
2        hello
3    [2, 4, 6]
dtype: object

In [28]:
type(s.loc[0])

int

In [29]:
type(s.loc[1])

float

In [30]:
type(s.loc[2])

str

In [31]:
type(s.loc[3])

list

In [32]:
s = Series(['a', 'b', 'c', 'd'])
s

0    a
1    b
2    c
3    d
dtype: object

In [33]:
s.memory_usage()

164

In [35]:
# let's try something else
s = Series(['a'*5000, 'b'*6000, 'c'*7000, 'd'*8000])
s.memory_usage()

164

In [34]:
'a' * 5

'aaaaa'

The moment that we have "object" dtypes in our series, Pandas stops actually checking the sizes. Instead, it reports the "size" of the in-memory reference to the string in Python. In other words, this is the size of the address in memory where the value is, not the size of the value itself.

In [36]:
# we can check if we ask for deep=True
# this means: I don't trust you -- go into Python's memory and ask each object how big it is
s.memory_usage(deep=True)

26328

In [37]:
s = Series(['a'*50000000, 'b'*6000000, 'c'*7000000, 'd'*8000000])
s.memory_usage()

164

In [38]:
s.memory_usage(deep=True)

71000328

# Exercise: Ages

1. Create a series containing the ages of 5 people in your family.
2. What dtype would be appropriate for such a series?
3. What if you want to measure the age more precisely, using fractional years? What kind of dtype would be appropriate?
4. Create a new series in which you measure the age (approximately) in seconds. Create that series using multiplication from the existing one. Is `int64` going to be large enough?

In [39]:
s = Series([53, 51, 23, 21, 18])
s

0    53
1    51
2    23
3    21
4    18
dtype: int64

In [41]:
# what is the biggest int64 number we can have?

(2 ** 64) / 2

9.223372036854776e+18

In [42]:
s = Series([53, 51, 23, 21, 18], dtype='uint8')
s

0    53
1    51
2    23
3    21
4    18
dtype: uint8

In [45]:
# let's use floats to describe the ages
s = Series([53.9, 51.3, 23.5, 21.5, 18.7], dtype='float32')
s

0    53.900002
1    51.299999
2    23.500000
3    21.500000
4    18.700001
dtype: float32

In [47]:
s = Series([53, 51, 23, 21, 18], dtype='uint8')
s * 86400 * 365

0    1671408000
1    1608336000
2     725328000
3     662256000
4     567648000
dtype: uint32

In [49]:
s = Series([53, 51, 23, 21, 18], dtype='int8')
s * 86400 * 365

0    1671408000
1    1608336000
2     725328000
3     662256000
4     567648000
dtype: int32

In [58]:
s * 1000 * 60 * 60 * 24 

0     3072
1    21504
2    17408
3   -29696
4    30720
dtype: int16

In [59]:
import numpy as np

In [60]:
np.__version__

'1.26.4'

# Changing types

What happens if we have values in a series, and we want to change them to another dtype?

- We want a bigger/smaller number of bits
- We want a completely different type
    - float -> int
    - int -> float
    - object (string) -> int/float
 
We *cannot* assign to the `dtype` attribute

In [61]:
s = Series('10 20 30'.split())
s

0    10
1    20
2    30
dtype: object

In [62]:
s.sum()

'102030'

In [64]:
# dtype is a read-only attribute

s.dtype = int

AttributeError: property 'dtype' of 'Series' object has no setter

# `astype`

We can use the `astype` method on any series. The argument to the method is the dtype that we want to get back, expressed as a string or as an `np` attribute, if you prefer (e.g., `np.int64`).

We will get back a new series, but this will *not* affect the original series! We can, of course, assign the new series back to the original variable.

In [65]:
s

0    10
1    20
2    30
dtype: object

In [66]:
s.astype('int64')

0    10
1    20
2    30
dtype: int64

In [67]:
# I could say

s = s.astype('int64')  # this will affect s, because we're assigning a new value to it

In [68]:
s

0    10
1    20
2    30
dtype: int64

In [71]:
s = Series([10.1, 20.7, 30.9])
s

0    10.1
1    20.7
2    30.9
dtype: float64

In [72]:
s.astype('int')

0    10
1    20
2    30
dtype: int64

In [73]:
s = Series('10 20 hello 30'.split())
s

0       10
1       20
2    hello
3       30
dtype: object

In [74]:
s.astype('int')

ValueError: invalid literal for int() with base 10: 'hello'

In [75]:
s = Series('10 20 20.5 30'.split())
s

0      10
1      20
2    20.5
3      30
dtype: object

In [76]:
s.astype('int')

ValueError: invalid literal for int() with base 10: '20.5'

In [77]:
# you can, if you really want to, tell Pandas to ignore errors
# when you invoke "astype", by default its "errors" parameter has a value of "raise", which means: raise an exception

# but you can instead say errors="ignore"
s.astype('int', errors='ignore')

0      10
1      20
2    20.5
3      30
dtype: object

In [78]:
s = s.astype('int', errors='ignore')
type(s.loc[0])

str

In [80]:
# let's get a series of random integers
# for this, I'll use NumPy

import numpy as np

np.random.seed(0)              # this guarantees that we'll all get the same set of random integers in order
np.random.randint(0, 100, 10)  # this returns a NumPy array (a primitive Pandas series) with 10 integers, all between 0-100

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87])

In [81]:
np.random.seed(0)              # this guarantees that we'll all get the same set of random integers in order
Series(np.random.randint(0, 100, 10))  # this returns a NumPy array (a primitive Pandas series) with 10 integers, all between 0-100

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [83]:
# I can get random floats in a similar way

np.random.seed(0)                       #  seed the random number generator
Series(np.random.uniform(0, 100, 10))   # ask for 10 random floats between 0-100

0    54.881350
1    71.518937
2    60.276338
3    54.488318
4    42.365480
5    64.589411
6    43.758721
7    89.177300
8    96.366276
9    38.344152
dtype: float64