# Changing Data Types

This chapter will go deeper into the different data types that are available to pandas Series and DataFrames. As a reminder, all values in a Series are exactly one data type. Similarly, all values in each column of a DataFrame are exactly one data type. Earlier in the book, it was stated that the following data types are the most common:

* boolean
* integer
* float
* object
* datetime

While these data types are useful descriptions, they are not the technical names of the actual data types. This chapter will give a complete picture of all the exact data types that are available and what they mean. We will make heavy use of the `astype` method to change data types.

## Constructing a Series

All of our previous Series have been created by selecting a single column from a DataFrame. It is also possible to create a Series manually using `pd.Series` by passing it a list of values. Let's create a simple integer Series with a few values.

In [None]:
import pandas as pd
s_int = pd.Series([50, 99, 130])
s_int

Creating a Series in this manner is often referred to as using the **constructor**, which is a generic programming term used to describe the creation and instantiation of a new object of a particular type. For instance, we used the Series constructor to create a new Series of integers. The visual output of a Series displays the data type below the values.

## numpy data types

The data type is 'int64', which formally represents a 64-bit integer. This data type comes directly from numpy which allows integers to be either 8, 16, 32, or 64 bits in size. A 64-bit integer can contain up to 2 raised to the 64th power number of integers. Let's see how many numbers this is.

In [None]:
2 ** 64

The 'int64'  data type has both positive and negative integers. Let's divide the above number by 2 to get the maximum integer allowed.

In [None]:
2 ** 64 // 2

numpy has a function called `iinfo` that returns the exact integer information for each integer data type. Pass it the data type as a string to get the information. Note that the range of integers is exactly 2 raised to the 64th power counting 0.

In [None]:
import numpy as np
np.iinfo('int64')

We can find the range of 8-bit integers with the following:

In [None]:
np.iinfo('int8')

## Changing Data Types with `astype`

You can change a Series data type with the `astype` method by passing it the string name of the data type. The string name is going to be the base data type, which is 'int' in this case, appended directly to the number of bits (8, 16, 32, or 64).

Below, we change the data type to 'int8'. Notice that the third value now displays as -126 and not its original value of 130. The maximum 8-bit integer is 127 making 130  greater than it by 3. numpy assumes you know what you are doing and does not check that this number goes beyond its maximum. Instead, the number is represented by the third integer greater than the minimum -128 which is -126.

In [None]:
s_int.astype('int8')

### Default integer types

Unfortunately, the default data type is dependent on the platform that you are using. numpy is built with the C programming language and uses the size of the C **long** type as its default. For 32-bit Linux, macOS, and Windows machines this will be 32 bits. For 64-bit Linux and macOS machines, it will be 64 bits. For 64-bit Windows machines this will be 32-bits. The [first two rows of this table][1] show the size of C long types for each platform.

### Windows machines

If you have a Windows machine, you may have noticed that your data type still says 'int64' even though the previous section just said otherwise. This is because we constructed our Series with a list and not a numpy array. Let's construct a numpy array of integers and output its data type using its constructor `array` from the numpy library.

[1]: https://en.wikipedia.org/wiki/Integer_(computer_science)#Common_long_integer_sizes

In [None]:
a = np.array([1, 5])
a

numpy arrays use the exact same `dtype` attribute as Series to access the data type. Windows users should see 'int32' as the data type, while 64-bit Linux and macOS users should see 'int64'.

In [None]:
a.dtype

Constructing the Series with a numpy array will show the same data type as above.

In [None]:
pd.Series(a)

## Float data types

Float columns contain numbers with decimal places. The default 'float' size for all platforms is 64 bits (the same size as a C double) and also referred to as 'double-precision'. numpy has additional float sizes of 16, 32, and 128 bits.

In [None]:
s_float = pd.Series([4.247, 1234.56789])
s_float

Below, we change the data type to a 32-bit float, also known as 'single-precision' float. Notice how the second value has changed as a 32-bit float does not have enough precision to map its value exactly. 

In [None]:
s_float.astype('float32')

We can use the numpy `finfo` function to get information on each float type. For instance, the 'float32' data type guarantees us 6 significant digits of precision as seen with the 'resolution' attribute below.

In [None]:
np.finfo('float32')

A 16-bit 'half-precision' float only guarantees 3 digits of precision.

In [None]:
np.finfo('float16')

In [None]:
s_float.astype('float16')

## Unsigned Integers

The default integer data types split half of their range between negative and positive integers. It is possible to limit your integers to just the non-negative integers by using the unsigned integer type abbreviated with the string 'uint'. The same sizes 8, 16, 32, and 64 bits are available.

In [None]:
s_int.astype('uint8')

As before, you can find the range of possible values for each type.

In [None]:
np.iinfo('uint16')

## One size for booleans

Now that we have a better understanding of data types and their sizes, let's back-up to booleans. Booleans have a single 8-bit data type. This makes sense that there would be no 16, 32, or 64 bit boolean data types as there are only two possible values for booleans. You may be curious as to why booleans are not represented with a single bit. This is because a byte (8 bits) is the smallest addressable unit of memory available to modern computers.

## Changing from float to int

So far, we have only changed the size of integer or float data types. We can change types from float to integer and vice-versa. Below, we go from a 'float64' to an 'int64'. This will truncate (and not round) the decimals.

In [None]:
s_float

In [None]:
s_float.astype('int64')

Going from integer to float will not have as dramatic of an effect and there should be no loss of data as long as the float has enough bits to represent all the digits.

In [None]:
s_int.astype('float64')

### Visual display of integers and floats

pandas always display floats with decimals even if there are no significant digits after the decimal. At a minimum, the `.0` will be present. On the other hand, integers are always displayed without a decimal. You can use this rule of thumb to determine the data type without actually accessing the `dtypes` attribute.

### Converting to and from boolean

It is possible to convert integer and float columns to boolean and vice-versa. The only value that will be converted to `False` is 0. All other values are converted to `True`. Use the string 'bool' to convert to boolean.

In [None]:
s = pd.Series([0, 99, -56])
s

In [None]:
s.astype('bool')

In [None]:
s = pd.Series([0, 0.0001, -3.99])
s

In [None]:
s.astype('bool')

Python itself uses the same rules to convert integers and floats to boolean. Let's see a couple examples with `bool`.

In [None]:
bool(5), bool(0)

In [None]:
bool(-3.4), bool(0.00000)

Converting a boolean Series to integer or float will convert all `True` values to 1 and `False` to 0.

In [None]:
s_bool = pd.Series([True, False])
s_bool

Using a 64-bit integer to store a boolean is overkill. The smallest integer type, `int8` can be used to save memory.

In [None]:
s_bool.astype('int8')

### Missing values in integer and boolean Series

There is no missing value representation for integer or boolean data types. Therefore if you create a Series containing integers or booleans along with missing values, its type will be float which is a more flexible type and does have missing value representation. Here we create a Series consisting of two integers and the missing value `nan` from numpy. pandas returns us a float Series.

In [None]:
pd.Series([4, 5, np.nan])

### Changing a single value of a Series to a different data type

The data type of a Series can change if one of its values is changed to a different data type. Let's begin by creating a boolean Series.

In [None]:
s = pd.Series([True, False])
s

If we assign the first element in the Series to a non-boolean value, its data type changes. pandas choses the most flexible type, object, to hold both an integer and a boolean in the same Series.

In [None]:
s.loc[0] = 2
s

### Changing data types with an operation

An operation on a Series can change the resulting Series data type. Division| always convert an integer Series to float, even if the result is whole numbers.

In [None]:
s = pd.Series([-15, 45])
s

In [None]:
s / 15

This is consistent with core Python which makes the same type conversion.

In [None]:
type(15 / 3)

## Object data types

Series with the object data type do not have analogous size representation like integer and float data types do. There is no 'object64', only a single 'object' data type. The reason for this is that Series with the object data type can contain any Python object. It is the most flexible data type. This is not a particularly satisfying revelation as we can never know for sure what type each Series value is. It's possible a Series with the object data type contains a boolean, integer, float, string, and list all in the same Series.

### Strings in object data types

pandas has no specific data type for strings and if your Series contains even a single string, then it will be an object data type. Let's verify this by creating a Series of strings.

In [None]:
s = pd.Series(['some', 'strings'])
s

Because object is the most flexible type, any Series may be converted to it. Here we convert a Series of integers to object.

In [None]:
s = pd.Series([5, 10])
s.astype('object')

This is something you would never want to do as numpy integer arrays are optimized for fast computation. By converting to an object array, you would lose this excellent benefit. The following example creates a numpy array with one million random integers between 0 and 100. A second array is created by converting the data type to object. We then time how long it takes to sum each array. On my machine, the integer array is about 50x as fast as object array even though they both hold the exact same data.

In [None]:
a = np.random.randint(low=0, high=100, size=1000000)
a1 = a.astype('object')

In [None]:
%timeit -n 5 a.sum()

In [None]:
%timeit -n 5 a1.sum()

### Many different types in a Series
There is no restriction on what can be placed in a Series with the object data type. The following example creates a Series that contains a list, a boolean, a string, a float, and a dictionary.

In [None]:
pd.Series([[1,2], True, 'some string', 4.5, {'key': 'value'}])

### Object Series usually contain strings

As we have mentioned in previous chapters, when you encounter a Series or column of a DataFrame that has object as its data type, it usually contains nothing but strings.

### Poor practice to store complex data types within Series

Even though you are allowed to place any Python object within a Series, it's generally considered poor practice to do so. Series with object data types are designed to be filled with strings as the `str` accessor is only available for this data type. This shouldn't be seen as an absolute statement since it might be necessary to use the flexibility of these object columns for some advanced operations.

## Setting data types in numpy arrays

It is possible to manually set the data type of a numpy array during construction with the `dtype` parameter. You can set it to the string name of the data type you desire. Let's construct an 8-bit integer array. Notice that 150 exceeds the max value of 127.

In [None]:
np.array([1, 5, 150], dtype='int8')

Here we force numpy to use a 32-bit float. Normally, it would default this array as a 32 or 64 bit integer.

In [None]:
np.array([1, 5, 150], dtype='float32')

### Setting data types on Series
The Series constructor has the same `dtype` parameter that we can use to enforce a different data type than the default.

In [None]:
pd.Series([1, 5, 150], dtype='int16')

We can also construct a Series with a numpy array. The data type of the array is used as the data type for the Series.

In [None]:
a = np.array([1, 5, 150], dtype='float32')
pd.Series(a)

## The numpy datetime64 data type
Both numpy and pandas have a 'datetime64' data type. pandas uses numpy's datetime64 as a base and builds quite a bit more functionality on top of it. As the name implies, a datetime64 value always use 64 bits of memory. There is no other size for datetimes other than 64 bits. However, a datetime64 object must have a date or time **unit**. The units can be years, months, weeks, days, hours, minutes, seconds, and parts of a second up to an attosecond (10<sup>-18</sup> of a second).

The unit determines the precision of a datetime value. For instance, if the unit is months then each datetime will have a year and month component. If the unit is hours then each datetime will have year, month, day, and hour components.

The official string representation of datetime64 data types must contain the units and are placed within brackets. For instance, `datetime64[s]` has second precision and `datetime64[ns]` has nano-second precision. Visit the [numpy documentation to view all of the possible units][1].

There are a few ways to create a datetime array in numpy. Just as we did above, we will pass the `dtype` parameter the data type we desire. The values passed to `np.array` are going to be integers. numpy converts these integers to a datetime with the specified unit. It does this by treating 0 as the **unix epoch** which is January 1, 1970 at midnight. In the following example we create an array from the three integers 10, -120, and 410. The data type is a datetime with month precision. The integer 10 corresponds to 10 months after the epoch or November, 1970.

[1]: https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units

In [None]:
np.array([10, -120, 410], dtype='datetime64[M]')

Using second precision we get the following.

In [None]:
np.array([10, -120, 410], dtype='datetime64[s]')

### Available integers
You can use all integers that are available to 64-bit integers. Let's print this info out again.

In [None]:
np.iinfo('int64')

### Available time span
The precision of the datetime will limit its available time span. For instance, the very highest datetime possible will be with the maximum 64-bit integer which is 9223372036854775807 (2<sup>63</sup> - 1). Let's convert the min and max 64-bit integers to datetimes with millisecond precision.

In [None]:
np.array([-9223372036854775808, 9223372036854775807], dtype='datetime64[ms]')

### NaT
Notice that the first value returned as 'NaT' which stands for 'Not a Time'. Instead of using the minimum integer as a datetime, numpy uses it to signal a missing value. A value such as this is usually referred to as a **sentinel value** or a special reserved value for a specific situation. The minimum 64-bit integer is not available to be used as a normal datetime. 

There is more to the numpy datetime64 data type than what was written above. Visit [the official documentation][1] for more extensive coverage.

[1]: https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html

## The pandas datetime64 data type
pandas datetime64 data type is very similar to numpy's but is only available with **nanosecond** precision. Let's create a pandas Series from a numpy datetime array.

In [None]:
a = np.array([10, -120, 410], dtype='datetime64[M]')
a

In [None]:
s = pd.Series(a)
s

Notice that the Series data type has nanosecond precision even though it was created from a numpy array with month precision. You might be wondering why the hour, minute, second, and nanoseconds are not viewable in the above Series output. These components do exist, but pandas intelligently does not output them as showing lots of zeros would dilute the information. You can view the underlying numpy array to verify that the nanosecond precision exists.

In [None]:
s.values

You can convert a Series of integers to a datetime with the `astype` method. Let's first create a Series of integers.

In [None]:
s = pd.Series([0, 4, 30])
s

Choose the units for these integers. pandas eventually converts it to nanoseconds.

In [None]:
s.astype('datetime64[Y]')

Use months as the units which again eventually converts to nanoseconds.

In [None]:
s.astype('datetime64[M]')

## Conversion failure

It's not always possible to convert from one data type to another. By default, the `astype` method raises an error if it's unable to do so. Here, we attempt to convert a Series of floats to integer, but fail to do so as integers have no missing value representation.

In [None]:
s = pd.Series([4, np.nan])
s.astype('int64')

Converting strings to floats is normally not possible.

In [None]:
s = pd.Series(['some', 'series'])
s.astype('int64')

It is possible to convert strings consisting entirely of numerical characters to either integer or float.

In [None]:
s = pd.Series(['4.5', '3.19'])
s

In [None]:
s.astype('float64')

## Force conversion with `pd.to_numeric` 

You may have a Series of string values where some of the strings can be converted to a numeric and others that cannot. In this situation, it is not possible to use the `astype` method to make the conversion as you can see with the following error.

In [None]:
s = pd.Series(['4.5', '3.19', 'NOT AVAILABLE'])
s

In [None]:
s.astype('float64')

pandas provides the `to_numeric` function which works very similarly to `astype` but has an option to force the conversion to happen. You do this by setting the `errors` parameter to the string 'coerce'. Any value that cannot be converted will be set as missing.

In [None]:
pd.to_numeric(s, errors='coerce')

Notice that `to_numeric` is a function and not a method. You must access it directly from `pd`. The `astype` method does have an `errors` parameter, but it does not have the option for 'coerce'. It would be quite nice if the developers implemented this option for `astype`, then we wouldn't need to use `to_numeric`.

## DataFrame data type conversion

Since a DataFrame is essentially a collection of columns, converting data types happens in a similar manner as it does with a Series. Let's see some examples by reading in a few of the columns from the college dataset.

In [None]:
cols = ['instnm', 'hbcu', 'relaffil', 'ugds', 'md_earn_wne_p10', 'grad_debt_mdn_supp']
college = pd.read_csv('../data/college.csv', index_col='instnm', usecols=cols)
college.head(3)

Unlike Series, DataFrames do not display the column data types in their output. You must access them with the `dtypes` attribute. At first glance, it appears that each column is numeric, but surprisingly this isn't the case. The last two columns are objects.

In [None]:
college.dtypes

When a seemingly numeric column is read in as an object, it is a clue that there are strings in this column. One of the first things we can do to investigate this issue is output some of the values from the underlying numpy array with the `values` attribute. Here, we take a look at the first five values and indeed they appear as strings.

In [None]:
college['grad_debt_mdn_supp'].head().values

We can get the exact type of an individual value by extracting it from the numpy array. We have now verified that these are strings.

In [None]:
type(college['grad_debt_mdn_supp'].values[0])

The `read_csv` function won't read in a column of data as strings unless it contains non-numeric characters. One way to find non-numeric characters is to sort the string column in descending order. Numeric characters have a lower unicode code point than alphabetic characters, so this should put the alphabetic strings to the top.

In [None]:
college['grad_debt_mdn_supp'].dropna().drop_duplicates() \
       .sort_values(ascending=False).head()

This method isn't perfect for uncovering non-numeric strings since it's possible a string can begin with a digit only to be followed by alphabetic characters. Regular expressions are needed to search for more specific patterns.

We need to use the `to_numeric` function to convert this column to a float. Again, we set the `errors` parameter to 'coerce' to force any value that isn't able to be converted to missing. One minor annoyance is that `to_numeric` converts only a single column at a time. Below, we overwrite both of the object columns with two separate calls of the `to_numeric` function.

In [None]:
college['grad_debt_mdn_supp'] = pd.to_numeric(college['grad_debt_mdn_supp'], errors='coerce')
college['md_earn_wne_p10'] = pd.to_numeric(college['md_earn_wne_p10'], errors='coerce')
college.dtypes

## The `astype` method for DataFrames

The `astype` method is still useful for DataFrames. We can convert all columns at once to a different type. Below, we convert each column to a 32-bit float.

In [None]:
college.astype('float32').head()

You can change the data type of specific columns by using a dictionary to map the column name to the desired type. Here, we change `relaffil` to an 8-bit integer and `ugds` to a 32-bit float.

In [None]:
college.astype({'relaffil': 'int8', 'ugds': 'float32'}).dtypes

## Reading in data with known missing values

You can avoid having to use `to_numeric` if you know the missing value representation in your dataset before you read in your data. Set the `na_values` parameter of the `read_csv` function to the string that represents missing values. You can use a list to specify more values and a dictionary to specify different missing values for each column. Here, we read in our college dataset again and convert every occurrence of 'PrivacySuppressed' to missing on read.

In [None]:
cols = ['instnm', 'hbcu', 'relaffil', 'ugds', 'md_earn_wne_p10', 'grad_debt_mdn_supp']
college = pd.read_csv('../data/college.csv', index_col='instnm', 
                      usecols=cols, na_values='PrivacySuppressed')
college.head(3)

Let's verify that pandas has correctly read in the last two columns as floats.

In [None]:
college.dtypes

## numpy and pandas timedelta64 data type

The pandas timedelta64 data type is just as valid as any other data type but does not appear as frequently as the others previously discussed, which is why it is being presented towards the end of the chapter. A timedelta refers to an amount of time like 4 days and 24 minutes or 123 milliseconds. Both numpy and pandas have timedelta64 data types. In numpy, a timedelta is expressed as an integer along with a unit ranging from years to attoseconds with the same character abbreviation as datetimes. Let's create a numpy array of `timedelta64[D]` values. The `D` represents day precision.

In [None]:
np.array([1, 2, 100], dtype='timedelta64[D]')

In pandas, all timedeltas have nanosecond precision. You are not given a choice. Below, we convert a Series of integers to timedeltas. We specify the unit as minutes ('m'), but pandas will eventually convert this to nanosecond precision.

In [None]:
pd.Series([10, 50, 423]).astype('timedelta64[m]')

Take a look at the value 423. We are telling pandas to treat this as 423 minutes, which is 7 hours, 3 minutes, 0 seconds, and 0 nanoseconds. This is the value that is returned. Let's use the same Series of integers and use hours as the units.

In [None]:
pd.Series([10, 50, 423]).astype('timedelta64[h]')

To prove that pandas uses nanosecond precision, you can view the underlying array.

In [None]:
pd.Series([10, 50, 423]).astype('timedelta64[h]').values

Even though numpy allows timedeltas to have year precision, the largest unit used in the representation of timedeltas within pandas is days. The following Series still has nanosecond precision, but the visual representation is shown in days. pandas does not use years in its representation as a year is not a consistent measure of time.

In [None]:
pd.Series([1, 10, 50]).astype('timedelta64[Y]')

## Period and Category data types

Both the period and category data types are unique to pandas and have no equivalent in numpy. More details on the period data type will be covered in the time series part. Category data types can help save a tremendous amount of memory and will also be covered in an upcoming chapter.

## Different syntax for data types

All of the data type conversions in this chapter were accomplished by using a string such as 'int8'. All of these data types are are also available directly as numpy objects with the same name. For instance, we can use `np.int8` instead of the string to specify a data type.

In [None]:
pd.Series([10, 50]).astype(np.int8)

The following is equivalent to the above.

In [None]:
pd.Series([10, 50]).astype('int8')

You can even use the built-in `int` and `float` which will use the default bit size.

In [None]:
pd.Series([10, 50]).astype(float)

### Does it matter which one you use?
I typically use strings when specifying a data type as numpy must be imported to use the object directly. I also don't use the built-in Python `int` or `float` as they are not explicit about the bit size.

### Converting to strings
You can convert all the values in a DataFrame to a string with either the string 'str' or the built-in `str`. Let's create a Series of integers and then convert it to strings.

In [None]:
s = pd.Series([10, 20, 99])
s.astype('str')

Let's verify that the underlying values are actually strings.

In [None]:
s.astype('str').values

Converting data to a string is an uncommon occurrence.

## Data Types Summary

<h3 style="text-align: center;">pandas data types in common with numpy</h3>

![][1]

Integers default to 64 bits on 64-bit Linux and macOS machines and 32 bits for all others.

![][2]

<h3 style="text-align: center;">Uncommon data types available to both pandas and numpy</h3>

![][3]


### More data types specific to pandas
* `SparseDtype`
* `IntervalDtype`
* `DatetimeTZDtype` - timezone aware datetime. Use string `datetime64[ns, <timezone>]`
* `UInt64` - Nullable unsigned integer. Sizes 8, 16, 32, 64
* `ExtensionDtype`

[1]: images/pandas_numpy_dtypes.png
[2]: images/pandas_only_dtypes.png
[3]: images/pandas_numpy_other.png

## Exercises

### Exercise 1

<span  style="color:green; font-size:16px">Find the maximum integer of a 16-bit integer. Then verify it with numpys `iinfo` function.</span>

### Exercise 2

<span  style="color:green; font-size:16px">Read in the bikes data and select the `tripduration` column. Find its data type and then use the `memory_usage` method to find how much memory (in bytes) it is using. Change its data type to the smallest possible type so that no information is lost. What percentage of memory has been saved?</span>

### Exercise 3

<span  style="color:green; font-size:16px">Create three different Series. Make them each have a different data type and have a different number of items. Make a fourth Series that has these three Series as the values. Output the fourth Series. Can you make sense of it?</span>

### Exercise 4

<span  style="color:green; font-size:16px">What month is it 1 million minutes after the unix epoch?</span>

### Exercise 5

<span  style="color:green; font-size:16px">Convert the following Series to float.</span>

### Exercise 6

<span  style="color:green; font-size:16px">Take a look at the `dpcapacity_start` column from the bikes dataset. It contains the capacity of the bike rack when the ride began. This number should be an integer but it is a float. Why do you think pandas read this in as a float? Do something to the DataFrame as a whole so that you can convert just this column to an integer. Choose the lowest size integer that is possible.</span>