# Data Types and Missing Values

One of the most important pieces of information you can have about your DataFrame is the data type of each column. pandas stores its data such that each column is exactly one data type. A large number of data types are available for pandas DataFrame columns. This chapter focuses only on the most common data types and provides a brief summary of each one. For extensive coverage of each and every data type, see the chapter **Changing Data Types** in the **Essential Commands** part.

## Common data types

The following are the most common data types that appear frequently in DataFrames. 

* **boolean** - Only two possible values, `True` and `False`
* **integer** - Whole numbers without decimals
* **float** - Numbers with decimals
* **object** - Typically strings, but may contain any object
* **datetime** - Specific date and time with nanosecond precision

### More on the object data type

The object data type is the most confusing and deserves a longer discussion. Each value in an object column can be *any* Python object. Object columns can contain integers, floats, or even data structures such as lists or dictionaries. Anything can be contained in object columns.  But, nearly all of the time, columns of the object data type only contain **strings**. When you see that a column is an object data type, you should expect the values to be strings. Unfortunately, pandas does not provide its users with a specific data type for strings. If you do have strings in your columns, the data type will be object.

### The importance of knowing the data type

Knowing the data type of each column of your pandas DataFrame is very important. The main reason for this is that every value in each column will be of the same type. For instance, if you select a single value from a column that has an integer data type, then you are guaranteed that this value is also an integer.  Knowing the data type of a column is one of the most fundamental pieces of knowledge of your DataFrame.

### A major exception with the object data type

The object data type, is unfortunately, an exception to the information in the previous section. Although columns that have object data type are typically strings, there is no guarantee that each value will be a string. You could very well have an integer, list, or even another DataFrame as a value in the same object column.

## Missing Value Representation

### `NaN`,  `None`, and `NaT`

pandas represents missing values differently based on the data type of the column.

* `NaN` - Stands for not a number and is a float data type
* `None` - The literal Python object `None` and only found in object columns
* `NaT` - Stands for not a time and is used for missing values in datetime columns

### Missing values for each data type

* **boolean and integer** - No representation for missing values exist for boolean and integer columns. This is an unfortunate limitation.
* **float** -  Uses `NaN` as the missing value.
* **datetime** - Only uses `NaT` as the missing value.
* **object** - Can contain any Python object so all three of the missing value representations may appear in these columns, but typically you will encounter `NaN` or `None`.

### Missing values in boolean and integer columns

Knowing that a column is either a boolean or integer column guarantees that there are no missing values in that column as pandas does not allow for it. If, for instance, you would like to place a missing value in a boolean or integer column, then pandas converts the entire column to float. This is because a float column can accommodate missing values. When booleans are converted to floats, `False` becomes 0 and `True` becomes 1.

### Integer NaN update for pandas 0.24

With the release of pandas version 0.24 (February 2019), missing value representation was made possible for a special kind of integer data type called **Int64Dtype**. There is still no missing value representation for the default integer data type. 

## Finding the data type of each column

The `dtypes` DataFrame attribute (NOT a method) returns the data type of each column and is one of the first commands you should execute after reading in your data. Let's begin by using the `read_csv` function to read in the bikes dataset. 

In [None]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(3)

Let's get the data types of each column in our `bikes` DataFrame. The returned object is a Series with the data types as the values and the column names as the index.

In [None]:
bikes.dtypes

### Why do `starttime` and `stoptime` have object as the data type?

From the visual display of the bikes DataFrame above, it appears that both the `starttime` and `stoptime` columns are datetimes. The result of the `dtypes` attribute shows that they are objects (strings).

The `read_csv` function requires that you provide a list of columns that are datetimes to the `parse_dates` parameter, otherwise it will read them in as strings. Let's reread the data using the `parse_dates` parameter.

In [None]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.dtypes.head()

### What are all those 64's at the end of the data types?

Booleans, integers, floats, and datetimes all use a particular amount of memory for each of their values. The memory is measured in **bits**. The number of bits used for each value is the number appended to the end of the data type name. For instance, integers can be either 8, 16, 32, or 64 bits while floats can be 16, 32, 64, or 128. A 128-bit float column will show up as `float128`. 

Technically a `float128` is a different data type than a `float64` but generally you will not have to worry about such a distinction as the operations between different float columns will be the same. 

**Booleans** are always stored as 8-bits. There is no set bit size for values in an **object** column as each value can be of any size.

## Getting more metadata

**Metadata** can be defined as data on the data. The data type of each column is an example of metadata. The number of rows and columns is another piece of metadata. We find this with the `shape` attribute, which returns a tuple of integers.

In [None]:
bikes.shape

### Total number of values with the `size` attribute
The `size` attribute returns the total number of values (the number of columns multiplied by the number of rows) in the DataFrame.

In [None]:
bikes.size

### Get data types plus more with the `info` method
The `info` DataFrame method provides output similar to `dtypes`, but also shows the number of non-missing values in each column along with more info such as:  

* Type of object (always a DataFrame)
* The type of index and number of rows
* The number of columns
* The data types of each column and the number of non-missing (a.k.a non-null)
* The frequency count of all data types
* The total memory usage

In [None]:
bikes.info()

## More data types

There are several more data types available in pandas. An extensive and formal discussion on all data types is available in the chapter **Changing Data Types** from the **Essential Commands** part.

## Exercises
Use the `bikes` DataFrame for the following:

### Exercise 1
<span  style="color:green; font-size:16px">What type of object is returned from the `dtypes` attribute?</span>

### Exercise 2
<span  style="color:green; font-size:16px">What type of object is returned from the `shape` attribute?</span>

### Exercise 3
<span style="color:green; font-size:16px">What type of object is returned from the `info` method?</span>

### Exercise 4
<span  style="color:green; font-size:16px">The memory usage from the `info` method isn't correct when you have objects in your DataFrame. Read the docstrings from it and get the true memory usage.</span>