# Optimizing Storage: Numpy Data Types

Now that you have a bit more practical experience, it‚Äôs time to go back to theory and look at data types. Data types don‚Äôt play a central role in a lot of Python code. Numbers work like they‚Äôre supposed to, strings do other things, Booleans are true or false, and other than that, you make your own objects and collections.

In NumPy, though, there‚Äôs a little more detail that needs to be covered. NumPy uses C code under the hood to optimize performance, and it can‚Äôt do that unless all the items in an array are of the same type. That doesn‚Äôt just mean the same Python type. They have to be the same underlying C type, with the same shape and size in bits!

Python defines only one type of a particular data class (there is only one integer type, one floating-point type, etc.). This can be convenient in applications that don‚Äôt need to be concerned with all the ways data can be represented in a computer. For scientific computing, however, more control is often needed.

In NumPy, there are 24 new fundamental Python types to describe different types of scalars. These type descriptors are mostly based on the types available in the C language that CPython is written in, with several additional types compatible with Python‚Äôs types.

## Numerical Types: int, bool, float, and complex
Since most of your data science and numerical calculations will tend to involve numbers, they seem like the best place to start. There are essentially four numerical types in NumPy code, and each one can take a few different sizes.

The table below breaks down the details of these types:

|Name|# of Bits|Python Type|NumPy Type|
|:--|:--|:--|:--|
|Integer|64|int|np.int_|
|Booleans|8|bool|np.bool_|
|Float|64|float|np.float_|
|Complex|128|complex|np.complex_|

These are just the types that map to existing Python types. NumPy also has types for the smaller-sized versions of each, like 8-, 16-, and 32-bit integers, 32-bit single-precision floating-point numbers, and 64-bit single-precision complex numbers. The documentation lists them in their entirety.

To specify the type when creating an array, you can provide a `dtype` argument:

In [7]:
import numpy as np

In [8]:
np.array([1, 3, 5.5, 7.7, 9.2], dtype=np.single)

array([1. , 3. , 5.5, 7.7, 9.2], dtype=float32)

In [11]:
np.array([1, 3, 5.5, 7.7, 9.2], dtype=np.uint8)

array([1, 3, 5, 7, 9], dtype=uint8)

NumPy automatically converts your platform-independent type `np.single` to whatever fixed-size type your platform supports for that size. In this case, it uses `np.float32`. If your provided values don‚Äôt match the shape of the `dtype` you provided, then NumPy will either fix it for you or raise an error.

### String Types: Sized Unicode

Strings behave a little strangely in NumPy code because NumPy needs to know how many bytes to expect, which isn‚Äôt usually a factor in Python programming. Luckily, NumPy does a pretty good job at taking care of less complex cases for you:

In [5]:
import numpy as np

In [6]:
names = np.array(["bob", "amy", "han"], dtype=str)
names

array(['bob', 'amy', 'han'], dtype='<U3')

In [7]:
names.itemsize

12

In [8]:
names = np.array(["bob", "amy", "han"])
names

array(['bob', 'amy', 'han'], dtype='<U3')

In [9]:
more_names = np.array(["bobo", "jehosephat"])

In [10]:
more_names.dtype, more_names.itemsize

(dtype('<U10'), 40)

In [11]:
np.concatenate((names, more_names))

array(['bob', 'amy', 'han', 'bobo', 'jehosephat'], dtype='<U10')

In `names`, you provide a `dtype` of Python‚Äôs built-in `str` type, but in its output, it‚Äôs been converted into a little-endian Unicode string of size 3. When you check the size of a given item in input 4, you see that they‚Äôre each 12 bytes: three 4-byte Unicode characters.

> **Note:** When dealing with NumPy data types, you have to think about things like the endianness of your values. In this case, the dtype `'<U3'` means that each value is the size of three Unicode characters, with the least-significant byte stored first in memory and the most-significant byte stored last. A `dtype` of `'>U3'` would signify the reverse.
>
>As an example, NumPy represents the Unicode character ‚Äúüêç‚Äù with the bytes `0xF4 0x01 0x00` with a dtype of `'<U1'` and `0x00 0x01 0xF4` with a `dtype` of `'>U1'`. Try it out by creating an array full of emoji, setting the dtype to one or the other, and then calling `.tobytes()` on your array!
>
>If you‚Äôd like to study up on how Python treats the ones and zeros of your normal Python data types, then the official documentation for the [struct library](https://docs.python.org/3/library/struct.html#struct-alignment), which is a standard library module that works with raw bytes, is another good resource.

When you combine that with an array that has a larger item to create a new array in input 8, NumPy helpfully figures out how big the new array‚Äôs items need to be and grows them all to size `<U10`.

But here‚Äôs what happens when you try to modify one of the slots with a value larger than the capacity of the dtype:

In [12]:
names[2] = "jamima"

names

array(['bob', 'amy', 'jam'], dtype='<U3')

It doesn‚Äôt work as expected and truncates your value instead. If you already have an array, then NumPy‚Äôs automatic size detection won‚Äôt work for you. You get three characters and that‚Äôs it. The rest get lost in the void.

This is all to say that, in general, NumPy has your back when you‚Äôre working with strings, but you should always keep an eye on the size of your elements and make sure you have enough space when modifying or changing arrays in place.

### Structured Arrays

Originally, you learned that array items all have to be the same data type, but that wasn‚Äôt entirely correct. NumPy has a special kind of array, called a **record array** or **structured array**, with which you can specify a type and, optionally, a name on a per-column basis. This makes sorting and filtering even more powerful, and it can feel similar to working with data in Excel, CSVs, or relational databases.

In [None]:
import numpy as np

In [13]:
data = np.array([
    ("joe", 32, 6),
    ("mary", 15, 20),
    ("felipe", 80, 100),
    ("beyonce", 38, 9001),
], dtype=[("name", str, 10), ("age", int), ("power", int)])

In [14]:
data[0]

('joe', 32, 6)

In [15]:
data["name"]

array(['joe', 'mary', 'felipe', 'beyonce'], dtype='<U10')

In [16]:
data[data["power"] > 9000]["name"]

array(['beyonce'], dtype='<U10')

By defining `data`, you create an array, except each item is a `tuple` with a name, an age, and a power level. For the `dtype`, you actually provide a list of tuples with the information about each field: name is a 10-character Unicode field, and both age and power are standard 4-byte or 8-byte integers.

In `data[0]`, you can see that the rows, known as records, are still accessible using the index.

In `data['name']`, you see a new syntax for accessing an entire column, or field.

Finally, in `data[data["power"] > 9000]["name"]`, you see a super-powerful combination of mask-based filtering based on a field and field-based selection. Notice how it‚Äôs not that much different to read the following SQL query:

```sql
SELECT name FROM data WHERE power > 9000;
```

In both cases, the result is a list of names where the power level is over 9000.

You can even add in `ORDER BY` functionality by making use of `np.sort()`:

In [20]:
np.sort(data[data["age"] > 20], order="power")["name"]

array(['joe', 'felipe', 'beyonce'], dtype='<U10')

This sorts the data by power before retrieving it, which rounds out your selection of NumPy tools for selecting, filtering, and sorting items just like you might in SQL!

## Data Type Objects (`dtype`)

A data type object (an instance of numpy.dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data:

- Type of the data (integer, float, Python object, etc.)
- Size of the data (how many bytes is in e.g. the integer)
- Byte order of the data (little-endian or big-endian)
- If the data type is structured data type, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float),
  - what are the names of the ‚Äúfields‚Äù of the structure, by which they can be accessed,
  - what is the data-type of each field, and
  - which part of the memory block each field takes.
- If the data type is a sub-array, what is its shape and data type.

To describe the type of scalar data, there are several built-in scalar types in NumPy for various precision of integers, floating-point numbers, etc. An item extracted from an array, e.g., by indexing, will be a Python object whose type is the scalar type associated with the data type of the array.

> **Note:** Note that the scalar types are not dtype objects, even though they can be used in place of one whenever a data type specification is needed in NumPy.

Structured data types are formed by creating a data type whose field contain other data types. Each field has a name by which it can be accessed. The parent data type should be of sufficient size to contain all its fields; the parent is nearly always based on the void type which allows an arbitrary item size. Structured data types may also contain nested structured sub-array data types in their fields.

Finally, a data type can describe items that are themselves arrays of items of another data type. These sub-arrays must, however, be of a fixed size.m

If an array is created using a data-type describing a sub-array, the dimensions of the sub-array are appended to the shape of the array when the array is created. Sub-arrays in a field of a structured type behave differently, see [Field Access Numpy Documentation](https://numpy.org/doc/stable/reference/arrays.indexing.html#arrays-indexing-fields).

Sub-arrays always have a C-contiguous memory layout.

A simple data type containing a 32-bit big-endian integer:

In [15]:
dt = np.dtype('>i4')

In [17]:
dt.byteorder

'>'

In [18]:
dt.itemsize

4

In [19]:
dt.name

'int32'

In [20]:
dt.type is np.int32

True

A structured data type containing a 16-character string (in field ‚Äç`name`) and a sub-array of two 64-bit floating-point number (in field `grades`):

In [21]:
dt = np.dtype([('name', np.unicode_, 16), ('grades', np.float64, (2,))])

In [22]:
dt['name']

dtype('<U16')

In [23]:
dt['grades']

dtype(('<f8', (2,)))

Items of an array of this data type are wrapped in an array scalar type that also has two fields:

In [24]:
x = np.array([('Sarah', (8.0, 7.0)), ('John', (6.0, 7.0))], dtype=dt)

In [25]:
x[1]

('John', [6., 7.])

In [26]:
x[1]['grades']

array([6., 7.])

In [28]:
type(x[1])

numpy.void

In [29]:
type(x[1]['grades'])

numpy.ndarray

## Specifying and constructing data types

Whenever a data-type is required in a NumPy function or method, either a dtype object or something that can be converted to one can be supplied. Such conversions are done by the dtype constructor:

In [30]:
dt = np.dtype(np.int32)      # 32-bit integer

In [31]:
dt = np.dtype(np.complex128) # 128-bit complex floating-point number

Several python types are equivalent to a corresponding array scalar when used to generate a dtype object:

|built-in python type|numpy type|
|:--|:--|
|`int`|`int_`|
|`bool`|`bool_`|
|`float`|`float_`|
|`complex`|`cfloat`|
|`bytes`|`bytes_`|
|`str`|`str_`|
|`buffer`|`void`|
|all others|`object_`|

### More on Data Types

This section of the tutorial was designed to get you just enough knowledge to be productive with NumPy‚Äôs data types, understand a little of how things work under the hood, and recognize some common pitfalls. It‚Äôs certainly not an exhaustive guide. The [NumPy documentation on `ndarrays`](https://numpy.org/doc/stable/reference/arrays.ndarray.html#internal-memory-layout-of-an-ndarray) has tons more resources.

There‚Äôs also a lot more information on [`dtype` objects](https://numpy.org/doc/stable/reference/arrays.dtypes.html), including the different ways to construct, customize, and optimize them and how to make them more robust for all your data-handling needs. If you run into trouble and your data isn‚Äôt loading into arrays exactly how you expected, then that‚Äôs a good place to start.

Lastly, the NumPy `recarray` is a powerful object in its own right, and you‚Äôve really only scratched the surface of the capabilities of structured datasets. It‚Äôs definitely worth reading through the [`recarray` documentation](https://numpy.org/doc/stable/reference/generated/numpy.recarray.html) as well as the documentation for the other specialized array [subclasses](https://numpy.org/doc/stable/reference/arrays.classes.html) that NumPy provides.