# Structured Data: NumPy's Structured Arrays

We've primarily been dealing with homogeneous data types in arrays

Of course, some data is heterogeneous by nature

NumPy handles this via *structured arrays*

In [1]:
import numpy as np

In [2]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

This is a bit clumsy...

In [3]:
# Recall the following to create a simple array
x = np.zeros(4, dtype=int)

In [14]:
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})
print(data.dtype)

[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]


- `U10` translates to "Unicode string of maximum length 10," 
- `i4` translates to "4-byte (i.e., 32 bit) integer," 
- `f8` translates to "8-byte (i.e., 64 bit) float."



In [15]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]


In [6]:
# We can refer to values by name
# Get all names
data['name']

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

In [7]:
# or refer to values by index
data[0]

('Alice', 25, 55.)

In [8]:
# Get the name from the last row
data[-1]['name']

'Doug'

In [9]:
# Use Boolean masks to perform sophisticated operations like filtering
# Get names where age is under 30
data[data['age'] < 30]['name']

array(['Alice', 'Doug'], dtype='<U10')

For anything more complicated, should really use the `pandas` package.

This uses `DataFrame` objects, built upon NumPy arrays that offer far more functionality.

## Creating Structured Arrays

Couple of ways to do this:

In [10]:
# the dictionary method
np.dtype({'names':('name', 'age', 'weight'),
          'formats':('U10', 'i4', 'f8')})

dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])

In [11]:
# Python's types and NumPy's `dtypes` can be used too
np.dtype({'names':('name', 'age', 'weight'),
          'formats':((np.str_, 10), int, np.float32)})

dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f4')])

In [12]:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])

In [13]:
#if names are not important can use the following
np.dtype('S10,i4,f8')

dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])

## On to Pandas...

We've purposely looked at structured arrays last.

Leads on well to the next section of the course on Pandas

For day-to-day structured data, Pandas is friendlier

![image.png](attachment:image.png)