In [111]:
# discography = np.array(
#     [('David Bowie', np.datetime64('1969-11-14'), 17),
#      ('The Man Who Sold the World', np.datetime64('1970-11-04'), 3),
#      ('Hunky Dory', np.datetime64('1971-12-17'), 5),
#      ('Ziggy Stardust', np.datetime64('1972-06-16'), 1),
#      ('Aladdin Sane', np.datetime64('1973-04-13'), 1),
#      ('Pin Ups', np.datetime64('1973-10-19'), 1),
#      ('Diamond Dogs', np.datetime64('1974-05-24'), 1),
#      ('Young Americans', np.datetime64('1975-03-07'), 2),
#      ('Station To Station', np.datetime64('1976-01-23'), 5),
#      ('Low', np.datetime64('1977-01-14'), 2),
#      ('Heroes', np.datetime64('1977-10-14'), 3),
#      ('Lodger', np.datetime64('1979-05-18'), 4)],
    
#      dtype=[('title','U32'), ('release', 'M8[D]'), ('toprank', np.int)])

# np.save('discography.npy', discography)

# 03_05: Special arrays

In [1]:
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

In the last video for this chapter, I want to show you two NumPy features that are not always covered in tutorials but are still very useful.

One is _record arrays_, which allow us to mix different data types and give descriptive names to fields. We'll see a much more powerful version of this in the `pandas` library, but sometimes it's good to use it within NumPy.

The other feature is `datetime` objects, which (as the name says) can encode a date and time.

I will load a simple example of a record array, which I have saved in the numpy format.

In [4]:
discography = np.load('discography.npy')

Let's have a look. This is a partial David Bowie discography. Each entry shows record a record's name, date of release, and top rank in the UK charts.

The datatype is a list, which shows the name and dtype of each field. For 'title', it's 'U32', which denotes a unicode string of length 32; for release, it's 'M8[D]', which denotes a numpy datetime object with a precision of a day (but it could be as small as nanosecond); there's an 8 because the datetime objects are 8 bytes; last, the toprank is an 8-byte integer.

(If you wondering about the "less than" symbols, they refer to the _enddianness_ of the datatypes---the order in which bytes are stored in memory. On Intel processors, they are little-endian.)

In [36]:
discography

array([('David Bowie', '1969-11-14', 17),
       ('The Man Who Sold the World', '1970-11-04',  3),
       ('Hunky Dory', '1971-12-17',  5),
       ('Ziggy Stardust', '1972-06-16',  1),
       ('Aladdin Sane', '1973-04-13',  1), ('Pin Ups', '1973-10-19',  1),
       ('Diamond Dogs', '1974-05-24',  1),
       ('Young Americans', '1975-03-07',  2),
       ('Station To Station', '1976-01-23',  5),
       ('Low', '1977-01-14',  2), ('Heroes', '1977-10-14',  3),
       ('Lodger', '1979-05-18',  4)],
      dtype=[('title', '<U32'), ('release', '<M8[D]'), ('toprank', '<i8')])

Of course we could also load data from a text file such as this:

In [29]:
!cat discography.txt

# title, release, toprank
David Bowie, 1969-11-14, 17
The Man Who Sold the World, 1970-11-04, 3
Hunky Dory, 1971-12-17, 5
Ziggy Stardust, 1972-06-16, 1
Aladdin Sane, 1973-04-13, 1
Pin Ups, 1973-10-19, 1
Diamond Dogs, 1974-05-24, 1
Young Americans, 1975-03-07, 2
Station To Station, 1976-01-23, 5
Low, 1977-01-14, 2
Heroes, 1977-10-14, 3
Lodger, 1979-05-18, 4


Loading these data takes a bit more work because we have to specify the `dtype` of every field, as well as the field delimiter. `names=True` lets us grab the field names from the file header.

In [37]:
discography_txt = np.genfromtxt('discography.txt', dtype=('U32', 'M8[D]', 'i8'), delimiter=',', names=True)

In [38]:
discography_txt == discography

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

So what can we do with a record array? Each record looks like a Python tuple, and we can extract or modify the elements as we would for a tuple.

In [42]:
discography[0]

('David Bowie', '1969-11-14', 17)

In [43]:
discography[0][0]

'David Bowie'

In [44]:
discography[0][1]

numpy.datetime64('1969-11-14')

But we can also use the dictionary interface using field names.

In [45]:
discography[0]['title']

'David Bowie'

Using the field names on the entire array will give us a view of a column.

In [46]:
discography['title']

array(['David Bowie', 'The Man Who Sold the World', 'Hunky Dory',
       'Ziggy Stardust', 'Aladdin Sane', 'Pin Ups', 'Diamond Dogs',
       'Young Americans', 'Station To Station', 'Low', 'Heroes', 'Lodger'],
      dtype='<U32')

To create a record array, we have to provide the data records as tuples, and be careful about describing the datatypes. For our discography, this would look like this.

In [68]:
np.array([('David Bowie', '1969-11-14', 17),
          ('The Man Who Sold the World', '1970-11-04', 3)],
         dtype = [('title', 'U32'), ('release', 'M8[D]'), ('toprank', 'i8')])

array([('David Bowie', '1969-11-14', 17),
       ('The Man Who Sold the World', '1970-11-04',  3)],
      dtype=[('title', '<U32'), ('release', '<M8[D]'), ('toprank', '<i8')])

Now for dates and times in NumPy. The dtype that we need is called `datetime64` to avoid confusion with the `datetime` object in the Python standard library, and to remind us that each element takes 64 bits. We initialize datetime objects from strings, and we can give as much detail as we want. The string format is ISO 8601, which goes from larger to smaller units (that is, from year to month to day and so on).

In [71]:
np.datetime64('1969')

numpy.datetime64('1969')

In [72]:
np.datetime64('1969-11-14')

numpy.datetime64('1969-11-14')

In [73]:
np.datetime64('2015-02-03 12:00')

numpy.datetime64('2015-02-03T12:00')

We can also go through a Python `datetime` object. We specify a granularity of D to avoid setting the time at exactly midnight.

In [100]:
np.datetime64(datetime.datetime(2015, 2, 3))

numpy.datetime64('2015-02-03T00:00:00.000000')

In [101]:
np.datetime64(datetime.datetime(2015, 2, 3), 'D')

numpy.datetime64('2015-02-03')

The Python `datetime` module can also parse a generic string format.

In [102]:
np.datetime64(datetime.datetime.strptime('02/03/2015', '%m/%d/%Y'), 'D')

numpy.datetime64('2015-02-03')

Now, numpy datetime objects can be compared

In [177]:
np.datetime64('2015-02-03 12:00') < np.datetime64('2015-02-03 18:00')

True

And they can be subtracted, resulting in a `timedelta` object... here it's specified in minutes.

In [103]:
np.datetime64('2015-02-03 18:00') - np.datetime64('2015-02-03 12:00')

numpy.timedelta64(360,'m')

The nice thing about these `datetime64` objects is that they work across NumPy. For instance, we can use the numpy function `diff`, which computes the difference between successive array elements, to see how long it took David to come up with each new record:

In [104]:
np.diff(discography['release'])

array([355, 408, 182, 301, 189, 217, 287, 322, 357, 273, 581],
      dtype='timedelta64[D]')

In [109]:
discography[3]

('Ziggy Stardust', '1972-06-16', 1)

"Ziggy Stardust" was especially quick!

Another example of using standard NumPy function with `datetime64` is making a range of dates. Consistently with the usual convention, the last day in the range is excluded.

In [110]:
np.arange(np.datetime64('2015-02-03'), np.datetime64('2015-03-01'))

array(['2015-02-03', '2015-02-04', '2015-02-05', '2015-02-06',
       '2015-02-07', '2015-02-08', '2015-02-09', '2015-02-10',
       '2015-02-11', '2015-02-12', '2015-02-13', '2015-02-14',
       '2015-02-15', '2015-02-16', '2015-02-17', '2015-02-18',
       '2015-02-19', '2015-02-20', '2015-02-21', '2015-02-22',
       '2015-02-23', '2015-02-24', '2015-02-25', '2015-02-26',
       '2015-02-27', '2015-02-28'], dtype='datetime64[D]')

As for record arrays, NumPy datetime objects are even more powerful in the `pandas` libraries, which we will discuss later in this course.