# Non-numeric data

Run the following cell and watch the video. 

In [1]:
%%html
<h1>ho ho ho</h1> 

# Reading non-numeric data
So far, we know how to read numeric and text data. There are several burning questions remaining: 
* What happens if there is missing data? 
* What happens if there are an inconsistent number of fields per row? 
* How do we convert things that aren't numbers? 

    * Labels? 
    * Dates? 

There is clearly a lot more to discuss.

# genfromtxt
* A power tool for data loading: `genfromtxt` 
* https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

In [6]:
%cat data1.txt

import numpy as np

stuff = np.genfromtxt('data1.txt',
                      delimiter=(7, 4, 9, 8, 4),
                      autostrip=True, encoding=None, 
                      names=("n1", "n2", "label", "n3", "n4"), 
                      dtype=("f8", "f8", "U8", "f8", "f8"))
print('\n')
print(stuff)

#2345671234123456789123456781234
    2.4  10 kilroy      25.2  10
   12.8  20  was        20.2   5
   20.2  30   here      15.4  15

[( 2.4, 10., 'kilroy', 25.2, 10.) (12.8, 20., 'was', 20.2,  5.)
 (20.2, 30., 'here', 15.4, 15.)]


Let's take this apart: 

* `delimiter` represents the length of each column
* `autostrip` says to remove spaces from both sides of each field. 
* `names` are the names of the fields, like headings. 
* `dtype` is an array of the types of fields. These are automatically inferred if `dtype=None`.

# Some important observations
* `numpy` requires arrays to have one type. 
* If the input file contains multiple types, they're represented as tuples (each of which has the same structure and type). 
* If you label the columns, you can get to the singly typed data as follows: 

In [8]:
stuff['label']

array(['kilroy', 'was', 'here'], dtype='<U8')

# Whoa there

* This is really handy. 
* You can break tables into columns easily. 
* This will come in handy when dealing with more complex data types. 

# But several problems remain
* What if data can't be converted to a number?
* What if data represents something numeric that isn't a number?

# Invalid data
Consider the following data table

In [14]:
%cat data2.txt

1,2,3
4,kilroy,6
6,7,foo

What happens when a column of data is invalid? 
Consider:

In [17]:
d2 = np.genfromtxt('data2.txt', delimiter=',')
d2

array([[ 1.,  2.,  3.],
       [ 4., nan,  6.],
       [ 6.,  7., nan],
       [ 2., nan,  4.]])

The special value `nan` (short for "not a number") marks the spots where a number was required and not available. 


# Custom conversions

* We already know about the conversions for int, float, etc. 
* What if you want to convert something in a custom fashion? 
* Most important: dates and times.

Consider:

In [18]:
%cat data3.txt

alva 1956-01-10
frank 1999-03-05
george 2001-03-21

In [24]:
from datetime import datetime
d3=np.genfromtxt('data3.txt', dtype=[str, np.datetime64])
d3

ValueError: Cannot create a NumPy datetime other than NaT with generic units