# <span style="color:blue">Data Ingestion Primitives</span>

This notebook is a tutorial on some of the primitives for reading data

## Structured Arrays

In [None]:
import numpy as np

In [None]:
x = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
             dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
x

Here `x` is a one-dimensional array of length two whose datatype is a structure with three fields: 1. A string of length 10 or less named ‘name’, 2. a 32-bit integer named ‘age’, and 3. a 32-bit float named ‘weight’.

If you index `x` at position 1 you get a structure:

In [None]:
x[1]

You can access and modify individual fields of a structured array by indexing with the field name:

In [None]:
print (x['age'])
x[0]['age'] = 7
print(x)
x['age'] = 5
print(x)

## `numpy.loadtxt`

`numpy.`**`loadtxt`** ```(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes', max_rows=None)```

### `numpy.loadtxt` Parameters

**comments**: str or sequence of str, optional

The characters or list of characters used to indicate the start of a comment. None implies no comments. For backwards compatibility, byte strings will be decoded as ‘latin1’. The default is ‘#’.

**delimiter**: str, optional

The string used to separate values. For backwards compatibility, byte strings will be decoded as ‘latin1’. The default is whitespace.

**converters**: dict, optional

A dictionary mapping column number to a function that will parse the column string into the desired value. E.g., if column 0 is a date string: converters = {0: datestr2num}. Converters can also be used to provide a default value for missing data (but see also genfromtxt): converters = {3: lambda s: float(s.strip() or 0)}. Default: None.

**skiprows**: int, optional

Skip the first skiprows lines, including comments; default: 0.

**usecols**: int or sequence, optional

Which columns to read, with 0 being the first. For example, usecols = (1,4,5) will extract the 2nd, 5th and 6th columns. The default, None, results in all columns being read.

In [None]:
from io import StringIO   # StringIO behaves like a file object
c = StringIO("0 1\n2 3")
np.loadtxt(c)

In [None]:
d = StringIO("M 21 72\nF 35 58")
np.loadtxt(d, dtype={'names': ('gender', 'age', 'weight'),
                     'formats': ('S1', 'i4', 'f4')})

In [None]:
c = StringIO("1,0,2\n3,0,4")
x, y = np.loadtxt(c, delimiter=',', usecols=(0, 2), unpack=True)
print(x)
print(y)

In [None]:
s = StringIO('10.01 31.25-\n19.22 64.31\n17.57- 63.94')
def conv(fld):
    return -float(fld[:-1]) if fld.endswith(b'-') else float(fld)

np.loadtxt(s, converters={0: conv, 1: conv})

## `numpy.genfromtxt`

See [formal definition page](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) and [`genfromtxt` tutorial](https://numpy.org/doc/stable/user/basics.io.genfromtxt.html)

### `numpy.genfromtxt` Parameters:
#### delimiter

In [None]:
data = u"1, 2, 3\n4, 5, 6"
np.genfromtxt(StringIO(data), delimiter=",")

In [None]:
data = u"  1  2  3\n  4  5 67\n890123  4"
print(np.genfromtxt(StringIO(data), delimiter=3))

In [None]:
data = u"123456789\n   4  7 9\n   4567 9"
print(np.genfromtxt(StringIO(data), delimiter=(4, 3, 2)))

#### comments

In [None]:
data = u"""#
# Skip me !
# Skip me too !
1, 2
3, 4
5, 6 #This is the third line of the data
7, 8
# And here comes the last line
9, 0
"""
print (np.genfromtxt(StringIO(data), comments="#", delimiter=","))

#### names

In [None]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=[(_, int) for _ in "abc"])

In [None]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, names="A, B, C")

In [None]:
data = StringIO("So it goes\n#a b c\n1 2 3\n 4 5 6")
np.genfromtxt(data, skip_header=1, names=True)

In [None]:
data = StringIO("1 2 3\n 4 5 6")
ndtype=[('a',int), ('b', float), ('c', int)]
names = ["A", "B", "C"]
np.genfromtxt(data, names=names, dtype=ndtype)

#### defaultfmt

In [None]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int))

In [None]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int), names="a")

In [None]:
data = StringIO("1 2 3\n 4 5 6")
np.genfromtxt(data, dtype=(int, float, int), defaultfmt="var_%02i")

#### converters

In [None]:
convertfunc = lambda x: float(x.strip(b"%"))/100.
data = u"1, 2.3%, 45.\n6, 78.9%, 0"
names = ("i", "p", "n")
# General case .....
np.genfromtxt(StringIO(data), delimiter=",", names=names)

In [None]:
# Converted case ...
np.genfromtxt(StringIO(data), delimiter=",", names=names,
              converters={1: convertfunc})

In [None]:
# Using a name for the converter ...
np.genfromtxt(StringIO(data), delimiter=",", names=names,
              converters={"p": convertfunc})

In [None]:
data = u"1, , 3\n 4, 5, 6"
convert = lambda x: float(x.strip() or -999)
np.genfromtxt(StringIO(data), delimiter=",",
              converters={1: convert})

#### `missing_values` and `filling_values`

In [None]:
data = u"N/A, 2, 3\n4, ,???"
print(data)
kwargs = dict(delimiter=",",
              dtype=int,
              names="a,b,c",
              missing_values={0:"N/A", 'b':" ", 2:"???"},
              filling_values={0:0, 'b':0, 2:-999})
print(np.genfromtxt(StringIO(data), **kwargs))