# <span style="color:blue">Data ingestion</span>

## The special problem of reading data

There are several problems involved with getting data into a program. 
* Several high-level data formats. 
* Each one has several subformats.  
* A multitude of tools for reading.
* Each tool has limits. 

### The basic formats: 
* **text**: human-readable and editable data. 
* **binary**: machine-readable data. 
     
### Text data
* Numbers in one of many printable formats. 
* Less space-efficient than binary formats. 
* Tends to be portable between machines.

### Binary data
* Much more efficiently stored. 
* Tends not to be portable between machines and software. 
* Several variants are incompatible with one another. 

We've already seen a really simple example of reading a file, 

In [None]:
# run this to see the file. 
%more data.txt

In [None]:
# read the file into a list
file = open('data.txt', 'r')
out = []
for line in file:
    numbers = line.strip().split(',')
    out.append(numbers)
file.close()
out

## Oops! This is not quite right. 
There is a profound difference between '1' and 1 (without the quotes). 
* `'1'` is a string. 
* `1` is an integer. 
* (`1.` is a floating point number.) 

### Internal and string representations of numbers
* In python, there are several formats for numbers. 
* `int`: an integer
* `float`: a floating point number. 
* `cfloat`: a complex number. 

### Internal representations of numbers share these attributes: 
* Binary representation. 
* Usually 4, 8, or 16 characters (bytes) long. 
* Expressed in base 2. 

## String representations of numbers are obviously different
* Variable length. 
* Digits in base 10. 

### Converting between strings and numbers
* `str(numb)`: gives the string version of a number. 
* `float(strng)` or `int(strng)`: gives the numeric version. 

### So, we can solve our little problem as follows: 

In [None]:
# read the file into a list
out2 = []
for row in out: 
    new_row = []
    for entry in row: 
        new_row.append(int(entry))
    out2.append(new_row)
out2

# Alas, if it were only that easy...

The reality: 
* 90% of the typical data scientist's time is spent finding ways to read data.
* Data reading errors are very common. 
* Data formats are not so neat as the above. 

Thus, an enormous amount of time has been spent on libraries for reading data. 

One of the simplest is `numpy.loadtxt`. 

See https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html

For example, we can write:

In [None]:
import numpy as np
stuff = np.loadtxt('data.txt', delimiter=',')
stuff

# A few observations

* default conversion type is float. 
* default delimiter is any "whitespace". 

So, for this file: 

In [None]:
# run this to see the file
%more data2.txt

we might write, instead: 

In [None]:
stuff = np.loadtxt('data2.txt')  # default delimiter is whitespace 
stuff

# Non-numeric data
Obviously, we need to do something about non-numeric data in the file. 
If we try naively to parse a file with non-numeric data,

In [None]:
%more data3.txt

In [None]:
try:
    stuff = np.loadtxt('data3.txt', delimiter=',')
except Exception as e:
    print(e)

... and we must get more clever. Let's tell `loadtxt` what kinds of objects it is looking for. 
* i4: an integer
* U20: a string of up to 20 characters (U means *unicode*)

In [None]:
stuff = np.loadtxt('data3.txt', delimiter=',', dtype={'names': ['apples', 'oranges', 'name'],
                                                      'formats': ['i4', 'i4', 'U20']})
stuff

# Notice several things in this example
* `loadtxt` automatically chose to represent this as an array of tuples rather than an array of lists. 
* Indexing is the same, but math is different. 
* This is its way of saying *this is not a vector, matrix, or tensor.* 

# The curse of comma-separated values

Comma-separated values are a very common data representation, but have one **huge** problem. 
Microsoft excel often puts out files like this. 

In [None]:
%more data3.csv

Excel thinks it's being clever, because there is a comma in the data. 
* It surrounds the field containing commands with double quotes. 
* Our parser knows nothing about that. 
So, when we try to parse it naively, we get: 

In [None]:
try:
    stuff = np.loadtxt('data3.csv', delimiter=',', dtype={'names': ['apples', 'oranges', 'name'],
                                                          'formats': ['i4', 'i4', 'U20']})
except Exception as e:
    print(e)

This parser won't handle this case. 

We can, however, outsmart Excel by outputting as tab-separated values: 

In [None]:
%more data4.txt

In [None]:
stuff = np.loadtxt('data4.txt', delimiter='\t', dtype={'names': ['apples', 'oranges', 'name'],
                                                       'formats': ['i4', 'i4', 'U20']})
stuff

*This always works because Excel won't allow a tab character to be typed into a cell!*

# Fixed-width fields

Some files we need to load have fields that occupy a fixed number of characters. These kinds of files are found as output of scientific modeling programs in Fortran. For example, consider: 

In [None]:
%more data5.txt

We parse this kind of file with a different method that supports a more general *delimiter* field consisting of the widths -- as integers -- of the fields we should find. See https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html
For example: 

In [None]:
stuff = np.genfromtxt('data5.txt', delimiter=(7,4,8,4))
stuff

The delimiter field says that there are numbers in the input that are 7, 4, 8, and 4 characters long. I included a comment line (starting with `#`) in the data to show you the column numbers. 

When you are done with these notes, please proceed to [complete the related exercise](03-03-data-ingest.ipynb). 