# Importing a CSV to Numpy
The Extract, Transform, and Load process is one that can be problematic

We will use a small silly data set to make our lives easier while we examine the tools for performing ETL

# Step 1. Examine the data
We need to identify what needs to be extracted and what can be ignored

I have a look at the data file:

<code>[root@localhost JupyterDataScience]# cat tmp.csv 
2019-4-8, 1, 2.3%, 45.6
2019-5-8, 6, 78.9%, 0
</code>

There are several key observations:
1. There are no column headers
2. Each column appears to have a different type (int, percent, float)
3. The data is not -> clean extra spaces in values

# Step 2. Determine if the desired data format matches the input format
If the data is in the format we will be using for our algorithms then we dont need to do ETL.

But this is a tutorial on ETL so our life is not that easy...

Lets assume that we want the data to be put into a new specific format (date, int, float, float).

While the second and fourth column can magically be recognized, the first and second column cannot be and we will need to write code to tell the system how to interpret "2.3%" as 0.23

# Step 3. Determine how numpy impliments the desired data formats
We knot we want use "dates" based on our input data. Most programming languages offer various "primitives" (built in objects) which impliment basic things: for example python natively supports integers, booleans, strings etc.

... but what does numpy offer? From working with numpy I know that it impliments its own primitives separate from those of the vanilla python language.

In other words, how does the numpy framework represent a date and what features does this object type have.

To figure this out we turn to the documentation:
https://docs.scipy.org/doc/numpy/user/basics.types.html

But the dates are not mentioned here... we keep googling and we find:
https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html

And this very useful article about datetimes: http://poquitopicante.blogspot.com/2013/06/dates-and-datetimes.html

A bit of an egg hunt but we did it... 

# Step 4. Test out all the types and get familiar with them
Lets see if these data types can 

In [1]:
# Import the numpy library
import numpy

# Prove that we can convert a string to a datetime
a = numpy.datetime64('2005-02-25')
print("The type of a is: {0}.".format(type(a)))
print("The value of a is {0}.".format(a))
print("")

# Prove that we can convert a string to an int
b = numpy.int64("1")
print("The type of b is: {0}.".format(type(b)))
print("The value of b is {0}.".format(b))
print("")

# There is no datatype called 'percent' and we cannot convert "2.3%" to 0.23
# Prove this
try:
    c = numpy.float64("2.3%")
    print("The type of c is: {0}.".format(type(c)))
    print("")
except Exception as e:
    print("An exception was raised while parsing c")
    print("")

# Prove that we can convert a string to a float
d = numpy.float64("5")
print("The type of d is: {0}.".format(type(d)))
print("The value of d is {0}.".format(d))
print("")

The type of a is: <class 'numpy.datetime64'>.
The value of a is 2005-02-25.

The type of b is: <class 'numpy.int64'>.
The value of b is 1.

An exception was raised while parsing c

The type of d is: <class 'numpy.float64'>.
The value of d is 5.0.



# Step 3. ETL the data
In this step we are going to look at a few different options for parsing data (and yes there are a few ways to skin this cat).

The basic options are (in order of preference and ease):

1. Let numpy guess what to do (not reccomended, we will see why)
2. Tell numpy what types you want and hope everything works out of the box (like in step #4)
3. Write a conversion function for numpy framework
4. Write a parsing mechanism from scratch

# Step 3.1 Let numpy guess what to do

In [3]:
# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-4-8, 1, 2.3%, 45.\n2019-4-8, 6, 78.9%, 0"
delimiter = ","


# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

# Load the data into a numpy array
import numpy
test_3_1 = numpy.genfromtxt(test_data_file_handle, delimiter=delimiter, names=column_names)

test_3_1

array([(nan, 1., nan, 45.), (nan, 6., nan,  0.)],
      dtype=[('dates', '<f8'), ('ints', '<f8'), ('percents', '<f8'), ('numbers', '<f8')])

# Step 3.2 Tell numpy what types you want (like in step #4)

In [6]:
# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
column_types = "str,int64,str,float64"
test_data_string = "2019-4-8, 1, 2.3%, 45.\n2019-4-8, 6, 78.9%, 0"
delimiter = ","

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

# Load the data into a numpy array
import numpy
test_3_2 = numpy.genfromtxt(test_data_file_handle, delimiter=delimiter, names=column_names, dtype=column_types)

test_3_2

array([('', 1, '', 45.), ('', 6, '',  0.)],
      dtype=[('dates', '<U'), ('ints', '<i8'), ('percents', '<U'), ('numbers', '<f8')])

## Step 3.3.a Define a function to parse percents

In [9]:
def convert_percent_string_to_float(raw_bytes):
    
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_bytes.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()
    input_string = input_string.strip("%")

    # Make it a float
    input_float = float(input_string)

    # We then convert back to float
    result = input_float/100

    return result

test_string = "2.3%"
test_bytes = test_string.encode("UTF-8", test_string)
x = convert_percent_string_to_float(test_bytes)
print("The type of x is: {0}.".format(type(x)))
print("The value of x is: {0}.".format(x))


The type of x is: <class 'float'>.
The value of x is: 0.023.


# Step 3.3.b Plug parsing function into the numpy framework

In [10]:
# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-4-8, 1, 2.3%, 45.\n2019-4-8, 6, 78.9%, 0"
delimiter = ","
name_of_column_being_mapped_to_parsing_function = "percents"
converter_mapping = {
    name_of_column_being_mapped_to_parsing_function: convert_percent_string_to_float
}

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

# Load the data into a numpy array
import numpy
test_3_3 = numpy.genfromtxt(test_data_file_handle, delimiter=delimiter, names=column_names, converters=converter_mapping)

test_3_3

array([(nan, 1., 0.023, 45.), (nan, 6., 0.789,  0.)],
      dtype=[('dates', '<f8'), ('ints', '<f8'), ('percents', '<f8'), ('numbers', '<f8')])

# Step 3.4 - Write a custom parsing system
This is a bit outside the scope of this section and more advanced... We will cover it later if needed.

I doubt anyone would need this anyways...

# Homework - Implement parsing function for dates
Read the article: http://poquitopicante.blogspot.com/2013/06/dates-and-datetimes.html

It explains why datetimes are not working for us out of the box. TL;DR - 

In [13]:
# Load the library which contains the code to load data from a csv
import numpy
import io


def my_convert_function(raw_data):
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_data.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()
    input_string = input_string.strip("%")

    # Make it a float
    input_float = float(input_string)

    # We then convert back to float
    result = input_float/100

    return result


data = u"1, 2.3%, 45.\n6, 78.9%, 0"
names = ("i", "p", "n")
converters = {"p": my_convert_function}
#a = numpy.genfromtxt(io.StringIO(data), delimiter=",", names=names, converters=converters)
#a

file_name = "tmp.csv"
b = numpy.genfromtxt(file_name, delimiter=',',  names=names, converters=converters)
b


array([(1., 0.023, 45.), (6., 0.789,  0.)],
      dtype=[('i', '<f8'), ('p', '<f8'), ('n', '<f8')])