# Overview
The Extract, Transform, and Load process is generally the first step to machine learning

We will use a small silly data set to make our lives easier while we examine the tools for performing ETL

The high level steps to this process will be:
1. Examine the raw data format
2. Identify required extractions and transformations
3. Test transformations
4. Perform extract, transform, and load steps.


# Step 1. Examine the raw data format
We need to look at the raw data contained in a csv to get an idea of what extract and transform steps will need to be completed in order to load the data into python objects (in our case a numpy array).

<code>[root@localhost JupyterDataScience]# cat tmp.csv 
2019-04-08, 1, 2.3%, 45.6
2019-05-08, 6, 78.9%, 0
</code>

There are several key observations:
1. There are no column headers
2. Each column appears to have a different type (date, int, percent, float)
3. Some of the column types are not natively understood by our loader and will need to be extracted and/or transformed!

# Step 2. Identify required extractions and transformations

If the data is in the format we will be using for our algorithms then we dont need to do ETL. But this is a tutorial on ETL so our life is not that easy... Lets assume that we want the data to be put into a new specific format (date, int, float, float).

## 2.1. Identify ways to solve the problem

There are a few options we can try:
1. Let numpy guess what to do (not reccomended, we will see why)
2. Tell numpy what types you want and hope everything works out of the box (like in step 3.1)
3. Write a conversion function for numpy framework
4. Write a parsing mechanism from scratch


## 2.2. Trial and error

We can see that option #1 will not work. Numpy will not do this out of the box. We get nan as the value of our problematic cells and every cell gets a datatype of '<f8' which coresponds to an in I believe.

In [1]:
# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-4-8, 6, 78.9%, 0"
delimiter = ","


# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

# Load the data into a numpy array
import numpy
etl_data = numpy.genfromtxt(test_data_file_handle, delimiter=delimiter, names=column_names)

etl_data

array([(nan, 1., nan, 45.), (nan, 6., nan,  0.)],
      dtype=[('dates', '<f8'), ('ints', '<f8'), ('percents', '<f8'), ('numbers', '<f8')])

Additionally we will see that option #2 does not work either. Numpy cannot natively understand the percentage datatype as seen in part 3.3.

This leaves options 3 and 4. We will see that option 3 will work and is likely the fastest and easiest solution (assuming we know what we are doing).

## 2.3. Whic data types to use?
We will need to write code to tell the system how to interpret the values. Some of this is implimented out-of-the-box by numpy and some is custom software.

For example "2.3%" should be interpreted as a float with a value of 0.23. 

Another example being '2019-04-08' should be loaded into a date format but I am not sure which. To arrive at the right conclusion we can consult the [documentation for numpy data types](https://docs.scipy.org/doc/numpy/user/basics.types.html). We see that dates are special and not mentioned explicitly on this page but instead are meantioned [here](https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html). Some helpful examples of manipulating this data type can be found [here](http://poquitopicante.blogspot.com/2013/06/dates-and-datetimes.html).

We can extract some important information about datetimes:

> Starting in NumPy 1.7, there are core array data types which natively support datetime functionality. The data type is called “datetime64”, so named because “datetime” is already taken by the datetime library included in Python. 

> The most basic way to create datetimes is from strings in ISO 8601 date or datetime format. The unit for internal storage is automatically selected from the form of the string, and can be either a date unit or a time unit. The date units are years (‘Y’), months (‘M’), weeks (‘W’), and days (‘D’), while the time units are hours (‘h’), minutes (‘m’), seconds (‘s’), milliseconds (‘ms’), and some additional SI-prefix seconds-based units.



| Column  | Input Type | Extract / Transform | Output Type |
|---------|------------|---------------------|-------------|
| 1       | string     |                     | datetime64  |
| 2       | string     |                     | int         |
| 3       | string     | Remove % symbol     | float       |
| 3       | string     |                     | float       |

# Step 3. Test Extractions And Transformation Functions
Numpy allows us to plug into its ETL process with custom functions. They need to be able to tolerate byte code representations of the string. We will impliment a few examples and test them out here and confirm we convert our data into the right types.

## 3.1. Convert string to datetime64

In [2]:
# Import the numpy library
import numpy

# Prove that we can convert a string to a datetime
a = numpy.datetime64('2005-02-25')
print("The type of a is: {0}.".format(type(a)))
print("The value of a is: {0}.".format(a))
print("The units of a are: {0}".format(a.dtype))
print("")

The type of a is: <class 'numpy.datetime64'>.
The value of a is: 2005-02-25.
The units of a are: datetime64[D]



In [3]:
def convert_date_string_to_date(raw_bytes):
    
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_bytes.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()

    # Make it a date
    result = numpy.datetime64(input_string, 'D')

    return result

test_string = "2019-05-06"
test_bytes = test_string.encode("UTF-8", test_string)
y = convert_date_string_to_date(test_bytes)
print("The type of y is: {0}.".format(type(y)))
print("The value of y is: {0}.".format(y))
print("The units of y are: {0}".format(y.dtype))

The type of y is: <class 'numpy.datetime64'>.
The value of y is: 2019-05-06.
The units of y are: datetime64[D]


## 3.2. Convert string to int

In [4]:
# Prove that we can convert a string to an int
b = numpy.int64("1")
print("The type of b is: {0}.".format(type(b)))
print("The value of b is {0}.".format(b))
print("")

The type of b is: <class 'numpy.int64'>.
The value of b is 1.



## 3.3. Convert percentage string to float

In [5]:
# There is no datatype called 'percent' and we cannot convert "2.3%" to 0.23
# Prove this
try:
    c = numpy.float64("2.3%")
    print("The type of c is: {0}.".format(type(c)))
    print("")
except Exception as e:
    print("An exception was raised while parsing c")
    print("")

def convert_percent_string_to_float(raw_bytes):
    
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_bytes.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()
    input_string = input_string.strip("%")

    # Make it a float
    input_float = float(input_string)

    # We move the decimal place
    result = input_float/100

    return result

test_string = "2.3%"
test_bytes = test_string.encode("UTF-8", test_string)
x = convert_percent_string_to_float(test_bytes)
print("The type of x is: {0}.".format(type(x)))
print("The value of x is: {0}.".format(x))

An exception was raised while parsing c

The type of x is: <class 'float'>.
The value of x is: 0.023.


## 3.4. Convert string to float

In [6]:
d = numpy.float64("5.6667")
print("The type of d is: {0}.".format(type(d)))
print("The value of d is {0}.".format(d))
print("")

The type of d is: <class 'numpy.float64'>.
The value of d is 5.6667.



# Step 4: Plug ET functions into numpy framework
Numpy provides us the ability to dictate the datatypes to be used for each column as well as a mapping object to apply functions to each column.

In [7]:
# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-04-08, 6, 78.9%, 0"
delimiter = ","
converter_mapping = {
    "percents": convert_percent_string_to_float,
    "dates": convert_date_string_to_date
}
column_types = "datetime64[D],int64,float64,float64"

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

# Load the data into a numpy array
import numpy
homework = numpy.genfromtxt(test_data_file_handle, delimiter=delimiter, names=column_names, converters=converter_mapping, dtype=column_types)

homework

array([('2019-04-08', 1, 0.023, 45.), ('2019-04-08', 6, 0.789,  0.)],
      dtype=[('dates', '<M8[D]'), ('ints', '<i8'), ('percents', '<f8'), ('numbers', '<f8')])