# Overview
This jupyter notebook follows the "Doing ETL with Numpy" notebook

We wil be using the same sample data and the same high level steps :
1. Examine the raw data format
2. Identify required extractions and transformations
3. Test transformations
4. Perform extract, transform, and load steps.

It is assumed you have read the notebook on loading a csv using numpy as much of this notebook is taken from that.

# Step 1. Examine the data

# Step 2. Identify required extractions and transformations

## 2.1. Identify ways to solve the problem
We can try the same basic options as the numpy notebook:

1. Let pandas guess what to do (not reccomended, we will see why)
2. Tell pandas what types you want and hope everything works out of the box (like in step 3.1)
3. Write a conversion function for pandas framework
4. Write a parsing mechanism from scratch


## 2.2. Trial and error

We can see that option #1 will not work. Pandas will not do this out of the box. Option #2 will not work because of the percentages.

In [3]:
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-04-08, 6, 78.9%, 0"
column_types = "object,int,str,float"

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

import pandas
df = pandas.read_csv(test_data_file_handle, names=column_names, dtype=column_types, parse_dates=["dates"])

df

Unnamed: 0,dates,ints,percents,numbers
0,2019-04-08,1,2.3%,45.0
1,2019-04-08,6,78.9%,0.0


We see the dataypes are not correct

In [4]:
df.dtypes

dates       datetime64[ns]
ints                 int32
percents            object
numbers            float64
dtype: object

## 2.3. Whic data types to use?
We will need to write code to tell the system how to interpret the values. Some of this is implimented out-of-the-box by numpy and some is custom software.

Consulting the documentation we see that Pandas supports the numpy datatypes:

> For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns] (note that NumPy does not support timezone-aware datetimes).
>
> Pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the extensions pandas has made internally. See Extension types for how to write your own extension that works with pandas. See Extension data types for a list of third-party libraries that have implemented an extension.
>
> https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes

We also see it has some extensions:

> pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time series data.
>
> https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Reading more of the documentation we see that pandas offers some additional types:
- Timestamp
- Timedelta

We will stick with the original numpy datatypes used in the previous notebook:

| Column  | Input Type | Extract / ET Transform | Output Type |
|---------|------------|---------------------|-------------|
| 1       | string     |                     | datetime64  |
| 2       | string     |                     | int         |
| 3       | string     | Remove % symbol     | float       |
| 3       | string     |                     | float       |

# Step 3. Test Extractions And Transformation Functions
We will use the same transformations functions as in the numpy notebook as we are using the same datatypes.

In [5]:
def convert_date_string_to_date(raw_bytes):
    
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_bytes.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()

    # Make it a date
    result = numpy.datetime64(input_string, 'D')

    return result

def convert_percent_string_to_float(raw_bytes):
    
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_bytes.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()
    input_string = input_string.strip("%")

    # Make it a float
    input_float = float(input_string)

    # We move the decimal place
    result = input_float/100

    return result

# Step 4: Plug ET functions into numpy/pandas framework

In [7]:
# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-04-08, 6, 78.9%, 0"
delimiter = ","
converter_mapping = {
    "percents": convert_percent_string_to_float,
    "dates": convert_date_string_to_date
}
column_types = "datetime64[D],int64,float64,float64"

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

# Load the data into a numpy array
import numpy
nda = numpy.genfromtxt(test_data_file_handle, delimiter=delimiter, names=column_names, converters=converter_mapping, dtype=column_types)

# Convert the ndarray into a dataframe
df = pandas.DataFrame(nda)

df

Unnamed: 0,dates,ints,percents,numbers
0,2019-04-08,1,0.023,45.0
1,2019-04-08,6,0.789,0.0


In [8]:
df.dtypes

dates       datetime64[ns]
ints                 int64
percents           float64
numbers            float64
dtype: object