# Doing ETL with Pandas (Importing a CSV to Pandas)
This jupyter notebook follows the "Doing ETL with Numpy" notebook

We wil be using the same high level steps and the same sample data:
1. Examine the data
2. Determine if the desired data format matches the input format
3. Determine how Pandas impliments the desired data formats
4. Test out all the types and get familiar with them
5. ETL the data

# Prerequisites

The first thing you need to have installed on your system is Pip, the python package installation program.

If pip is installed, the next thing you need to do is open a os terminal (bash shell in linux or command prompt in windows), you can use pip

To ask pip to install Pandas run the following:

<code>pip install pandas</code>

# Step 1. Examine the data
We need to identify what needs to be extracted and what can be ignored

I have a look at the data file:

<code>[root@localhost JupyterDataScience]# cat tmp.csv 
2019-04-08, 1, 2.3%, 45.6
2019-05-08, 6, 78.9%, 0
</code>

There are several key observations:
1. There are no column headers
2. Each column appears to have a different type (int, percent, float)
3. The data is not -> clean extra spaces in values

# Step 2. Determine if the desired data format matches the input format
If the data is in the format we will be using for our algorithms then we dont need to do ETL.

But this is a tutorial on ETL so our life is not that easy...

Lets assume that we want the data to be put into a new specific format (date, int, float, float).

While the second and fourth column can magically be recognized, the first and second column cannot be and we will need to write code to tell the system how to interpret "2.3%" as 0.23

# Step 3. Determine how Pandas impliments the desired data formats

We know we want use "dates" based on our input data. Most programming languages offer various "primitives" (built in objects) which impliment basic things: for example python natively supports integers, booleans, strings etc.

... but what does pandas offer? In other words, how does the numpy framework represent a date and what features does this object type have.

From googling (or referring to the previous lecture) and working with pandas will learn the following:

The DataFrame is the main object that a user will be using while using the Pandas library. The term DataFrame is a common one which refers fo a 2D array (ie matrix... think excel spreadsheet). The Pandas library impliments its own primitives separate from those of the vanilla python language. It then uses these primitives to create DataFrame objects. 

At this point we are more focused on the primitives rather than the DataFrame.

For more information we turn to the documentation:

> For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns] (note that NumPy does not support timezone-aware datetimes).

> Pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the extensions pandas has made internally. See Extension types for how to write your own extension that works with pandas. See Extension data types for a list of third-party libraries that have implemented an extension.

https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes


A bit of an egg hunt but we did it... 

## Step 3.1 Examine the options for handling dates and times

Consulting the documentation we are reminded that Pandas supports the numpy datatypes.

> pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time series data.
>
> https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Reading more of the documentation we see that pandas offers some additional types:
- Timestamp
- Timedelta


# Step 4. Test out all the types and get familiar with them
Lets see if these data types can satisfy our needs

## Step 4.1 Examine the TimeStamp object

Consulting the documentation: 

> Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases. It’s the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas.
>
> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html

Further down the page you can see the class constructor, class methods, etc.




In [1]:
import pandas

my_timestamp = pandas.Timestamp(year=2019, month=1, day=1, hour=14)

# Look at some attributes on the object
print("Day of the week (Tuesday): {0}.".format(my_timestamp.dayofweek))
print("Day of the year: {0}.".format(my_timestamp.dayofyear))

# Look at some methods on the object
print("The weekday name: {0}.".format(my_timestamp.day_name()))
print("The object as a datetime64: {0}.".format(my_timestamp.to_datetime64()))


Day of the week (Tuesday): 1.
Day of the year: 1.
The weekday name: Tuesday.
The object as a datetime64: 2019-01-01T14:00:00.000000000.


If we continue googling we learn a little bit more about this data type:

> pandas stores datetimes as data with type datetime64 in index/columns (this are not datetime.datetime objects). This is the standard numpy type for datetimes and is more performant than using  datetime.datetime objects. 
>
> when retrieving one value of such a datetime column/index, you will see a Timestamp object. This is a more convenient object to work with the datetimes (more methods, better representation, etc than the datetime64), and this is a subclass of datetime.datetime, and so has all methods of it.
>
> https://stackoverflow.com/questions/23755146/why-does-pandas-return-timestamps-instead-of-datetime-objects-when-calling-pd-to#23756824

So, the Timestamp is just a more convenient datetime64 or datetime object that Pandas uses.

## Step 4.2 Have a look at how the Pandas library parses dates and times

In [2]:
# Import the pandas library
import pandas

# Lets have a look at the object relating to date 03/10/2018
a = pandas.Timestamp('20180310')
print("The type of a is: {0}.".format(type(a)))
print("The value of a is {0}.".format(a))
print("")

# Lets put the data in another format
b = pandas.Timestamp('2019-03-10')
print("The type of b is: {0}.".format(type(b)))
print("The value of b is {0}.".format(b))
print("")

# Lets put the data in another format
c = pandas.Timestamp('03/10/2020')
print("The type of c is: {0}.".format(type(c)))
print("The value of c is {0}.".format(c))
print("")

# Lets put the data in another format
d = pandas.Timestamp('2021-03-10 01:00:00')
print("The type of d is: {0}.".format(type(d)))
print("The value of d is {0}.".format(d))
print("")

# Lets put the data in another format
e = pandas.Timestamp('2022-03-10 01:00')
print("The type of e is: {0}.".format(type(e)))
print("The value of e is {0}.".format(e))
print("")

# Lets put the data in another format
f = pandas.Timestamp('2022-03-10 01:00 PM')
print("The type of f is: {0}.".format(type(f)))
print("The value of f is {0}.".format(f))
print("")

# So far, this is MUCH easier than numpy! Pandas has some magic!?

# Lets now see if we can deal with a date rather than a datetime

g = pandas.Timestamp('20180310', freq='D')
print("The type of g is: {0}.".format(type(g)))
print("The value of g is {0}.".format(g))
print("")


The type of a is: <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
The value of a is 2018-03-10 00:00:00.

The type of b is: <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
The value of b is 2019-03-10 00:00:00.

The type of c is: <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
The value of c is 2020-03-10 00:00:00.

The type of d is: <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
The value of d is 2021-03-10 01:00:00.

The type of e is: <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
The value of e is 2022-03-10 01:00:00.

The type of f is: <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
The value of f is 2022-03-10 13:00:00.

The type of g is: <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
The value of g is 2018-03-10 00:00:00.



# Step 5. ETL the data
In this step we are going to look at a few different options for parsing data (and yes there are a few ways to skin this cat).

The basic options are (in order of preference and ease):

1. Let Pandas guess what to do (not reccomended, we will see why)
2. Tell Pandas what types you want and hope everything works out of the box (like in step #4)
3. Write a conversion function for numpy framework
4. Write a parsing mechanism from scratch
5. Create a DataFrame using a numpy ndarray

# Step 5.1 Let Pandas guess what to do (not reccomended)

In [3]:
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-04-08, 6, 78.9%, 0"

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

test_1 = pandas.read_csv(test_data_file_handle, names=column_names)

test_1

Unnamed: 0,dates,ints,percents,numbers
0,2019-04-08,1,2.3%,45.0
1,2019-04-08,6,78.9%,0.0


We can see this is no ideal as our data is not in the corect datatype

In [17]:
test_1["dates"]

0    2019-04-08
1    2019-04-08
Name: dates, dtype: object

In [5]:
test_1["dates"][0]

'2019-04-08'

In [6]:
type(test_1["dates"][0])

str

# Step 5.2 Tell Pandas what types you want (like in step #4)

In [7]:
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-04-08, 6, 78.9%, 0"
column_types = "object,int,str,float"

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

test_2 = pandas.read_csv(test_data_file_handle, names=column_names, dtype=column_types, parse_dates=["dates"])

test_2

Unnamed: 0,dates,ints,percents,numbers
0,2019-04-08,1,2.3%,45.0
1,2019-04-08,6,78.9%,0.0


In [8]:
test_2["dates"]

0   2019-04-08
1   2019-04-08
Name: dates, dtype: datetime64[ns]

In [9]:
test_2["dates"][0]

Timestamp('2019-04-08 00:00:00')

In [10]:
type(test_2["dates"][0])

pandas._libs.tslibs.timestamps.Timestamp

In [11]:
type(test_2["numbers"][0])

numpy.float64

# Step 5.3 Write a conversion function for Pandas framework
While this is similar to the convertsion function used in numpy, we will notice that instead of dealing with bytes we deal with strings which is much easier.

In [6]:
import pandas

def convert_percent_string_to_float(raw_string):
    
    # The input variable will be a byte array
    # We will convert this to a string

    # We then do our manipulation
    raw_string = raw_string.strip()
    raw_string = raw_string.strip("%")

    # Make it a float
    input_float = float(raw_string)

    # We move the decimal place
    result = input_float/100

    return result

# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-04-08, 6, 78.9%, 0"
delimiter = ","
converter_mapping = {
    "percents": convert_percent_string_to_float
}
column_types = "object,int,float,float"

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

test_3 = pandas.read_csv(test_data_file_handle, names=column_names, dtype=column_types, parse_dates=["dates"], converters=converter_mapping)

test_3



Unnamed: 0,dates,ints,percents,numbers
0,2019-04-08,1,0.023,45.0
1,2019-04-08,6,0.789,0.0


In [60]:
",".join(test_3.iloc[:1].to_string().split("\n")[1].split()[1:])

'2019-04-08,1,0.023,45.0'

# Step 5.4 - Write a custom parsing system
This is a bit outside the scope of this section and more advanced... We will cover it later if needed.

I doubt anyone would need this anyways...

# Step 5.5 Create a DataFrame using an ndarray

In [13]:
# Define two functions used to parse data into ndarray

def convert_date_string_to_date(raw_bytes):
    
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_bytes.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()

    # Make it a date
    result = numpy.datetime64(input_string, 'D')

    return result

def convert_percent_string_to_float(raw_bytes):
    
    # The input variable will be a byte array
    # We will convert this to a string
    input_string = raw_bytes.decode("utf-8")

    # We then do our manipulation
    input_string = input_string.strip()
    input_string = input_string.strip("%")

    # Make it a float
    input_float = float(input_string)

    # We move the decimal place
    result = input_float/100

    return result

# Setup some other parameters to instruct numpy function how and what we are importing
column_names = ("dates", "ints", "percents", "numbers")
test_data_string = "2019-04-08, 1, 2.3%, 45.\n2019-04-08, 6, 78.9%, 0"
delimiter = ","
converter_mapping = {
    "percents": convert_percent_string_to_float,
    "dates": convert_date_string_to_date
}
column_types = "datetime64[D],int64,float64,float64"

# Import a module to help us import data
# This module implements a file-like class, StringIO, that reads and writes a string buffer
import io

# Create a file handle for our string data
test_data_file_handle = io.StringIO(test_data_string)

# Load the data into a numpy array
import numpy
test_5 = numpy.genfromtxt(test_data_file_handle, delimiter=delimiter, names=column_names, converters=converter_mapping, dtype=column_types)

# Convert the ndarray into a dataframe
dataframe = pandas.DataFrame(test_5)

dataframe

Unnamed: 0,dates,ints,percents,numbers
0,2019-04-08,1,0.023,45.0
1,2019-04-08,6,0.789,0.0
