# Easy Functional Data Engineering Examples

In [39]:
# Let's get our dependencies
import sys
!{sys.executable} -m pip install pandas
import pandas as pd
import sqlite3
import shutil
from datetime import datetime

# some constants
DATA_FILE_PATH="data.csv"
CONN = sqlite3.connect(':memory:')



In [18]:
df = pd.read_csv(DATA_FILE_PATH)
print("This is our dataframe...\n", df.head(5))

This is our dataframe...
    A  B  C
0  1  2  a
1  3  4  b
2  5  6  a


## Task 1

**Data Engineering Task:** Alright, the task is simple... DATA_FILE_PATH is our external data source, and we want to load the data into our analytical database CONN. While doing that, it'd be cool to turn the "a"s and "b"s into proper names "analytical_data" and "business_data"

In [27]:
# Let's write a simple function to do that...

def get_n_write_data(path):
    df = pd.read_csv(path)
    df['C'] = df['C'].apply(lambda x: "analytical_data" if x == "a" else "business_data")
    rows_written = df.to_sql("our_data_table",CONN, if_exists="append")
    print(f"Written {rows_written} rows into the db")
    return 0

In [29]:
get_n_write_data(DATA_FILE_PATH)

Written 3 rows into the db


0

### Let's see how this is not functional data engineering

Note: I'm gonna say "function" but really, the right unit to use is more like "task", or simply "unit" which might incoporate more than just a function we're calling. It might be a Sparkjob + some files, it might be a dbt run, it might be a lot of things.

Alright, so our functions should be "pure", "immutable" and idempotent so how does that work out right now?

1. Idempotency, what happens if we call it twice?
=> We duplicate the data. Not cool.

=> The easy fix for this dataset? We could just change to "replace" table.

2. Immutable things. So what happens, if the underlying datafile changes? 
=> We get a different outcome, which might be intended, or it might not. In any case, we have some mutability inside here. 

3. Pureness. What are the sideeffects of the function?
=> The function both is influenced by external conditions (namely the contents of the "path") as well as having an influence on external things (the database table).

=> Let's see what we could do about that... 
=> There's no way around having sideeffects, if we're dealing with reading inputs & outputs, but there is a way to minimize them! It's to use wrappers just around the I/O part!

#### 1.1 Making parts of the task functionally pure

In [34]:
def transform_data(input_dataframe):
    input_dataframe['C'] = input_dataframe['C'].apply(lambda x: "analytical_data" if x == "a" else "business_data")
    return input_dataframe
    
def get_n_write_data(path):
    df = pd.read_csv(path)    
    output_dataframe = transform_data(df)
    rows_written = df.to_sql("our_data_table",CONN, if_exists="append")
    print(f"Written {rows_written} rows into the db")
    return 0

In [35]:
get_n_write_data(DATA_FILE_PATH)

Written 3 rows into the db


0

In [36]:
# The benefit? The transformation behavior just became reproducible and testable! 

#### 1.2 Making the task idempotent
To make the task idempotent, we need to change the "appending" behavior. In this case, that's pretty simple, we stop to append, and simply always replace everything.

In [37]:
def transform_data(input_dataframe):
    input_dataframe['C'] = input_dataframe['C'].apply(lambda x: "analytical_data" if x == "a" else "business_data")
    return input_dataframe
    
def get_n_write_data(path):
    df = pd.read_csv(path)    
    output_dataframe = transform_data(df)
    rows_written = df.to_sql("our_data_table",CONN, if_exists="replace") #easy change in this case.
    print(f"Written {rows_written} rows into the db")
    return 0

#### 1.3 Making the task more immutable
Finally we can do one thing, to make our task a bit more immutable. What happens if the file changes? Well, everything goes down the drain. So what could we do? We could simply copy it to a unique place before running our stuff.

In [46]:
# Getting the current date and time

# THIS FUNCTION IS UNITTESTABLE AND PURE
def transform_data(input_dataframe):
    input_dataframe['C'] = input_dataframe['C'].apply(lambda x: "analytical_data" if x == "a" else "business_data")
    return input_dataframe


# THIS FUNCTION IS UNITTESTABLE AND PURE
def copy_to_unique(path):
    # generate unique timestamp for EVENTPROCESSING TIME
    dt = datetime.now()
    ts = datetime.timestamp(dt)
    target_file = f'{ts}_data.csv'
    shutil.copyfile(path, target_file)
    return target_file

# THIS FUNCTION IS UNITTESTABLE AND PURE 
def write_data(staging_file):
    df = pd.read_csv(staging_file)  
    output_dataframe = transform_data(df)
    rows_written = df.to_sql("our_data_table",CONN, if_exists="replace") #easy change in this case.
    print(f"Written {rows_written} rows into the db")    

# THE ONLY SMALL NON-PURE PART    
def get_n_write_data(path):
    staging_file = copy_to_unique(path)
    write_data(staging_file)
    return 0

In [47]:
get_n_write_data(DATA_FILE_PATH)

Written 3 rows into the db


0