# Easy Functional Data Engineering Examples - Encapsulating changing logic into data

In [1]:
# Let's get our dependencies
import sys
!{sys.executable} -m pip install pandas
import pandas as pd
import sqlite3
import shutil
from datetime import datetime

# some constants
DATA_FILE_PATH="data_2.csv"




In [3]:
df = pd.read_csv(DATA_FILE_PATH)
print("This is our dataframe...\n", df.head(5))

This is our dataframe...
    revenue  B  C   time
0        1  2  a  01-01
1        3  4  b  01-02
2        5  6  a  01-03


## Task 2

**Data Engineering Task:** Alright, the task is simple, calculate the revenue for "a". And calculate the tax portion of it. Tax rate is 20% for 01-01 to 01-02 and 30% for 01-03. Let's do a naive version of this.

In [18]:
df=pd.read_csv(DATA_FILE_PATH)
df.set_index('time')
# Calculating tax:
df = df.reset_index()  # make sure indexes pair with number of rows

def calc_taxes(df):
    for index, row in df.iterrows():
        if row['time'] == "01-03":
            df.at[index,'tax_portion'] = row['revenue']*0.3
        else:
            df.at[index,'tax_portion'] = row['revenue']*0.2
    return df

df = calc_taxes(df)

In [20]:
df

Unnamed: 0,index,revenue,B,C,time,tax_portion
0,0,1,2,a,01-01,0.2
1,1,3,4,b,01-02,0.6
2,2,5,6,a,01-03,1.5


So what is the problem?

What happens, when the tax rate is set to 0.25 beginning 01-04? Then this function will not return the same values given the same inputs (right now, it would set the tax rate to 0.2 for 01-04).

The solution? We take changing business logic out of the function and cement it into immutable data. Let us see who that works.

#### Pure functions future-proof through extraction of possibly mutable pieces

In [34]:
df=pd.read_csv(DATA_FILE_PATH)
df.set_index('time')
# Calculating tax:

df = df.reset_index()  # make sure indexes pair with number of rows

tax_rates = pd.DataFrame({'time': ['01-01','01-02','01-03'],
                   'tax_rate': [0.2,0.2,0.3]})

tax_rates.set_index('time')

# So now we encapsulated the business logic. We can easily add/exchange that... Let's continue!

Unnamed: 0_level_0,tax_rate
time,Unnamed: 1_level_1
01-01,0.2
01-02,0.2
01-03,0.3


In [35]:
def calc_taxes(df, tax_rates):
    new_df = pd.merge(df,tax_rates, on='time')
    new_df['tax_portion']=new_df['revenue'] * new_df['tax_rate']
    return new_df

calc_taxes(df, tax_rates)

Unnamed: 0,index,revenue,B,C,time,tax_rate,tax_portion
0,0,1,2,a,01-01,0.2,0.2
1,1,3,4,b,01-02,0.2,0.6
2,2,5,6,a,01-03,0.3,1.5


***Why is this much better than before?** We used the tax_rates as input here, so no side effects, we're able to test the function (which we also could do before) but we've made it pure such that we can easily change the tax_rates in the future, without messing with the pureness of the function (the function above wasn't pure and did not allow to recompute old analyses). So the function is now...

1. future-proved it (possibly variable parts are not part of the function any more)
2. Pure, in the way that you're able to recalculate old analyses just based on the inputs.
3. The complexity of the hole is much lower due to this decoupling.
4. The changing business logic can be made public! Before, it would be hard to an enduser to see how the tax portion is calculated, but now, you can simply display the always accurate "business logic" behind that.


## Doing this in other tools
Nothing of the functional data engineering approach should be limited to Python or "functions" per se. If you got a 
dbt instance running, and you have changing CASE SQL statements, you can do the very same thing and 
encapsulate your changing business logic into "dbt seeds", and this way making your dbt tasks functional as well.

Note: This will additionally require to "log the inputs", which might mean you want to snapshot your seed data to 
    make sure you know which input was present at what time. 