# Workflow Orchestration

![img](https://miro.medium.com/max/1400/1*VTRH5WAotHAz_yKD-1XqWg.png)

Workflow orchestration frameworks are primarily used to monitor and observe the movement of data in production applications. 

Such frameworks typically include a family of independent features that collectively make modern data pipelines fault-tolerant and robust. These features include:

* scheduling and triggering jobs
* retrying failed work
* dependency and state management
* caching expensive tasks
* resource management
* observability

These allow us to gracefully handle failure events, including scenarios beyond our control like cloud outages or API failures. Without explicitly tracking states in data pipelines, they become prone to triggering premature jobs, re-running already completed work, or even failing haphazardly. 

The features workflow orchestration provides are not limited to supporting the scheduled movement of data from a source to a destination. 

These features are also heavily applied in other domains such as machine learning and parameterized report generation. Presently, workflow orchestration is getting simple enough for hobbyists to adopt for personal projects. 


## Negative Engineering
Negative Engineering happens when engineers write defensive code to make sure the positive code acutally runs. Writing code that anticipates the infinite number of possible failures.

Let's take an example of giving a data engineer a task, they will probably ask for python, cron for automation and more computers so that they can run the python somewhere. 

![img](https://media-exp1.licdn.com/dms/image/C5612AQGJ5uRJxPycPQ/article-cover_image-shrink_423_752/0/1520219397019?e=2147483647&v=beta&t=sxO6C8v4yIHfmWdLMfG1cDdgwbXNWV_mHHrbw98OCqs)

Negative engineering that just happened: 
- provisioning infrastructure (always on vm)
- how do we know the cron job ran?
- how do we debug failures?

No worries! We're smart, we can solve this. 
- We'll just add some logging
- Write to a file when the job completes 
- Add try / excepts with some alert code

You can image how the story goes on where requirements change and more issues arise where the engineer keeps adding more code to anticipate the different failures.  

You can watch the entire story in this [youtube video](https://www.youtube.com/watch?v=wejJzGQ4XDo). 


#### Why this matters to you?
- contiually patching of legacy pipelines
- time spent fixing problems instead of building something new

## Consequences of pipeline failures
* time spent finding where in the pipeline the failure occurred
* premature job triggers
* data staleness 
* expensive compute rerunning tasks 
* duplicating work


## Common workflow patterns

- ETL 
- ELT
- ML
- Dashboarding
- DevOps


## Exercise: Native Python Work Example

Say I have a pair of shoes I really want to buy but I have a tight budget. I want to find out when the shoe price drops so that I can buy them. For this example, let's create a python script that will find the price of the shoes online and then compare to my budget and print out whether or not I should buy the shoes. 

First let's install a python library to scrape html from a shoe webite that we will need as a dependency. 

In [None]:
!pip install beautifulsoup4

Import the libraries we will use for this example.

In [None]:
import requests
import re
from bs4 import BeautifulSoup
import time

Our function to find the price of a shoe will take a URL and then parse the html looking for the product-price and will return the price. 

In [None]:
def find_nike_price(url):
    k = requests.get(url).text
    soup = BeautifulSoup(k,'html.parser')
    price_string = soup.find('div', {"class":"product-price"}).text
    price_string = price_string.replace(' ','')
    price = int(re.search('[0-9]+',price_string).group(0))
    return price

We'll build a function to compare the price returned from the URL to our budget. 

In [None]:
def compare_price(price, budget):
    if price <= budget:
       print(f"Buy the shoes! Good deal!")
    else:
        print(f"Don't buy the shoes. They're too expensive")


We should test our function to make sure it's working properly

In [None]:
# Test the function - should print buy the shoes
compare_price(120, 150)

# Test the function - should print too expensive
compare_price(150, 120)

Now that we have a function to grab a price from a URL, and a function that compares the prices, we can put it all together. 

In [None]:

def nike_flow(url, budget):
    price = find_nike_price(url)
    compare_price(price, budget)


url = "https://www.nike.com/t/air-max-270-womens-shoes-Pgb94t/AH6789-601"
budget = 120

nike_flow(url, budget)

Now if I wanted to put this on a schedule I might add something like this:

In [None]:
# time.sleep with infinite loop to put this on a schedule

while True:
    time.sleep(300)
    nike_flow(url, budget)

## Discussion: What Can You Use Workflow Orchestration For?

Fun Examples:
- March Madness brackets
- Notification on shoe prices 
- Turning off your lights (us not being lazy)
- Notifications on crypto 

## Q&A

## Reference
Full Code Example all put together:

In [None]:
import requests
import re
from bs4 import BeautifulSoup
import time

def find_nike_price(url):
    k = requests.get(url).text
    soup = BeautifulSoup(k,'html.parser')
    price_string = soup.find('div', {"class":"product-price"}).text
    price_string = price_string.replace(' ','')
    price = int(re.search('[0-9]+',price_string).group(0))
    return price

def compare_price(price, budget):
    if price <= budget:
       print(f"Buy the shoes! Good deal!")
    else:
        print(f"Don't buy the shoes. They're too expensive")

def nike_flow(url, budget):
    price = find_nike_price(url)
    compare_price(price, budget)

if __name__ == "__main__":
    
    url = "https://www.nike.com/t/air-max-270-womens-shoes-Pgb94t/AH6789-601"
    budget = 120

    # time.sleep with infinite loop to put this on a schedule

    # while True:
    #     time.sleep(300)
    #     nike_flow(url, budget)

    nike_flow(url, budget)