<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="../../resources/logo.png" alt="Intellinum Bootcamp" style="width: 600px; height: 163px">
</div>

# Job Failure

Apache Spark&trade; allows for the creation of robust job failure strategies

## In this lesson you:
* Use antijoins to ensure that duplicate records are not loaded into your target database
* Design a job failure monitoring strategy using job ID's
* Evaluate job failure recovery strategies for idempotence 


### Idempotent Failure Recovery

Jobs can fail for any number of reasons.  The majority of job failures are caused by input/output (I/O) problems but other issues include schema evolution, data corruption, and hardware failures.  Recovery from job failure should be guided by the principle of *idempotence, or the property of operations whereby the operation can be applied multiple times without changing the results beyond the first application.*

More technically, the definition of idempotence is as follows where a function `f` applied to `x` is equal to that function applied to `x` two or more times:

&nbsp;&nbsp;&nbsp;&nbsp;`f(x) = f(f(x)) = f(f(f(x))) = ...`

In ETL job recovery, we need to be able to run a job multiple times and get our data into our target database without duplicates.  This can be accomplished in a few ways:<br><br>

* A **left antijoin** of new data on data already in a database will give you only the data that was not inserted
* Overwriting all data is a resource-intensive way to ensure that all data was written
* The transactionality of databases enable all-or-nothing database writes where failure of any part of the job will not result in any committed data
* Leveraging primary keys in a database will only write data where the primary key is not already present or upsert the data

Run the following cell to create the lab environment:

In [1]:
#MODE = "LOCAL"
MODE = "CLUSTER"

import sys
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.storagelevel import StorageLevel
from matplotlib import interactive
interactive(True)
import matplotlib.pyplot as plt
%matplotlib inline
import json
import math
import numbers
import numpy as np
import plotly
import uuid
import time
plotly.offline.init_notebook_mode(connected=True)

sys.path.insert(0,'../../src')
from settings import *

try:
    fh = open('../../libs/pyspark24_py36.zip', 'r')
except FileNotFoundError:
    !aws s3 cp s3://devops.intellinum.co/bins/pyspark24_py36.zip ../../libs/pyspark24_py36.zip

try:
    spark.stop()
    print("Stopped a SparkSession")
except Exception as e:
    print("No existing SparkSession detected")
    print("Creating a new SparkSession")

SPARK_DRIVER_MEMORY= "1G"
SPARK_DRIVER_CORE = "1"
SPARK_EXECUTOR_MEMORY= "1G"
SPARK_EXECUTOR_CORE = "1"
SPARK_EXECUTOR_INSTANCES = 12



conf = None
if MODE == "LOCAL":
    os.environ["PYSPARK_PYTHON"] = "/home/yuan/anaconda3/envs/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_etl_13-failure-recovery").\
            setMaster('local[*]').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', '../../libs/mysql-connector-java-5.1.45-bin.jar').\
            set('spark.jars.packages','net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1')
else:
    os.environ["PYSPARK_PYTHON"] = "./MN/pyspark24_py36/bin/python"
    conf = SparkConf().\
            setAppName("pyspark_etl_13-failure-recovery").\
            setMaster('yarn-client').\
            set('spark.executor.cores', SPARK_EXECUTOR_CORE).\
            set('spark.executor.memory', SPARK_EXECUTOR_MEMORY).\
            set('spark.driver.cores', SPARK_DRIVER_CORE).\
            set('spark.driver.memory', SPARK_DRIVER_MEMORY).\
            set("spark.executor.instances", SPARK_EXECUTOR_INSTANCES).\
            set('spark.sql.files.ignoreCorruptFiles', 'true').\
            set('spark.yarn.dist.archives', '../../libs/pyspark24_py36.zip#MN').\
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars.packages','io.delta:delta-core_2.11:0.2.0,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.2,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.2,net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1'). \
            set('spark.driver.maxResultSize', '0').\
            set('spark.jars', 's3://devops.intellinum.co/bins/mysql-connector-java-5.1.45-bin.jar')
        

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()


sc = spark.sparkContext

sc.addPyFile('../../src/settings.py')

sc=spark.sparkContext
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")

def display(df, limit=10):
    try:
        # For spark-core 
        result = df.limit(limit).toPandas()
    except Exception as e:
        # For structured-streaming
        stream_name = str(uuid.uuid1()).replace("-","")
        query = (
          df
            .writeStream
            .format("memory")        # memory = store in-memory table (for debugging only)
            .queryName(stream_name) # show = name of the in-memory table
            .trigger(processingTime='1 seconds') #Trigger = 1 second
            .outputMode("append")  # append
            .start()
        )
        while query.isActive:
            time.sleep(1)
            result = spark.sql(f"select * from {stream_name} limit {limit}").toPandas()
            print("Wait until the stream is ready...")
            if result.empty == False:
                break
        result = spark.sql(f"select * from {stream_name} limit {limit}").toPandas()
    
    return result

def untilStreamIsReady(name):
    queries = list(filter(lambda query: query.name == name, spark.streams.active))

    if len(queries) == 0:
        print("The stream is not active.")

    else:
        while (queries[0].isActive and len(queries[0].recentProgress) == 0):
            pass # wait until there is any type of progress

        if queries[0].isActive:
            queries[0].awaitTermination(5)
            print("The stream is active and ready.")
        else:
            print("The stream is not active.")
            
            
def dfTest(id, expected, result):
    assert str(expected) == str(result), "{} does not equal expected {}".format(result, expected)

No existing SparkSession detected
Creating a new SparkSession


### One Idempotent Strategy: Left Antijoin

In traditional ETL, a job recovery strategy where only partial data was written to database would look something as follow:

```
begin transaction;
  delete from production_table where batch_period = failed_batch_period;
  insert into production_table select * from staging_table;
  drop table staging_table;  
end transaction;
```

This won't work in a Spark environment because data structures are immutable.  One alternative strategy among the several listed in the cell above relies on a left antijoin, which returns all data in the left table that doesn't exist in the right table.

Run the follow cell to create a mock production and staging table. Create a staging table from parquet that contains log records and then create a production table that only has 20 percent of the records from staging.

In [2]:
from pyspark.sql.functions import col 

staging_table = (spark.read.parquet("s3a://data.intellinum.co/bootcamp/common/EDGAR-Log-20170329/enhanced/EDGAR-Log-20170329-sample.parquet/")
  .dropDuplicates(['ip', 'date', 'time']))

production_table = staging_table.sample(.2, seed=123)

Run the following cell to see that the `poduction_table` only has 20% of the data from `staging_table`

In [3]:
production_table.count() / staging_table.count()

0.2002676166946979

Join the two tables using a left antijoin.

In [4]:
failedDF = staging_table.join(production_table, on=["ip", "date", "time"], how="left_anti")

Union `production_table` with the results from the left antijoin.

Append operations are generally not idempotent as they can result in duplicate records.  Streaming operations that maintain state and append to an always up-to-date parquet or Delta table are idempotent.

In [None]:
fullDF = production_table.union(failedDF)

The two tables are now equal.

In [None]:
staging_table.count() == fullDF.count()

## Review
**Question:** What is idempotence?  
**Answer:** For ETL jobs, idempotence is the ability to run the same job multiple times without getting duplicate data.  This is the primary axiom for ensuring that ETL workloads do not have any unexpected behavior.

**Question:** How can I accomplish idempotence in Spark jobs?  
**Answer:** There are a number of strategies for accomplishing this.  Doing an antijoin of your full data on already loaded data is one method.  This can be in the form of an incremental update script that would run on the case of job failure.  By counting the records at the beginning and end of a job, you can detect whether any unexpected behavior would demand the use of this incremental script.

**Question:** How can I detect job failure?  
**Answer:** This depends largely on the pipeline you're creating.  One common best practice is to have a monitoring job that periodically checks jobs for failure.  This can be tied to email or other alerting mechanisms.

&copy; 2019 [Intellinum Analytics, Inc](http://www.intellinum.co). All rights reserved.<br/>