d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Corrupt Record Handling

Apache Spark&trade; and Databricks&reg; provide ways to handle corrupt records.

## In this lesson you:
* Define corruption logic to handle corrupt records
* Pipe corrupt records into a directory for later analysis

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Software Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Concept (optional): <a href="https://academy.databricks.com/collections/frontpage/products/dataframes" target="_blank">DataFrames course from Databricks Academy</a>

<iframe  
src="//fast.wistia.net/embed/iframe/4m6z26iu8h?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/4m6z26iu8h?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
## Working with Corrupt Data

ETL pipelines need robust solutions to handle corrupt data. This is because data corruption scales as the size of data and complexity of the data application grow. Corrupt data includes:  
<br>
* Missing information
* Incomplete information
* Schema mismatch
* Differing formats or data types
* User errors when writing data producers

Since ETL pipelines are built to be automated, production-oriented solutions must ensure pipelines behave as expected. This means that **data engineers must both expect and systematically handle corrupt records.**

In the road map for ETL, this is the **Handle Corrupt Records** step:
<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-3.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

<iframe  
src="//fast.wistia.net/embed/iframe/5y70n1k6vz?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/5y70n1k6vz?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [7]:
%run "./Includes/Classroom-Setup"

-sandbox
Run the following cell, which contains a corrupt record, `{"a": 1, "b, "c":10}`:

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This is not the preferred way to make a DataFrame.  This code allows us to mimic a corrupt record you might see in production.

In [9]:
data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

corruptDF = (spark.read
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "_corrupt_record")
  .json(sc.parallelize(data))
)

display(corruptDF)

_corrupt_record,a,b,c
,1.0,2.0,3.0
,1.0,2.0,3.0
"{""a"": 1, ""b, ""c"":10}",,,


In the previous results, Spark parsed the corrupt record into its own column and processed the other records as expected. This is the default behavior for corrupt records, so you didn't technically need to use the two options `mode` and `columnNameOfCorruptRecord`.

There are three different options for handling corrupt records [set through the `ParseMode` option](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L34):

| `ParseMode` | Behavior |
|-------------|----------|
| `PERMISSIVE` | Includes corrupt records in a "_corrupt_record" column (by default) |
| `DROPMALFORMED` | Ignores all corrupted records |
| `FAILFAST` | Throws an exception when it meets corrupted records |

The following cell acts on the same data but drops corrupt records:

In [11]:
data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

corruptDF = (spark.read
  .option("mode", "DROPMALFORMED")
  .json(sc.parallelize(data))
)
display(corruptDF)

a,b,c
1,2,3
1,2,3


The following cell throws an error once a corrupt record is found, rather than ignoring or saving the corrupt records:

In [13]:
try:
  data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

  corruptDF = (spark.read
    .option("mode", "FAILFAST")
    .json(sc.parallelize(data))
  )
  display(corruptDF)
  
except Exception as e:
  print(e)

### Recommended Pattern: `badRecordsPath`

Databricks Runtime has [a built-in feature](https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html) that saves corrupt records to a given end point. To use this, set the `badRecordsPath`.

This is a preferred design pattern since it persists the corrupt records for later analysis even after the cluster shuts down.

<iframe  
src="//fast.wistia.net/embed/iframe/4t8m5hbwp8?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/4t8m5hbwp8?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

Before getting started, we need some place to store bad records.

We can use our `workingDir` to create a temporary folder for this purpose.

From there we can safely create temp files without any collision from other users.

:NOTE: We defined `workingDir` for you in the call to `Classroom-Setup`

In [17]:
myBadRecords = f"{workingDir}/badRecordsPath"

print(f"""Your temp directory is "{myBadRecords}" """)

And now let's put your `myBadRecords` to work:

In [19]:
data = """{"a": 1, "b":2, "c":3}|{"a": 1, "b":2, "c":3}|{"a": 1, "b, "c":10}""".split('|')

corruptDF = (spark.read
  .option("badRecordsPath", myBadRecords)
  .json(sc.parallelize(data))
)
display(corruptDF)

a,b,c
1,2,3
1,2,3


-sandbox
See the results in the path specified by `myBadRecords`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Recall that this directory is backed by S3 and is available to all clusters.

In [21]:
path = "{}/*/*/*".format(myBadRecords)
display(spark.read.json(path))

reason,record
"com.fasterxml.jackson.core.JsonParseException: Unexpected character ('c' (code 99)): was expecting a colon to separate field name and value  at [Source: {""a"": 1, ""b, ""c"":10}; line: 1, column: 16]","{""a"": 1, ""b, ""c"":10}"


## Exercise 1: Working with Corrupt Records

### Step 1: Diagnose the Problem

Import the data used in the last lesson, which is located at `/mnt/training/UbiqLog4UCI/14_F/log*`.  Import the corrupt records in a new column `SMSCorrupt`.  <br>

Save only the columns `SMS` and `SMSCorrupt` to the new DataFrame `SMSCorruptDF`.

In [24]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import col
schema1 = StructType().add("SMS",StringType()).add("SMSCorrupt",StringType())

data= "/mnt/training/UbiqLog4UCI/14_F/log*"
SMSCorruptDF = (spark.read
  .schema(schema1)
  .option("mode","Permissive")
  .option("columnNameOfCorruptRecord", "SMSCorrupt")
  .json(data)
  .filter(col("SMSCorrupt")!='NULL')
  
)
display(SMSCorruptDF)

SMS,SMSCorrupt
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:14"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:20"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:30"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:26"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:59"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:59"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:53:26"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"
,"{""SMS"":{""Address"":""+98912503####"",""type"":""1"",""date"":""12-24-2013 11:52:30"",""body"":""ANONYMIZED"",""Type"":""1"", ""metadata"": {""name"": ""mr Khojasteh""flash""""}}}"


In [25]:
# TEST - Run this cell to test your solution
cols = set(SMSCorruptDF.columns)
SMSCount = SMSCorruptDF.cache().count()

dbTest("ET1-P-06-01-01", True, "SMS" in cols)
dbTest("ET1-P-06-01-02", True, "SMSCorrupt" in cols)
dbTest("ET1-P-06-01-03", 8, SMSCount)

print("Tests passed!")

-sandbox
Examine the corrupt records to determine what the problem is with the bad records.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Take a look at the name in metadata.

The entry `{"name": "mr Khojasteh"flash""}` should have single quotes around `flash` since the double quotes are interpreted as the end of the value.  It should read `{"name": "mr Khojasteh'flash'"}` instead.

The optimal solution is to fix the initial producer of the data to correct the problem at its source.  In the meantime, you could write ad hoc logic to turn this into a readable field.

### Step 2: Use `badRecordsPath`

Use the `badRecordsPath` option to save corrupt records to the directory specified by the `corruptPath` variable below.

In [29]:
# TODO

corruptPath = f"{workingDir}/corruptSMS"
data="/mnt/training/UbiqLog4UCI/14_F/log*"
SMSCorruptDF2 = (spark.read
  .option("badRecordsPath", corruptPath)
  .json(data)
)
display(SMSCorruptDF2)

Application,Bluetooth,Call,Location,SMS,WiFi
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:56:44 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -85, Thursday, January 9, 2014 11:56:44 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -86, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -86, Thursday, January 9, 2014 11:57:19 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:57:44 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -85, Thursday, January 9, 2014 11:57:44 PM Iran Standard Time)"
,,,,,"List(74:ea:3a:a2:e1:b4, Saba, [WPA-PSK-TKIP], 2437, -88, Thursday, January 9, 2014 11:58:19 PM Iran Standard Time)"
,,,,,"List(90:94:e4:83:57:83, Milad, [WPA-PSK-TKIP+CCMP][WPA2-PSK-TKIP+CCMP][WPS], 2472, -87, Thursday, January 9, 2014 11:58:19 PM Iran Standard Time)"


In [30]:
# TEST - Run this cell to test your solution
SMSCorruptDF2.count()

testPath = f"{workingDir}/corruptSMS/*/*/*"
corruptCount = spark.read.json(testPath).count()

dbTest("ET1-P-06-02-01", True, corruptCount >= 8)

print("Tests passed!")

## Review
**Question:** By default, how are corrupt records dealt with using `spark.read.json()`?  
**Answer:** They appear in a column called `_corrupt_record`.

**Question:** How can a query persist corrupt records in separate destination?  
**Answer:** The Databricks feature `badRecordsPath` allows a query to save corrupt records to a given end point for the pipeline engineer to investigate corruption issues.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [33]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Loading Data and Productionalizing]($./07-Loading-Data-and-Productionalizing).

## Additional Topics & Resources

**Q:** Where can I get more information on dealing with corrupt records?  
**A:** Check out the Spark Summit talk on <a href="https://databricks.com/session/exceptions-are-the-norm-dealing-with-bad-actors-in-etl" target="_blank">Exceptions are the Norm: Dealing with Bad Actors in ETL</a>

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>