## Permissive Mode [DEFAULT]

In [1]:
import sys
import os

In [2]:
os.environ.get('JAVA_HOME')

'C:\\Program Files\\Java\\jdk1.8.0_311'

In [3]:
import findspark
findspark.init()

In [4]:
from pyspark.sql import SparkSession
import numpy as np

In [5]:
spark = SparkSession.builder.master("local[*]").appName("PermissiveCorruptRec").getOrCreate()

In [6]:
df = spark.read \
    .schema("id integer, name string, join_date date, salary integer, _corrupt_record string") \
    .option("header", True) \
    .option("mode", "PERMISSIVE") \
    .option("columnNameOfCorruptRecord", "_corrupt_record") \
    .option("dateformat", "dd.MM.yyyy") \
    .csv("data/corruptRecords.csv")

In [7]:
df.show(truncate=False)

+---+-----+----------+------+-----------------------+
|id |name |join_date |salary|_corrupt_record        |
+---+-----+----------+------+-----------------------+
|1  |John |2019-12-10|150000|null                   |
|2  |Adam |2019-04-10|50000 |null                   |
|3  |Sam  |2019-03-13|90000 |null                   |
|4  |Karen|2019-03-14|null  |4,Karen,14.03.2019,100K|
+---+-----+----------+------+-----------------------+



In [8]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- join_date: date (nullable = true)
 |-- salary: integer (nullable = true)
 |-- _corrupt_record: string (nullable = true)



In [9]:
from pyspark.sql.functions import *

In [10]:
corrupt_rec_df = df.filter(col("_corrupt_record").isNotNull())
corrupt_rec_df.show()

+---+-----+----------+------+--------------------+
| id| name| join_date|salary|     _corrupt_record|
+---+-----+----------+------+--------------------+
|  4|Karen|2019-03-14|  null|4,Karen,14.03.201...|
+---+-----+----------+------+--------------------+



In [11]:
corrupt_rec_df.select(col("_corrupt_record")).show(truncate=False)

+---------------+
|_corrupt_record|
+---------------+
+---------------+



As we know, spark performs lazy-evaluation, i.e., actual operations on data is not performed until an action is called. In this case, when we called filter(), select() transformations, and show() action command, spark internally uses projection operation to fetch "_corrupt_record" data from the file where it does not exist. When we want to fetch only corrupted records, we select the "_corrupt_record" column and call an action show(). So, that's why we can't see that column.

By caching, we can store the intermediate results in the memory, which will fix the issue.

In [12]:
# CACHE IS NEEDED TO SEE THE _corrupt_record COLUMN
corrupt_rec_df.cache()

DataFrame[id: int, name: string, join_date: date, salary: int, _corrupt_record: string]

In [13]:
corrupt_rec_df.select(col("_corrupt_record")).show(truncate=False)

+-----------------------+
|_corrupt_record        |
+-----------------------+
|4,Karen,14.03.2019,100K|
+-----------------------+



## DROP MALFORMED ROWS

In [16]:
emp_df = spark.read \
    .schema("id integer, name string, join_date date, salary integer") \
    .option("header", True) \
    .option("mode", "DROPMALFORMED") \
    .option("dateformat", "dd.MM.yyyy") \
    .csv("data/corruptRecords.csv")

In [17]:
emp_df.show(truncate=False)

+---+----+----------+------+
|id |name|join_date |salary|
+---+----+----------+------+
|1  |John|2019-12-10|150000|
|2  |Adam|2019-04-10|50000 |
|3  |Sam |2019-03-13|90000 |
+---+----+----------+------+



## FAILFAST MODE

In [20]:
emp_df = spark.read \
    .schema("id integer, name string, join_date date, salary integer") \
    .option("header", True) \
    .option("mode", "FAILFAST") \
    .option("dateformat", "dd.MM.yyyy") \
    .csv("data/corruptRecords.csv")

In [22]:
try:
    emp_df.show(truncate=False)
except Exception as e:
    print(e)

An error occurred while calling o109.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8) (DESKTOP-SSC2TF1 executor driver): org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING] Malformed records are detected in record parsing: [4,Karen,17969,null].
Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.malformedRecordsDetectedInRecordParsingError(QueryExecutionErrors.scala:1764)
	at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:69)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.$anonfun$parseIterator$2(UnivocityParser.scala:456)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
	at scala.collection.Iterator$$anon$

## Redirect Bad Records to a file

In [18]:
df = spark.read \
    .schema("id integer, name string, join_date date, salary integer") \
    .option("header", True) \
    .option("badRecordsPath", "data") \
    .option("dateformat", "dd.MM.yyyy") \
    .csv("data/corruptRecords.csv")

In [19]:
df.show(truncate=False)

+---+-----+----------+------+
|id |name |join_date |salary|
+---+-----+----------+------+
|1  |John |2019-12-10|150000|
|2  |Adam |2019-04-10|50000 |
|3  |Sam  |2019-03-13|90000 |
|4  |Karen|2019-03-14|null  |
+---+-----+----------+------+

