## Handling Corrupted Records

**Potential Interview Questions**
- Have you worked with corrupted records?
- When do you say that it's a corrupted record?
- What happens when we encounter corrupted records in different read modes?
- How can we print bad records?
- Where do you store corrupted records, and how can we access them later?

#### When do you say that it's a corrupted record?
When working with large datasets, corrupted records can disrupt data processing workflows.
A record is considered corrupted when:
- It does not conform to the specified schema.
- It has missing or malformed values.
- The data format is invalid (e.g., unparseable JSON, incorrect CSV delimiters).

In PySpark, corrupted records are often captured in the _corrupt_record column when using permissive mode.

#### What happens when we encounter corrupted records in different read modes?

**PERMISSIVE (default):** Corrupted records are captured in a special `_corrupt_record` column.

**DROPMALFORMED:** Rows with corrupted records are dropped.

**FAILFAST:** Processing stops at the first corrupted record.

##### Example with actual data:
In the below CSV file we have two corrupted records id 3 and 4

In [0]:
%fs
head dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee.csv

In [0]:
employee_df = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .option("mode", "permissive") \
            .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee.csv")

employee_df.show()

# We can see, in record 3 and 4 we have incorrect data

+---+--------+---+------+------------+--------+
| id|    name|age|salary|     address| nominee|
+---+--------+---+------+------------+--------+
|  1|  Manish| 26| 75000|       bihar|nominee1|
|  2|  Nikita| 23|100000|uttarpradesh|nominee2|
|  3|  Pritam| 22|150000|   Bangalore|   India|
|  4|Prantosh| 17|200000|     Kolkata|   India|
|  5|  Vikash| 31|300000|        null|nominee5|
+---+--------+---+------+------------+--------+



**Let's run in DROPMALFORMED mode**

In [0]:
employee_df_1 = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .option("mode", "dropmalformed") \
            .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee.csv")

employee_df_1.show()

# We can see, Rows with corrupted records are dropped. i.e id 3 and 4

+---+------+---+------+------------+--------+
| id|  name|age|salary|     address| nominee|
+---+------+---+------+------------+--------+
|  1|Manish| 26| 75000|       bihar|nominee1|
|  2|Nikita| 23|100000|uttarpradesh|nominee2|
|  5|Vikash| 31|300000|        null|nominee5|
+---+------+---+------+------------+--------+



**Now let's run in FAILFAST mode**

In [0]:
employee_df_3 = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .option("mode", "failfast") \
            .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee.csv")

employee_df_3.show()

# We can see, the execution fails and stops

[0;31m---------------------------------------------------------------------------[0m
[0;31mPy4JJavaError[0m                             Traceback (most recent call last)
File [0;32m<command-755740500396355>:7[0m
[1;32m      1[0m employee_df_3 [38;5;241m=[39m spark[38;5;241m.[39mread[38;5;241m.[39mformat([38;5;124m"[39m[38;5;124mcsv[39m[38;5;124m"[39m) \
[1;32m      2[0m             [38;5;241m.[39moption([38;5;124m"[39m[38;5;124mheader[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mtrue[39m[38;5;124m"[39m) \
[1;32m      3[0m             [38;5;241m.[39moption([38;5;124m"[39m[38;5;124minferSchema[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mtrue[39m[38;5;124m"[39m) \
[1;32m      4[0m             [38;5;241m.[39moption([38;5;124m"[39m[38;5;124mmode[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mfailfast[39m[38;5;124m"[39m) \
[1;32m      5[0m             [38;5;241m.[39mload([38;5;124m"[39m[38;5;124mdbfs:/FileStore/shared_

#### How can we print bad records?
You can print bad records using the **_corrupt_record** column in permissive mode:


In [0]:
# Define a schema for the CSV file, you need to manually define the schema
schema = "id INT, name STRING, age INT, salary FLOAT, address STRING, nominee STRING, _corrupt_record STRING"

df = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .option("mode", "permissive") \
            .schema(schema) \
            .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee.csv")

df.show()

+---+--------+---+--------+------------+--------+--------------------+
| id|    name|age|  salary|     address| nominee|     _corrupt_record|
+---+--------+---+--------+------------+--------+--------------------+
|  1|  Manish| 26| 75000.0|       bihar|nominee1|                null|
|  2|  Nikita| 23|100000.0|uttarpradesh|nominee2|                null|
|  3|  Pritam| 22|150000.0|   Bangalore|   India|3,Pritam,22,15000...|
|  4|Prantosh| 17|200000.0|     Kolkata|   India|4,Prantosh,17,200...|
|  5|  Vikash| 31|300000.0|        null|nominee5|                null|
+---+--------+---+--------+------------+--------+--------------------+



In [0]:
# if you want to see whole records use truncate=False
df.show(truncate=False)

+---+--------+---+--------+------------+--------+-------------------------------------------+
|id |name    |age|salary  |address     |nominee |_corrupt_record                            |
+---+--------+---+--------+------------+--------+-------------------------------------------+
|1  |Manish  |26 |75000.0 |bihar       |nominee1|null                                       |
|2  |Nikita  |23 |100000.0|uttarpradesh|nominee2|null                                       |
|3  |Pritam  |22 |150000.0|Bangalore   |India   |3,Pritam,22,150000,Bangalore,India,nominee3|
|4  |Prantosh|17 |200000.0|Kolkata     |India   |4,Prantosh,17,200000,Kolkata,India,nominee4|
|5  |Vikash  |31 |300000.0|null        |nominee5|null                                       |
+---+--------+---+--------+------------+--------+-------------------------------------------+



#### Where do you store corrupted records, and how can we access them later?

Corrupted records can be stored using the `**badRecordsPath**` option

In [0]:
# storing bad_records using the badRecordsPath option
schema = "id INT, name STRING, age INT, salary FLOAT, address STRING, nominee STRING, _corrupt_record STRING"

df = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "true") \
            .schema(schema) \
            .option("badRecordsPath", "dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/bad_records/") \
            .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee.csv")

df.show()

+---+------+---+--------+------------+--------+---------------+
| id|  name|age|  salary|     address| nominee|_corrupt_record|
+---+------+---+--------+------------+--------+---------------+
|  1|Manish| 26| 75000.0|       bihar|nominee1|           null|
|  2|Nikita| 23|100000.0|uttarpradesh|nominee2|           null|
|  5|Vikash| 31|300000.0|        null|nominee5|           null|
+---+------+---+--------+------------+--------+---------------+



In [0]:
%fs
ls dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/bad_records/20241205T111712/bad_records/

path,name,size,modificationTime
dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/bad_records/20241205T111712/bad_records/part-00000-a5ca3e7c-619b-46db-9ab9-ee2cb424778c,part-00000-a5ca3e7c-619b-46db-9ab9-ee2cb424778c,544,1733397435000


**Access:** The bad records will be stored as files (e.g., JSON) at the specified path. You can read them later like any other dataset:
python
Copy code


In [0]:
bad_records_df = spark.read.format("json").load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/bad_records/20241205T111712/bad_records/")

bad_records_df.show()

+--------------------+--------------------+--------------------+
|                path|              reason|              record|
+--------------------+--------------------+--------------------+
|dbfs:/FileStore/s...|org.apache.spark....|3,Pritam,22,15000...|
|dbfs:/FileStore/s...|org.apache.spark....|4,Prantosh,17,200...|
+--------------------+--------------------+--------------------+



In [0]:
# if you want to see whole records use truncate=False
bad_records_df.show(truncate=False)

+-----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
|path                                                             |reason                                                                                                                          |record                                     |
+-----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------+
|dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee.csv|org.apache.spark.SparkRuntimeException: [MALFORMED_CSV_RECORD] Malformed CSV record: 3,Pritam,22,150000,Bangalore,India,nominee3|3,Pritam,22,150000,Bangalore,India,nominee3|
|dbfs:/FileStore/shared_uploads/zade