In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark =SparkSession.builder.config('spark.driver.host','localhost').getOrCreate()

**sample data for this program** -\
file name - corrupt_data.csv

month,emp_count,production_unit,expense\
jan,244,657,30\
feb,54,876,32\
mar,37,673,82\
apr,72,721,72,test_msg\
may,29,weare,63

### **create sample dataframe**

In [4]:
df=spark.read.format('csv').options(header=True,sep=',',inferschema=True).load("corrupt_data.csv")

In [5]:
df.show()
# this is not showing the 5the value added.

+-----+---------+---------------+-------+
|month|emp_count|production_unit|expense|
+-----+---------+---------------+-------+
|  jan|      244|            657|     30|
|  feb|       54|            876|     32|
|  mar|       37|            673|     82|
|  apr|       72|            721|     72|
|  may|       29|          weare|     63|
+-----+---------+---------------+-------+



**Define dataframe**

In [6]:
from pyspark.sql.types import StructType,StructField,IntegerType,StringType

In [7]:
schema = StructType([StructField("month",StringType(),True),
                    StructField("emp_count",IntegerType(),True),
                    StructField("production_unit",IntegerType(),True),
                    StructField("expense",IntegerType(),True),
                    StructField("_corrupt_record",StringType(),True)])

### **permissive mode**

In [8]:
# Define the schema for your CSV file
schema = StructType([
    StructField("month", StringType(), True),
    StructField("emp_count", IntegerType(), True),
    StructField("production_unit", IntegerType(), True),
    StructField("expense", IntegerType(), True),
    StructField("_corrupt_record",StringType(),True)])



In [9]:
# Read the CSV file with the specified options
df2 = spark.read.format('csv') \
    .option("mode", "PERMISSIVE") \
    .option("header", True) \
    .option("sep", ',') \
    .schema(schema) \
    .load("corrupt_data.csv")

# Show the DataFrame
df2.show(truncate=False)


+-----+---------+---------------+-------+----------------------+
|month|emp_count|production_unit|expense|_corrupt_record       |
+-----+---------+---------------+-------+----------------------+
|jan  |244      |657            |30     |null                  |
|feb  |54       |876            |32     |null                  |
|mar  |37       |673            |82     |null                  |
|apr  |72       |721            |72     |apr,72,721,72,test_msg|
|may  |29       |null           |63     |may,29,weare,63       |
+-----+---------+---------------+-------+----------------------+



with table if some corrupt datatype value is present than that is shown with null value and the corrupt record is mentioned in _corrupt_recoed column.

As in record of month apr - there are 5 values but as we have defined scehma of 4 column only thus this record is corrupt.

similar for month of may in productoin_unit string value present but according to datatype it should be integer type thus this is also a corrupt record.

### **dropMallformed Mode**

In [10]:
# Define the schema for your CSV file
schema = StructType([
    StructField("month", StringType(), True),
    StructField("emp_count", IntegerType(), True),
    StructField("production_unit", IntegerType(), True),
    StructField("expense", IntegerType(), True),
    StructField("_corrupt_record",StringType(),True)])

In [11]:
df3=spark.read.format('csv').option("mode","DROPMALFORMED").option("header",True).option("sep",",").schema(schema).load("corrupt_data.csv")

In [12]:
df3.show()

+-----+---------+---------------+-------+---------------+
|month|emp_count|production_unit|expense|_corrupt_record|
+-----+---------+---------------+-------+---------------+
|  jan|      244|            657|     30|           null|
|  feb|       54|            876|     32|           null|
|  mar|       37|            673|     82|           null|
+-----+---------+---------------+-------+---------------+



As you can see the record of apr, may are not present because those were corrupt record so DROPMALFORMED will drop the corrupt record.

### **FailFast Mode**

In [13]:
# Define the schema for your CSV file
schema = StructType([
    StructField("month", StringType(), True),
    StructField("emp_count", IntegerType(), True),
    StructField("production_unit", IntegerType(), True),
    StructField("expense", IntegerType(), True),
    StructField("_corrupt_record",StringType(),True)])

In [14]:
df4=spark.read.format('csv').option("mode","FAILFAST").option("header",True).option("sep",",").schema(schema).load("corrupt_data.csv")

df4.show() - will give error if any corrupt record is encountered, even data frame will not be created.