In [1]:
from pyspark.sql import SparkSession

In [2]:
session = SparkSession.builder.appName('file options').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/03 11:09:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Path Glob Filter

In [11]:
df = session.read.load('/opt/bitnami/spark/data',
                      format='parquet',
                      pathGlobFilter='*.parquet')

df.show()

+-------------+
|         file|
+-------------+
|         NULL|
|         NULL|
|file1.parquet|
|file2.parquet|
+-------------+



#### Recursive File Lookup

recursiveFileLookup is used to recursively load files and it disables partition inferring. 

Its default value is false. If data source explicitly specifies the partitionSpec when recursiveFileLookup is true, exception will be thrown.

In [18]:
df = session.read.format('parquet').option('recursiveFileLookup', 'true').load('/opt/bitnami/spark/data/parquet/')
df.show()

+-------------+
|         file|
+-------------+
|         NULL|
|         NULL|
|file1.parquet|
|file2.parquet|
+-------------+



#### Modification time path filter

modifiedBefore and modifiedAfter are options that can be applied together or separately in order to achieve greater granularity over which files may load during a Spark batch query. (Note that Structured Streaming file sources don’t support these options.)

    1. modifiedBefore: an optional timestamp to only include files with modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
    
    2. modifiedAfter: an optional timestamp to only include files with modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)


In [22]:
df = session.read.load('/opt/bitnami/spark/data/parquet/',
                       format = 'parquet',
                       modifiedBefore = '2025-03-03 06:40:00')
df.show()

+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          NULL|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



In [23]:
df = session.read.load('/opt/bitnami/spark/data/parquet/',
                       format = 'parquet',
                       modifiedAfter = '2025-03-03 06:40:00')
df.show()

+-------------+
|         file|
+-------------+
|file1.parquet|
|file2.parquet|
+-------------+



#### Ignore corrupt files

Spark allows you to use the configuration spark.sql.files.ignoreCorruptFiles or the data source option ignoreCorruptFiles to ignore corrupt files while reading data from files. 

When set to true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.

In [19]:
### in parquet file reading view other formats like corrupt files.

test_df = session.read.option("ignoreCorruptFiles", "true").parquet('/opt/bitnami/spark/data/parquet/',
                                                                   '/opt/bitnami/spark/data/temp/')

test_df.show()

+-------------+
|         file|
+-------------+
|         NULL|
|         NULL|
|         NULL|
|         NULL|
|file1.parquet|
|file2.parquet|
+-------------+



25/03/03 11:42:59 WARN FileScanRDD: Skipped the rest of the content in the corrupted file: path: file:///opt/bitnami/spark/data/temp/people.json, range: 0-73, partition values: [empty row]
java.lang.RuntimeException: file:/opt/bitnami/spark/data/temp/people.json is not a Parquet file. Expected magic number at tail, but found [49, 57, 125, 10]
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:71)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:66)
	at org.apache.spark.sql.execution.datasources.parquet.Parq

In [21]:
session.sql('set spark.sql.files.ignoreCorruptFiles=true')
test_dfo = session.read.parquet('/opt/bitnami/spark/data/parquet/',
                                                                   '/opt/bitnami/spark/data/temp/')
test_dfo.show()

+-------------+
|         file|
+-------------+
|         NULL|
|         NULL|
|         NULL|
|         NULL|
|file1.parquet|
|file2.parquet|
+-------------+



25/03/03 11:43:34 WARN FileScanRDD: Skipped the rest of the content in the corrupted file: path: file:///opt/bitnami/spark/data/temp/people.json, range: 0-73, partition values: [empty row]
java.lang.RuntimeException: file:/opt/bitnami/spark/data/temp/people.json is not a Parquet file. Expected magic number at tail, but found [49, 57, 125, 10]
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:71)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:66)
	at org.apache.spark.sql.execution.datasources.parquet.Parq