# Data Sources
This notebook formally introduces the variety of other data sources that you can use with Spark out of the box as well as the countless other sources built by the greater community. Spark has six “core” data sources and hundreds of external data sources written by the community. The ability to read and write from all different kinds of data sources and for the community to create its own contributions is arguably one of Spark’s greatest strengths. Following are Spark’s core data sources:

* CSV
* JSON
* Parquet
* ORC
* JDBC/ODBC connections
* Plain-text files

As mentioned, Spark has numerous community-created data sources. Here’s just a small sample:

* BigQuery
* Cassandra
* HBase
* MongoDB
* AWS Redshift
* XML
* And many, many others

The goal of this notebook is to give you the ability to read and write from Spark’s core data sources and know enough to understand what you should look for when integrating with third-party data sources. To achieve this, we will focus on the core concepts that you need to be able to recognize and understand.

## The Structure of the Data Sources API
Before proceeding with how to read and write from certain formats, let’s visit the overall organizational structure of the data source APIs.

### Read API Structure
The core structure for reading data is as follows:

> ```
DataFrameReader.format(...).option("key", "value").schema(...).load()
```

We will use this format to read from all of our data sources. format is optional because by default Spark will use the Parquet format. option allows you to set key-value configurations to parameterize how you will read data. Lastly, schema is optional if the data source provides a schema or if you intend to use schema inference. Naturally, there are some required options for each format, which we will discuss when we look at each format.

### Basics of Reading Data
The foundation for reading data in Spark is the DataFrameReader. We access this through the SparkSession via the read attribute:

> ```python
spark.read
```

After we have a DataFrame reader, we specify several values:

* The format
* The schema
* The read mode
* A series of options

The format, options, and schema each return a DataFrameReader that can undergo further transformations and are all optional, except for one option. Each data source has a specific set of options that determine how the data is read into Spark (we cover these options shortly). At a minimum, you must supply the DataFrameReader a path to from which to read.

Here’s an example of the overall layout:

``` python
spark.read.format("csv")\
  .option("mode", "FAILFAST")\
  .option("inferSchema", "true")\
  .option("path", "path/to/file(s)")\
  .load()
```
  
As we have seen before we can put the "path" option directly in the `load()` method, i.e.: `.load("path/to/file(s)")`

**READ MODES**
Reading data from an external source naturally entails encountering malformed data, especially when working with only semi-structured data sources. Read modes specify what will happen when Spark does come across malformed records. Table below lists the read modes:


|Read mode | Description |
| -- | -- |
| permissive | Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column called \_corrupt_record |
| dropMalformed | Drops the row that contains malformed records |
| failFast | Fails immediately upon encountering malformed records |

The default is *permissive*.

**Example:** Let's load a sample CSV file, similar to what we have seen before:

In [1]:
# the following line gets the bucket name attached to our cluster
bucket = spark._jsc.hadoopConfiguration().get("fs.gs.system.bucket")

# specifying the path to our bucket where the data is located (no need to edit this path anymore)
data = "gs://" + bucket + "/Big-Data-Analytics-for-Business/data/"
print(data)

gs://qst843/Big-Data-Analytics-for-Business/data/


In [2]:
df = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load(data + "flight-data/csv/2010-summary.csv")

                                                                                

In [3]:
df.show(2)

[Stage 2:>                                                          (0 + 1) / 1]

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
+-----------------+-------------------+-----+
only showing top 2 rows



                                                                                

### Write API Structure
The core structure for writing data is as follows:

> ```
DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...)\
  .sortBy(...).save()```

We will use this format to write to all of our data sources. format is optional because by default, Spark will use the parquet format. option, again, allows us to configure how to write out our given data. *PartitionBy*, *bucketBy*, and *sortBy* work only for file-based data sources; you can use them to control the specific layout of files at the destination.

### Basics of Writing Data
The foundation for writing data is quite similar to that of reading data. Instead of the *DataFrameReader*, we have the *DataFrameWriter*. Because we always need to write out some given data source, we access the *DataFrameWriter* on a per-DataFrame basis via the write attribute:

We can use the `write` attribute to write it to file:

In [4]:
df.write

<pyspark.sql.readwriter.DataFrameWriter at 0x7fb72dc80a50>

After we have a DataFrameWriter, we specify three values: the format, a series of options, and the save mode. At a minimum, you must supply a path. We will cover the potential for options, which vary from data source to data source, shortly.

In [5]:
df.write.format("csv")\
  .option("header", "True")\
  .mode("overwrite")\
  .save("path/to/file(s)")

                                                                                

### SAVE MODES
Save modes specify what will happen if Spark finds data at the specified location (assuming all else equal). The following table lists the save modes:

| Save mode | Description |
| -- | -- |
| append | Appends the output files to the list of files that already exist at that location |
| overwrite | Will completely overwrite any data that already exists there |
| errorIfExists | Throws an error and fails the write if data or files already exist at the specified location |
| ignore | If data or files exist at the location, do nothing with the current DataFrame |

The default is `errorIfExists`. This means that if Spark finds data at the location to which you’re writing, it will fail the write immediately.

We’ve largely covered the core concepts that you’re going to need when using data sources, so now let’s dive into each of Spark’s native data sources.

### CSV Files
CSV stands for commma-separated values. This is a common text file format in which each line represents a single record, and commas separate each field within a record. CSV files, while seeming well structured, are actually one of the trickiest file formats you will encounter because not many assumptions can be made in production scenarios about what they contain or how they are structured. For this reason, the CSV reader has a large number of options. These options give you the ability to work around issues like certain characters needing to be escaped—for example, commas inside of columns when the file is also comma-delimited or null values labeled in an unconventional way.

### CSV Options
Following table presents the options available in the CSV reader:


|Read/write|Key|Potential values|Default |Description|
|--|--|--|--|--|
|Both|sep|Any single string character|,|The single character that is used as separator for each field and value.|
|Both|header|true, false|false|A Boolean flag that declares whether the first line in the file(s) are the names of the columns.|
|Read|escape|Any string character|"\\"|The character Spark should use to escape other characters in the file.|
|Read|inferSchema|true, false|false|Specifies whether Spark should infer column types when reading the file.|
|Read|ignoreLeadingWhiteSpace|true, false|false|Declares whether leading spaces from values being read should be skipped.|
|Read|ignoreTrailingWhiteSpace|true, false|false|Declares whether trailing spaces from values being read should be skipped.|
|Both|nullValue|Any string character|“”|Declares what character represents a null value in the file.|
|Both|nanValue|Any string character|NaN|Declares what character represents a NaN or missing character in the CSV file.|
|Both|positiveInf|Any string or character|Inf|Declares what character(s) represent a positive infinite value.|
|Both|negativeInf|Any string or character|-Inf|Declares what character(s) represent a negative infinite value.|
|Both|compression or codec|None, uncompressed, bzip2, deflate, gzip, lz4, or snappy|none|Declares what compression codec Spark should use to read or write the file.|
|Both|dateFormat|Any string or character that conforms to java’s SimpleDataFormat.|yyyy-MM-dd|Declares the date format for any columns that are date type.|
|Both|timestampFormat|Any string or character that conforms to java’s SimpleDataFormat.|yyyy-MM-dd’T’HH:mm:ss.SSSZZ|Declares the timestamp format for any columns that are timestamp type.|
|Read|maxColumns|Any integer|20480|Declares the maximum number of columns in the file.|
|Read|maxCharsPerColumn|Any integer|1000000|Declares the maximum number of characters in a column.|
|Read|escapeQuotes|true, false|true|Declares whether Spark should escape quotes that are found in lines.|
|Read|maxMalformedLogPerPartition|Any integer|10|Sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored.|
|Write|quoteAll|true, false|false|Specifies whether all values should be enclosed in quotes, as opposed to just escaping values that have a quote character.|
|Read|multiLine|true, false|false|This option allows you to read multiline CSV files where each logical row in the CSV file might span multiple rows in the file itself.|

### Reading CSV Files
To read a CSV file, like any other format, we must first create a `DataFrameReader` for that specific format. Here, we specify the format to be CSV:

In [6]:
spark.read.format("csv")

<pyspark.sql.readwriter.DataFrameReader at 0x7fb72dca5a50>

After this, we have the option of specifying a schema as well as modes as options. Let’s set a couple of options, some that we have seen above and others that we haven’t seen yet. We’ll set the `header` to `true` for our CSV file, the `mode` to be `FAILFAST`, and `inferSchema` to `true`:

In [7]:
spark.read.format("csv")\
  .option("header", "true")\
  .option("mode", "FAILFAST")\
  .option("inferSchema", "true")\
  .load(data + "flight-data/csv/2010-summary.csv")
# FAILFAST: Fails immediately upon encountering malformed records

DataFrame[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string, count: int]

As mentioned, we can use the mode to specify how much tolerance we have for malformed data. For example, we can use these modes and a manual schema to ensure that our file(s) conform to the data that we expected:

In [8]:
from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", LongType(), True)
])

spark.read.format("csv")\
  .option("header", "true")\
  .option("mode", "FAILFAST")\
  .schema(myManualSchema)\
  .load(data + "flight-data/csv/2010-summary.csv")\
  .show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



### Writing CSV Files

Just as with reading data, there are a variety of options (listed in the table above) for writing data when we write CSV files. This is a subset of the reading options because many do not apply when writing data (like maxColumns and inferSchema). Here’s an example:

Let's first use a filter to create a small DataFrame from our loaded df:

In [9]:
df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



In [10]:
from pyspark.sql.functions import desc, col
outbound_US_2010 = df.where("ORIGIN_COUNTRY_NAME == 'United States'")\
  .where("DEST_COUNTRY_NAME <> 'United States'")\
  .select("DEST_COUNTRY_NAME", "count")\
  .orderBy(col("count").desc())

outbound_US_2010.show(5)

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|           Canada| 8271|
|           Mexico| 6200|
|   United Kingdom| 1629|
|          Germany| 1392|
|            Japan| 1383|
+-----------------+-----+
only showing top 5 rows



                                                                                

We will write this DataFrame to a TSV file in our Google Cloud bucket. Before writing let's check out the current physical plan for our DataFrame `outbound_US_2010`:

In [11]:
outbound_US_2010.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#19 DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#19 DESC NULLS LAST, 1000), ENSURE_REQUIREMENTS, [plan_id=113]
      +- Project [DEST_COUNTRY_NAME#17, count#19]
         +- Filter (((isnotnull(ORIGIN_COUNTRY_NAME#18) AND isnotnull(DEST_COUNTRY_NAME#17)) AND (ORIGIN_COUNTRY_NAME#18 = United States)) AND NOT (DEST_COUNTRY_NAME#17 = United States))
            +- FileScan csv [DEST_COUNTRY_NAME#17,ORIGIN_COUNTRY_NAME#18,count#19] Batched: false, DataFilters: [isnotnull(ORIGIN_COUNTRY_NAME#18), isnotnull(DEST_COUNTRY_NAME#17), (ORIGIN_COUNTRY_NAME#18 = Un..., Format: CSV, Location: InMemoryFileIndex(1 paths)[gs://qst843/Big-Data-Analytics-for-Business/data/flight-data/csv/2010-..., PartitionFilters: [], PushedFilters: [IsNotNull(ORIGIN_COUNTRY_NAME), IsNotNull(DEST_COUNTRY_NAME), EqualTo(ORIGIN_COUNTRY_NAME,United..., ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>




Since the default number of partitions for shuffling is 200 if we write the current DataFrame into a file it will result in 200 small files (this happened after the orderBy operation). In fact since this is a small DataFrame it will be 124 files since we have 124 rows in the DataFrame. This is less than ideal as it will add significant time to read/write. We can fix this by resetting `shuffle.partitions` value:

In [12]:
spark.conf.set('spark.sql.shuffle.partitions', '1')

If we try now we will end up with only 1 file:

In [13]:
outbound_US_2010.write.format("csv")\
  .option("header", "True")\
  .option("sep", "\t")\
  .mode("overwrite")\
  .save(data + "tmp/outbound_US_2010.tsv")

print("outbound_US_2010 DataFrame was written to a CSV file in the following path: {}flight-data/csv/outbound_US_2010.tsv".format(data))

                                                                                

outbound_US_2010 DataFrame was written to a CSV file in the following path: gs://qst843/Big-Data-Analytics-for-Business/data/flight-data/csv/outbound_US_2010.tsv


In [14]:
data + "tmp/outbound_US_2010.tsv"

'gs://qst843/Big-Data-Analytics-for-Business/data/tmp/outbound_US_2010.tsv'

### Writing to Hadoop File System
We can also write this file in our cluster's Hadoop File System:

In [15]:
outbound_US_2010.write.format("csv")\
  .option("header", "True")\
  .option("sep", "\t")\
  .mode("overwrite")\
  .save("/tmp/outbound_US_2010.tsv")

                                                                                

In [16]:
!hadoop fs -ls /tmp/outbound_US_2010.tsv

Found 2 items
-rw-r--r--   1 root hadoop          0 2025-10-27 23:28 /tmp/outbound_US_2010.tsv/_SUCCESS
-rw-r--r--   1 root hadoop       1727 2025-10-27 23:28 /tmp/outbound_US_2010.tsv/part-00000-8a80259d-419c-44f8-9f4b-39cd27338197-c000.csv


The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data. This means that although we specify a “file”, it’s actually a number of files within a folder, with the name of the specified file, with one file per each partition that is written.

outbound_US_2010.tsv is actually a folder that contains our CSV file(s). This actually reflects the number of partitions in our DataFrame at the time we write it out. If we were to repartition our data before then, we would end up with a different number of files. Let's look at the first few rows of this file and confirm that the format is as what we expected.

Use the following command to check the first few rows. Replace FILENAME with the name of the file from the result of the list above:

```bash
!hadoop fs -cat FILENAME | head
```

In [17]:
# Your code goes here
!hadoop fs -cat /tmp/outbound_US_2010.tsv/* | head

DEST_COUNTRY_NAME	count
Canada	8271
Mexico	6200
United Kingdom	1629
Germany	1392
Japan	1383
Dominican Republic	1109
Brazil	995
The Bahamas	903
Colombia	785


## JSON Files
Refer to chapter 9 for the specifics of read/write of this type of files.

## Parquet Files
Parquet is an open source column-oriented data store that provides a variety of storage optimizations, especially for analytics workloads. It provides columnar compression, which saves storage space and allows for reading individual columns instead of entire files. It is a file format that works exceptionally well with Apache Spark and is in fact the default file format. We recommend writing data out to Parquet for long-term storage because reading from a Parquet file will always be more efficient than JSON or CSV. Another advantage of Parquet is that it supports complex types. This means that if your column is an array (which would fail with a CSV file, for example), map, or struct, you’ll still be able to read and write that file without issue. Here’s how to specify Parquet as the read format:

In [18]:
spark.read.format("parquet")

<pyspark.sql.readwriter.DataFrameReader at 0x7fb72dc9ba90>

### Reading Parquet Files
Parquet has very few options because it enforces its own schema when storing data. Thus, all you need to set is the format and you are good to go. We can set the schema if we have strict requirements for what our DataFrame should look like. Oftentimes this is not necessary because we can use schema on read, which is similar to the inferSchema with CSV files. However, with Parquet files, this method is more powerful because the schema is built into the file itself (so no inference needed).

Here are some simple examples reading from parquet:

In [19]:
spark.read.format("parquet").load(data + "flight-data/parquet/2010-summary.parquet").show(5)

                                                                                

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|    1|
|    United States|            Ireland|  264|
|    United States|              India|   69|
|            Egypt|      United States|   24|
|Equatorial Guinea|      United States|    1|
+-----------------+-------------------+-----+
only showing top 5 rows



                                                                                

### PARQUET OPTIONS
As we just mentioned, there are very few Parquet options—precisely two, in fact—because it has a well-defined specification that aligns closely with the concepts in Spark. Table below presents the options:

|Read/Write|Key|Potential Values|Default|Description|
|--|--|--|--|--|
|Write|compression or codec|None, uncompressed, bzip2, deflate, gzip, lz4, or snappy|None|Declares what compression codec Spark should use to read or write the file.|
|Read|mergeSchema|true, false|Value of the configuration spark.sql.parquet.mergeSchema|You can incrementally add columns to newly written Parquet files in the same table/folder. Use this option to enable or disable this feature.|

Writing Parquet Files
Writing Parquet is as easy as reading it. We simply specify the location for the file. The same partitioning rules apply:

In [20]:
outbound_US_2010.write.format("parquet").mode("overwrite")\
  .save(data + "tmp/outbound_US_2010.parquet") 

                                                                                

We can easily read back this file into a DataFrame:

In [21]:
spark.read.format("parquet").load(data + "tmp/outbound_US_2010.parquet").show(5)

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
|           Canada| 8271|
|           Mexico| 6200|
|   United Kingdom| 1629|
|          Germany| 1392|
|            Japan| 1383|
+-----------------+-----+
only showing top 5 rows



## BigQuery

The BigQuery connector can be used with Apache Spark to read and write data from/to BigQuery. This section provides an example code that uses the BigQuery connector with PySpark.

The following code will get the information to bring in the BigQuery table `bigquery-public-data.samples.shakespeare` into Spark.

It will first save the BigQuery table as a JSON object in Google Cloud Storage and then we convert this object into a DataFrame:

In [22]:
spark.conf.set('temporaryGcsBucket', bucket)

In [23]:
# Load data from BigQuery.
words = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:samples.shakespeare') \
  .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql(
    'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word ORDER BY word_count DESC')
word_count.show(10)
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format('bigquery') \
  .mode("overwrite") \
  .option('table', 'examples.wordcount_output') \
  .save()

                                                                                

+----+----------+
|word|word_count|
+----+----------+
| the|     25568|
|   I|     21028|
| and|     19649|
|  to|     17361|
|  of|     16438|
|   a|     13409|
| you|     12527|
|  my|     11291|
|  in|     10589|
|  is|      8735|
+----+----------+
only showing top 10 rows

root
 |-- word: string (nullable = false)
 |-- word_count: long (nullable = true)



                                                                                

## Partitioning
Partitioning is a tool that allows you to control what data is stored (and where) as you write it. When you write a file to a partitioned directory (or table), you basically encode a column as a folder. What this allows you to do is skip lots of data when you go to read it in later, allowing you to read in only the data relevant to your problem instead of having to scan the complete dataset. These are supported for all file-based data sources:

In [24]:
df.write.mode("overwrite").partitionBy("DEST_COUNTRY_NAME")\
  .save("/tmp/partitioned-files.parquet")

                                                                                

Upon writing, you get a list of folders in your Parquet “file”:

In [None]:
!hadoop fs -ls /tmp/partitioned-files.parquet | head

Each of these will contain Parquet files that contain that data where the previous predicate was true:

In [None]:
!hadoop fs -ls /tmp/partitioned-files.parquet/DEST_COUNTRY_NAME=Afghanistan

This is probably the lowest-hanging optimization that you can use when you have a table that readers frequently filter by before manipulating. For instance, date is particularly common for a partition because, downstream, often we want to look at only the previous week’s data (instead of scanning the entire list of records). This can provide massive speedups for readers.

## Managing File Size
Managing file sizes is an important factor not so much for writing data but reading it later on. When you’re writing lots of small files, there’s a significant metadata overhead that you incur managing all of those files. Spark especially does not do well with small files, although many file systems (like HDFS) don’t handle lots of small files well, either. You might hear this referred to as the “small file problem.” The opposite is also true: you don’t want files that are too large either, because it becomes inefficient to have to read entire blocks of data when you need only a few rows.

Spark 2.2 introduced a new method for controlling file sizes in a more automatic way. We saw previously that the number of output files is a derivative of the number of partitions we had at write time (and the partitioning columns we selected). Now, you can take advantage of another tool in order to limit output file sizes so that you can target an optimum file size. You can use the *maxRecordsPerFile* option and specify a number of your choosing. This allows you to better control file sizes by controlling the number of records that are written to each file. For example, if you set an option for a writer as `df.write.option("maxRecordsPerFile", 5000)`, Spark will ensure that files will contain at most 5,000 records.

## SQL Databases (JDBC Connection)

SQL datasources are one of the more powerful connectors because there are a variety of systems to which you can connect (as long as that system speaks SQL). For instance you can connect to a MySQL database, a PostgreSQL database, or an Oracle database. Despite the importance of these kinds of data sources the details of JDBC connections are beyond the scope of this course. If interested more details can be found in chapter 9 and Spark documentation.