Skip to content

Latest commit

 

History

History
84 lines (61 loc) · 3.01 KB

data-source.md

File metadata and controls

84 lines (61 loc) · 3.01 KB

DataSource

Description

The DataSource framework is a utility framework that helps configuring and reading DataFrames.

This framework provides for reading from a given path with the specified format like avro, parquet, orc, json, csv, jdbc, delta...

The framework is composed of two main traits:

  • DataSource, which is created based on a DataSourceConfiguration class and provides one main function:
    override def reader(implicit spark: SparkSession): Reader
    override def read(implicit spark: SparkSession): Try[DataFrame]
  • DataSourceConfiguration: a marker trait to define DataSource configuration classes

Usage

The framework provides the following predefined DataSource implementations:

For convenience the DataSourceFactory trait and the default implementation are provided. To create a DataSource out of a given TypeSafe Config instance, one can call

DataSource( someDataSourceConfigurationInstance )

Also, in order to easily extract the configuration from a given TypeSafe Config instance, the FormatAwareDataSourceConfiguration factory is provided.

FormatAwareDataSourceConfiguration( someTypesafeConfigurationInstance )

There is a convenience implicit decorator for the Spark session, available through the

import org.tupol.spark.io._
import org.tupol.spark.io.implicits._

import statements. The org.tupol.spark.io package contains the implicit factories for data sources and the org.tupol.spark.implicits contains the actual SparkSession decorator.

This allows us to create the DataSource by calling the source() function on the given Spark session, passing a DataSourceConfiguration configuration instance.

import org.tupol.spark.io._
import org.tupol.spark.io.implicits._

def spark: SparkSession = ???
def dataSourceConfiguration: DataSourceConfiguration = ???
val dataSource = spark.source(dataSourceConfiguration)

For streaming sources:

import org.tupol.spark.io._
import org.tupol.spark.io.implicits._

def spark: SparkSession = ???
def dataSourceConfiguration: DataSourceConfiguration = ???
val dataSource = spark.streamingSource(dataSourceConfiguration)

Configuration Parameters