# Programmatically Specifying the Schema

+ https://spark.apache.org/docs/latest/sql-getting-started.html#programmatically-specifying-the-schema

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

- Create an RDD of Rows from the original RDD;
- Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
- Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

In [1]:
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()


spark = org.apache.spark.sql.SparkSession@4aeee1bd


org.apache.spark.sql.SparkSession@4aeee1bd

In [6]:
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql._

val prefix="../../data"

// Create an RDD
val peopleRDD = spark.sparkContext.textFile(prefix+"/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()

+-------------+
|        value|
+-------------+
|Name: Michael|
|   Name: Andy|
| Name: Justin|
+-------------+



prefix = ../../data
peopleRDD = ../../data/resources/people.txt MapPartitionsRDD[10] at textFile at <console>:35
schemaString = name age
fields = Array(StructField(name,StringType,true), StructField(age,StringType,true))
schema = StructType(StructField(name,StringType,true), StructField(age,StringType,true))
rowRDD = MapPartitionsRDD[12] at map at <console>:48
peopleDF = [name: string, age: string]
results = [name: string]


lastException: Throwable = null


[name: string]