# Notebook 2 - SparkSQL and Dataframes ("relational RDDs")


To run SQL queries on Spark you just need to use the SparkContext, which is loaded as "spark" in our cluster.


In [None]:
val dataPath = "/home/adbs22/shared/diamonds.csv"
val diamonds = spark.read.format("csv")
  .option("header","true")
  .option("inferSchema", "true")
  .load(dataPath)


Reading files to SQLContext always creates DataFrames (realational Data). Most Data Analysts are used to using DataFrames over SQL from python and R.

In [None]:
diamonds.printSchema()

We use the "show" function to actually produce the results. Note that the first argument truncates the output, and the second allows to see the full row content (to avoid cutting off long entries).

In [None]:
diamonds.show(20, false)

In [None]:
val res1 = diamonds.groupBy("cut", "color").avg("price")

Spark already knows what the result will look like and has computed the Physical Plan for computation.

In [None]:
res1.explain()

In [None]:
res1.show(20, false)

In [None]:
val res2 = res1.join(diamonds, "color")

In [None]:
res2.explain()

In [None]:
res2.printSchema()

In [None]:

res2.show(20, false)

### Write SQL Queries

You have to create a view of a DataFrame before you can run sql queries.


In [None]:
diamonds.createOrReplaceTempView("diamonds")

val temp = spark.sql("SELECT cut, color FROM diamonds")

(sql1 creates the same result as res1)

In [None]:
val sql1 = spark.sql("SELECT cut, color, avg(price) FROM diamonds GROUP BY cut, color")

In [None]:
sql1.explain()

In [None]:
res1.explain()

In [None]:
sql1.show(20, false)

In [None]:
res1.show(20, false)

### For more information see: https://spark.apache.org/docs/latest/sql-programming-guide.html