# INFO 323: Cloud Computing and Big Data
## An End-to-End Example for Spark

Sark reads in a DataFrame from a file. The DataFrame has a set of columns with an unspecified number of rows. The reason the number of rows is unspecified is because reading data is a transformation, and
is therefore a lazy operation. Spark peeked at only a couple of rows of data to try to guess what types
each column should be.

In [1]:
flightdata = spark.read.option("inferSchema", "true").option("header", "true").csv("2015-summary.csv")

NameError: name 'spark' is not defined

In [2]:
flightdata.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

Let us specify a wide transformation, sort(). Nothing happens to the data when we call sort because it’s just a transformation. However, we can
see that Spark is building up a plan for how it will execute this across the cluster by looking at the
explain plan. We can call explain on any DataFrame object to see the DataFrame’s lineage.

In [3]:
flightdata.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#12 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#12 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/cloudera/spark-definitive/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


Now we have a sequence of transofrmations as: narror (read) -> wide (sort)

You can read explain plans from top to bottom, the top being the
end result, and the bottom being the source(s) of data. In this case, take a look at the first keywords.
You will see sort, exchange, and FileScan. That’s because the sort of our data is actually a wide
transformation because rows will need to be compared with one another. Don’t worry too much about
understanding everything about explain plans at this point, they can just be helpful tools for debugging
and improving your knowledge as you progress with Spark.

Next, we can specify an action to kick off this plan. However, before doing
that, we’re going to set a configuration. By default, when we perform a shuffle, Spark outputs 200
shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the
shuffle.

In [4]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [5]:
flightdata.sort("count").take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

## DataFrame and SQL

We worked through a simple transformation in the previous example, let’s now work through a more
complex one and follow along in both DataFrames and SQL. Spark can run the same transformations,
regardless of the language, in the exact same way. You can express your business logic in SQL or
DataFrames (either in R, Python, Scala, or Java) and Spark will compile that logic down to an
underlying plan (that you can see in the explain plan) before actually executing your code. With Spark
SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure
SQL. There is no performance difference between writing SQL queries or writing DataFrame code,
they both “compile” to the same underlying plan that we specify in DataFrame code.
You can make any DataFrame into a table or view with one simple method call:

In [8]:
flightdata.createOrReplaceTempView("flight_data_table")

We can query our data in SQL. To do so, we’ll use the spark.sql function (remember, spark is our SparkSession variable) that conveniently returns a new DataFrame. Although this might seem a bit circular in logic—that a SQL query against a DataFrame returns another DataFrame—it’s actually
quite powerful. This makes it possible for you to specify transformations in the manner most convenient to you at any given point in time and not sacrifice any efficiency to do so! To understand
that this is happening, let’s take a look at two explain plans:

In [10]:
flightdata.groupBy("DEST_COUNTRY_NAME").count().explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/cloudera/spark-definitive/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


In [13]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_table
GROUP BY DEST_COUNTRY_NAME
""")

In [17]:
sqlWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/cloudera/spark-definitive/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


Let’s pull out some interesting statistics from our data. One thing to understand is that DataFrames
(and SQL) in Spark already have a huge number of manipulations available. There are hundreds of
functions that you can use and import to help you resolve your big data problems faster. We will use
the max function, to establish the maximum number of flights to and from any given location. This just
scans each value in the relevant column in the DataFrame and checks whether it’s greater than the
previous values that have been seen. This is a transformation, because we are effectively filtering
down to one row. Let’s see what that looks like:

In [18]:
spark.sql("SELECT max(count) FROM flight_data_table").take(1)

[Row(max(count)=370002)]

Let’s perform something a bit more
complicated and find the top five destination countries in the data. This is our first multitransformation
query, so we’ll take it step by step. Let’s begin with a fairly straightforward SQL
aggregation:

In [20]:
maxsql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_table
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

In [21]:
maxsql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



Now, let’s move to the DataFrame syntax that is semantically similar but slightly different in
implementation and ordering. But, as we mentioned, the underlying plans for both of them are the
same. Let’s run the queries and see their results as a sanity check:

In [24]:
from pyspark.sql.functions import desc

In [25]:
flightdata.groupBy("DEST_COUNTRY_NAME")\
.sum("count").withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total")).limit(5).show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



The above sequence of transformations has 7 transformation steps: read->groupBy->sum->withColumnRenamed->sort->limit->collect

In [26]:
flightdata.groupBy("DEST_COUNTRY_NAME")\
.sum("count").withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total")).limit(5).explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#109L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#10,destination_total#109L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[sum(cast(count#12 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_sum(cast(count#12 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/cloudera/spark-definitive/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>
