# An End to End Example

In this section, we will reinforce everything we learned previously in this chapter with a more realistic example, and explain step by step what is happening under the hood. We’ll use Spark to analyze some flight data from the United States Bureau of Transportation statistics.

We will start working with data/flight-data/csv/2015-summary.csv. First few rows of this file look like:

```
$ head data/flight-data/csv/2015-summary.csv

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States,Romania,15
United States,Croatia,1
United States,Ireland,344
Egypt,United States,15
```

First, we will read this file into Spark.

Note: Replace the bucket name (`is843`) with your own bucket name and make sure that the data file can be found in Google Cloud Storage.

In [1]:
datapath = "gs://is843/notebooks/jupyter/data/flight-data/csv/2015-summary.csv"

In [2]:
flightData2015 = spark.read.option("inferSchema", "true")\
  .option("header", "true")\
  .csv(datapath)

`inferSchema` is guessing the data types by reading a little bit of the data. In real life this has to be done in a more realiable way, but for the purpose of this example it would be fine.

We also want to specify that the first row is the header in the file, so we’ll specify that as an option, too.

Reading data is a transformation, and is therefore a lazy operation. 

If we perform the take action on the DataFrame, we will be able to see the same results that we saw before when we used the command line:

In [3]:
flightData2015.take(4)

[Row(DEST_COUNTRY_NAME=u'United States', ORIGIN_COUNTRY_NAME=u'Romania', count=15),
 Row(DEST_COUNTRY_NAME=u'United States', ORIGIN_COUNTRY_NAME=u'Croatia', count=1),
 Row(DEST_COUNTRY_NAME=u'United States', ORIGIN_COUNTRY_NAME=u'Ireland', count=344),
 Row(DEST_COUNTRY_NAME=u'Egypt', ORIGIN_COUNTRY_NAME=u'United States', count=15)]

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/04-03-DataFrame-take.png?raw=true" width="700" align="center"/>

Now, let’s sort our data according to the `DEST_COUNTRY_NAME` column:

In [4]:
flightData2015.sort("DEST_COUNTRY_NAME").take(3)

[Row(DEST_COUNTRY_NAME=u'Algeria', ORIGIN_COUNTRY_NAME=u'United States', count=4),
 Row(DEST_COUNTRY_NAME=u'Angola', ORIGIN_COUNTRY_NAME=u'United States', count=15),
 Row(DEST_COUNTRY_NAME=u'Anguilla', ORIGIN_COUNTRY_NAME=u'United States', count=41)]

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/04-03-DataFrame-sort-take.png?raw=true" width="700" align="center"/>

Nothing happens to the data when we call sort because it’s just a transformation. However, we can see that Spark is building up a plan for how it will execute this across the cluster by looking at the `explain()` plan. We can call explain on any DataFrame object to see the DataFrame’s lineage (or how Spark will execute this query):

In [5]:
flightData2015.sort("DEST_COUNTRY_NAME").explain()

== Physical Plan ==
*(2) Sort [DEST_COUNTRY_NAME#10 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(DEST_COUNTRY_NAME#10 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://is843/notebooks/jupyter/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


You can read explain plans from top to bottom, the top being the end result, and the bottom being the source(s) of data.

In this case, take a look at the first keywords. You will see sort, exchange, and FileScan. That’s because the sort of our data is actually a wide transformation because rows will need to be compared with one another.

By default, when we perform a shuffle, Spark outputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the shuffle:

In [6]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [7]:
flightData2015.sort("DEST_COUNTRY_NAME").take(3)

[Row(DEST_COUNTRY_NAME=u'Algeria', ORIGIN_COUNTRY_NAME=u'United States', count=4),
 Row(DEST_COUNTRY_NAME=u'Angola', ORIGIN_COUNTRY_NAME=u'United States', count=15),
 Row(DEST_COUNTRY_NAME=u'Anguilla', ORIGIN_COUNTRY_NAME=u'United States', count=41)]

<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/04-03-DataFrame-partition.png?raw=true" width="700" align="center"/>

In [8]:
flightData2015.sort("DEST_COUNTRY_NAME").explain()

== Physical Plan ==
*(2) Sort [DEST_COUNTRY_NAME#10 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(DEST_COUNTRY_NAME#10 ASC NULLS FIRST, 5)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,ORIGIN_COUNTRY_NAME#11,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://is843/notebooks/jupyter/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


From `explain()` we can confirm that we have 5 shuffle partitions. In experimenting with different values, you should see drastically different runtimes for larger datasets.

## DataFrames and SQL

Spark can run the same transformations, regardless of the language, in the exact same way. You can express your business logic in SQL or DataFrames (either in R, Python, Scala, or Java) and Spark will compile that logic down to an underlying plan (that you can see in the explain plan) before actually executing your code. With Spark SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure SQL. There is no performance difference between writing SQL queries or writing DataFrame code, they both “compile” to the same underlying plan that we specify in DataFrame code.

You can make any DataFrame into a table or view with one simple method call:

In [9]:
flightData2015.createOrReplaceTempView("flight_data_2015")

Now we can query our data in SQL. To do so, we’ll use the spark.sql function (remember, spark is our SparkSession variable) that conveniently returns a new DataFrame. Although this might seem a bit circular in logic—that a SQL query against a DataFrame returns another DataFrame—it’s actually quite powerful. This makes it possible for you to specify transformations in the manner most convenient to you at any given point in time and not sacrifice any efficiency to do so! To understand that this is happening, let’s take a look at two explain plans:

In [10]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(*)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")

dataFrameWay = flightData2015.groupBy("DEST_COUNTRY_NAME").count()

sqlWay.explain()
dataFrameWay.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://is843/notebooks/jupyter/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>
== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://is843/notebooks/jupyter/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


**Notice that these plans compile to the exact same underlying plan!**

Let’s pull out some interesting statistics from our data. One thing to understand is that DataFrames and SQL in Spark already have a huge number of manipulations available. There are hundreds of functions that you can use and import to help you resolve your big data problems faster.

Let's find the maximum number of flights to and from any given location in both DataFrame and SQL ways:

In [11]:
spark.sql("SELECT max(count) from flight_data_2015").take(1)

[Row(max(count)=370002)]

In [12]:
from pyspark.sql.functions import max

flightData2015.select(max("count")).take(1)

[Row(max(count)=370002)]

Let's now find the top five destination countries in the data:

In [13]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [14]:
from pyspark.sql.functions import desc

flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\
  .show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [15]:
flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\
  .explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#110L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#10,destination_total#110L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[sum(cast(count#12 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_sum(cast(count#12 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://is843/notebooks/jupyter/data/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


<img src="https://github.com/soltaniehha/Big-Data-Analytics-for-Business/blob/master/figs/04-03-DataFrame-transformation-flow.png?raw=true" width="700" align="center"/>

The true execution plan (the one visible in explain) will differ from that shown in figure above because of optimizations in the physical execution, however, the illustration is as good starting point. This execution plan is a directed acyclic graph (DAG) of transformations, each resulting in a new immutable DataFrame, on which we call an action to generate a result.