# Spark: Make sure your environment is configured properly

This notebook is just intended to check if you are able to access spark and use it though a jupyter notebook. So, you dont need to pay any attention to the code in it. Just go ahead and run all the cells and check to see if you are able to instanciate spark and use it.

* I am using Jupyter Notebooks powered with Cloud Dataproc as I can provision and use the cluster from anywhere
* The persistance of the data is not attached to the compute and is stored in GCS in this case
* Make sure you don't store any state in any of the cloud compute instances as it not the best practice to do so. Instead make use of one of the following
    - `AWS S3` 
    - `GCP GCS`
    - `Azure OS`
    
**Provsisioning a Cloud Dataproc cluster**

You run the following command to do so

```sh
# Cluster provisioning command
# Make sure you have already logged in using 'gcloud auth login'

./gcp/dataproc \
          --gcloud-email=sample@gmail.com \
          --project-id=spark-dataproc-cluster \
          --cluster-name=test-spark-cluster \
          --bucket=dataproc-statestore \
          --action=create
```

Follow the instructions in the `README.md` for more information. All code can e found in my [Codebooks Repository on github](https://github.com/reddy-s/codebooks)

In [36]:
# Imports
import random
from pyspark.sql import SparkSession
import os

## Configure spark context

* App name
* Spark Master location (Spark Master config is needed only if running in a cluster mode)

In [44]:
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

In [45]:
sc.getConf().getAll()

[(u'spark.eventLog.enabled', u'true'),
 (u'spark.dynamicAllocation.minExecutors', u'1'),
 (u'spark.executor.memory', u'2688m'),
 (u'spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
  u'test-spark-cluster-m'),
 (u'spark.ui.proxyBase', u'/proxy/application_1569758767276_0003'),
 (u'spark.driver.port', u'38757'),
 (u'spark.yarn.am.memory', u'640m'),
 (u'spark.history.fs.logDirectory',
  u'hdfs://test-spark-cluster-m/user/spark/eventlog'),
 (u'spark.eventLog.dir', u'hdfs://test-spark-cluster-m/user/spark/eventlog'),
 (u'spark.executor.instances', u'2'),
 (u'spark.serializer.objectStreamReset', u'100'),
 (u'spark.executorEnv.PYTHONPATH',
  u'/usr/lib/spark/python/lib/py4j-0.10.7-src.zip:/usr/lib/spark/python/:<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.7-src.zip'),
 (u'spark.submit.deployMode', u'client'),
 (u'spark.ui.filters',
  u'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
 (u'spark.org.apache.hadoop.yarn.server.webproxy.amfilter.A

In [46]:
spark

In [47]:
# Creating a datafrom from a range of numbers
df = spark.range(10).toDF("numbers")

In [48]:
# Printing the schema
df.printSchema()

root
 |-- numbers: long (nullable = false)



In [49]:
# Displaying the dataframe
df.show()

+-------+
|numbers|
+-------+
|      0|
|      1|
|      2|
|      3|
|      4|
|      5|
|      6|
|      7|
|      8|
|      9|
+-------+



In [50]:
# Displaying tuples
df.take(2)

[Row(numbers=0), Row(numbers=1)]

In [51]:
# Counting the number of entries in the DF
df.count()

10

### Access to Data

All the code in this repository will read data from a `GCS` bucket which I own. I have set the billing property to this repo as `requester pays`. If you want, you mauy use it at your expense.

Make sure data in available in one of the GCS buckets to validate this.

In [59]:
%%bash
gsutil ls gs://reddys-data-for-experimenting/flight-data/csv/

gs://reddys-data-for-experimenting/flight-data/csv/2010-summary.csv
gs://reddys-data-for-experimenting/flight-data/csv/2011-summary.csv
gs://reddys-data-for-experimenting/flight-data/csv/2012-summary.csv
gs://reddys-data-for-experimenting/flight-data/csv/2013-summary.csv
gs://reddys-data-for-experimenting/flight-data/csv/2014-summary.csv
gs://reddys-data-for-experimenting/flight-data/csv/2015-summary.csv


In [60]:
flightData = spark \
    .read \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .csv("gs://reddys-data-for-experimenting/flight-data/csv/2015-summary.csv")

In [61]:
flightData.take(3)

[Row(DEST_COUNTRY_NAME=u'United States', ORIGIN_COUNTRY_NAME=u'Romania', count=15),
 Row(DEST_COUNTRY_NAME=u'United States', ORIGIN_COUNTRY_NAME=u'Croatia', count=1),
 Row(DEST_COUNTRY_NAME=u'United States', ORIGIN_COUNTRY_NAME=u'Ireland', count=344)]

In [62]:
flightData.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#92 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#92 ASC NULLS FIRST, 200)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#90,ORIGIN_COUNTRY_NAME#91,count#92] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://reddys-data-for-experimenting/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


In [63]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [64]:
flightData.sort("count").explain()

== Physical Plan ==
*(2) Sort [count#92 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#92 ASC NULLS FIRST, 5)
   +- *(1) FileScan csv [DEST_COUNTRY_NAME#90,ORIGIN_COUNTRY_NAME#91,count#92] Batched: false, Format: CSV, Location: InMemoryFileIndex[gs://reddys-data-for-experimenting/flight-data/csv/2015-summary.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>


In [65]:
flightData.sort("count").take(2)

[Row(DEST_COUNTRY_NAME=u'United States', ORIGIN_COUNTRY_NAME=u'Singapore', count=1),
 Row(DEST_COUNTRY_NAME=u'Moldova', ORIGIN_COUNTRY_NAME=u'United States', count=1)]

In [66]:
# Stopping spark context
sc.stop()
spark.stop()

`Info: If all of the above cells have executed sucessfully, please go ahead and start of with the book as mentioned in the readme.md`