# Entering Sprak World

## 1. Using SparkContext (Spark Version < 2.x)
In Spark versions < 2.x, we used to create a SparkConf and SparkContext to interact with Spark as we will see below. We can do the same in spark2 as well but SparkSession(not SparkContext) will be the main entry point.

### Creating sconf variable using SparkConf (which will be used for SparkContext Creation) 

In [1]:
from pyspark import SparkConf
from pyspark import SparkContext

In [2]:
sconf = SparkConf().setAppName("SparkContext_Usage").set("spark.executor.memory","6g").set("spark.executor.cores","3")

### Creating SparkContext 'sc'

SparkContext: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluste


Note that, if we stop sc (sc.stop()) to create a SparkContext with new name in pyspark shell, it will automatically come up again. So, DONOT Create any new SparkContext "sc". sc will get Killed with ERROR "SparkContext has been shutdown"

In [4]:
sc.stop()
sc = SparkContext(conf=sconf)

### Get the Configuration Details

In [5]:
#sc.getConf("spark.executor.memory")
sconf.get("spark.executor.memory")

u'6g'

In [6]:
sconf.set("spark.executor.memory","4g")
sconf.get("spark.executor.memory")

u'4g'

## SQLContext: To Create DataFrames

### Creating SQLContext 'sqlCtx'

The entry point into all 'RELATIONAL FUNCTIIONALITY' in Spark is the SQLContext class. Enables working with structured data (rows and columns) in Spark. Allows the creation of DataFrame objects as well as the execution of SQL queries. 

When created, SQLContext adds a method called toDF to RDD, which could be used to convert an RDD into a DataFrame, it is a shorthand for SQLContext.createDataFrame().

Make sure you use toDF() only after you create the SQLContext.

In [7]:
from pyspark import SQLContext, sql

sqlCtx = SQLContext(sc)
rdd = sc.parallelize([('Naresh',1),('Bhanu',2),('Ravi',3)])
rdd.toDF().show()

+------+---+
|    _1| _2|
+------+---+
|Naresh|  1|
| Bhanu|  2|
|  Ravi|  3|
+------+---+



#### Load the data from HDFS and Create DF using sqlCtx

The data file used below is available in this repo under data folder. LOAD this file in the HDFS at /tmp directory


In [8]:
file = sc.textFile("/tmp/people.txt")
line = file.map(lambda x: x.split(" ")).map(lambda t: (t[0],t[1],int(t[2]),t[3]))
df = sqlCtx.createDataFrame(line,["name","gender","age","language"])
df.show()



+-------+------+---+--------+
|   name|gender|age|language|
+-------+------+---+--------+
| naresh|     M| 30|  python|
|   ravi|     M| 22|       c|
|akansha|     F| 34|    java|
| ravina|     F| 33|    ruby|
|  bhanu|     M| 35|    java|
|  vikas|     M| 11|   shell|
|   jiya|     F| 39|    perl|
+-------+------+---+--------+



## HiveContext: To access Hive DB/Tables or save data to Hive

### Creating HiveContext 'hqlCtx'

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command. If you have configured your existing HIVE setup (by putting hive-site.xml in Spark conf directory), you can check the new table there. But if you donot have Hive Metastore setup, Spark will create a default local Hive metastore (using Derby) for you. 

Unlike the createOrReplaceTempView command, saveAsTable will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore.By default saveAsTable will create a 'managed table', meaning that the location of the data will be controlled by the metastore

In [9]:
from pyspark import HiveContext, sql

hqlCtx = HiveContext(sc)

print "\nDropping Hive Table if exists"
hqlCtx.sql("DROP TABLE IF EXISTS default.nartest")

print "\nSave Data Frame to Hive table default.nartest"
df.write.saveAsTable("default.nartest")

print "\nQuery Newly created Hive Table "
hqlCtx.sql("SELECT * FROM default.nartest").show()



Dropping Hive Table if exists

Save Data Frame to Hive table default.nartest

Query Newly created Hive Table 
+-------+------+---+--------+
|   name|gender|age|language|
+-------+------+---+--------+
| naresh|     M| 30|  python|
|   ravi|     M| 22|       c|
|akansha|     F| 34|    java|
| ravina|     F| 33|    ruby|
|  bhanu|     M| 35|    java|
|  vikas|     M| 11|   shell|
|   jiya|     F| 39|    perl|
+-------+------+---+--------+



## 2. Using SparkSession (Spark Version >= 2.x)

Whereas in Spark 2.0 the same effects can be achieved through SparkSession, without expliciting creating SparkConf, SparkContext, SQLContext or HiveContext as they’re encapsulated within the SparkSession. Using a builder design pattern, it instantiates a SparkSession object if one does not already exist, along with its associated underlying contexts.

We will only Create SparkSession ‘spark’ and will do the same stuff (from previous code).

pyspark.sql.SparkSession

In [10]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

Creating sconf variable using SparkConf, which will be used for SparkSesssion Creation - You can give the conf using SparkSessions 'config' as well. We are using SparkConf here just to show how we use SparkConf in SparkSession"



In [11]:
sconf = SparkConf().set("spark.executor.memory","6g").set("spark.executor.cores","3")


### Creating SparkSession 'myspark'

SparkSession: New entry point for Spark functionality enabling programming Spark with the Dataset and DataFrame API.


In [12]:
myspark = SparkSession.builder.master("yarn").appName("SparkSession_Useage")\
        .config(conf=sconf).enableHiveSupport().getOrCreate()

In [13]:
print "\nGet Spark Executor Cores using spark.conf.get {same as sconf.get} : "
print myspark.conf.get("spark.executor.cores")

print "\nSet Spark Executor Cores to 4 using spark.conf.set {same as sconf.set} "
myspark.conf.set("spark.executor.cores","3")

print "\nGet Spark New Executor Cores using spark.conf.get {same as sconf.get} : "
print myspark.conf.get("spark.executor.cores")


Get Spark Executor Cores using spark.conf.get {same as sconf.get} : 
3

Set Spark Executor Cores to 4 using spark.conf.set {same as sconf.set} 

Get Spark New Executor Cores using spark.conf.get {same as sconf.get} : 
3


#### Using spark.sparkContext.parallelize without creating any SparkContext

In [14]:
rdd = myspark.sparkContext.parallelize([('Naresh',1),('Bhanu',2),('Ravi',3)])

print "\nRDD to DF using rdd.toDF() and use df.show() \n "

rdd.toDF().show()



RDD to DF using rdd.toDF() and use df.show() 
 
+------+---+
|    _1| _2|
+------+---+
|Naresh|  1|
| Bhanu|  2|
|  Ravi|  3|
+------+---+



## myspark: To Create DataFrames without SQLContext

The data file used below is available in this repo under data folder. LOAD this file in the HDFS at /tmp directory


In [15]:

print "\nLOAD and HDFS file using spark.sparkContext.textFile and Create DF using spark.createDataFrame() "

file = myspark.sparkContext.textFile("/tmp/people.txt")

line = file.map(lambda x: x.split(" ")).map(lambda t: (t[0],t[1],int(t[2]),t[3]))

df = myspark.createDataFrame(line,["name","gender","age","language"])

print "\nPrinting DataFrame \n"
df.show()



LOAD and HDFS file using spark.sparkContext.textFile and Create DF using spark.createDataFrame() 

Printing DataFrame 

+-------+------+---+--------+
|   name|gender|age|language|
+-------+------+---+--------+
| naresh|     M| 30|  python|
|   ravi|     M| 22|       c|
|akansha|     F| 34|    java|
| ravina|     F| 33|    ruby|
|  bhanu|     M| 35|    java|
|  vikas|     M| 11|   shell|
|   jiya|     F| 39|    perl|
+-------+------+---+--------+



## myspark: To access Hive DB/Tables or save data to Hive without HiveContext

In [16]:
print "\nDropping Hive Table if exists using spark.sql "
myspark.sql("DROP TABLE IF EXISTS default.mytest")

print "\nSave Data Frame to Hive table default.nartest"
df.write.saveAsTable("default.mytest")

print "\nQuery Newly created Hive Table\n"
myspark.sql("SELECT * FROM default.mytest").show()

print "\nDeleting the table default.mytest"
myspark.sql("DROP TABLE default.mytest")



Dropping Hive Table if exists using spark.sql 

Save Data Frame to Hive table default.nartest

Query Newly created Hive Table

+-------+------+---+--------+
|   name|gender|age|language|
+-------+------+---+--------+
| naresh|     M| 30|  python|
|   ravi|     M| 22|       c|
|akansha|     F| 34|    java|
| ravina|     F| 33|    ruby|
|  bhanu|     M| 35|    java|
|  vikas|     M| 11|   shell|
|   jiya|     F| 39|    perl|
+-------+------+---+--------+


Deleting the table default.mytest


DataFrame[]



# Setting up Run Time Properties
    Only applicable while running spark job with spark2-submit

Setting up new configuration for sparkContext/SparkSession.

Any runtime property change will ONLY happen using SparkConf() or using SparkSession.config() or passing them as arguments.

We did below earlier which is only a way/test to check the properties. If you open Spark UI and check these properties, you will find the default one there.

In [17]:
print "\nGet Spark Executor Cores using spark.conf.get {same as sconf.get} : "
print myspark.conf.get("spark.executor.cores")

print "\nSet Spark Executor Cores to 4 using spark.conf.set {same as sconf.set} "
myspark.conf.set("spark.executor.cores","3")

print "\nGet Spark New Executor Cores using spark.conf.get {same as sconf.get} : "
print myspark.conf.get("spark.executor.cores")


Get Spark Executor Cores using spark.conf.get {same as sconf.get} : 
3

Set Spark Executor Cores to 4 using spark.conf.set {same as sconf.set} 

Get Spark New Executor Cores using spark.conf.get {same as sconf.get} : 
3



## 1. Using SparkConf()

To set the properties permanently, create a new SparkConf() object, set the required properties there and use that to create a SparkContext/SparkSession. 

Use spark.sparkContext.getConf().getAll() to get all properties.

You can also check the same in Spark UI under “Environment” Tab

In [18]:
from pyspark.sql import SparkSession
from pyspark import SparkConf

conf = SparkConf().setAll([("spark.executor.cores","2"),("spark.executor.memory","5g"),("hive.exec.dynamic.partition","true")])
myspark.stop()
myspark = SparkSession.builder.master("yarn").appName("Spark_Conf_Usage").config(conf=conf).enableHiveSupport().getOrCreate()

In [19]:
myspark.sparkContext.getConf().get("spark.executor.memory")

u'5g'

In [20]:
# To see all Properties
#spark.sparkContext.getConf().getAll()

### 2. Using SparkSession.config()

In [21]:
from pyspark.sql import SparkSession

sparkwarehouseLocation = "/tmp/spark-warehouse"

myspark.stop()
myspark = SparkSession.builder.master("yarn").appName("Spark_Conf_Usage")\
        .config("spark.sql.warehouse.dir",sparkwarehouseLocation)\
        .config("spark.serializer","org.apache.spark.serializer.KryoSerializer")\
        .config("spark.kryoserializer.buffer.max","1g")\
        .config("hive.exec.dynamic.partition", "true")\
        .config("hive.exec.dynamic.partition.mode", "nonstrict")\
        .enableHiveSupport().getOrCreate()


In [22]:
myspark.sparkContext.getConf().get("spark.kryoserializer.buffer.max")

u'1g'

In [23]:
myspark.sparkContext.getConf().get("spark.sql.warehouse.dir")

u'/tmp/spark-warehouse'

### What's Next
1. To Download this Single Notebook, go to Click this file in my Github Account, Copy the URL and paste in http://nbviewer.jupyter.org/. Download button will be in top right corner.

2. Open your Juypter Notebook home page and upload using "upload" Button.

3. Continue Learning from the next Notebook Spark_02_DataFrame_Operations.ipynb