# Working with Spark
Spark users can access files, tables or streams stored on iguazio data platform through the native spark Dataframe interfaces. <br>
iguazio drivers for Spark implement the data-source API and allow `predicate push down` (the queries pass to iguazio database which only return the relevant data), this allow accelerated and high-speed access from Spark to data stored in iguazio DB. for more details read [Spark API documentation]()

## loading a file from AWS S3 into iguazio file system  


In [None]:
%%sh 
mkdir -p /v3io/${V3IO_HOME}/examples
curl -L "iguazio-sample-data.s3.amazonaws.com/2018-03-26_BINS_XETR08.csv" > /v3io/${V3IO_HOME}/examples/stocks.csv


## Initiating a Spark session 

In [4]:
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Working with Spark notebook").getOrCreate()

# Data Source
To use the Iguazio Spark connector to read or write NoSQL data in the platform, use the format method to set the DataFrame’s data-source format to the platform’s custom NoSQL data source — "io.iguaz.v3io.spark.sql.kv". See the following read and write examples:

## Read the csv file using Spark DF
You can use the custom NoSQL DataFrame inferSchema read option to automatically infer the schema of the read table from its contents.

In [5]:
file_path=os.path.join(os.getenv('V3IO_HOME_URL')+'/examples')

df = spark.read.option("header", "true").csv(os.path.join(file_path)+'/stocks.csv')
df.show()

+------------+--------+--------------------+------------+--------+----------+----------+-----+----------+--------+--------+--------+------------+--------------+
|        ISIN|Mnemonic|        SecurityDesc|SecurityType|Currency|SecurityID|      Date| Time|StartPrice|MaxPrice|MinPrice|EndPrice|TradedVolume|NumberOfTrades|
+------------+--------+--------------------+------------+--------+----------+----------+-----+----------+--------+--------+--------+------------+--------------+
|CH0038389992|    BBZA|BB BIOTECH NAM.  ...|Common stock|     EUR|   2504244|2018-03-26|08:00|      56.4|    56.4|    56.4|    56.4|         320|             4|
|CH0038863350|    NESR|NESTLE NAM.      ...|Common stock|     EUR|   2504245|2018-03-26|08:00|     63.04|   63.06|      63|   63.06|         314|             3|
|LU0378438732|    C001|COMSTAGE-DAX UCIT...|         ETF|     EUR|   2504271|2018-03-26|08:00|    113.42|  113.42|  113.42|  113.42|         100|             1|
|LU0411075020|    DBPD|XTR.SHORTDA

## Writing the spark DF into a table in Iguazio DB
Specify the path to the NoSQL table that is associated with the DataFrame as a fully qualified path of the format v3io://container name/data path —
where container name is the name of the table’s parent container, and data path is the path to the data within the specified container

In [6]:
# specify the DB index key using the key option (note the key must be unique)
df.write.format("io.iguaz.v3io.spark.sql.kv").mode("append").option("key", "ISIN").option("allow-overwrite-schema", "true").save(os.path.join(file_path)+'/stocks_tab/')

## Reading a table via Spark DF

In [9]:
spark.read.format("io.iguaz.v3io.spark.sql.kv").load(os.path.join(file_path)+'/stocks_tab').show()

+------------+--------+--------------------+------------+--------+----------+----------+-----+----------+--------+--------+--------+------------+--------------+
|        ISIN|Mnemonic|        SecurityDesc|SecurityType|Currency|SecurityID|      Date| Time|StartPrice|MaxPrice|MinPrice|EndPrice|TradedVolume|NumberOfTrades|
+------------+--------+--------------------+------------+--------+----------+----------+-----+----------+--------+--------+--------+------------+--------------+
|IE00BZ163H91|    VGEB|VAN.EUR EUROZ.G.B...|         ETF|     EUR|   2749243|2018-03-26|08:31|    25.185|  25.185|  25.185|  25.185|         600|             1|
|DE0005933964|    EXI1|ISHARES SLI UCITS...|         ETF|     EUR|   2504990|2018-03-26|08:56|      80.6|    80.6|    80.6|    80.6|          75|             1|
|IE00B1FZS350|    IQQ6|ISHSII-DEV.MKT.PR...|         ETF|     EUR|   2505587|2018-03-26|08:41|     19.63|   19.63|   19.63|   19.63|        1146|             1|
|DE0005008007|     ADL|ADLER REAL 

## Writing to Parquet

In [10]:
df.write.parquet(os.path.join(file_path)+'/stocks_tab.parquet')

## Table Schema-Overwrite Examples
The following example creates a table named mytable with AttrA and AttrB attributes of type string and an AttrC attribute of type long, and then overwrites the table schema to change the type of AttrC to double:

In [12]:
df = spark.createDataFrame([
    ("a", "z", 123),
    ("b", "y", 456)
], ["AttrA", "AttrB", "AttrC"])
df.write.format("io.iguaz.v3io.spark.sql.kv") \
    .mode("overwrite") \
    .option("key", "AttrA") \
    .save(os.path.join(file_path)+'/mytable/')
    
df = spark.createDataFrame([
    ("c", "x", 32.12),
    ("d", "v", 45.2)
], ["AttrA", "AttrB", "AttrC"])
df.write.format("io.iguaz.v3io.spark.sql.kv") \
    .mode("append") \
    .option("key", "AttrA") \
    .option("allow-overwrite-schema", "true") \
    .save(os.path.join(file_path)+'/mytable/')

## Creating a partition table

This examples creates a partitioned “weather” table  The option("partition", "year, month, day") write option partitions the table by the year, month, and day item attributes. As demonstrated in the following image, if you browse the container in the dashboard after running the example, you’ll see that the weather directory has year=<value>/month=<value>/day=<value> partition directories that match the written items. If you select any of the nested day partition directories, you can see the written items and their attributes. For example, the first item (with attribute values 2016, 3, 25, 6, 16, 0.00, 55) is saved to a 20163256 file in a weather/year=2016/month=3/day=25 partition directory.

In [15]:
from pyspark.sql.functions import concat

table_path = os.path.join(os.getenv('V3IO_HOME_URL')+'/examples/weather/')

df = spark.createDataFrame([
    (2016,  3, 25, 17, 18, 0.2, 62),
    (2016,  7, 24,  7, 19, 0.0, 52),
    (2016, 12, 24,  9, 10, 0.1, 47),
    (2017,  5,  7, 14, 21, 0.0, 70),
    (2017, 11,  1, 10, 15, 0.0, 34),
    (2017, 12, 12, 16, 12, 0.0, 47),
    (2017, 12, 24, 17, 11, 1.0, 50),
    (2018,  1, 18, 17, 10, 2.0, 45),
    (2018,  5, 20, 21, 20, 0.0, 59),
    (2018, 11,  1, 11, 11, 0.1, 65)
], ["year", "month", "day", "hour", "degrees_cel", "rain_ml", "humidity_per"])
df_with_key = df.withColumn(
    "time", concat(df["year"], df["month"], df["day"], df["hour"]))
df_with_key.write.format("io.iguaz.v3io.spark.sql.kv") \
    .mode("overwrite") \
    .option("key", "time") \
    .option("partition", "year, month, day, hour") \
    .save(table_path)

## Reading from partition table
Following is the output of the example’s show commands for each read. The filtered results are gathered by scanning only the partition directories that match the filter criteria.


### Full table read

In [16]:
readDF = spark.read.format("io.iguaz.v3io.spark.sql.kv").load(table_path)
readDF.show()

+----+-----+---+----+-----------+-------+------------+----------+
|year|month|day|hour|degrees_cel|rain_ml|humidity_per|      time|
+----+-----+---+----+-----------+-------+------------+----------+
|2016|   12| 24|   9|         10|    0.1|          47| 201612249|
|2016|    3| 25|  17|         18|    0.2|          62| 201632517|
|2016|    7| 24|   7|         19|    0.0|          52|  20167247|
|2017|   11|  1|  10|         15|    0.0|          34| 201711110|
|2017|   12| 12|  16|         12|    0.0|          47|2017121216|
|2017|   12| 24|  17|         11|    1.0|          50|2017122417|
|2017|    5|  7|  14|         21|    0.0|          70|  20175714|
|2018|    1| 18|  17|         10|    2.0|          45| 201811817|
|2018|   11|  1|  11|         11|    0.1|          65| 201811111|
|2018|    5| 20|  21|         20|    0.0|          59| 201852021|
+----+-----+---+----+-----------+-------+------------+----------+



### month < 7 filter — retrieve all data for the first six months of each year:

In [17]:
readDF = spark.read.format("io.iguaz.v3io.spark.sql.kv").load(table_path) \
    .filter("month > 6")
readDF.show()

+----+-----+---+----+-----------+-------+------------+----------+
|year|month|day|hour|degrees_cel|rain_ml|humidity_per|      time|
+----+-----+---+----+-----------+-------+------------+----------+
|2016|   12| 24|   9|         10|    0.1|          47| 201612249|
|2016|    7| 24|   7|         19|    0.0|          52|  20167247|
|2017|   11|  1|  10|         15|    0.0|          34| 201711110|
|2017|   12| 12|  16|         12|    0.0|          47|2017121216|
|2017|   12| 24|  17|         11|    1.0|          50|2017122417|
|2018|   11|  1|  11|         11|    0.1|          65| 201811111|
+----+-----+---+----+-----------+-------+------------+----------+



### month == 12 AND day == 24 filter — retrieve all hours on Dec 24 each year:

In [18]:
readDF = spark.read.format("io.iguaz.v3io.spark.sql.kv").load(table_path) \
    .filter("month == 12 AND day == 24")
readDF.show()

+----+-----+---+----+-----------+-------+------------+----------+
|year|month|day|hour|degrees_cel|rain_ml|humidity_per|      time|
+----+-----+---+----+-----------+-------+------------+----------+
|2016|   12| 24|   9|         10|    0.1|          47| 201612249|
|2017|   12| 24|  17|         11|    1.0|          50|2017122417|
+----+-----+---+----+-----------+-------+------------+----------+



### month > 6 AND hour >= 8 AND hour <= 20 filter — retrieve 08:00–20:00 data for every day in the last six months of each year:

In [19]:

readDF = spark.read.format("io.iguaz.v3io.spark.sql.kv").load(table_path) \
    .filter("month < 7 AND hour >= 8 AND hour <= 20")
readDF.show()

+----+-----+---+----+-----------+-------+------------+---------+
|year|month|day|hour|degrees_cel|rain_ml|humidity_per|     time|
+----+-----+---+----+-----------+-------+------------+---------+
|2016|    3| 25|  17|         18|    0.2|          62|201632517|
|2017|    5|  7|  14|         21|    0.0|          70| 20175714|
|2018|    1| 18|  17|         10|    2.0|          45|201811817|
+----+-----+---+----+-----------+-------+------------+---------+



## Conditional update
This example demonstrates how to conditionally update NoSQL table items by using the condition write option. Each write call in the example is followed by matching read and show calls to read and display the value of the updated item in the target table after the write operation.

The first write command writes an item (row) to a “cars” table . The item’s reg_license primary-key (identity-column) attribute is set to 7843321, the mode attribute is set to “Honda”, and the odometer attribute is set to 29321. The overwrite save mode is used to overwrite the table if it already exists and create it otherwise. Reading the item from the table produces this output:

In [21]:
writeDF = spark.createDataFrame([("7843321", "Honda", 29321)],
                                ["reg_license", "model", "odometer"])
writeDF.write.format("io.iguaz.v3io.spark.sql.kv") \
    .option("key", "reg_license") \
    .mode("overwrite").save("v3io://users/iguazio/examples/cars/")
readDF = spark.read.format("io.iguaz.v3io.spark.sql.kv") \
    .load(os.path.join(file_path)+'/cars/')
readDF.show()

writeDF = spark.createDataFrame([("7843321", "Honda", 31718)],
                                ["reg_license", "model", "odometer"])
writeDF.write.format("io.iguaz.v3io.spark.sql.kv") \
    .option("key", "reg_license") \
    .option("condition", "${odometer} > odometer") \
    .mode("append").save("v3io://users/iguazio/examples/cars/")
readDF = spark.read.format("io.iguaz.v3io.spark.sql.kv") \
    .load(os.path.join(file_path)+'/cars/')
readDF.show()



+-----------+-----+--------+
|reg_license|model|odometer|
+-----------+-----+--------+
|    7843321|Honda|   29321|
+-----------+-----+--------+

+-----------+-----+--------+
|reg_license|model|odometer|
+-----------+-----+--------+
|    7843321|Honda|   31718|
+-----------+-----+--------+



# Using SQL queries (using Presto)
## Reading the stock_tab table using SQL after being written by Spark DF


In [None]:
%sql select * from v3io.users."/iguazio/examples/stocks_tab" where tradedvolume > 20000

# Remove Data
When you are done - cleaning the directory will be done by running the following:

In [22]:
# unmark the comment
# !rm -rf $HOME/examples/*


In order to release compute and memory resources taken by spark we recommend running the following command 

In [23]:
spark.stop()