In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'

In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
jdbc_host = os.environ['JDBC_HOST']

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

spark = SparkSession.builder.config(conf=conf).getOrCreate()

splicejdbc=f"jdbc:splice://{jdbc_host}:1527/splicedb;user=splice;password=admin"

splice = PySpliceContext(spark, splicejdbc)



# Using our Spark Adapter

Data Scientists have adopted Spark as the de facto data science platform, and Splice Machine provides an industry leading in-process integration to a Spark cluster. This means data scientists and data engineers can adopt the full power of Spark and manipulate dataframes but also get the power of full ANSI, ACID-compliant SQL.
<sub>Learn more about ANSI [here](https://share.ansi.org/Shared%20Documents/News%20and%20Publications/Brochures/WhatIsANSI_brochure.pdf) and ACID compliance [here](https://www.clustrix.com/bettersql/acid-compliance-means-care/)

The Splice Machine Spark adapter provides:

* A durable, ACID compliant persistence model for Spark Dataframes.
* Lazy result sets returned as Spark Dataframes.
* Access to Spark libraries such as MLLib and GraphX.
* Avoidance of expensive ETL of data from OLTP to OLAP.

This notebook will be our first look at writing code with <b>Python</b> using a Spark library called <b>PySpark.</b> 
Learn more about <b>Python</b> [here](https://docs.python.org/3/tutorial/index.html)
Learn more about <b>Spark</b> and <b>PySpark</b> [here](https://spark.apache.org/docs/latest/api/python/index.html)

This notebook demonstrates using the Spark Adapter with Python, in these steps:

1. Import the `PySpliceContextClass` to interface with the Python API.
2. Create a simple table in Splice Machine.
3. Create a Spark dataframe and insert that into Splice Machine
4. Run a simple Splice Machine transaction using the Spark context.
5. View results in the table

<br />

## 1. Import the PySpliceContext Class

Our first step is to import the `PySpliceContext` class:


In [None]:
from splicemachine.spark.context import PySpliceContext

## 2. Create a Spark context

Next, we use the `PySpliceContext` to create a connection to Splice Machine:
<sub>We can also use `inspect` to see more about what makes the PySpliceContext class

In [None]:
import inspect
jdbcURL = input('Enter your jdbcURL here, which you can find on the bottom of the cloud UI for your cluster')
splice = PySpliceContext(spark, jdbcURL)
print(inspect.getsource(PySpliceContext))

## 3. Create a Simple Table

Now we create a simple table in Splice Machine that we'll subsequently populate:


In [None]:
%%sql 
drop table if exists foo;
create table foo (I int, F float, V varchar(100), primary key (I));


## 4. Create a Spark Dataframe and Insert into Splice Machine

Then we use `pyspark` to create a Spark dataframe from some sample data, and insert that into our Splice Machine table.

After inserting the data, we do a `select *` to display the contents of the Splice Machine table. 

In [None]:
from pyspark.sql import Row
l = [(0,3.14,'Turing'), (1,4.14,'Newell'), (2,5.14,'Simon'), (3,6.14,'Minsky')]
rdd = sc.parallelize(l)
rows = rdd.map(lambda x: Row(I=x[0], F=float(x[1]), V=str(x[2])))
schemaRows = spark.createDataFrame(rows)
splice.insert(schemaRows,'foo')


In [None]:
%%sql 
select * from foo;

## 5. Run a Simple Splice Machine Transaction

Now we'll add more data to that table in a transactional context: 

In [None]:
conn = splice.getConnection()
conn.setAutoCommit(False)
l = [(4,3.14,'Turing'), (5,4.14,'Newell'), (6,5.14,'Simon'), (7,6.14,'Minsky')]
rdd = sc.parallelize(l)
rows = rdd.map(lambda x: Row(I=x[0], F=float(x[1]), V=str(x[2])))
schemaRows = spark.createDataFrame(rows)
splice.insert(schemaRows,'foo')
df = splice.df("select * from foo")
df.collect
df.show()

## 6. Rollback the transaction

Finally, we'll rollback the transaction we just ran:


In [None]:
conn.rollback()
df = splice.df("select * from foo")
df.collect
df.show()

## Where to Go Next

The next notebook in this presentation shows an example of <a href="./3.7%20Python%20MLlib%20example.ipynb">Using the Spark Machine Learning Library (MLlib) with Splice Machine.</a>
