In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'


In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext


# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'

<link rel="stylesheet" href="https://doc.splicemachine.com/zeppelin/css/zepstyles2.css" />

# Using our Native Spark DataSource
This notebook demonstrates using the Spark Adapter with Python, in these steps:

1. *Create the `PySpliceContextClass` to interface with the Python API.*
2. *Use Jupyter to create a Spark context.*
3. *Create a simple table in Splice Machine.*
4. *Create a Spark dataframe and insert that into Splice Machine.*
5. *Run a simple Splice Machine transaction using the Spark context.*
6. *Rollback that transaction using the same context.*

## About the Native Spark DataSource
Data Scientists have adopted Spark as the de facto data science platform, and Splice Machine provides an industry leading in-process integration to a Spark cluster. This means data scientists and data engineers can adopt the full power of Spark and manipulate dataframes but also get the power of full ANSI, ACID-compliant SQL.

The Splice Machine Spark adapter provides:

* A durable, ACID compliant persistence model for Spark Dataframes.
* Lazy result sets returned as Spark Dataframes.
* Access to Spark libraries such as MLLib and GraphX.
* Avoidance of expensive ETL of data from OLTP to OLAP.

## 1. Import the PySpliceContext Class

Your first step is to import the `PySpliceContext` class:

In [None]:
from splicemachine.spark.context import PySpliceContext

## 2. Create a Spark context and PySpliceContext

Next, we create a spark session.
Then, we use the `PySpliceContext` to create a connection to Splice Machine:

In [None]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from splicemachine.spark.context import PySpliceContext
jdbc_host = os.environ['JDBC_HOST']

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

spark = SparkSession.builder.config(conf=conf).getOrCreate()

splicejdbc=f'jdbc:splice://{jdbc_host}:1527/splicedb;user=splice;password=admin'

splice = PySpliceContext(spark, splicejdbc)

## 3. Create a Simple Table

Now we create simple table in Splice Machine that we'll subsequently populate:


In [None]:
%%sql 

create table DS.foo (I int, F float, V varchar(100), primary key (I));


## 4. Create a Spark Dataframe and Insert into Splice Machine

Then we use `spark.pyspark` to create a Spark dataframe from some sample data, and insert that into our Splice Machine table.

<p class="noteNote">You can ignore the <code>RuntimeWarning:</code> warning messages that may display when you run the code in the next paragraph.</p>

After inserting the data, we do a `select *` to display the contents of the Splice Machine table. 

In [None]:
from pyspark.sql import Row
l = [(0,3.14,'Turing'), (1,4.14,'Newell'), (2,5.14,'Simon'), (3,6.14,'Minsky')]
rdd = sc.parallelize(l)
rows = rdd.map(lambda x: Row(I=x[0], F=float(x[1]), V=str(x[2])))
schemaRows = sqlContext.createDataFrame(rows)
splice.insert(schemaRows,'DS.foo')


### Native Spark Datasource

If you look closely, you'll see that we went straight from a Spark Dataframe into Splice Machine's database. 

This is Splice Machine's Native Spark Datasource.  Not only is this mechanism convenient, it is also very performant, leveraging parallelism in large datasets.  The main API for the Python version we just used is just in the paragraph at the top of this notebook.

Here's the result:

In [None]:
%%sql 
select * from DS.foo;

## 5. Run a Simple Splice Machine Transaction

Now we'll add more data to that table in a transactional context: 

In [None]:
conn = splice.getConnection()
conn.setAutoCommit(False)
l = [(4,3.14,'Turing'), (5,4.14,'Newell'), (6,5.14,'Simon'), (7,6.14,'Minsky')]
rdd = sc.parallelize(l)
rows = rdd.map(lambda x: Row(I=x[0], F=float(x[1]), V=str(x[2])))
schemaRows = sqlContext.createDataFrame(rows)
splice.insert(schemaRows,'DS.foo')
df = splice.df("select * from DS.foo")
df.collect
z.show(df)

## 6. Rollback the transaction

Finally, we'll rollback the transaction we just ran:


In [None]:
conn.rollback()
df = splice.df("select * from DS.foo")
df.collect
z.show(df)

## Where to Go Next

Now let's explore Machine Learning with these Splice Machine, starting with our next notebook: [*Machine Learning with MLlib*](./g.%20Machine%20Learning%20with%20Spark%20MLlib.ipynb).