# Splice Machine and Spark have a great relationship

<blockquote><p class='quotation'><span style='font-size:15px'>Spark is Embedded into the DNA of Splice Machine. It is used in our database for large, analytical queries as well as in our notebooks here for large machine learning data manipulation workloads which we'll cover later. Spark and PySpark come preconfigured on all of our clusters, and getting started is as easy as 2 lines of code. Your Spark Session will automatically connect to your Kubernetes cluster and can scale to meet your demands.<footer>Splice Machine</footer>
</blockquote>

#### Let's start our Spark Session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# That's it!
## You now have a powerful Spark Session running on Kubernetes
<blockquote> 
    You can access your Spark Session UI by calling the <code>get_spark_ui</code> function in our <code>splicemachine.notebook</code> module. This function takes either the port of your Spark Session or the Spark Session object itself, and returns both a link to your Spark UI as well as an embedded IFrame you can interact with right here in the notebook.
<footer>Splice Machine</footer>
</blockquote>

In [None]:
from splicemachine.notebook import get_spark_ui
# Get the port of our Spark Session
port = spark.sparkContext.uiWebUrl.split(':')[-1]
print('Spark UI Port: ',port)
help(get_spark_ui)

In [None]:
# Get the Spark UI with the port
get_spark_ui(port=port)

# Let's talk Database
<blockquote> After all, Splice Machine is a powerful Scale-Out transactional and analytical database. To make this as useful as possible for Data Scientists, we've created the
    <a href="https://www.splicemachine.com/the-splice-machine-native-spark-datasource/">Native 
Spark Datasource</a>. It allows us to do inserts, selects, upserts, updates and many more functions without serialization all from code. On top of this, we've implemented a wrapper called the <code>PySpliceContext</code> to establish our direct connection in Python. This comes with the same API as the Native Scala implementation, and a few extra Python specific helpers. Check out the entire documentation <a href="https://pysplice.readthedocs.io/en/dbaas-4100/splicemachine.spark.html">here</a>.<br><br>
    You'll see in the docs that there is both the <code>PySpliceContext</code> and the <code>ExtPySpliceContext</code>. The <code>ExtPySpliceContext</code> is used when you are running your code outside of the Kubernetes cluster. The only difference in configuration is that you must manually set both the JDBC_URL (which you can get from your <a href="https://cloud.splicemachine.io">Cloud Manager UI</a>) and your kafkaServer URL. Everything else is identical.
<footer>Splice Machine</footer>
</blockquote>

#### Let's create our PySpliceContext

In [None]:
from splicemachine.spark import PySpliceContext

splice = PySpliceContext(spark)
help(splice)

## Great! 
### Let's look at some common functions
<blockquote> 
    Some of the most commonly used functions by Data Scientists and Engineers are:
    <ul>
        <li><code>df</code>: This function takes an arbitrary SQL statement and returns the result as a Spark Dataframe. This ensures that no matter the size of the result, it will be distributed amongst your available Spark Executors</li>
        <li><code>createTable</code>: This function takes your Dataframe and the name of a table in the format "schema.table" and creates that table using the structure of your DF. This allows you to skip all of the SQL</li>
        <li><code>insert</code>: This function takes your Dataframe and the name of a table in the format "schema.table" and inserts the rows directly into the table. It's important to make sure <b>the schema of your Dataframe matches the schema of your table</b></li>
        <li><code>dropTableIfExists</code>: This function takes the name of a table in the format "schema.table" and drops that table if it exists</li>
        <li><code>execute</code>: This function takes arbitrary SQL and executes it through a raw JDBC connection</li>
    </ul>
    <br>
There are many other powerful functions available in our <a href="https://pysplice.readthedocs.io/en/dbaas-4100/splicemachine.spark.html">documentation</a>
<footer>Splice Machine</footer>
</blockquote>

#### Let's see and example

In [None]:
print(help(splice.df))
print('-------------------------------------------------------------------------------------')
print(help(splice.createTable))
print('-------------------------------------------------------------------------------------')
print(help(splice.insert))
print('-------------------------------------------------------------------------------------')
print(help(splice.dropTableIfExists))
print('-------------------------------------------------------------------------------------')
print(help(splice.execute))


#### Let's try it out

First, we'll create a SQL table and populate it. Then we'll grab that data as a Spark Dataframe and create a new table with it, inserting our data

In [None]:
%%sql
DROP TABLE IF EXISTS FOO;
CREATE TABLE FOO(a INT, b FLOAT, c VARCHAR(25), d TIMESTAMP DEFAULT CURRENT TIMESTAMP);
INSERT INTO FOO (a,b,c) VALUES (240, 84.1189, 'bird');
INSERT INTO FOO (a,b,c) VALUES (207, 1120.7235, 'heal');
INSERT INTO FOO (a,b,c) VALUES (73, 1334.6568, 'scent');
INSERT INTO FOO (a,b,c) VALUES (24, 513.4238, 'toy');
INSERT INTO FOO (a,b,c) VALUES (127, 1030.0719, 'neat');
INSERT INTO FOO (a,b,c) VALUES (91, 694.5587, 'mailbox');
INSERT INTO FOO (a,b,c) VALUES (219, 238.7311, 'animal');
INSERT INTO FOO (a,b,c) VALUES (112, 698.1438, 'watch');
INSERT INTO FOO (a,b,c) VALUES (229, 1034.051, 'sheet');
INSERT INTO FOO (a,b,c) VALUES (246, 782.5559, 'challenge');
INSERT INTO FOO (a,b,c) VALUES (33, 241.8961, 'nutty');
INSERT INTO FOO (a,b,c) VALUES (127, 758.8009, 'python');
INSERT INTO FOO (a,b,c) VALUES (80, 1566.444, 'jumble');
INSERT INTO FOO (a,b,c) VALUES (246, 751.352, 'easy');
INSERT INTO FOO (a,b,c) VALUES (242, 717.3813, 'difficult');
INSERT INTO FOO (a,b,c) VALUES (118, 311.3499, 'answer');
INSERT INTO FOO (a,b,c) VALUES (174, 815.5917, 'xylophone');
INSERT INTO FOO (a,b,c) VALUES (235, 269.0144, 'crash');
INSERT INTO FOO (a,b,c) VALUES (21, 267.1351, 'chocolate');
INSERT INTO FOO (a,b,c) VALUES (82, 1097.7805, 'straw');

### Now we'll use the PySpliceContext to
<blockquote> 
    <ul>
        <li>Grab our new data from our table directly into a Spark Dataframe</li>
        <li>Create a new table with our Dataframe</li>
        <li>Inserting our data directly into it</li>
    </ul>
    <br>
<footer>Splice Machine</footer>
</blockquote>

In [None]:
from splicemachine.mlflow_support.utilities import get_user
schema = get_user()
# Get our data
df = splice.df(f'select * from {schema}.foo')
df.show()

# Create our new table
print(f'Dropping table new_foo if exists...', end='')
splice._dropTableIfExists(f"{schema}.new_foo")
print('done.')
print('Creating table new_foo...', end='')
splice.createTable(df, f"{schema}.new_foo")
print('done.')

# Insert our data
print('Inserting data into new_foo...', end='')
splice.insert(df, f"{schema}.new_foo")
print('done.')

In [None]:
%%sql
select a, b, varchar(c) c, d from new_foo

In [None]:
spark.stop()

## Amazing!
<blockquote> 
Now you have all of the tools necessary to start accessing and manipulating your Big Data with Spark and Splice Machine. Again, feel free to check out our <a href="https://pysplice.readthedocs.io/en/dbaas-4100/splicemachine.spark.html">documentation</a>!<br><br>
    Next Up: <a href='./7.2 Splice MLflow Support.ipynb'>Using Splice Machine's MLflow Support</a>
<footer>Splice Machine</footer>
</blockquote>