In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'


In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
jdbc_host = os.environ['JDBC_HOST']

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

spark = SparkSession.builder.config(conf=conf).getOrCreate()
'''jdbc:splice://{FRAMEWORKNAME}-proxy.marathon.mesos:1527/splicedb;user=splice;password=admin'''

splicejdbc=f'jdbc:splice://{jdbc_host}:1527/splicedb;user=splice;password=admin'

splice = PySpliceContext(spark, splicejdbc)


<link rel="stylesheet" href="https://doc.splicemachine.com/zeppelin/css/zepstyles2.css" />

# Using Spark in Jupyter Notebooks
This notebook demonstrates how to use Spark in a Jupyter notebook, in the following sections:

* *Loading Data Into A Table Using Spark*
* *Using Spark SQL to Query the Loaded Data*

<div class="notePlain" style="font-size:12px">
The data we use in this notebook is public data; here is access information for it:
<p style="margin-left:60px;">[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. <em>Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology.</em> In P. Novais et al. (Eds.), <em>Proceedings of the European Simulation and Modelling Conference - ESM'2011</em>, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.</p>

<p style="margin-left:60px;">Available at: <a href="http://hdl.handle.net/1822/14838">http://hdl.handle.net/1822/14838</a></p>
</div>

## Loading Data Into a Table Using Spark

The following paragraph uses the `%spark` interpreter to load public bank data into a table. Note that Zeppelin creates and injects the SparkContext (`sc`) and sqlContext (`HiveContext` or `SqlContext`) for you, so you don't need to create them manually.

In [None]:
%%scala 
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset

// Zeppelin creates and injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)
// So you don't need create them manually

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")

## Using Spark SQL to Query the Loaded Data

The three paragraphs below use the `%spark.sql` interpreter to query the loaded data; each displays the query results using one of the available Jupyuter data visualizations.  

We also demonstrate how you can substitute variables into your queries that can be populated in a textbox.


In [None]:
%%scala
// need to figure out how to convert this and visualizations in jupyter
select age, count(1) value
from bank 
where age < 30 
group by age 
order by age

In [None]:
%%scala
// need to figure out how to convert this and visualizations in jupyter
select age, count(1) value 
from bank 
where age < ${maxAge=30} 
group by age 
order by age

In [None]:
%%scala
// need to figure out how to convert this and visualizations in jupyter
select age, count(1) value 
from bank 
where marital="${marital=single,single|divorced|married}" 
group by age 
order by age

## Where to Go Next
The next notebook in this class, [*Using the Database Console*](./d.%20Using%20the%20Database%20Console.ipynb), explores using the Database Console to learn about where your queries are bogging donw and how to use that information for additional query tuning.