In [1]:
%%HTML
<link rel="stylesheet" href="https://doc.splicemachine.com/jupyter/css/custom.css">

In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'


In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
jdbc_host = os.environ['JDBC_HOST']

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

spark = SparkSession.builder.config(conf=conf).getOrCreate()

splicejdbc=f'jdbc:splice://{jdbc_host}:1527/splicedb;user=splice;password=admin'

splice = PySpliceContext(spark, splicejdbc)


# Exercises: Splice Machine Developer's Class, Part II

This notebook contains follow-on exercises for the material that we covered in this class. You can complete these exercises and run the cells in this notebook to verify your understanding of what was covered.

<p class="noteNote">These exercises build upon the exercises from our <em>Developer's Class, Part I</em> class, and assume that you've already loaded the data used there. If that's not true, please go back to the <a href="/#/notebook/2DYX3JGDF">Exercises notebook from Part I</a>, and at least create the table and load the data before continuing here.</p>

You'll be performing the following actions in these exercises:

1.  Enhancing our exercise schema with an additional table and data.
2.  Writing and monitoring more advanced queries.
3.  Implementing additional query tuning.
4.  Interacting with this data using the Native Spark DataSource.

## 1. Enhancing our Exercise Schema
We'll begin these exercises by:

* Extending our *Moving Rating Data* schema with a new table for user information
* Loading data into our new table
* Collecting statistics on the table

### Our Sample User Data
The sample user data that we'll be ingesting looks like this:

```
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
```

This data contains four fields, separated by `|` characters in the input:
&nbsp;&nbsp;&nbsp;&nbsp; `user_id | age | gender | occupation | zip_code`

### Create the Table Definition
Now, let's create a table specification for the new user information data shown above, and call it `USER_DATA`.  A couple notes about the data that you should take into consideration when defining the table:

* Not all ZIP codes are integers.
* Though we support Foreign Keys, for these exercises we will skip adding them.

<p class="noteQuestion">What do you think the Primary Key should be?</p>

In the next cell, insert the SQL to create the table, then run the cell to actually create the table in your database.

For help with the syntax, review the notebooks in this class, or read about creating tables in <a href="doc.splicemachine.com/sqlref_statements_createtable.html" target="_blank">our documentation.</a>


In [None]:
%%sql 


### Loading Data into the User Data Table

Now we'll import all of our user data. We've copied the data file into this docker image, so you can examine it if needed; you'll find the data here:

&nbsp;&nbsp;&nbsp;&nbsp; `/opt/data/user.csv`

<p class="noteNote">Use the following command to log into your docker image: <code>docker exec -it spliceserver /bin/bash</code>.

Enter the proper `IMPORT` call to load the data in the next cell, then run to actually load the data into the table in your database. You can review examples from this class or in our documentation for any required help.

<p class="noteHint">Use <code>/opt/data</code> as your BAD records file directory; if you have trouble with the import, you'll find valuable information in that directory.



In [None]:
%%sql 


### Collecting Statistics on the User Data Table

Again let's collect statistics against this table.

In [None]:
%%sql 


## Writing and Monitoring Advanced Queries

Now let's run some queries. 

Each of the following cells poses a question; you should enter a query that will answer that question into the `%%sql` cell below, and then run the cell.

### a. How many total records are there in USER_DATA?

In [None]:
%%sql 


### b. How many ratings were submitted by students?

In [None]:
%%sql 
 

### c. Which gender gave higher average ratings?

In [None]:
%%sql 


### d. What is the average age of reviewers who submitted at least one review of 5?

In [None]:
%%sql 


### e. What are the (distinct) occupations in this table?

In [None]:
%%sql 


### f. What is the review count by age, filtered by occupation?

In [None]:
%%sql 


## Additional Query Tuning

Run the following query, then create indexes to make it run faster:

Find the average ratings for those users with ZIP codes that start with `9` and in which the `time_entered` for the rating is  April 1, 1998.

In [None]:
%%sql 


## Using the Native Spark DataSource with Our Data

We'll now review some of the basics of the Native Spark DataSource.  First, create a PySpliceContext object:



Now create a Splice table with 2-3 columns.  Don't insert any data yet:

In [None]:
%%sql 


Next, create some Spark data (similar to what was done in the previous notebook), and populate the Splice table you just created:

Show that it made it into the Splice table:

In [None]:
%%sql 


## Where to Go Next
Congratulations! You've just completed the *Splice Machine Developer's Class, Part II*. 

Visit [*Our Training Classes*](../About/Our%20Training%20Classes.ipynb) notebook to learn about our other training classes.

