In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'

<link rel="stylesheet" href="https://doc.splicemachine.com/zeppelin/css/zepstyles2.css" />

# Exercises: Splice Machine Beginning Developers Class

This notebook contains follow-on exercises for the material that we covered in this class. You can complete these exercises and run the paragraphs in this notebook to verify your understanding of what was covered.

You'll be performing the following actions in these exercises:

1. Creating Tables
2. Importing Data
3. Collecting Statistics
3. Running Queries
4. Monitoring Queries
5. Tuning Queries

The data you'll be loading is on your local machine, which will prove useful if you need to debug the data import process.

## 1. Creating Tables
In this exercise, we'll:

* Examine the data we want in the table
* Create a table with a primary key

### Our Sample Data
We'll be loading up sample movie rating data over the course of our training classes; for this exercise, we'll start by looking at a sample of the raw ratings data:

```
196|242|3|1997-12-04 07:55:49
186|302|3|1998-04-04 11:22:22
22|377|1|1997-11-06 23:18:360
```

The data contains four fields, separated by `|` characters in the input:
&nbsp;&nbsp;&nbsp;&nbsp; `user_id | item_id | rating | timestamp`

### Create the Table Definition
Now, let's create a table specification for the raw ratings data shown above, and call it `RATING_DATA`. Be sure to put in a Primary Key definition.

<p class="noteQuestion">What do you think the Primary Key should be?</p>

Insert the SQL to create the table in the next paragraph, and then run the cell to actually create the table in your database.

For help with the syntax, review the notebooks in this class, or read about creating tables in <a href="doc.splicemachine.com/sqlref_statements_createtable.html" target="_blank">our documentation.</a>


In [None]:
%%sql 


## 2. Importing Data

Now we'll import all of our ratings data, which contains 100,000 rows. We've copied the data file into this docker image, so you can examine it if needed; you'll find the data here:
&nbsp;&nbsp;&nbsp;&nbsp; `/opt/data/rating.csv`

NOTE: if you ever need to log into your docker image, go to a fresh terminal window and enter `docker exec -it spliceserver /bin/bash`, then navigate to the appropriate directory.

Enter the proper `IMPORT` call to load the data in the next paragraph, then run to actually load the data into the table in your database. You can review examples from this class or in our documentation for any required help.

<p class="noteHint">use `/opt/data` as your BAD records file directory; if you have trouble with the import, you'll find valuable information in that directory.

#### Questions:
1. What should the Schema specification be?
2. What is the field delimiter?
3. Do you need a character delimiter?
4. What should the timestamp format be?
5. What should the bad record count be?


In [None]:
%%sql 


## 3. Collecting Statistics

Now let's run the simple command to collect statistics for your table. Enter the command in the next paragraph, and then run the cell:


In [None]:
%%sql 


## 4. Running Queries

Now let's run some queries. 

Each of the following paragraphs poses a question; you should:

* Enter a query that will answer that question.
* Run the cell.

## a. How many total records are there in RATING_DATA?

In [None]:
%%sql 


## b. Did the last query run through HBase (control) or Spark?  

In [None]:
%%sql 


## c. What are the top 20 movies based on average rating?  (Extra credit: your answer might need to do some cast-ing to be most accurate)

In [None]:
%%sql 


## d. What rating did User 100 give Movie 300?

In [None]:
%%sql 


## e. What is the most popular rating given out?

In [None]:
%%sql 


## f. How many ratings did Movie 50 get?

In [None]:
%%sql 


## g. How might we have sped that query up?

In [None]:
%%sql 


## Monitoring Queries

Now we'll monitor a query. 

* First, please rerun the query that told you how many records are in the `RATING_DATA` table.
* Then, if you haven't already done so, point your browser to `localhost:4040` and look at the queries that were run in Spark.

#### Questions:

1. How long did the `count(*)` query take to run?
2. How many stages was it?  Why?
3. Which stage took longer to run?  Why?
4. How many tasks ran in each stage?
5. What do you think needs to happen in order for more tasks to be created?
6. What query did we run earlier that we will NOT see in the Spark UI?

## Tuning Queries

Finally, let's do some tuning on a query.  Let's suppose we want to run a query where we find out how many users gave a rating of 1


In [None]:
%%sql 


Not bad, but can we do better?  

If you ran an EXPLAIN on the query first (and if you haven't yet, do that now), you'll see that we did a full table scan on the RATING_DATA table.

Create an index for this table to help the query run faster.  Once you have created your index, be sure to use EXPLAIN on the original query to help you see that the query will use the index.  There is a subtlety here - if EXPLAIN is not choosing your index, review the index sections of the Tuning Queries for Performance.

Once you have your index being chosen, you should see your query run faster.

In [None]:
%%sql 


## Where to Go Next
Congratulations! You've just completed the *Splice Machine Developer's Class, Part I*. 

To continue learning about developing with Splice Machine, consider taking our  [*Splice Machine Developers, Part II*](../For%20Developers%2C%20Part%20II/a.%20Introduction%20to%20Developer%20Training%2C%20Part%20II.ipynb) class.

Visit [*Our Training Classes*](../About/Our%20Training%20Classes.ipynb) notebook to learn about our other training classes.

