In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'

In [2]:
%%HTML
<link rel="stylesheet" href="https://doc.splicemachine.com/jupyter/css/custom.css">

In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
jdbc_host = os.environ['JDBC_HOST']

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

spark = SparkSession.builder.config(conf=conf).getOrCreate()
'''jdbc:splice://{FRAMEWORKNAME}-proxy.marathon.mesos:1527/splicedb;user=splice;password=admin'''

splicejdbc=f'jdbc:splice://{jdbc_host}:1527/splicedb;user=splice;password=admin'

splice = PySpliceContext(spark, splicejdbc)


<link rel="stylesheet" href="https://doc.splicemachine.com/zeppelin/css/zepstyles2.css" />

# Exercises: Data Scientists

This notebook contains follow-on exercises for the material that we covered in this class. You can complete these exercises and run the cells in this notebook to verify your work.

You'll be performing the following actions in these exercises:

1. Creating Tables
2. Importing Data
3. Visualizing Data
4. Performing some basic Machine Learning

The data you'll be loading is on your local machine, which will prove useful if you need to debug the data import process.

## Prerequisite Database Tables
We'll start with three tables from a simple Movie Rating schema:

* The `rating_data` table, which stores movie ratings, was introduced in the exercises for our *Developer Training, Part I* class.
* The `user_data` table, which stores reviewer information, was added in the exercise for our *Developer Training, Part II* class.
* In this notebook, you'll create and load the `item_data` table, which stores movie title and genre information.

If you've completed the developer classes, you may already have these tables loaded in your database. If so, you can skip the next cell and start with the following cell, __1. Creating Tables.__

Otherwise, please run the next cell to create and load the `RATING_DATA` and `USER_DATA` tables, before proceeding. 

In [None]:
%%sql 
drop table if exists rating_data;
create table RATING_DATA (
    user_id bigint,
    item_id bigint,
    rating integer, 
    time_entered timestamp,
    primary key (user_id, item_id)
);

call syscs_util.import_data('SPLICE', 'RATING_DATA', null, '/opt/data/rating.csv', '|', null, null, null, null, 0, '/opt/data/', null, null);
analyze table rating_data;

drop table if exists user_data;
create table USER_DATA (
  user_id bigint primary key,
  age integer,
  gender varchar(1),
  occupation varchar(20),
  zip varchar(10)
);
call syscs_util.import_data('SPLICE', 'USER_DATA', null, '/opt/data/user.csv', '|', null, null, null, null, 0, '/opt/data/', null, null);
analyze table user_data;


## 1. Creating Tables

Here we'll load one more table to enhance our Movie Data schema; this table will categorization information for individual movie titles. As you'll see, this will be useful for our Machine Learning exercices.

### Our Sample Data

The sample movie data that we're using is a table of movie titles and genre information. The raw data looks like:

```
1|Toy Story|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
```

The data fields contains the following fields:

`item_id | movie title | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy |`
&nbsp;&nbsp;&nbsp;&nbsp; `Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western`
    
Each of the columns after `movie title` is a genre flag; each flag has a value of either `1`, indicating that the genre does apply to this movie, or `0`, indicating that it does not.


### Create the Table Definition

Now, let's create a table specification for the movie and genre data shown above, and call it `ITEM_DATA`. Be sure to put in a Primary Key definition.

<p class="noteQuestion">What do you think the Primary Key should be?</p>

Insert the SQL to create the table in the next cell and run it to create the table in your database.

For help with the syntax, review the notebooks in this class, or read about creating tables in <a href="doc.splicemachine.com/sqlref_statements_createtable.html" target="_blank">our documentation.</a>



In [None]:
%%sql 


## 2. Importing Data

Now we'll import all of our movie/genre. We've copied the data file into this docker image, so you can examine it if needed; you'll find the data here:

&nbsp;&nbsp;&nbsp;&nbsp; `opt/data/item.csv`

Enter the proper `IMPORT` call to load the data in the next cell, then run the cell to load the data into the table in your database. You can review examples from this class or in our documentation for any required help.

<p class="noteHint">Use <code>/opt/data</code> as your BAD records file directory; if you have trouble with the import, you'll find valuable information in that directory.




In [None]:
%%sql 


## 3. Visualizing Data

Now we have 3 tables loaded:

* Movie Ratings
* Movie Reviewers
* Movies and their Genres

And now our visualizations of data should become more interesting now as well. Explore using different visualizations by:

1. Running the pre-entered queries in the next cells.
2. Applying different visualization techniques to look for any possible interesting correlations.


In [None]:
%%sql 
select occupation, avg(cast(age as float)), avg(cast(rating as float)), count(*) from rating_data r, user_data u, item_data i where r.user_id = u.user_id and i.item_id = r.item_id and i.animation = 1 and i.childrens = 1 and comedy = 1 group by occupation order by avg(cast(rating as float)) desc

In [None]:
%%sql 
select occupation, count(*), avg(cast(rating as float)) from rating_data r, user_data u, item_data i where r.user_id = u.user_id and i.item_id = r.item_id group by occupation order by avg(cast(rating as float)) desc

In [None]:
%%sql 
select occupation, count(*), avg(cast(rating as float)) from rating_data r, user_data u, item_data i where r.user_id = u.user_id and i.item_id = r.item_id and horror = 1 group by occupation order by avg(cast(rating as float)) desc


##### Now try your own queries and visualizations.  Can you find any correlations?


In [None]:
%%sql 


## 4. Machine Learning Example

In the future we'll have a more in-depth Machine Learning exercise with the Movie Rating Data we have gone through.  But in the meantime we'll drill into a more streamlined example on Spam detection.

Here we'll create a table to hold a set of SMS records, where each record has been labeled as "ham" or "spam" (where the spam label indicates a spam record).  We'll then apply a set of transformations and create a model for our own Spam predictor.

First we'll create the table for the data, and go ahead and import it. To get you to the Machine Learning, we have provided the schema and import statements for you to run.


In [None]:
%%sql 
drop table if exists sms;
CREATE TABLE SMS (
    LABEL VARCHAR(10),
    SMS_CONTENT VARCHAR(10000)
);


We went ahead and provided the import statement for you to run here.  A couple of points:

1.  This is tab-separated data, hence we are using `\t` as the value of the field separator.
2.  For string delimiters, we can't use our default since the data may contain double-quote characters.  Therefore the best option is to include an unprintable character (in this case `^A`) that we don't expect to see in the SMS strings.


In [None]:
%%sql 
call syscs_util.import_data ('SPLICE', 'SMS', null, '/opt/data/sms.csv', '\t', '', null, null, null, 0, '/opt/data/', null, null);

It's good to take a look at the data, perhaps filtering on the label to see what we have.  (Note - the data has NOT been cleansed)

In [None]:
%%sql 


### Native Spark DataSource to create DataFrame from Splice table
Now create a variable `df` which is a dataframe of all records from the SMS table.  Additionally, pipeline the result of that dataframe with a call to `withColumnRenamed` from `LABEL` to `correct`:

###  Preprocess the Data and Define generate_pipeline()

We're now going to preprocess the data before learning:

1. Tokenize the content into words (using `Tokenizer()`)
2. Filter out words of little significance (using `StopWordsRemover()`)
3. Hashing to get term frequency (using `HashingTF()`)
4. IDF to establish term significance (using `IDF()`)
5. Converting _ham_ and _spam_ to 0's and 1's

Finish out the following definition of `generate_pipeline()`. We've started you off with the Tokenizer, Hashing, and IDF.  

* Add a `StopWordsRemover()` and name it `remover` to make it work, based on the other code you see here.
* Consult the Spark documentation for `StopWordsRemover` as necessary. Don't worry about any warnings when it runs.


In [None]:

from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer
from pyspark.ml import Pipeline

FEATURE_NUM_HASHING = 7500 
df = df.withColumnRenamed("LABEL", "correct")

def generate_pipeline(predicting=False):
    tokenizer = Tokenizer().setInputCol("SMS_CONTENT").setOutputCol("words")

    # StopWordsRemover code goes here to set up remover variable:
    
    
    hashingTF = HashingTF().setNumFeatures(FEATURE_NUM_HASHING).setInputCol("filtered").setOutputCol("rawFeatures")
    idf = IDF().setInputCol("rawFeatures").setOutputCol("features").setMinDocFreq(0)
    stages = [tokenizer, remover, hashingTF, idf]
    
    if not predicting: # ignore label if we are predicting
        labelidx = StringIndexer().setInputCol("correct").setOutputCol("label")
        stages.append(labelidx)
        
    pipe = Pipeline(stages=stages)
    return pipe
    
pipe = generate_pipeline()


### Transformation, modeling, and evaluation

We'll use random iteration averaging to train and evaluate random samples from the dataset. This prevents overfitting.  Then we'll follow up with Logistic Regression for training and testing.

Put in the proper `LogisticRegression()` code below (assiging to variable `lr`)

Questions:
1.  What is our training set size?  What is our testing set size?
2.  Why is it important not to overfit?



In [None]:

from pyspark.ml.classification import LogisticRegression
from splicemachine.ml.utilities import SpliceBinaryClassificationEvaluator

CV_ITERATIONS = 4
TRAIN_SIZE = 0.7

evaluator = SpliceBinaryClassificationEvaluator(spark)

for iteration in range(1, CV_ITERATIONS + 1):
    transformed = pipe.fit(df).transform(df)
    train, test = transformed.randomSplit([TRAIN_SIZE, 1 - TRAIN_SIZE])
    
    # LogisticRegression initialization goes here

    
    fitlr = lr.fit(train)
    predicted = fitlr.transform(test)
    print("ITERATION {iteration}".format(iteration=iteration))
    evaluator.input(predicted)

### Use our model to make Predictions

Now that we have trained and tested our model, it's time to make predictions.  This code is ready to go, so you don't need to make any changes.  Note that there is a text input field to enter sample SAS strings and get a prediction (Spam or Not Spam). 

Tests to run:
1. _Run the following cell, in which we are testing the following SMS:_

   ```
   free gummies. call 1800-393-2939 to claim your prize.
   ```

   This evaluates the text in place.  Do you agree with the results?
   <br />

2. _Replace the SMS contents with:_

   ```
   George I got your message, please call me.
   ```

   Run the cell again.  Do you agree with the results?
   <br />
   
3. _Try other strings as well._  Do you agree with the results?


In [None]:

from pyspark.sql import Row
from pyspark.sql.types import StringType
from splicemachine.ml.utilities import display

text_contents = str(z.input("SMS Message Contents"))
if len(text_contents) > 1:
    pred_df = sqlContext.createDataFrame([text_contents], StringType()).withColumnRenamed('value', 'SMS_CONTENT')
    predPipe = generate_pipeline(predicting=True)

    pred = predPipe.fit(df).transform(pred_df)
    predictions = fitlr.transform(pred)
    prob = predictions.select('probability').collect()[0][0][0]
    if prob > 0.75:
        display("<h2>This message is <b><i>Not Spam</i></b></h2>")
    else:
        display("<h2>This message is <b><i>Spam</i></b></h2>")
   
    display("<h3>Spam Probability: " + str(round(1-prob, 4) * 100) + "%</h3>")

## Where to Go Next
Congratulations! You've just completed the *Splice Machine Data Science Class*. 

Visit [*Our Training Classes*](../About/Our%20Training%20Classes.ipynb) notebook to learn about our other training classes.
