In [None]:
import os
os.environ['JDBC_HOST'] = 'jrtest01-splice-hregion'


In [None]:
# setup-- 
import os
import pyspark
from splicemachine.spark.context import PySpliceContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# make sure pyspark tells workers to use python3 not 2 if both are installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
jdbc_host = os.environ['JDBC_HOST']

conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

spark = SparkSession.builder.config(conf=conf).getOrCreate()

splicejdbc=f'jdbc:splice://{jdbc_host}:1527/splicedb;user=splice;password=admin'

splice = PySpliceContext(spark, splicejdbc)


<link rel="stylesheet" href="https://doc.splicemachine.com/zeppelin/css/zepstyles2.css" />

# Machine Learning with Spark MLlib

This notebook contains code that uses the Machine Learning (<em>ML</em>) Library embedded in Spark, *MLlib*, with the Splice Machine Spark Adapter to realize in-process machine learning. Specifically, the example in this notebook uses data that tracks international shipments to learn, and then predicts how late a shipment will be, based on various factors.

If you're not familiar with Machine Learning with Spark MLlib, you can learn more about this library here: <a href="https://spark.apache.org/docs/latest/ml-guide.html" target="_blank">https://spark.apache.org/docs/latest/ml-guide.html</a>.

The code in this project was written in the Scala programming language as well as the Python programming language.

The remainder of this notebook contains these sections:

* <em>Basic Terminology</em> defines a few major ML terms used in this notebook.
* <em>About Our Sample Data</em> introduces the shipping data that we use. 
* <em>About our Learning Model</em> describes the learning model method we're using.
* <em>Creating our Splice Machine Database</em> walks you through setting up our database with our sample data.
* <em>Creating, Training, and Deploying our Learning Model</em> walks you through our Machine Learning sample code.
* <em>Program Listing</em> contains a listing of all of the code used in this notebook.

## Basic Terminology

Here's some basic terminology you need to be familiar with to understand the code in this notebook. These descriptions are paraphrased from the <a href="https://spark.apache.org/docs/latest/ml-guide.html" target="_blank">above-mentioned Spark MLlib guide.</a>

<table class="splicezep">
    <col />
    <col />
    <thead>
        <tr>
            <th>Term</th>
            <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="ItalicFont">DataFrame</td>
            <td>A DataFrame is a basic Spark SQL concept. A DataFrame is similar to a table in a database: it contains rows of data with columns of varying types. The MLlib operates on datasets that are organized in DataFrames. </td>
        </tr>
        <tr>
            <td class="ItalicFont">Pipeline</td>
            <td>In MLlib, you chain together a sequence of algorithms, or <em>stages</em> that operate on your DataFrame into a <em>pipeline</em> that learns.</td>
        </tr>
        <tr>
            <td class="ItalicFont">Transformer</td>
            <td>An algorithm that transforms a DataFrame into another DataFrame. Each transformer implements a method named <code>transform</code> that converts the DataFrame, typically by appending additional columns to it. A <em>model</em> is a kind of transformer.</td>
        </tr>
        <tr>
            <td class="ItalicFont">Estimator</td>
            <td>A learning algorithm that trains or <em>fits</em> on a DataFrame and produces a <code>model</code>. Each estimator implements a method named <code>fit</code> that produces a model.</td>
        </tr>
    </tbody>
</table>
    

## About our Sample Data

We've obtained some actual shipping data that tracks international shipments between ports, and have imported that data into a Splice Machine database that we've named `ASN.` The tables of interest are named `SHIPMENT_IN_TRANSIT` and `SHIPMENT_HISTORY;` you'll see these table used in the sample code below. We also create a database table named `Features` that forms the basis of the DataFrame we use for our learning model; this is the table you'll see featured in this notebook's code. The idea of this model is to predict, in real-time, how late a specific shipment will be, based on past data and other factors. Over time, as more data is processed by the model, the predictions become more accurate. 

## About our Learning Model

We use a Logistic Regression *estimator* as the final stage in our pipeline to produce a Logistic Regression Model of lateness from our data, and then deploy that model on a dataset to predict lateness.

The estimator operates on data that is formatted into vectors of integers. Since most of the fields in  our input dataset contain string values, we need to convert any data that will be used by the estimator into this format, as you'll see below. 

##  Create our Splice Machine Database

Before working with the MLlib, we need to create a Splice Machine database that contains the shipping data we're using. We:

1. Connect to your database via JDBC
2. Create the schema and tables
3. Import the data
4. Create our features table


### 1. Connect to your Database via JDBC
First we'll configure the URL we'll use in our JDBC connection to Splice Machine. For this class, you can simply use the `defaultJDBCURL` assignment in the next paragraph.

In [None]:
defaultJDBCURL = """jdbc:splice://localhost:1527/splicedb;user=splice;password=admin"""



### 2. Create the Schema and Tables

We'll now create our new schema, make it our default schema, and then create the tables for the `shipment_in_transit` and `shipment_history` data that we will import.


In [None]:
%%sql 

CREATE SCHEMA DS_ASN;
SET SCHEMA DS_ASN;


In [None]:
%%sql 

DROP TABLE IF EXISTS SHIPMENT_IN_TRANSIT;
CREATE TABLE SHIPMENT_IN_TRANSIT(
    SHIPMENTID VARCHAR(11) NOT NULL PRIMARY KEY,
    STATUS VARCHAR(50),
    SHIPMODE VARCHAR(30),
    PRODUCT_DESCRIPTION VARCHAR(500),
    CONSIGNEE VARCHAR(200),
    SHIPPER VARCHAR(100),
    ARRIVAL_DATE TIMESTAMP,
    GROSS_WEIGHT_LB INTEGER,
    GROSS_WEIGHT_KG INTEGER,
    FOREIGN_PORT VARCHAR(50),
    US_PORT VARCHAR(50),
    VESSEL_NAME VARCHAR(40),
    COUNTRY_OF_ORIGIN VARCHAR(40),
    CONSIGNEE_ADDRESS VARCHAR(150),
    SHIPPER_ADDRESS VARCHAR(150),
    ZIPCODE VARCHAR(20),
    NO_OF_CONTAINERS INTEGER,
    CONTAINER_NUMBER VARCHAR(200),
    CONTAINER_TYPE VARCHAR(80),
    QUANTITY INTEGER,
    QUANTITY_UNIT VARCHAR(10),
    MEASUREMENT INTEGER,
    MEASUREMENT_UNIT VARCHAR(5),
    BILL_OF_LADING VARCHAR(20),
    HOUSE_VS_MASTER CHAR(1),
    DISTRIBUTION_PORT VARCHAR(40),
    MASTER_BL VARCHAR(20),
    VOYAGE_NUMBER VARCHAR(10),
    SEAL VARCHAR(300),
    SHIP_REGISTERED_IN VARCHAR(40),
    INBOND_ENTRY_TYPE VARCHAR(30),
    CARRIER_CODE VARCHAR(10),
    CARRIER_NAME VARCHAR(40),
    CARRIER_CITY VARCHAR(40),
    CARRIER_STATE VARCHAR(10),
    CARRIER_ZIP VARCHAR(10),
    CARRIER_ADDRESS VARCHAR(200),
    NOTIFY_PARTY VARCHAR(50),
    NOTIFY_ADDRESS VARCHAR(200),
    PLACE_OF_RECEIPT VARCHAR(50),
    DATE_OF_RECEIPT TIMESTAMP
    );



In [None]:
%%sql 
DROP TABLE IF EXISTS SHIPMENT_HISTORY;
CREATE TABLE SHIPMENT_HISTORY(
    SHIPMENTID VARCHAR(11) NOT NULL PRIMARY KEY,
    STATUS VARCHAR(50),
    SHIPMODE VARCHAR(30),
    PRODUCT_DESCRIPTION VARCHAR(500),
    CONSIGNEE VARCHAR(200),
    SHIPPER VARCHAR(100),
    ARRIVAL_DATE TIMESTAMP,
    GROSS_WEIGHT_LB INTEGER,
    GROSS_WEIGHT_KG INTEGER,
    FOREIGN_PORT VARCHAR(50),
    US_PORT VARCHAR(50),
    VESSEL_NAME VARCHAR(40),
    COUNTRY_OF_ORIGIN VARCHAR(40),
    CONSIGNEE_ADDRESS VARCHAR(150),
    SHIPPER_ADDRESS VARCHAR(150),
    ZIPCODE VARCHAR(20),
    NO_OF_CONTAINERS INTEGER,
    CONTAINER_NUMBER VARCHAR(200),
    CONTAINER_TYPE VARCHAR(80),
    QUANTITY INTEGER,
    QUANTITY_UNIT VARCHAR(10),
    MEASUREMENT INTEGER,
    MEASUREMENT_UNIT VARCHAR(5),
    BILL_OF_LADING VARCHAR(20),
    HOUSE_VS_MASTER CHAR(1),
    DISTRIBUTION_PORT VARCHAR(40),
    MASTER_BL VARCHAR(20),
    VOYAGE_NUMBER VARCHAR(10),
    SEAL VARCHAR(300),
    SHIP_REGISTERED_IN VARCHAR(40),
    INBOND_ENTRY_TYPE VARCHAR(30),
    CARRIER_CODE VARCHAR(10),
    CARRIER_NAME VARCHAR(40),
    CARRIER_CITY VARCHAR(40),
    CARRIER_STATE VARCHAR(10),
    CARRIER_ZIP VARCHAR(10),
    CARRIER_ADDRESS VARCHAR(200),
    NOTIFY_PARTY VARCHAR(50),
    NOTIFY_ADDRESS VARCHAR(200),
    PLACE_OF_RECEIPT VARCHAR(50),
    DATE_OF_RECEIPT TIMESTAMP
);


### Import the Data

Next we import the shipping data, which is in csv format, into our Splice Machine database.


In [None]:
%%sql 
call SYSCS_UTIL.IMPORT_DATA (
     'DS_ASN',
     'SHIPMENT_IN_TRANSIT',
     null,
     's3a://splice-demo/shipment/shipment_in_transit.csv',
     '|',
     null,
     'yyyy-MM-dd HH:mm:ss.SSSSSS',
     'yyyy-MM-dd',
     null,
     -1,
     '/tmp',
     true, null);

In [None]:
%%sql 
call SYSCS_UTIL.IMPORT_DATA (
     'DS_ASN',
     'SHIPMENT_HISTORY',
     null,
     's3a://splice-demo/shipment/shipment_history.csv',
     '|',
     null,
     'yyyy-MM-dd HH:mm:ss.SSSSSS',
     'yyyy-MM-dd',
     null,
     -1,
     '/tmp',
     true, null);

### 4. Create our Features Table

We create a features table in our database that we use with our learning model. We add three computed fields in the `features` table that are important to our model:

* `quantity_bin` categorizes shipping quantities into bins, to improve learning accuracy 
* `lateness` computes how many days late a shipment was
* `label` categorizes lateness into one of four values:

<table class="spliceZepNoBorder" style="margin: 0 0 100px 50px;">
    <tbody>
            <tr><td>0</td><td>0 days late</td></tr>
            <tr><td>1</td><td>1-5 days late</td></tr>
            <tr><td>2</td><td>5-10 days late</td></tr>
            <tr><td>3</td><td>10 days or more late</td></tr>
    </tbody>
</table>

In [None]:
%%sql 
drop table IF EXISTS DS_ASN.FEATURES;
CREATE table DS_ASN.FEATURES AS
    SELECT 
    SHIPMENTID,
    STATUS,
    SHIPMODE,
    PRODUCT_DESCRIPTION,
    CONSIGNEE,
    SHIPPER,
    ARRIVAL_DATE,
    GROSS_WEIGHT_LB,
    GROSS_WEIGHT_KG,
    FOREIGN_PORT,
    US_PORT,
    VESSEL_NAME,
    COUNTRY_OF_ORIGIN,
    CONSIGNEE_ADDRESS,
    SHIPPER_ADDRESS,
    ZIPCODE,
    NO_OF_CONTAINERS,
    CONTAINER_NUMBER,
    CONTAINER_TYPE,
    QUANTITY,
    QUANTITY_UNIT,
    MEASUREMENT,
    MEASUREMENT_UNIT,
    BILL_OF_LADING,
    HOUSE_VS_MASTER,
    DISTRIBUTION_PORT,
    MASTER_BL,
    VOYAGE_NUMBER,
    SEAL,
    SHIP_REGISTERED_IN,
    INBOND_ENTRY_TYPE,
    CARRIER_CODE,
    CARRIER_NAME,
    CARRIER_CITY,
    CARRIER_STATE,
    CARRIER_ZIP,
    CARRIER_ADDRESS,
    NOTIFY_PARTY,
    NOTIFY_ADDRESS,
    PLACE_OF_RECEIPT,
    DATE_OF_RECEIPT,
    CASE
    WHEN DS_ASN.SHIPMENT_HISTORY.QUANTITY > 10
    THEN
        CASE
            WHEN DS_ASN.SHIPMENT_HISTORY.QUANTITY > 100
            THEN
                CASE
                    WHEN DS_ASN.SHIPMENT_HISTORY.QUANTITY > 1000
                    THEN 3
                    ELSE 2
                END
            ELSE 1
    END
    ELSE 0
    END AS QUANTITY_BIN,
    DS_ASN.SHIPMENT_HISTORY.DATE_OF_RECEIPT - DS_ASN.SHIPMENT_HISTORY.ARRIVAL_DATE as LATENESS,
    CASE
    WHEN  DS_ASN.SHIPMENT_HISTORY.DATE_OF_RECEIPT - DS_ASN.SHIPMENT_HISTORY.ARRIVAL_DATE > 0
    THEN
        CASE
            WHEN  DS_ASN.SHIPMENT_HISTORY.DATE_OF_RECEIPT - DS_ASN.SHIPMENT_HISTORY.ARRIVAL_DATE > 5
            THEN
                CASE
                    WHEN  DS_ASN.SHIPMENT_HISTORY.DATE_OF_RECEIPT - DS_ASN.SHIPMENT_HISTORY.ARRIVAL_DATE > 10
                    THEN 3
                    ELSE 2
                END
            ELSE 1
    END
    ELSE 0
    END AS LABEL
FROM DS_ASN.SHIPMENT_HISTORY 

## Create, Train, and Deploy our Learning Model

The remainder of this notebook walks you through the code we use to create, train, and deploy our learning model, in these steps:

1. Perform Spark+MLlib Setup Tasks
2. Create our DataFrame
3. Create Pipeline Stages
4. Assemble the Pipeline>Train our Model
5. Deploy our Model

We include the entire program at the end of this notebook.

### 1. Perform Spark + MLlib Setup Tasks

The Python Splice Machine API communicates with your database through the `PySparkContext` class.


 
We use the following code to instantiate `PySpliceContext` and import our modules:

```
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

splice = PySpliceContext(spark, defaultJDBCURL)
```

### 2. Create our DataFrame

Now we create a Python DataFrame from the results of a SQL `SELECT` query from the database. This allows us to manipulate our Splice Machine data as a Spark DataFrame:

```
df_with_uppercase_schema = PySpliceContext.df("select * from ASN.Features")
newNames = [
    "consignee",
    "shipper",
    "shipmode",
    "gross_weight_lb",
    "foreign_port",
    "us_port",
    "vessel_name",
    "country_of_origin",
    "container_number",
    "container_type",
    "quantity",
    "ship_registered_in",
    "carrier_code",
    "carrier_city",
    "notify_party",
    "place_of_receipt",
    "zipcode",
    "quantity_bin"
    ]
df = df_with_uppercase_schema.toDF(newNames)
```

### 3. Create Pipeline Stages

Our pipeline stages are fairly simple:

* Transform each row of data in the input dataset into an integer vector.
* Assemble the vectors into a DataFrame
* Use a Logistic Regression Estimator to create our model

#### Transform each row of data into an integer vector

The Logistic Regression estimator operates on integer vectors, so we need to convert each row in our input dataframe into an integer vector. Remember that each row contains only the fields from our database that are of interest to our model: the fields that previously included in our sequence and concatenated onto our DataFrame.

Spark includes a `StringIndexer` function that does exactly that, so we create a `StringIndexer` for each field, and we'll later use each of these as a stage in our learning pipeline. The `StringIndexer` transforms the data from a specified input column in our DataFrame and stores the output in a specified and new output column. By convention, we name each string indexer with the name of the field+`Indexer,` and name the output column the name of the field+`Index,` e.g. we create a transformer named `consigneeIndexer` to transform the input column `consignee` into the new output column `consigneeIndex.`

```
// Transform strings into numbers
consigneeIndexer =  StringIndexer(inputCol="consignee", outputCol="consigneeIndex", handleInvalid="skip")
shipperIndexer = StringIndexer(inputCol="shipper", outputCol="shipperIndex", handleInvalid="skip")
shipmodeIndexer = StringIndexer(inputCol="shipmode", outputCol="shipmodeIndex", handleInvalid="skip")
gross_weight_lbIndexer = StringIndexer(inputCol="gross_weight_lb", outputCol="gross_weight_lbIndex", handleInvalid="skip")
foreign_portIndexer =  StringIndexer(inputCol="foreign_port", outputCol="foreign_portIndex", handleInvalid="skip")
us_portIndexer = StringIndexer(inputCol="us_port", outputCol="us_portIndex", handleInvalid="skip")
vessel_nameIndexer = StringIndexer(inputCol="vessel_name", outputCol="vessel_nameIndex",  handleInvalid="skip")
country_of_originIndexer = StringIndexer(inputCol="country_of_origin", outputCol="country_of_originIndex",  handleInvalid="skip")
container_numberIndexer = StringIndexer(inputCol="container_number", outputCol="container_numberIndex", handleInvalid="skip")
container_typeIndexer = StringIndexer(inputCol="container_type", outputCol="container_typeIndex", handleInvalid="skip")
ship_registered_inIndexer = StringIndexer(inputCol="ship_registered_in", outputCol="ship_registered_inIndex", handleInvalid="skip")
carrier_codeIndexer = StringIndexer(inputCol="carrier_code", outputCol="carrier_codeIndex", handleInvalid="skip")
carrier_cityIndexer = StringIndexer(inputCol="carrier_city", outputCol="carrier_cityIndex", handleInvalid="skip")
notify_partyIndexer = StringIndexer(inputCol="notify_party", outputCol="notify_partyIndex", handleInvalid="skip")
place_of_receiptIndexer = StringIndexer(inputCol="place_of_receipt", outputCol="place_of_receiptIndex", handleInvalid="skip")
zipcodeIndexer = StringIndexer(inputCol="zipcode", outputCol="zipcodeIndex", handleInvalid="skip")
```
#### Assemble the Vectors

After our pipeline has transformed data into numbers, we need to assemble those into vectors. Spark includes a `VectorAssembler` object that does just that, transforming a set of input columns into a vector that is stored in the `features` column in the DataFrame:

```
//assemble raw features
assembler = VectorAssembler(inputCols=[
                    "shipmodeIndex",
                    "consigneeIndex",
                    "shipperIndex",
                    "gross_weight_lbIndex",
                    "foreign_portIndex",
                    "us_portIndex",
                    "vessel_nameIndex",
                    "country_of_originIndex",
                    "container_numberIndex",
                    "container_typeIndex",
                    "quantity_bin",
                    "ship_registered_inIndex",
                    "carrier_codeIndex",
                    "carrier_cityIndex",
                    "notify_partyIndex",
                    "place_of_receiptIndex",
                    "zipcodeIndex",
                    "quantity_bin"
                    ], outputCol='features')
```

#### Create the Estimator

Creating the estimator is a simple matter of specifying a few parameters, including which column in the DataFrame is the label, and which column contains the feature set:

```
//Create ML analytic
lr = LogisticRegression(maxIter=30, labelCol="label", featuresCol="features", regParam=0.3)

```


### 4. Assemble our Pipeline

```
// Chain indexers and tree in a Pipeline
lrPipeline = Pipeline(stages=
        [consigneeIndexer,
        shipperIndexer,
        shipmodeIndexer,
        gross_weight_lbIndexer,
        foreign_portIndexer,
        us_portIndexer,
        vessel_nameIndexer,
        country_of_originIndexer,
        container_numberIndexer,
        container_typeIndexer,
        ship_registered_inIndexer,
        carrier_codeIndexer,
        carrier_cityIndexer,
        notify_partyIndexer,
        place_of_receiptIndexer,
        zipcodeIndexer,
        assembler,
        lr]
        )
```

### 5. Train our Model
Now that our pipeline is set up, all we need to do to train our model is feed our dataframe into the pipeline's `fit` method, which learns from the data. 
```
// Train model. 
lrModel = lrPipeline.fit(df)
```


### 6. Materialize the Model

After training our model, we can apply it to real data and display the results. For simplicity sake, in this example, we'll simply apply the model to our feature table itself.

```
lrModel.transform(df).select("prediction", "probability", "features").show(100)
```

In [None]:
%%sql 
select *  from DS_ASN.features { limit 100 }


## The Python Code

Our Python code is listed in the next paragraph.


<p class="noteNote">You can ignore the <code>RuntimeWarning:</code> warning messages that may display when you run the code in the next paragraph.</p>


In [None]:

from __future__ import print_function
import string
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import DataFrame
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

from __future__ import print_function

from pyspark.sql import DataFrame

# Query Features
#splice was created above
query_results = splice.df("select * from DS_ASN.Features")
newNames = ["shipmentid",
            "status",
            "shipmode",
            "product_description",
            "consignee",
            "shipper",
            "arrival_date",
            "gross_weight_lb",
            "gross_weight_kg",
            "foreign_port",
            "us_port",
            "vessel_name",
            "country_of_origin",
            "consignee_address",
            "shipper_address",
            "zipcode",
            "no_of_containers",
            "container_number",
            "container_type",
            "quantity",
            "quantity_unit",
            "measurement",
            "measurement_unit",
            "bill_of_lading",
            "house_vs_master",
            "distribution_port",
            "master_bl",
            "voyage_number",
            "seal",
            "ship_registered_in",
            "inbond_entry_type",
            "carrier_code",
            "carrier_name",
            "carrier_city",
            "carrier_state",
            "carrier_zip",
            "carrier_address",
            "notify_party",
            "notify_address",
            "place_of_receipt",
            "date_of_receipt",
            "quantity_bin",
            "lateness",
            "label"]

df = query_results.toDF(*newNames)

# Assemble Vectors
assembler = VectorAssembler(inputCols=[
    "shipmodeIndex",
    "consigneeIndex",
    "shipperIndex",
    "gross_weight_lbIndex",
    "foreign_portIndex",
    "us_portIndex",
    "vessel_nameIndex",
    "country_of_originIndex",
    "container_numberIndex",
    "container_typeIndex",
    "ship_registered_inIndex",
    "carrier_codeIndex",
    "carrier_cityIndex",
    "notify_partyIndex",
    "place_of_receiptIndex",
    "zipcodeIndex",
    "quantity_bin"
], outputCol='features')

# Transform strings into numbers
zipcodeIndexer = StringIndexer(inputCol="zipcode", outputCol="zipcodeIndex", handleInvalid="skip")
consigneeIndexer = StringIndexer(inputCol="consignee", outputCol="consigneeIndex", handleInvalid="skip")
shipperIndexer = StringIndexer(inputCol="shipper", outputCol="shipperIndex", handleInvalid="skip")
statusIndexer = StringIndexer(inputCol="status", outputCol="statusIndex", handleInvalid="skip")
shipmodeIndexer = StringIndexer(inputCol="shipmode", outputCol="shipmodeIndex", handleInvalid="skip")
gross_weight_lbIndexer = StringIndexer(inputCol="gross_weight_lb", outputCol="gross_weight_lbIndex",
                                       handleInvalid="skip")
foreign_portIndexer = StringIndexer(inputCol="foreign_port", outputCol="foreign_portIndex", handleInvalid="skip")
us_portIndexer = StringIndexer(inputCol="us_port", outputCol="us_portIndex", handleInvalid="skip")
vessel_nameIndexer = StringIndexer(inputCol="vessel_name", outputCol="vessel_nameIndex", handleInvalid="skip")
country_of_originIndexer = StringIndexer(inputCol="country_of_origin", outputCol="country_of_originIndex",
                                         handleInvalid="skip")
container_numberIndexer = StringIndexer(inputCol="container_number", outputCol="container_numberIndex",
                                        handleInvalid="skip")
container_typeIndexer = StringIndexer(inputCol="container_type", outputCol="container_typeIndex",
                                      handleInvalid="skip")
distribution_portIndexer = StringIndexer(inputCol="distribution_port", outputCol="distribution_portIndex",
                                         handleInvalid="skip")
ship_registered_inIndexer = StringIndexer(inputCol="ship_registered_in", outputCol="ship_registered_inIndex",
                                          handleInvalid="skip")
inbond_entry_typeIndexer = StringIndexer(inputCol="inbond_entry_type", outputCol="inbond_entry_typeIndex",
                                         handleInvalid="skip")
carrier_codeIndexer = StringIndexer(inputCol="carrier_code", outputCol="carrier_codeIndex", handleInvalid="skip")
carrier_cityIndexer = StringIndexer(inputCol="carrier_city", outputCol="carrier_cityIndex", handleInvalid="skip")
carrier_stateIndexer = StringIndexer(inputCol="carrier_state", outputCol="carrier_stateIndex", handleInvalid="skip")
carrier_zipIndexer = StringIndexer(inputCol="carrier_zip", outputCol="carrier_zipIndex", handleInvalid="skip")
notify_partyIndexer = StringIndexer(inputCol="notify_party", outputCol="notify_partyIndex", handleInvalid="skip")
place_of_receiptIndexer = StringIndexer(inputCol="place_of_receipt", outputCol="place_of_receiptIndex",
                                        handleInvalid="skip")

lr = LogisticRegression(maxIter=30, labelCol="label", featuresCol="features", regParam=0.3)

lrPipeline = Pipeline(stages=
                      [consigneeIndexer,
                       shipperIndexer,
                       shipmodeIndexer,
                       gross_weight_lbIndexer,
                       foreign_portIndexer,
                       us_portIndexer,
                       vessel_nameIndexer,
                       country_of_originIndexer,
                       container_numberIndexer,
                       container_typeIndexer,
                       ship_registered_inIndexer,
                       carrier_codeIndexer,
                       carrier_cityIndexer,
                       notify_partyIndexer,
                       place_of_receiptIndexer,
                       zipcodeIndexer,
                       assembler,
                       lr]
                      )
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=lrPipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best parameters to demonstate GridSearch in Spark MLlib
lrModel = crossval.fit(df)

transformed_df = lrModel.transform(df)
transformed_df.createOrReplaceTempView("res_view")
results = sqlContext.sql('SELECT prediction, probability, features from res_view')
results.show(20)


In [None]:
%%sql 
drop table IF EXISTS DS_ASN.TEST_FEATURES;
CREATE table DS_ASN.TEST_FEATURES AS
    SELECT 
    SHIPMENTID,
    STATUS,
    SHIPMODE,
    PRODUCT_DESCRIPTION,
    CONSIGNEE,
    SHIPPER,
    ARRIVAL_DATE,
    GROSS_WEIGHT_LB,
    GROSS_WEIGHT_KG,
    FOREIGN_PORT,
    US_PORT,
    VESSEL_NAME,
    COUNTRY_OF_ORIGIN,
    CONSIGNEE_ADDRESS,
    SHIPPER_ADDRESS,
    ZIPCODE,
    NO_OF_CONTAINERS,
    CONTAINER_NUMBER,
    CONTAINER_TYPE,
    QUANTITY,
    QUANTITY_UNIT,
    MEASUREMENT,
    MEASUREMENT_UNIT,
    BILL_OF_LADING,
    HOUSE_VS_MASTER,
    DISTRIBUTION_PORT,
    MASTER_BL,
    VOYAGE_NUMBER,
    SEAL,
    SHIP_REGISTERED_IN,
    INBOND_ENTRY_TYPE,
    CARRIER_CODE,
    CARRIER_NAME,
    CARRIER_CITY,
    CARRIER_STATE,
    CARRIER_ZIP,
    CARRIER_ADDRESS,
    NOTIFY_PARTY,
    NOTIFY_ADDRESS,
    PLACE_OF_RECEIPT,
    DATE_OF_RECEIPT,
    CASE
    WHEN DS_ASN.SHIPMENT_IN_TRANSIT.QUANTITY > 10
    THEN
        CASE
            WHEN DS_ASN.SHIPMENT_IN_TRANSIT.QUANTITY > 100
            THEN
                CASE
                    WHEN DS_ASN.SHIPMENT_IN_TRANSIT.QUANTITY > 1000
                    THEN 3
                    ELSE 2
                END
            ELSE 1
	END
    ELSE 0
    END AS QUANTITY_BIN
    FROM DS_ASN.SHIPMENT_IN_TRANSIT;
    


 
## Testing the Code

Now we'll test our code on the `testing` features table:

In [None]:

test_data_with_uppercase_schema = splice.df('SELECT * FROM DS_ASN.TEST_FEATURES')

newNames = ["shipmentid",
            "status",
            "shipmode",
            "product_description",
            "consignee",
            "shipper",
            "arrival_date",
            "gross_weight_lb",
            "gross_weight_kg",
            "foreign_port",
            "us_port",
            "vessel_name",
            "country_of_origin",
            "consignee_address",
            "shipper_address",
            "zipcode",
            "no_of_containers",
            "container_number",
            "container_type",
            "quantity",
            "quantity_unit",
            "measurement",
            "measurement_unit",
            "bill_of_lading",
            "house_vs_master",
            "distribution_port",
            "master_bl",
            "voyage_number",
            "seal",
            "ship_registered_in",
            "inbond_entry_type",
            "carrier_code",
            "carrier_name",
            "carrier_city",
            "carrier_state",
            "carrier_zip",
            "carrier_address",
            "notify_party",
            "notify_address",
            "place_of_receipt",
            "date_of_receipt",
            "quantity_bin"]

df= test_data_with_uppercase_schema.toDF(*newNames)
lrPredictions = lrModel.transform(df)
lrPredictions.createOrReplaceTempView("pred_view")
results = sqlContext.sql('SELECT prediction, probability, features FROM pred_view ORDER BY features')
results.show(20)

In [None]:
%%sql 
DROP TABLE IF EXISTS DS_ASN.PREDICTIONS;
CREATE TABLE DS_ASN.PREDICTIONS (
    SHIPMENTID VARCHAR(11) NOT NULL PRIMARY KEY,
    PREDICTION DOUBLE
    );

In [None]:

predictions = lrPredictions.select("SHIPMENTID", "PREDICTION")
predictions.printSchema()
splice.insert(predictions, 'DS_ASN.predictions')

In [None]:
%%sql 
select * from DS_ASN.predictions;

## Where to Go Next
Now we're ready to move onto exploring other examples of Machine Learning with Splice Machine, starting with our [*KMeans Example*]((./g.%20KMeans%20Example.ipynb)).