$$\huge{\text{PL/Python Tutorial}}$$

This notebook serves as an hands-on introduction to the data science pipeline, focusing on the usage of **procedural languages (PL/Python)**.  Using a single dataset throughout, it begins with loading the data into a Greenplum Database (GPDB), then proceeds to data exploration, feature engineering, model development, and finally, model evaluation.

We’ll be using the publicly available [Abalone dataset from the University of California, Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/abalone).  The dataset contains eight attributes (including our target prediction column).

| Column Name | Data Type | Description|
| ---|:---:| ---:|
|Sex | text | M,F,I[infant]|
| Length | float | Longest shell measurement|
|Diameter | float | Perpendicular to length|
| Height | float | With meat in shell |
| Whole weight | float | Whole abalone |
| Shucked weight | float | Weight of meat only |
| Viscera weight | float | Gut weight (after bleeding) |
| Shell weight | float | Post-drying |
| Rings | integer | +1.5 gives the age in years|

All of the code to conduct this enterprise has already been filled in for you. You should feel free to make as many comments and notes as is helpful for your future self (you can make an inline comment by beginning a line with the `#` character).

# Set Up Your Notebook Environment

In [None]:
# this command allows for visualizations to appear 
# in the notebook
%matplotlib inline
import matplotlib.pyplot as plt

import math
import inspect
import six
import pandas as pd
from sqlalchemy import create_engine

In [None]:
pd.set_option('display.max_columns', 200)

# Connect to Database

Establishing the sql connection and loading the data into the GPDB is done behind the scenes here by calling a helper function from a custom module called `dbconnect` in the interest of getting more quickly to the sections on analytics. This module should be in the same folder as this notebook as a file called `dbconnect.py`.

In [None]:
import dbconnect

A prerequisite to establishing the sql connection to GPDB is a set of credentials stored in a .cred file.  The credential file contents should look something like below. 

    [database_creds]
    host: <HOSTNAME_OR_IP>
    port: 5432
    user: <USERNAME>
    database: <DATABASE_NAME>
    password: <PASSWORD>

The values in angle brackets (\<...\>) are placeholders that need to be filled in. For example:

    [database_creds]
    host: 1.2.3.4
    port: 5432
    user: scott
    database: practice_db
    password: my_$ecretP@ss

Running the `connect_and_register_sql_magic()` function below will add a global variable `conn` that is a SQLAlchemy connection object.

In [None]:
db_credential_file = 'db_credentials.txt'
dbconnect.connect_and_register_sql_magic(
    db_credential_file,
    conn_name='conn'
)

## Load Abalone Data Locally

An [abalone](https://simple.wikipedia.org/wiki/Abalone) is a salt water univalve mollusc.

We'll load the data straight from the machine learning database and then start looking at the data.

In [None]:
abalone_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
abalone_columns = (
    'sex',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'rings'
)
len(abalone_columns)

In [None]:
df_abalone = pd.read_csv(abalone_data_url, names=abalone_columns)

In [None]:
df_abalone.info()

## Looking at the Target
We're interested in estimating the age of the abalone in the data.  To get age, add 1.5 to the number of rings.  A good place to begin is to create a histogram of the target variable.

In [None]:
(df_abalone.rings + 1.5).hist(bins=30)

In [None]:
((df_abalone.rings + 1.5) >= 3).value_counts()

In [None]:
_cumsum = (df_abalone.rings + 1.5).value_counts().sort_index().cumsum()
_cumsum / df_abalone.shape[0]

## Upload data to database

### `psql` approach

`psql` is an alternative to python for running SQL commands and for uploading a small data set. If you have it installed or are logged into the Greenplum master node, you could use the following commands from the command line to copy the data into the database. However, if you are running this notebook in Python you can skip ahead to the code that will use Python to upload the data. 

    psql --dbname=DBNAME --host=HOSTNAME --port=6432 --username=USER --password

    \copy ds_training.abalone (sex, length, diameter, height, whole_weight, shucked_weight, viscera_weight, shell_weight, rings) FROM 'data-science-training/input/abalone.data' DELIMITER ','

After the `psql` `\copy` command you'd need to add an ID column for the following exercises.

### Python pandas approach

The following is Python that can be executed to upload the data (currently just local) into the database. 

In [None]:
schema = 'ds_training_plpy'

In [None]:
%read_sql DROP SCHEMA IF EXISTS {schema} CASCADE;
%read_sql CREATE SCHEMA {schema};

In [None]:
df_abalone.to_sql(
    'abalone', 
    conn, 
    schema=schema, 
    if_exists='replace', 
    index=True, 
    index_label='id',
    chunksize=10000)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone
LIMIT 10

# Data pre-processing

## Define target (age)

Our first order of business is to generate our prediction target.  This is a two step process. We’ll create a new column in our data table (“age”) by adding 1.5 to the “rings” column to generate the abalone age.  We’ll then create another column (“mature”) denoting abalone maturity as either a 1 or 0 based whether the age is greater than or equal to an age of 10 years.

A second transformation is added to the query below to streamline later processing. The `sex` column has three possible values: "M" for male, "F" for female, and "I" for infant. When we one-hot encode the column later, the function we will use for it works better on lower-cased characters, so before uploading the data set let's convert the `sex` field to lowercase. 

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_target;
CREATE TABLE {schema}.abalone_target
AS
SELECT 
    id,
    lower(sex) as sex,
    "length",
    diameter,
    height,
    whole_weight,
    shucked_weight,
    viscera_weight,
    shell_weight,
    rings,
    rings + 1.5 as age,
    CASE WHEN 
            (rings + 1.5) >= 10.0
        THEN 1
        ELSE 0
    END as mature
FROM {schema}.abalone

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_target 
LIMIT 5

In [None]:
%%read_sql
SELECT sum(mature), count(*)
FROM {schema}.abalone_target 


## Encode categorical variables

The next thing is to leverage [MADlib to one-hot encode](http://madlib.apache.org/docs/latest/group__grp__encode__categorical.html) the “sex” column which is a categorical variable.  In order to create a predictive model, we need all our columns to be numerical values.  Making sure all our model inputs conform to this standard is an important part of the data science modeling pipeline and is considered part of the preprocessing/data cleaning step of the process.

In [None]:
%%read_sql
SELECT
madlib.encode_categorical_variables (
        '{schema}.abalone_target',  -- input table
        '{schema}.abalone_encoded',  -- output table
        'sex',   -- categorical_cols
        NULL,  --categorical_cols_to_exclude    -- Optional
        NULL,  --row_id,                         -- Optional
        NULL,  --top,                            -- Optional
        NULL,  --value_to_drop,                  -- Optional
        NULL,  --encode_null,                    -- Optional
        NULL,  --output_type,                    -- Optional
        NULL,  --output_dictionary,              -- Optional
        NULL  --distributed_by                  -- Optional
    )

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_encoded
LIMIT 5

# Explore data

The next step through the modeling process is to explore our data.  We’ll again use some of MADLib’s built in functionality to generate [descriptive statistics](http://madlib.apache.org/docs/latest/group__grp__summary.html) of our data.  This will generate important information about the data including count, number of missing values, the mean, median, maximum, minimum, interquartile range, mode, and variance.

Note that you only want to do this after converting categorical data to numeric data because otherwise the statistics will not be computed correctly.


In [None]:
%%read_sql
SELECT
madlib.summary ( 
    '{schema}.abalone_encoded',  -- source_table,
    '{schema}.abalone_summary',  -- output_table,
    NULL,  -- target_cols,
    NULL  -- grouping_cols,
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_summary
LIMIT 15

In [None]:
%%read_sql
SELECT target_column
from {schema}.abalone_summary

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_encoded
limit 3

Another aspect of the data that we might want to know about is the correlation between different columns.  We turn again to MADlib to provide a ready made function: [correlation()](http://madlib.apache.org/docs/latest/group__grp__correlation.html).

In [None]:
%%read_sql
SELECT
madlib.correlation(
    '{schema}.abalone_encoded', -- source_table,
    '{schema}.abalone_correlations', -- output_table,
    'length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings', -- target_cols,
    TRUE, -- verbose,
    'sex_f,sex_i,sex_m'  -- grouping_columns
)

In [None]:
%%read_sql
SELECT * 
from {schema}.abalone_correlations
ORDER BY sex_m, sex_f, sex_i, column_position

Ensuring predictive power in large part is the result of creating a hold-out data set that we don’t train out model with.  By creating this subset of the data, we can test any model we develop against “unseen” data to prevent overfitting by our model.  This has the benefit of generating a predictive model that will generalize better.

There’s no right answer as to how much data to set aside in the test table; a 70-30% split, weighted towards the training data, is a good rule of thumb.  

In the MADlib tutorial we showed how to use its [train_test_split](http://madlib.apache.org/docs/latest/group__grp__train__test__split.html) method. Here we will show an alternative to do it in pure SQL using the `random` function. 

>*Just for illustration purposes, not necessarily better. Note that the number of records that end up in the train and test portions is non-deterministic and will vary.*


In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_eval CASCADE;
CREATE TABLE {schema}.abalone_eval
AS
SELECT
    *,
    random() >= 0.7 as test
FROM 
    {schema}.abalone_encoded
;

In [None]:
%%read_sql
SELECT 
    test,
    count(*) as n
FROM {schema}.abalone_eval
GROUP BY test

# Local Python analysis

Doing in-database analytics using PL/Python will mimic what can be done locally with regular Python. Before writing PL/Python let's write what the analysis might look like if done locally. 

## Download sample for local testing

First, execute the following queries to download the evaluation data set (training and test). 

>*We can download the whole data set in this case because we know it is small. For really big tables you could just download a sample for local testing purposes.*

In [None]:
%%read_sql -d df_train
SELECT * FROM {schema}.abalone_eval
WHERE test = FALSE;

In [None]:
%%read_sql -d df_test
SELECT * FROM {schema}.abalone_eval
WHERE test = TRUE;

In [None]:
df_train.info()

## Features and target column names

In [None]:
classific_target = 'mature'

# Get column names that are features 
# (not target-related, identifiers or train/test flags)
features_all = [
    column
    for column in df_train.columns
    if column not in ('id', 'test', 'age', 'mature', 'rings')
]
features_all

In [None]:
# Remove one of the one-hot-encoded columns to avoid collinearity
features_dummy_coded = [
    feature
    for feature in features_all
    if feature != 'sex_m'
]
features_dummy_coded

## Logistic Regression in local python

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression(solver='lbfgs')

In [None]:
logreg.fit(df_train[features_dummy_coded], df_train[classific_target])

In [None]:
logreg.intercept_

In [None]:
logreg.coef_.flatten().shape

In [None]:
# Show coefficients
pd.DataFrame.from_records(zip(features_dummy_coded, logreg.coef_.flatten()))

Compare these coefficients to what we got in MADlib. Do they differ?

Use the model to generate predictions for the test data. Let's evaluate how well we did.

In [None]:
logreg_test_predict = logreg.predict_proba(df_test[features_dummy_coded])

The predictions object is a matrix with 2 columns, one for each class (False, True). When putting our predictions into metrics functions we'll only use the probability of the True class, i.e. the second column which is indexed `1` (zero-indexing in Python). 

In [None]:
logreg_test_predict.shape

In [None]:
logreg_test_predict[:5, :]

In [None]:
from sklearn import metrics

In [None]:
logreg_auc = metrics.roc_auc_score(df_test[classific_target], logreg_test_predict[:, 1])
logreg_auc

In [None]:
metrics.explained_variance_score(df_test[classific_target], logreg_test_predict[:, 1])

In [None]:
logreg_fpr, logreg_tpr, logreg_thresholds = \
    metrics.roc_curve(
        df_test[classific_target], 
        logreg_test_predict[:, 1]
    )

In [None]:
plt.plot(
    logreg_fpr, 
    logreg_tpr, 
    color='darkorange',
    label='ROC curve (area = {:0.2f})'.format(logreg_auc)
)
plt.plot([0,1], [0,1], color='navy', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic (LogRegr)')
plt.legend(loc="lower right")

In [None]:
logreg_conf_matrix = metrics.confusion_matrix(
    df_test[classific_target],
    logreg_test_predict[:, 1] > 0.5
)
logreg_conf_matrix

In [None]:
pd.DataFrame(
    logreg_conf_matrix, 
    index=pd.Index([False, True], name='True Label'),
    columns=pd.Index(
        [False, True], name='Predicted Label'
    )
)

# Combine data to single cell

Now that we've seen how to do logistic regression in Python locally using the popular `sklearn` machine learning library, we are ready to replicate it within Greenplum. 

The way modeling with procedural languages works is that the data needs to be aggregated into a single row or cell to pass into the PL/Python function. Let's look at how to do this aggregation below. 

In [None]:
%%read_sql
CREATE or replace FUNCTION {schema}.array_append_2d(integer[][], integer[])
    RETURNS integer[][]
    LANGUAGE SQL
    AS 'select array_cat($1, ARRAY[$2])'
    IMMUTABLE
;
CREATE or replace FUNCTION {schema}.array_append_2d(numeric[][], numeric[])
    RETURNS numeric[][]
    LANGUAGE SQL
    AS 'select array_cat($1, ARRAY[$2])'
    IMMUTABLE
;
CREATE or replace FUNCTION {schema}.array_append_2d(double precision[][], double precision[])
    RETURNS double precision[][]
    LANGUAGE SQL
    AS 'select array_cat($1, ARRAY[$2])'
    IMMUTABLE
;

In [None]:
%%read_sql
-- Define a user-defined aggregate (UDA) to concatenate arrays
DROP AGGREGATE IF EXISTS {schema}.array_agg_array(anyarray) CASCADE;
CREATE ORDERED AGGREGATE {schema}.array_agg_array(double precision[])
(
    SFUNC = {schema}.array_append_2d,
    STYPE = double precision[][]
);


In [None]:
%%read_sql
CREATE TABLE {schema}.abalone_train_agg
AS
SELECT
    array_agg(id) as ids,
    {schema}.array_agg_array(feature_vec) AS features,
    array_agg(mature) as y_vector
FROM (
    SELECT
        id,
        mature,
        ARRAY[
            length,
            diameter,
            height,
            whole_weight,
            shucked_weight,
            viscera_weight,
            shell_weight,
            sex_f,
            sex_m
        ] AS feature_vec
    FROM {schema}.abalone_eval
    WHERE test = FALSE
) tmp
;

In [None]:
%%read_sql tmp
SELECT * FROM {schema}.abalone_train_agg;

In [None]:
tmp.iloc[0,:]['features'][:5]

## Gotchas combining to single cell

If the final training feature set exceeds 1GB, then this approach of combining everything into a single row/column for putting into PL/Python won't work out of the box. There is a [workaround](http://engineering.pivotal.io/post/running-sklearn-models-at-scale-on-mpp/). Also, another approach is to use a `GROUP BY` and build separate models for different groups of your data (e.g. a different model for each state in the US). 

# Create training UDF

Now we will create a user-defined function (UDF) to take in the aggregated training data, train the model, and output the serialized model. 

The content of a PL/Python function is normal Python. Let's develop the modeling logic we used above into a self-contained function before putting it into a PL/Python function definition. 

In [None]:
def logreg_train(features, targets):
    """Training function for Logistic Regression
    
    INPUTS
    features: 2-dimensional array or list-of-lists
    targets: 1-dimensional array or list
    
    RETURNS: serialized model (as a string)
    """
    from sklearn.linear_model import LogisticRegression
    import six
    pickle = six.moves.cPickle

    logreg = LogisticRegression(solver='lbfgs')
    logreg.fit(features, targets)
    return pickle.dumps(logreg, protocol=2)

# Test that the function works and the serialized model can be de-serialized
serialized = logreg_train(
    df_train[features_dummy_coded], 
    df_train[classific_target]
)

temp_model = six.moves.cPickle.loads(serialized)

# Test that the deserialized model can predict on new data
temp_model.predict_proba(df_test[features_dummy_coded])

In [None]:
%%read_sql
DROP FUNCTION IF EXISTS {schema}.logreg_train(features float[][], targets integer[]);
CREATE OR REPLACE FUNCTION 
        {schema}.logreg_train(features float[][], targets integer[])
RETURNS bytea
LANGUAGE plpythonu
AS
$$
def logreg_train(features, targets):
    """Training function for Logistic Regression
    
    INPUTS
    features: 2-dimensional array or list-of-lists
    targets: 1-dimensional array or list
    
    RETURNS: serialized model (as a string)
    """
    from sklearn.linear_model import LogisticRegression
    import six
    pickle = six.moves.cPickle

    logreg = LogisticRegression(solver='lbfgs')
    logreg.fit(features, targets)
    return pickle.dumps(logreg, protocol=2)

return logreg_train(features, targets)
$$;

# Train model in-database

In [None]:
%%read_sql
CREATE TABLE {schema}.logreg_model
AS
SELECT 
    {schema}.logreg_train(features, y_vector) as model,
    now() as serialized_on
FROM {schema}.abalone_train_agg

In [None]:
%%read_sql df_model
SELECT serialized_on, length(model), model::text
FROM {schema}.logreg_model

# Score model in-database

First we need to create a scoring UDF

In [None]:
def sklearn_predict_1(serialized_model, features):
    """Predict a single record
    
    INPUT
    serialized_model: string
    features: 1-dimensional array/list
    
    RETURNS: float
    """
    # make sure that features is only 1-dimensional
    assert not hasattr(features[0], '__len__')
    
    import six
    pickle = six.moves.cPickle
    
    model = pickle.loads(serialized_model)
    
    result = model.predict_proba([features])
    # `result` is a 1x2 matrix. 
    # The second column shows probability of the true class, 
    # which is what we want to return
    return result[0, 1]

# Test that the function can take a serialized model and a feature 
# vector and return a probability
sklearn_predict_1(serialized, df_test.loc[0, features_dummy_coded])

In [None]:
%%read_sql
DROP FUNCTION IF EXISTS 
    {schema}.sklearn_predict_1(serialized_model bytea, features float[]);
CREATE OR REPLACE FUNCTION 
        {schema}.sklearn_predict_1(serialized_model bytea, features float[])
RETURNS float
LANGUAGE plpythonu
AS
$$
def sklearn_predict_1(serialized_model, features):
    """Predict a single record
    
    INPUT
    serialized_model: string
    features: 1-dimensional array/list
    
    RETURNS: float
    """
    # make sure that features is only 1-dimensional
    assert not hasattr(features[0], '__len__')
    
    import six
    pickle = six.moves.cPickle
    
    model = pickle.loads(serialized_model)
    
    result = model.predict_proba([features])
    # `result` is a 1x2 matrix. 
    # The second column shows probability of the true class, 
    # which is what we want to return
    return result[0, 1]

return sklearn_predict_1(serialized_model, features)
$$;

In [None]:
print('ARRAY[\n  ' + ',\n  '.join(features_dummy_coded) + '\n]')

**Cross join model with table to be scored**

Now that we have a model with coefficients, we can make predictions on records previously unseen by the model. In the current version of MADlib (1.15.1), the way to predict probability using a logistic regression model is to `CROSS JOIN` the test set records with the single-row model table. A `CROSS JOIN` produces the cartesian product between all records in both tables, meaning it pairs every record from one table with every record in the other table. In Postgres/Greenplum this can be done be explicitly using the `CROSS JOIN` statement, or you can simply list the two tables in the `FROM` clause separated by a comma. 

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_logreg_test_proba;
CREATE TABLE {schema}.abalone_logreg_test_proba 
AS
SELECT 
    test.id,
    test.age,
    test.mature,
    {schema}.sklearn_predict_1(
        model_table.model,        
        ARRAY[
          length,
          diameter,
          height,
          whole_weight,
          shucked_weight,
          viscera_weight,
          shell_weight,
          sex_f,
          sex_i
        ]
    ) AS proba
FROM 
    {schema}.abalone_eval as test, 
    {schema}.logreg_model as model_table
WHERE test = TRUE;

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_logreg_test_proba
LIMIT 5

In [None]:
%%read_sql
SELECT
madlib.area_under_roc(
    '{schema}.abalone_logreg_test_proba', -- table_in, 
    '{schema}.abalone_logreg_test_auc',  --table_out,
    'proba',  -- prediction_col, 
    'mature'  --observed_col, 
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_auc

In [None]:
%%read_sql
CREATE TABLE {schema}.abalone_logreg_test_predict
AS
SELECT
    (proba >= 0.5)::integer as predicted,
    mature
FROM {schema}.abalone_logreg_test_proba

In [None]:
%%read_sql
SELECT
madlib.confusion_matrix(
    '{schema}.abalone_logreg_test_predict', -- table_in
    '{schema}.abalone_logreg_test_conf_matrix', -- table_out
    'predicted',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT 
    row_id,
    class,
    confusion_arr[1] as predicted_0,
    confusion_arr[2] as predicted_1
FROM {schema}.abalone_logreg_test_conf_matrix
ORDER BY row_id

In [None]:
%%read_sql
SELECT
madlib.binary_classifier(
    '{schema}.abalone_logreg_test_proba', -- table_in
    '{schema}.abalone_logreg_test_binary_metrics', -- table_out
    'proba',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_binary_metrics
WHERE 
    --round(threshold::numeric, 1) = 0.5
    threshold >= 0.49 AND
    threshold <= 0.51
ORDER BY threshold

The `-d` flag for the `%%read_sql` magic command below keeps it from displaying the query result, which in this case is many rows that we want stored in the `logreg_metrics` dataframe but don't want to print the whole thing. 

In [None]:
%%read_sql -d logreg_metrics
SELECT *
FROM {schema}.abalone_logreg_test_binary_metrics
ORDER BY threshold

In [None]:
logreg_metrics.plot('fpr', 'tpr')

# Random Forest Classifier

## Create RF UDF

Let's go through the same exercise we did with Logistic Regression via PL/Python, just this time with Random Forest

In [None]:
def rf_train(features, targets):
    """Training function for Random Forest
    
    INPUTS
    features: 2-dimensional array or list-of-lists
    targets: 1-dimensional array or list
    
    RETURNS: serialized model (as a string)
    """
    from sklearn.ensemble import RandomForestClassifier
    import six
    pickle = six.moves.cPickle

    model = RandomForestClassifier()
    model.fit(features, targets)
    return pickle.dumps(model, protocol=2)

# Test that the function works and the serialized model can be de-serialized
rf_serialized = rf_train(
    df_train[features_dummy_coded], 
    df_train[classific_target]
)

rf_model_deserialized = six.moves.cPickle.loads(rf_serialized)

# Test that the deserialized model can predict on new data
rf_model_deserialized.predict_proba(df_test[features_dummy_coded])

In [None]:
%%read_sql
DROP FUNCTION IF EXISTS {schema}.rf_train(features float[][], targets integer[]);
CREATE OR REPLACE FUNCTION 
        {schema}.rf_train(features float[][], targets integer[])
RETURNS bytea
LANGUAGE plpythonu
AS
$$
def rf_train(features, targets):
    """Training function for Random Forest
    
    INPUTS
    features: 2-dimensional array or list-of-lists
    targets: 1-dimensional array or list
    
    RETURNS: serialized model (as a string)
    """
    from sklearn.ensemble import RandomForestClassifier
    import six
    pickle = six.moves.cPickle

    model = RandomForestClassifier()
    model.fit(features, targets)
    return pickle.dumps(model, protocol=2)

return rf_train(features, targets)
$$;

## Train RF model in-database

In [None]:
%%read_sql
CREATE TABLE {schema}.rf_model
AS
SELECT 
    {schema}.rf_train(features, y_vector) as model,
    now() as serialized_on
FROM {schema}.abalone_train_agg

In [None]:
%%read_sql df_rf_model
SELECT serialized_on, length(model), model::text
FROM {schema}.rf_model

## Get fitted model information

In [None]:
[i for i in dir(rf_model_deserialized) if not i.startswith('_')]

In [None]:
%%read_sql
DROP TYPE IF EXISTS {schema}.rf_info CASCADE;
CREATE TYPE {schema}.rf_info
    AS
    (
        classes text[],
        max_depth integer,
        n_estimators integer,
        feature_importances float[]
    )
;


CREATE OR REPLACE FUNCTION 
    {schema}.get_rf_info(serialized_model bytea)
RETURNS {schema}.rf_info
LANGUAGE plpythonu
AS
$$
import six
pickle = six.moves.cPickle

model = pickle.loads(serialized_model)

return [
    model.classes_,
    model.max_depth,
    model.n_estimators,
    model.feature_importances_
]
$$

In [None]:
%%read_sql
SELECT 
    serialized_on,
    ({schema}.get_rf_info(model)).*
FROM {schema}.rf_model

## Score test set with RF model

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_proba;
CREATE TABLE {schema}.abalone_rf_test_proba 
AS
SELECT 
    test.id,
    test.age,
    test.mature,
    {schema}.sklearn_predict_1(
        model_table.model,        
        ARRAY[
          length,
          diameter,
          height,
          whole_weight,
          shucked_weight,
          viscera_weight,
          shell_weight,
          sex_f,
          sex_i
        ]
    ) AS proba
FROM 
    {schema}.abalone_eval as test, 
    {schema}.rf_model as model_table
WHERE test = TRUE;

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_rf_test_proba 
LIMIT 4

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_binary_metrics;
SELECT
madlib.binary_classifier(
    '{schema}.abalone_rf_test_proba', -- table_in
    '{schema}.abalone_rf_test_binary_metrics', -- table_out
    'proba',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_binary_metrics
ORDER BY threshold
LIMIT 5

In [None]:
%%read_sql -d rf_metrics
SELECT fpr, tpr
FROM {schema}.abalone_rf_test_binary_metrics
ORDER BY threshold

In [None]:
rf_metrics.plot('fpr', 'tpr')

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_auc CASCADE;
SELECT
madlib.area_under_roc(
    '{schema}.abalone_rf_test_proba', -- table_in
    '{schema}.abalone_rf_test_auc', -- table_out
    'proba',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_auc
LIMIT 15