$$\huge{\text{Data Science Workshop}}$$

This notebook serves as an hands-on introduction to the data science pipeline.  Using a single dataset throughout, it begins with loading the data from a Greenplum Database (GPDB), then proceeds to data exploration, feature engineering, model development, and finally, model evaluation.

We’ll be using the publicly available [Abalone dataset from the University of California, Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/abalone).  The dataset contains eight attributes (including our target prediction column).

| Column Name | Data Type | Description|
| ---|:---:| ---:|
|Sex | text | M,F,I[infant]|
| Length | float | Longest shell measurement|
|Diameter | float | Perpendicular to length|
| Height | float | With meat in shell |
| Whole weight | float | Whole abalone |
| Shucked weight | float | Weight of meat only |
| Viscera weight | float | Gut weight (after bleeding) |
| Shell weight | float | Post-drying |
| Rings | integer | +1.5 gives the age in years|

Much of the code to conduct this enterprise has already been filled in for you.  Where your input is required will be clear as it will say “*code here*“.  You’ll replace those sections with the appropriate code snippets you learn as we go through this notebook together.  You should feel free to make as many comments and notes as is helpful for your future self (you can make an inline comment by beginning a line with the “#” character).

# Connect to Database

Establishing the sql connection and loading the data into the GPDB is done behind the scenes here by calling a helper function from a custom module called `dbconnect` in the interest of getting more quickly to the sections on analytics.

In [None]:
%matplotlib inline

In [None]:
import dbconnect

In [None]:
db_credential_file = '../.dbcred'
dbconnect.connect_and_register_sql_magic(
    db_credential_file,
    conn_name='conn'
)

In [None]:
import math
import pandas as pd
from sqlalchemy import create_engine

In [None]:
pd.set_option('display.max_columns', 200)

## Load Data: Abalone

In [None]:
abalone_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
abalone_columns = (
    'sex',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'rings'
)
len(abalone_columns)

In [None]:
df_abalone = pd.read_csv(abalone_data_url, names=abalone_columns)

In [None]:
df_abalone.info()

In [None]:
df_abalone['sex'] = df_abalone['sex'].str.lower()

In [None]:
(df_abalone.rings + 1.5).hist(bins=30)

In [None]:
((df_abalone.rings + 1.5) >= 3).value_counts()

In [None]:
_cumsum = (df_abalone.rings + 1.5).value_counts().sort_index().cumsum()
_cumsum / df_abalone.shape[0]

In [None]:
schema = 'ds_training'

In [None]:
%read_sql DROP SCHEMA IF EXISTS {schema} CASCADE;
%read_sql CREATE SCHEMA {schema};

In [None]:
df_abalone.to_sql(
    'abalone', 
    conn, 
    schema=schema, 
    if_exists='replace', 
    index=True, 
    index_label='id',
    chunksize=10000)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone
LIMIT 10

# Data pre-processing

## Define target (age)

Our first order of business is to generate our prediction target.  This is a two step process. We’ll create a new column in our data table (“age”) by adding 1.5 to the “rings” column to generate the abalone age.  We’ll then create another column (“mature”) denoting abalone maturity as either a 1 or 0 based whether the age is greater than or equal to an age of 10 years.

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_target;
CREATE TABLE {schema}.abalone_target
AS
SELECT 
    *,
    rings + 1.5 as age,
    CASE WHEN 
            (rings + 1.5) >= 10.0
        THEN 1
        ELSE 0
    END as mature
FROM {schema}.abalone

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_target 
LIMIT 10

In [None]:
%%read_sql
SELECT sum(mature), count(*)
FROM {schema}.abalone_target 


## Encode categorical variables

The next thing is to leverage [MADlib to one-hot encode](http://madlib.apache.org/docs/latest/group__grp__encode__categorical.html) the “sex” column which is a categorical variable.  In order to create a predictive model, we need all our columns to be numerical values.  Making sure all our model inputs conform to this standard is an important part of the data science modeling pipeline and is considered part of the preprocessing/data cleaning step of the process.

In [None]:
%%read_sql
SELECT
madlib.encode_categorical_variables (
        '{schema}.abalone_target',  -- input table
        '{schema}.abalone_encoded',  -- output table
        'sex',   -- categorical_cols
        NULL,  --categorical_cols_to_exclude    -- Optional
        NULL,  --row_id,                         -- Optional
        NULL,  --top,                            -- Optional
        NULL,  --value_to_drop,                  -- Optional
        NULL,  --encode_null,                    -- Optional
        NULL,  --output_type,                    -- Optional
        NULL,  --output_dictionary,              -- Optional
        NULL  --distributed_by                  -- Optional
    )

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_encoded
LIMIT 5

# Explore data

The next step through the modeling process is to explore our data.  We’ll again use some of MADLib’s built in functionality to generate [descriptive statistics](http://madlib.apache.org/docs/latest/group__grp__summary.html) of our data.  This will generate important information about the data including count, number of missing values, the mean, median, maximum, minimum, interquartile range, mode, and variance.

Note that you only want to do this after converting categorical data to numeric data because otherwise the statistics will not be computer correctly.


In [None]:
%%read_sql
SELECT
madlib.summary ( 
    '{schema}.abalone_encoded',  -- source_table,
    '{schema}.abalone_summary',  -- output_table,
    NULL,  -- target_cols,
    NULL  -- grouping_cols,
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_summary
LIMIT 15

In [None]:
%%read_sql
SELECT target_column
from {schema}.abalone_summary

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_encoded
limit 3

Another aspect of the data that we might want to know about is the correlation between different columns.  We turn again to MADlib to provide a ready made function: [correlation()](http://madlib.apache.org/docs/latest/group__grp__correlation.html).

In [None]:
%%read_sql
SELECT
madlib.correlation(
    '{schema}.abalone_encoded', -- source_table,
    '{schema}.abalone_correlations', -- output_table,
    'length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings', -- target_cols,
    TRUE, -- verbose,
    'sex_f,sex_i,sex_m'  -- grouping_columns
)

In [None]:
%%read_sql
SELECT * 
from {schema}.abalone_correlations
ORDER BY sex_m, sex_f, sex_i, column_position

Ensuring predictive power in large part is the result of creating a hold-out data set that we don’t train out model with.  By creating this subset of the data, we can test any model we develop against “unseen” data to prevent overfitting by our model.  This has the benefit of generating a predictive model that will generalize better.

There’s no right answer as to how much data to set aside in the test table; a 70&-30% split, weighted towards the training data, is a good rule of thumb.  This process is referred to as the [train-test split](http://madlib.apache.org/docs/latest/group__grp__train__test__split.html).


In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_encoded LIMIT 2

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_classif CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_classif_train CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_classif_test CASCADE;
SELECT madlib.train_test_split(
    '{schema}.abalone_encoded', -- source_table,
    '{schema}.abalone_classif', -- output_table,
    0.7, -- train_proportion,
    NULL, -- test_proportion,
    NULL, -- grouping_cols,
    'id,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_i,sex_m,rings,age,mature', -- target_cols,
    FALSE, -- with_replacement,
    TRUE -- separate_output_tables
)

The train/test flag is in column `split`. `1` means train, `0` means test

In [None]:
%%read_sql
SELECT count(*) as n
FROM {schema}.abalone_classif_train

In [None]:
%%read_sql
SELECT count(*) as n
FROM {schema}.abalone_classif_test

# Modeling

## Classification

### Logistic Regression

We’re now ready to create our first predictive model.  We’ll start with a classic logistic regression since we’ve decided that we have a classification problem.  

Note: drop one of the 1-hot-encoded variables to remove perfect collinearity

In [None]:
%%read_sql
SELECT
madlib.logregr_train(
    '{schema}.abalone_classif_train', -- source_table,
    '{schema}.abalone_logreg_model', -- out_table,
    'mature', -- dependent_varname,
    'ARRAY[
        1,
        length,
        diameter,
        height,
        whole_weight,
        shucked_weight,
        viscera_weight,
        shell_weight,
        sex_f,
        sex_m
    ]' -- independent_varname,
    --, -- grouping_cols,
    --, -- max_iter,
    --, -- optimizer,
    --, -- tolerance,
     -- verbose
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_model
LIMIT 5

cross join model with table to be scored

In [None]:
%%read_sql
CREATE TABLE {schema}.abalone_logreg_test_proba
AS
SELECT madlib.logregr_predict_prob(
        coef, 
        ARRAY[
            1,
            length,
            diameter,
            height,
            whole_weight,
            shucked_weight,
            viscera_weight,
            shell_weight,
            sex_f,
            sex_m
        ] 
    ) as proba,
    test.mature
FROM {schema}.abalone_classif_test test, {schema}.abalone_logreg_model model
;

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_proba
LIMIT 4

In [None]:
%%read_sql
SELECT
madlib.area_under_roc(
    '{schema}.abalone_logreg_test_proba', -- table_in, 
    '{schema}.abalone_logreg_test_auc',  --table_out,
    'proba',  -- prediction_col, 
    'mature'  --observed_col, 
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_auc

In [None]:
%%read_sql
CREATE TABLE {schema}.abalone_logreg_test_predict
AS
SELECT
    (proba >= 0.5)::integer as predicted,
    mature
FROM {schema}.abalone_logreg_test_proba

In [None]:
%%read_sql
SELECT
madlib.confusion_matrix(
    '{schema}.abalone_logreg_test_predict', -- table_in
    '{schema}.abalone_logreg_test_conf_matrix', -- table_out
    'predicted',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT 
    row_id,
    class,
    confusion_arr[1] as predicted_0,
    confusion_arr[2] as predicted_1
FROM {schema}.abalone_logreg_test_conf_matrix
ORDER BY row_id

**Check by hand which axis corresponds to actual vs predicted**

In the confusion matrix, there is not indication of which is the true positive and which the false negative.  We can check this by looking at the sum of predicted from our predictions table with a case for where we predict 1 but the actual class is zero.

In [None]:
%%read_sql
SELECT
    count(*)
FROM {schema}.abalone_logreg_test_predict
WHERE
    predicted = 1 AND
    mature = 0


Get ROC values (thresholds, true-positives, false-positives)

In [None]:
%%read_sql
SELECT
madlib.binary_classifier(
    '{schema}.abalone_logreg_test_proba', -- table_in
    '{schema}.abalone_logreg_test_binary_metrics', -- table_out
    'proba',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_binary_metrics
WHERE 
    --round(threshold::numeric, 1) = 0.5
    threshold >= 0.48 AND
    threshold <= 0.52
ORDER BY threshold

#### LogReg with Cross-validation

In [None]:
%%read_sql
SELECT madlib.cross_validation_general(
    -- modelling_func
        'madlib.logregr_train',
    -- modelling_params
        '{{%data%, %model%, mature, "length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_m",NULL,max_iter}}'::varchar[],
    -- modelling_params_type
        '{{varchar, varchar, varchar, varchar, varchar, integer}}',
    -- param_explored
        'max_iter',
    -- explore_values
        '{{5, 10, 20, 50}}'::varchar[],
    -- predict_func
        'madlib.logregr_predict_prob',
    -- predict_params
        '{{%model%, %data%, %prediction%, prob}}'::varchar[],
    -- predict_params_type
        '{{text, text, text, text}}'::varchar[],
    -- metric_func
        'madlib.area_under_roc',
    -- metric_params
        '{{%prediction%, %data%, %id%, mature}}'::varchar[],
    -- metric_params_type
        '{{varchar, varchar, varchar, varchar}}'::varchar[],
    -- data_tbl
        '{schema}.abalone_classif_train',
    -- data_id
        'id',
    -- id_is_random
        FALSE,
    -- validation_result
        '{schema}.abalone_cls_rf_cv',
    -- data_cols
        NULL,
    -- fold_num
        3
)


### Random Forest

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_classif_train
LIMIT 5

In [None]:
%%read_sql
SELECT
madlib.forest_train(
    '{schema}.abalone_classif_train',  -- training_table_name
    '{schema}.abalone_rf_model',  -- output_table_name
    'id',  -- id_col_name
    'mature',  -- dependent_variable
    'length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_m',  -- list_of_features
    NULL,  -- list_of_features_to_exclude
    NULL,  -- grouping_columns
    10  -- number of trees
)


In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_model
LIMIT 10

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_proba;
SELECT
madlib.forest_predict(
    '{schema}.abalone_rf_model',  -- random_forest_model
    '{schema}.abalone_classif_test',  -- new_data_table
    '{schema}.abalone_rf_test_proba',  -- output_table
    'prob'  -- type
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_proba
LIMIT 5

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_predict_actual;
CREATE TABLE {schema}.abalone_rf_test_predict_actual
AS
SELECT 
    test.id,
    prob.estimated_prob_1,
    prob.estimated_prob_1 >= 0.5 as predicted_class,
    test.mature as actual_class
FROM 
    {schema}.abalone_rf_test_proba prob
INNER JOIN
    {schema}.abalone_classif_test test
ON
    prob.id = test.id

In [None]:
%%read_sql
SELECT
madlib.binary_classifier(
    '{schema}.abalone_rf_test_predict_actual', -- table_in
    '{schema}.abalone_rf_test_binary_metrics', -- table_out
    'estimated_prob_1',  --prediction_col
    'actual_class' --observation_col
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_binary_metrics
ORDER BY threshold
LIMIT 15

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_auc CASCADE;
SELECT
madlib.area_under_roc(
    '{schema}.abalone_rf_test_predict_actual', -- table_in
    '{schema}.abalone_rf_test_auc', -- table_out
    'estimated_prob_1',  --prediction_col
    'actual_class' --observation_col
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_auc
LIMIT 15

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_model_v2 CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_rf_model_v2_group CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_rf_model_v2_summary CASCADE;
SELECT
madlib.forest_train(
    '{schema}.abalone_classif_train',  -- training_table_name
    '{schema}.abalone_rf_model_v2',  -- output_table_name
    'id',  -- id_col_name
    'mature',  -- dependent_variable
    'length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_m',  -- list_of_features
    NULL,  -- list_of_features_to_exclude
    NULL,  -- grouping_columns
    10,  -- number of trees
    NULL,  -- num_random_features
    TRUE,  -- importance
    1,  -- num_permutations
    4,  -- max_tree_depth
    NULL,  -- min_split
    NULL,  -- min_bucket
    NULL,  -- num_splits
    NULL,  -- null_handling_params
    TRUE,  -- verbose
    NULL   -- sample_ratio
)


In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_v2_test_proba;
SELECT
madlib.forest_predict(
    '{schema}.abalone_rf_model_v2',  -- random_forest_model
    '{schema}.abalone_classif_test',  -- new_data_table
    '{schema}.abalone_rf_v2_test_proba',  -- output_table
    'prob'  -- type
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_rf_v2_test_proba
LIMIT 4

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_v2_test_predict_actual;
CREATE TABLE {schema}.abalone_rf_v2_test_predict_actual
AS
SELECT 
    test.id,
    prob.estimated_prob_1,
    prob.estimated_prob_1 >= 0.5 as predicted_class,
    test.mature as actual_class
FROM 
    {schema}.abalone_rf_v2_test_proba prob
INNER JOIN
    {schema}.abalone_classif_test test
ON
    prob.id = test.id

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_v2_test_auc CASCADE;
SELECT
madlib.area_under_roc(
    '{schema}.abalone_rf_v2_test_predict_actual', -- table_in
    '{schema}.abalone_rf_v2_test_auc', -- table_out
    'estimated_prob_1',  --prediction_col
    'actual_class' --observation_col
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_rf_v2_test_auc

#### RF with Cross-validation

Note that with `sql_magic`'s `%%read_sql`, any presence of braces `{some_key}` gets replaced with the value of a variable of that name in the current scope. To use literal braces you have to escape them by doubling the individual braces. The literal braces is a a way of specifying arrays in Postgres/Greenplum. 

In [None]:
%%read_sql
SELECT madlib.cross_validation_general(
    -- modelling_func
        'madlib.forest_train',
    -- modelling_params
        '{{%data%, %model%, id, mature, "length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_m",NULL,NULL,numtrees}}'::varchar[],
    -- modelling_params_type
        '{{varchar, varchar, varchar, varchar, varchar, varchar, varchar, integer}}',
    -- param_explored
        'numtrees',
    -- explore_values
        '{{5, 10, 20, 50}}'::varchar[],
    -- predict_func
        'madlib.forest_predict',
    -- predict_params
        '{{%model%, %data%, %prediction%, prob}}'::varchar[],
    -- predict_params_type
        '{{text, text, text, text}}'::varchar[],
    -- metric_func
        'madlib.area_under_roc',
    -- metric_params
        '{{%prediction%, %data%, %id%, mature}}'::varchar[],
    -- metric_params_type
        '{{varchar, varchar, varchar, varchar}}'::varchar[],
    -- data_tbl
        '{schema}.abalone_classif_train',
    -- data_id
        'id',
    -- id_is_random
        FALSE,
    -- validation_result
        '{schema}.abalone_cls_rf_cv',
    -- data_cols
        NULL,
    -- fold_num
        3
)


The above fails with the following error:

    ERROR:  plpy.SPIError: plpy.Error: Prediction Metrics error: Input table 'pg_temp.__madlib_temp_output_error26841664_1547679153_22982462__' does not exist (plpython.c:5038)
    CONTEXT:  Traceback (most recent call last):
      PL/Python function "cross_validation_general", line 20, in <module>
        return cross_validation.cross_validation_general(**globals())
      PL/Python function "cross_validation_general", line 366, in cross_validation_general
      PL/Python function "cross_validation_general", line 271, in _one_step_cv
      PL/Python function "cross_validation_general", line 124, in __cv_funcall_general
    PL/Python function "cross_validation_general"
    SQL state: XX000

In [None]:
%%read_sql
SELECT count(*) FROM {schema}.abalone_cls_rf_cv

## Regression

Before our target variable was a binary one that we constructed to represent maturity. An abalone was either mature or not mature. Now let's predict its age instead of the binary target. 

### Linear Regression

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_classif_train
LIMIT 2

In [None]:
%%read_sql
SELECT madlib.linregr_train(
    '{schema}.abalone_classif_train',  -- source_table
    '{schema}.abalone_linreg_model',  -- out_table
    'age',  -- dependent_varname
    'ARRAY[
        1,
        length,
        diameter,
        height,
        whole_weight,
        shucked_weight,
        viscera_weight,
        shell_weight,
        sex_f,
        sex_m
    ]',  -- independent_varname
    NULL,  -- grouping_cols
    TRUE  -- heteroskedasticity_option
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_linreg_model
LIMIT 10

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_linreg_test_predict;
CREATE TABLE {schema}.abalone_linreg_test_predict
AS
SELECT 
    test.id,
    madlib.linregr_predict(
        coef, 
        ARRAY[
            1,
            length,
            diameter,
            height,
            whole_weight,
            shucked_weight,
            viscera_weight,
            shell_weight,
            sex_f,
            sex_m
        ] 
    ) as predicted_age,
    test.age as actual_age
FROM {schema}.abalone_classif_test test, {schema}.abalone_linreg_model model
;

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_linreg_test_predict
LIMIT 5

In [None]:
%%read_sql
SELECT madlib.mean_squared_error(
    '{schema}.abalone_linreg_test_predict',  -- table_in
    '{schema}.abalone_linreg_test_predict_mse',  -- table_out
    'predicted_age',  -- prediction_col
    'actual_age'  -- observed_col
)

In [None]:
%%read_sql linreg_mse
SELECT * FROM {schema}.abalone_linreg_test_predict_mse

In [None]:
# Root Mean Squared Error (RMSE)
math.sqrt(linreg_mse.iloc[0,0])

### Elastic Net Regression

Elastic Net Regression is linear regression with penalties assigned to the size of the coefficients. 

In [None]:
SELECT madlib.linregr_train(
    '{schema}.abalone_classif_train',  -- source_table
    '{schema}.abalone_linreg_model',  -- out_table
    'age',  -- dependent_varname
    'ARRAY[
        1,
        length,
        diameter,
        height,
        whole_weight,
        shucked_weight,
        viscera_weight,
        shell_weight,
        sex_f,
        sex_m
    ]',  -- independent_varname
    NULL,  -- grouping_cols
    TRUE  -- heteroskedasticity_option
)

Note that MADlib's elastic net automatically fits an intercept, so you shouldn't include an explicit intercept column of 1's in your independent variable array.

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_elasticnet_model CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_elasticnet_model_summary CASCADE;
SELECT madlib.elastic_net_train( 
    '{schema}.abalone_classif_train',  -- tbl_source
    '{schema}.abalone_elasticnet_model',  -- tbl_result
    'age',  -- col_dep_var
    'ARRAY[
        length,
        diameter,
        height,
        whole_weight,
        shucked_weight,
        viscera_weight,
        shell_weight,
        sex_f,
        sex_m
    ]',  -- col_ind_var
    'gaussian',  -- regress_family
    0.5,  -- alpha
    0.5,  -- lambda_value
    TRUE  -- standardize
    --,  -- grouping_col
    --,  -- optimizer
    --,  -- optimizer_params
    --,  -- excluded
    --,  -- max_iter
      -- tolerance
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_elasticnet_model

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_elasticnet_model_summary

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_elasticnet_test_predict;
CREATE TABLE {schema}.abalone_elasticnet_test_predict
AS
SELECT 
    test.id,
    madlib.elastic_net_gaussian_predict(
        model.coef_all, 
        model.intercept,
        ARRAY[
            length,
            diameter,
            height,
            whole_weight,
            shucked_weight,
            viscera_weight,
            shell_weight,
            sex_f,
            sex_m
        ] 
    ) as predicted_age,
    test.age as actual_age
FROM {schema}.abalone_classif_test test, {schema}.abalone_elasticnet_model model
;

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_elasticnet_test_predict
LIMIT 5

In [None]:
%%read_sql
SELECT madlib.mean_squared_error(
    '{schema}.abalone_elasticnet_test_predict',  -- table_in
    '{schema}.abalone_elasticnet_test_predict_mse',  -- table_out
    'predicted_age',  -- prediction_col
    'actual_age'  -- observed_col
)

In [None]:
%%read_sql elasticnet_mse
SELECT * FROM {schema}.abalone_elasticnet_test_predict_mse

In [None]:
# Root Mean Squared Error (RMSE)
math.sqrt(elasticnet_mse.iloc[0,0])