### $$\huge{\text{MADlib Tutorial}}$$

This notebook serves as an hands-on introduction to the data science pipeline using the [MADlib](http://madlib.apache.org) machine learning library.  Using a single dataset throughout, it begins with loading the data into a Greenplum Database (GPDB), then proceeds to data exploration, feature engineering, model development, and finally, model evaluation.

We’ll be using the publicly available [Abalone dataset from the University of California, Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/abalone).  The dataset contains nine attributes (including our target prediction column).

| Column Name | Data Type | Description|
| ---|:---:| ---:|
|Sex | text | M,F,I[infant]|
| Length | float | Longest shell measurement|
|Diameter | float | Perpendicular to length|
| Height | float | With meat in shell |
| Whole weight | float | Whole abalone |
| Shucked weight | float | Weight of meat only |
| Viscera weight | float | Gut weight (after bleeding) |
| Shell weight | float | Post-drying |
| Rings | integer | +1.5 gives the age in years|

All of the code to conduct this enterprise has already been filled in for you. You should feel free to make as many comments and notes as is helpful for your future self (you can make an inline comment by beginning a line with the `#` character).

# Set Up Your Notebook Environment

In [None]:
# this command allows for visualizations to appear in the notebook
%matplotlib inline

In [None]:
import math
import six
import pandas as pd
from sqlalchemy import create_engine

An optional visualization step in this notebook relies on the `graphviz` package.  The following will check to see if it's installed. If it's not installed you will be prompted to optionally install it now. In the event the you run into problems with package installation, please check the [graphviz download page](https://www.graphviz.org/download/) for more instructions.

In [None]:
graphviz_installed = True

try:
    import graphviz
except ImportError:
    print("installing graphviz")
    install_graphviz = six.moves.input('Install `graphviz`? (y/n)')
    if install_graphviz == 'y':
        !pip install graphviz
    else:
        raise ImportError
except:
    graphviz_installed = False
    print("Could not load or install graphviz. Will not show random forest visualization below. ")

In [None]:
pd.set_option('display.max_columns', 200)

# Connect to Database

Establishing the sql connection and loading the data into the GPDB is done behind the scenes here by calling a helper function from a custom module called `dbconnect` in the interest of getting more quickly to the sections on analytics. This module should be in the same folder as this notebook as a file called `dbconnect.py`.

In [None]:
import dbconnect

A prerequisite to establishing the sql connection to GPDB is a set of credentials stored in a .cred file.  The credential file contents should look something like below. 

    [database_creds]
    host: <HOSTNAME_OR_IP>
    port: 5432
    user: <USERNAME>
    database: <DATABASE_NAME>
    password: <PASSWORD>

The values in angle brackets (\<...\>) are placeholders that need to be filled in. For example:

    [database_creds]
    host: 1.2.3.4
    port: 5432
    user: scott
    database: practice_db
    password: my_$ecretP@ss

Running the `connect_and_register_sql_magic()` function below will add a global variable `conn` that is a SQLAlchemy connection object.

In [None]:
db_credential_file = 'db_credentials.txt'
dbconnect.connect_and_register_sql_magic(
    db_credential_file,
    conn_name='conn'
)

## Load Abalone Data Locally

An [abalone](https://simple.wikipedia.org/wiki/Abalone) is a salt water univalve mollusc.

We'll load the data straight from the machine learning database and then start looking at the data.

In [None]:
abalone_data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
abalone_columns = (
    'sex',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'rings'
)
len(abalone_columns)

In [None]:
df_abalone = pd.read_csv(abalone_data_url, names=abalone_columns)

One of the key steps in the data science life cycle is exploring the data.  Key aspects of the data to notice are the presence of null values and data types.

In [None]:
df_abalone.info()

## Looking at the Target
We're interested in estimating the age of the abalone in the data.  To get age, add 1.5 to the number of rings.  A good place to begin is to create a histogram of the target variable.

In [None]:
(df_abalone.rings + 1.5).hist(bins=30)

In [None]:
((df_abalone.rings + 1.5) >= 10).value_counts()

In [None]:
_cumsum = (df_abalone.rings + 1.5).value_counts().sort_index().cumsum()
_cumsum / df_abalone.shape[0]

## Upload data to database

### `psql` approach

`psql` is an alternative to python for running SQL commands and for uploading a small data set. If you have it installed or are logged into the Greenplum master node, you could use the following commands from the command line to copy the data into the database. However, if you are running this notebook in Python you can skip ahead to the code that will use Python to upload the data. 

    psql --dbname=DBNAME --host=HOSTNAME --port=6432 --username=USER --password

    \copy ds_training.abalone (sex, length, diameter, height, whole_weight, shucked_weight, viscera_weight, shell_weight, rings) FROM 'data-science-training/input/abalone.data' DELIMITER ','

After the `psql` `\copy` command you'd need to add an ID column for the following exercises.

### Python pandas approach

The following is Python that can be executed to upload the data (currently just local) into the database. 

In [None]:
schema = 'ds_training'

In [None]:
%read_sql DROP SCHEMA IF EXISTS {schema} CASCADE;
%read_sql CREATE SCHEMA {schema};

In [None]:
df_abalone.to_sql(
    'abalone', 
    conn, 
    schema=schema, 
    if_exists='replace', 
    index=True, 
    index_label='id',
    chunksize=10000)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone
LIMIT 10

# Data pre-processing

## Define target (age)

Our first order of business is to generate our prediction target.  This is a two step process. We’ll create a new column in our data table (“age”) by adding 1.5 to the “rings” column to generate the abalone age.  We’ll then create another column (“mature”) denoting abalone maturity as either a 1 or 0 based whether the age is greater than or equal to an age of 10 years.

A second transformation is added to the query below to streamline later processing. The `sex` column has three possible values: "M" for male, "F" for female, and "I" for infant. When we one-hot encode the column later, the function we will use for it works better on lower-cased characters, so before uploading the data set let's convert the `sex` field to lowercase. 

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_target;
CREATE TABLE {schema}.abalone_target
AS
SELECT 
    id,
    lower(sex) as sex,
    "length",
    diameter,
    height,
    whole_weight,
    shucked_weight,
    viscera_weight,
    shell_weight,
    rings,
    rings + 1.5 as age,
    CASE WHEN 
            (rings + 1.5) >= 10.0
        THEN 1
        ELSE 0
    END as mature
FROM {schema}.abalone

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_target 
LIMIT 10

In [None]:
%%read_sql
SELECT sum(mature), count(*)
FROM {schema}.abalone_target 


## Encode categorical variables

The next thing is to leverage [MADlib to one-hot encode](http://madlib.apache.org/docs/latest/group__grp__encode__categorical.html) the “sex” column which is a categorical variable.  In order to create a predictive model, we need all our columns to be numerical values.  Making sure all our model inputs conform to this standard is an important part of the data science modeling pipeline and is considered part of the preprocessing/data cleaning step of the process.

In [None]:
%%read_sql
SELECT
madlib.encode_categorical_variables (
        '{schema}.abalone_target',  -- input table
        '{schema}.abalone_encoded',  -- output table
        'sex',   -- categorical_cols
        NULL,  --categorical_cols_to_exclude    -- Optional
        NULL,  --row_id,                         -- Optional
        NULL,  --top,                            -- Optional
        NULL,  --value_to_drop,                  -- Optional
        NULL,  --encode_null,                    -- Optional
        NULL,  --output_type,                    -- Optional
        NULL,  --output_dictionary,              -- Optional
        NULL  --distributed_by                  -- Optional
    )

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_encoded
LIMIT 5

# Explore data

The next step through the modeling process is to explore our data.  We’ll again use some of MADLib’s built in functionality to generate [descriptive statistics](http://madlib.apache.org/docs/latest/group__grp__summary.html) of our data.  This will generate important information about the data including count, number of missing values, the mean, median, maximum, minimum, interquartile range, mode, and variance.

Note that you only want to do this after converting categorical data to numeric data because otherwise the statistics will not be computed correctly.


In [None]:
%%read_sql
SELECT
madlib.summary ( 
    '{schema}.abalone_encoded',  -- source_table,
    '{schema}.abalone_summary',  -- output_table,
    NULL,  -- target_cols,
    NULL  -- grouping_cols,
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_summary
LIMIT 15

In [None]:
%%read_sql
SELECT target_column
from {schema}.abalone_summary

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_encoded
limit 3

Another aspect of the data that we might want to know about is the correlation between different columns.  We turn again to MADlib to provide a ready made function: [correlation()](http://madlib.apache.org/docs/latest/group__grp__correlation.html).

In [None]:
%%read_sql
SELECT
madlib.correlation(
    '{schema}.abalone_encoded', -- source_table,
    '{schema}.abalone_correlations', -- output_table,
    'length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,rings', -- target_cols,
    TRUE, -- verbose,
    'sex_f,sex_i,sex_m'  -- grouping_columns
)

In [None]:
%%read_sql
SELECT * 
from {schema}.abalone_correlations
ORDER BY sex_m, sex_f, sex_i, column_position

Ensuring predictive power in large part is the result of creating a hold-out data set that we don’t train out model with.  By creating this subset of the data, we can test any model we develop against “unseen” data to prevent overfitting by our model.  This has the benefit of generating a predictive model that will generalize better.

There’s no right answer as to how much data to set aside in the test table; a 70&-30% split, weighted towards the training data, is a good rule of thumb.  This process is referred to as the [train-test split](http://madlib.apache.org/docs/latest/group__grp__train__test__split.html).


In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_encoded LIMIT 2

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_classif CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_classif_train CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_classif_test CASCADE;
SELECT madlib.train_test_split(
    '{schema}.abalone_encoded', -- source_table,
    '{schema}.abalone_classif', -- output_table,
    0.7, -- train_proportion,
    NULL, -- test_proportion,
    NULL, -- grouping_cols,
    'id,length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_i,sex_m,rings,age,mature', -- target_cols,
    FALSE, -- with_replacement,
    TRUE -- separate_output_tables
)

The train/test flag is in column `split`. `1` means train, `0` means test

In [None]:
%%read_sql
SELECT count(*) as n
FROM {schema}.abalone_classif_train

In [None]:
%%read_sql
SELECT count(*) as n
FROM {schema}.abalone_classif_test

# Modeling

## Classification

### Logistic Regression

We’re now ready to create our first predictive model.  We’ll start with a classic logistic regression since we’ve decided that we have a classification problem.  

Note: drop one of the 1-hot-encoded variables to remove perfect collinearity

In [None]:
%%read_sql
SELECT
madlib.logregr_train(
    '{schema}.abalone_classif_train', -- source_table,
    '{schema}.abalone_logreg_model', -- out_table,
    'mature', -- dependent_varname,
    'ARRAY[
        1,
        length,
        diameter,
        height,
        whole_weight,
        shucked_weight,
        viscera_weight,
        shell_weight,
        sex_f,
        sex_m
    ]' -- independent_varname,
    --, -- grouping_cols,
    --, -- max_iter,
    --, -- optimizer,
    --, -- tolerance,
     -- verbose
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_model_summary

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_model
LIMIT 5

Show coefficients from model

In [None]:
%%read_sql logreg_coefs
SELECT coef 
FROM {schema}.abalone_logreg_model

In [None]:
logreg_coef_names = (
    'intercept',
    'length',
    'diameter',
    'height',
    'whole_weight',
    'shucked_weight',
    'viscera_weight',
    'shell_weight',
    'sex_f',
    'sex_m'
)
tuple(zip(logreg_coef_names, logreg_coefs.iloc[0, 0]))

**Cross join model with table to be scored**

Now that we have a model with coefficients, we can make predictions on records previously unseen by the model. In the current version of MADlib (1.15.1), the way to predict probability using a logistic regression model is to `CROSS JOIN` the test set records with the single-row model table. A `CROSS JOIN` produces the cartesian product between all records in both tables, meaning it pairs every record from one table with every record in the other table. In Postgres/Greenplum this can be done be explicitly using the `CROSS JOIN` statement, or you can simply list the two tables in the `FROM` clause separated by a comma. 

In [None]:
%%read_sql
CREATE TABLE {schema}.abalone_logreg_test_proba
AS
SELECT madlib.logregr_predict_prob(
        coef, 
        ARRAY[
            1,
            length,
            diameter,
            height,
            whole_weight,
            shucked_weight,
            viscera_weight,
            shell_weight,
            sex_f,
            sex_m
        ] 
    ) as proba,
    test.mature
FROM {schema}.abalone_classif_test test, {schema}.abalone_logreg_model model
;

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_proba
LIMIT 4

In [None]:
%%read_sql
SELECT
madlib.area_under_roc(
    '{schema}.abalone_logreg_test_proba', -- table_in, 
    '{schema}.abalone_logreg_test_auc',  --table_out,
    'proba',  -- prediction_col, 
    'mature'  --observed_col, 
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_auc

In [None]:
%%read_sql
CREATE TABLE {schema}.abalone_logreg_test_predict
AS
SELECT
    (proba >= 0.5)::integer as predicted,
    mature
FROM {schema}.abalone_logreg_test_proba

In [None]:
%%read_sql
SELECT
madlib.confusion_matrix(
    '{schema}.abalone_logreg_test_predict', -- table_in
    '{schema}.abalone_logreg_test_conf_matrix', -- table_out
    'predicted',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT 
    row_id,
    class,
    confusion_arr[1] as predicted_0,
    confusion_arr[2] as predicted_1
FROM {schema}.abalone_logreg_test_conf_matrix
ORDER BY row_id

**Check by hand which axis corresponds to actual vs predicted**

In the confusion matrix, there is not indication of which is the true positive and which the false negative.  We can check this by looking at the sum of predicted from our predictions table with a case for where we predict 1 but the actual class is zero.

In [None]:
%%read_sql
SELECT
    count(*)
FROM {schema}.abalone_logreg_test_predict
WHERE
    predicted = 1 AND
    mature = 0


Get ROC values (thresholds, true-positives, false-positives)

In [None]:
%%read_sql
SELECT
madlib.binary_classifier(
    '{schema}.abalone_logreg_test_proba', -- table_in
    '{schema}.abalone_logreg_test_binary_metrics', -- table_out
    'proba',  --prediction_col
    'mature' --observation_col
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_logreg_test_binary_metrics
WHERE 
    --round(threshold::numeric, 1) = 0.5
    threshold >= 0.48 AND
    threshold <= 0.52
ORDER BY threshold

The `-d` flag for the `%%read_sql` magic command below keeps it from displaying the query result, which in this case is many rows that we want stored in the `logreg_metrics` dataframe but don't want to print the whole thing. 

In [None]:
%%read_sql -d logreg_metrics
SELECT *
FROM {schema}.abalone_logreg_test_binary_metrics
ORDER BY threshold

In [None]:
logreg_metrics.plot('fpr', 'tpr')

### Random Forest Classifier

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_classif_train
LIMIT 5

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_model;
DROP TABLE IF EXISTS {schema}.abalone_rf_model_group;
DROP TABLE IF EXISTS {schema}.abalone_rf_model_summary;
SELECT
madlib.forest_train(
    '{schema}.abalone_classif_train',  -- training_table_name
    '{schema}.abalone_rf_model',  -- output_table_name
    'id',  -- id_col_name
    'mature',  -- dependent_variable
    'length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_m',  -- list_of_features
    NULL,  -- list_of_features_to_exclude
    NULL,  -- grouping_columns
    10  -- number of trees
)


In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_model
LIMIT 4

In [None]:
%%read_sql rf_tree1
SELECT madlib.get_tree(
    '{schema}.abalone_rf_model',
    1,
    1,
    FALSE  -- return results in dot_format? (boolean)
) 

In [None]:
print(rf_tree1.iloc[0,0])

In [None]:
%%read_sql rf_tree1_dot
SELECT madlib.get_tree(
    '{schema}.abalone_rf_model',
    1,
    1,
    TRUE  -- return results in dot_format? (boolean)
) 

In [None]:
if graphviz_installed:
    rf_dot_source = graphviz.Source(rf_tree1_dot.iloc[0,0])
    display(rf_dot_source)

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_importances;
SELECT madlib.get_var_importance(
    '{schema}.abalone_rf_model',  -- model_table
    '{schema}.abalone_rf_importances'  -- output_table
)

In [None]:
%%read_sql
SELECT *
FROM {schema}.abalone_rf_importances
ORDER BY impurity_var_importance DESC

In [None]:
%%read_sql -d
DROP TABLE IF EXISTS {schema}.abalone_rf_test_proba;
SELECT
madlib.forest_predict(
    '{schema}.abalone_rf_model',  -- random_forest_model
    '{schema}.abalone_classif_test',  -- new_data_table
    '{schema}.abalone_rf_test_proba',  -- output_table
    'prob'  -- type
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_proba
LIMIT 5

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_predict_actual;
CREATE TABLE {schema}.abalone_rf_test_predict_actual
AS
SELECT 
    test.id,
    prob.estimated_prob_1,
    prob.estimated_prob_1 >= 0.5 as predicted_class,
    test.mature as actual_class
FROM 
    {schema}.abalone_rf_test_proba prob
INNER JOIN
    {schema}.abalone_classif_test test
ON
    prob.id = test.id

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_binary_metrics;
SELECT
madlib.binary_classifier(
    '{schema}.abalone_rf_test_predict_actual', -- table_in
    '{schema}.abalone_rf_test_binary_metrics', -- table_out
    'estimated_prob_1',  --prediction_col
    'actual_class' --observation_col
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_binary_metrics
ORDER BY threshold
LIMIT 15

In [None]:
%%read_sql -d rf_metrics
SELECT fpr, tpr
FROM {schema}.abalone_rf_test_binary_metrics
ORDER BY threshold

In [None]:
rf_metrics.plot('fpr', 'tpr')

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_test_auc CASCADE;
SELECT
madlib.area_under_roc(
    '{schema}.abalone_rf_test_predict_actual', -- table_in
    '{schema}.abalone_rf_test_auc', -- table_out
    'estimated_prob_1',  --prediction_col
    'actual_class' --observation_col
)

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_rf_test_auc
LIMIT 15

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_model_v2 CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_rf_model_v2_group CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_rf_model_v2_summary CASCADE;
SELECT
madlib.forest_train(
    '{schema}.abalone_classif_train',  -- training_table_name
    '{schema}.abalone_rf_model_v2',  -- output_table_name
    'id',  -- id_col_name
    'mature',  -- dependent_variable
    'length,diameter,height,whole_weight,shucked_weight,viscera_weight,shell_weight,sex_f,sex_m',  -- list_of_features
    NULL,  -- list_of_features_to_exclude
    NULL,  -- grouping_columns
    10,  -- number of trees
    NULL,  -- num_random_features
    TRUE,  -- importance
    1,  -- num_permutations
    4,  -- max_tree_depth
    NULL,  -- min_split
    NULL,  -- min_bucket
    NULL,  -- num_splits
    NULL,  -- null_handling_params
    TRUE,  -- verbose
    NULL   -- sample_ratio
)


In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_v2_test_proba;
SELECT
madlib.forest_predict(
    '{schema}.abalone_rf_model_v2',  -- random_forest_model
    '{schema}.abalone_classif_test',  -- new_data_table
    '{schema}.abalone_rf_v2_test_proba',  -- output_table
    'prob'  -- type
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_rf_v2_test_proba
LIMIT 4

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_v2_test_predict_actual;
CREATE TABLE {schema}.abalone_rf_v2_test_predict_actual
AS
SELECT 
    test.id,
    prob.estimated_prob_1,
    prob.estimated_prob_1 >= 0.5 as predicted_class,
    test.mature as actual_class
FROM 
    {schema}.abalone_rf_v2_test_proba prob
INNER JOIN
    {schema}.abalone_classif_test test
ON
    prob.id = test.id

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_rf_v2_test_auc CASCADE;
SELECT
madlib.area_under_roc(
    '{schema}.abalone_rf_v2_test_predict_actual', -- table_in
    '{schema}.abalone_rf_v2_test_auc', -- table_out
    'estimated_prob_1',  --prediction_col
    'actual_class' --observation_col
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_rf_v2_test_auc

## Regression

Before our target variable was a binary one that we constructed to represent maturity. An abalone was either mature or not mature. Now let's predict its age instead of the binary target. 

### Linear Regression

In [None]:
%%read_sql
SELECT * 
FROM {schema}.abalone_classif_train
LIMIT 2

In [None]:
%%read_sql
SELECT madlib.linregr_train(
    '{schema}.abalone_classif_train',  -- source_table
    '{schema}.abalone_linreg_model',  -- out_table
    'age',  -- dependent_varname
    'ARRAY[
        1,
        length,
        diameter,
        height,
        whole_weight,
        shucked_weight,
        viscera_weight,
        shell_weight,
        sex_f,
        sex_m
    ]',  -- independent_varname
    NULL,  -- grouping_cols
    TRUE  -- heteroskedasticity_option
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_linreg_model
LIMIT 10

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_linreg_test_predict;
CREATE TABLE {schema}.abalone_linreg_test_predict
AS
SELECT 
    test.id,
    madlib.linregr_predict(
        coef, 
        ARRAY[
            1,
            length,
            diameter,
            height,
            whole_weight,
            shucked_weight,
            viscera_weight,
            shell_weight,
            sex_f,
            sex_m
        ] 
    ) as predicted_age,
    test.age as actual_age
FROM {schema}.abalone_classif_test test, {schema}.abalone_linreg_model model
;

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_linreg_test_predict
LIMIT 5

In [None]:
%%read_sql
SELECT madlib.mean_squared_error(
    '{schema}.abalone_linreg_test_predict',  -- table_in
    '{schema}.abalone_linreg_test_predict_mse',  -- table_out
    'predicted_age',  -- prediction_col
    'actual_age'  -- observed_col
)

In [None]:
%%read_sql linreg_mse
SELECT * FROM {schema}.abalone_linreg_test_predict_mse

In [None]:
# Root Mean Squared Error (RMSE)
math.sqrt(linreg_mse.iloc[0,0])

### Elastic Net Regression

Elastic Net Regression is linear regression with penalties assigned to the size of the coefficients. 

Note that MADlib's elastic net automatically fits an intercept, so you shouldn't include an explicit intercept column of 1's in your independent variable array.

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_elasticnet_model CASCADE;
DROP TABLE IF EXISTS {schema}.abalone_elasticnet_model_summary CASCADE;
SELECT madlib.elastic_net_train( 
    '{schema}.abalone_classif_train',  -- tbl_source
    '{schema}.abalone_elasticnet_model',  -- tbl_result
    'age',  -- col_dep_var
    'ARRAY[
        length,
        diameter,
        height,
        whole_weight,
        shucked_weight,
        viscera_weight,
        shell_weight,
        sex_f,
        sex_m
    ]',  -- col_ind_var
    'gaussian',  -- regress_family
    0.5,  -- alpha
    0.5,  -- lambda_value
    TRUE  -- standardize
    --,  -- grouping_col
    --,  -- optimizer
    --,  -- optimizer_params
    --,  -- excluded
    --,  -- max_iter
      -- tolerance
)

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_elasticnet_model

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_elasticnet_model_summary

In [None]:
%%read_sql
DROP TABLE IF EXISTS {schema}.abalone_elasticnet_test_predict;
CREATE TABLE {schema}.abalone_elasticnet_test_predict
AS
SELECT 
    test.id,
    madlib.elastic_net_gaussian_predict(
        model.coef_all, 
        model.intercept,
        ARRAY[
            length,
            diameter,
            height,
            whole_weight,
            shucked_weight,
            viscera_weight,
            shell_weight,
            sex_f,
            sex_m
        ] 
    ) as predicted_age,
    test.age as actual_age
FROM {schema}.abalone_classif_test test, {schema}.abalone_elasticnet_model model
;

In [None]:
%%read_sql
SELECT * FROM {schema}.abalone_elasticnet_test_predict
LIMIT 5

In [None]:
%%read_sql
SELECT madlib.mean_squared_error(
    '{schema}.abalone_elasticnet_test_predict',  -- table_in
    '{schema}.abalone_elasticnet_test_predict_mse',  -- table_out
    'predicted_age',  -- prediction_col
    'actual_age'  -- observed_col
)

In [None]:
%%read_sql elasticnet_mse
SELECT * FROM {schema}.abalone_elasticnet_test_predict_mse

In [None]:
# Root Mean Squared Error (RMSE)
math.sqrt(elasticnet_mse.iloc[0,0])