### Setup database connectivity

We'll reuse our module from the previous notebook (***`00_database_connectivity_setup.ipynb`***) to establish connectivity to the database

In [1]:
%run '00_database_connectivity_setup.ipynb'
IPython.display.clear_output()

### Apache MADlib

Model parallel problems or task parallel problems typically involve building a machine learning model on a dataset that cannot fit into memory, on a distributed cluster. The [Apache MADlib](http://madlib.apache.org/) project explicitly parallelizes such models by splitting them into sub-tasks that can be simultaneously executed on multiple nodes of a cluster and combining the results from these sub-tasks to fit the original model.

MADlib has a very rich collection of machine learning algorithms implemented in-database and is highly performant on large scale datasets.

![Apache MADlib](https://github.com/vatsan/postgresopen-2017/blob/master/docs/images/madlib_1.png?raw=true)

### Breadth of algorithms

![Apache MADlib User Guide](https://github.com/vatsan/postgresopen-2017/blob/master/docs/images/madlib_2.png?raw=true)

### Architecture

![Apache MADlib Architecture](https://github.com/vatsan/postgresopen-2017/blob/master/docs/images/madlib_3.png?raw=true)

### Ridge Regression - Demo

In [2]:
%%time
%%execsql
drop table if exists wine_sample_madlib;
create table wine_sample_madlib
as
(
    select
        id,
        ARRAY[
            fixed_acidity,
            volatile_acidity,
            citric_acid,
            residual_sugar,
            chlorides,
            free_sulfur_dioxide,
            total_sulfur_dioxide,
            density,
            ph,
            sulphates,
            alcohol
        ] as features,
        quality
    from
        wine_sample
)

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 10.3 ms


In [3]:
%%time
%%execsql
drop table if exists wine_sample_madlib_mdl, wine_sample_madlib_mdl_summary;
select 
    madlib.elastic_net_train(
        -- Source table
        'wine_sample_madlib',
        -- Result table
        'wine_sample_madlib_mdl', 
        -- Dependent variable
        'quality',                 
        -- Independent variable
        'features',
        -- Regression family
        'gaussian',             
        -- Alpha value
        0.5,           
        -- Lambda value
        0.1,          
        -- Standardize
        TRUE,
        -- Grouping column(s)
        NULL,         
        -- Optimizer
        'fista',     
        -- Optimizer parameters
        '',   
        -- Excluded columns
        NULL,        
        -- Maximum iterations
        10000,     
        -- Tolerance value
        1e-6                       
    );

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 411 ms


In [4]:
%%time
%%showsql
select
    *
from
    wine_sample_madlib_mdl;

Unnamed: 0,family,features,features_selected,coef_nonzero,coef_all,intercept,log_likelihood,standardize,iteration_run
0,gaussian,"[features[1], features[2], features[3], features[4], features[5], features[6], features[7], features[8], features[9], features[10], features[11]]","[features[1], features[2], features[5], features[7], features[10], features[11]]","[0.00213245063091, -0.985673014513, -0.273021090308, -0.000964230457507, 0.489861692677, 0.251937832341]","[0.00213245063091, -0.985673014513, 0.0, 0.0, -0.273021090308, 0.0, -0.000964230457507, 0.0, 0.0, 0.489861692677, 0.251937832341]",3.25888,-0.249234,True,64


CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 35.1 ms


### Apache MADlib - looking under the hood

In [5]:
%%bash
psql -d vatsandb -c 'select madlib.version();'

                                                                                                                version                                                                                                                
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 MADlib version: 1.12, git revision: rc/1.12-rc1, cmake configuration time: Wed Aug 23 22:33:09 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0
(1 row)



In [6]:
%%bash
locate libmadlib.so

/usr/local/madlib/Versions/1.12/ports/greenplum/4.2/lib/libmadlib.so
/usr/local/madlib/Versions/1.12/ports/greenplum/4.3/lib/libmadlib.so
/usr/local/madlib/Versions/1.12/ports/greenplum/4.3ORCA/lib/libmadlib.so
/usr/local/madlib/Versions/1.12/ports/hawq/2/lib/libmadlib.so
/usr/local/madlib/Versions/1.12/ports/postgres/9.5/lib/libmadlib.so
/usr/local/madlib/Versions/1.12/ports/postgres/9.6/lib/libmadlib.so


In [7]:
%%bash
tail -n 50 /usr/local/madlib/Versions/1.12/ports/postgres/modules/regress/linear.sql_in

 * </pre>
 */
DROP AGGREGATE IF EXISTS MADLIB_SCHEMA.heteroskedasticity_test_linregr(
    DOUBLE PRECISION, DOUBLE PRECISION [], DOUBLE PRECISION []);
CREATE AGGREGATE MADLIB_SCHEMA.heteroskedasticity_test_linregr(
    /*+ "dependentVariable" */ DOUBLE PRECISION,
    /*+ "independentVariables" */ DOUBLE PRECISION[],
    /*+ "olsCoefficients" */ DOUBLE PRECISION[]) (

    SFUNC=MADLIB_SCHEMA.hetero_linregr_transition,
    STYPE=MADLIB_SCHEMA.bytea8,
    FINALFUNC=MADLIB_SCHEMA.hetero_linregr_final,
    m4_ifdef(`__POSTGRESQL__', `', `prefunc=MADLIB_SCHEMA.hetero_linregr_merge_states,')
    INITCOND=''
);
---------------------------------------------------------------------------

/**
 * @brief Predict the boolean value of a dependent variable for a specific
 *        independent variable value in a linear regression model
 *
 * @param coef Coefficients obtained by running linear regression.
 * @param col_ind Independent variable array
 * @returns DOUBLE PRECISION Predicted value
 *
 * T

### Learn More

More example available here: [Apache MADlib Community Artifacts](https://github.com/apache/madlib-site/tree/asf-site/community-artifacts)