# Linear Regression - Vertica Direct

This example contains a demo of using Vertica's Linear Regression algorithm along with the Vertica database directly. 

Old Faithful is a geyser that sits in Yellowstone National Park. Using Linear Regression we want to train a model that can predict how long an eruption will be based off the time taken between eruptions.

## Vertica Setup

First we need to set everything up. In this instance we will be using Vertica's Python driver to make direct SQL queries to Vertica.

In [None]:
import vertica_python
import math
from prettytable import from_db_cursor

conn_info = {'host': 'vertica-demo',
             'port': 5433,
             'user': 'dbadmin',
             'password': '',
             'database': 'demo'}

## Import Data

Our Faithful dataset has been randomly split up into two. One for training the model and one for testing it. Both sets are stored in a local .csv, so let's open them and copy them. We can then write each one to Vertica to their respective tables "faithful_training" and "faithful_testing."\
Normally when performing training and testing in ML, we start one with one full dataset and use a function that randomly splits up the dataset. In our case however, we want the datasets the same across Vertica examples to be consistent with our training and results.

In [None]:
with vertica_python.connect(**conn_info) as conn:

    cur = conn.cursor()
    cur.execute("DROP TABLE IF EXISTS faithful_training; CREATE TABLE faithful_training (id int, eruptions float, waiting int);")
    
    with open("/project/data/old_faithful/faithful_training.csv", "rb") as fs:
        cur.copy("COPY faithful_training FROM STDIN DELIMITER ',' ENCLOSED BY '\"' SKIP 1", fs, buffer_size=65536)

    cur.execute("DROP TABLE IF EXISTS faithful_testing; CREATE TABLE faithful_testing (id int, eruptions float, waiting int);")
    
    with open("/project/data/old_faithful/faithful_testing.csv", "rb") as fs:
        cur.copy("COPY faithful_testing FROM STDIN DELIMITER ',' ENCLOSED BY '\"' SKIP 1", fs, buffer_size=65536)

Now let's give Vertica a read so we can see what the data looks like by running a SELECT on faithful_training.

In [None]:
with vertica_python.connect(**conn_info) as conn:
    cur = conn.cursor()

    cur.execute("SELECT * FROM faithful_training LIMIT 20;")

    print("Data of the Old Faithful geyser in Yellowstone National Park.")
    print("eruptions = duration of eruption \nwaiting = time between eruptions")
    print(from_db_cursor(cur))

## Train Model

Linear Regression analyzes the relationship between an independant and dependant variable using a line of best fit. The dependant variable (eruptions) is what we are trying to predict, whereas the independant variables consists of our features that we are using to make our model. In this case we just have the one variable "waiting", and this will compose our features.

In this SELECT command we are building our model. We will call this model "linear_reg_faithful" and training it with the dataset from "faithful_training." We want to predict the "eruptions" column with our features that consist of "waiting." 

In [None]:
with vertica_python.connect(**conn_info) as conn:

    cur = conn.cursor()

    cur.execute("SELECT LINEAR_REG('linear_reg_faithful', 'faithful_training',\
    'eruptions', 'waiting' USING PARAMETERS optimizer='BFGS');")

    print(cur.fetchall())

## Test Model

Now that our Regression Model has been built it's time to see its predictions. To do this we will create a faithful_predictions table and we will include a new column called "pred" that uses our predicted values. We will then lay this new column against our training set to match the ID of the eruption.

In [None]:
with vertica_python.connect(**conn_info) as conn:

    cur = conn.cursor()

    cur.execute("DROP TABLE IF EXISTS faithful_predictions; CREATE TABLE \
    faithful_predictions AS (SELECT id, eruptions, PREDICT_LINEAR_REG(waiting \
    USING PARAMETERS model_name='linear_reg_faithful') AS pred FROM faithful_testing);")

    cur.execute("SELECT id, pred FROM faithful_predictions LIMIT 20;")

    print(from_db_cursor(cur))


## Results

Our Linear Regression model has been created and we've displayed its predictions. Now let's SELECT the rest of the columns from our combined predictions - faithful_testing table to see how close we really were. Afterwards, let's fetch the Mean Squared Error so we can see how our predictions faired.

In [None]:
with vertica_python.connect(**conn_info) as conn:

    cur = conn.cursor()

    cur.execute("SELECT * FROM faithful_predictions ORDER BY id LIMIT 20;")

    print(from_db_cursor(cur))

    cur.execute("SELECT MSE (eruptions::float, pred::float) OVER() FROM (SELECT\
    eruptions, pred FROM faithful_predictions) AS prediction_output;")

    print(from_db_cursor(cur))

    cur.execute("SELECT RSQUARED (eruptions::float, pred::float) OVER() FROM (\
    SELECT eruptions, pred FROM faithful_predictions) AS prediction_output;")

    print(from_db_cursor(cur))



So we've arrived at our Mean Squared Error (MSE) and R Squared values. Since MSE is squared error however, let's root it to get our Root Mean Squared Error. This value will gives us something more comparable in terms of the same unit dimensions.

In [None]:
with vertica_python.connect(**conn_info) as conn:

    cur = conn.cursor()
    cur.execute("SELECT MSE (eruptions::float, pred::float) OVER() FROM (SELECT\
    eruptions, pred FROM faithful_predictions) AS prediction_output;")

    mse = cur.fetchall()[0][0]
    print("RMSE: " + str(math.sqrt(mse)))

**R Squared** is a calculation that provides us with a way of quantifying the relationship between our variables. \
It is a percentage, with 100% being a 1:1 relationship between our axis.

**RMSE** is the average deviation of the dependant variables to the regression line. \
As such, a value closer to 0 means there is less deviation and therefore less error. Given our unit dimensions (minutes) An RMSE under 0.5 means the model can likely predict values accurately.