# Linear Regression - VerticaPy

This example contains a demo of using Vertica's Linear Regression algorithm along with the Vertica database using VerticaPy. 

Old Faithful is a geyser that sits in Yellowstone National Park. Using Linear Regression we want to train a model that can predict how long an eruption will be based off the time taken between eruptions.

## Vertica Setup

Let's get our necessary tools installed and set up our connection with Vertica. To do this we will install VerticaPy, Vertica's Python API for Data Science. This toolkit allows a user to work with Vertica's in-database Machine Learning functions without requiring much (if any) use of SQL.

In [None]:
pip install verticapy

Now we can fill out the information VerticaPy needs in order to establish a connection.

In [None]:
import verticapy as vp

vp.new_connection({"host": "vertica", 
                   "port": "5433", 
                   "database": "dbadmin", 
                   "password": "", 
                   "user": "dbadmin"},
                   name = "MyVerticaConnection")

vp.connect("MyVerticaConnection")

## Import Data

Our Faithful dataset has been randomly split up into two. One for training the model and one for testing it. Both sets are stored in a local .csv, so let's open them and copy them. We can then write each one to Vertica to their respective tables "faithful_training" and "faithful_testing."\
Normally when performing training and testing in ML, we start one with one full dataset and use a function that randomly splits up the dataset. In our case however, we want the datasets the same across Vertica examples to be consistent with our training and results.

In [None]:
vp.drop(name="public.faithful_training")

df_training = vp.read_csv("/spark-connector/examples/jupyter/data/faithful_training.csv",  
                            table_name = "faithful_training",
                            schema = "public",
                            quotechar = '"',
                            sep = ",",
                            na_rep = "")

vp.drop(name="public.faithful_testing")

df_testing = vp.read_csv("/spark-connector/examples/jupyter/data/faithful_testing.csv",  
                            table_name = "faithful_testing",
                            schema = "public",
                            quotechar = '"',
                            sep = ",",
                            na_rep = "")

Now let's give Vertica a read so we can see what the data looks like by running a SELECT on faithful_training.

In [None]:
print("Data of the Old Faithful geyser in Yellowstone National Park.")
print("eruptions = duration of eruption \nwaiting = time between eruptions")

df_training

## Train Model

Linear Regression analyzes the relationship between an independant and dependant variable using a line of best fit. The dependant variable (eruptions) is what we are trying to predict, whereas the independant variables consists of our features that we are using to make our model. In this case we just have the one variable "waiting", and this will compose our features.

In [None]:
from verticapy.learn.linear_model import LinearRegression

model = LinearRegression("LR_faithful")

model.fit(df_training, ["waiting"], "eruptions")
model.plot()

A Linear Regression model is noteworthy if the our datapoints lie close to our line of best fit. This shows that there is high relation between our training columns.

## Test Model

Now that our Regression Model has been built it's time to see its predictions. To do this we will use the predict method, creating a "predictions" column and laying it against our taithful_testing set.

In [None]:
model.predict(df_testing, name = "predictions")

## Results

Our Linear Regression model has been created and we've displayed its predictions. Now let's see some statistics regarding our model to see how it held up.

In [None]:
model.report()

**R Squared** is a calculation that provides us with a way of quantifying the relationship between our variables. \
It is a percentage, with 100% being a 1:1 relationship between our axis.

**RMSE** is the average deviation of the dependant variables to the regression line. \
As such, a value closer to 0 means there is less deviation and therefore less error. Given our unit dimensions (minutes) An RMSE under 0.5 means the model can likely predict values accurately.