# Linear Regression

In this example, we will take a look at building a linear regressor,
and the line that is produced.

We will be using a toy dataset from scikit-learn called "Diabetes",
which is meant to represent the measure of progression of diabetes in a patient.
The exact details of the dataset are not particularly important for us,
we just need some sample data to graph.

For another example of linear regression using this same dataset,
see scikit-learn's [Linear Regression Example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html).

Let's start with loading the dataset.

In [None]:
import math

import matplotlib.pyplot
import sklearn.datasets
import sklearn.linear_model

# Load a dataset.
# The exact details of this dataset don't matter, we are just using it as sample data points.
dataset = sklearn.datasets.load_diabetes(as_frame = True)

# Get the features and targets/labels.
features = dataset['data']
targets = dataset['target']

# We only care about the first feature (so we can graph it).
# Note that the data has been normalized, so it may not look like what you may expect from "BMI".
# But this is no problem for us, since we just interested in the data for graphing.
features = features['bmi']

# Grab a subset of the data points so we can see them easier on the graph.
# (We happen to know that the last 20 look pretty good for graphing.)
features = features[-20:]
targets = targets[-20:]

# sklearn tends to prefer data as numpy ndarrays.
# The reshape "stacks" our horizontal list (1 row, many columns)
# into a vertical list (many rows, 1 column).
features = features.to_numpy().reshape(-1, 1)
targets = targets.to_numpy().reshape(-1, 1)

print(features[0:10])
print(targets[0:10])

Now that we have some data, let's just plot it.

In [None]:
# The 'k.' tells matplotlib to use a "black dot" when plotting.
matplotlib.pyplot.plot(features, targets, 'k.')
matplotlib.pyplot.show()

We can see a slight increasing pattern,
but it will be easier to see once we have a line that is fit to the data.

In [None]:
# Train a linear regressor on the data.
regressor = sklearn.linear_model.LinearRegression()
regressor.fit(features, targets)

# Have the regressor predict back the line it fit.
# (Make a prediction for each data point and graph it (which essentially gives us back the line).)
predictions = regressor.predict(features)

# Plot the actual data points.
# The 'k.' tells matplotlib to use a "black dot" when plotting.
matplotlib.pyplot.plot(features, targets, 'k.')

# Plot the predictions.
# The 'g-' tells matplotlib to use a "green line" when plotting.
matplotlib.pyplot.plot(features, predictions, 'g-')

matplotlib.pyplot.show()

Looks pretty cool!

But how good does the newly-fitted line actually fit the data?

We have already talked about error and have seen RMSE.
Let's take a look (and graph!) the error for this regression.

In [None]:
rmse = 0.0

for i in range(len(features)):
    feature = features[i]
    target = targets[i]
    prediction = predictions[i]
    
    error = target - prediction
    rmse += (error ** 2)
    
    # Draw a red line ('r-') between the true target (located at the point (feature, target))
    # and the prediction (located at the point (feature, prediction)).
    matplotlib.pyplot.plot([feature, feature], [target, prediction], 'r-')

# Put the real targets and predictions on the plot again.
matplotlib.pyplot.plot(features, targets, 'k.')
matplotlib.pyplot.plot(features, predictions, 'g-')

matplotlib.pyplot.show()

# Vector operations (`target - prediction`) causes rmse to be a vector (1-dimensional numpy array),
# so we need to pull out the desired value.
rmse = rmse[0]

rmse = math.sqrt(rmse / len(features))
print("RMSE: %3.2f" % (rmse))

Now we not only have the numerical RMSE, but we can also see how far away the prediction is for each true target.