# Linear Regression (Documentation Example)

The documentation example is available here: https://spark.apache.org/docs/latest/ml-classification-regression.html. 

Objective: First, what we'll do is go through this example. This allows us to read from the documentation, understand it, then apply it. 

This dataset is quite unrealistic, but it is necessary in understanding some of the basic elements of using Spark's MLlib library. More relevant datasets are used in the advanced linear regression exercise. 

In [10]:
# Must be included at the beginning of each new notebook. Remember to change the app name.
import findspark
findspark.init('/home/ubuntu/spark-3.2.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('linear_regression_docs').getOrCreate()

# If you're getting an error with numpy, please type 'sudo pip install numpy --user' into the EC2 console.
from pyspark.ml.regression import LinearRegression

In [11]:
# Load model training data. Location of the data may be different.
training = spark.read.format("libsvm").load("Datasets/sample_linear_regression_data.txt")

The libsvm format might be new to you. It's not used often, and may not be relevant to your dataset. However, it's used extensively throughout the Spark documentation. Let's see what the training data looks like:

In [12]:
# Visualise the training data format.
training.show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
| -9.490009878824548|(10,[0,1,2,3,4,5,...|
| 0.2577820163584905|(10,[0,1,2,3,4,5,...|
| -4.438869807456516|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -7.966593841555266|(10,[0,1,2,3,4,5,...|
| -7.896274316726144|(10,[0,1,2,3,4,5,...|
| -8.464803554195287|(10,[0,1,2,3,4,5,...|
| 2.1214592666251364|(10,[0,1,2,3,4,5,...|
| 1.0720117616524107|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -5.082010756207233|(10,[0,1,2,3,4,5,...|
|  7.887786536531237|(10,[0,1,2,3,4,5,...|
| 14.323146365332388|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-0.8995693247765151|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|  5.601801561245534|(10,[0,1,2,3,4,5,...|
|-3.2256352187273354|(10,[0,1,2,3,4,5,...|
| 1.5299675726687754|(10,[0,1,2,3,4,5,...|
| -0.250102447941961|(10,[0,1,2,3,4,5,...|
+----------

This is the format that Spark needs to run a machine learning algorithm. One column with the name "label" and the other with the name "features". The label represents the output/answer/predictor (for example, house value), while the features represent the inputs.

The "label" column then needs to have the numerical label, either a regression numerical value, or a numerical value that matches to a classification grouping. 

The feature column has inside of it a vector of all the features that belong to that row. Usually what we end up doing is combining the various feature columns we have into a single 'features' column using the data transformations from the previous lab.

In [13]:
# These are the default values:
# featuresCol: What is the features column named? 
# labelCol: What is the label column named?
# predictionCol: What is the name of the actual prediction?
lr = LinearRegression(featuresCol='features', labelCol='label', predictionCol='prediction')

In [14]:
# Fit/train the model. Fit the model onto the training data.
lrModel = lr.fit(training)

In [15]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {}".format(str(lrModel.coefficients))) # For each feature...
print('\n')
print("Intercept:{}".format(str(lrModel.intercept)))

Coefficients: [0.0073350710225801715,0.8313757584337543,-0.8095307954684084,2.441191686884721,0.5191713795290003,1.1534591903547016,-0.2989124112808717,-0.5128514186201779,-0.619712827067017,0.6956151804322931]


Intercept:0.14228558260358093


You can use the summary attribute to get even more information.

In [16]:
# Summarize the model over the training set and print out some metrics.
trainingSummary = lrModel.summary

This has a lot of information, here are a few examples:

In [17]:
trainingSummary.residuals.show()

# Print Root Mean Squared Error. 
print("RMSE: {}".format(trainingSummary.rootMeanSquaredError))

# Print R-Squared.
print("r2: {}".format(trainingSummary.r2))

+-------------------+
|          residuals|
+-------------------+
|-11.011130022096554|
| 0.9236590911176538|
|-4.5957401897776675|
|  -20.4201774575836|
|-10.339160314788181|
|-5.9552091439610555|
|-10.726906349283922|
|  2.122807193191233|
|  4.077122222293811|
|-17.316168071241652|
| -4.593044343959059|
|  6.380476690746936|
| 11.320566035059846|
|-20.721971774534094|
| -2.736692773777401|
| -16.66886934252847|
|  8.242186378876315|
|-1.3723486332690233|
|-0.7060332131264666|
|-1.1591135969994064|
+-------------------+
only showing top 20 rows

RMSE: 10.16309157133015
r2: 0.027839179518600154


## Train/Test Splits

Based on our nine-step process, we've actually missed a fundamental step following the Spark documentation! We never split our data into a training and testing set. Instead we've trained the model using all of our data, which you know by now is not a good idea.

Luckily, Spark DataFrames has a convienent method of splitting the data. Let's see it:

In [18]:
# Remember, data is stored in the parent directory. 
all_data = spark.read.format("libsvm").load("Datasets/sample_linear_regression_data.txt")

In [19]:
# Pass in the split between training/test as a list.
# This is based on your test-designs, but generally 70/30 or 60/40 splits are used. 
# Depending on how much data you have and how unbalanced it is.
train_data,test_data = all_data.randomSplit([0.7,0.3])

In [20]:
# Let's check out our training data.
train_data.show()

# Let's check out the count (348).
train_data.describe().show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
|-28.571478869743427|(10,[0,1,2,3,4,5,...|
|-26.805483428483072|(10,[0,1,2,3,4,5,...|
|-26.736207182601724|(10,[0,1,2,3,4,5,...|
| -23.51088409032297|(10,[0,1,2,3,4,5,...|
|-22.949825936196074|(10,[0,1,2,3,4,5,...|
|-22.837460416919342|(10,[0,1,2,3,4,5,...|
|-20.212077258958672|(10,[0,1,2,3,4,5,...|
|-20.057482615789212|(10,[0,1,2,3,4,5,...|
|-19.884560774273424|(10,[0,1,2,3,4,5,...|
|-19.872991038068406|(10,[0,1,2,3,4,5,...|
|-19.782762789614537|(10,[0,1,2,3,4,5,...|
| -19.66731861537172|(10,[0,1,2,3,4,5,...|
|-18.845922472898582|(10,[0,1,2,3,4,5,...|
| -18.27521356600463|(10,[0,1,2,3,4,5,...|
|-17.494200356883344|(10,[0,1,2,3,4,5,...|
|-17.428674570939506|(10,[0,1,2,3,4,5,...|
| -17.32672073267595|(10,[0,1,2,3,4,5,...|
|-17.065399625876015|(10,[0,1,2,3,4,5,...|
|-17.026492264209548|(10,[0,1,2,3,4,5,...|
| -16.26143027545273|(10,[0,1,2,3,4,5,...|
+----------

In [21]:
# And our test data. 
test_data.show()

# Let's check out the count (153, approximately a 70/30 split).
test_data.describe().show()

+-------------------+--------------------+
|              label|            features|
+-------------------+--------------------+
|-28.046018037776633|(10,[0,1,2,3,4,5,...|
|-23.487440120936512|(10,[0,1,2,3,4,5,...|
|-21.432387764165806|(10,[0,1,2,3,4,5,...|
|-19.402336030214553|(10,[0,1,2,3,4,5,...|
| -19.16829262296376|(10,[0,1,2,3,4,5,...|
|-17.803626188664516|(10,[0,1,2,3,4,5,...|
| -16.71909683360509|(10,[0,1,2,3,4,5,...|
|-16.692207021311106|(10,[0,1,2,3,4,5,...|
| -16.08565904102149|(10,[0,1,2,3,4,5,...|
|-15.375857723312297|(10,[0,1,2,3,4,5,...|
|-15.359544879832677|(10,[0,1,2,3,4,5,...|
|-13.772441561702871|(10,[0,1,2,3,4,5,...|
| -13.15333560636553|(10,[0,1,2,3,4,5,...|
|-13.039928064104615|(10,[0,1,2,3,4,5,...|
|-12.977848725392104|(10,[0,1,2,3,4,5,...|
|-12.773226999251197|(10,[0,1,2,3,4,5,...|
|-12.558575788856189|(10,[0,1,2,3,4,5,...|
|-12.094351278535258|(10,[0,1,2,3,4,5,...|
|-11.640549677888826|(10,[0,1,2,3,4,5,...|
| -11.43180236554046|(10,[0,1,2,3,4,5,...|
+----------

In [22]:
# Now we only train the train_data.
correct_model = lr.fit(train_data)

In [23]:
# Now we can directly get a .summary object using the evaluate method.
test_results = correct_model.evaluate(test_data)

In [24]:
# And generate some basic evaluation metrics. 
test_results.residuals.show()
print("RMSE: {}".format(test_results.rootMeanSquaredError))

+-------------------+
|          residuals|
+-------------------+
|-26.807162722626384|
| -21.59371356869086|
|  -21.9464814828004|
|-20.784412270368783|
|-17.815241924516652|
|-16.627443595267554|
| -17.05982262760015|
| -17.94762687581216|
|-15.654922779606231|
|-15.349049359554938|
|-15.695588171296139|
| -16.38920332633152|
|-12.990305637371522|
|-12.854812920968358|
|-14.716372011315832|
|-10.767144737423443|
|-13.030990836368117|
|-11.164342817225434|
| -8.477199122731658|
|-13.573983042656442|
+-------------------+
only showing top 20 rows

RMSE: 10.504373377842208


## Linear Regression Evaluation Metrics

Let's check out a few different evaluation metrics applicable to regression models:

Mean Absolute Error - Average of the error between the true value and predicted value. For example, if you're trying to predict housing price, MAE allows you to state that 'on average your model is off by $100,000 on average per house'.

Mean Squared Error - Instead of taking the absolute error value, the errors are now squared. If you have a large difference between your predicted value and your true value, that difference becomes even more prominent when you square it. This makes it easier to notice when your model is off by a significantly large margin. However, the problem with MSE is that the units are now different. Instead of a $100k housing price error, the error is now 100k squared.

Root Mean Absolute Error - So what do you do? Take the root of MSE (RMAE) to convert it back into the original unit. This retains the enhanced difference on significantly large errors as well maintains the appropriate unit. Because of this, this metric is the most popular. 

R2 (aka Coefficient of Determination) - This measures the amount of variance your model can explain (between 0 and 100%), but does not tell the entire story. It's not actually an evaluation metric, but more of a statistical measure of your model. 