## Introduction to Machine Learning with Linear Regression

**Author**: Robert Hryniewicz from Hortonworks
See the original [here](https://hortonworks.com/tutorial/intro-to-machine-learning-with-apache-spark-and-apache-zeppelin/)

**Ported from scala to pyspark:** David Tilson

In this lab we'll cover the basics of building a Linear Regression model using Apache Spark ML Pipeline API. 

- Starting from a simple 2 dimensional array
- Using Pipeline API to create vectorised version of features and build the model
- Using Pipeline API to calculate predictions
- Saving and loading back the model
- Some simple plotting

A **model** is a **mathematical formula** with a number of parameters that need to be learned from the data. **Fitting a model to the data** is a process known as **model training**.

Take, for instance one feature/variable linear regression, where a goal is to fit a line (described by the well know eqution `y = ax + b`) to a set of distributed data points.

For example, assume that once model training is complete we get a model equation `y = 2x + 5`. Then for a set of inputs `[1, 0, 7, 2, …]` we would get a set of outputs `[7, 5, 19, 9, …]`. That's it!

In this notebook you will get a chance to learn a step-by-step process of training a one variable linear regression model with Spark.

We're introducing Machine Learning with **Linear Regression** because it's one of the more basic and **commonly used predictive analytics method**. It's also easy to explain and grasp intuitively as you'll make your way through the examples. We will not cover the details of how the underlying Linear Regression algorithm works. We will merely focus on applying the algorithm and generating a model.

#### Create small data set for creating a Linear Regression model

In [0]:
from pyspark.ml.linalg import Vectors

data = spark.createDataFrame([
	(-12.0,  -4.9),
	( -6.0,  -4.5),
	( -7.2,  -4.1),
	( -5.0,  -3.2),
	( -2.0,  -3.0),
	( -3.1,  -2.1),
	( -4.0,  -1.5),
	( -2.2,  -1.2),
	( -2.0,  -0.7),
	( 1.0,   -0.5),
	( -0.7,  -0.2),
	( 1.2,   0.1),
	( 2.2,   0.3),  
	( 6.5,   0.52),
	( 4.2,   0.72),
	( 8.6,   1.1),
	( 9.5,   2.3),
	( 14.52, 3.4),
	( 12.9,  3.61), 
	( 16.3,  3.8)
], ["y", "x"])  

In [0]:
display(data)
# data.show()

y,x
-12.0,-4.9
-6.0,-4.5
-7.2,-4.1
-5.0,-3.2
-2.0,-3.0
-3.1,-2.1
-4.0,-1.5
-2.2,-1.2
-2.0,-0.7
1.0,-0.5


#### Run Linear Regression

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import  LinearRegression, LinearRegressionModel

# Set Features
features = VectorAssembler()\
    .setInputCols(["x"])\
    .setOutputCol("features")

linreg = LinearRegression().setLabelCol("y")
  
pipeline = Pipeline().setStages([features, linreg])
pipeline_model = pipeline.fit(data)

In [0]:
# We can see that pipeline_model is indeed a pipeline model with several stages
print('pipeline_model is a {}'.format(type(pipeline_model)))
pipeline_model.stages

In [0]:
# We can access the stages of the pipeline using its stages attribute
# Here we access the linear regression model 

linRegModel = pipeline_model.stages[1]
type(linRegModel)

#### Summarize model training

In [0]:
print ("RMSE: {}".format(linRegModel.summary.rootMeanSquaredError))
print ("R2: {}".format(linRegModel.summary.r2))
print ("Model: Y = {} *X + {}".format(linRegModel.coefficients, linRegModel.intercept))

linRegModel.summary.residuals.show()

#### Use same data to predict model

In [0]:
result = pipeline_model.transform(data).select("x", "y", "prediction")
result.show()

#### Save the model

In [0]:
linreg.write().overwrite().save("/mnt/my-data/practice/ml/linregmodel")

In [0]:
display(dbutils.fs.ls("/mnt/my-data/practice/ml/linregmodel/metadata"))


path,name,size
dbfs:/mnt/my-data/practice/ml/linregmodel/metadata/_SUCCESS,_SUCCESS,0
dbfs:/mnt/my-data/practice/ml/linregmodel/metadata/part-00000,part-00000,472


In [0]:
print(dbutils.fs.head("/mnt/my-data/practice/ml/linregmodel/metadata/part-00000"))

#### Load back the model

In [0]:
from pyspark.ml.regression import  LinearRegression, LinearRegressionModel
sameModel = LinearRegression.load("/mnt/my-data/practice/ml/linregmodel")

type(sameModel)


In this lab we have looked at Linear Regression, but there are other popular algorithms.

- [Decision trees](https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-trees)
- [Random forest](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests)
- [K-Means Clustering](https://spark.apache.org/docs/latest/ml-clustering.html#k-means)