## Predictive Analytics using Simple Linear Regression on Azure Machine Learning

In this article we will look at how to model the relationship between two variables by fitting a linear equation to the observed data. 

A detailed discussion about the concepts of linear regression is out of the scope of this article, but I will try to refresh some of the concepts related to our discussion along the way.

To begin with, let's recall that linear regression line has an equation of the form <b>Y = a + bX</b>, where <b>X</b> is the explanatory variable and <b>Y</b> is the dependent variable. The slope of the line is <b>b</b>, and <b>a</b> is the intercept (the value of <b>y</b> when <b>x = 0</b>).

### Linear Regression on Azure Machine Learning

Azure Machine Learning provides many different types of regression modules to create a model for predicting numeric outcome in an experiment. The <b>Linear Regression</b> module is a good choice when you want a very simple model involving a single independent variable and a dependent variable. This is called <b>simple linear regression</b>.

<b>Note:</b> Besides the classic regression problem involving a single independent variable and a dependent variable, the <b>Linear Regression</b> module can also solve problems in which multiple inputs are used to predict a single numeric outcome - known as <b>multivariate linear regression</b>. We will look at these problems in a future post.

So, let's jump right in and attempt to solve a simple linear regression problem with Azure Machine Learning. 

### Predicting brain weight from body weight of mammals

We have a dataset containing two variables - the average brain weight and the average body weight for 62 speceis of mammals. We want to predict the brain weight of mammals from body weight, based on the given sample of data.

##### Create a Dataset on Azure Machine Learning Workspace

As a first step, we need to upload our dataset on Azure Machine Learning Studio to create a Dataset in our workspace. The dataset is currently available as a CSV file on my hard drive. Therefore, I will choose <b>+NEW</b>, and upload my mammals.csv file using <b>From Local File</b> option under <b>DATASET</b> menu.

![Upload mammals dataset](images/upload-mammals-dataset.png "Upload mammals dataset")

##### Create an Experiment and Visualize the Data 

Let's create a new experiment by clicking <b>+NEW</b> on the bottom of Azure Machine Learning page and selecting <b>Blank Experiment</b> option under the <b>EXPERIMENT</b> menu.

The previously uploaded mammals dataset is now available under <b>My Datasets</b>. Once the blank experiment canvas is presented, we will drag and drop the dataset on the experiment canvas.

Before proceeding further, it is necessary to explore the data in order to elicit basic information on the columns in the data. Are there any missing data? Do we have any influential information or outliers? Is it necessary to transform one or more variables in the data?

As memory refresher, let's recall that data transformations are helpful to ensure that:
- all columns have a distribution that is reasonably well spread out over the whole range of values
- relationships between columns are roughly linear
- the scatter plot about any relationship is similar across the whole range of values.

Click on the output port of the dataset block, then click <b>Visualize</b> to bring up the data visualization dialog.

![Statistics of mammals data](images/mammals-statistics.png "Statistics of mammals data")

A closer look at the basic statistical information on each column of the data reveals that the response variable (<i>brain</i>) has 6 rows with <b>missing values</b>; however, the explanatory variable (<i>body</i>) has no such problem.

Next, we will use the <b>compare to</b> feature of data visualization in order to generate scatter plots and view relationships between the columns in data. 

![Log transformation](images/log-transformation.png "Log transformation")

We see that the scatter plot of <b>brain ~ body</b> (on the far-right) does not conform to a bivariate normality pattern; rather most of the points on it are clustered together with only a few outliers. In cases like this, we should look at making a transformation of the response (i.e. dependent) variable or the explanatory (i.e. independent) variable or both.

We will use the <b>Snapshot</b> function (the icon next to the <b>compare to</b> dropdown) to copy over the <b>brain ~ body</b> scatter plot on the side, and create a new scatter plot by transforming body to log(body) by selecting the checkbox on the bottom of the plot. The <b>brain ~ log(body)</b> scatter plot on the middle does not conform to a bivariate normality pattern either, but shows some improvement.

Finally, we use the <b>Snapshot</b> function again to copy over the <b>brain ~ log(body)</b> scatter plot on the side, and create a new scatter plot by using log transformation on both body and brain data. The <b>log(brain) ~ log(body)</b> on the left shows a scatter plot that conforms reasonably well to a bivariate normal distribution. 

##### Prepare the Data: Create a Data Preprocessing Workflow

What did we learn by exploring the data?

1. There are 6 missing values in the response variable (<i>brain</i>) column; we should remove these rows from our data in order to ensure that we are not drawing an inaccurate inference about the data. 
2. Most of the values are concentrated at one end of the range, with a very small number of values lying at the other end of the range. A regression line drawn with such a column as the only explanatory variable will be controlled by those few large values (also called <b>high-leverage points</b>) as these values will have an overly large say in determining the coefficient of that variable. Therefore, we need to do a log-log transformation on our data.

First, let's drag and drop the <b>Clean Missing Data</b> module on our experiment canvas and connect the output port of <b>Mammals Brain and Body Weights</b> dataset to the input port of the <b>Clean Missing Data</b> module. In the <b>Properties</b> pane on the right, select <b>Remove entire row</b> option from <b>Cleaning mode</b> dropdown. 

Next, we need to perform log transformation on the two columns in our data. We will use a custom R script to do so. You can find out more on how to extend your experiment by using custom <b>R</b> scripts [here](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-extend-your-experiment-with-r). You can also use Python scripts with Azure Machine Learning. Find out more on executing <b>Python</b> scripts [here](https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-execute-python-scripts).

Let's find the <b>Execute R Script</b> module from left pane, drag and drop the module on experiment canvas, and connect the output port of <b>Clean Missing Data</b> module to the left-most input port of the <b>Execute R Script</b> module. In the <b>Properties</b> pane on the right, we will enter the R code inside the <b>R script</b> editor. This R code will be executed by Azure Machine Learning Studio as part of our data processing workflow. 

The R script will transform the <i>brain</i> and <i>body</i> columns by replacing each value with the corresponding log transformation. The complete code for this log transformation R script can be accessed from my [Github repository](https://github.com/smdev15/azure-ml/tree/master/notebook/Rscripts/mammals). 

![Data preprocessing workflow](images/data-preprocessing.png "Data preprocessing workflow")

Now that our data preprocessing workflow is complete, let's run it by clicking <b>RUN</b> at the bottom of the page.

Once the processing completes, we can right-click on the output port of the <b>Execute R Script</b> module and select <b>Visualize</b> to verify whether the data has been cleaned up and transformed according to our needs for further processing.

![Data preprocessing results](images/data-preprocessing-results.png "Data preprocessing results")

As we see, there are now 56 rows in the dataset; all the rows with missing data have been removed, thus leaving <b>0 missing values</b> in the <i>brain</i> column. Furthermore, data in both columns is log transformed. The <b>brain ~ body</b> scatter plot now shows a reasonably well bivariate normal distribution.

##### Build the model

We need some data to train the model and some to test it. So in the next step of the experiment, we will drag and drop the <b>Split Data</b> module to split the dataset into two separate datasets: one for training our model and one for testing it.

Let's connect the output port of the <b>Execute R Script</b> module to the input port of the <b>Split Data</b> module, and configure the <b>Split Data</b> module to output 70% of our dataset as the training set.

Next, we will find the <b>Linear Regression</b> module, and drag and drop on our experiment canvas. 

The <b>Linear Regression</b> module in Azure Machine Learning uses two <b>Solution method</b>s for fitting the linear model.

- <b>Online gradient descent:</b> The gradient descent is an optimization method that updates the values for the intercept and the regression coefficient at each step of the model training process in order to minimize the error function. Intuitively, a gradient descent algorithm starts with an initial set of parameter values and iteratively moves towards a set of parameter values, taking steps in the negative direction of the function gradient, until the output of the function reaches a steady plateau.
- <b>Ordinary least squares:</b> The method of least squares is one of the most commonly used method for fitting a regression line. This method determines the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. Here, the error is calculated as the sum of the square of distance from the actual value to the predicted line; there are no cancellations between positive and negative values, as the deviations are first squared and then summed.

In this example, we will use 2 separate linear regression models - each with a different solution method - and then compare the results to find out which better meet our needs.

In order to do so, let's select the <b>Split Data</b> module that has already been configured to output 70% of our dataset as the training set, and copy it to create a second <b>Split Data</b> module with exact same configurations. Connect the output port of the <b>Execute R Script</b> module to the input port of the second <b>Split Data</b> module so that the same data flows through both the pipeline. Then, drag and drop a second <b>Linear Regression</b> module on the experiment canvas. 

Next, drag and drop two <b>Train Model</b> modules and configure them using the corresponding <b>Launch column selector</b> buttons; select the <i>brain</i> column as target for both <b>Train Model</b> modules.

1. We will configure the first <b>Linear Regression</b> module with <b>Solution method</b> as <b>Ordinary least squares</b>, and connect the ouput ports of the first <b>Linear Regression</b> module and the first <b>Split Data</b> module to the input port of one <b>Train Model</b> module. Drag and drop a <b>Score Model</b> module and connect the output ports of the <b>Train Model</b> module and the <b>Split Data</b> module to the input port of the <b>Score Model</b> module.
2. We will configure the second <b>Linear Regression</b> module with <b>Solution method</b> as <b>Online gradient descent</b>, and connect the ouput ports of the second <b>Linear Regression</b> module and the second <b>Split Data</b> module to the input port of the other <b>Train Model</b> module. Drag and drop another <b>Score Model</b> module and connect the output ports of the <b>Train Model</b> module and the <b>Split Data</b> module to the input port of the <b>Score Model</b> module.

Finally, drag and drop the <b>Evaluate Model</b> module, and connect the output ports of the two <b>Score Model</b> modules to the input ports of the <b>Evaluate Model</b> module. 

![Linear regression model](images/linear-regression.png "Linear regression model")

### Predict and Evaluate 

Now that our experiment is built, let's click on <b>RUN</b> at the bottom of the Azure Machine Learning Studio page.

Once the experiment finishes execution, we can click on the output port of the <b>Evaluate Model</b> module and select <b>Visualize</b> to compare the two models.

![Compare linear regression models](images/compare-regressions.png "Compare linear regression models")

The evaluation metrics available for regression models are: <i>Mean Absolute Error</i>, <i>Root Mean Absolute Error</i>, <i>Relative Absolute Error</i>, <i>Relative Squared Error</i>, and the <i>Coefficient of Determination</i>.

The coefficient of determination - usually denoted by <i>R<sup>2</sup></i> or <i>R-squared</i> - is a standard way of measuring how well the model fits the data. Let's recall that <i>R<sup>2</sup></i> can be interpreted as the proportion of variation in the dependent variable that can be explained from the independent variable(s); the coefficient of determination ranges from 0 to 1 - a higher proportion is better.

<b>Note:</b>For each of the error statistics, smaller is better. A smaller value indicates that the predictions more closely match the actual values. The Coefficient of Determination is the only statistic for which the higher its value (the closer its value is to 1), the better the predictions.

In our case, the model using <b>Ordinary least squares</b> solution method returns <i>R<sup>2</sup></i> of <b>.91</b> while the model using <b>Online gradient descent</b> solution method returns <i>R<sup>2</sup></i> of <b>.33</b>, indicating that the former fits the data better.

We can also look at the scatter plot of the predicted values for <i>brain</i> against <i>body</i> next to the scatter plot of the actual values on our test dataset.

![Compare scored labels](images/scored-labels.png "Compare scored labels")

### Regression Summary

Let's click on the output port of the <b>Train Model</b> module connected to the <b>Linear Regression</b> module using least squares solution method, and select <b>Visualize</b> to see the summary of linear regression solution.

![Regression summary](images/linear-regression-summary.png "Regression summary")

The <i>bias</i> of <b>2.05</b> represents the intercept while the regression coefficient for <i>body</i> is <b>.763</b>.

Recall the linear regression line has an equation of the form <b>Y = a + bX</b>. At this point, let's consider fitting the model:

    log(brain weight) = a + b log(body weight)

Therefore, our regression equation is: 
    
    log(brain) = 2.05 + .763 log(body)

An <i>R<sup>2</sup></i> value of <b>.91</b> on the test dataset implies that the fitted relationship between log(brain weight) and log(body weight) explains about 91% of the variation in log(brain weight).

Therefore, there is a strong linear relationship between log(brain weight) and log(body weight), with the average log(brain weight) increasing as log(body weight) increases.

From the regression summary information, it is also evident that the species with the largest brain, relative to its body weight, is the human. The water opossum, on the other hand, has the smallest brain size after taking account of differences in body weight.