In [2]:
# Intel DAAL related imports
from daal.data_management import (
    DataSourceIface, FileDataSource, HomogenNumericTable, MergedNumericTable, NumericTable
)


#"utils" module can be found in IDP environment installation folder (intall_dir)\share\pydaal_examples\examples\python\source
#uncomment the below comment and replace <install_dir> with the correct path
#sys.path.append(<install_dir>\share\pydaal_examples\examples\python\source)
from utils import printNumericTable
import sys, os
sys.path.append(os.path.realpath('../3-custom-modules'))
from customUtils import getArrayFromNT

# Import numpy, matplotlib, seaborn
import numpy as np

# Boilerplate
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

# Plotting configurations
%config InlineBackend.figure_format = 'retina'
plt.rcParams["figure.figsize"] = (12, 9)

# Online Ridge Regression 

### Tutorial brief
This tutorial is an example of using ridge regression algorithms from PyDAAL to build predictive models.
We use the well-studied Boston House Prices dataset to train ridge regression model in online processiong mode. We  test the accuracy of these models in median house price prediction. The code for ridge regression model training and prediction is provided.

### Learning objectives
* To understand how to process the data that doen not fit into memory using online computing mode. 
* To understand and practice the typical code sequence of using PyDAAL for supervised learning.
* To practice interactions and conversions between DAAL NumericTables and NumPy ndarrays.


### Linear regression introduction
Supervised learning involves training a model using the data that has known responses, and then apply the model to predict responses for unseen data. In the case of **linear regression** and **ridge regression**, the model is linear. That is, 

$$ f_{\beta}(X) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k $$

$\beta_0, \beta_1, \cdots, \beta_k$ are the regression model coefficients. 

PyDAAL provides two linear regression algorithms:
* **Multiple Linear Regression**: The model is trained by minimizing an objective function in the form of **Residual Sum of Squares**. PyDAAL supports two ways to train the model: 1) Normal Equation method, and 2) QR method.

$$ \sum \limits_{i=1}^n\left ( y_i - f_{\beta}(X^i)\right )^2 $$  
* **Ridge Regression**: It is similar to multiple linear regression, but adds a regularization term to the objective function. The regularization term penalizes features with large values, thus makes the model less prone to overfitting. 

$$ \sum \limits_{i=1}^n\left ( y_i - f_{\beta}(X^i)\right )^2 + \lambda \sum \limits_{j=1}^k \beta_j^2 $$

### Online processing mode
Some Intel DAAL algorithms enable processing of data sets in blocks. In the online processing mode, the `compute()`, and `finalizeCompute()` methods of a particular algorithm class are used.

This computation mode assumes that the data arrives in blocks i = 1, 2, 3, … nBlocks.

Call the `compute()` method each time new input becomes available.
![](https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-73A24EF1-070A-40DA-A3A9-FD62079C370A-low.png)

When the last block of data arrives, call the `finalizeCompute()` method to produce final results.
![](https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-65441AC7-7991-4966-98D4-FD49E3313889-low.png)

If the input data arrives in an asynchronous mode, you can use the `getStatus()` method for a given data source to check whether a new block of data is available for load.

### The Boston House Prices dataset
The dataset has already been downloaded to the ./mldata folder. There are 506 rows and 14 columns. The first 13 columns are features (explanatory variables), and the last column is the dependent variable we try to make predictions for. Here's detailed information about this dataset, including descriptions of each feature:

> Origin: 

> This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. 

> Creator: 

> Harrison, D. and Rubinfeld, D.L. 
> 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

> Data Set Information:

> Concerns housing values in suburbs of Boston.


> Attribute Information:

> 1. CRIM: per capita crime rate by town 
> 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 
> 3. INDUS: proportion of non-retail business acres per town 
> 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 
> 5. NOX: nitric oxides concentration (parts per 10 million) 
> 6. RM: average number of rooms per dwelling 
> 7. AGE: proportion of owner-occupied units built prior to 1940 
> 8. DIS: weighted distances to five Boston employment centres 
> 9. RAD: index of accessibility to radial highways 
> 10. TAX: full-value property-tax rate per \$10,000 
> 11. PTRATIO: pupil-teacher ratio by town 
> 12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
> 13. LSTAT: % lower status of the population 
> 14. MEDV: Median value of owner-occupied homes in \$1000's



### Quality metrics

In [None]:
def mse(values, fitted_values):
    """Return Mean Squared Errors for fitted values w.r.t. true values

    Args:
        values: True values. NumericTable, nsamples-by-noutputs
        fitted_values: True values. NumericTable, nsamples-by-noutputs

    Returns:
        A tuple contains MSE's
    """

    y_t = getArrayFromNT(values)
    y_p = getArrayFromNT(fitted_values)
    rss = ((y_t - y_p) ** 2).sum(axis = 0)
    mse = rss / y_t.shape[0]
    return tuple(mse)


def score(y_true, y_pred):
    """Compute R-squared and adjusted R-squared

    Args:
        y_true: True values. NumericTable, shape = (nsamples, noutputs)
        y_pred: Predicted values. NumericTable, shape = (nsamples, noutputs)

    Returns:
        R2: A tuple with noutputs values
    """

    y_t = getArrayFromNT(y_true)
    y_p = getArrayFromNT(y_pred)
    rss = ((y_t - y_p) ** 2).sum(axis = 0)
    tss = ((y_t - y_t.mean(axis = 0)) ** 2).sum(axis = 0)
    r2 = 1 - rss/tss
    return tuple(r2)

### Ridge regression model training for Boston houses prices
The code below reads data from file `housing.data.train.csv` and creates 2 NumericTables: training data (`xTrain`) and ground truth (`yTrain`). We use the `FileDataSource` to stream the data from the file into in-memory representation - numeric tables. 

The model of ridge regression gets and update after each new block of data.

In [None]:
from daal.algorithms.ridge_regression import training as ridge_training

# Number of teatures in the dataset
nFeatures = 13

# Initialize FileDataSource to retrieve the input data from a .csv file
trainDataSource = FileDataSource(
    './mldata/housing.data.train.csv', DataSourceIface.notAllocateNumericTable, DataSourceIface.doDictionaryFromContext
)

# Create Numeric Tables for training data and dependent variables
xTrain = HomogenNumericTable(nFeatures, 0, NumericTable.notAllocate)
yTrain = HomogenNumericTable(1,         0, NumericTable.notAllocate)
mergedDataTrain = MergedNumericTable(xTrain, yTrain)

# Create an algorithm object to train ridge regression model in online processing mode
regr = ridge_training.Online()

while(trainDataSource.loadDataBlock(50, mergedDataTrain) == 50):
    # Pass new block of data from the training data set and dependent values to the algorithm
    regr.input.set(ridge_training.data, xTrain)
    regr.input.set(ridge_training.dependentVariables, yTrain)

    # Update ridge regression model
    regr.compute()

model = regr.finalizeCompute().get(ridge_training.model)

# Peek at the model (Betas)
printNumericTable(model.getBeta())

### Prediction with Ridge Regression model

The code below reads data from file housing.data.test.csv and creates 2 NumericTables: test data (xTest) and test ground truth (yTest). We use ridge regression prediction algorithm and the model obtained on the training stage to compute the predictions for a new, prevoiusly unseen data.

In [8]:
from daal.algorithms.ridge_regression import prediction as ridge_prediction

testDataSource = FileDataSource(
    './mldata/housing.data.test.csv', DataSourceIface.notAllocateNumericTable, DataSourceIface.doDictionaryFromContext
)

# Create Numeric Tables for testing data and dependent variables
xTest = HomogenNumericTable(nFeatures, 0, NumericTable.notAllocate)
yTest = HomogenNumericTable(1,         0, NumericTable.notAllocate)
mergedDataTest = MergedNumericTable(xTest, yTest)

testDataSource.loadDataBlock(mergedDataTest)

# Create a prediction algorithm object
alg = ridge_prediction.Batch()
# Set input
alg.input.setModel(ridge_prediction.model, model)
alg.input.setTable(ridge_prediction.data, xTest)

# Compute
predictions = alg.compute().get(ridge_prediction.prediction)

### Plotting predicted values against the ground truth
To see if the model has done a good job, we plot the predicted values against the ground truth. If the model does a perfect job then all points on the plot should fall on a straight line. As we see, it's not quite the case. But still the predictions are close to true values in many cases.

In [None]:
print(mse(yTest, predictions))
print(score(yTest, predictions))

predicted = getArrayFromNT(predictions)
expected = getArrayFromNT(yTest)

fig, ax = plt.subplots()
ax.scatter(expected, predicted)
ax.plot([0, 30], [0, 30], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

### Summary
In this lab, we learned two widely used linear regression models: Multiple linear regression and Ridge regression. We saw how to apply them to the Boston House Prices dataset. We studied and practiced PyDAAL API for these two algorithms.