# Linear Regression Implementation from Scratch
:label:`sec_linear_scratch`

Now that you understand the key ideas behind linear regression,
we can begin to work through a hands-on implementation in code.
In this section, we will implement the entire method from scratch,
including the data pipeline, the model,
the loss function, and the gradient descent optimizer.
While modern deep learning frameworks can automate nearly all of this work,
implementing things from scratch is the only
to make sure that you really know what you are doing.
Moreover, when it comes time to customize models,
defining our own layers, loss functions, etc.,
understanding how things work under the hood will prove handy.
In this section, we will rely only on `NDArray` and `GradientCollector`.
Afterwards, we will introduce a more compact implementation,
taking advantage of DJL's bells and whistles.
To start off, we import the few required packages.

In [1]:
%use @file[../djl.json]
%use lets-plot

## Generating the Dataset

To keep things simple, we will construct an artificial dataset
according to a linear model with additive noise.
Our task will be to recover this model's parameters
using the finite set of examples contained in our dataset.
We will keep the data low-dimensional so we can visualize it easily.
In the following code snippet, we generated a dataset 
containing $1000$ examples, each consisting of $2$ features
sampled from a standard normal distribution.
Thus our synthetic dataset will be an object
$\mathbf{X}\in \mathbb{R}^{1000 \times 2}$.

The true parameters generating our data will be 
$\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$
and our synthetic labels will be assigned according 
to the following linear model with noise term $\epsilon$:

$$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.$$

You could think of $\epsilon$ as capturing potential 
measurement errors on the features and labels.
We will assume that the standard assumptions hold and thus
that $\epsilon$ obeys a normal distribution with mean of $0$.
To make our problem easy, we will set its standard deviation to $0.01$.
The following code generates our synthetic dataset:

In [2]:
data class DataPoints(val X:NDArray , val y:NDArray ) {
}

// Generate y = X w + b + noise
fun syntheticData(manager:NDManager , w: NDArray , b : Float, numExamples: Int) : DataPoints {
    val X = manager.randomNormal(Shape(numExamples.toLong(), w.size()))
    var y = X.matMul(w).add(b)
    // Add noise
    y = y.add(manager.randomNormal(0f, 0.01f, y.getShape(), DataType.FLOAT32))
    return DataPoints(X, y);
}

val manager = NDManager.newBaseManager();

val trueW = manager.create(floatArrayOf(2f, -3.4f))
val trueB = 4.2f;

val dp = syntheticData(manager, trueW, trueB, 1000);
val features = dp.X
val labels = dp.y

Note that each row in `features` consists of a 2-dimensional data point 
and that each row in `labels` consists of a 1-dimensional target value (a scalar).

In [3]:
println("features: [%f, %f]".format(features.get(0).getFloat(0), features.get(0).getFloat(1)))
println("label: ${labels.getFloat(0)}")

features: [2.212206, 1.163079]
label: 4.662078


By generating a scatter plot using the second feature `features[:, 1]` and `labels`, 
we can clearly observe the linear correlation between the two.

In [13]:
val X = features.get(NDIndex(":, 1")).toFloatArray()
val y = labels.toFloatArray()

val data = mapOf<String, FloatArray>(
    "X(0)" to X,
    "y" to y
    )

var p = letsPlot(data)
p += geomPoint() { x = "X(0)" ; y = "y"}
p + ggsize(700, 500)

## Reading the Dataset

Recall that training models consists of 
making multiple passes over the dataset, 
grabbing one minibatch of examples at a time,
and using them to update our model. 
We can use `ArrayDataset` to randomly sample
the data and access it in minibatches.

In the following code, we instantiate an `ArrayDataset`.
We then set parameters for `features`, `labels`, `batchSize`, 
and `sampling`. 

With `dataset.getData`, we can get minibatches of size `batchSize`,
each consisting of its features and labels.

In [5]:
import ai.djl.training.dataset.ArrayDataset
import ai.djl.training.dataset.Batch

val batchSize = 10

val dataset = ArrayDataset.Builder()
                          .setData(features) // Set the Features
                          .optLabels(labels) // Set the Labels
                          .setSampling(batchSize, false) // set the batch size and random sampling to false
                          .build()

In general, note that we want to use reasonably sized minibatches
to take advantage of the GPU hardware,
which excels at parallelizing operations.
Because each example can be fed through our models in parallel
and the gradient of the loss function for each example can also be taken in parallel,
GPUs allow us to process hundreds of examples in scarcely more time
than it might take to process just a single example.

To build some intuition, let us read and print
the first small batch of data examples.
The shape of the features in each minibatch tells us
both the minibatch size and the number of input features.
Likewise, our minibatch of labels will have a shape given by `batchSize`.

In [6]:
val batch = dataset.getData(manager).iterator().next()
// Call head() to get the first NDArray
val X = batch.getData().head()
val y = batch.getLabels().head()
println(X);
println(y);
// Don't forget to close the batch!
batch.close();

ND: (10, 2) cpu() float32
[[ 2.2122,  1.1631],
 [ 0.774 ,  0.4838],
 [ 1.0434,  0.2996],
 [ 1.1839,  0.153 ],
 [ 1.8917, -1.1688],
 [-1.2347,  1.5581],
 [-1.771 , -0.5459],
 [-0.4514, -2.3556],
 [ 0.5794,  0.5414],
 [-1.8561,  2.6785],
]

ND: (10) cpu() float32
[ 4.6621,  4.0969,  5.2684,  6.0158, 11.9587, -3.5633,  2.5178, 11.3157,  3.505 , -8.6263]



As we run the iterator, we obtain distinct minibatches 
successively until all the data has been exhausted (try this).
While the iterator implemented above is good for didactic purposes,
it is inefficient in ways that might get us in trouble on real problems.
For example, it requires that we load all data in memory
and that we perform lots of random memory access.
The built-in iterators implemented in DJL
are considerably more efficient and they can deal
both with data stored in file and data fed via a data stream.

## Initializing Model Parameters

Before we can begin optimizing our model's parameters by gradient descent,
we need to have some parameters in the first place.
In the following code, we initialize weights by sampling
random numbers from a normal distribution with mean 0
and a standard deviation of $0.01$, setting the bias $b$ to $0$.

In [7]:
val w = manager.randomNormal(0f, 0.01f, Shape(2, 1), DataType.FLOAT32)
val b = manager.zeros(Shape(1))
val params = NDList(w, b)

Now that we have initialized our parameters,
our next task is to update them until 
they fit our data sufficiently well.
Each update requires taking the gradient
(a multi-dimensional derivative)
of our loss function with respect to the parameters.
Given this gradient, we can update each parameter
in the direction that reduces the loss.

Since nobody wants to compute gradients explicitly
(this is tedious and error prone),
we use automatic differentiation to compute the gradient.
See :numref:`sec_gradcollector` for more details.
Recall from the autograd chapter
that in order for `GradientCollector` to know
that it should store a gradient for our parameters,
we need to invoke the `attachGradient()` function,
allocating memory to store the gradients that we plan to take.

## Defining the Model

Next, we must define our model,
relating its inputs and parameters to its outputs.
Recall that to calculate the output of the linear model,
we simply take the matrix-vector dot product
of the examples $\mathbf{X}$ and the models weights $w$,
and add the offset $b$ to each example.
Note that below `X.dot(w)` is a vector and `b` is a scalar.
Recall that when we add a vector and a scalar,
the scalar is added to each component of the vector.

In [14]:
// Saved in Training.java for later use
fun linreg(X:NDArray , w:NDArray , b:NDArray ) : NDArray = X.dot(w).add(b)

## Defining the Loss Function

Since updating our model requires taking 
the gradient of our loss function,
we ought to define the loss function first.
Here we will use the squared loss function
as described in the previous section.
In the implementation, we need to transform the true value `y` 
into the predicted value's shape `yHat`.
The result returned by the following function
will also be the same as the `yHat` shape.

In [9]:
// Saved in Training.java for later use
fun squaredLoss(yHat: NDArray, y: NDArray) : NDArray{
    return (yHat.sub(y.reshape(yHat.getShape()))).mul((yHat.sub(y.reshape(yHat.getShape())))).div(2)
}

## Defining the Optimization Algorithm

As we discussed in the previous section,
linear regression has a closed-form solution.
However, this is not a book about linear regression,
it is a book about deep learning.
Since none of the other models that this book introduces
can be solved analytically, we will take this opportunity to introduce your first working example of stochastic gradient descent (SGD).


At each step, using one batch randomly drawn from our dataset,
we will estimate the gradient of the loss with respect to our parameters.
Next, we will update our parameters (a small amount)
in the direction that reduces the loss.
Recall from :numref:`sec_gradcollector` that after we call `backward()`, 
each parameter (`param`) will have its gradient stored in `param.getGradient()`.
The following code applies the SGD update,
given a set of parameters, a learning rate, and a batch size.
The size of the update step is determined by the learning rate `lr`.
Because our loss is calculated as a sum over the batch of examples,
we normalize our step size by the batch size (`batchSize`),
so that the magnitude of a typical step size
does not depend heavily on our choice of the batch size.

In [15]:
// Saved in Training.java for later use
fun sgd(params: NDList, lr: Float, batchSize: Int) {
    for (param in params) {
        // Update param
        // param = param - param.gradient * lr / batchSize
        param.subi(param.getGradient().mul(lr).div(batchSize));
    }
}

## Training

Now that we have all of the parts in place,
we are ready to implement the main training loop.
It is crucial that you understand this code
because you will see nearly identical training loops
over and over again throughout your career in deep learning.

In each iteration, we will grab minibatches of training dataset,
first passing them through our model to obtain a set of predictions.
After calculating the loss, we call the `backward()` function
to initiate the backwards pass through the network, 
storing the gradients with respect to each parameter in its corresponding `gradient` attribute. Technically since `NDArray` is an interface for each engine's implementation, there is no standard `gradient` attribute, but we can safely assume that we can access them however they are stored with `getGradient()`.
Finally, we will call the optimization algorithm `sgd`
to update the model parameters.
Since we previously set the batch size `batchSize` to $10$,
the loss shape `l` for each minibatch is ($10$, $1$).

In summary, we will execute the following loop:

* Initialize parameters $(\mathbf{w}, b)$
* Repeat until done
    * Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{\mathcal{B}} \sum_{i \in \mathcal{B}} l(\mathbf{x}^i, y^i, \mathbf{w}, b)$
    * Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$

In the code below, `l` is a vector of the losses
for each example in the minibatch.

In each epoch (a pass through the data),
we will iterate through the entire dataset
(using the `dataset.getData()` function) once
passing through every examples in the training dataset
(assuming the number of examples is divisible by the batch size).
The number of epochs `numEpochs` and the learning rate `lr` are both hyper-parameters, 
which we set here to $3$ and $0.03$, respectively. 
Unfortunately, setting hyper-parameters is tricky
and requires some adjustment by trial and error.
We elide these details for now but revise them
later in
:numref:`chap_optimization`.

Note: We can replace `linreg` and `squaredLoss` with any net or loss function respectively
and still keep the same training structure shown here.

In [16]:
val lr = 0.03f  // Learning Rate
val numEpochs = 3  // Number of Iterations

// Attach Gradients
for (param in params) {
    param.setRequiresGradient(true)
}

for (epoch in 0 until numEpochs) {
    // Assuming the number of examples can be divided by the batch size, all
    // the examples in the training dataset are used once in one epoch
    // iteration. The features and tags of minibatch examples are given by X
    // and y respectively.
    for (batch in dataset.getData(manager)) {
        val X = batch.getData().head()
        val y = batch.getLabels().head()
        
        val gc = Engine.getInstance().newGradientCollector()
            // Minibatch loss in X and y
            val l = squaredLoss(linreg(X, params.get(0), params.get(1)), y)
            gc.backward(l)  // Compute gradient on l with respect to w and b
        gc.close()
        sgd(params, lr, batchSize);  // Update parameters using their gradient
        
        batch.close();
    }
    val trainL = squaredLoss(linreg(features, params.get(0), params.get(1)), labels);
    println("epoch %d, loss %f".format(epoch + 1, trainL.mean().getFloat()))
}

epoch 1, loss 0.000051
epoch 2, loss 0.000051
epoch 3, loss 0.000051


In this case, because we synthesized the data ourselves,
we know precisely what the true parameters are. 
Thus, we can evaluate our success in training 
by comparing the true parameters
with those that we learned through our training loop. 
Indeed they turn out to be very close to each other.

In [17]:
val w = trueW.sub(params.get(0).reshape(trueW.getShape())).toFloatArray();
println("Error in estimating w: [%f, %f]".format(w[0], w[1]))
println("Error in estimating b: %f".format(trueB - params.get(1).getFloat()))

Error in estimating w: [0.000462, -0.000288]
Error in estimating b: 0.000358


Note that we should not take it for granted
that we are able to recover the parameters accurately.
This only happens for a special category problems:
strongly convex optimization problems with "enough" data to ensure
that the noisy samples allow us to recover the underlying dependency.
In most cases this is *not* the case.
In fact, the parameters of a deep network 
are rarely the same (or even close) between two different runs, 
unless all conditions are identical,
including the order in which the data is traversed.
However, in machine learning, we are typically less concerned
with recovering true underlying parameters,
and more concerned with parameters that lead to accurate prediction.
Fortunately, even on difficult optimization problems,
stochastic gradient descent can often find remarkably good solutions,
owing partly to the fact that, for deep networks,
there exist many configurations of the parameters 
that lead to accurate prediction.

## Summary

We saw how a deep network can be implemented
and optimized from scratch, using just `NDArray` and `GradientCollector`,
without any need for defining layers, fancy optimizers, etc.
This only scratches the surface of what is possible.
In the following sections, we will describe additional models
based on the concepts that we have just introduced
and learn how to implement them more concisely.

## Exercises

1. What would happen if we were to initialize the weights $\mathbf{w} = 0$. Would the algorithm still work?
1. Assume that you are [Georg Simon Ohm](https://en.wikipedia.org/wiki/Georg_Ohm) trying to come up with a model between voltage and current. Can you use `GradientCollector` to learn the parameters of your model.
1. Can you use [Planck's Law](https://en.wikipedia.org/wiki/Planck%27s_law) to determine the temperature of an object using spectral energy density?
1. What are the problems you might encounter if you wanted to extend `GradientCollector` to second derivatives? How would you fix them?
1.  Why is the `reshape()` function needed in the `squaredLoss()` function?
1. Experiment using different learning rates to find out how fast the loss function value drops.
1. If the number of examples cannot be divided by the batch size, what happens to the `dataset.getData()` function's behavior?
