# Weight Decay

:label:`sec_weight_decay`


Now that we have characterized the problem of overfitting,
we can introduce some standard techniques for regularizing models.
Recall that we can always mitigate overfitting
by going out and collecting more training data.
That can be costly, time consuming,
or entirely out of our control,
making it impossible in the short run.
For now, we can assume that we already have
as much high-quality data as our resources permit
and focus on regularization techniques.

Recall that in our
polynomial curve-fitting example
(:numref:`sec_model_selection`)
we could limit our model's capacity
simply by tweaking the degree 
of the fitted polynomial.
Indeed, limiting the number of features 
is a popular technique to avoid overfitting.
However, simply tossing aside features
can be too blunt an instrument for the job.
Sticking with the polynomial curve-fitting
example, consider what might happen
with high-dimensional inputs.
The natural extensions of polynomials
to multivariate data are called *monomials*, 
which are simply products of powers of variables.
The degree of a monomial is the sum of the powers.
For example, $x_1^2 x_2$, and $x_3 x_5^2$ 
are both monomials of degree $3$.

Note that the number of terms with degree $d$
blows up rapidly as $d$ grows larger.
Given $k$ variables, the number of monomials 
of degree $d$ is ${k - 1 + d} \choose {k - 1}$.
Even small changes in degree, say from $2$ to $3$,
dramatically increase the complexity of our model.
Thus we often need a more fine-grained tool
for adjusting function complexity.

## Squared Norm Regularization

*Weight decay* (commonly called *L2* regularization),
might be the most widely-used technique
for regularizing parametric machine learning models.
The technique is motivated by the basic intuition
that among all functions $f$,
the function $f = 0$ 
(assigning the value $0$ to all inputs) 
is in some sense the *simplest*,
and that we can measure the complexity 
of a function by its distance from zero.
But how precisely should we measure
the distance between a function and zero?
There is no single right answer.
In fact, entire branches of mathematics,
including parts of functional analysis 
and the theory of Banach spaces,
are devoted to answering this issue.

One simple interpretation might be 
to measure the complexity of a linear function
$f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}$
by some norm of its weight vector, e.g., $|| \mathbf{w} ||^2$.
The most common method for ensuring a small weight vector
is to add its norm as a penalty term
to the problem of minimizing the loss.
Thus we replace our original objective,
*minimize the prediction loss on the training labels*,
with new objective,
*minimize the sum of the prediction loss and the penalty term*.
Now, if our weight vector grows too large,
our learning algorithm might *focus* 
on minimizing the weight norm $|| \mathbf{w} ||^2$
versus minimizing the training error.
That is exactly what we want.
To illustrate things in code, 
let us revive our previous example
from :numref:`sec_linear_regression` for linear regression.
There, our loss was given by

$$l(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$

Recall that $\mathbf{x}^{(i)}$ are the observations,
$y^{(i)}$ are labels, and $(\mathbf{w}, b)$
are the weight and bias parameters respectively.
To penalize the size of the weight vector,
we must somehow add $|| \mathbf{w} ||^2$ to the loss function,
but how should the model trade off the 
standard loss for this new additive penalty?
In practice, we characterize this tradeoff
via the *regularization constant* $\lambda > 0$, 
a non-negative hyperparameter 
that we fit using validation data:

$$l(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2.$$

For $\lambda = 0$, we recover our original loss function.
For $\lambda > 0$, we restrict the size of $|| \mathbf{w} ||$.
The astute reader might wonder why we work with the squared
norm and not the standard norm (i.e., the Euclidean distance).
We do this for computational convenience.
By squaring the L2 norm, we remove the square root, 
leaving the sum of squares of 
each component of the weight vector.
This makes the derivative of the penalty easy to compute
(the sum of derivatives equals the derivative of the sum).

Moreover, you might ask why we work with the L2 norm 
in the first place and not, say, the L1 norm.

In fact, other choices are valid and 
popular throughout statistics.
While L2-regularized linear models constitute
the classic *ridge regression* algorithm,
L1-regularized linear regression
is a similarly fundamental model in statistics
(popularly known as *lasso regression*).

More generally, the $\ell_2$ is just one 
among an infinite class of norms call p-norms,
many of which you might encounter in the future.
In general, for some number $p$, 
the $\ell_p$ norm is defined as

$$\|\mathbf{w}\|_p^p := \sum_{i=1}^d |w_i|^p.$$


One reason to work with the L2 norm
is that it places and outsize penalty
on large components of the weight vector.
This biases our learning algorithm 
towards models that distribute weight evenly 
across a larger number of features.
In practice, this might make them more robust
to measurement error in a single variable.
By contrast, L1 penalties lead to models
that concentrate weight on a small set of features,
which may be desirable for other reasons. 

The stochastic gradient descent updates 
for L2-regularized regression follow:

$$
\begin{aligned}
\mathbf{w} & \leftarrow \left(1- \eta\lambda \right) \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),
\end{aligned}
$$

As before, we update $\mathbf{w}$ based on the amount 
by which our estimate differs from the observation.
However, we also shrink the size of $\mathbf{w}$ towards $0$.
That is why the method is sometimes called "weight decay":
given the penalty term alone,
our optimization algorithm *decays*
the weight at each step of training.
In contrast to feature selection,
weight decay offers us a continuous mechanism
for adjusting the complexity of $f$.
Small values of $\lambda$ correspond 
to unconstrained $\mathbf{w}$,
whereas large values of $\lambda$ 
constrain $\mathbf{w}$ considerably.
Whether we include a corresponding bias penalty $b^2$ 
can vary across implementations, 
and may vary across layers of a neural network.
Often, we do not regularize the bias term
of a network's output layer.
 

## High-Dimensional Linear Regression

We can illustrate the benefits of 
weight decay over feature selection
through a simple synthetic example.
First, we generate some data as before

$$y = 0.05 + \sum_{i = 1}^d 0.01 x_i + \epsilon \text{ where }
\epsilon \sim \mathcal{N}(0, 0.01).$$

choosing our label to be a linear function of our inputs,
corrupted by Gaussian noise with zero mean and variance 0.01.
To make the effects of overfitting pronounced,
we can increase the dimensionality of our problem to $d = 200$
and work with a small training set containing only 20 examples.

We will now import the relevant libraries for showing weight decay concept in action.

In [1]:
%use @file[../djl.json]
%use lets-plot
@file:DependsOn("org.apache.commons:commons-lang3:3.12.0")
import ai.djl.metric.Metrics

class Accumulator(n: Int) {
    val data = FloatArray(n) { 0f }


    /* Adds a set of numbers to the array */
    fun add(args: FloatArray) {
        for (i in 0..args.size - 1) {
            data[i] += args[i]
        }
    }

    /* Resets the array */
    fun reset() {
        data.fill(0f)
    }

    /* Returns the data point at the given index */
    fun get(index: Int): Float {
        return data[index]
    }
}

class DataPoints(X:NDArray , y:NDArray ) {
    private val X = X
    private val y = y

    fun  getX() : NDArray{
        return X
    }
    
    fun getY() :NDArray {
        return y
    }
}

fun syntheticData(manager:NDManager , w: NDArray , b : Float, numExamples: Int) : DataPoints {
    val X = manager.randomNormal(Shape(numExamples.toLong(), w.size()))
    var y = X.matMul(w).add(b)
    // Add noise
    y = y.add(manager.randomNormal(0f, 0.01f, y.getShape(), DataType.FLOAT32))
    return DataPoints(X, y);
}

object Training {

    fun linreg(X: NDArray, w: NDArray, b: NDArray): NDArray {
        return X.dot(w).add(b);
    }

    fun squaredLoss(yHat: NDArray, y: NDArray): NDArray {
        return (yHat.sub(y.reshape(yHat.getShape())))
            .mul((yHat.sub(y.reshape(yHat.getShape()))))
            .div(2);
    }

    fun sgd(params: NDList, lr: Float, batchSize: Int) {
    val lrt = Tracker.fixed(lr);
    val opt = Optimizer.sgd().setLearningRateTracker(lrt).build();
        for (param in params) {
            // Update param in place.
            // param = param - param.gradient * lr / batchSize
            // val ind = params.indexOf(param)
            // params.rep
            // params.set(ind, param.sub(param.getGradient().mul(lr).div(batchSize)))
            opt.update(param.toString(), param, param.getGradient().div(batchSize))
//            param.subi(param.getGradient().mul(lr).div(batchSize));
        }
    }

    /**
     * Allows to do gradient calculations on a subManager. This is very useful when you are training
     * on a lot of epochs. This subManager could later be closed and all NDArrays generated from the
     * calculations in this function will be cleared from memory when subManager is closed. This is
     * always a great practice but the impact is most notable when there is lot of data on various
     * epochs.
     */
    fun sgd(params: NDList, lr: Float, batchSize: Int, subManager: NDManager) {
        for (param in params) {
            // Update param in place.
            // param = param - param.gradient * lr / batchSize
            val gradient = param.getGradient()
            gradient.attach(subManager);
            param.subi(gradient.mul(lr).div(batchSize))
        }
    }

    fun accuracy(yHat: NDArray, y: NDArray): Float {
        // Check size of 1st dimension greater than 1
        // to see if we have multiple samples
        if (yHat.getShape().size(1) > 1) {
            // Argmax gets index of maximum args for given axis 1
            // Convert yHat to same dataType as y (int32)
            // Sum up number of true entries
            return yHat.argMax(1)
                .toType(DataType.INT32, false)
                .eq(y.toType(DataType.INT32, false))
                .sum()
                .toType(DataType.FLOAT32, false)
                .getFloat();
        }
        return yHat.toType(DataType.INT32, false)
            .eq(y.toType(DataType.INT32, false))
            .sum()
            .toType(DataType.FLOAT32, false)
            .getFloat();
    }

    fun trainingChapter6(
        trainIter: ArrayDataset,
        testIter: ArrayDataset,
        numEpochs: Int,
        trainer: Trainer,
        evaluatorMetrics: MutableMap<String, DoubleArray>
    ): Double {

        trainer.setMetrics(Metrics())

        EasyTrain.fit(trainer, numEpochs, trainIter, testIter)

        val metrics = trainer.getMetrics()

        trainer.getEvaluators()
            .forEach { evaluator ->
                {
                    evaluatorMetrics.put(
                        "train_epoch_" + evaluator.getName(),
                        metrics.getMetric("train_epoch_" + evaluator.getName()).stream()
                            .mapToDouble { x -> x.getValue() }
                            .toArray())
                    evaluatorMetrics.put(
                        "validate_epoch_" + evaluator.getName(),
                        metrics
                            .getMetric("validate_epoch_" + evaluator.getName())
                            .stream()
                            .mapToDouble { x -> x.getValue() }
                            .toArray())
                }
            }

        return metrics.mean("epoch")
    }

    /* Softmax-regression-scratch */
    fun evaluateAccuracy(net: UnaryOperator<NDArray>, dataIterator: Iterable<Batch>): Float {
        val metric = Accumulator(2) // numCorrectedExamples, numExamples
        for (batch in dataIterator) {
            val X = batch.getData().head()
            val y = batch.getLabels().head()
            metric.add(floatArrayOf(accuracy(net.apply(X), y), y.size().toFloat()))
            batch.close()
        }
        return metric.get(0) / metric.get(1)
    }
    /* End Softmax-regression-scratch */

    /* MLP */
    /* Evaluate the loss of a model on the given dataset */
    fun evaluateLoss(
        net: UnaryOperator<NDArray>,
        dataIterator: Iterable<Batch>,
        loss: BinaryOperator<NDArray>
    ): Float {
        val metric = Accumulator(2) // sumLoss, numExamples

        for (batch in dataIterator) {
            val X = batch . getData ().head();
            val y = batch . getLabels ().head();
            metric.add(
                floatArrayOf(loss.apply(net.apply(X), y).sum().getFloat(), y.size().toFloat()) )
            batch.close()
        }
        return metric.get(0) / metric.get(1)
    }
    /* End MLP */
}

// %load ../utils/djl-imports
// %load ../utils/plot-utils
// %load ../utils/DataPoints.java
// %load ../utils/Training.java
// %load ../utils/Accumulator.java

In [2]:
import org.apache.commons.lang3.ArrayUtils

In [3]:
val nTrain = 20;
val nTest = 100;
val numInputs = 200;
val batchSize = 5;

val trueB = 0.05f;
val manager = NDManager.newBaseManager();
val trueW = manager.ones(Shape(numInputs.toLong(), 1)).mul(0.01)

fun  loadArray(features: NDArray , labels: NDArray , batchSize: Int, shuffle: Boolean) : ArrayDataset {
    return ArrayDataset.Builder()
                  .setData(features) // set the features
                  .optLabels(labels) // set the labels
                  .setSampling(batchSize, shuffle) // set the batch size and random sampling
                  .build();
}

val trainData = syntheticData(manager, trueW, trueB, nTrain);

val trainIter = loadArray(trainData.getX(), trainData.getY(), batchSize, true);

val testData = syntheticData(manager, trueW, trueB, nTest);

val testIter = loadArray(testData.getX(), testData.getY(), batchSize, false);

## Implementation from Scratch

Next, we will implement weight decay from scratch,
simply by adding the squared $\ell_2$ penalty
to the original target function.

### Initializing Model Parameters

First, we will define a function 
to randomly initialize our model parameters 
and run `attachGradient()` on each to allocate 
memory for the gradients we will calculate.

In [4]:
class InitParams{

    private val manager = NDManager.newBaseManager()
    private val w = manager.randomNormal(0f, 1.0f, Shape(numInputs.toLong(), 1), DataType.FLOAT32)
    private val b = manager.zeros(Shape(1))
    
    init {
        w.setRequiresGradient(true)
        b.setRequiresGradient(true)
    }
    
//    private NDList l;
    
    fun getW(): NDArray{
        return this.w;
    }
    
    fun getB(): NDArray {
        return this.b;
    }
    
}

### Defining $\ell_2$ Norm Penalty

Perhaps the most convenient way to implement this penalty
is to square all terms in place and sum them up.
We divide by $2$ by convention
(when we take the derivative of a quadratic function,
the $2$ and $1/2$ cancel out, ensuring that the expression
for the update looks nice and simple).

In [5]:
fun l2Penalty(w: NDArray): NDArray{
    return ((w.pow(2)).sum()).div(2);
}

In [6]:
val l2loss = Loss.l2Loss();

### Defining the Train and Test Functions

The following code fits a model on the training set
and evaluates it on the test set.
The linear network and the squared loss
have not changed since the previous chapter,
so we will just import them via `Training.linreg()` and `Training.squaredLoss()`.
The only change here is that our loss now includes the penalty term.

In [7]:
val epochCount = mutableListOf<Int>()
val trainLoss = mutableListOf<Float>()
val testLoss = mutableListOf<Float>()

fun train(lambd: Float) {
    
    val initParams = InitParams();
    
    val params = NDList(initParams.getW(), initParams.getB());
    
    val numEpochs = Integer.getInteger("MAX_EPOCH", 100);
    val lr = 0.003f;
    
    for (epoch in 1..numEpochs){
        
        for (batch in trainIter.getData(manager)){
            
            val X = batch.getData().head();
            val y = batch.getLabels().head();
            
             val w = params.get(0);
             val b = params.get(1);
            
             Engine.getInstance().newGradientCollector().use { gc ->
                // The L2 norm penalty term has been added, and broadcasting
                // makes `l2Penalty(w)` a vector whose length is `batch_size`
                val l = Training.squaredLoss(Training.linreg(X, w, b), y).add(l2Penalty(w).mul(lambd));
                gc.backward(l);  // Compute gradient on l with respect to w and b
                
            }
            
            batch.close();
            Training.sgd(params, lr, batchSize);  // Update parameters using their gradient
        }
        
        if(epoch % 5 == 0){
            val testL = Training.squaredLoss(Training.linreg(testData.getX(), params.get(0), params.get(1)), testData.getY());
            val trainL = Training.squaredLoss(Training.linreg(trainData.getX(), params.get(0), params.get(1)), trainData.getY());
            
            epochCount.add(epoch)  
            trainLoss.add(trainL.mean().log10().getFloat())
            testLoss.add(testL.mean().log10().getFloat())
        }
        
    }
    
    println("l1 norm of w: " + params.get(0).abs().sum());
}

### Training without Regularization

We now run this code with `lambd = 0`, 
disabling weight decay.
Note that we overfit badly, 
decreasing the training error but not the 
test error---a textook case of overfitting.

In [8]:
train(0f);

val trainLabel = Array<String>(trainLoss.size) { "train loss" } 
//val accLabel = Array<String>(trainAccuracy.size) { "train acc" }
val testLabel = Array<String>(testLoss.size) {"test acc"}

val data = mapOf( "epochCount" to epochCount + epochCount,
                "loss" to trainLoss + testLoss,
                "lossLabel" to trainLabel + testLabel)
var plot = letsPlot(data)
plot += geomLine { x = "epochCount" ; y = "loss" ; color = "lossLabel"}
plot + ggsize(500, 500)

l1 norm of w: ND: () cpu() float32
3.7243



### Using Weight Decay

Below, we run with substantial weight decay.
Note that the training error increases
but the test error decreases.
This is precisely the effect 
we expect from regularization.
As an exercise, you might want to check
that the $\ell_2$ norm of the weights $\mathbf{w}$
has actually decreased.

In [9]:
// calling training with weight decay lambda = 3.0
train(3f);

val trainLabel = Array<String>(trainLoss.size) { "train loss" } 
//val accLabel = Array<String>(trainAccuracy.size) { "train acc" }
val testLabel = Array<String>(testLoss.size) {"test acc"}

val data = mapOf( "epochCount" to epochCount + epochCount,
                "loss" to trainLoss + testLoss,
                "lossLabel" to trainLabel + testLabel)
var plot = letsPlot(data)
plot += geomLine { x = "epochCount" ; y = "loss" ; color = "lossLabel"}
plot + ggsize(500, 500)

l1 norm of w: ND: () cpu() float32
1.6956



## Concise Implementation

Because weight decay is ubiquitous 
in neural network optimization,
DJL makes it especially convenient,
integrating weight decay into the optimization algorithm itself
for easy use in combination with any loss function.
Moreover, this integration serves a computational benefit,
allowing implementation tricks to add weight decay to the algorithm,
without any additional computational overhead.
Since the weight decay portion of the update
depends only on the current value of each parameter,
and the optimizer must touch each parameter once anyway.

In the following code, we specify
the weight decay hyperparameter directly
through `wd` when instantiating our `Trainer`.
By default, DJL decays both 
weights and biases simultaneously.

In [10]:
val epochCount = mutableListOf<Int>()
val trainLoss = mutableListOf<Float>()
val testLoss = mutableListOf<Float>()

fun train_djl(wd: Float) {
    
    val initParams = InitParams();
    
    val params = NDList(initParams.getW(), initParams.getB());
    
    val numEpochs = Integer.getInteger("MAX_EPOCH", 100);
    val lr = 0.003f;
    
    val lrt = Tracker.fixed(lr);
    val sgd = Optimizer.sgd().setLearningRateTracker(lrt).build();
    
    val config = DefaultTrainingConfig(l2loss)
     .optOptimizer(sgd) // Optimizer (loss function)
     .optDevices(Engine.getInstance().getDevices(1)) // single CPU/GPU
     .addEvaluator(Accuracy()) // Model Accuracy
     .addEvaluator(l2loss)
     .addTrainingListeners(*TrainingListener.Defaults.logging()); // Logging
    
    val model = Model.newInstance("mlp");

    val net = SequentialBlock();
    val linearBlock = Linear.builder().optBias(true).setUnits(1).build();
    net.add(linearBlock);

    model.setBlock(net);
    val trainer = model.newTrainer(config);
        
    trainer.initialize(Shape(batchSize.toLong(), 2));
    for (epoch in 1..numEpochs){
        
        for (batch in trainer.iterateDataset(trainIter)){
            
            val X = batch.getData().head();
            val y = batch.getLabels().head();
            
             val w = params.get(0);
             val b = params.get(1);
            
            Engine.getInstance().newGradientCollector().use { gc ->
                // Minibatch loss in X and y
                val l = Training.squaredLoss(Training.linreg(X, w, b), y).add(l2Penalty(w).mul(wd));
                gc.backward(l);  // Compute gradient on l with respect to w and b
                
            }
            batch.close();
            for(param in params) {
                sgd.update(param.toString(), param, param.getGradient().div(batchSize))
            }
//            Training.sgd(params, lr, batchSize);  // Update parameters using their gradient
        }
        
        if(epoch % 5 == 0){
            val testL = Training.squaredLoss(Training.linreg(testData.getX(), params.get(0), params.get(1)), testData.getY());
            val trainL = Training.squaredLoss(Training.linreg(trainData.getX(), params.get(0), params.get(1)), trainData.getY());
            
            epochCount.add(epoch)  
            trainLoss.add(trainL.mean().log10().getFloat())
            testLoss.add(testL.mean().log10().getFloat())
        }
        
    }
    println("l1 norm of w: " + params.get(0).abs().sum());
}

The plots look identical to those when 
we implemented weight decay from scratch.
However, they run appreciably faster 
and are easier to implement,
a benefit that will become more
pronounced for large problems.

In [11]:
train_djl(0f);

val trainLabel = Array<String>(trainLoss.size) { "train loss" } 
//val accLabel = Array<String>(trainAccuracy.size) { "train acc" }
val testLabel = Array<String>(testLoss.size) {"test acc"}

val data = mapOf( "epochCount" to epochCount + epochCount,
                "loss" to trainLoss + testLoss,
                "lossLabel" to trainLabel + testLabel)
var plot = letsPlot(data)
plot += geomLine { x = "epochCount" ; y = "loss" ; color = "lossLabel"}
plot + ggsize(500, 500)

l1 norm of w: ND: () cpu() float32
3.7242



In [12]:
train_djl(10f);

val trainLabel = Array<String>(trainLoss.size) { "train loss" } 
//val accLabel = Array<String>(trainAccuracy.size) { "train acc" }
val testLabel = Array<String>(testLoss.size) {"test acc"}

val data = mapOf( "epochCount" to epochCount + epochCount,
                "loss" to trainLoss + testLoss,
                "lossLabel" to trainLabel + testLabel)
var plot = letsPlot(data)
plot += geomLine { x = "epochCount" ; y = "loss" ; color = "lossLabel"}
plot + ggsize(500, 500)

l1 norm of w: ND: () cpu() float32
0.5661



So far, we only touched upon one notion of
what constitutes a simple *linear* function.
Moreover, what constitutes a simple *nonlinear* function
can be an even more complex question.
For instance, [Reproducing Kernel Hilbert Spaces (RKHS)](https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space)
allows one to apply tools introduced 
for linear functions in a nonlinear context.
Unfortunately, RKHS-based algorithms
tend to scale purely to large, high-dimensional data.
In this book we will default to the simple heuristic
of applying weight decay on all layers of a deep network.

## Summary

* Regularization is a common method for dealing with overfitting. It adds a penalty term to the loss function on the training set to reduce the complexity of the learned model.
* One particular choice for keeping the model simple is weight decay using an $\ell_2$ penalty. This leads to weight decay in the update steps of the learning algorithm.
* DJL provides automatic weight decay functionality in the optimizer by setting the hyperparameter `wd`.
* You can have different optimizers within the same training loop, e.g., for different sets of parameters.


## Exercises

1. Experiment with the value of $\lambda$ in the estimation problem in this page. Plot training and test accuracy as a function of $\lambda$. What do you observe?
1. Use a validation set to find the optimal value of $\lambda$. Is it really the optimal value? Does this matter?
1. What would the update equations look like if instead of $\|\mathbf{w}\|^2$ we used $\sum_i |w_i|$ as our penalty of choice (this is called $\ell_1$ regularization).
1. We know that $\|\mathbf{w}\|^2 = \mathbf{w}^\top \mathbf{w}$. Can you find a similar equation for matrices (mathematicians call this the [Frobenius norm](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm))?
1. Review the relationship between training error and generalization error. In addition to weight decay, increased training, and the use of a model of suitable complexity, what other ways can you think of to deal with overfitting?
1. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via $P(w \mid x) \propto P(x \mid w) P(w)$. How can you identify $P(w)$ with regularization?

