Training a model means finding the weight that minimizes the loss.

The expression "argminloss(w)" represents the value of the parameter vector "w" that minimizes a given loss function. This notation is often used in optimization problems to denote the argument (in this case, the parameter vector "w") that yields the minimum value of the specified loss function.

Let's break down the components:

- "argmin": This is short for "argument of the minimum." It refers to the input value (or argument) that minimizes the function following it.

- "loss(w)": This is the loss function that depends on the parameter vector "w." The loss function quantifies the discrepancy between the model's predictions and the actual (ground truth) values. In the context of machine learning and optimization, the goal is to find the parameter values that minimize this loss function.

So, "argminloss(w)" essentially means finding the parameter vector "w" that minimizes the loss function. Mathematically, it can be written as:

```
w_optimal = argmin(loss(w))
```

In practical machine learning and optimization tasks, finding the exact value of "w" that minimizes the loss function might involve complex mathematical computations and iterative optimization algorithms. These algorithms, such as gradient descent, stochastic gradient descent, or more advanced methods like Adam, iteratively adjust the parameter values to approach the optimal value that minimizes the loss.

The process of finding the argument that minimizes a loss function is fundamental in training machine learning models, as it allows the model to learn the best parameter values that result in accurate predictions on the given data.

<hr>

*Gradient descent* is a widely used optimization algorithm in machine learning and deep learning. It's used to minimize a given objective function, typically referred to as the "loss" or "cost" function. The main idea behind gradient descent is to iteratively adjust the parameters of a model in the direction that reduces the value of the loss function, ultimately reaching a local minimum (or potentially a global minimum) of the function.
</br>

![Alt text](gd.png)

Here's how gradient descent works:

1. **Initialization**: Start by initializing the model's parameters randomly or with some predefined values.

2. **Compute Gradient**: Calculate the gradient of the loss function with respect to each parameter. The gradient represents the direction of steepest increase of the loss function. In other words, it indicates how much the loss would increase if each parameter was increased by a small amount.

    $gradient(g)=$ $∇loss(l)\over∇weight(w)$

3. **Update Parameters**: Adjust the parameters in the opposite direction of the gradient. This means subtracting a fraction of the gradient from each parameter. The fraction of the gradient that's subtracted is determined by a parameter called the learning rate.

    $w = w - α$ $∇l\over∇w$   
    
    where, α = learning rate
    

For our linear model: <br>
    $g=$ $∇l\over∇w$ = $∇(ŷ-y)²\over∇w$ = $∇(x*w-y)²\over∇w$ = $2x(xw - y)$

![Loss Calculation](loss.png)
    
4. **Iterate**: Repeat steps 2 and 3 for a certain number of iterations or until a convergence criterion is met. The goal is to iteratively move closer to the minimum of the loss function.

    ***For each iteration, w will be now updated using this equation:***
        $w = w - α*2x(xw - y)$

5. **Convergence**: The algorithm continues to adjust the parameters, ideally leading to convergence, where the loss function stops decreasing significantly, and the parameter values stabilize.

The choice of learning rate, batch size, and other hyperparameters can significantly influence the performance and convergence of gradient descent. Fine-tuning these hyperparameters often requires experimentation and monitoring the training process.

Gradient descent forms the basis for many optimization algorithms used in training neural networks and other machine learning models. Its goal is to find the parameter values that result in the best model fit to the data and the lowest possible loss.



In [22]:
# Implementation of Gradient Descent
X = [1.0,2.0,3.0,4.0] # data
y = [2.0,4.0,6.0,8.0] # target

# weight : taking a random guess
w = 1.0

# learning rate
alpha = 0.01 

In [2]:
# forward pass for our model
def forward(x):
  # ŷ = x * w
  return x * w

In [3]:

# loss function
def loss(x,y):
  y_pred = forward(x)
  # loss = (ŷ-y)²
  return (y_pred-y) * (y_pred-y)

In [5]:
# compute gradient
def gradient(x,y):
    return 2*x*(x*w - y)

In [8]:
# Test before training
print(f'Prediction before training : 4 -> {forward(4)}')

Prediction before training : 4 -> 4.0


In [23]:
# Training : Adjusting weight
# Training Loop 
for epoch in range(5):
    for x_val,y_val in zip(X,y):
        grad = gradient(x_val,y_val)
        # now update weight using calculated gradient & set learning rate(alpha)
        w = w - alpha * grad
        print(f'grad: {x_val},{y_val},{grad}')
        l = loss(x_val,y_val)

    print(f'Progress: {epoch}, w={w}, l={l}')

grad: 1.0,2.0,-2.0
grad: 2.0,4.0,-7.84
grad: 3.0,6.0,-16.2288
grad: 4.0,8.0,-23.657984
Progress: 0, w=1.4972678400000001, l=4.043833995172248
grad: 1.0,2.0,-1.0054643199999997
grad: 2.0,4.0,-3.9414201343999995
grad: 3.0,6.0,-8.158739678208
grad: 4.0,8.0,-11.89362939756544
Progress: 1, w=1.7472603753017344, l=1.0220370862819224
grad: 1.0,2.0,-0.5054792493965312
grad: 2.0,4.0,-1.981478657634403
grad: 3.0,6.0,-4.101660821303216
grad: 4.0,8.0,-5.979309997277575
Progress: 2, w=1.8729396625578516, l=0.2583092696146019
grad: 1.0,2.0,-0.2541206748842968
grad: 2.0,4.0,-0.9961530455464427
grad: 3.0,6.0,-2.062036804281137
grad: 4.0,8.0,-3.0059914302409467
Progress: 3, w=1.9361226821073798, l=0.06528498785847768
grad: 1.0,2.0,-0.12775463578524038
grad: 2.0,4.0,-0.5007981722781416
grad: 3.0,6.0,-1.036652216615753
grad: 4.0,8.0,-1.5112085646665179
Progress: 4, w=1.9678868180008364, l=0.016500103329782457


In [24]:
# Test after training
print(f'Prediction after training : 4 -> {forward(4)}')

Prediction after training : 4 -> 7.871547272003346
