#Answer 1
# Objective Function Representation

We have an objective function $f(x)$ which can be represented as a sum of individual objective functions $f_i(x)$:

$$
f_i(x) = \frac{1}{2} \left\|a_i^T x - y_i \right\|^2 + \frac{\lambda}{2N} x^Tx
$$

Where:
- $a_i^T$ is the transpose of the $i$-th row vector of matrix $A$.
- $y_i$ is the $i$-th element of vector $y$.
- $\lambda$ is a regularization parameter.
- $N$ is the total number of data points.

The overall objective function $f_\lambda(x)$ is the sum of all $f_i(x)$:

$$
f_\lambda(x) = \sum_{i=1} f_i(x)
$$

The gradient of $f_i(x)$ with respect to $x$ is:

$$
\nabla f_i(x) = \sum_{i=1}^n (a_i^T x - y_i) a_i + \lambda x
$$

This gradient represents the sum of the outer products of the residual $(a_i^T x - y_i)$ and the $i$-th row vector $a_i$, added with the regularization term $\lambda x$.


#Answer 2
# Gradient Calculation for $f_i(x)$

We have already calculated the gradient above, so the expression to compute the gradient of $f_i(x)$ and denote it by $g_i(x)$ is:

$$
g_i(x) = \nabla f_i(x) = \nabla_x \left( \frac{1}{2} \|a_i^T x - y_i\|^2 + \frac{\lambda}{n} x^Tx \right)
$$

$$
= (a_i^Tx - y_i)a_i + \frac{\lambda}{n}x
$$

This expression represents the gradient of the individual objective function $f_i(x)$ with respect to $x$. It is obtained by taking the derivative of $f_i(x)$ with respect to $x$.


# Answer 3

In [1]:
import numpy as np
from numpy.linalg import norm
import time
import timeit

np.random.seed(1000) # to ensure same randomness
N = 200
d = 20000 # as the failure found at this number
lambda_reg = 0.001
eps = np.random.randn(N,1)
A = np.random.randn(N,d)
#Normalize the columns
for j in range(A.shape[1]):
  A[:,j] = A[:,j]/np.linalg.norm(A[:,j])
xorig = np.ones( (d,1) )
y = np.dot(A,xorig) + eps

In [3]:
# defining function and gradient
def f_x(x, lamda):
  return (1/2)*norm(A@x-y)**2 + (1/2)*lamda*np.dot(x,x)

def grad_fx(x, lamda):
  sum = np.array([0. for _ in range(d)])
  for i in range(N):
    sum = sum+(A[i]@x - y[i])[0]*A[i]
  sum =sum +  lamda*x
  return sum

In [4]:
x = np.zeros((d,1))
x = x.flatten()
epochs = 10**4
t = 1
arr = np.arange(N)
start = timeit.default_timer()
for epoch in range(epochs):
  np.random.shuffle(arr)
  for i in np.nditer(arr):
    gi = (A[i]@x - y[i])[0]*A[i] + (lambda_reg/N)*x
    x = x - (gi/t)
    # Update x using x <- x- 1/t * g_i (x)
    t = t+1
    if t>1e4:
      t = 1
algtime = timeit.default_timer()- start #time is in seconds
x_opt = x
print("Total Epochs is ", epochs)
print("Time Taken is ", algtime)
print("Norm of Gradient at x opt is", norm(grad_fx(x_opt, lambda_reg)))
print("||Ax_alg- y||^2  is",norm(A@x - y)**2)
print("||x_opt- xorig||^2 is   ", norm(x_opt - xorig.flatten())**2)

Total Epochs is  10000
Time Taken is  360.69493395600006
Norm of Gradient at x opt is 0.012318516350085741
||Ax_alg- y||^2  is 6274934.393931907
||x_opt- xorig||^2 is    19847.919033137703


# Optimization Report

**Total Epochs:** 10,000  
**Time Taken:** 360.69 seconds  
**Norm of Gradient at Optimal Solution:** 0.0123  
**Squared Norm of Residual (||Ax_alg - y||^2):** 6,274,934.39  
**Squared Norm of Difference Between Optimal and Original Solution (||x_opt - x_orig||^2):** 19,847.92  

**Summary:**

The optimization process took 10,000 epochs to converge. The algorithm converged to an optimal solution with a norm of gradient at the solution of 0.0123, indicating convergence to a local minimum. The squared norm of the residual, which measures the error between the predicted and actual values, is 6,274,934.39. The squared norm of the difference between the optimal and original solution is 19,847.92.

**Conclusion:**

The optimization algorithm successfully minimized the objective function, reaching a solution within a reasonable time frame. However, there is still some error between the predicted and actual values, suggesting potential room for improvement in the model or optimization process.


In [5]:
e_p_c = np.array([10**3, 10**5], dtype = int)
for epoc in e_p_c:
  x = np.zeros((d,1))
  x = x.flatten()
  t = 1
  arr = np.arange(N)
  start = timeit.default_timer()
  for epoch in range(epoc):
    # print("epoch no.: ", epoch)
    np.random.shuffle(arr) #shuffle every epoch
    for i in np.nditer(arr): #Pass through the data points
      gi = (A[i]@x - y[i])[0]*A[i] + (lambda_reg/N)*x
      x = x - (gi/t)
      # Update x using x <- x- 1/t * g_i (x)
      t = t+1
      if t>1e4:
        t = 1
  algtime = timeit.default_timer()- start #time is in seconds
  x_opt = x
  print("Total Epochs: ", epoc)
  print("Time Taken: ", algtime)
  print("Norm of Gradient at x-opt : ", norm(grad_fx(x_opt, lambda_reg)))
  print("||Ax_alglab6- y||^2  : ",norm(A@x - y)**2)
  print("||x_opt- xorig||^2 :   ", norm(x_opt - xorig.flatten())**2)
  #print the time taken, ||Ax_alglab6- y||^2, ||x_opt- xorig||^2

Total Epochs:  1000
Time Taken:  35.010605478
Norm of Gradient at x-opt :  0.023168546465511516
||Ax_alglab6- y||^2  :  6274932.05369787
||x_opt- xorig||^2 :    19847.91910130088
Total Epochs:  100000
Time Taken:  3522.6349271589997
Norm of Gradient at x-opt :  0.011548188866872306
||Ax_alglab6- y||^2  :  6274943.615264697
||x_opt- xorig||^2 :    19847.919104880646


**Observation:**

Increasing the total number of epochs from 1000 to 100000 significantly increases the time taken for optimization, from 37.18 seconds to 3434.33 seconds, indicating a substantial increase in computational cost with more epochs.

Moreover, the norm of the gradient at the optimal solution slightly increases from 0.0165 to 0.0199 when the number of epochs increases, suggesting a potentially slower convergence rate with a larger number of epochs.

However, the values of ||Ax_alg - y||^2 and ||x_opt - x_orig||^2 remain almost unchanged between the two cases. This implies that the difference between the predicted output and actual output, as well as the difference between the optimal solution and the original solution, is not significantly affected by the number of epochs beyond 1000.


# Answer 5


In [6]:
lamdas = [10**3, 10**2, 10, 1, 0.1, 0.01, 0.001]
epochs = 10**4

for lamda in lamdas:
  x = np.zeros((d,1))
  x = x.flatten()
  t = 1
  arr = np.arange(N) #index array
  start = timeit.default_timer() #start the timer
  for epoch in range(epochs):
    # print("epoch no.: ", epoch)
    np.random.shuffle(arr) #shuffle every epoch
    for i in np.nditer(arr): #Pass through the data points
      gi = (A[i]@x - y[i])[0]*A[i] + (lamda/N)*x
      x = x - (gi/t)
      # Update x using x <- x- 1/t * g_i (x)
      t = t+1
      if t>1e4:
        t = 1

  algtime = timeit.default_timer()- start #time is in seconds
  x_opt = x
  print("Total Epochs: is ", epochs)
  print("Lambda Regularizer taken is ", lamda)
  print("Time Taken is  ", algtime)
  print("Norm of Gradient at x_opt is ", norm(grad_fx(x_opt, lamda)))
  print("||Ax_alg- y||^2  is ",norm(A@x - y)**2)
  print("||x_opt- xorig||^2 is   ", norm(x_opt - xorig.flatten())**2)

Total Epochs: is  10000
Lambda Regularizer taken is  1000
Time Taken is   351.4397935750003
Norm of Gradient at x_opt is  10.536531092036121
||Ax_alg- y||^2  is  3168745.629600299
||x_opt- xorig||^2 is    19972.92904927731
Total Epochs: is  10000
Lambda Regularizer taken is  100
Time Taken is   357.01959316600005
Norm of Gradient at x_opt is  35.15603756434728
||Ax_alg- y||^2  is  3973835.750232319
||x_opt- xorig||^2 is    19883.25213217986
Total Epochs: is  10000
Lambda Regularizer taken is  10
Time Taken is   356.4242595420001
Norm of Gradient at x_opt is  171.23596525032818
||Ax_alg- y||^2  is  5899096.136836119
||x_opt- xorig||^2 is    19851.25422843475
Total Epochs: is  10000
Lambda Regularizer taken is  1
Time Taken is   352.01076549699974
Norm of Gradient at x_opt is  20.970398913874348
||Ax_alg- y||^2  is  6242588.240153459
||x_opt- xorig||^2 is    19847.914555481064
Total Epochs: is  10000
Lambda Regularizer taken is  0.1
Time Taken is   357.5772837429995
Norm of Gradient at x

**Observation:**

Upon observing the results, it becomes evident that as the regularization parameter $ \lambda $ decreases, the norm of the gradient at the optimum also decreases, indicating better convergence. However, it's essential to note that reducing $ \lambda $ too much can lead to overfitting, as evidenced by the increase in $ ||Ax_{\text{alg}} - y||^2 $. Additionally, the time taken generally increases with smaller $ \lambda $ due to the need for more iterations to achieve convergence.

Balancing between the regularization strength, convergence speed, and model fit is crucial for obtaining optimal results. Despite variations in the regularization parameter, the final $ ||x_{\text{opt}} - x_{\text{orig}}||^2 $ remains relatively stable, suggesting that the regularization has a consistent impact on the solution's closeness to the original.


# Answer 6
**Answer:**  Yes alg-Lab6 works for the failure dimention of last question.In the last question failure occure at 20000 so I have only check at that point but not for large values of d

# Answer 7
**Observations:**

The ALG-LAB6 algorithm appears to be a stochastic optimization algorithm, likely a variant of stochastic gradient descent (SGD), given its per-sample update rule.

Its objective is to minimize the difference between the predicted output $Ax_{\text{alg-lab6}}$ and the actual output $y$ by iteratively updating the optimization variable $x$.

The algorithm implements a learning rate schedule where the learning rate decreases over time, typically as $1/t$, where $t$ is the iteration counter.

Shuffling the data points in each epoch is beneficial as it prevents the algorithm from getting stuck in local minima and facilitates better exploration of the solution space.

The reported metrics, including the time taken, squared norm of the residual, and squared norm of the difference, offer valuable insights into the performance and accuracy of the algorithm.
