## Exercise 2 - Implementing the LU Decomposition

Your exercise today is straight forward. Write a function `my_lu(A)`, which takes a matrix $A$ and returns the factors $L$ and $U$ of the LU decomposition without pivoting.

* Try to vectorize the code, that means the inner-most operation of subtracting a multiple of one row from another row should be performed by operating on the whole slice of the corresponding matrix. Overall, your code should only require a double-for loop.
* Test your function by computing $\|A-LU\|$. If this value is in the order of machine precision for various examples your algorithm should be correct.
* Compare the speed of your LU decomposition with the speed of the `scipy.linalg.lu` function.
* You may want to use the following skeleton to write your LU function.

In [1]:
def lu(A):
    
    import numpy as np
    
    # Return an error if matrix is not square
    if not A.shape[0]==A.shape[1]:
        raise ValueError("Input matrix must be square")
        
    n = A.shape[0] # The number of rows/columns of A
    
    L = np.zeros((n,n),dtype='float64') # Reserve space for L
    U = np.zeros((n,n),dtype='float64') # Reserve space for U
    U[:] = A # Copy A into U as we do not want to modify A
    np.fill_diagonal(L,1) # fill the diagonal of L with 1
    
    for i in range(n-1):
        # The outer iteration 
        for j in range(i+1,n):
            L[j,i] = U[j,i]/U[i,i]
            U[j,i:] = U[j,i:]-L[j,i]*U[i,i:]
            U[j,i] = 0 # Explicitly set to zero, not really necessary later on
    return (L,U)

In [4]:
# Test implementation
import numpy as np
from scipy.linalg import lu as scipy_lu

n = 200
A = np.random.rand(n,n)
L,U = lu(A)
residual = np.linalg.norm(A-np.dot(L,U))
print("Implementation test")
print("The residual is {0}".format(residual))
print("===========")

# Now do some timing comparisons

runtime_my_lu = %timeit -o lu(A)
runtime_scipy_lu = %timeit -o scipy_lu(A)


print("Time for my LU decomposition is {0} s".format(runtime_my_lu.best))

print("Time for Scipy LU decomposition is {0} s".format(runtime_scipy_lu.best))

print("Speed ratio: {0}".format(runtime_my_lu.best / runtime_scipy_lu.best))




Implementation test
The residual is 1.9058066906814195e-12
53.8 ms ± 3.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
578 µs ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Time for my LU decomposition is 0.05074618400012696 s
Time for Scipy LU decomposition is 0.0004466583260000334 s
Speed ratio: 113.61298121222791


On my computer for larger dimensions Scipy is almost 100 times faster than our own implementation of the LU decomposition. We will later learn more details of why this is the case.

**Note**: The above way of measuring times is complicated but very accurate. A simpler but slightly less accurate way of measuring execution times is as follows:

    import time
    start_time = time.time()
    # your code
    elapsed_time = time.time() - start_time