


## Linear Algebra

Those exercises will involve vector and matrix math, the  <a href="http://wiki.scipy.org/Tentative_NumPy_Tutorial">NumPy</a> Python package.

This exercise will be divided into two parts:

#### 1. Math checkup
Where you will do some of the math by hand.

#### 2. NumPy and Spark linear algebra
You will do some exercise using the NumPy package.

<br>
In the following exercises you will need to replace the code parts in the cell that starts with following comment: "#Replace the `<INSERT>`"

To go through the notebook fill in the `<INSERT>`:s with appropriate code in the cells. 
To run a cell press Shift-Enter to run it and advance to the following cell or Ctrl-Enter to only run the code in the cell. You should do the exercises from the top to the bottom in this notebook, because following cells may depend on code in previous cells.

If you want to execute these lines in a python script, you will need to create first a spark context:

In [None]:
# import os
# os.environ["SPARK_OPTS"] = "--driver-java-options=-Xms1024M --driver-java-options=-Xmx1536M --driver-java-options=-Dlog4j.logLevel=info"

# from pyspark import SparkContext, StorageLevel
# from pyspark.sql import SQLContext

# sc = SparkContext(master="local[*]")
# sqlContext = SQLContext(sc)


But since we are using the notebooks, those lines are not needed here.

## 1. Math checkup

### 1.1 Euclidian norm

$$
\mathbf{v} = \begin{bmatrix}
  666 \\
  1337 \\
  1789 \\
  1066 \\
  1945 \\
 \end{bmatrix}
 \qquad
 \|\mathbf{v}\| = ?
 $$

Calculate the euclidian norm for the $\mathbf{v}$ using the following definition:

$$
\|\mathbf{v}\|_2 = \sqrt{\sum\limits_{i=1}^n {x_i}^2} = \sqrt{{x_1}^2+\cdots+{x_n}^2}
$$

In [None]:
#Replace the <INSERT>
import math
import numpy as np
v = [666, 1337, 1789, 1066, 1945]
rdd = sc.parallelize(v)
#sumOfSquares = rdd.map(<INSERT>).reduce(<INSERT>) 
sumOfSquares = rdd.map(lambda x: x*x ).reduce(lambda x,y : x+y) 
norm = math.sqrt(sumOfSquares)
# <INSERT round to 8 decimals > 
norm = format(norm, '.8f') 
# the np calculation is an added extra here - the students don't actually do it till section 2.3 
norm_numpy= np.linalg.norm(v)
print("norm: "+str(norm) +" norm_numpy: "+ str(norm_numpy))

In [None]:
#Helper function to check results
import hashlib
def hashCheck(x, hashCompare): #Defining a help function
    hash = hashlib.md5(str(x).encode('utf-8')).hexdigest()
    print(hash)
    if hash == hashCompare:
        print('Yay, you succeeded!')
    else:
        print('Try again!')
        
def check(x,y,label):
    if(x == y):
        print("Yay, "+label+" is correct!")
    else:
        print("Nay, "+label+" is incorrect, please try again!")

def checkArray(x,y,label):
    if np.allclose(x,y):
        print("Yay, "+label+" is correct!")
    else:
        print("Nay, "+label+" is incorrect, please try again!")

In [None]:
#Check if the norm is correct
hashCheck(norm_numpy, '6de149ccbc081f9da04a0bbd8fe05d8c')

### 1.2 Transpose

$$
\mathbf{A} = \begin{bmatrix}
  1 & 2 & 3\\
  4 & 5 & 6\\
  7 & 8 & 9\\
 \end{bmatrix}
 \qquad
 \mathbf{A}^T = ?
$$

Tranpose is an operation on matrices that swaps the row for the columns.

$$
\begin{bmatrix}
  2 & 7 \\
  3 & 11\\
  5 & 13\\
 \end{bmatrix}^T
 \Rightarrow
 \begin{bmatrix}
  2 & 3 & 5 \\
  7 & 11 & 13\\
 \end{bmatrix}
$$

Do the transpose of A by hand and write it in:

In [None]:
#Replace the <INSERT>
#Input aT like this: AT = [[1, 2, 3],[4, 5, 6],[7, 8, 9]]
#At = <INSERT>

A= np.matrix([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
print(A)
print("\n")
At = np.matrix.transpose(A)
print (At)

At =[[1,4, 7],[2, 5, 8],[3, 6, 9]]
print("\n")
print (At)

In [None]:
#Check if the transpose is correct
hashCheck(At, '1c8dc4c2349277cbe5b7c7118989d8a5')

### 1.3 Scalar matrix multiplication

$$
\mathbf{A} = 3\times\begin{bmatrix}
  1 & 2 & 3\\
  4 & 5 & 6\\
  7 & 8 & 9\\
 \end{bmatrix}
=?
\qquad
\mathbf{B} = 5\times\begin{bmatrix}
  1\\
  -4\\
  7\\
 \end{bmatrix}
=?
$$

The operation is done element-wise, e.g. $k\times\mathbf{A}=\mathbf{C}$ then $k\times a_{i,j}={k}c_{i,j}$.

$$
 2
 \times
 \begin{bmatrix}
  1 & 6 \\
  4 & 8 \\
 \end{bmatrix} 
 = 
 \begin{bmatrix}
  2\times1& 2\times6 \\
  2\times4 & 2\times8\\
 \end{bmatrix}
 =
 \begin{bmatrix}
  2& 12 \\
  8 & 16\\
 \end{bmatrix}
 $$
 
 $$
 11
 \times
 \begin{bmatrix}
  2  \\
  3  \\
  5  \\
 \end{bmatrix} 
 = 
 \begin{bmatrix}
  11\times2  \\
  11\times3  \\
  11\times5  \\
 \end{bmatrix}
 =
 \begin{bmatrix}
 22\\
 33\\
 55\\
 \end{bmatrix}
 $$

Do the scalar multiplications of $\mathbf{A}$ and $\mathbf{B}$ by hand and write them in:

In [None]:
#Replace the <INSERT>
#Input A like this: A = [[1, 2, 3],[4, 5, 6],[7, 8, 9]]
#And B like this: B = [1, -4, 7]

#A = <INSERT>
#B = <INSERT>

A = np.array([[1, 2, 3],[4, 5, 6],[7, 8, 9]])
print(3*A)
print ("\n")
B = np.array([1, -4, 7])
print (5*B)
print ("\n")

A = [[ 3,  6,  9], [12, 15,18], [21, 24, 27]]
B = [5, -20, 35]

In [None]:
#Check if the scalar matrix multiplication is correct
hashCheck(A, '91b9508ec9099ee4d2c0a6309b0d69de')
hashCheck(B, '88bddc0ee0eab409cee011770363d007')

### 1.4 Dot product
$$
c_1=\begin{bmatrix}
  11  \\
  2  \\
 \end{bmatrix} 
 \cdot
 \begin{bmatrix}
  3 \\
  5 \\
 \end{bmatrix}
 =?
\qquad
c_2=\begin{bmatrix}
  1  \\
  2  \\
  3  \\
 \end{bmatrix} 
 \cdot
 \begin{bmatrix}
  4 \\
  5 \\
  6 \\
 \end{bmatrix}
 =?
$$
The operations are done element-wise, e.g. $\mathbf{v}\cdot\mathbf{w}=k$ then $\sum v_i \times w_i =k$

$$
 \begin{bmatrix}
  2  \\
  3  \\
  5  \\
 \end{bmatrix} 
 \cdot
 \begin{bmatrix}
  1 \\
  4 \\
  6 \\
 \end{bmatrix}
 = 2\times1+3\times4+5\times6=44
 $$
 
 Calculate the values of $c_1$ and $c_2$ by hand and write them in:

In [None]:
#Replace the <INSERT>
#Input c1 and c2 like this: c = 1337
#c1 = <INSERT>
#c2 = <INSERT>

c1_1 = np.array([11,2])
c1_2 = np.array([3,5])
c1 = c1_1.dot(c1_2)
print (c1)
c1 = 43
c2_1 = np.array([1,2,3])
c2_2 = np.array([4,5,6])
c2 = c2_1.dot(c2_2)
print (c2)
c2 = 32


In [None]:
#Check if the dot product is correct
hashCheck(c1, '17e62166fc8586dfa4d1bc0e1742c08b')
hashCheck(c2, '6364d3f0f495b6ab9dcf8d3b5c6e0b01')

### 1.5 Matrix multiplication
 $$
 \mathbf{A}=
 \begin{bmatrix}
 682 &  848 & 794 & 954 \\
 700 & 1223 & 1185 &  816 \\
 942 & 428 &  324 &  526 \\
 321 &  543 &  532 &  614 \\
 \end{bmatrix}
 \qquad
  \mathbf{B}=
 \begin{bmatrix}
  869 & 1269 & 1306 & 358 \\
  1008 & 836 & 690 & 366 \\
  973 & 619 &  407 & 1149 \\
  323 & 42 & 405 & 117 \\
 \end{bmatrix}
 \qquad
 \mathbf{A}\times\mathbf{B}=\mathbf{C}=?
 $$

The $c_{i,j}$ entry is the dot product of the i-th row in $\mathbf{A}$ and the j-th column in $\mathbf{B}$

Calculate $\mathbf{C}$ by implementing the naive matrix multiplication algotrithm with $\mathcal{O}(n^3)$ run time, by using the tree nested for-loops below:

In [None]:
# The convention is to import NumPy as the alias np
import numpy as np

In [None]:
A = [[ 682,  848,  794,  954],
     [ 700, 1223, 1185,  816],
     [ 942,  428,  324,  526],
     [ 321,  543,  532,  614]]

B = [[ 869, 1269, 1306,  358],
     [1008,  836,  690,  366],
     [ 973,  619,  407, 1149],
     [ 323,   42,  405,  117]]

C = [[0]*4 for i in range(4)]

#Iterate through rows of A
for i in range(len(A)):
   #Iterate through columns of B
   for j in range(len(B[0])):
       #Iterate through rows of B
       for k in range(len(B)):
           C[i][j] += A[i][k] * B[k][j]
        
print(np.matrix(C))
print(np.matrix(A)*np.matrix(B))

In [None]:
#Check if the matrix multiplication is correct
hashCheck(C, 'f6b7b0500a6355e8e283f732ec28fa76')

## 2. NumPy and Spark linear algebra

A python library to utilize arrays is  <a href="http://wiki.scipy.org/Tentative_NumPy_Tutorial">NumPy</a>. The library is optimized to be fast and memory efficient, and provide abstractions corresponding to vectors, matrices and the operations done on these objects.

Numpy's array class is called <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html">ndarray</a>, it is also known by the alias array. This is a multidimensional array of fixed-size that contains numerical elements of one type, e.g. floats or integers.


### 2.1 Scalar matrix multiplication using NumPy

$$
\mathbf{A} = \begin{bmatrix}
  1 & 2 & 3\\
  4 & 5 & 6\\
  7 & 8 & 9\\
 \end{bmatrix}
\quad
5\times\mathbf{A}=\mathbf{C}=?
\qquad
\mathbf{B} = \begin{bmatrix}
  1&-4& 7\\
 \end{bmatrix}
 \quad
3\times\mathbf{B}=\mathbf{D}=?
$$

Utilizing the <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html">np.array()</a> function create the above matrix $\mathbf{A}$ and vector $\mathbf{B}$ and multiply it by 5 and 3 correspondingly.

Note that if you use a Python list of integers to create an array you will get a one-dimensional array, which is, for our purposes, equivalent to a vector.

Calculate C and D by inputting the following statements:

In [None]:
#Replace the <INSERT>. You will use np.array()
A = np.array([[1, 2, 3],[4,5,6],[7,8,9]])
B = np.array([1,-4, 7])
C = A *5
D = 3 * B
print(A)
print(B)
print(C)
print(D)

In [None]:
#Check if the scalar matrix multiplication is correct
checkArray(C,[[5, 10, 15],[20, 25, 30],[35, 40, 45]], "the scalar multiplication")
checkArray(D,[3, -12,  21], "the scalar multiplication")

### 2.2 Dot product and element-wise multiplication

Both dot product and element-wise multiplication is supported by ndarrays.

Element-wise multiplication is the standard between two arrays, of the same dimension, using the operator *. 

The dot product you can use either <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html#numpy.dot">np.dot()</a> or <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.dot.html">np.array.dot()</a>. The dot product is a commutative operation, i.e. the order of the arrays doe not matter, e.g. if you have the ndarrays x and y, you can write the dot product as any of the following four ways: np.dot(x, y), np.dot(y, x), x.dot(y), or y.dot(x).

Calculate the element wise product and the dot product by filling in the following statements:

In [None]:
#Replace the <INSERT>
u = np.arange(0, 5)
v = np.arange(5, 10)
elementWise = np.multiply(u,v)
dotProduct = np.dot(u,v)
print(elementWise)
print(dotProduct)

In [None]:
#Check if the dot product and element wise is correct
checkArray(elementWise,[0,6,14,24,36], "the element wise multiplication")
check(dotProduct, 80, "the dot product")

### 2.3 Cosine similarity
The cosine similarity between two vectors is defined as the following equation:

$$
cosine\_similarity(u,v)=\cos\theta=\frac{\mathbf{u}\cdot\mathbf{v}}{\|u\|\|v\|}
$$

The norm of a vector $\|v\|$ can be calculated by using <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html#numpy.linalg.norm">np.linalg.norm()</a>.

Implement the following function that calculates the cosine similarity:

In [None]:
def cosine_similarity(u,v):
    dotProduct = np.dot(u,v)
    normProduct = np.linalg.norm(u)*np.linalg.norm(v)
    return dotProduct/normProduct

u = np.array([2503,2992,1042])
v = np.array([2217,2761,990])

w = np.array([0,1,1])
x = np.array([1,0,1])

uv = cosine_similarity(u,v)
wx = cosine_similarity(w,x)

print(uv)
print(wx)

In [None]:
#Check if the cosine similarity is correct
check(round(uv,5),0.99974,"cosine similarity between u and v")
check(round(wx,5),0.5,"cosine similarity between w and x")

### 2.4 Matrix math
To represent matrices, you can use the following class: <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html">np.matrix()</a>. To create a matrix object either pass it a two-dimensional ndarray, or a list of lists to the function, or a string e.g. '1 2; 3 4'. Instead of element-wise multiplication, the operator *, does matrix multiplication.

To transpose a matrix, you can use either <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.transpose.html">np.matrix.transpose()</a> or <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.T.html">.T</a> on the matrix object.

To calculate the inverse of a matrix, you can use <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.inv.html">np.linalg.inv()</a> or <a href="docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.I.htmll">.I</a> on the matrix object, remember that the inverse of a matrix is only defined on square matrices, and is does not always exist (for sufficient requirements of invertibility look up the: <a href="https://en.wikipedia.org/wiki/Invertible_matrix#The_invertible_matrix_theorem">The invertible matrix theorem</a>) and it will then raise a LinAlgError. If you multiply the original matrix with its inverse, you get the identity matrix, which is a square matrix with ones on the main diagonal and zeros elsewhere., e.g. $\mathbf{A} \mathbf{A}^{-1} = \mathbf{I_n}$

In the following exercise, you should calculate $\mathbf{A}^T$ multiply it by $\mathbf{A}$ and then inverting the product $\mathbf{AA}^T$ and finally multiply $\mathbf{AA}^T[\mathbf{AA}^T]^{-1}=\mathbf{I}_n$ to get the identity matrix:


In [None]:
#Replace the <INSERT>

#We generate a Vandermonde matrix
A = np.mat(np.vander([2,3], 5))
print(A)

#Calculate the transpose of A
At = np.transpose(A)
print(At)

#Calculate the multiplication of A and A^T
AAt = np.dot(A,At)
print(AAt)

#Calculate the inverse of AA^T
AAtInv = np.linalg.inv(AAt)
print(AAtInv)

#Calculate the multiplication of AA^T and (AA^T)^-1
I = np.dot(AAt,AAtInv)
print(I)

#To get the identity matrix we round it because of numerical precision
I = I.round(13)

In [None]:
#Check if the matrix math is correct
checkArray(I,[[1.,0.], [0.,1.]], "the matrix math")

### 2.5 Slices

It is possible to select subsets of one-dimensional arrays using <a href="http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html">slices</a>. The basic syntax for slices is $\mathbf{v}$[i:j:k] where i is the starting index, j is the stopping index, and k is the step ($k\neq0$), the default value for k, if it is not specified, is 1. If no i is specified, the default value is 0, and if no j is specified, the default value is the end of the array.

For example [0,1,2,3,4][:3] = [0,1,2] i.e. the three first elements of the array. You can use negative indices also, for example [0,1,2,3,4][-3:] = [2,3,4] i.e. the three last elements.

The following function can be used to concenate 2 or more arrays: <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html">np.concatenate</a>, the syntax is np.concatenate((a1, a2, ...)).

Slice the following array in 3 pieces and concenate them together to form the original array:


In [None]:
#Replace the <INSERT>
v = np.arange(1, 9)
print(v)
#The first two elements of v
v1 = v[:2]

#The last two elements of v
v3 = v[-2:]

#The middle four elements of v
v2 = v[2:6]
print(v1)
print(v2)
print(v3)
print(v1,v2,v3)
#Concatenating the three vectors to get the original array
u = np.concatenate((v1, v2, v3))

In [None]:
#Check if the slices are correct
checkArray(u,[1,2,3,4,5,6,7,8], "the slicing")

### 2.6 Stacking
There exist many functions provided by the NumPy library to <a href="http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html">manipulate</a> existing arrays. We will try out two of these methods <a href="docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html">np.hstack()</a> which takes two or more arrays and stack them horizontally to make a single array (column wise, equvivalent to np.concatenate), and <a href="docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html">np.vstack()</a> which takes two or more arrays and stack them vertically (row wise). The syntax is the following np.vstack((a1, a2, ...)).

Stack the two following array $\mathbf{u}$ and $\mathbf{v}$ to create a 1x20 and a 2x10 array:

In [None]:
#Replace the <INSERT>
u = np.arange(1, 11)
v = np.arange(11, 21)

#A 1x20 array
oneRow = np.hstack((u,v))
print(oneRow)

#A 2x10 array
twoRows = np.vstack((u,v))
print(twoRows)

In [None]:
#Check if the stacks are correct
checkArray(oneRow,[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], "the hstack")
checkArray(twoRows,[[1,2,3,4,5,6,7,8,9,10],[11,12,13,14,15,16,17,18,19,20]], "the vstack")

### 2.7 PySpark's DenseVector
In PySpark there exists a <a href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseVector">DenseVector</a> class within the module <a href="https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.linalg">pyspark.mllib.linalg</a>. The DenseVector stores the values as a NumPy array and delegates the calculations to this object. You can create a new DenseVector by using DenseVector() and passing it an NumPy array or a Python list.

The DenseVector class implements several functions, one important is the dot product, DenseVector.dot(), which operates just like np.ndarray.dot().

The DenseVector save all values as np.float64, so even if you pass it an integer vector, the resulting vector will contain floats. Using the DenseVector in a distributed setting, can be done by either passing functions that contain them to resilient distributed dataset (RDD) transformations or by distributing them directly as RDDs.

Create the DenseVector $\mathbf{u}$ containing the 10 elements [0.1,0.2,...,1.0] and the DenseVector $\mathbf{v}$ containing the 10 elements [1.0,2.0,...,10.0] and calculate the dot product of $\mathbf{u}$ and $\mathbf{v}$: 

In [None]:
#To use the DenseVector first import it
from pyspark.mllib.linalg import DenseVector

In [None]:
#Replace the <INSERT>
#[0.1,0.2,...,1.0]
u = DenseVector((0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1))
# or 
# u = np.arange(0.1,1.1,0.1) # arange isn't inclusive, stop needs to be end+interval
print(u)

#[1.0,2.0,...,10.0]
v = DenseVector((1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0))
# or
# np.arange(1,11,1) # arange isn't inclusive, stop needs to be end+interval
print(v)

#The dot product between u and v
dotProduct = np.dot(u,v)

In [None]:
#Check if the dense vectors are correct
check(dotProduct, 38.5, "the dense vectors")