# Tools
1. NumPy
2. MatplotLib

In [None]:
import numpy as np
np.set_printoptions(precision=2)  # reduced display precision on numpy arrays

#Problem Statement

We will use the example of housing price prediction. The training dataset contains three examples with four features (size, bedrooms, floors and, age) shown in the table below.  Note that, unlike the earlier labs, size is in sqft rather than 1000 sqft. This causes an issue, which you will solve in the next lab!

| Size (sqft) | Number of Bedrooms  | Number of floors | Age of  Home | Price (1000s dollars)  |   
| ----------------| ------------------- |----------------- |--------------|-------------- |  
| 2104            | 5                   | 1                | 45           | 460           |  
| 1416            | 3                   | 2                | 40           | 232           |  
| 852             | 2                   | 1                | 35           | 178           |  

We will build a linear regression model using these values so you can then predict the price for other houses. For example, a house with 1200 sqft, 3 bedrooms, 1 floor, 40 years old.  

In [None]:
# Training Data (multiple features)
x_train = np.array([[2104, 5, 1, 45], 
                    [1416, 3, 2, 40], 
                    [852, 2, 1, 35]])
# Target data
y_train = np.array([460, 232, 178])

## Matrix X containing our Training Data
Similar to the table above, examples are stored in a NumPy matrix `X_train`. Each row of the matrix represents one example. When you have $m$ training examples ( $m$ is 3 in our example), and there are $n$ features (4 in our example), $\mathbf{X}$ is a matrix with dimensions ($m$, $n$) (m rows, n columns).


$$\mathbf{X} = 
\begin{pmatrix}
 x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ 
 x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\
 \cdots \\
 x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} 
\end{pmatrix}
$$
notation:
- $\mathbf{x}^{(i)}$ is vector containing example i. $\mathbf{x}^{(i)}$ $ = (x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1})$
- $x^{(i)}_j$ is element j in example i. The superscript in parenthesis indicates the example number while the subscript represents an element.  

In [None]:
print(f"X-Shape = {x_train.shape}")
print(f"Y-Shape = {y_train.shape}")
print(f"X-train = \n{x_train}")
print(f"Y-train = {y_train}")

X-Shape = (3, 4)
Y-Shape = (3,)
X-train = 
[[2104    5    1   45]
 [1416    3    2   40]
 [ 852    2    1   35]]
Y-train = [460 232 178]


## Parameter vector w, b

* $\mathbf{w}$ is a vector with $n$ elements.
  - Each element contains the parameter associated with one feature.
  - in our dataset, n is 4.
  - notionally, we draw this as a column vector

$$\mathbf{w} = \begin{pmatrix}
w_0 \\ 
w_1 \\
\cdots\\
w_{n-1}
\end{pmatrix}
$$
* $b$ is a scalar parameter.

In [None]:
# For demonstration, w and b are be loaded with some initial selected values that are near the optimal. Also, w is a 1-D NumPy vector.
w_init = np.array([ 0.39133535, 18.75376741, -53.36032453, -26.42131618])
b_init = 785.1811367994083
print(f"w_init shape: {w_init.shape}, w_init type: {type(w_init)}")
print(f"b_init type: {type(b_init)}")

w_init shape: (4,), w_init type: <class 'numpy.ndarray'>
b_init type: <class 'float'>


## 1. Model Prediction With Multiple Variables
The model's prediction with multiple variables is given by the linear model:

$$ f_{\mathbf{w},b}(\mathbf{x}) =  w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b \tag{1}$$
or in vector notation:
$$ f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b  \tag{2} $$ 
where $\cdot$ is a vector `dot product`

To demonstrate the dot product, we will implement prediction using (1) and (2).

In [None]:
# Computes the model using the 'dot' method of the NumPy library using the above equation (2)
# NOTE: It is computed for ONE Row, for instance, here we test it on the 0th row using 'x_train[0,:]'
def compute_model(x, w, b):
  """
      Args:
      x (ndarray): Shape (n,) example with multiple features
      w (ndarray): Shape (n,) model parameters   
      b (scalar):             model parameter 

      Returns:
      f_wb (scalar):  prediction
  """
  f_wb = np.dot(w, x) + b
  return f_wb

# Testing the method by getting a 'Row' from the matrix
print(f"f_wb for 0th row (i=0) = {compute_model(x_train[0,:], w_init, b_init)}")

f_wb for 0th row (i=0) = 459.9999976194083


## 2 Compute Cost With Multiple Variables
The equation for the cost function with multiple variables $J(\mathbf{w},b)$ is:
$$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}$$ 
where:
$$ f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b  \tag{4} $$ 


Here, $\mathbf{w}$ and $\mathbf{x}^{(i)}$ are vectors rather than scalars supporting multiple features.

In [None]:
# Computes the cost J(w,b) for the training-data
def compute_cost(X, y, w, b):
  """
    Args:
      X (ndarray (m,n)): Data, m examples with n features [Matrix]
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      cost (scalar): cost
  """
  m = X.shape[0]                                # Row count
  cost = 0
  for i in range (0, m):
    f_wb = compute_model(X[i], w, b)            # Computing f_wb (prediction) for each row
    cost = cost + (f_wb - y[i]) ** 2
  total_cost = cost/(2*m)
  return total_cost
      
# Testing the method
print(f"Total Cost = {compute_cost(x_train, y_train, w_init, b_init)}")


Total Cost = 1.5578904428966628e-12


## 3 Gradient Descent With Multiple Variables
Gradient descent for multiple variables:

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\;
& w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{5}  \; & \text{for j = 0..n-1}\newline
&b\ \ = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b}  \newline \rbrace
\end{align*}$$

where, n is the number of features, parameters $w_j$,  $b$, are updated simultaneously and where  

$$
\begin{align}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{6}  \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{7}
\end{align}
$$
* m is the number of training examples in the data set

    
*  $f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model's prediction, while $y^{(i)}$ is the target value


In [None]:
# Computing the gradient (i.e) the derivatives for 'w' and 'b' using the equations (6) and (7) 
def compute_gradient(X, y, w, b):
  """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n)): Data, m examples with n features
      y (ndarray (m,)) : target values
      w (ndarray (n,)) : model parameters  
      b (scalar)       : model parameter
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """
  m, n = X.shape
  dj_dw = np.zeros((n,))
  dj_db = 0.

  for i in range (0, m):
    f_wb = compute_model(X[i], w, b)
    error = f_wb - y[i]

    for j in range (0, n):
      dj_dw[j] =  dj_dw[j] + error * X[i,j]
    dj_db = dj_db + error

  dj_dw = dj_dw/m
  dj_db = dj_db/m
  return dj_dw, dj_db
      
# Testing the method by computing and displaying gradient 
tmp_dj_db, tmp_dj_dw = compute_gradient(x_train, y_train, w_init, b_init)
print(f'dj_db at initial w,b: {tmp_dj_db}')
print(f'dj_dw at initial w,b: \n {tmp_dj_dw}')

dj_db at initial w,b: [-2.73e-03 -6.27e-06 -2.22e-06 -6.92e-05]
dj_dw at initial w,b: 
 -1.6739251501955248e-06


## 4 Gradient Descent With Multiple Variables
The function below implements equation (5) above (i.e)

$$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\;
& w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{5}  \; & \text{for j = 0..n-1}\newline
&b\ \ = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b}  \newline \rbrace
\end{align*}$$

where, n is the number of features, parameters $w_j$,  $b$, are updated simultaneously and where  

In [None]:
def gradient_descent(X, y, w, b, alpha, num_iters):
  """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X (ndarray (m,n))   : Data, m examples with n features
      y (ndarray (m,))    : target values
      w (ndarray (n,))    : initial model parameters  
      b (scalar)          : initial model parameter
      alpha (float)       : Learning rate
      num_iters (int)     : number of iterations to run gradient descent
      
    Returns:
      w (ndarray (n,))    : Updated values of parameters 
      b (scalar)          : Updated value of parameter 
    """
  # Important to assign w and b to local-variables to avoid modification of global w and b
  w_out = w
  b_out = b
  for i in range (0, num_iters):
    dj_w, dj_b = compute_gradient(X, y, w_out, b_out)
      
    w_out = w_out - (alpha * dj_w)
    b_out = b_out - (alpha * dj_b)

  return w_out, b_out


# initialize parameters
initial_w = np.zeros_like(w_init)
initial_b = 0.
# some gradient descent settings
iterations = 1000
alpha = 5.0e-7
# run gradient descent 
w_final, b_final = gradient_descent(x_train, y_train, initial_w, initial_b, alpha, iterations)
print(f"b,w found by gradient descent: {b_final:0.2f},{w_final} ")

b,w found by gradient descent: -0.00,[ 0.2   0.   -0.01 -0.07] 


In [None]:
# Prediction
m = x_train.shape[0]
for i in range(m):
    print(f"prediction: {np.dot(x_train[i], w_final) + b_final:0.2f}, target value: {y_train[i]}")

prediction: 426.19, target value: 460
prediction: 286.17, target value: 232
prediction: 171.47, target value: 178
