# Lab 2 (Part C) - Linear regression with multiple features

Make sure that you check the videos of lecture 2 before starting this Lab:
- Introduction to Linear Regression: https://youtu.be/-wmjwMWRsZU
- Introduction to Nonlinear Regression: https://youtu.be/Hyu8QMLEHrE

In this part of the lab, you will implement linear regression with multiple variables to predict the price of houses. Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices.

# 1. Loading the dataset
The file `housing-dataset.csv` contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house. The following Python code helps you load the dataset from the data file into the variables $X$ and $y$. Read the code and print a small subset of $X$ and $y$ to see what they look like.

In [None]:
%matplotlib notebook
import numpy as np

filename = "datasets/housing-dataset.csv"
mydata = np.genfromtxt(filename, delimiter=",")

# We have n data-points (houses)
n = len(mydata)

# X is a matrix of two column, i.e. an array of n 2-dimensional data-points
X = mydata[:, :2].reshape(n, 2)

# y is the vector of outputs, i.e. an array of n scalar values
y = mydata[:, -1]


""" TODO:
You can print a small subset of X and y to see what it looks like.
"""


# 2. Data normalization (scaling or standardization)
By looking at the values, note that house sizes are about 1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge much more quickly. Your task here is to write the following code to:
- Subtract the mean value of each feature from the dataset.
- After subtracting the mean, additionally scale (divide) the feature values by their respective *standard deviations*.

In Python, you can use the numpy function `np.mean(..)` to compute the mean. This function can directly be used on a $d$-dimensional dataset to compute a $d$-dimensional mean vector `mu` where each value `mu[j]` is the mean of the $j^{th}$ feature. This is done by setting the $2^{nd}$ argument `axis` of this function to `0`. For example, consider the following matrix `A` where each line corresponds to one data-point and each column corresponds to one feature:

```python
A = [[ 100,    10],
     [ 30,     10],
     [ 230,    25]]
```

In this case, `np.mean(A, axis=0)` will give `[120,   15]` where 120 is the mean of the 1st  column (1st feature) and 15 is the mean of the 2nd column (2nd feature). Another function `np.std(..)` exists to compute the standard deviation. The standard deviation is a way of measuring how much variation there is in the range of values of a particular feature (usually, most data points will lie within the interval: mean $\pm$ 2 standard_deviation).

Once the features are normalized, you can do a scatter plot of the original dataset `X` (size of the house vs. number of bedrooms) and a scatter plot of the normalized dataset `X_normalized`. You will notice that the normalized dataset still have the same shape as the original one; the difference is that the new feature values have a similar scale and are centred arround the origin.

**Implementation Note**: When normalizing the features, it is important to store the values used for normalization (the mean and the standard deviation used for the computations). Indeed, after learning the parameters of a model, we often want to predict the prices of houses we have not seen before. Given a new $x$ value (living room area and number of bedrooms), we must first normalize $x$ using the mean and standard deviation that we had previously computed from the training set.

In [None]:
import matplotlib.pylab as plt

""" TODO:
Complete the following code to compute a normalized version of X called: X_normalized
"""
# TODO: compute mu, the mean vector from X
# TODO: compute std, the standard deviation vector from X
# X_normalized = (X - mu) / std



""" TODO:
- Do a scatter plot of the original dataset X
- Do a scatter plot of the normalized dataset X_normalized
"""


Similar to what you did in Lab2 Part B, you can simplify your implementation of linear regression by adding an additional first column to `X_normalized` with all the values of this column set to $1$. To do this you can re-use the function `add_all_ones_column(..)` defined in Lab2 Part B, which takes a matrix as argument and returns a new matrix with an additional first column (of ones).

In [None]:
""" TODO:
Copy-past here the definition of the function add_all_ones_column(...) that 
you have see in Lab 2 (Part B).
"""
# definition of the function add_all_ones_column() here ...


""" TODO:
Just uncomment the following lines to create a matrix 
X_normalized_new with an additional first column (of ones).
"""
# X_normalized_new = add_all_ones_column(X_normalized)

# print("Subset of X_normalized_new")
# print(X_normalized_new[:10])

# 3. Linear Regression from Scrach
You are now ready to implement the linear regression using gradient descent (with more than one feature). In this multivariate case, you can further simply your implementation by writing the cost function in the following vectorized form:

$$E(\theta) = \frac{1}{2n} (X \theta - y)^T (X \theta - y)$$

$$\text{where }\quad
X = \begin{bmatrix}
-- ~ {x^{(1)}}^T ~ -- \\ 
-- ~ {x^{(2)}}^T ~ -- \\ 
\vdots \\ 
-- ~ {x^{(n)}}^T ~ --
\end{bmatrix}
\quad \quad \quad
y = \begin{bmatrix}
y^{(1)} \\ 
y^{(2)} \\ 
\vdots \\ 
y^{(n)} 
\end{bmatrix}
$$

The vectorized form of the gradient of $E(\theta)$ is a vector denoted as $\nabla E(\theta)$ and defined follows:

$$\nabla E(\theta) = \left ( \frac{\partial E}{\partial \theta_0}, \frac{\partial E}{\partial \theta_1}, \dots, \frac{\partial E}{\partial \theta_d} \right ) = \frac{1}{n} X^T (X \theta - y)$$

this is a **vector** where each $j^{th}$ value corresponds to $\frac{\partial E}{\partial \theta_j}$ (the derivative of the function $E$ with respect to the parameter $\theta_j$)

One your code is finished, you will get to try out different learning rates $\alpha$ for the dataset and find a learning rate that converges quickly. To do so, you can plot the history of the cost $E(\theta)$ with respect to the number of iterations at the end of your code.

For example for alpha values of 0.01, 0.05 and 0.1, the plot should look like follows:
<img src="imgs/costLab2C.png" width="400px" />

If your learning rate is too large, $E(\theta)$ can diverge and *blow up*, resulting in values which are too large for computer calculations. In these situations, Python will tend to return `NaN` or `inf` (NaN stands for "*not a number*" and is often caused by undefined operations that involve $-\inf$ and $+\inf$). If your value of $E(\theta)$ increases or even blows up, adjust your learning rate and try again.

In [None]:
""" TODO: 
Write the cost function E using the vectorized form
"""
def E(theta, X, y):
    ...
    return ...

""" TODO: 
Define the function grad_E (the gradient of E) using the vectorized form.
This should return a vector of the same dimension as theta
"""
def grad_E(theta, X, y):
    ...
    return ...

""" TODO: 
Complete the definition of the function LinearRegressionWithGD(...) below
Note: don't forget to call the functions E(..) and grad_E(..) with X_normalized_new instead of X

The arguments of LinearRegressionWithGD(..) are:
*** theta: vector of initial parameter values
*** alpha: the learning rate (used by gradient descent)
*** max_iterations: maximum number of iterations to perform
*** epsilon: to stop iterating if the cost decreases by less than epsilon

The function returns:
*** errs: a list corresponding to the historical cost values
*** theta: the final parameter values
"""
def LinearRegressionWithGD(theta, alpha, max_iterations, epsilon):
    errs = []
    
    for itr in range(max_iterations):
        mse = E(theta, X_normalized_new, y)
        errs.append(mse)
        
        # TODO: take a gradient descent step to adapt the vector of parameters theta
        # ...
        
        # TODO: test if the cost decreases by less than epsilon (to stop iterating)
        #if CONDITION:
            #break
    
    return errs, theta


""" TODO: 
Here you will call LinearRegressionWithGD(..) in a loop with different values of alpha, 
and plot the cost history (errs) returned by each call of LinearRegressionWithGD(..)
"""
fig, ax = plt.subplots()
ax.set_xlabel("Number of Iterations")
ax.set_ylabel(r"Cost $E(\theta)$")

theta_init = np.array([0, 0, 0])
max_iterations = 100
epsilon = 0.000000000001

for alpha in [0.01, 0.05, 0.1]:
    pass
    # TODO: call LinearRegressionWithGD(...) using the current alpha, to get errs and theta
    # ...
    
    # print("alpha = {}, theta = {}".format(alpha, theta))
    # ...
    
    # plot the errs using ax.plot(..)
    # ...
    
plt.legend()
fig.show()


Now, once you have found a good $\theta$ using gradient descent, use it to make a price prediction for a new house of 1650-square-foot with 3 bedrooms. **Note**: since the parameter vector $\theta$ was learned using the normalized dataset, you will need to normalize the new data-point corresponding to this new house before predicting its price.

In [None]:
""" TODO: 
Use theta to predict the price of a 1650-square-foot house with 3 bedrooms
Don't forget to normalize the feature values of this new house first.
"""
# Cretate a data-point x corresponding to the new house
# Normalize the feature values of x
# Use the vector of parameters theta to predict the price of x

"""
HINT: if you are not able to compute the dot product between x and theta, then 
make sure that the arrays have the same size. Did you forget something?
"""


### 4. (removed). 

# 5. Normal Equation: Linear regression without gradient descent

As you know from the lecture, the MSE cost function $E(\theta)$ that we are trying to minimize is a convex function, and its derivative at the optimal $\theta$ (that minimizes $E(\theta)$) is equal to $0$. Therefore, to find the optimal $\theta$, one can simply compute the derivative of $E(\theta)$ with respect to $\theta$, set it equal to $0$, and solve for $\theta$.

We have seen in the lecture that, by doing this, the closed-form solution is given as follows:
$$\theta = (X^T X)^{-1} X^T y$$

Using this formula does not require any feature scaling, and you will get an exact solution in one calculation: there is no "*loop until convergence*" like in gradient descent.

You are asked to implement this equation to directly compute the best parameter vector $\theta$ for the linear regression. In Python, you can use the `inv` function from `numpy.linalg.inv` to compute the inverse of a function.

Remember that while you don't need to scale your features, we still need to add a column of 1's to the $X$ matrix to have an intercept term ($\theta_0$).

In [None]:
from numpy.linalg import inv

""" TODO: 
Use the function add_all_ones_column(..) to add a column of 1's to X. 
Let's call the returned dataset X_new.
"""
# new_X = ...

""" TODO: 
Compute the optimal theta using new_X and y (without using gradient descent).
Use the normal equation shown above. You can use the function inv (imported above)
to compute the inverse of a matrix.
"""
# theta = ...
# print("With the original (non-normalized) dataset: theta = {}".format(theta))

Now, once you have computed the optimal $\theta$, use it to make a price prediction for the new house of 1650-square-foot with 3 bedrooms. Remeber that $\theta$ was computed above based on the original dataset (without normalization); so, you do not need to normalize the feature values of the new house to make the prediction in this case.

In [None]:
""" TODO: 
Use theta to predict the price of a 1650-square-foot house with 3 bedrooms
"""
# x = ...
# prediction = ...
# print(prediction)

Using the previous formula does not require any feature normalization or scaling. However, you can still compute again the optimal $\theta$ when using `X_normalized_new` instead of `new_X`.

By doing this, you will be able to compare the $\theta$ that you compute here with the one you got previously when you used gradient descent. The two parameter vectors should be quite similar (but not necessarily exatly the same).

In [None]:
""" TODO: 
Compute the optimal theta using X_normalized_new and y (without using gradient descent). 
Use the normal equation (shown previously).
"""
# theta = ...
# print("With the normalized dataset: theta = {}".format(theta))

Again, now that you have computed the optimal $\theta$ based on `X_normalized_new`, use it to make a price prediction for the new house of 1650-square-foot with 3 bedrooms. Do you need to normalize the feature values of the new house here? Remeber that $\theta$ was computed here based on the normalized dataset.

You should find that this predicted price similar to the price you predicted previsouly for the same house. 

In [None]:
""" TODO: 
Use theta to predict the price of a 1650-square-foot house with 3 bedrooms
"""
# Cretate a data-point x corresponding to the new house
# Normalize the feature values of x
# Use the vector of parameters theta to predict the price of x
# print("prediction:", pred)

# 6. Linear Regression with scikit-learn (sklearn)
You will now use the scikit-learn library to do the ordinary linear regression.

First, the code below shows you how to scale you data using scikit-learn. The `preprocessing` module provides a class `StandardScaler` that compute the mean and standard deviation on a training data so as to be able to later reapply the same transformation on the testing data.

In [None]:
from sklearn.preprocessing import StandardScaler

# Fit the StandardScaler to the training data (to compute the mean and std-deviation)
scaler = StandardScaler().fit(X)

# You can now use scaler to transform (scale) the training data or any new test data
X_normalized = scaler.transform(X)

print("Original X:\n", X[:5])
print()
print("X_normalized:\n", X_normalized[:5])

Complete the following code to train an ordinary linear regression on the scaled training dataset. Then, use the trained model to predict the price of new test houses.

In [None]:
from sklearn.linear_model import LinearRegression

""" TODO:
Train the linear regression model on the scaled training data
"""
# reg = ...


In [None]:
""" TODO: 
Use the trained regression model to predict the price of the following test houses
* 1650-square-foot house with 3 bedrooms
* 1020-square-foot house with 2 bedrooms
* 2300-square-foot house with 4 bedrooms
"""
# X_test = ... # Create the test dataset
# Scale the test dataset using scaler.transform(...)
# Predict the prices using reg.predict(...)


The prediction you get for the first test house (`[1650, 3]`) should be similar to the prediction you got for this house when you implemented the linear regression from scrach.

- For more information about data scaling in scikit-learn, check the link: https://scikit-learn.org/stable/modules/preprocessing.html
- For more information about the ordinary linear regression, check the link: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html 