# Learn Linear Regression! House Sale Prices
On Day 1 of this AI/ML workshop, we will learn simple and multiple linear regression by training a machine learning model to predict house sale prices.

Live Coding Tutorial Recording (May 30): https://www.youtube.com/watch?v=d-SyV8yZV1M

Tutorial and Notebook created by Eban Ebssa

Import the necessary Python libraries and submodules:
* **numpy** - math including arrays
* **pandas** - data processing and analysis
* **matplotlib.pyplot** - plotting figures
* **seaborn** - visualizations
* **sklearn.model_selection** - machine learning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

print("done importing libraries!")

# Data Preparation
First, we will take a look at the dataset.

Use the pandas library to create a **DataFrame**, which is a 2-D data structure labeled with rows (**observations**) and columns (**features**).
There are 80 features recorded with each of the 1460 house observations in the data set that we can use to predict the last column, SalePrice.

In [1]:
dataframe = pd.read_csv("../input/train.csv")  # read file into DataFrame
print(dataframe.head())  # preview

sale_price = dataframe['SalePrice']  # holds all values of SalePrice column
sale_price.describe()  # descriptive statistics

Generate a heat map based on the pairwise correlations of all of the columns. What do you notice? What does the vertical bar on the right represent? How is that reflected in the graph?

In [1]:
correlation_matrix = dataframe.corr()  # correlation coefficients of all paired columns
figure, axis = plt.subplots(figsize=(12, 10))  # make the graph bigger
sns.heatmap(correlation_matrix, vmin=-0.7, vmax=0.7, cmap='RdBu', square=True)

# Time for regression!
First, we will look through the methods that go behind training a regression model. Then, we can use the sklearn library to train automatically.

**What is linear regression?**

Linear regression is a type of **predictive analysis**: we want to predict an outcome based on given input(s). The outcome, aka the dependent variable, is referred to as y, and the input, aka the independent or explanatory variable, is x. When we say we want to create a regression model, essentially, we are trying to find a line of best fit.

There are two main types of linear regression:
* **Simple Linear Regression**: one input variable and one output variable 
* **Multiple Linear Regression**: multiple input variables and one output variable

**How do we find that line of best fit?**

You know that the equation of a line of best fit is y = mx + b, where m is the slope coefficient and b is the y-intercept. In machine learning, it's quite similar, except we adjust for possible multiple inputs: **y(x_1, ..., x_n) = w_1\*x_1 + ... + w_n\*x_n + w_0**.
* x_1, ..., x_n represent the **explanatory variables** we are using to predict y
* w_1 ... w_n represent the **weight** coefficients, representing each variable's relative importance
* w_0 is the **intercept**
    * to simplify product calculations, we will add another column of 1s to each of the inputs, so it becomes 1\*w_0

**Why would we add a column of 1s to the x inputs?**

This formula y(x_1, ..., x_n) = w_1\*x_1 + ... + w_n\*x_n + w_0 can be rewritten as the dot product **y(x) = w · x + w_0**, where x is the list (vector) of all x_1 to x_n and w is the list (vector) of all w_1 to w_n.
The dot product is a type of math function used on two vectors of the same length that adds the products of each of the elements.

Example: If w = (2, 3, 4), x = (4, 3, 2), and w_0 = 5, what is y(x)?

y(x) = w · x + w_0

y(x) = 2\*4 + 3\*3 + 4\*2 + 5

y(x) = 30

Instead of thinking of y(x) as a dot product of two vectors plus another number, we can simplify the calculation to just a dot product of the w vector with an added column for w_0 and the x vector with an added column of 1s.

So now, w = (2, 3, 4, **5**) and x = (4, 3, 2, **1**), and there's no need for w_0 separately.

y(x) = w · x

y(x) = 2\*4 + 3\*3 + 4\*2 + 1\*5

y(x) = 30



**How do we write all this in code?**

We find the optimal values for the linear regression model through training. The main purpose of training is to minimize the error between the predicted values and the actual values. With each step of training, we update the weights to keep minimizing the error.

This first function, calculate_loss, calculates the **mean squared error** (MSE): the sum of all of the squared differences divided by the number of observations. MSE is one example of an **evaluation metric**, used to measure the quality of machine learning model.

In [1]:
# calculates the mean squared error as an evaluation metric
# predicted: the observed values calculated by the model
# actual: the expected values from the dataset

def calculate_loss(predicted, actual):
  squared_errors = (actual - predicted)**2
  n = len(predicted)
  return 1.0 / (2*n) * squared_errors.sum()

This function, evaluate_prediction, simply multiplies the x inputs by the weights of the model to return a prediction of the output y.

In [1]:
# returns the weighted sum of the inputs as a prediction of y
# x: rows of values for each column (inputs)
# weights: factors representing relative importance of each column

def evaluate_prediction(x, weights):
    return np.dot(x, weights)  # dot: add individual products

This function, gradient_descent, will return the new values of the weights after taking a step toward the descent of the gradient of the loss function. The **gradient** of a function is the **derivative** of that function, which is a new function representing the rate of change (slope) at each point.

The derivative (gradient) of the loss function is the dot product of the transposed flipped x values and the error, divided by the number of x values. Because this represents the rate of change of the errors, we want to go in the negative direction so that error will decrease (descent).

![Gradient Descent Visual](http://miro.medium.com/max/1024/1*G1v2WBigWmNzoMuKOYQV_g.png)

In [1]:
# returns the new weights after taking a step in the direction of the negative gradient
# x: rows of inputs
# weights: coefficients for each column in x
# y: rows of actual outputs
# learning_rate: how quickly the model should learn (how big of a step)

def gradient_descent(x, weights, y, learning_rate):
    predictions = evaluate_prediction(x, weights)
    error = predictions - y  # how bad are the current weight values
    gradient = np.dot(x.T,  error) / len(x)  # plug into derivative function
    new_weights = weights - learning_rate * gradient  # update
    return new_weights

Now, for training!
Essentially, we repeat gradient descent, updating the weights after each iteration. I also keep track of every value of the loss function so that later, we can visualize how our model improves with training.

In [1]:
# returns the final weights and list of losses after updating weights using repeated gradient descent
# x: rows of inputs
# y: rows of actual outputs
# iterations: number of times to update (epochs)
# learning_rate: how quickly the model should learn

def train_model(x, y, iterations, learning_rate):
    weights = np.zeros(x.shape[1])  # initially all zeros
    loss_history = []
    for i in range(iterations):  # iterations aka epochs
        prediction = evaluate_prediction(x, weights)
        current_loss = calculate_loss(prediction, y)
        loss_history.append(current_loss)
        weights = gradient_descent(x, weights, y, learning_rate)  # update
    return weights, loss_history

# Let's use these functions to train our models!

**Simple Linear Regression Model**

Choose GrLivArea (ground living area) as a single explanatory variable.
Then split the data into training (80%) and testing (20%) sets.
Then, standardize the inputs as z-scores, and concatenate a column of 1s.

Standardizing allows the gradient descent to converge more quickly.

In [1]:
area = dataframe['GrLivArea']
x_train, x_test, y_train, y_test = train_test_split(area, sale_price,test_size=0.2)

std_x_train = (x_train - x_train.mean()) / x_train.std()  # calculate z-scores
std_x_train = np.c_[std_x_train, np.ones(x_train.shape[0])]  # concatenate column of 1s
std_x_test = (x_test - x_test.mean()) / x_test.std()
std_x_test = np.c_[std_x_test, np.ones(x_test.shape[0])]

weights, loss_history = train_model(std_x_train, y_train, 1000, 0.01)
print(weights)

What do these weights mean?

In [1]:
plt.plot(loss_history)
plt.title('Loss During Training')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.show()

Notice how much repeated training reduces the loss! The loss over the first couple hundred iterations decrease dramatically, and then it slows down as we keep fine-tuning our weights.

Let's plot this model as a line of best fit against the original training data.

In [1]:
plt.scatter(x_train, y_train, color='red')
plt.plot(x_train, evaluate_prediction(std_x_train, weights), color='blue')
plt.title('Sale Price vs. Living Area (Training set)')
plt.xlabel('Living Area (sq. ft.)')
plt.ylabel('Sale Price ($)')
plt.show()

Now, let's see how well the model fits the test data.

In [1]:
plt.scatter(x_test, y_test, color='red')
plt.plot(x_test, evaluate_prediction(std_x_test, weights), color='blue')
plt.title('Sale Price vs. Living Area (Test set)')
plt.xlabel('Living Area (sq. ft.)')
plt.ylabel('Sale Price ($)')
plt.show()

**Multiple Linear Regression Model**

Choose GrLivArea (ground living area), OverallQual (overall finish), and GarageCars (car capacity) as three explanatory variables. Don't forget to standardize and concatenate the column of ones.

In [1]:
multi_vars = dataframe[['GrLivArea', 'OverallQual', 'GarageCars']]

std_multi_vars = (multi_vars - multi_vars.mean()) / multi_vars.std()
std_multi_vars = np.c_[std_multi_vars, np.ones(std_multi_vars.shape[0])]

weights, loss_history = train_model(std_multi_vars, sale_price, 1000, 0.01)
print(weights)

Here's the graph of the decreasing loss function again :)

In [1]:
plt.plot(loss_history)
plt.title('Loss During Training')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.show()

Now that we've understood how linear regression models are trained, we can use an sklearn package to do so automatically.

In [1]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(std_x_train, y_train)

print(regressor.coef_)
print(regressor.intercept_)

plt.scatter(x_train, y_train, color='red')
plt.plot(x_train, regressor.predict(std_x_train), color='blue')
plt.title('Sale Price vs. Living Area (Training set)')
plt.xlabel('Living Area (sq. ft.)')
plt.ylabel('Sale Price ($)')
plt.show()

plt.scatter(x_test, y_test, color='red')
plt.plot(x_test, regressor.predict(std_x_test), color='blue')
plt.title('Sale Price vs. Living Area (Test set)')
plt.xlabel('Living Area (sq. ft.)')
plt.ylabel('Sale Price ($)')
plt.show()

# Check Your Understanding
* In your own words, what is a DataFrame?
* What's the significance of the heat map?
* Summarize how to train a linear regression model in machine learning.
* What's the benefit of coding a machine learning algorithm to create a regression model over calculating the statistical regression formula, say, with a calculator?
* Any questions, clarifications, comments?