We will explore linear regression in this exercise. We will train a model to predict the sale price depending on ground square feet for houses.

We first load the necessary tools:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as ipydis
import time

In [None]:
We use Pandas to read the data file which is stored as Comma Separated Values (CSV) and print the column labels. CSV files are similar to excel sheets.

In [None]:
data = pd.read_csv('slimmed_realestate_data.csv')
print(data.columns)

In [None]:
data.plot(x='GrLivArea', y='SalePrice',style='.')

x will be our above ground square footage and y will be the sale price. In our equations we have a few different values we need, such as 
n, which is just the number of data points we have.

Note: we also need to convert the Pandas data formats (in this case a Series) to Numpy data formats using the to_numpy() function.

In [None]:
n = len(data)

In [None]:
x = data['GrLivArea'].to_numpy()
y = data['SalePrice'].to_numpy()

In [None]:
sum_xy = np.sum(x*y)
sum_x = np.sum(x)
sum_y = np.sum(y)
sum_x2 = np.sum(x*x)

The denominator in the equations to calculate m & b (please refer to lecture notes on regression)

In [None]:
denominator = n * sum_x2 - sum_x * sum_x

Now, we compute m & b:

In [None]:
m = (n * sum_xy - sum_x * sum_y) / denominator
b = (sum_y * sum_x2 - sum_x * sum_xy) / denominator
print('y = %f * x + %f' % (m,b))

# saving these for later comparison
m_calc = m
b_calc = b

In [None]:
def plot_data(x,y,m,b,plt = plt):
   # plot our data points with 'bo' = blue circles
   plt.plot(x,y,'bo')
   # create the line based on our linear fit
   # first we need to make x points
   # the 'arange' function generates points between two limits (min,max)
   linear_x = np.arange(x.min(),x.max())
   # now we use our fit parameters to calculate the y points based on our x points
   linear_y = linear_x * m + b
   # plot the linear points using 'r-' = red line
   plt.plot(linear_x,linear_y,'g-',label='fit')

In [None]:
plot_data(x,y,m,b)

Stochastic Gradient Descent (SGD) is a popular optimization algorithm used to train machine learning models, particularly neural networks. It is an iterative method for optimizing a loss function that we get to define. 

In our example, the model is how house prices vary based on house size. We know our system is roughly driven by a linear function: y= mx + b, and we need to figure out m & b.

In [None]:
def model(x,m,b):
   return m * x + b

Let's create the loss function:

In [None]:
def loss(x,y,m,b):
   y_predicted = model(x,m,b)
   return np.power( y - y_predicted, 2 )

Let's now update m & b:

In [None]:
def updated_m(x,y,m,b,learning_rate):
   dL_dm = - 2 * x * (y - model(x,m,b))
   dL_dm = np.mean(dL_dm)
   return m - learning_rate * dL_dm

def updated_b(x,y,m,b,learning_rate):
   dL_db = - 2 * (y - model(x,m,b))
   dL_db = np.mean(dL_db)
   return b - learning_rate * dL_db

Let's now put it all together and train our model.

We can now randomly select our initial slope(m) and intercept(b):

In [None]:
m = 5.
b = 1000.
print('y_i = %.2f * x + %.2f' % (m,b))


Then we can calculate our Loss function:

In [None]:
l = loss(x,y,m,b)
print('first 10 loss values: ',l[:10])

In [None]:
learning_rate = 1e-9
m = updated_m(x,y,m,b,learning_rate)
b = updated_b(x,y,m,b,learning_rate)
print('y_i = %.2f * x + %.2f     previously calculated: y_i = %.2f * x + %.2f' % (m,b,m_calc,b_calc))
plot_data(x,y,m,b)

In [None]:
# set our initial slope and intercept
m = 5.
b = 1000.
#batch_size = 512
# set a learning rate for each parameter
learning_rate_m = 1e-7
learning_rate_b = 1e-1
# use these to plot our progress over time
loss_history = []
# convert panda data to numpy arrays, one for the "Ground Living Area" and one for "Sale Price"
data_x = data['GrLivArea'].to_numpy()
data_y = data['SalePrice'].to_numpy()
#Sample data randomly in batches of size batch_size
#data_batch = data.sample(batch_size)
#data_x = data_batch['GrLivArea'].to_numpy()
#data_y = data_batch['SalePrice'].to_numpy()
# we run our loop N times
#loop_N = 30 * len(data)//batch_size
loop_N = 30
for i in range(loop_N):
   # update our slope and intercept based on the current values
   m = updated_m(data_x,data_y,m,b,learning_rate_m)
   b = updated_b(data_x,data_y,m,b,learning_rate_b)

   # calculate the loss value
   loss_value = np.mean(loss(data_x,data_y,m,b))

   # keep a history of our loss values
   loss_history.append(loss_value)

   # print our progress
   print('[%03d]  dy_i = %.2f * x + %.2f     previously calculated: y_i = %.2f * x + %.2f    loss: %f' % (i,m,b,m_calc,b_calc,loss_value))

   # close/delete previous plots
   plt.close('all')

   # create a 1 by 2 plot grid
   fig,ax = plt.subplots(1,2,figsize=(18,6),dpi=80)
   # lot our usual output
   plot_data(data_x,data_y,m,b,ax[0])

   # here we also plot the calculated linear fit for comparison
   line_x = np.arange(data_x.min(),data_x.max())
   line_y = line_x * m_calc + b_calc
   ax[0].plot(line_x,line_y,'b-',label='calculated')
   # add a legend to the plot and x/y labels
   ax[0].legend()
   ax[0].set_xlabel('square footage')
   ax[0].set_ylabel('sale price')

   # plot the loss
   loss_x = np.arange(0,len(loss_history))
   loss_y = np.asarray(loss_history)
   ax[1].plot(loss_x,loss_y, 'o-')
   ax[1].set_yscale('log')
   ax[1].set_xlabel('loop step')
   ax[1].set_ylabel('loss')
   plt.show()
   # gives us time to see the plot
   time.sleep(2.5)
   # clears the plot when the next plot is ready to show.
   ipydis.clear_output(wait=True)
    

Exercise: Vary the learning rates for m & b to see  the changes in your output. Vary the numbers learning_rate_m = 1e-7, learning_rate_b = 1e-1.