# Simple Linear Regression using Closed Form


# Fire up graphlab create

In [65]:
import graphlab
import numpy as np

# Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [67]:
sales = graphlab.SFrame('../kc_house_data.gl/')
sales.head(3)

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650,1,0
6414100192,2014-12-09 00:00:00+00:00,538000.0,3.0,2.25,2570.0,7242,2,0
5631500400,2015-02-25 00:00:00+00:00,180000.0,2.0,1.0,770.0,10000,1,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398
0,3,7,2170,400,1951,1991,98125,47.72102274
0,3,6,770,0,1933,0,98028,47.73792661

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0


# Split data into training and testing

**seed = 0** is used so that after running multiple time thisnotebook we get the same results. In practice, we may set a random seed.  

In [68]:
train_data,test_data = sales.random_split(.8,seed=0)

# Useful SFrame summary functions

In order to make use of the closed form solution as well as take advantage of graphlab's built in functions we will review some important ones. In particular:
* Computing the sum of an SArray
* Computing the arithmetic average (mean) of an SArray
* multiplying SArrays by constants
* multiplying SArrays by other SArrays

In [69]:
# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray

sum_prices = prices.sum()
num_houses = prices.size() # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # if you just want the average, the .mean() function
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)

average price via method 1: 540088.141905
average price via method 2: 540088.141905


As we see we get the same answer both ways

In [70]:
# if we want to multiply every price by 0.5 it's a simple as:
half_prices = 0.5*prices
# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print "the sum of price squared is: " + str(sum_prices_squared)

the sum of price squared is: 9.21732513355e+15


# Build a generic simple linear regression function 

Armed with these SArray functions we can use the closed form solution found from lecture to compute the slope and intercept for a simple linear regression on observations stored as SArrays: input_feature, output.

Complete the following function (or write your own) to compute the simple linear regression slope and intercept:

In [71]:
input_feature = train_data['sqft_living']
output = train_data['price']

mean_of_x_y = (input_feature * output).mean()
mean_of_x_square = (input_feature ** 2).mean()
mean_of_y = output.mean()
mean_of_x = input_feature.mean()
N = len(input_feature)

numerator_slope = mean_of_x_y - (mean_of_x * mean_of_y)
denominator_slope = mean_of_x_square - (mean_of_x * mean_of_x)

In [72]:
def simple_linear_regression(input_feature, output):
    
    slope = numerator_slope / denominator_slope
    intercept = mean_of_y - slope * mean_of_x
    return intercept, slope

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [73]:
test_feature = graphlab.SArray(range(5))
test_output = graphlab.SArray(1 + 1*test_feature)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print "Intercept: " + str(test_intercept)
print "Slope: " + str(test_slope)

Intercept: -47116.0765749
Slope: 281.958838568


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

In [74]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)

Intercept: -47116.0765749
Slope: 281.958838568


# Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value. Complete the following function to return the predicted output given the input_feature, slope and intercept:

In [75]:
def get_regression_predictions(input_feature, intercept, slope):
    
    predicted_output = intercept + slope * input_feature
    return predicted_output

Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

In [76]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)

The estimated price for a house with 2650 squarefeet is $700074.85


# Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

In [77]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    
    predicted_output = get_regression_predictions(input_feature, intercept, slope)
    square_difference = np.square(output - predicted_output)
    RSS = np.sum(square_difference)
    return RSS 

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [81]:
print get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope) # should be 0.0

0


In [79]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is : 1201918356321967.2


# Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation y = a + b\*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

In [82]:
def inverse_regression_predictions(output, intercept, slope):

    estimated_feature = (output - intercept) / slope
    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.

In [83]:
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print "The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet)

The estimated squarefeet for a house worth $800000.00 is 3004


# New Model: estimate prices from bedrooms

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. 
Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!

In [90]:
input_feature = train_data['bathrooms']
output = train_data['price']

mean_of_x_y = (input_feature * output).mean()
mean_of_x_square = (input_feature ** 2).mean()
mean_of_y = output.mean()
mean_of_x = input_feature.mean()
N = len(input_feature)

numerator_slope = mean_of_x_y - (mean_of_x * mean_of_y)
denominator_slope = mean_of_x_square - (mean_of_x * mean_of_x)

intercept, slope = simple_linear_regression(input_feature, output)
predicted_output = get_regression_predictions(input_feature, intercept, slope)

# Testing our Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculating the RSS on the TEST data. Computing the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

In [91]:
# Compute RSS when using bedrooms on TEST data:
get_residual_sum_of_squares(input_feature, output, intercept, slope)

1725065031365053.0

In [89]:
# Compute RSS when using squarefeet on TEST data:
get_residual_sum_of_squares(input_feature, output, intercept, slope)

1201918356321967.2

# Good Luck !!!