# Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:
* Use graphlab SArray and SFrame functions to compute important summary statistics
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input given the output
* Compare two different models for predicting house prices


# Fire up graphlab create

In [1]:
import graphlab

# Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [4]:
sales = graphlab.SFrame('kc_house_data.gl/')

# Split data into training and testing

We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

In [13]:
train_data,test_data = sales.random_split(.8,seed=0)

prices = sales['price'] 
prices.size()
test_data

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
114101516,2014-05-28 00:00:00+00:00,310000.0,3.0,1.0,1430.0,19901,1.5,0
9297300055,2015-01-24 00:00:00+00:00,650000.0,4.0,3.0,2950.0,5000,2.0,0
1202000200,2014-11-03 00:00:00+00:00,233000.0,3.0,2.0,1710.0,4697,1.5,0
8562750320,2014-11-10 00:00:00+00:00,580500.0,3.0,2.5,2320.0,3980,2.0,0
7589200193,2014-11-10 00:00:00+00:00,535000.0,3.0,1.0,1090.0,3000,1.5,0
2078500320,2014-06-20 00:00:00+00:00,605000.0,4.0,2.5,2620.0,7553,2.0,0
7766200013,2014-08-11 00:00:00+00:00,775000.0,4.0,2.25,4220.0,24186,1.0,0
9478500640,2014-08-19 00:00:00+00:00,292500.0,4.0,2.5,2250.0,4495,2.0,0
9558200045,2014-08-28 00:00:00+00:00,289000.0,3.0,1.75,1260.0,8400,1.0,0
8820901275,2014-06-10 00:00:00+00:00,571000.0,4.0,2.0,2750.0,7807,1.5,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,4,7,1430,0,1927,0,98028,47.75584254
3,3,9,1980,970,1979,0,98126,47.57136955
0,5,6,1710,0,1941,0,98002,47.30482931
0,3,8,2320,0,2003,0,98027,47.5391103
0,4,8,1090,0,1929,0,98117,47.68889559
0,3,8,2620,0,1996,0,98056,47.53013988
0,3,8,2600,1620,1984,0,98166,47.44504345
0,3,7,2250,0,2008,0,98042,47.36628767
0,3,7,1260,0,1954,0,98148,47.43658598
0,5,7,2250,500,1916,0,98125,47.7168015

long,sqft_living15,sqft_lot15
-122.22874498,1780.0,12697.0
-122.37541218,2140.0,4000.0
-122.21774909,1030.0,4705.0
-122.06971484,2580.0,3980.0
-122.3752359,1570.0,5080.0
-122.18000831,2620.0,11884.0
-122.34720874,2410.0,30617.0
-122.11356981,2250.0,4500.0
-122.3346675,1290.0,8750.0
-122.28694727,1510.0,7807.0


# Useful SFrame summary functions


* Computing the sum of an SArray
* Computing the arithmetic average (mean) of an SArray
* multiplying SArrays by constants
* multiplying SArrays by other SArrays

In [5]:
# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = prices.size() # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # if you just want the average, the .mean() function
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)

average price via method 1: 540088.141905
average price via method 2: 540088.141905


As we see we get the same answer both ways

In [6]:
# if we want to multiply every price by 0.5 it's a simple as:
half_prices = 0.5*prices
# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum() # price_squared is an SArray of the squares and we want to add them up.
print "the sum of price squared is: " + str(sum_prices_squared)

the sum of price squared is: 9.21732513355e+15


# Build a generic simple linear regression function 

In [7]:
def simple_linear_regression(input_feature, output):
    # compute the sum of input_feature and output
    input_sum = input_feature.sum()
    output_sum = output.sum()
    N = len(input_feature)
    
    # compute the product of the output and the input_feature and its sum
    prod = map(lambda x, y:x*y, input_feature, output)
    prod_sum = sum(prod)

    # compute the squared value of the input_feature and its sum
    sq = map(lambda x:x*x, input_feature)
    sq_sum = sum(sq)
    
    print("sq_sum", sq_sum)
    print("prod_sum", prod_sum)
    print("input_sum", input_sum)
    print("output_sum", output_sum)
    
    # use the formula for the slope
    slope = float(prod_sum - float(input_sum * output_sum) / N)/(sq_sum - float(input_sum * input_sum) / N) 
    
    # use the formula for the intercept
    intercept = output_sum / N - slope * input_sum / N
    
    return (intercept, slope)



We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [8]:
test_feature = graphlab.SArray(range(5))
test_output = graphlab.SArray(1 + 1*test_feature)
print(test_feature)
print(test_output)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print "Intercept: " + str(test_intercept)
print "Slope: " + str(test_slope)

[0L, 1L, 2L, 3L, 4L]
[1L, 2L, 3L, 4L, 5L]
('sq_sum', 30L)
('prod_sum', 40L)
('input_sum', 10L)
('output_sum', 15L)
Intercept: 1.0
Slope: 1.0


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

In [9]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)
print train_data['sqft_living']

('sq_sum', 89977452623.0)
('prod_sum', 23666256847942.0)
('input_sum', 36159233.0)
('output_sum', 9376349465.0)
Intercept: -47116.0765749
Slope: 281.958838568
[1180.0, 2570.0, 770.0, 1960.0, 1680.0, 5420.0, 1715.0, 1060.0, 1780.0, 1890.0, 3560.0, 1160.0, 1370.0, 1810.0, 1890.0, 1600.0, 1200.0, 1250.0, 1620.0, 3050.0, 2270.0, 1070.0, 2450.0, 2450.0, 1400.0, 1520.0, 2570.0, 1190.0, 2330.0, 2060.0, 2300.0, 1660.0, 2360.0, 1220.0, 2570.0, 3595.0, 1570.0, 1280.0, 3160.0, 990.0, 2290.0, 1250.0, 2753.0, 1190.0, 3150.0, 1410.0, 1980.0, 2730.0, 2830.0, 2420.0, 3250.0, 1850.0, 2150.0, 2519.0, 1540.0, 1660.0, 2770.0, 2720.0, 2240.0, 1000.0, 3200.0, 4770.0, 1260.0, 2380.0, 3430.0, 1760.0, 1040.0, 1410.0, 3450.0, 2350.0, 2020.0, 1680.0, 960.0, 2140.0, 2660.0, 2770.0, 1610.0, 1030.0, 3520.0, 1200.0, 1580.0, 1580.0, 3300.0, 1160.0, 1810.0, 2320.0, 2070.0, 1980.0, 2190.0, 2920.0, 1210.0, 2340.0, 1670.0, 1240.0, 3140.0, 2310.0, 1260.0, 1540.0, 2080.0, 4380.0, ... ]


# Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. Using SArrays it's easy to multiply an SArray by a constant and add a constant value. Complete the following function to return the predicted output given the input_feature, slope and intercept:

In [13]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = slope * input_feature + intercept
    
    return predicted_values

Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use (or alter) the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above

In [14]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)

The estimated price for a house with 2650 squarefeet is $700074.85


# Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output. 

Complete the following (or write your own) function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [15]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First get the predictions
    pred = map(lambda x: get_regression_predictions(x, intercept, slope), input_feature)

    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    RSS = sum(map(lambda x, y: (x - y)*(x - y), output, pred))

    # square the residuals and add them up

    return(RSS)

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [16]:
print get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope) # should be 0.0

0.0


RSS on sqft_living

In [17]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is : 1.20191835632e+15
