# NOTE: Modified from sample to use Pandas instead of SFrame/SArray and graphlab create

# Regression Week 1: Simple Linear Regression

In this notebook we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:
* Use graphlab SArray and SFrame functions to compute important summary statistics
* Write a function to compute the Simple Linear Regression weights using the closed form solution
* Write a function to make predictions of the output given the input feature
* Turn the regression around to predict the input given the output
* Compare two different models for predicting house prices

In this notebook you will be provided with some already complete code as well as some code that you should complete yourself in order to answer quiz questions. The code we provide to complte is optional and is there to assist you with solving the problems but feel free to ignore the helper code and write your own.

In [1]:
import pandas as pd
import numpy as np

## Load house sales data

In [2]:
# dnames_list = ["id","date","price","bedrooms","bathrooms","sqft_living","sqft_lot","floors","waterfront","view","condition","grade","sqft_above","sqft_basement","yr_built","yr_renovated","zipcode","lat","long","sqft_living15","sqft_lot15"]
# Data types
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 
              'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 
              'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 
              'sqft_lot':int, 'view':int}

train_datafile = 'kc_house_train_data.csv'
train_data = pd.read_csv(train_datafile, dtype=dtype_dict)

test_datafile = 'kc_house_test_data.csv'
test_data = pd.read_csv(test_datafile, dtype=dtype_dict)

In [3]:
train_data.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,lat,long,sqft_living15,sqft_lot15
count,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0,17384.0
mean,539366.627934,3.369363,2.115048,2080.02951,15091.91124,0.007651,0.236079,3.41078,7.655028,1787.844512,292.184998,1971.152727,83.107973,47.559313,-122.213281,1985.994995,12776.380867
std,369691.178858,0.906468,0.771783,921.630888,41459.272327,0.087136,0.768008,0.649792,1.169818,827.107595,444.404136,29.328722,398.692283,0.138703,0.140906,686.512835,27175.730523
min,75000.0,0.0,0.0,290.0,520.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,47.1593,-122.519,399.0,651.0
25%,320000.0,3.0,1.75,1420.0,5049.5,0.0,0.0,3.0,7.0,1200.0,0.0,1952.0,0.0,47.46865,-122.328,1490.0,5100.0
50%,450000.0,3.0,2.25,1910.0,7616.0,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,47.5714,-122.229,1840.0,7620.0
75%,640000.0,4.0,2.5,2550.0,10665.25,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,47.677625,-122.125,2360.0,10065.25
max,7700000.0,10.0,8.0,13540.0,1651359.0,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,47.7776,-121.315,6210.0,871200.0


In [4]:
test_data.describe()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,lat,long,sqft_living15,sqft_lot15
count,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0,4229.0
mean,543054.043036,3.376921,2.113561,2079.36628,15168.859068,0.007094,0.227004,3.403878,7.66446,1790.635848,288.730433,1970.398439,89.722629,47.563092,-122.216426,1988.844171,12735.877749
std,356421.245803,1.021434,0.76356,905.317454,41265.627014,0.083936,0.759375,0.654686,1.198476,832.21553,435.016314,29.552143,413.736867,0.137965,0.140497,680.837632,27829.200218
min,85000.0,0.0,0.0,370.0,600.0,0.0,0.0,1.0,4.0,370.0,0.0,1900.0,0.0,47.1559,-122.514,700.0,660.0
25%,325000.0,3.0,1.75,1430.0,5027.0,0.0,0.0,3.0,7.0,1180.0,0.0,1951.0,0.0,47.4766,-122.33,1490.0,5105.0
50%,453000.0,3.0,2.25,1920.0,7633.0,0.0,0.0,3.0,7.0,1570.0,0.0,1974.0,0.0,47.5734,-122.239,1840.0,7611.0
75%,650000.0,4.0,2.5,2550.0,10760.0,0.0,0.0,4.0,8.0,2230.0,560.0,1996.0,0.0,47.6795,-122.125,2370.0,10159.0
max,6885000.0,33.0,7.75,9890.0,1024068.0,1.0,4.0,5.0,13.0,8860.0,2610.0,2015.0,2015.0,47.7776,-121.315,5030.0,858132.0


## Compute some numbers from the data

In [5]:
# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = test_data['price'] # prices as a single data series

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = prices.count()
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean()
print("average price via method 1: {}".format(str(avg_price_1)))
print("average price via method 2: {}".format(str(avg_price_2)))

average price via method 1: 543054.043036
average price via method 2: 543054.0430361788


As we see we get the same answer both ways

In [6]:
# if we want to multiply every price by 0.5 it's a simple as:
half_prices = 0.5*prices

# Let's compute the sum of squares of price. We can multiply two SArrays of the same length elementwise also with *
prices_squared = prices*prices
sum_prices_squared = prices_squared.sum()
print("the sum of price squared is: {}".format(str(sum_prices_squared)))

the sum of price squared is: 1784273286136298.0


# Build a generic simple linear regression function

Complete the following function to compute the simple linear regression slope and intercept using the closed form solution

In [7]:
def simple_linear_regression(input_feature, output):
    N = len(input_feature)
    
    # compute the sum of input_feature and output
    sum_input_feature = input_feature.sum()
    sum_output = output.sum()
    
    # compute the product of the output and the input_feature and its sum
    prod_input_output = input_feature * output
    sum_prod_input_output = prod_input_output.sum()
    
    # compute the squared value of the input_feature and its sum
    square_input_feature = input_feature * input_feature
    sum_square_input_feature = square_input_feature.sum()
    
    # use the formula for the slope
    slope = (sum_prod_input_output - sum_input_feature * sum_output / N) / (sum_square_input_feature - sum_input_feature * sum_input_feature / N)
    
    # use the formula for the intercept
    intercept = sum_output / N - slope * sum_input_feature / N
    
    return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1\*input_feature then we know both our slope and intercept should be 1

In [8]:
test_input = np.arange(0,5)
test_output = 1 + 1.0 * test_input
test_output
test_intercept, test_slope = simple_linear_regression(test_input,test_output)
print("Intercept: {}".format(test_intercept))
print("Slope: {}".format(test_slope))

Intercept: 1.0
Slope: 1.0


Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

In [9]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print("Intercept: {}".format(sqft_intercept))
print("Slope: {}".format(sqft_slope))

Intercept: -47116.07907289418
Slope: 281.9588396303426


# Predicting Values

Now that we have the model parameters intercept and slope, we can make predictions.  Compute the following function to return the predicted output given the input_feature, slope, and intercept:

In [10]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = intercept + slope * input_feature
    
    return predicted_values

Now that we can calculate a prediction given the slope and intercept, let's make a prediction.  Estimate the price for a house with 2650 sqft according to the sqft model above.

**Quiz Question: Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?**

In [11]:
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print("The estimated price for a house with {} sqft is ${:.2f}".format(my_house_sqft, estimated_price))

The estimated price for a house with 2650 sqft is $700074.85


# Residual Sum of Squares

Now that we have a model that can make predictions let's evaluate our model using Residual Sum of Squares (RSS).  Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between predicted output and true output.

In [12]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First get the predictions
    predictions = get_regression_predictions(input_feature, intercept, slope)
    
    # The compute the residuals
    residuals = predictions - output
    
    # Then square the residuals and add them up
    RSS = (residuals * residuals).sum()
    
    return RSS

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

In [13]:
print(get_residual_sum_of_squares(test_input, test_output, test_intercept, test_slope)) # should be 0.0

0.0


Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

**Quiz Question: According to this function and the slope and intercept from the squarefeet model What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?**

In [14]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print("The RSS of predicting prices based on sqft is: {}".format(str(rss_prices_on_sqft)))

The RSS of predicting prices based on sqft is: 1201918354177286.2


# Predict sqft given prices

What if we want to predict the squarefoot given the price?  Since we have an equation y=a+bx we can solve the function for x.

In [15]:
def inverse_regression_predictions(output, intercept, slope):
    estimated_feature = (output - intercept) / slope
    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that costs $800,000 to be.

**Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?**

In [16]:
my_house_price = 800000
estimated_sqft = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print("The estimated sqft for a house worth {:.2f} is {:.0f}".format(my_house_price, estimated_sqft))

The estimated sqft for a house worth 800000.00 is 3004


# New Model: estimate prices from bedrooms

We have made one model for predicting house prices using sqft, but there are other features in the sales data.  Use the simple linear regression function to estimate the regression parameters from predicting prices based on number of bedrooms.  Use the training data! 

In [17]:
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
bdr_intercept, bdr_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])

print("Intercept: {}".format(bdr_intercept))
print("Slope: {}".format(bdr_slope))

Intercept: 109473.1776229596
Slope: 127588.95293398784


# Test your Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

**Quiz Question: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.**

In [18]:
# Compute RSS using bedroom model with TEST data:
rss_prices_on_bdr_test = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], bdr_intercept, bdr_slope)
print("The RSS of predicting prices based on sqft with TEST data is: {}".format(str(rss_prices_on_bdr_test)))

The RSS of predicting prices based on sqft with TEST data is: 493364585960301.4


In [19]:
# Compute RSS using sqft model with TEST data:
rss_prices_on_sqft_test = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)
print("The RSS of predicting prices based on sqft with TEST data is: {}".format(str(rss_prices_on_sqft_test)))

The RSS of predicting prices based on sqft with TEST data is: 275402933617813.1


**The sqft model performs better than the bedroom model (RSS_sqft < RSS_bdr).  This makes sense as the integer bedroom data is very coarse and likely not as strong a factor as sqft.**