# Simple Linear Regression

# Start graphlab

In [5]:
import graphlab

# Loading the house sales data
Dataset description has been given previously.

In [9]:
sales = graphlab.SFrame('kc_house_data.gl/')

# Split data into training and testing

Splitting the whole data into 2 parts : training data which is used to train the model and test data which is used to evaluate the performance of the trained model. I am taking training data to be 80% of the total data and hence test data would be the remaining 20% of the total data.


In [10]:
train_data,test_data = sales.random_split(.8,seed=0) 

In [11]:
prices = sales['price'] # defining price array of the sales dataset
sum_prices = prices.sum() # .sum() returns the sum of all values in that array
num_houses = prices.size() # .size() returns the length of an array
avg_price = sum_prices/num_houses
print "average price: " + str(avg_price)

average price: 540088.141905


# Building a simple linear regression function 

Now lets compute the simple linear regression slope and intercept. Below method shown uses cosed form solution approach to calculate the slope and intercept.

In [12]:
def simple_linear_regression(input_feature, output):
    Xi = input_feature
    Yi = output
    N = len(Xi)
    # compute the mean of  input_feature and output
    Ymean = Yi.mean()
    Xmean = Xi.mean()
    
    # compute the product of the output and the input_feature and its mean
    SumYiXi = (Yi * Xi).sum()
    YiXiByN = (Yi.sum() * Xi.sum()) / N
    
    # compute the squared value of the input_feature and its mean
    XiSq = (Xi * Xi).sum()
    XiXiByN = (Xi.sum() * Xi.sum()) / N
    
    # use the formula for the slope
    slope = (SumYiXi - YiXiByN) / (XiSq - XiXiByN)
    
    # use the formula for the intercept
    intercept = Ymean - (slope * Xmean)
    return (intercept, slope)

Lets build a regression model for predicting price based on sqft_living.

In [13]:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)

Intercept: -47116.0765749
Slope: 281.958838568


# Predicting Values
The below function is predicting the prices of house with the given value of features

In [15]:
def get_regression_predictions(input_feature, intercept, slope):
    # calculating the predicted values
    predicted_values = intercept + (slope * input_feature)
    return predicted_values

In [16]:
my_house_sqft = 1000 # Let's estimate the price for a house with 1000 square feet with my model defined above.
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)

The estimated price for a house with 1000 squarefeet is $234842.76


# Residual Sum of Squares

let's evaluate our model using Residual Sum of Squares (RSS). 
The function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

In [17]:
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First of all get the predictions
    # Please note that output is the true value
    predicted_values = intercept + (slope * input_feature)
    # compute the residuals which is just the differenece between true value and predicted value
    residuals = output - predicted_values
    # square the residuals and add them up
    RSS = (residuals * residuals).sum()
    return(RSS)

Let's calculate the RSS for the above simple linear regression model using squarefeet to predict prices on TRAINING data !

In [21]:
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'], train_data['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft) + '. Please note that it is the RSS for the trainig data'

The RSS of predicting Prices based on Square Feet is : 1.20191835632e+15. Please note that it is the RSS for the trainig data


# Predict the squarefeet given price

If price is given and we want to predict the squarefeet size of the house, then use the below function.

In [22]:
def inverse_regression_predictions(output, intercept, slope):
    # Just obtain the formula by using simple maths
    estimated_feature = (output - intercept)/slope
    return estimated_feature

In [23]:
my_house_price = 5000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print "The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet)

The estimated squarefeet for a house worth $5000.00 is 184


# New Model: estimate prices from bedrooms

My first model for predicting house prices using squarefeet, but there are many other features in the sales SFrame. 
Lets try the other feature given in our sales dataset. Again we will train this new model on the training data defined at the start of this notebook.

In [24]:
# Estimating the slope and intercept for predicting 'price' based on 'bedrooms'
sqft_intercept, sqft_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)

Intercept: 109473.180469
Slope: 127588.952175


# Testing the Linear Regression Algorithm

Till now i had defined two models for predicting the price of a house. How do we know which one is better? Let's calculate the RSS on the TEST data which I had not used till now. I am calculating the RSS on test data for both these models. As RSS is the cost of the model, we can use the obtained rss values in determing the better model, i mean which feature out of square_feet and no_of_bedrooms, better fits the data.

In [25]:
# Compute RSS when using bedrooms on TEST data:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['bedrooms'], train_data['price'])
rss_prices_on_bedrooms = get_residual_sum_of_squares(test_data['bedrooms'], test_data['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Bedrooms is : ' + str(rss_prices_on_bedrooms)

The RSS of predicting Prices based on Bedrooms is : 4.93364582868e+14


In [26]:
# Compute RSS when using squarfeet on TEST data:
sqft_intercept, sqft_slope = simple_linear_regression(train_data['sqft_living'], train_data['price'])
rss_prices_on_sqft = get_residual_sum_of_squares(test_data['sqft_living'], test_data['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)

The RSS of predicting Prices based on Square Feet is : 2.75402936247e+14


In [None]:
# Hence we see as the value of RSS for model 2 is less than the RSS value for model 1, we can say model 2 is much better
# than model 1. Also, the feature 2 fits the given data more accurately than the feature 1.