In [1]:
import pandas as pd
import numpy as np

---

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 
              'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 
              'view':int}

In [3]:
train_data = pd.read_csv('./data/kc_house_train_data.csv', dtype=dtype_dict)
test_data = pd.read_csv('./data/kc_house_test_data.csv', dtype=dtype_dict)

---

Write a generic function that accepts a column of data (e.g, an SArray) ‘input_feature’ and another column ‘output’ and returns the Simple Linear Regression parameters ‘intercept’ and ‘slope’. Use the closed form solution from lecture to calculate the slope and intercept. 

In [4]:
def simple_linear_regression(x, y):
    '''
    sum_yi = np.sum(y)
    sum_xi = np.sum(x)
    sum_xi_yi = np.sum(x * y)
    sum_xi2 = np.sum(x**2)
    N = len(x)
    slope = (sum_xi_yi - sum_yi*sum_xi/N) / (sum_xi2 - sum_xi**2/N)
    '''
    slope = np.cov(x, y, bias=True)[0,1] / np.var(x)
    intercept = np.mean(y) - slope * np.mean(x)
    return intercept, slope

---

Use your function to calculate the estimated slope and intercept on the training data to predict ‘price’ given ‘sqft_living’.

In [5]:
x_Tr = train_data['sqft_living']
y_Tr = train_data['price']

In [6]:
squarefeet_intercept, squarfeet_slope = simple_linear_regression(x_Tr, y_Tr)

In [7]:
squarefeet_intercept, squarfeet_slope

(-47116.079072878, 281.9588396303348)

---

Write a function that accepts a column of data ‘input_feature’, the ‘slope’, and the ‘intercept’ you learned, and returns an a column of predictions ‘predicted_output’ for each entry in the input column.

In [8]:
def get_regression_predictions(x, intercept, slope):
    yhat = intercept + slope * x
    return yhat

In [9]:
get_regression_predictions(2650, squarefeet_intercept, squarfeet_slope)

700074.8459475093

---

Write a function that accepts column of data: ‘input_feature’, and ‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the Residual Sum of Squares (RSS).

In [10]:
def get_residual_sum_of_squares(x, y, intercept, slope):
    residuals = y - get_regression_predictions(x, intercept, slope)
    RSS = np.sum(residuals**2)
    return RSS

In [11]:
get_residual_sum_of_squares(x_Tr, y_Tr, squarefeet_intercept, squarfeet_slope)

1201918354177283.0

---

Write a function that accept a column of data:‘output’ and the regression parameters ‘slope’ and ‘intercept’ and outputs the column of data: ‘estimated_input’. Do this by solving the linear function output = intercept + slope\*input for the ‘input’ variable (i.e. ‘input’ should be on one side of the equals sign by itself).

In [12]:
def inverse_regression_predictions(y, intercept, slope):
    xhat = (y - intercept) / slope
    return xhat

In [13]:
inverse_regression_predictions(8e5, squarefeet_intercept, squarfeet_slope)

3004.396245152302

---

Instead of using ‘sqft_living’ to estimate prices we could use ‘bedrooms’ (a count of the number of bedrooms in the house) to estimate prices. Using your function from (3) calculate the Simple Linear Regression slope and intercept for estimating price based on bedrooms. Save this slope and intercept for later (you might want to call them e.g. bedroom_slope, bedroom_intercept).

In [14]:
x2_Tr = train_data['bedrooms']

In [15]:
bedroom_intercept, bedroom_slope = simple_linear_regression(x2_Tr, y_Tr)

In [16]:
x_Te  = test_data['sqft_living']
x2_Te = test_data['bedrooms']
y_Te  = test_data['price']

In [17]:
get_residual_sum_of_squares(x_Te, y_Te, squarefeet_intercept, squarfeet_slope)

275402933617811.75

In [18]:
get_residual_sum_of_squares(x2_Te, y_Te, bedroom_intercept, bedroom_slope)

493364585960301.9