In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.linear_model import LinearRegression

---

In [3]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 
              'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 
              'view':int}

In [4]:
train_data = pd.read_csv('./data/kc_house_train_data.csv', dtype=dtype_dict, index_col=0)
test_data = pd.read_csv('./data/kc_house_test_data.csv', dtype=dtype_dict, index_col=0)

---

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, square feet, and # of bathrooms) but we can also consider transformations of existing variables e.g. the log of the square feet or even "interaction" variables such as the product of bedrooms and bathrooms. Add 4 new variables in both your train_data and test_data.

In [5]:
train_data['bedrooms_squared'] = train_data.bedrooms**2
train_data['bed_bath_rooms']   = train_data.bedrooms * train_data.bathrooms
train_data['log_sqft_living']  = np.log(train_data.sqft_living)
train_data['lat_plus_long']    = train_data.lat + train_data.long

In [6]:
test_data['bedrooms_squared'] = test_data.bedrooms**2
test_data['bed_bath_rooms']   = test_data.bedrooms * test_data.bathrooms
test_data['log_sqft_living']  = np.log(test_data.sqft_living)
test_data['lat_plus_long']    = test_data.lat + test_data.long

Before we continue let’s explain these new variables:

- Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
- Bedrooms times bathrooms is what's called an "interaction" variable. It is large when both of them are large.
- Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
- Adding latitude to longitude is non-sensical but we will do it anyway (you'll see why)

In [7]:
test_data.loc[:, ['bedrooms_squared', 'bed_bath_rooms', 
                  'log_sqft_living', 'lat_plus_long']].mean()

bedrooms_squared    12.446678
bed_bath_rooms       7.503902
log_sqft_living      7.550275
lat_plus_long      -74.653334
dtype: float64

---

Use graphlab.linear_regression.create (or any other regression library/function) to estimate the regression coefficients/weights for predicting ‘price’ for the following three models:(In all 3 models include an intercept -- most software does this by default).

- Model 1: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’, and ‘long’
- Model 2: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, and ‘bed_bath_rooms’
- Model 3: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, ‘bed_bath_rooms’, ‘bedrooms_squared’, ‘log_sqft_living’, and ‘lat_plus_long’

You’ll note that the three models here are “nested” in that all of the features of the Model 1 are in Model 2 and all of the features of Model 2 are in Model 3.

Learn all three models on the TRAINING data set. Save your model results for quiz questions later.

In [8]:
predictors1 = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
fit1 = LinearRegression().fit(X=train_data.loc[:, predictors1], y=train_data.price)

In [9]:
dict(zip(predictors1, fit1.coef_))

{'sqft_living': 312.25864627320277,
 'bedrooms': -59586.53315361201,
 'bathrooms': 15706.742082734603,
 'lat': 658619.2639305174,
 'long': -309374.35126823315}

In [10]:
predictors2 = predictors1 + ['bed_bath_rooms']
fit2 = LinearRegression().fit(X=train_data.loc[:, predictors2], y=train_data.price)

In [11]:
dict(zip(predictors2, fit2.coef_))

{'sqft_living': 306.6100534589952,
 'bedrooms': -113446.36807020311,
 'bathrooms': -71461.30829275977,
 'lat': 654844.6295033027,
 'long': -294298.96913811873,
 'bed_bath_rooms': 25579.652000752216}

In [12]:
predictors3 = predictors2 + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
fit3 = LinearRegression().fit(X=train_data.loc[:, predictors3], y=train_data.price)

In [13]:
dict(zip(predictors3, fit3.coef_))

{'sqft_living': 529.4228196465249,
 'bedrooms': 34514.229577990445,
 'bathrooms': 67060.7813189112,
 'lat': 534085.6108674955,
 'long': -406750.71086103976,
 'bed_bath_rooms': -8570.504394630156,
 'bedrooms_squared': -6788.586670340485,
 'log_sqft_living': -561831.4840755223,
 'lat_plus_long': 127334.90000645655}

---

Now using your three estimated models compute the RSS (Residual Sum of Squares) on the Training data.

In [14]:
np.sum((train_data.price - fit1.predict(train_data.loc[:,predictors1]))**2)

967879963049546.4

In [15]:
np.sum((train_data.price - fit2.predict(train_data.loc[:,predictors2]))**2)

958419635074069.2

In [16]:
np.sum((train_data.price - fit3.predict(train_data.loc[:,predictors3]))**2)

903436455050479.0

---

Now using your three estimated models compute the RSS on the Testing data.

In [17]:
np.sum((test_data.price - fit1.predict(test_data.loc[:,predictors1]))**2)

225500469795490.16

In [18]:
np.sum((test_data.price - fit2.predict(test_data.loc[:,predictors2]))**2)

223377462976466.88

In [19]:
np.sum((test_data.price - fit3.predict(test_data.loc[:,predictors3]))**2)

259236319207179.44

---

Next write a function that takes a data set, a list of features (e.g. [‘sqft_living’, ‘bedrooms’]), to be used as inputs, and a name of the output (e.g. ‘price’). This function should return a features_matrix (2D array) consisting of first a column of ones followed by columns containing the values of the input features in the data set in the same order as the input list. It should also return an output_array which is an array of the values of the output in the data set (e.g. ‘price’). e.g. if you’re using SFrames and numpy you can complete the following function:

In [20]:
def get_numpy_data(df, features, output):
    df['constant'] = 1 # add a constant column to an DataFrame
    features = ['constant'] + features
    df_features = df.loc[:, features]
    features_matrix = df_features.values
    output_array = df.loc[:, output].values
    return features_matrix, output_array

---

If the features matrix (including a column of 1s for the constant) is stored as a 2D array (or matrix) and the regression weights are stored as a 1D array then the predicted output is just the dot product between the features matrix and the weights (with the weights on the right). Write a function ‘predict_output’ which accepts a 2D array ‘feature_matrix’ and a 1D array ‘weights’ and returns a 1D array ‘predictions’. e.g. in python:

In [21]:
def predict_outcome(feature_matrix, weights):
    predictions = feature_matrix @ weights
    return predictions

---

If we have a the values of a single input feature in an array ‘feature’ and the prediction ‘errors’ (predictions - output) then the derivative of the regression cost function with respect to the weight of ‘feature’ is just twice the dot product between ‘feature’ and ‘errors’. Write a function that accepts a ‘feature’ array and ‘error’ array and returns the ‘derivative’ (a single number). e.g. in python:

In [22]:
def feature_derivative(errors, feature):
    derivative = 2 * np.dot(errors, feature)
    return derivative

---

Now we will use our predict_output and feature_derivative to write a gradient descent function. Although we can compute the derivative for all the features simultaneously (the gradient) we will explicitly loop over the features individually for simplicity. Write a gradient descent function that does the following:

- Accepts a numpy feature_matrix 2D array, a 1D output array, an array of initial weights, a step size and a convergence tolerance.
- While not converged updates each feature weight by subtracting the step size times the derivative for that feature given the current weights
- At each step computes the magnitude/length of the gradient (square root of the sum of squared components)
- When the magnitude of the gradient is smaller than the input tolerance returns the final weight vector.

e.g. if you’re using SFrames and numpy you can complete the following function:

In [23]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False
    weights = np.array(initial_weights)
    while not converged:
        # compute the predictions based on feature_matrix and weights:
        predictions = predict_outcome(feature_matrix, weights)
        # compute the errors as predictions - output:
        errors = predictions - output
        gradient_sum_squares = 0 # initialize the gradient
        # while not converged, update each weight individually:
        for i in range(len(weights)):
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative_i = feature_derivative(errors, feature_matrix[:,i])
            # add the squared derivative to the gradient magnitude
            gradient_sum_squares += (derivative_i**2)
            # update the weight based on step size and derivative:
            weights[i] -= (step_size * derivative_i)
        gradient_magnitude = np.sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return weights

---

Now we will run the regression_gradient_descent function on some actual data. In particular we will use the gradient descent to estimate the model from Week 1 using just an intercept and slope. Use the following parameters:

- features: ‘sqft_living’
- output: ‘price’
- initial weights: -47000, 1 (intercept, sqft_living respectively)
- step_size = 7e-12
- tolerance = 2.5e7

In [24]:
simple_features = ['sqft_living']
my_output = 'price'
simple_feature_matrix, output = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

In [25]:
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, 
                                             step_size, tolerance)

In [26]:
dict(zip(['constant']+simple_features, simple_weights))

{'constant': -46999.88716554671, 'sqft_living': 281.91211917520917}

---

Now build a corresponding ‘test_simple_feature_matrix’ and ‘test_output’ using test_data. Using ‘test_simple_feature_matrix’ and ‘simple_weights’ compute the predicted house prices on all the test data.

In [27]:
test_simple_feature_matrix, test_output = get_numpy_data(test_data, simple_features, my_output)

In [28]:
price_hat_simple = test_simple_feature_matrix @ simple_weights

In [29]:
price_hat_simple[0]

356134.4432550024

---

Now compute RSS on all test data for this model. Record the value and store it for later.

In [30]:
RSS_simple = np.sum((test_output - price_hat_simple)**2)
print('{:,}'.format(RSS_simple))

275,400,044,902,128.3


---

Now we will use the gradient descent to fit a model with more than 1 predictor variable (and an intercept). Use the following parameters:

- model features = ‘sqft_living’, ‘sqft_living_15’
- output = ‘price’
- initial weights = \[-100000, 1, 1\] (intercept, sqft_living, and sqft_living_15 respectively)
- step size = 4e-12
- tolerance = 1e9

In [31]:
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
feature_matrix, output = get_numpy_data(train_data, model_features,my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

In [32]:
multi_weights = regression_gradient_descent(feature_matrix, output,initial_weights, 
                                            step_size, tolerance)

In [33]:
dict(zip(['constant']+model_features, multi_weights))

{'constant': -99999.96884887619,
 'sqft_living': 245.0726034645802,
 'sqft_living15': 65.27952669888786}

---

Use the regression weights from this second model (using sqft_living and sqft_living_15) and predict the outcome of all the house prices on the TEST data.

In [34]:
test_multi_feature_matrix, test_output = get_numpy_data(test_data, model_features, my_output)

In [35]:
price_hat_multi = test_multi_feature_matrix @ multi_weights

In [36]:
price_hat_multi[0]

366651.4116294939

In [37]:
test_output[0]

310000.0

In [38]:
RSS_multi = np.sum((test_output - price_hat_multi)**2)
print('{:,}'.format(RSS_multi))

270,263,443,629,803.56
