## Load Data

In [1]:
import numpy as np
import pandas as pd 

In [2]:
test_data = pd.read_csv('kc_house_test_data.csv')
train_data = pd.read_csv('kc_house_train_data.csv')

In [3]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

In [4]:
test_data.astype(dtype_dict)
train_data.astype(dtype_dict)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.00,1180.0,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340.0,5650.0
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.7210,-122.319,1690.0,7639.0
2,5631500400,20150225T000000,180000.0,2.0,1.00,770.0,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720.0,8062.0
3,2487200875,20141209T000000,604000.0,4.0,3.00,1960.0,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360.0,5000.0
4,1954400510,20150218T000000,510000.0,3.0,2.00,1680.0,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800.0,7503.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17379,7936000429,20150326T000000,1007500.0,4.0,3.50,3510.0,7200,2.0,0,0,...,9,2600,910,2009,0,98136,47.5537,-122.398,2050.0,6200.0
17380,2997800021,20150219T000000,475000.0,3.0,2.50,1310.0,1294,2.0,0,0,...,8,1180,130,2008,0,98116,47.5773,-122.409,1330.0,1265.0
17381,263000018,20140521T000000,360000.0,3.0,2.50,1530.0,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530.0,1509.0
17382,291310100,20150116T000000,400000.0,3.0,2.50,1600.0,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410.0,1287.0


- write a function that takes a data set, a list of features (e.g. [‘sqft_living’, ‘bedrooms’]), to be used as inputs, and a name of the output (e.g. ‘price’). This function should return a features_matrix (2D array) consisting of first a column of ones followed by columns containing the values of the input features in the data set in the same order as the input list. It should also return an output_array which is an array of the values of the output in the data set (e.g. ‘price’).

In [5]:
# Takes a data set, a list of features to be used as inputs, and a name of the output
def get_numpy_data(data, features, output):
    features_df = data[features]
    # Add a constant column to the dataframe 
    features_df.insert(0,'constant',value = 1)
    # convert the features_df into a numpy matrix 
    features_matrix = np.array(features_df)
    # assign the column of data_frame associated with the target to the variable ‘output_array’
    output_array = np.array(data[output])
    return(features_matrix, output_array)

- If the features matrix (including a column of 1s for the constant) is stored as a 2D array (or matrix) and the regression weights are stored as a 1D array then the predicted output is just the dot product between the features matrix and the weights (with the weights on the right). Write a function ‘predict_output’ which accepts a 2D array ‘feature_matrix’ and a 1D array ‘weights’ and returns a 1D array ‘predictions’. e.g. in python:

In [6]:
# dot production between features and weights
def predict_outcome(feature_matrix, weights):
    predictions = np.dot(feature_matrix, weights)
    return(predictions)

- If we have a the values of a single input feature in an array ‘feature’ and the prediction ‘errors’ (predictions - output) then the derivative of the regression cost function with respect to the weight of ‘feature’ is just twice the dot product between ‘feature’ and ‘errors’. Write a function that accepts a ‘feature’ array and ‘error’ array and returns the ‘derivative’ (a single number). e.g. in python:

In [7]:
def feature_derivative(errors, feature):
    derivative = 2*np.dot(errors, feature)
    return(derivative)

In [8]:
(example_features, example_output) = get_numpy_data(train_data, ['sqft_living'], 'price') 
my_weights = np.array([0., 0.]) # this makes all the predictions 0
test_predictions = predict_outcome(example_features, my_weights) 
# just like 2D numpy arrays can be elementwise subtracted with '-': 
errors = test_predictions - np.reshape(example_output,(1, len(example_output))) # prediction errors in this case is just the -example_output
feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the ":" indicates "all rows"
derivative = feature_derivative(errors, feature)
print(derivative)
print(-np.sum(example_output)*2) # should be the same as derivative

[-1.87526989e+10]
-18752698920.0


In [9]:
example_features

array([[   1, 1180],
       [   1, 2570],
       [   1,  770],
       ...,
       [   1, 1530],
       [   1, 1600],
       [   1, 1020]])

### Gradient Descent

Now we will use our predict_output and feature_derivative to write a gradient descent function. Although we can compute the derivative for all the features simultaneously (the gradient) we will explicitly loop over the features individually for simplicity. Write a gradient descent function that does the following:

Accepts a numpy feature_matrix 2D array, a 1D output array, an array of initial weights, a step size and a convergence tolerance.

While not converged updates each feature weight by subtracting the step size times the derivative for that feature given the current weights

At each step computes the magnitude/length of the gradient (square root of the sum of squared components)

When the magnitude of the gradient is smaller than the input tolerance returns the final weight vector.

In [10]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False
    weights = np.array(initial_weights)
    while not converged:
        # compute the predictions based on feature_matrix and weights:
        # compute the errors as predictions - output:
        predictions = predict_outcome(feature_matrix, weights)
        errors = predictions - output
        gradient_sum_squares = 0 # initialize the gradient
        # while not converged, update each weight individually:
        for i in range(len(weights)):
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative = feature_derivative(errors, feature_matrix[:, i])
            # add the squared derivative to the gradient magnitude
            gradient_sum_squares += derivative* derivative
            # update the weight based on step size and derivative:
            weights[i] = weights[i] - (step_size*derivative)
        gradient_magnitude = np.sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

## Running the Gradient Descent

In [11]:
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
# simple_feature_matrix (2dim matrix)
# output(2dim array)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

In [12]:
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size,tolerance)

In [13]:
# 9. Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?
list(zip(simple_weights, ['constant','sqft_living']))

[(-46999.88716554671, 'constant'), (281.91211917520917, 'sqft_living')]

Build a corresponding ‘test_simple_feature_matrix’ and ‘test_output’ using test_data. Using ‘test_simple_feature_matrix’ and ‘simple_weights’ compute the predicted house prices on all the test data.

In [14]:
(test_simple_feature_matrix,test_output) = get_numpy_data(test_data,simple_features, my_output)
test_predictions = predict_outcome(test_simple_feature_matrix, simple_weights) 
print(test_predictions)

[356134.443255   784640.86440132 435069.83662406 ... 663418.65315598
 604217.10812919 240550.47439317]


In [15]:
# Quiz Question: What is the predicted price for the 1st house in the Test data set for model 1 (round to nearest dollar)?
print(test_predictions[0])

356134.4432550024


 Compute RSS on all test data for this model. Record the value and store it for later

In [16]:
residual_model1 = test_output-test_predictions
RSS_model1 = np.sum(residual_model1*residual_model1)
print(RSS_model1)

275400044902128.3


Use the gradient descent to fit a model with more than 1 predictor variable (and an intercept). Use the following parameters:

In [17]:
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features,my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

Run gradient descent on a model with ‘sqft_living’ and ‘sqft_living_15’ as well as an intercept with the above parameters. Save the resulting regression weights

In [18]:
weight_2 = regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance)
weight_2

array([-9.99999688e+04,  2.45072603e+02,  6.52795267e+01])

Use the regression weights from this second model (using sqft_living and sqft_living_15) and predict the outcome of all the house prices on the TEST data. 

In [19]:
test_feature_matrix, test_output =get_numpy_data(test_data, model_features, my_output)

test_predictions_2 = predict_outcome(test_feature_matrix, weight_2)
print(test_predictions_2)

[366651.41162949 762662.39850726 386312.09557541 ... 682087.39916306
 585579.27901327 216559.20391786]


In [20]:
# Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?
test_predictions_2[0]

366651.4116294939

 What is the actual price for the 1st house in the Test data set?

In [21]:
test_data.loc[0,['price']]

price    310000
Name: 0, dtype: object

In [22]:
# Quiz Question: Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2?
# model 2

Compute RSS on all test data for the second model. Record the value and store it for later.

In [23]:
RSS_model2 = np.sum((test_output-test_predictions_2)**2)
RSS_model2

270263443629803.56

In [24]:
# Quiz Question: Which model (1 or 2) has lowest RSS on all of the TEST data?  
# Model 2