In [1]:
import pandas as pd
import numpy as np

dtype_dict = {'bathrooms':float, 
              'waterfront':int, 
              'sqft_above':int, 
              'sqft_living15':float, 
              'grade':int, 
              'yr_renovated':int, 
              'price':float, 
              'bedrooms':float, 
              'zipcode':str, 
              'long':float, 
              'sqft_lot15':float, 
              'sqft_living':float, 
              'floors':float, 
              'condition':int, 
              'lat':float, 
              'date':str, 
              'sqft_basement':int, 
              'yr_built':int, 
              'id':str, 
              'sqft_lot':int, 
              'view':int}

sales = pd.read_csv('kc_house_data_small.csv', dtype=dtype_dict)
training = pd.read_csv('kc_house_data_small_train.csv', dtype = dtype_dict)
testing = pd.read_csv('kc_house_data_small_test.csv', dtype = dtype_dict)
validation = pd.read_csv('kc_house_data_validation.csv', dtype = dtype_dict)

sales.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3.0,1.0,1180.0,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340.0,5650.0
1,6414100192,20141209T000000,538000.0,3.0,2.25,2570.0,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690.0,7639.0
2,5631500400,20150225T000000,180000.0,2.0,1.0,770.0,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720.0,8062.0
3,2487200875,20141209T000000,604000.0,4.0,3.0,1960.0,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360.0,5000.0
4,1954400510,20150218T000000,510000.0,3.0,2.0,1680.0,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800.0,7503.0


To efficiently compute pairwise distances among data points, we will convert the SFrame (or dataframe) into a 2D Numpy array. First import the numpy library and then copy and paste get_numpy_data() (or equivalent). The function takes a dataset, a list of features (e.g. [‘sqft_living’, ‘bedrooms’]) to be used as inputs, and a name of the output (e.g. ‘price’). It returns a ‘features_matrix’ (2D array) consisting of a column of ones followed by columns containing the values of the input features in the data set in the same order as the input list. It also returns an ‘output_array’, which is an array of the values of the output in the dataset (e.g. ‘price’).

In [2]:
def get_numpy_data(data_frame, features, output):
    data_frame['constant'] = 1 # add a constant column to an DataFrame
    # prepend variable 'constant' to the features list
    features = ['constant'] + features

    features_dataframe = data_frame[features]

    features_matrix = features_dataframe.as_matrix()
 
    output_dataframe = data_frame[output]
    output_array = output_dataframe.as_matrix()

    return(features_matrix, output_array)

Similarly, copy and paste the normalize_features function (or equivalent) from Module 5 (Ridge Regression). Given a feature matrix, each column is divided (element-wise) by its 2-norm. The function returns two items: (i) a feature matrix with normalized columns and (ii) the norms of the original columns.

In [3]:
def normalize_features(features):
    norms = np.linalg.norm(features, axis = 0)
    normalized_features = features / norms
    return(normalized_features, norms)

Using get_numpy_data (or equivalent), extract numpy arrays of the training, test, and validation sets.

In computing distances, it is crucial to normalize features. Otherwise, for example, the ‘sqft_living’ feature (typically on the order of thousands) would exert a much larger influence on distance than the ‘bedrooms’ feature (typically on the order of ones). We divide each column of the training feature matrix by its 2-norm, so that the transformed column has unit norm.

IMPORTANT: Make sure to store the norms of the features in the training set. The features in the test and validation sets must be divided by these same norms, so that the training, test, and validation sets are normalized consistently.

* features_train, norms = normalize_features(features_train)
* features_test = features_test / norms
* features_valid = features_valid / norms

In [4]:
all_features = ['bedrooms',  
                'bathrooms',  
                'sqft_living',  
                'sqft_lot',  
                'floors',
                'waterfront',  
                'view',  
                'condition',  
                'grade',  
                'sqft_above',  
                'sqft_basement',
                'yr_built',  
                'yr_renovated',  
                'lat',  
                'long',  
                'sqft_living15',  
                'sqft_lot15']

my_output = 'price'

(feature_train, output_train) = get_numpy_data(training, all_features, my_output)
(feature_train, norms) = normalize_features(feature_train)

(feature_test, output_test) = get_numpy_data(testing, all_features, my_output)
(feature_valid, output_valid) = get_numpy_data(validation, all_features, my_output)

feature_test = feature_test / norms
feature_valid = feature_valid / norms

### Compute a single distance

To start, let's just explore computing the “distance” between two given houses. We will take our query house to be the first house of the test set and look at the distance between this house and the 10th house of the training set.

To see the features associated with the query house, print the first row (index 0) of the test feature matrix. You should get an 18-dimensional vector whose components are between 0 and 1. Similarly, print the 10th row (index 9) of the training feature matrix.

In [5]:
feature_test[0]

array([ 0.01345102,  0.01551285,  0.01807473,  0.01759212,  0.00160518,
        0.017059  ,  0.        ,  0.05102365,  0.0116321 ,  0.01564352,
        0.01362084,  0.02481682,  0.01350306,  0.        ,  0.01345387,
       -0.01346922,  0.01375926,  0.0016225 ])

In [6]:
feature_train[9]

array([ 0.01345102,  0.01163464,  0.00602491,  0.0083488 ,  0.00050756,
        0.01279425,  0.        ,  0.        ,  0.01938684,  0.01390535,
        0.0096309 ,  0.        ,  0.01302544,  0.        ,  0.01346821,
       -0.01346251,  0.01195898,  0.00156612])

# Question 1
From the section "Compute a single distance": we take our query house to be the first house of the test set.

What is the Euclidean distance between the query house and the 10th house of the training set? Enter your answer in American-style decimals (e.g. 0.044) rounded to 3 decimal places.

In [7]:
np.sqrt(((feature_test[0] - feature_train[9]) ** 2).sum())

0.059723593713980783

Of course, to do nearest neighbor regression, we need to compute the distance between our query house and all houses in the training set.

To visualize this nearest-neighbor search, let's first compute the distance from our query house (features_test[0]) to the first 10 houses of the training set (features_train[0:10]) and then search for the nearest neighbor within this small set of houses. Through restricting ourselves to a small set of houses to begin with, we can visually scan the list of 10 distances to verify that our code for finding the nearest neighbor is working.

Write a loop to compute the Euclidean distance from the query house to each of the first 10 houses in the training set.

In [8]:
for i in range(10):
    euclidean_distance = np.sqrt(((feature_test[0] - feature_train[i]) ** 2).sum())
    print("the Euclidean distance from the query house to house ", i+1, " in the training set is: ", euclidean_distance)

the Euclidean distance from the query house to house  1  in the training set is:  0.060274709163
the Euclidean distance from the query house to house  2  in the training set is:  0.0854688114764
the Euclidean distance from the query house to house  3  in the training set is:  0.0614994643528
the Euclidean distance from the query house to house  4  in the training set is:  0.0534027397929
the Euclidean distance from the query house to house  5  in the training set is:  0.0584448406017
the Euclidean distance from the query house to house  6  in the training set is:  0.0598792150981
the Euclidean distance from the query house to house  7  in the training set is:  0.0546314049678
the Euclidean distance from the query house to house  8  in the training set is:  0.0554310832361
the Euclidean distance from the query house to house  9  in the training set is:  0.0523836278402
the Euclidean distance from the query house to house  10  in the training set is:  0.059723593714


# Question 2
From the section "Compute multiple distances": we take our query house to be the first house of the test set.

Among the first 10 training houses, which house is the closest to the query house? Enter the 0-based index of the closest house.

* 8

It is computationally inefficient to loop over computing distances to all houses in our training dataset. Fortunately, many of the numpy functions can be vectorized, applying the same operation over multiple values or vectors. We now walk through this process. 

Consider the following loop that computes the element-wise difference between the features of the query house (features_test[0]) and the first 3 training houses (features_train[0:3]):

In [9]:
for i in range(3):
    print(feature_train[i]-feature_test[0])
    # should print 3 vectors of length 18

[  0.00000000e+00  -3.87821276e-03  -1.20498190e-02  -1.05552733e-02
   2.08673616e-04  -8.52950206e-03   0.00000000e+00  -5.10236549e-02
   0.00000000e+00  -3.47633726e-03  -5.50336860e-03  -2.48168183e-02
  -1.63756198e-04   0.00000000e+00  -1.70254220e-05   1.29876855e-05
  -5.14364795e-03   6.69281453e-04]
[  0.00000000e+00  -3.87821276e-03  -4.51868214e-03  -2.26610387e-03
   7.19763456e-04   0.00000000e+00   0.00000000e+00  -5.10236549e-02
   0.00000000e+00  -3.47633726e-03   1.30705004e-03  -1.45830788e-02
  -1.91048898e-04   6.65082271e-02   4.23090220e-05   6.16364736e-06
  -2.89330197e-03   1.47606982e-03]
[  0.00000000e+00  -7.75642553e-03  -1.20498190e-02  -1.30002801e-02
   1.60518166e-03  -8.52950206e-03   0.00000000e+00  -5.10236549e-02
   0.00000000e+00  -5.21450589e-03  -8.32384500e-03  -2.48168183e-02
  -3.13866046e-04   0.00000000e+00   4.70885840e-05   1.56292487e-05
   3.72914476e-03   1.64764925e-03]


The subtraction operator (-) in numpy is vectorized as follows:

In [10]:
print(feature_train[0:3] - feature_test[0])

[[  0.00000000e+00  -3.87821276e-03  -1.20498190e-02  -1.05552733e-02
    2.08673616e-04  -8.52950206e-03   0.00000000e+00  -5.10236549e-02
    0.00000000e+00  -3.47633726e-03  -5.50336860e-03  -2.48168183e-02
   -1.63756198e-04   0.00000000e+00  -1.70254220e-05   1.29876855e-05
   -5.14364795e-03   6.69281453e-04]
 [  0.00000000e+00  -3.87821276e-03  -4.51868214e-03  -2.26610387e-03
    7.19763456e-04   0.00000000e+00   0.00000000e+00  -5.10236549e-02
    0.00000000e+00  -3.47633726e-03   1.30705004e-03  -1.45830788e-02
   -1.91048898e-04   6.65082271e-02   4.23090220e-05   6.16364736e-06
   -2.89330197e-03   1.47606982e-03]
 [  0.00000000e+00  -7.75642553e-03  -1.20498190e-02  -1.30002801e-02
    1.60518166e-03  -8.52950206e-03   0.00000000e+00  -5.10236549e-02
    0.00000000e+00  -5.21450589e-03  -8.32384500e-03  -2.48168183e-02
   -3.13866046e-04   0.00000000e+00   4.70885840e-05   1.56292487e-05
    3.72914476e-03   1.64764925e-03]]


Note that the output of this vectorized operation is identical to that of the loop above, which can be verified below:

In [11]:
# verify that vectorization works
results = feature_train[0:3] - feature_test[0]
print(results[0] - (feature_train[0]-feature_test[0]))
# should print all 0's if results[0] == (features_train[0]-features_test[0])
print(results[1] - (feature_train[1]-feature_test[0]))
# should print all 0's if results[1] == (features_train[1]-features_test[0])
print(results[2] - (feature_train[2]-feature_test[0]))
# should print all 0's if results[2] == (features_train[2]-features_test[0])

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


### Perform 1-nearest neighbor regression

Now that we have the element-wise differences, it is not too hard to compute the Euclidean distances between our query house and all of the training houses. First, write a single-line expression to define a variable ‘diff’ such that ‘diff[i]’ gives the element-wise difference between the features of the query house and the i-th training house.

To test your code, print diff[-1].sum(), which should be -0.0934339605842.


In [12]:
diff = feature_train - feature_test[0]
diff[-1].sum()

-0.093433998746546426

The next step in computing the Euclidean distances is to take these feature-by-feature differences in ‘diff’, square each, and take the sum over feature indices. That is, compute the sum of squared feature differences for each training house (row in ‘diff’).

By default, ‘np.sum’ sums up everything in the matrix and returns a single number. To instead sum only over a row or column, we need to specifiy the ‘axis’ parameter described in the np.sum documentation. In particular, ‘axis=1’ computes the sum across each row.

In [13]:
np.sum(diff**2, axis=1)

array([ 0.00363304,  0.00730492,  0.00378218, ...,  0.0032681 ,
        0.00325555,  0.00341846])

In [14]:
np.sum(diff**2, axis=1)[15]

0.0033070590284564457

In [15]:
np.sum(diff[15]**2)

0.0033070590284564453

With this result in mind, write a single-line expression to compute the Euclidean distances from the query to all the instances. Assign the result to variable distances.

Hint: don't forget to take the square root of the sum of squares.

Hint: distances[100] should contain 0.0237082324496.

In [19]:
distances = np.sqrt((np.sum(diff**2, axis=1)))
distances[100]

0.023708232416678195

Now you are ready to write a function that computes the distances from a query house to all training houses. The function should take two parameters: (i) the matrix of training features and (ii) the single feature vector associated with the query.

In [20]:
def compute_distances(features_instances, features_query):
    diff = features_instances - features_query
    distances = np.sqrt(np.sum(diff**2, axis=1))
    return(distances)

# Question 3
From the section "Perform 1-nearest neighbor regression":

Take the query house to be third house of the test set (features_test[2]). What is the (0-based) index of the house in the training set that is closest to this query house?


In [23]:
distances_3house = compute_distances(feature_train, feature_test[2])
np.argmin(distances_3house)

382

# Question 4
From the section "Perform 1-nearest neighbor regression":

Take the query house to be third house of the test set (features_test[2]). What is the predicted value of the query house based on 1-nearest neighbor regression? Enter your answer in simple decimals without comma separators (e.g. 300000), rounded to nearest whole number.

In [25]:
training['price'][382]

249000.0

### Perform k-nearest neighbor regression

Using the functions above, implement a function that takes in

 *   the value of k;
 *   the feature matrix for the instances; and
 *   the feature of the query

and returns the indices of the k closest training houses. For instance, with 2-nearest neighbor, a return value of [5, 10] would indicate that the 6th and 11th training houses are closest to the query house.

In [26]:
def k_nearest_neighbors(k, feature_train, features_query):
    distances = compute_distances(feature_train, features_query)
    neighbors = np.argsort(distances, axis = 0)[:k]
    return(neighbors)

# Question 5
From the section "Perform k-nearest neighbor regression":

Take the query house to be third house of the test set (features_test[2]). Which of the following is NOT part of the 4 training houses closest to the query house? (Note that all indices are 0-based.)

In [27]:
k_nearest_neighbors(4, feature_train, feature_test[2])

array([ 382, 1149, 4087, 3142])

Now that we know how to find the k-nearest neighbors, write a function that predicts the value of a given query house. For simplicity, take the average of the prices of the k nearest neighbors in the training set. The function should have the following parameters:

 *   the value of k;
 *   the feature matrix for the instances;
 *   the output values (prices) of the instances; and
 *   the feature of the query, whose price we’re predicting.

The function should return a predicted value of the query house.

In [28]:
def predict_output_of_query(k, features_train, output_train, features_query):
    neighbors = k_nearest_neighbors(k, features_train, features_query)
    prediction = np.mean(output_train[neighbors])
    return(prediction)

# Question 6
From the section "Perform k-nearest neighbor regression":

Take the query house to be third house of the test set (features_test[2]). Predict the value of the query house by the simple averaging method. Enter your answer in simple decimals without comma separators (e.g. 241242), rounded to nearest whole number.

In [29]:
predict_output_of_query(4, feature_train, output_train, feature_test[2])

413987.5

Finally, write a function to predict the value of each and every house in a query set. (The query set can be any subset of the dataset, be it the test set or validation set.) The idea is to have a loop where we take each house in the query set as the query house and make a prediction for that specific house. The new function should take the following parameters:

  *  the value of k;
  *  the feature matrix for the training set;
  *  the output values (prices) of the training houses; and
  *  the feature matrix for the query set.

The function should return a set of predicted values, one for each house in the query set.

In [30]:
def predict_output(k, features_train, output_train, features_query):
    num = features_query.shape[0]
    predictions = []
    for i in range(num):
        predictions.append(predict_output_of_query(k, features_train, output_train, features_query[i]))
    return(predictions)

# Question 7
From the section "Perform k-nearest neighbor regression": Make prediction for the first 10 houses using k-nearest neighbors with k=10.

What is the predicted value of the house in this query set that has the lowest predicted value? Enter your answer in simple decimals without comma separators (e.g. 312000), rounded to nearest whole number.


In [32]:
predictions_q7 = predict_output(10, feature_train, output_train, feature_test[:10])
min(predictions_q7)

350032.0

### Choosing the best value of k using a validation set

There remains a question of choosing the value of k to use in making predictions. Here, we use a validation set to choose this value. Write a loop that does the following:

For k in [1, 2, … 15]:

  *  Make predictions for the VALIDATION data using the k-nearest neighbors from the TRAINING data.
  *  Compute the RSS on VALIDATION data

Report which k produced the lowest RSS on validation data.

In [33]:
RSS = []
for k in range(1,16):
    predictions_valid = predict_output(k, feature_train, output_train, feature_valid)
    residual = output_valid - predictions_valid
    RSS.append(sum(residual ** 2)) 

np.argmin(RSS)

7

# Question 8
What is the RSS on the TEST data using the value of k found above? To be clear, sum over all houses in the TEST set.

In [35]:
predictions_test = predict_output(8, feature_train, output_train, feature_test)
residual_test = output_test - predictions_test
print(sum(residual_test ** 2))

1.33118823552e+14
