# Predicting house prices using k-nearest neighbors regression

In this notebook, you will implement k-nearest neighbors regression. You will:

- Find the k-nearest neighbors of a given query input
- Predict the output for the query input using the k-nearest neighbors
- Choose the best value of k using a validation set

In [1]:
import pandas as pd
import numpy as np
import graphlab

In [2]:
sales = graphlab.SFrame('kc_house_data_small.gl/')
(train_and_validation, test) = sales.random_split(.8, seed=1)
(train, validation) = train_and_validation.random_split(.8, seed=1)

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1483037944.log


This non-commercial license of GraphLab Create for academic use is assigned to santosh.chilkunda@gmail.com and will expire on July 20, 2017.


In [3]:
def get_numpy_data(data_sframe, features, output):
    data_sframe['one'] = 1
    features = ['one'] + features
    new_sframe = data_sframe[features]
    feature_matrix1 = new_sframe.to_numpy()
    feature_matrix = np.asarray(feature_matrix1, dtype=float)
    
    out_sframe = data_sframe[output]
    output_array = out_sframe.to_numpy()
    
    return (feature_matrix, output_array)

In [4]:
def normalize_features(features):
    #norms = np.sqrt(np.sum(features*features,axis=0))
    norms = np.linalg.norm(features, axis=0)
    normalized_features = features/norms
    return (normalized_features, norms)

In [5]:
all_features = ['bedrooms',
                'bathrooms',
                'sqft_living',
                'sqft_lot',
                'floors',
                'waterfront', 
                'view', 
                'condition', 
                'grade',
                'sqft_above',
                'sqft_basement',
                'yr_built', 
                'yr_renovated',
                'lat',
                'long',
                'sqft_living15',
                'sqft_lot15']

In [6]:
len(all_features)

17

In [7]:
train_fm, train_oa = get_numpy_data(train, all_features, 'price')
test_fm, test_oa = get_numpy_data(test, all_features, 'price')
validation_fm, validation_oa = get_numpy_data(validation, all_features, 'price')

In [8]:
features_train, norms = normalize_features(train_fm)
features_test = test_fm / norms
features_valid = validation_fm / norms

Compute a single distance

In [9]:
print features_test[0]
print features_train[9]

[ 0.01345102  0.01551285  0.01807473  0.01759212  0.00160518  0.017059    0.
  0.05102365  0.0116321   0.01564352  0.01362084  0.02481682  0.01350306
  0.          0.01345386 -0.01346927  0.01375926  0.0016225 ]
[ 0.01345102  0.01163464  0.00602491  0.0083488   0.00050756  0.01279425
  0.          0.          0.01938684  0.01390535  0.0096309   0.
  0.01302544  0.          0.01346821 -0.01346254  0.01195898  0.00156612]


# What is the Euclidean distance between the query house and the 10th house of the training set?

In [10]:
diff = features_test[0]-features_train[9]
distance = np.sqrt(np.sum(diff**2,axis=0))
print distance

0.0597235937167


In [11]:
def get_distances(query, train):
    diff = (train - query)
    distances = np.sqrt(np.sum(diff**2,axis=1))
    return distances

# Take the query house to be third house of the test set (features_test[2]). What is the index of the house in the training set that is closest to this query house?

In [12]:
distances = get_distances(features_test[2], features_train)
np.argmin(distances)

382

# What is the predicted value of the query house based on 1-nearest neighbor regression?

In [13]:
train[382]['price']

249000

Perform k-nearest neighbor regression

In [14]:
def predict_output_of_query(k, features_train, output_train, features_query):
    diff = (features_train - features_query)
    distances = np.sqrt(np.sum(diff**2,axis=1))
    sorted_idx = np.argsort(distances)    
    neighbors = sorted_idx[0:k]
    prediction = np.average(output_train[neighbors])
    return prediction

# Again taking the query house to be third house of the test set (features_test[2]), predict the value of the query house using k-nearest neighbors with k=4 and the simple averaging method described and implemented above.

In [15]:
pred = predict_output_of_query(4, features_train, train_oa, features_test[2])
print pred

413987.5


In [16]:
def predict_output(k, features_train, output_train, features_query):
    n = len(features_query)
    predictions = np.zeros(n)
    for i in range(n):
        predictions[i] = predict_output_of_query(k, features_train, output_train, features_query[i])  
    return predictions

# Make predictions for the first 10 houses in the test set, using k=10. What is the index of the house in this query set that has the lowest predicted value? What is the predicted value of this house?

In [17]:
preds = predict_output(10, features_train, train_oa, features_test[0:10])

In [18]:
print preds  

[ 881300.   431860.   460595.   430200.   766750.   667420.   350032.
  512800.7  484000.   457235. ]


In [19]:
np.argmin(preds)

6

Choosing the best value of k using a validation set

In [20]:
def get_residual_sum_of_squares(target, pred):
    error = (target - pred);
    sq_err = (error*error);
    RSS = np.sum(sq_err);
    return(RSS)  

In [21]:
rss = np.zeros(15)
for l in range(15):
    preds = predict_output((l+1), features_train, train_oa, features_valid)
    rss[l] = get_residual_sum_of_squares(validation_oa, preds)
print rss
print np.argmin(rss)

[  1.05453830e+14   8.34450735e+13   7.26920960e+13   7.19467217e+13
   6.98465174e+13   6.88995444e+13   6.83419735e+13   6.73616787e+13
   6.83727280e+13   6.93350487e+13   6.95238552e+13   6.90499696e+13
   7.00112545e+13   7.09086989e+13   7.11069284e+13]
7


# What is the RSS on the TEST data using the value of k found above? To be clear, sum over all houses in the TEST set.

In [22]:
preds = predict_output(8, features_train, train_oa, features_test)

In [23]:
rss = get_residual_sum_of_squares(test_oa, preds)
print rss

1.33118823552e+14
