
Do some feature engineering by internal DataFrame functions  
Use built-in sklearn functions to compute the regression and access its parameters (coefficients)  
Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares  
Look at coefficients and interpret their meanings  
Evaluate multiple models via RSS


In [1]:
import numpy as np
import pandas as pd


In [2]:
full_data = pd.read_csv("../datasets/kc_house_data.csv")
full_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180.0,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170.0,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770.0,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050.0,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680.0,0,1987,0,98074,47.6168,-122.045,1800,7503


In [3]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(full_data, train_size=.8,test_size=0.2,random_state=0)


In [5]:
train_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
5268,5100402668,20150218T000000,495000.0,3,1.0,1570,5510,1.0,0,0,...,7,1070.0,500,1940,0,98115,47.6942,-122.319,1770,6380
16909,7856560480,20140808T000000,635000.0,3,2.5,1780,11000,1.0,0,0,...,8,1210.0,570,1980,0,98006,47.5574,-122.149,2310,9700
16123,2872900010,20150414T000000,382500.0,3,1.5,1090,9862,1.0,0,0,...,8,1090.0,0,1987,0,98074,47.6256,-122.036,1710,9862
12181,3216900070,20140617T000000,382500.0,4,2.5,2210,7079,2.0,0,0,...,8,2210.0,0,1993,0,98031,47.4206,-122.183,1970,7000
12617,976000790,20141020T000000,670000.0,3,2.5,1800,4763,2.0,0,0,...,7,1240.0,560,1985,0,98119,47.646,-122.362,1790,4763


In [7]:
def extract_features(data,featues_list):
    features = [data[title].values for title in featues_list]
    #stack them for a 2d [examples,properties]
    return np.stack(features,axis=-1)

In [9]:
from sklearn.linear_model import LinearRegression

example_features_list = ['sqft_living','bedrooms','bathrooms']

example_features = extract_features(train_data,example_features_list)
example_labels = train_data['price']
example_model = LinearRegression().fit(example_features,example_labels)

Now that we have fitted the model, we can extract the regression weights (coefficients) from your model

In [10]:
example_weight_summary = example_model.coef_
print(example_weight_summary)

[   313.17055038 -56754.66651422   6887.71910816]


3 features and their corresponding coefficients

Making predictions

In [11]:
ex_predictions = example_model.predict(example_features)
ex_predictions[0]

np.float64(395813.4988028941)

In [12]:
#RSS

In [13]:
def get_residual_sum_of_squares(model,data,outcome):
    pred_outcome = model.predict(data)
    RSS = np.sum(np.square(pred_outcome-outcome))
    return RSS

In [14]:
ex_test_features = extract_features(test_data,example_features_list)
ex_test_labels = test_data['price']
rss_ex_test = get_residual_sum_of_squares(example_model,ex_test_features,ex_test_labels)
print(rss_ex_test)

259213572106085.34


Create some new features

In [15]:
from math import log

In [26]:
train_data['bedrooms_squared'] = train_data['bedrooms'].map(lambda x : x**2)
test_data['bedrooms_squared'] = test_data['bedrooms'].map(lambda x: x**2)

In [27]:
#Now multiple rooms
train_data['bed_bath_rooms'] = train_data[['bedrooms','bathrooms']].apply(lambda row : row.bedrooms * row.bathrooms,axis=1)
test_data['bed_bath_rooms'] = test_data[['bedrooms','bathrooms']].apply(lambda row : row.bedrooms * row.bathrooms,axis=1)

In [28]:
train_data['log_sqft_living'] = train_data['sqft_living'].map(lambda x: log(x))
test_data['log_sqft_living'] = test_data['sqft_living'].map(lambda x: log(x))
train_data['lat_plus_long'] = train_data[['lat', "long"]].agg("sum", axis=1)
test_data['lat_plus_long'] = test_data[['lat', "long"]].agg("sum", axis=1)



Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.  
bedrooms times bathrooms gives what's called an "interaction" feature. It is large when both of them are large.  
Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.  
Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)




Learning Multiple Models

Now we will learn the weights for three (nested) models for predicting house prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:

    Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude  
    Model 2: add bedrooms*bathrooms  
    Model 3: Add log squarefeet, bedrooms squared, and the (nonsensical) latitude + longitude



In [29]:
model_1_features = ['sqft_living','bedrooms','bathrooms','lat','long']
model_2_features=model_1_features + ['bed_bath_rooms']
model_3_features=model_2_features + ['bedrooms_squared','log_sqft_living','lat_plus_long']

In [30]:
features_1=extract_features(train_data,model_1_features)
features_2=extract_features(train_data,model_2_features)
features_3=extract_features(train_data,model_3_features)
output_label=train_data['price']
model_1=LinearRegression().fit(features_1,output_label)
model_2=LinearRegression().fit(features_2,output_label)
model_3=LinearRegression().fit(features_3,output_label)

In [37]:
print(model_1.coef_)



[ 3.12942010e+02 -5.30962691e+04  1.47770428e+04  6.53983343e+05
 -3.25707336e+05]


Comparing models  
Now that you have learnt three models and extracted their weights, we want to evaluate which model is best.  
use RSS on TRAINING data for each of the 3 models

In [38]:
print(f"Model 1 RSS {get_residual_sum_of_squares(model_1,features_1,output_label)}")
print(f"Model 2 RSS {get_residual_sum_of_squares(model_2,features_2,output_label)}")
print(f"Model 3 RSS {get_residual_sum_of_squares(model_3,features_3,output_label)}")

Model 1 RSS 979843597588329.5
Model 2 RSS 970799199729578.0
Model 3 RSS 913653644974958.9


In [39]:


# Compute the RSS on TESTING data for each of the three models and display the values:
features_1_test = extract_features(test_data, model_1_features)
features_2_test = extract_features(test_data, model_2_features)
features_3_test = extract_features(test_data, model_3_features)
output_labels_test = test_data['price']
print("Model 1 RSS: ", get_residual_sum_of_squares(model_1, features_1_test, output_labels_test))
print("Model 2 RSS: ", get_residual_sum_of_squares(model_2, features_2_test, output_labels_test))
print("Model 3 RSS: ", get_residual_sum_of_squares(model_3, features_3_test, output_labels_test))



Model 1 RSS:  213487129319104.5
Model 2 RSS:  210778544168942.56
Model 3 RSS:  203972051917608.06
