# Multiple Regression 

# Firing up graphlab

In [26]:
import graphlab

# Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [27]:
sales = graphlab.SFrame('kc_house_data.gl/')

# Split data into training and testing.
Splitting the whole data into 2 parts : training data which is used to train the model and test data which is used to evaluate the performance of the trained model. I am taking training data to be 80% of the total data and hence test data would be the remaining 20% of the total data.

In [28]:
train_data,test_data = sales.random_split(.8,seed=0)

# Create a multiple regression model

For multiple regression model, i am using total 3 features like sqft_living, bedrooms, bathrooms.
Hence define a variabe sample_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data.

In [29]:
sample_features = ['sqft_living', 'bedrooms', 'bathrooms']
# we will use the built in function for creating multiple linear regression model defined by graphlab only.
# Also the target is price.
sample_model = graphlab.linear_regression.create(train_data, target = 'price', features = sample_features, 
                                                  validation_set = None)

Now that we have fitted the model we can extract the regression weights (coefficients) as an SFrame as follows:

In [30]:
sample_weight = sample_model.get("coefficients")
print sample_weight

+-------------+-------+----------------+---------------+
|     name    | index |     value      |     stderr    |
+-------------+-------+----------------+---------------+
| (intercept) |  None | 87910.0724924  |  7873.3381434 |
| sqft_living |  None | 315.403440552  | 3.45570032585 |
|   bedrooms  |  None | -65080.2155528 | 2717.45685442 |
|  bathrooms  |  None | 6944.02019265  | 3923.11493144 |
+-------------+-------+----------------+---------------+
[4 rows x 4 columns]



# Making Predictions

In the graphlab, there exists a .predict() function to find the predicted values. Hence I will use that to predict the values.

In [31]:
sample_predictions = sample_model.predict(train_data)
print sample_predictions[0] 

271789.505878


# Compute RSS

In [32]:
def get_residual_sum_of_squares(model, data, outcome):
    predictions = model.predict(data)
    residuals = outcome - predictions
    RSS = (residuals * residuals).sum()
    return(RSS)    

# Difference between inputs and features of a model


Let's see the difference between the inputs and features. Features will always be greater in number than the inputs. For example lets say inputs are x and y. So my features will be atleat x and y, in addition features can also be square of x, cube of x, log of y, cube-root of y. 

# Lets add some more features

In [33]:
from math import log

I am creating the following 4 new features as column in both TEST and TRAIN data:
* bedrooms_squared = bedrooms*bedrooms
* bed_bath_rooms = bedrooms*bathrooms
* log_sqft_living = log(sqft_living)
* lat_plus_long = lat + long 
As an example here's the first one:

In [34]:
train_data['bedrooms_squared'] = train_data['bedrooms'].apply(lambda x: x**2)
test_data['bedrooms_squared'] = test_data['bedrooms'].apply(lambda x: x**2)

In [35]:
train_data['bed_bath_rooms'] = train_data['bedrooms'] * train_data['bathrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms'] * test_data['bathrooms']

train_data['log_sqft_living'] = train_data['sqft_living'].apply(lambda x: log(x))
test_data['log_sqft_living'] = test_data['sqft_living'].apply(lambda x: log(x))

train_data['lat_plus_long'] = train_data['lat'] + train_data['long']# adding latitude and longitude both together
test_data['lat_plus_long'] = test_data['lat'] + test_data['long']

In [36]:
train_data[['bedrooms','bathrooms','lat','long','bedrooms_squared','bed_bath_rooms','log_sqft_living','lat_plus_long']].head()

bedrooms,bathrooms,lat,long,bedrooms_squared,bed_bath_rooms,log_sqft_living
3.0,1.0,47.51123398,-122.25677536,9.0,3.0,7.07326971746
3.0,2.25,47.72102274,-122.3188624,9.0,6.75,7.85166117789
2.0,1.0,47.73792661,-122.23319601,4.0,2.0,6.64639051485
4.0,3.0,47.52082,-122.39318505,16.0,12.0,7.58069975222
3.0,2.0,47.61681228,-122.04490059,9.0,6.0,7.4265490724
4.0,4.5,47.65611835,-122.00528655,16.0,18.0,8.59785109443
3.0,2.25,47.30972002,-122.32704857,9.0,6.75,7.4471683596
3.0,1.5,47.40949984,-122.31457273,9.0,4.5,6.96602418711
3.0,1.0,47.51229381,-122.33659507,9.0,3.0,7.48436864329
3.0,2.5,47.36840673,-122.0308176,9.0,7.5,7.54433210805

lat_plus_long
-74.74554138
-74.59783966
-74.4952694
-74.87236505
-74.42808831
-74.3491682
-75.01732855
-74.90507289
-74.82430126
-74.66241087


# Learning Multiple Models

First of all I am creating 3 different models. Let's learn the weights for these 3 models for predicting house prices. The first model will have the fewest features the second model will add one more feature and the third will add a few more:
* Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude
* Model 2: add bedrooms\*bathrooms
* Model 3: Add log squarefeet, bedrooms squared, and the (nonsensical) latitude + longitude

In [37]:
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']

We can create the multiple regression models using the graphlab in-built function and can anlyze the weights of coefficients obtained in each of the different models.

In [38]:
model_1 = graphlab.linear_regression.create(train_data, target = 'price', features = model_1_features, 
                                                  validation_set = None)
model_2 = graphlab.linear_regression.create(train_data, target = 'price', features = model_2_features, 
                                                  validation_set = None)
model_3 = graphlab.linear_regression.create(train_data, target = 'price', features = model_3_features, 
                                                  validation_set = None)

In [39]:
# Using .get() function of graphlab.
model_1_weight_summary = model_1.get("coefficients")
model_2_weight_summary = model_2.get("coefficients")
model_3_weight_summary = model_3.get("coefficients")
print model_1_weight_summary 
print model_2_weight_summary
print model_3_weight_summary

+-------------+-------+----------------+---------------+
|     name    | index |     value      |     stderr    |
+-------------+-------+----------------+---------------+
| (intercept) |  None | -56140675.7444 | 1649985.42028 |
| sqft_living |  None | 310.263325778  | 3.18882960408 |
|   bedrooms  |  None | -59577.1160682 | 2487.27977322 |
|  bathrooms  |  None | 13811.8405418  | 3593.54213297 |
|     lat     |  None | 629865.789485  | 13120.7100323 |
|     long    |  None | -214790.285186 | 13284.2851607 |
+-------------+-------+----------------+---------------+
[6 rows x 4 columns]

+----------------+-------+----------------+---------------+
|      name      | index |     value      |     stderr    |
+----------------+-------+----------------+---------------+
|  (intercept)   |  None | -54410676.1152 | 1650405.16541 |
|  sqft_living   |  None | 304.449298057  | 3.20217535637 |
|    bedrooms    |  None | -116366.043231 | 4805.54966546 |
|   bathrooms    |  None | -77972.3305135 | 7565

# Comparing multiple models

Now that you've learned three models and extracted the model weights we want to evaluate which model is best. The comparison between the models can be done by calculating the RSS values on Test data for all the models. 

Just see how the RSS on training data varies with the number of features used in the model. So let's calculate the RSS on training data for all the 3 models. 

In [40]:
# RSS on TRAINING data for each of the three models:
rss_model_1_train = get_residual_sum_of_squares(model_1, train_data, train_data['price'])
rss_model_2_train = get_residual_sum_of_squares(model_2, train_data, train_data['price'])
rss_model_3_train = get_residual_sum_of_squares(model_3, train_data, train_data['price'])
print rss_model_1_train
print rss_model_2_train
print rss_model_3_train

9.71328233544e+14
9.61592067856e+14
9.05276314555e+14


# Useful Points
The above results matches with our intuition that data will fit well with more number of features and hence we see model_3 with highest number of features have the lowest RSS. But we cannot say that we should have more number of features then. The reason is that model with more number of features can do well with the TRAINING DATA but their weakness gets visible when that model is applied on the TEST data. This phenomena is also known as OVERFITTING.

Now compute the RSS on on TEST data for each of the three models.

In [41]:
# RSS on TEST data for each of the three models:
rss_model_1_test = get_residual_sum_of_squares(model_1, test_data, test_data['price'])
rss_model_2_test = get_residual_sum_of_squares(model_2, test_data, test_data['price'])
rss_model_3_test = get_residual_sum_of_squares(model_3, test_data, test_data['price'])
print rss_model_1_test
print rss_model_2_test
print rss_model_3_test

2.26568089093e+14
2.24368799994e+14
2.51829318952e+14


As we can see from the above results, it is not necessary that model with more features will always have low error on the Test data. Like we see in this case model_3 with most features have the highest error and the highest cost, because we can say moedl_3 overfits the data.