## Multiple Linear Regression Assignment 1
In this notebook you will use data on house sales in King County to predict prices using multiple regression. The first assignment will be about exploring multiple regression in particular exploring the impact of adding features to a regression and measuring error. In the second assignment you will implement a gradient descent algorithm

In [9]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}

tran_data = pd.read_csv("kc_house_train_data.csv",dtype = dtype_dict)
test_data = pd.read_csv("kc_house_test_data.csv", dtype = dtype_dict)

In [4]:
tran_data.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, square feet, and # of bathrooms) but we can also consider transformations of existing variables e.g. the log of the square feet or even "interaction" variables such as the product of bedrooms and bathrooms. Add 4 new variables in both your train_data and test_data.

- bedrooms_squared’ = ‘bedrooms’*‘bedrooms’
- bed_bath_rooms’ = ‘bedrooms’*‘bathrooms’
- log_sqft_living’ = log(‘sqft_living’)
- lat_plus_long’ = ‘lat’ + ‘long’

Before we continue let’s explain these new variables:

1. Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
2. Bedrooms times bathrooms is what's called an "interaction" variable. It is large when both of them are large.
3. Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
4. Adding latitude to longitude is non-sensical but we will do it anyway (you'll see why)

In [5]:
def add_new_features(df):
    df["bedrooms_squared"] = df["bedrooms"] * df["bedrooms"]
    df["bed_bath_rooms"] = df["bedrooms"] * df["bathrooms"]
    df["log_sqft_living"] = np.log(df["sqft_living"])
    df["lat_plus_long"] = df["lat"] + df["long"]

add_new_features(tran_data)
add_new_features(test_data)

In [6]:
tran_data.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15', 'bedrooms_squared',
       'bed_bath_rooms', 'log_sqft_living', 'lat_plus_long'],
      dtype='object')

## 4. Quiz Question: what are the mean (arithmetic average) values of your 4 new variables on TEST data? (round to 2 digits)

In [8]:
test_data[["bedrooms_squared","bed_bath_rooms","log_sqft_living","lat_plus_long"]].mean()

bedrooms_squared    12.446678
bed_bath_rooms       7.503902
log_sqft_living      7.550275
lat_plus_long      -74.653334
dtype: float64

## 5. Use graphlab.linear_regression.create (or any other regression library/function) to estimate the regression coefficients/weights for predicting ‘price’ for the following three models:(In all 3 models include an intercept -- most software does this by default).

- Model 1: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’, and ‘long’
- Model 2: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, and ‘bed_bath_rooms’
- Model 3: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, ‘bed_bath_rooms’, ‘bedrooms_squared’, ‘log_sqft_living’, and ‘lat_plus_long’

You’ll note that the three models here are “nested” in that all of the features of the Model 1 are in Model 2 and all of the features of Model 2 are in Model 3.

In [14]:
def create_model(df,feature_names):
    clf = LinearRegression()
    features = df[feature_names]
    y = df["price"]
    clf.fit(df[feature_names],y)
    return clf

model1 = create_model(tran_data, ["sqft_living", "bedrooms", "bathrooms", "lat", "long"])
model2 = create_model(tran_data, ["sqft_living", "bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms"])
model3 = create_model(tran_data, ["sqft_living", "bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms","bedrooms_squared","log_sqft_living","lat_plus_long"])    

## Quiz Question: What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 1?

In [15]:
model1.coef_

array([ 3.12258646e+02, -5.95865332e+04,  1.57067421e+04,  6.58619264e+05,
       -3.09374351e+05])

## Quiz Question: What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 2?

In [16]:
model2.coef_

array([ 3.06610053e+02, -1.13446368e+05, -7.14613083e+04,  6.54844630e+05,
       -2.94298969e+05,  2.55796520e+04])

## Quiz Question: Which model (1, 2 or 3) had the lowest RSS on TRAINING data?

In [17]:
def rss(y_pred,y_true):
    return np.sum(np.square(y_pred - y_true))

In [21]:
def get_model_rss(df,feature_names,clf):
    y_pred = clf.predict(df[feature_names])
    return rss(y_pred,df['price'])


model1_rss = get_model_rss(tran_data,["sqft_living", "bedrooms", "bathrooms", "lat", "long"],model1)
model2_rss = get_model_rss(tran_data,["sqft_living", "bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms"], model2)
model3_rss = get_model_rss(tran_data, ["sqft_living", "bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms","bedrooms_squared","log_sqft_living","lat_plus_long"], model3)

In [22]:
model1_rss,model2_rss,model3_rss

(967879963049545.8, 958419635074068.8, 903436455050478.0)

In [25]:
print(min([model1_rss,model2_rss,model3_rss]))

903436455050478.0


## Quiz Question: Which model (1, 2, or 3) had the lowest RSS on TESTING data?

In [23]:
test_model1_rss = get_model_rss(test_data,["sqft_living", "bedrooms", "bathrooms", "lat", "long"],model1)
test_model2_rss = get_model_rss(test_data,["sqft_living", "bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms"], model2)
test_model3_rss = get_model_rss(test_data, ["sqft_living", "bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms","bedrooms_squared","log_sqft_living","lat_plus_long"], model3)

In [24]:
test_model1_rss,test_model2_rss,test_model3_rss

(225500469795489.56, 223377462976467.16, 259236319207179.3)

In [26]:
print(min([test_model1_rss,test_model2_rss,test_model3_rss]))

223377462976467.16


## Regression Week 2: Multiple Linear Regression Quiz 2

Estimating Multiple Regression Coefficients (Gradient Descent)
In the first notebook we explored multiple regression using GraphLab Create. Now we will use SFrames along with numpy to solve for the regression weights with gradient descent.


In [73]:
def get_numpy_data(data_sframe, features, output):
    data_sframe['constant'] = 1 # add a constant column to an SFrame
    # prepend variable 'constant' to the features list
    features = ['constant'] + features
    # select the columns of data_SFrame given by the ‘features’ list into the SFrame ‘features_sframe’
    features_sframe = data_sframe[features]
    # this will convert the features_sframe into a numpy matrix with GraphLab Create >= 1.7!!
    features_matrix = features_sframe.to_numpy()
    # assign the column of data_sframe associated with the target to the variable ‘output_sarray’
    output_sarray = data_sframe[output]
    # this will convert the SArray into a numpy array:
    output_array = output_sarray.to_numpy() # GraphLab Create>= 1.7!!
    return(features_matrix, output_array)

In [74]:
def predict_outcome(feature_matrix, weights):
    predictions = np.dot(feature_matrix,weights)
    return predictions

In [75]:
def feature_derivative(errors, feature):
    derivative = -2 * np.dot(feature,errors)
    return(derivative)

In [79]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False
    weights = np.array(initial_weights)
    while not converged:
        # compute the predictions based on feature_matrix and weights:
        # compute the errors as predictions - output:
        predictions = predict_outcome(feature_matrix,weights)
        errors =  output - predictions
    
        gradient_sum_squares = 0 # initialize the gradient
        # while not converged, update each weight individually:
        for i in range(len(weights)):
            # Recall that feature_matrix[:, i] is the feature column associated with weights[i]
            # compute the derivative for weight[i]:
            derivative = feature_derivative(errors,feature_matrix[:,i])
            # add the squared derivative to the gradient magnitude
            gradient_sum_squares += derivative**2
            # update the weight based on step size and derivative:
            weights[i] -= step_size * derivative
        gradient_magnitude = np.sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

In [80]:
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(tran_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

In [81]:
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size,                                             tolerance)

## Quiz Question: What is the value of the weight for sqft_living -- the second element of ‘simple_weights’ (rounded to 1 decimal place)?

In [82]:
simple_weights

array([-46999.88716555,    281.91211918])

In [83]:
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)

In [85]:
predict_prices = np.dot(test_simple_feature_matrix,simple_weights.T)

## Quiz Question: What is the predicted price for the 1st house in the Test data set for model 1 (round to nearest dollar)?

In [88]:
round(predict_prices[0],0)

356134.0

In [89]:
rss(predict_prices,test_output)

275400044902128.3

In [91]:
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(feature_matrix, output) = get_numpy_data(tran_data, model_features,my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9

In [92]:
simple_weights = regression_gradient_descent(feature_matrix, output,initial_weights, step_size,                                             tolerance)

In [93]:
simple_weights

array([-9.99999688e+04,  2.45072603e+02,  6.52795267e+01])

In [95]:
(feature_matrix_test, output_test) = get_numpy_data(test_data, model_features,my_output)

In [96]:
predict_prices = np.dot(feature_matrix_test,simple_weights.T)

## Quiz Question: What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?

In [97]:
round(predict_prices[0],0)

366651.0

In [98]:
output_test[0]

310000.0

In [99]:
rss(predict_prices,output_test)

270263443629803.56