## Diego Orejuela
## Machine Learning Project

## Implement regression on the dataset provided.
Dataset: The Housing dataset contains information about houses in the suburbs of a US city in 1970s. The features
of the 506 samples in the dataset are summarized here:
* CRIM: Per capita crime rate by town
* ZN: Proportion of residential land zoned for lots over 25,000 sq. ft.
* INDUS: Proportion of non-retail business acres per town
* CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX: Nitric oxide concentration (parts per 10 million)
* RM: Average number of rooms per dwelling
* AGE: Proportion of owner-occupied units built prior to 1940
* DIS: Weighted distances to five Boston employment centers
* RAD: Index of accessibility to radial highways
* TAX: Full-value property tax rate per 10,000 usd
* PTRATIO: Pupil-teacher ratio by town
* B: 1000(Bk - 0.63)^2, where Bk is the proportion of [people of African American descent] by town 
* LSTAT: Percentage of lower status of the population
* MEDV: Median value of owner-occupied homes in 1000s usd

The house prices (MEDV) will be regarder as the target variable (the variable that we want to predict using one or more of the 13 explanatory variables).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Load the dataset into a pandas dataframe and display the first 5 lines of the dataset along with the column headings

In [2]:
#load dataset
dataset = pd.read_csv('data.txt',header=None)

#split dataset by whitespace into columns
dataset = pd.DataFrame(dataset[0].str.split(expand=True))

#look for missing values
print(dataset.isnull().sum())

col_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']

dataset.columns = col_names

dataset.head()


0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
dtype: int64


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


## Split the dataset into training (70%) and testing set (30%)

In [3]:
X = dataset.iloc[: , 0:13].values
y = dataset.iloc[ : ,13].values

X = X.astype(float)
y = y.astype(float)

# add a column of ones to X (corresponding to intercept term)
X = np.append(arr = np.ones((506,1)).astype(int), values = X, axis = 1)

#split data into Training and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Model 1: Build a linear regression model with all the variables using Normal Equations

In [4]:
# Normal Equation: (XTX)^-1 XTy
xTx = X_train.T.dot(X_train)
xTx = np.linalg.inv(xTx)
xTx_xT = xTx.dot(X_train.T)
w = xTx_xT.dot(y_train)

## What are the weight parameters for model 1?

In [5]:
w

array([ 3.79371077e+01, -1.21310401e-01,  4.44664254e-02,  1.13416945e-02,
        2.51124642e+00, -1.62312529e+01,  3.85906801e+00, -9.98516565e-03,
       -1.50026956e+00,  2.42143466e-01, -1.10716124e-02, -1.01775264e+00,
        6.81446545e-03, -4.86738066e-01])

## Using Model 1 make a prediction on the test set and Calculate Mean Squared Error

In [6]:
y_pred_m1 = X_test.dot(w)
print( "Test Set: \n", y_test,"\n")
print( "Predict Set: \n", y_pred_m1)

#Mean Squared Error
mse_m1 = np.square(np.subtract(y_pred_m1, y_test)).mean()
print("\n Mean Squared Error: ",mse_m1)

Test Set: 
 [22.6 50.  23.   8.3 21.2 19.9 20.6 18.7 16.1 18.6  8.8 17.2 14.9 10.5
 50.  29.  23.  33.3 29.4 21.  23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
 38.7 18.7 14.6 20.  20.5 20.1 23.6 16.8  5.6 50.  14.5 13.3 23.9 20.
 19.8 13.8 16.5 21.6 20.3 17.  11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
 17.1 17.2 15.  21.7 18.6 21.  33.1 31.5 20.1 29.8 15.2 15.  27.5 22.6
 20.  21.4 23.5 31.2 23.7  7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
 50.  23.  21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
 14.1 18.7 28.1 19.8 26.7 21.7 22.  22.9 10.4 21.9 20.6 26.4 41.3 17.2
 27.1 20.4 16.5 24.4  8.4 23.   9.7 50.  30.5 12.3 19.4 21.2 20.3 18.8
 33.4 18.5 19.6 33.2 13.1  7.5 13.6 17.4  8.4 35.4 24.  13.4 26.2  7.2
 13.1 24.5 37.2 25.  24.1 16.6 32.9 36.2 11.   7.2 22.8 28.7] 

Predict Set: 
 [24.9357079  23.75163164 29.32638296 11.97534566 21.37272478 19.19148525
 20.5717479  21.21154015 19.04572003 20.35463238  5.44119126 16.93688709
 17.15482272  5.3928209  40.20270696 32.31327348 22.46

## Model 2: Using scikit-learn  build a Linear Regression model with all the variables 

In [7]:
#Fitting Multiple Linear Regression to Training Set
# import LinearRegression class from scikit-learn
# initialize a LinearRegression object and fit X and y train sets
from sklearn.linear_model import LinearRegression
mlrObj_m2 = LinearRegression()
mlrObj_m2.fit(X_train, y_train)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## What are the weight parameters for model 2?

In [8]:
# weight parameters
mlrObj_m2.coef_

array([ 0.00000000e+00, -1.21310401e-01,  4.44664254e-02,  1.13416945e-02,
        2.51124642e+00, -1.62312529e+01,  3.85906801e+00, -9.98516565e-03,
       -1.50026956e+00,  2.42143466e-01, -1.10716124e-02, -1.01775264e+00,
        6.81446545e-03, -4.86738066e-01])

## Using Model 2 make a prediction on the test set and Calculate Mean Squared Error

In [9]:
# Predicting on Test Set
y_pred_m2 = mlrObj_m2.predict(X_test)
print( "Test Set: \n", y_test,"\n")
print( "Predict Set: \n", y_pred_m2)

#Mean Squared Error
mse_m2 = np.square(np.subtract(y_pred_m2, y_test)).mean()
print("\n Mean Squared Error: ",mse_m2)

Test Set: 
 [22.6 50.  23.   8.3 21.2 19.9 20.6 18.7 16.1 18.6  8.8 17.2 14.9 10.5
 50.  29.  23.  33.3 29.4 21.  23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
 38.7 18.7 14.6 20.  20.5 20.1 23.6 16.8  5.6 50.  14.5 13.3 23.9 20.
 19.8 13.8 16.5 21.6 20.3 17.  11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
 17.1 17.2 15.  21.7 18.6 21.  33.1 31.5 20.1 29.8 15.2 15.  27.5 22.6
 20.  21.4 23.5 31.2 23.7  7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
 50.  23.  21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
 14.1 18.7 28.1 19.8 26.7 21.7 22.  22.9 10.4 21.9 20.6 26.4 41.3 17.2
 27.1 20.4 16.5 24.4  8.4 23.   9.7 50.  30.5 12.3 19.4 21.2 20.3 18.8
 33.4 18.5 19.6 33.2 13.1  7.5 13.6 17.4  8.4 35.4 24.  13.4 26.2  7.2
 13.1 24.5 37.2 25.  24.1 16.6 32.9 36.2 11.   7.2 22.8 28.7] 

Predict Set: 
 [24.9357079  23.75163164 29.32638296 11.97534566 21.37272478 19.19148525
 20.5717479  21.21154015 19.04572003 20.35463238  5.44119126 16.93688709
 17.15482272  5.3928209  40.20270696 32.31327348 22.46

## Function forwardSelection(x, sl) to implement the forward feature selection technique

In [10]:
## Automatic Forward Selection
import statsmodels.formula.api as sm
def forwardSelection(x, sl):
    row, column = x.shape
    minIndex = -2
    k = column

    #loop through k-1 size combinations starting at 0 (i)
    for i in range(0, k+1):
        #print("i: " + str(i) + " | X" + str(minIndex) + " added to model" +  " | " + str(x[0]))
        if i > 1 and minIndex != -1:
            selected = np.column_stack((selected[:,], x[:,minIndex]))
            x = np.delete(x, minIndex, 1)
        elif i == 1 and minIndex != -1:
            selected = x[:,minIndex]
            x = np.delete(x, minIndex, 1)
        
        minPval = 1
        minIndex = -1
        #loop through columns k-i times (j) starting at 0
        for j in range(0, k-i):
            if i > 0:
                obj_OLS = sm.OLS(y, np.column_stack((selected[:,], x[:,j]))).fit()
            else:
                obj_OLS = sm.OLS(y, x[:,j]).fit()
 
            pVal = obj_OLS.pvalues[-1].astype(float)
            print(obj_OLS.pvalues)
            #print("----X" + str(j) + ": p-value = " + str(pVal) + " | Min P-Value = " + str(minPval) + "-----")
            if pVal < sl:
                if pVal < minPval:
                    minPval = pVal
                    minIndex = j
    return selected
    
SL = 0.05
X_sig = X[:, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]]
X_ModeledForward = forwardSelection(X_sig, SL)

[9.37062373e-216]
[3.55554392e-07]
[8.87823995e-38]
[9.55082965e-74]
[1.48290625e-12]
[5.53284345e-160]
[3.74375819e-256]
[1.8485505e-115]
[1.51525782e-146]
[2.24637612e-48]
[1.02622585e-108]
[1.51036031e-181]
[3.98926958e-219]
[2.71419481e-67]
[2.48722887e-74 6.95022903e-34]
[2.04117444e-270 3.50077821e-022]
[1.27852731e-230 1.88182618e-012]
[1.61635352e-215 2.41985348e-033]
[8.14575415e-248 1.89019593e-004]
[2.33956493e-135 3.19222070e-039]
[5.07510979e-169 2.11454128e-028]
[2.14231854e-113 2.59936465e-003]
[2.34800721e-232 7.24337088e-025]
[2.64223375e-187 5.10142300e-040]
[2.67570806e-126 9.87378369e-052]
[7.52189309e-41 2.48299534e-03]
[1.61368901e-261 4.81185167e-073]
[3.47225760e-27 6.66936548e-41 6.68764941e-01]
[2.01550663e-257 1.24430924e-054 1.77647804e-003]
[2.38938025e-223 1.63865679e-062 1.77928703e-001]
[1.61536326e-252 7.40186618e-042 1.48998799e-001]
[9.23297771e-259 7.36380657e-074 1.88614494e-005]
[1.79592847e-112 1.13706862e-035 4.33402411e-001]
[5.20920834e-204 1.7

## Display the features selected by the function

In [11]:
print(X_ModeledForward.shape)
X_ModeledForward

(506, 12)


array([[6.5750e+00, 4.9800e+00, 1.5300e+01, ..., 6.3200e-03, 1.0000e+00,
        2.9600e+02],
       [6.4210e+00, 9.1400e+00, 1.7800e+01, ..., 2.7310e-02, 2.0000e+00,
        2.4200e+02],
       [7.1850e+00, 4.0300e+00, 1.7800e+01, ..., 2.7290e-02, 2.0000e+00,
        2.4200e+02],
       ...,
       [6.9760e+00, 5.6400e+00, 2.1000e+01, ..., 6.0760e-02, 1.0000e+00,
        2.7300e+02],
       [6.7940e+00, 6.4800e+00, 2.1000e+01, ..., 1.0959e-01, 1.0000e+00,
        2.7300e+02],
       [6.0300e+00, 7.8800e+00, 2.1000e+01, ..., 4.7410e-02, 1.0000e+00,
        2.7300e+02]])

## Model 3: Using scikit-learn  build a Linear Regression Model using only the features selected by the forward selection function implemented

In [12]:
#Splitting the data into Training Set and Test Set
X_sig_train, X_sig_test, y_sig_train, y_sig_test = train_test_split(X_ModeledForward, y, test_size=0.3, random_state=0)

#Fitting Multiple Linear Regression to Training Set
# import LinearRegression class from scikit-learn
# initialize a LinearRegression object and fit X and y train sets
from sklearn.linear_model import LinearRegression
mlrObj_m3 = LinearRegression()
mlrObj_m3.fit(X_sig_train, y_sig_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Using Model 3 make a prediction on the test set and Calculate Mean Squared Error

In [13]:
# Predicting on Test Set
y_pred_m3 = mlrObj_m3.predict(X_sig_test)
print( "Test Set: \n", y_sig_test,"\n")
print( "Predict Set: \n", y_pred_m3)

#Mean Squared Error
mse_m3 = np.square(np.subtract(y_pred_m3, y_sig_test)).mean()
print("\n Mean Squared Error: ",mse_m3)

Test Set: 
 [22.6 50.  23.   8.3 21.2 19.9 20.6 18.7 16.1 18.6  8.8 17.2 14.9 10.5
 50.  29.  23.  33.3 29.4 21.  23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
 38.7 18.7 14.6 20.  20.5 20.1 23.6 16.8  5.6 50.  14.5 13.3 23.9 20.
 19.8 13.8 16.5 21.6 20.3 17.  11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
 17.1 17.2 15.  21.7 18.6 21.  33.1 31.5 20.1 29.8 15.2 15.  27.5 22.6
 20.  21.4 23.5 31.2 23.7  7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
 50.  23.  21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
 14.1 18.7 28.1 19.8 26.7 21.7 22.  22.9 10.4 21.9 20.6 26.4 41.3 17.2
 27.1 20.4 16.5 24.4  8.4 23.   9.7 50.  30.5 12.3 19.4 21.2 20.3 18.8
 33.4 18.5 19.6 33.2 13.1  7.5 13.6 17.4  8.4 35.4 24.  13.4 26.2  7.2
 13.1 24.5 37.2 25.  24.1 16.6 32.9 36.2 11.   7.2 22.8 28.7] 

Predict Set: 
 [24.67908024 23.92880684 29.48092818 12.03270293 21.30919509 19.17352252
 20.37758536 21.23489071 18.76934955 20.56617036  5.5810561  17.02526238
 17.15581931  5.42147833 40.3085712  32.17527919 22.40

## Function backwardElimination(x, sl) to implement the backward feature elimination technique

In [14]:
## Automatic Backward Elimination
import statsmodels.formula.api as sm
def backwardElimination(x, sl):
    numVars = len(x[0])
    for i in range(0, numVars):
        obj_OLS = sm.OLS(y,x).fit()
        maxVar = max(obj_OLS.pvalues).astype(float)
        if maxVar > sl:
            for j in range(0, numVars - i):
                if(obj_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
    obj_OLS.summary()
    return x
SL = 0.05
X_sig_bm = X[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
X_Modeled_bm = backwardElimination(X_sig_bm, SL)


## Display the features selected by the function

In [15]:
print(X_Modeled_bm.shape)
X_Modeled_bm

(506, 12)


array([[1.0000e+00, 6.3200e-03, 1.8000e+01, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [1.0000e+00, 2.7310e-02, 0.0000e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [1.0000e+00, 2.7290e-02, 0.0000e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [1.0000e+00, 6.0760e-02, 0.0000e+00, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0000e+00, 1.0959e-01, 0.0000e+00, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [1.0000e+00, 4.7410e-02, 0.0000e+00, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

## Model 3: Using scikit-learn  build a Linear Regression Model using only the features selected by the backward elimination function implemented

In [16]:
#Splitting the data into Training Set and Test Set
X_sig_train_bm, X_sig_test_bm, y_sig_train_bm, y_sig_test_bm = train_test_split(X_Modeled_bm, y, test_size=0.3, random_state=0)

#Fitting Multiple Linear Regression to Training Set
# import LinearRegression class from scikit-learn
# initialize a LinearRegression object and fit X and y train sets
from sklearn.linear_model import LinearRegression
mlrObj_m4 = LinearRegression()
mlrObj_m4.fit(X_sig_train_bm, y_sig_train_bm)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Using Model 4 make a prediction on the test set and Calculate Mean Squared Error

In [17]:
# Predicting on Test Set
y_pred_m4 = mlrObj_m4.predict(X_sig_test_bm)
print( "Test Set: \n", y_sig_test_bm,"\n")
print( "Predict Set: \n", y_pred_m4)

#Mean Squared Error
mse_m4 = np.square(np.subtract(y_pred_m4, y_sig_test_bm)).mean()
print("\n Mean Squared Error: ",mse_m4)

Test Set: 
 [22.6 50.  23.   8.3 21.2 19.9 20.6 18.7 16.1 18.6  8.8 17.2 14.9 10.5
 50.  29.  23.  33.3 29.4 21.  23.8 19.1 20.4 29.1 19.3 23.1 19.6 19.4
 38.7 18.7 14.6 20.  20.5 20.1 23.6 16.8  5.6 50.  14.5 13.3 23.9 20.
 19.8 13.8 16.5 21.6 20.3 17.  11.8 27.5 15.6 23.1 24.3 42.8 15.6 21.7
 17.1 17.2 15.  21.7 18.6 21.  33.1 31.5 20.1 29.8 15.2 15.  27.5 22.6
 20.  21.4 23.5 31.2 23.7  7.4 48.3 24.4 22.6 18.3 23.3 17.1 27.9 44.8
 50.  23.  21.4 10.2 23.3 23.2 18.9 13.4 21.9 24.8 11.9 24.3 13.8 24.7
 14.1 18.7 28.1 19.8 26.7 21.7 22.  22.9 10.4 21.9 20.6 26.4 41.3 17.2
 27.1 20.4 16.5 24.4  8.4 23.   9.7 50.  30.5 12.3 19.4 21.2 20.3 18.8
 33.4 18.5 19.6 33.2 13.1  7.5 13.6 17.4  8.4 35.4 24.  13.4 26.2  7.2
 13.1 24.5 37.2 25.  24.1 16.6 32.9 36.2 11.   7.2 22.8 28.7] 

Predict Set: 
 [24.67908024 23.92880684 29.48092818 12.03270293 21.30919509 19.17352252
 20.37758536 21.23489071 18.76934955 20.56617036  5.5810561  17.02526238
 17.15581931  5.42147833 40.3085712  32.17527919 22.40