# Machine Learning Model Selection

## Goal

Once the data has been cleaned, it is time to work on model selection. To do this, we need to find the algorithm that will give the best predictive score on our test dataset. Too accomplish this goal, we will first try several different models with their default settings, running cross validation to determine their validation accuracy. We will then choose the model with the best validation accuracy for hyperparameter tuning to better fit our model. The following models will be compared:
1. Linear Regression
2. ElasticNet Regression
3. LARS Regression
4. Bayesian Regression
5. Perceptron
6. Support Vector Machines
7. KNN Regression
8. Random Forest Regression
9. XGBoost Regression

Note: Due to previous work being done, 2nd degree polynomial functions for all the features will be added and PCA performed to capture .98% of variance

## Data Loading and Adjustments

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

data = pd.read_csv('/Users/Tomas/Desktop/Kaggle-House-Prices-Challenge/data/final_train.csv', index_col=0)
prices = data['SalePrice']
data = data.drop(['SalePrice'], axis=1)

poly = PolynomialFeatures(2)
data = poly.fit_transform(data)

X_train, X_test, y_train, y_test = train_test_split(data, prices, test_size=0.2, random_state=1)

## 1. Linear Regression

In [2]:
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import GridSearchCV

linear_regressor = Pipeline([('pca', PCA(.98)), 
                     ('regr', linear_model.LinearRegression())])

estimator = GridSearchCV(linear_regressor, cv=5, n_jobs=-1, param_grid={}, scoring='neg_mean_squared_error')

estimator.fit(X_train, y_train)
print("Linear Regression Baseline Cross Validation Score {}".format(estimator.best_score_))
predictions = estimator.predict(X_test)
print("Linear Regression Baseline Test Score {}".format(mean_squared_error(y_test, predictions)))

  linalg.lstsq(X, y)
  linalg.lstsq(X, y)
  linalg.lstsq(X, y)
  linalg.lstsq(X, y)
  linalg.lstsq(X, y)


Linear Regression Baseline Cross Validation Score -1123168771.3887084
Linear Regression Baseline Test Score 869083595.8032187


## ElasticNet Regression

In [3]:
elasticnet_regression = Pipeline([('pca', PCA(.98)), 
                     ('regr', linear_model.ElasticNet())])
estimator = GridSearchCV(elasticnet_regression, cv=5, n_jobs=2, param_grid={}, scoring='neg_mean_squared_error', verbose=1)

estimator.fit(X_train, y_train)
print("ElasticNet Regression Baseline Cross Validation Score {}".format(estimator.best_score_))
predictions = estimator.predict(X_test)
print("ElasticNet Regression Baseline Test Score {}".format(mean_squared_error(y_test, predictions)))

Fitting 5 folds for each of 1 candidates, totalling 5 fits


Process ForkPoolWorker-10:
Process ForkPoolWorker-9:
Traceback (most recent call last):
  File "/Users/Tomas/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):
  File "/Users/Tomas/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Tomas/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/Tomas/anaconda3/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/Users/Tomas/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/Tomas/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/pool.py", line 360, in get
    racquire()
  File "/Users/Tomas/anaconda3/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
KeyboardInterrupt
  File "/Users/Toma

KeyboardInterrupt: 