# Choosing a Machine Learning Algorithm

First we are going to test the performance of several different ML algorithms on our train data, so we can then optimize the best model

Going to test the following algorithms:
1. Lasso
2. ElasticNet
3. Linear Regression
4. Ridge Regression
5. SVR Kernel(Linear)
6. SVR Kernel(rbf)
7. Ensemble Regressors (RandomForest)

## Load Data

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

data = pd.read_csv('/Users/Tomas/Desktop/Kaggle-House-Prices-Challenge/data/final_train.csv', index_col=0)
prices = data['SalePrice']
data = data.drop(['SalePrice'], axis=1)


poly = PolynomialFeatures(2)
data = poly.fit_transform(data)

In [20]:
print(data.shape)

(1460, 36856)


In [21]:
# split into train, validation, test
X_train, X_test, y_train, y_test = train_test_split(data, prices, test_size=0.4, random_state=1)
X_validation, X_test, y_validation, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=1)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(X_validation.shape)
print(y_validation.shape)

(876, 36856)
(876,)
(292, 36856)
(292,)
(292, 36856)
(292,)


The goal is to find the model with the best baseline performance, and then we can tune the hyperparameters for that model through GridSearch

## Lasso 

In [23]:
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
from math import sqrt

lasso = Pipeline([('pca', PCA(.95)), 
                     ('regr', linear_model.Lasso())])
lasso.fit(X_train, y_train)
predictions = lasso.predict(X_validation)
print(sqrt(mean_squared_error(y_validation, predictions)))

29596.85147428503


## ElasticNet

In [24]:
elasticnet = Pipeline([('pca', PCA(.95)), 
                      ('regr', linear_model.ElasticNet())])
elasticnet.fit(X_train, y_train)
ELPredictions = elasticnet.predict(X_validation)
print(sqrt(mean_squared_error(y_validation, ELPredictions)))

30084.579153483595


## Linear Regression

In [25]:
linearreg = Pipeline([('pca', PCA(.95)), 
                      ('regr', linear_model.LinearRegression())])
linearreg.fit(X_train, y_train)
p = linearreg.predict(X_validation)
print(sqrt(mean_squared_error(y_validation, p)))

29595.506114194606


## Ridge Regression

In [26]:
ridge = Pipeline([('pca', PCA(.95)), 
                     ('regr', linear_model.Ridge())])
ridge.fit(X_train, y_train)
predictions = ridge.predict(X_validation)
print(sqrt(mean_squared_error(y_validation, predictions)))

29596.128643553104


## Support Vector Machines

In [27]:
from sklearn import svm

linear_svr = Pipeline([('pca', PCA(.95)), 
                      ('regr', svm.SVR(kernel='linear'))])
linear_svr.fit(X_train, y_train)
predictions = linear_svr.predict(X_validation)
print(sqrt(mean_squared_error(y_validation, predictions)))

49844.468014612044


In [28]:
rbf = Pipeline([('pca', PCA(.95)), 
                      ('regr', svm.SVR(kernel='rbf'))])
rbf.fit(X_train, y_train)
predictions = rbf.predict(X_validation)
print(sqrt(mean_squared_error(y_validation, predictions)))

88698.54847264643


## Random Forest

In [29]:
from sklearn.ensemble import RandomForestRegressor

rf = Pipeline([('pca', PCA(.95)), 
                      ('regr', RandomForestRegressor(n_jobs=-1))])

rf.fit(X_train, y_train)
predictions = rf.predict(X_validation)
print(sqrt(mean_squared_error(y_validation, predictions)))

39012.94576453533


# Testing

In [30]:
print("ElasticNet Performance {}".format(sqrt(mean_squared_error(y_test, elasticnet.predict(X_test)))))
print("Lasso Performance {}".format(sqrt(mean_squared_error(y_test, lasso.predict(X_test)))))
print("RandomForest Performance {}".format(sqrt(mean_squared_error(y_test, rf.predict(X_test)))))
print("Linear Regression Performance {}".format(sqrt(mean_squared_error(y_test, linearreg.predict(X_test)))))
print("Ridge Regression Performance {}".format(sqrt(mean_squared_error(y_test, ridge.predict(X_test)))))
print("Linear SVR Performance {}".format(sqrt(mean_squared_error(y_test, linear_svr.predict(X_test)))))

ElasticNet Performance 23318.062591916772
Lasso Performance 24138.131028289234
RandomForest Performance 29874.55033616314
Linear Regression Performance 24140.469957673748
Ridge Regression Performance 24137.856658127275
Linear SVR Performance 32220.19637946001


# Submissions

In [35]:
test = pd.read_csv('/Users/Tomas/Desktop/Kaggle-House-Prices-Challenge/data/final_test.csv', index_col=0)
indices = test.index
test = poly.transform(test)

In [36]:
print(test.shape)

(1459, 36856)


In [37]:
predictions = elasticnet.predict(test)

In [38]:
submision = pd.DataFrame(predictions, index=indices)
submision

Unnamed: 0_level_0,0
Id,Unnamed: 1_level_1
1461,102560.328869
1462,144296.688841
1463,175885.109505
1464,189457.465256
1465,180300.042021
1466,175108.469820
1467,179549.838853
1468,156741.363932
1469,194478.153921
1470,123602.024415


In [39]:
submision.to_csv('/Users/Tomas/Desktop/Kaggle-House-Prices-Challenge/data/poly_elastic_net.csv')

In [40]:
elasticnet

Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('regr', ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False))])