#### UPDATES:
<b>2/1/22</b>

Changed the XGB model to include hyperparameter tuning.

<b>2/4/22</b>

Added new features to the model: OverallCond, Neighborhood, OutsideArea (Feature Engineering), TotalBathrooms (Feature Engineering), and SFRatio (Feature Engineering). Resulted in a lower MAE (17466.63).

<b>3/1/22</b>

-Modified all visualization plots (except for boxplots). Countplots now show the exact frequency values.

-Data is now tested on 8 different models (see predicting values).

-Added GarageFinish as a feature for models, and changed categorical encoding to MEstimateEncoder to aid high cardinality columns better.

-New MAE score: 16756.02 (For XGB)

<b>4/25/22</b>

-Reconstructed the hyperparameter model

-Visualizations using ordinal data are now sorted in order (still need to fix MSSubClass)

-Added a function that counts possible combinations of hyperparameters

-New RMSLE score: 0.1345

### Intro

For this competition, I will be doing some Exploratory Data Analysis and I will be using various models to create prediction for the Ames Housing Dataset. First I will start with the visualizations by analyzing significant aspects of the dataset, do some data cleaning, and finally I will create the models for my predictions. Please note that there will be various versions of this notebook to achieve better results and a lower MSE.

To start, let's import the needed packages.

In [None]:
#Import our packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.svm import SVR
from xgboost import XGBRegressor

from category_encoders import MEstimateEncoder

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Data Exploration

Here, the data is split into two datsets: Train and Test. We will leave those aside and make a new dataset with the two concatenated together.

In [None]:
path = '../input/house-prices-advanced-regression-techniques/'
train_file = 'train.csv'
test_file = 'test.csv'

pd.set_option('display.max_columns', None) # Show all columns of the dataset
pd.set_option('display.float_format', '{:.2f}'.format)

train = pd.read_csv(path + train_file, dtype={'MSSubClass': 'object'})
test = pd.read_csv(path + test_file, dtype={'MSSubClass': 'object'})

df = pd.concat([train,test])
df

In [None]:
df.shape

In [None]:
print(df.describe())

In [None]:
df.info()

Glancing at the information of this dataset, there appears to be alot of missing values. We will deal with this later.

In [None]:
for col in df.select_dtypes('object'):
    print('Unique values of %s: %d' % (col, df[col].nunique()))
    print(df[col].unique())

We can compute a correlation matrix to see which variables will have higher correlations. I have also included a matrix with just the Price just to an idea for the model.

In [None]:
plt.figure(figsize=(16,16))
sns.heatmap(df.corr(), cmap='ocean', annot=True, fmt='.2f', annot_kws={'size': 8});

In [None]:
df.corrwith(df.SalePrice).sort_values(ascending=False)

We can see that GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotalRmsAbvGrd, YearBuilt, and YearRemodAdd have the highest correlations with SalePrice.

## Visualization

We will now start plotting visualizations with the data. Before we begin, I will customize the graphs for a better viewing experience.

In [None]:
sns.set_style('darkgrid')
plt.rc('axes', labelweight='bold', titlesize=14, titleweight='bold')

In [None]:
def create_countplot(name, xlabel, x=None, y=None, figsize=None):
    if figsize is not None:
        plt.figure(figsize=figsize)
        
    column = x if y is None else y
    values = df[column].value_counts()
    ordinal_cols = ['MSSubClass','MoSold','YrSold','YearBuilt','YearRemodAdd','OverallQual',
                    'OverallCond','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd']
    if column in ordinal_cols:
        values = values.sort_index()
    
    ax = sns.countplot(x=x, y=y, data=df, order=values.index)
    ax.bar_label(container=ax.containers[0], labels=values.values, fontsize=13)
    plt.title('Frequencies of ' + name)
    plt.xlabel(xlabel)
    plt.ylabel('Count')

Now that we have gotten that out of the way, I will plot the counts on some categories using Seaborn's countplot.

In [None]:
create_countplot('Sale Conditions', 'Condition', x='SaleCondition');

In [None]:
create_countplot('Sale Types', 'Type', x='SaleType');

In [None]:
create_countplot('Neighborhoods', 'Neighborhood', y='Neighborhood', figsize=(16,6))

It can be concluded that North Ames is the biggest neighboorhood in regards to housong, and College Creek is the second biggest.

In [None]:
create_countplot('Building Types', 'Type', x='BldgType')

In [None]:
create_countplot('House Styles', 'House Style', x='HouseStyle')

In [None]:
create_countplot('Sales By Month', 'Month', x='MoSold')

In [None]:
create_countplot('Sales By Year', 'Year', x='YrSold')

2007-2009 were years where the most purchases were made.

In [None]:
create_countplot('Class Counts', 'Class', x='MSSubClass', figsize=(7,5))

It is stated here that homes built in 1946 or newer (20) are the most common homes in Ames, which is not surprising given that they vary by style and that there weren't alot of homes back in 1946. Two story homes from 1946 or newer, all styles, are the second most common type of home in the town.

In [None]:
create_countplot('Overall Quality', 'Quality', x='OverallQual')

In [None]:
create_countplot('Overall Condition', 'Condition', x='OverallQual')

From both graphs of the overall quality and condition, alot of the houses have been rated average in terms of both quality and condition. This would make sense given how the houses built were decades old, which would mean that the value would have depreciated and some houses would have been ran down throughout the years.

This is a data visualization of the median of housing prices in each decade of starting with the 1880s.

In [None]:
#Transform years into decades
def get_decade_median(min,max):
    return df.SalePrice[(df.YearBuilt >= min) & (df.YearBuilt < max)].median()

#This list will be used to obtain the median housing price for each decade
decade = [str(i) + 's' for i in range(min(df.YearBuilt)-2,2020,10)]
med = [get_decade_median(i,i+10) for i in range(min(df.YearBuilt)-2,2020,10)]

medplot = sns.lineplot(x=decade, y=med, marker='o')
plt.title('Medians For Each Decade')
plt.draw()
plt.xlabel('Decade')
plt.ylabel('Median House Price')
plt.xticks(rotation=35)
new_ticks = [str(int(x//1000)) + 'K' for x in medplot.axes.get_yticks()]
medplot.axes.set_yticklabels(new_ticks);

From this graph, we can see that prices gradually increase throughout the 1970s and skyrocketed in the 2010s. It is no surprise that housing prices have skyrocketed, especially in the 2010s. Inflation could also be a good factor, as well as the economy.

Here are some contingency tables on the amount of houses sold on each month anually. The data starts from 2006 up to the latest year of the data.

In [None]:
def create_heatmap(x, y, xlabel, ylabel, title, figsize=(10,6)):
    if figsize is not None:
        plt.figure(figsize=figsize)
    
    sns.heatmap(pd.crosstab(df[x],df[y]), annot=True, fmt='g', annot_kws={'size': 14})
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

In [None]:
create_heatmap('MoSold', 'YrSold', 'Year', 'Month', 'Houses Sold on a Monthly Basis');

From the table, we can see that alot of houses were sold during the summers of 2007-2009. However, the amount of sales dropped like a stone, probably due to how expensive they were getting in 2010. July 2010 was the newest data of sales that was recorded.

In [None]:
create_heatmap('OverallQual', 'OverallCond', 'Quality', 'Condition', 'Houses Sold on Condition and Quality');

Here, it can be implied that most of the houses are in average condition, but higher quality.

We can use histograms to see the common ranges of house prices, Lot Areas, Living Areas, 1st Floor Areas, 2nd floor Areas, and Basement areas.

In [None]:
hist_params = {'kde': True, 'bins': 50}

def create_histplot(col, title, xlabel, figsize=(10,6), **hist_params):
    
    #Set size
    if figsize is not None:
        plt.figure(figsize=figsize)
    
    #Create histogram
    s = sns.histplot(df[col], **hist_params)
    plt.title(title)
    plt.xlabel(xlabel)
    
    #Modify ticks of x-axis
    new_ticks = [str(int(x//1000)) + 'K' for x in s.axes.get_xticks()]
    new_ticks[1] = 0
    s.axes.set_xticklabels(new_ticks)

##### Price

In [None]:
create_histplot('SalePrice', 'Price Counts', 'Price', **hist_params);

From this histogram, we can see that the data is indeed normal and that most of the houses cost around $150,000.

##### Lot Area

In [None]:
create_histplot('LotArea', 'Area Measures', 'Area', **hist_params);

From the graph, a majority of houses are under 25,000 ft$^2$, while the remaining set are over that amount. Thus, it can be concluded that spatious houses are a rarity in Ames, Iowa.

##### Area Above Ground

In [None]:
create_histplot('GrLivArea', '', 'Area Abv. Ground', **hist_params);

From the histogram, most of the houses contain 1000-2000 ft$^2$ of this type of area, and there are only a few houses with bigger space. It's to be expected since Ames is a small town. Also note that the data does not follow a normal distridution.

##### Garage area based on cars and sq. ft
Since the dataset measured the area of the garage based on how many cars can fit in and in the regular way, I figured that I would plot both the count of cars that can fit in a garage, and a histogram of the amount of space in each garage.

In [None]:
fig, (ax1,ax2) = plt.subplots(2,figsize=(10,6))

sns.countplot(y='GarageCars', data=df, ax=ax1)
ax1.set_title('Sq. Ft Based on Car Capacity')
ax1.set_xlabel('')
ax1.set_yticklabels([int(t) for t in ax1.axes.get_yticks()])
ax1.set_ylabel('Car Spaces')

sns.histplot(df.GarageArea, kde=True, ax=ax2)
ax2.set_title('Sq. Ft Based on Area')
ax2.set_xlabel('Area')

fig.tight_layout()

Based on the count plot for the area based on car capacity, most of the garages can fit three cars, while some others can only fit one car. There are no garages that can only fit 2 cars.

Most of the garage areas fare in the 400-600 ft$^{2}$ range, with some being around 300 ft$^{2}$. It should also be noted that this data is not normally distributed, as the distribution plot is not in the shape of a bell curve.

##### Bathrooms
I will do two graphs here, which are both countplots of the amount of full and half bathrooms in a house.

In [None]:
create_heatmap('FullBath', 'HalfBath', 'Full', 'Half', 'Heatmap of Full and Half Baths', figsize=None)

In [None]:
create_countplot('Baths', 'Half Baths', x='HalfBath')

Shockingly, most of the houses have no half baths.

##### Bedrooms

In [None]:
create_countplot('Bedrooms Above Ground', 'Bedrooms', x='BedroomAbvGr');

The average house has 3 bedrooms. On another note, it is shocking to see an 8-bedroom house in Iowa.

##### Kitchens

In [None]:
create_countplot('Kitchens Above Ground', 'Kitchens', x='KitchenAbvGr')

There is a tiny portion of houses that have two or more kitchens.

In [None]:
create_countplot('Fireplaces', 'Fireplaces', x='Fireplaces');

Surprisingly, most of the houses don't have fireplaces. This could possibly be due to some houses not being reported to have fireplaces, or said houses are really small and/or have alot less features.

In [None]:
create_countplot('Total Rooms Above Ground', 'Total Rooms', x='TotRmsAbvGrd');

The average house has around 6 total rooms above ground.

In [None]:
fig, (ax1,ax2) = plt.subplots(1, 2, figsize=(10,6), sharey=True)

sns.regplot(x='OverallQual', y='SalePrice', data=df, ax=ax1, scatter_kws={'alpha': .4})
ax1.set_xlabel('Quality')
ax1.set_xticks(np.arange(1,11))
ax1.set_ylabel('Price')

sns.regplot(x='OverallCond', y='SalePrice', data=df, ax=ax2, scatter_kws={'alpha': .4})
ax2.set_xlabel('Condition')
ax2.set_xticks(np.arange(1,11))
ax2.set_ylabel('')

plt.suptitle('Relation of Quality/Condition to Price', fontweight='bold', fontsize=16);

As seen here, houses with a rating of 1-3 start at the 30K-200K range. After that, not only do the prices gradually increase, but the price range widens. There are some high quality houses that sell for cheap, but that wouldn't always mean that it's high quality. Also, from the correlation plot, we can see that quality and price have a high correlation.

For condition, however, the two variables don't seem to correlate very well. The regression line goes slightly downhill, and the prices are nearly the same for every rating, except for 5. Thus, quality seems to be the more reliable variable.

## Boxplots
Here, I will plot boxplots of significant features against the prices. To be efficient, I created this function to plot boxplots based on the selected column.

In [None]:
def boxplot(col, type_, xlab, rotation=0):
    plt.figure(figsize=(10,6))
    sns.boxplot(x=col, y=df.SalePrice, data=df)
    plt.title('Quantiles of ' + type_)
    plt.xlabel(xlab)
    plt.xticks(rotation=rotation)
    plt.ylabel('Price')

In [None]:
#Change to regression plot
boxplot('OverallCond','Condition','Condition Rating')

In [None]:
boxplot('Neighborhood','Neighborhoods','Neighborhood',rotation=45)

From this boxplot, Northridge Height and Stone Brook have the most expensive neighborhoos in the town.

In [None]:
boxplot('MSZoning','Zoning','Zoning')

In [None]:
boxplot('Street','Street Types','Street Type')

In [None]:
boxplot('Utilities','Utility Types','Utility Type')

In [None]:
boxplot('LotConfig','Lot Configurations','Lot')

In [None]:
boxplot('BldgType','Building Types','Building Type')

In [None]:
boxplot('HouseStyle','House Styles','House Style')

In [None]:
boxplot('KitchenQual','Kitchen Qualities','Kitchen Quality Rating')

In [None]:
boxplot('GarageType','Garage Types','Garage Type')

In [None]:
boxplot('SaleType','Sale Types','Sale Type')

In [None]:
boxplot('SaleCondition','Sale Conditions','Sale Condition')

## Predicting Values
For this part, I will only use the testing data, since the testing data doesn't have any prices. Additionally this is my process of building the model and predicting values:
* Clean data by removing and imputing missing values
* Creating useful features to make the model better
* Calculate feature importance to see which aspects have more potential for modeling
* Create multiple models. I will use the following:
    * XGBoost
    * Random Forest Regression
    * Decision Tree Regression
    * Linear Regression
    * Lasso Regression
    * Ridge Regression
    * SVR
* Obtain the MSE using hyperparameters
* Create the predictions
* Create CSV file for submission

#### Dealing with Missing Values

We can use the function below to compute the null values of the training and testing sets separately. The chart will be limited to 20 rows as there are way too many columns in the dataset.

In [None]:
def get_null_info(data):
    amt_of_null_vals = pd.Series([data[col].isnull().sum() for col in data.columns], index=data.columns)
    percentages = pd.Series([data[col].isnull().sum() / len(data) for col in data.columns], index=data.columns)
    null_vals = pd.DataFrame({'missing_values': amt_of_null_vals, 'percentage': percentages})
    return null_vals.sort_values(['missing_values'],ascending=False).head(20)

In [None]:
get_null_info(train)

In [None]:
get_null_info(test)

This is a heatmap of missing values in the data. The white strips represent that the value is a missing one.

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False);

As seen from the charts and the visualization, Alley, PoolQC, Fireplace Quality, and LotFrantage have the most missing values. SalePrice has half of values missing because this column is nonexistent in the testing set.

Also, note that the rate of missing values for PoolQC is 100%. This is not true as the values are rounded due to the formatting. The actual value is around 99.7%.

Now it's time to impute our data. A simple function should suffice.

In [None]:
def impute(data):
    for name in data.select_dtypes('number'):
        data[name].fillna(0, inplace=True)
    for name in data.select_dtypes('object'):
        data[name].fillna('NA', inplace=True)

impute(train)
impute(test)

In [None]:
get_null_info(train)

In [None]:
get_null_info(test)

#### Obtain Mutual Information

This is where feature engineering will be performed. The first part will be creating features using mathematical functions of the data, and the next one is measuring importance with mutual_info_regression().

##### Creating Features

In [None]:
train['TotalBathrooms'] = train.BsmtFullBath + train.BsmtHalfBath / 2 + train.FullBath + train.HalfBath / 2
train['OutsideArea'] = train.WoodDeckSF + train.OpenPorchSF + train.EnclosedPorch + train['3SsnPorch'] + train.ScreenPorch + train.PoolArea
train['SFRatio'] = train['2ndFlrSF'] / train['1stFlrSF']

test['TotalBathrooms'] = test.BsmtFullBath + test.BsmtHalfBath / 2 + test.FullBath + test.HalfBath / 2
test['OutsideArea'] = test.WoodDeckSF + test.OpenPorchSF + test.EnclosedPorch + test['3SsnPorch'] + test.ScreenPorch + test.PoolArea
test['SFRatio'] = test['2ndFlrSF'] / test['1stFlrSF']

The three new features are the total amount of bathrooms in a house, the outside area of a house, and the ratio between the area of the second floor and the area of the first floor.

##### Feature Importance

In [None]:
#Separate the target variable from the data
X = train.copy()
y = train.pop('SalePrice')

#Create Mutual Information
def mi_scores(X, y):
    for col in X.select_dtypes('object','category'):
        X[col], _ = X[col].factorize()
    
    scores = mutual_info_regression(X, y)
    scores = pd.Series(scores, index=X.columns).sort_values(ascending=False)
    return scores

scores = mi_scores(X, y)
scores[:20]

It can be noted that Total Bathrooms has more value than Full Baths, thus, one of the new featues is working.

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(scores.values[:20], scores.index[:20])
plt.title('Mutual Information', fontsize=14)
plt.xlabel('Importance')
plt.ylabel('Column');

#### Model Building

As mentioned before, 7 different models will be tested.

In [None]:
#Features we will use for our model
feats = ['OverallQual','OverallCond','GrLivArea','YearBuilt','GarageArea',
         'Neighborhood','MSSubClass','TotalBathrooms','OutsideArea', 'SFRatio']

X_train = train[feats]
X_test = test[feats]
y_train = y
num_cols = list(X_train.select_dtypes(exclude=['object']))

#Two of our features are categorical, so they must be encoded
cat_cols = list(X_train.select_dtypes('object'))
cat_cols

In [None]:
# ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
# ohe_cols_train = pd.DataFrame(ohe.fit_transform(X_train[cat_cols]))
# ohe_cols_test = pd.DataFrame(ohe.transform(X_test[cat_cols]))

# ohe_cols_train.index = X_train.index
# ohe_cols_test.index = X_test.index

# num_X_train = X_train.drop(cat_cols, axis=1)
# num_X_test = X_test.drop(cat_cols, axis=1)

# X_train_encoded = pd.concat([num_X_train,ohe_cols_train], axis=1)
# X_test_encoded = pd.concat([num_X_test,ohe_cols_test], axis=1)

In [None]:
X_train

In [None]:
encoder = MEstimateEncoder(cat_cols, m=5).fit(X_train,y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

In [None]:
def getCombinationAmount(params):
    combinations = 1
    for k, v in params.items():
        print('Options for {}: {}'.format(k,len(v)))
        combinations *= len(v)
        
    print('Possible combinations:', combinations)
    
def get_model_info(model, params, scoring='neg_mean_absolute_error', scale=True):
    if scale:
        mm = MinMaxScaler()
        mm.fit_transform(X_train[num_cols])
        mm.fit_transform(X_test[num_cols])

    gs = GridSearchCV(model, params, scoring=scoring)
    gs.fit(X_train,y_train)
    best_score = -gs.best_score_ if 'neg' in scoring else gs.best_score_
    return best_score, gs.best_estimator_, gs.best_params_

In [None]:
param_grid_xgb = {'n_estimators': [10,20,30,50,100,500,1000], 
                  'learning_rate': [.001,.01,.05,.1,.3], 
                  'max_depth': np.arange(2,7), 
                  #'model__min_child_weight': np.arange(1,11), 
                  #'model__colsample_bytree': np.arange(0.2,1,0.1), 
                  #'model__subsample': np.arange(0.2,1,0.1)
                 }

getCombinationAmount(param_grid_xgb)

In [None]:
mae_scores = {}
rmsle_scores = {}

##### XGBoost

In [None]:
#this took 23 minutes to run
xgb_mae, xgb_model, xgb_params = get_model_info(XGBRegressor(), param_grid_xgb, scale=False)
xgb_rmsle, xgb_model, xgb_params = get_model_info(XGBRegressor(), param_grid_xgb, 
                                                  scoring='neg_mean_squared_log_error', scale=False)

print('MAE for XGB: {:.4f}'.format(xgb_mae))
print('RMSLE for XGB: {:.4f}'.format(np.sqrt(xgb_rmsle)))
mae_scores['XGB'] = xgb_mae
rmsle_scores['XGB'] = np.sqrt(xgb_rmsle)

In [None]:
xgb_params

##### Random Forest

In [None]:
param_grid_rf = {'n_estimators': [10,20,30,50,100,500,1000], 
              'max_features': ['auto','sqrt','log2'], 
              'bootstrap': [True,False], 
              'max_depth': np.arange(1,4)}

mae_rf, model_rf, params_rf = get_model_info(RandomForestRegressor(), param_grid_rf, scale=False)
rmsle_rf, model_rf, params_rf = get_model_info(RandomForestRegressor(), param_grid_rf, 
                                               scoring='neg_mean_squared_log_error', scale=False)

print('MAE for Random Forest: {:.4f}'.format(mae_rf))
print('RMSLE for Random Forest: {:.4f}'.format(np.sqrt(rmsle_rf)))

mae_scores['Random Forest'] = mae_rf
rmsle_scores['Random Forest'] = np.sqrt(rmsle_rf)

In [None]:
getCombinationAmount(param_grid_rf)

In [None]:
params_rf

##### Decision Tree

In [None]:
param_grid_dt = {'splitter': ['best','random'], 'max_depth': np.arange(2,7)}
mae_dt, model_dt, params_dt = get_model_info(DecisionTreeRegressor(), param_grid_dt, scale=False)
rmsle_dt, model_dt, params_dt = get_model_info(DecisionTreeRegressor(), param_grid_dt,
                                               scoring='neg_mean_squared_log_error',
                                               scale=False)

print('MAE for Decision Tree: {:.4f}'.format(mae_dt))
print('RMSLE for Decision Tree: {:.4f}'.format(np.sqrt(rmsle_dt)))

mae_scores['Decision Tree'] = mae_dt
rmsle_scores['Decision Tree'] = np.sqrt(rmsle_dt)

In [None]:
getCombinationAmount(param_grid_dt)

In [None]:
params_dt

##### Linear Regression

In [None]:
# mae_linear, model_linear, params_linear = get_model_info(LinearRegression(), {})
# rmsle_linear, model_linear, params_linear = get_model_info(LinearRegression(), {}, scoring='neg_mean_squared_log_error')

# print('MAE for Linear Regression: {:.4f}'.format(mae_linear))
# print('RMSLE for Linear Regression: {:.4f}'.format(np.sqrt(rmsle_linear)))

# mae_scores['Linear'] = mae_linear
# rmsle_scores['Linear'] = np.sqrt(rmsle_linear)

##### LASSO Regression

In [None]:
# mae_lasso, model_lasso, params_lasso = get_model_info(Lasso(), {})
# rmsle_lasso, model_lasso, params_lasso = get_model_info(Lasso(), {}, scoring='neg_mean_squared_log_error')

# print('MAE for Lasso: {:.4f}'.format(mae_lasso))
# print('RMSLE for Lasso: {:.4f}'.format(np.sqrt(mae_lasso)))

# mae_scores['Lasso'] = mae_lasso
# rmsle_scores['Lasso'] = np.sqrt(rmsle_lasso)

##### Ridge Regression

In [None]:
# mae_ridge, model_ridge, params_ridge = get_model_info(Ridge(), {})
# rmsle_ridge, model_ridge, params_ridge = get_model_info(Ridge(), {}, scoring='neg_mean_squared_log_error')

# print('MAE for Ridge: {:.4f}'.format(mae_ridge))
# print('RMSLE for Ridge: {:.4f}'.format(np.sqrt(mae_ridge))
# mae_scores['Ridge'] = mae_ridge
# rmsle_scores['Ridge'] = np.sqrt(mae_ridge)

##### SVR

In [None]:
param_grid_svr = {'degree': np.arange(2,5), 
                  'C': [0.0001,0.001,0.01,0.1,1,10], 
                  'epsilon': [0.0001,0.001,0.01,0.1]}

mae_svr, model_svr, params_svr = get_model_info(SVR(), param_grid_svr)
rmsle_svr, model_svr, params_svr = get_model_info(SVR(), param_grid_svr, scoring='neg_mean_squared_log_error')

print('MAE for SVR: {:.4f}'.format(mae_svr))
print('RMSLE for SVR: {:.4f}'.format(np.sqrt(rmsle_svr)))

mae_scores['SVR'] = mae_svr
rmsle_scores['SVR'] = np.sqrt(rmsle_svr)

In [None]:
getCombinationAmount(param_grid_svr)

In [None]:
params_svr

##### Linear SVR

In [None]:
param_grid_linear_svr = {'degree': np.arange(2,5), 
                  'C': [0.0001,0.001,0.01,0.1,1,10]}
mae_linear_svr, model_linear_svr, params_linear_svr = get_model_info(SVR(), param_grid_linear_svr)
rmsle_linear_svr, model_linear_svr, params_linear_svr = get_model_info(SVR(), param_grid_linear_svr, 
                                                                       scoring='neg_mean_squared_log_error')

print('MAE for Linear SVR: {:.4f}'.format(mae_linear_svr))
print('RMSLE for Linear SVR: {:.4f}'.format(np.sqrt(rmsle_linear_svr)))

mae_scores['Linear SVR'] = mae_linear_svr
rmsle_scores['Linear SVR'] = np.sqrt(rmsle_linear_svr)

In [None]:
getCombinationAmount(param_grid_linear_svr)

In [None]:
params_linear_svr

##### Total Results

In [None]:
#List results in ascending order
results = pd.DataFrame({'MAE': mae_scores.values(), 
                        'RMSLE': rmsle_scores.values()}, 
                       index=mae_scores.keys())
results.sort_values('RMSLE')

From this ranking, the XGBoost model would be the best model to use for the data. Although the Ridge and Linear models nearly have the same scores, it's nowhere close to XGBoost. Now we can submit our results.

In [None]:
#Obtain predictions
preds = xgb_model.predict(X_test)
output = pd.DataFrame({'ID': X_test.index + 1461, 'SalePrice': preds})

#Create CSV file for submission!
output.to_csv('amespredictions.csv', index=False)
output

And that is the end of the project. All feedback is welcome.