# Housing Prices - My Submission

I created this notebook to share my work with my friends over at #kaggle_competitions for the Machine Learning Foundations Scholarship with Microsoft Azure.

### A brief disclamer before I get started

This is a copy of what I did in Jupyter notebooks on my local machine using Anaconda 4.8.2 (Python 3.7.6).  I know there are some differences between Kaggle's version of python libraries and my own so I don't know if all this will work verbatim in Kaggle and give the same results as I got but I did give it a run through and it seems to return similar figures.  You can find a copy of my Jupyter notebooks here: https://github.com/cassova/Kaggle-HousingPrices

* [Data Cleanup](#section-data-cleanup)
    - [MSZoning](#subsection-MSZoning)
    - [MSSubClass](#subsection-MSSubClass)
    - [LotFrontage](#subsection-LotFrontage)
    - [LotArea](#subsection-LotArea)
    - [Street and Alley](#subsection-StreetAlley)
    - [Drop unused fields](#subsection-DropFields)
    - [LotConfig](#subsection-LotConfig)
    - [Neighborhood](#subsection-Neighborhood)
    - [NeighborhoodLotArea](#subsection-NeighborhoodLotArea)
    - [BldgType](#subsection-BldgType)
    - [NeighborhoodBldgType](#subsection-NeighborhoodBldgType)
    - [Total Square Feet](#subsection-TotalSquareFeet)
    - [HouseStyle](#subsection-HouseStyle)
    - [NeighborhoodHouseStyle](#subsection-NeighborhoodHouseStyle)
    - [OverallQualNCond](#subsection-OverallQualNCond)
    - [NeighborhoodOverallQualNCond](#subsection-NeighborhoodOverallQualNCond)
    - [Age](#subsection-Age)
    - [Average Sale Price](#subsection-AverageSalePrice)
    - [Exterior](#subsection-Exterior)
    - [Foundation and Basement](#subsection-FondBase)
    - [Heating / Cooling](#subsection-HeatCool)
    - [Floor Sizes and Room Conditions](#subsection-FloorRoom)
    - [Garage](#subsection-Garage)
    - [Other Features](#Other)
    - [Sale Info](#subsection-SaleInfo)
    - [Results](#subsection-Results)
* [One-Hot Encoding](#section-onehot)
* [Column Selection](#section-columnselect)
* [Training](#section-training)
    - [Linear Regression](#subsection-LinearReg)
    - [Stochastic Gradient Descent (SGD)](#subsection-StoGradDes)
    - [Random Forest Classifier](#subsection-RandFore)
    - [Polynomial Regression](#subsection-PolyReg)
    - [Logistic Regression](#subsection-LogReg)
    - [Gaussian Naive Bayes](#subsection-GNB)
    - [Perceptron](#subsection-Perceptron)
    - [Linear Support Vector Machine](#subsection-SVM)
    - [Decision Tree](#subsection-DesiTree)
    - [Random Forest Regressor](#subsection-RanForReg)
    - [Gradient Boost Regressor](#subsection-GradBooReg)
    - [AdaBoost](#subsection-AdaBoost)
    - [Extremely Randomized Trees](#subsection-ExtRanTree)
    - [Ensemble: VotingRegessor](#subsection-EnsVotReg)
    - [Ensemble: Stacked Generalization](#subsection-EnsStakGen)
    - [XGBoost](#subsection-XGBoost)
    - [Best](#subsection-Best)


<a id="section-data-cleanup"></a>
# Data Cleanup

Here I went through all the columns and either dropped them outright or cleaned them.  I cleaned categorical items by converting them to numbers sorting their means by sale price.  I normalized specifications like number of rooms, area, and others using mean normalization (z-score).  Others were normalized with min-max.

In [None]:
import numpy as np
import pandas as pd

train_df = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test_df = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")
X_train = train_df.drop(['SalePrice'], axis=1)
y_train = train_df['SalePrice']
train_df

# We will use a "combined" set to look across both the train and test to ensure we normalize our values and make our checks across both data sets.
combined = pd.concat([train_df,test_df])
all_data = [train_df,test_df]

total = combined.isnull().sum().sort_values(ascending=False)
percent_1 = combined.isnull().sum()/combined.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(20)

In [None]:
train_df['SalePrice'].describe()

<a id="subsection-MSZoning"></a>
## MSZoning

In [None]:
combined['MSZoning'].unique()

In [None]:
combined.groupby('MSZoning')['Id'].count()

In [None]:
train_df.groupby('MSZoning')['SalePrice'].describe()

In [None]:
combined[combined['MSZoning'].isnull()]

### Summary
Based on all this, I think we can default the MSZoning null values (which only appear in the test data set) with the most frequent value 'RL'. We can also convert 'RL' to number.

In [None]:
MSZoningMap = {'FV': 1, 'RL': 2, 'RH': 3, 'RM': 4, 'C (all)': 5}

for d in all_data:
    d['MSZoning'] = d['MSZoning'].fillna('RL')
    d['MSZoning'] = d['MSZoning'].map(MSZoningMap)

combined['MSZoning'] = combined['MSZoning'].fillna('RL')
combined['MSZoning'] = combined['MSZoning'].map(MSZoningMap)
    
combined['MSZoning'].unique()

In [None]:
all_data[0].groupby('MSZoning')['SalePrice'].describe().sort_values('mean')

<a id="subsection-MSSubClass"></a>
## MSSubClass

In [None]:
subclass_map = train_df.groupby('MSSubClass')['SalePrice'].mean().sort_values()
for d in all_data:
    for idx, s in enumerate(subclass_map.index):
        d.loc[d['MSSubClass'] == s,'MSSubClass'] = idx+1

        
for idx, s in enumerate(subclass_map.index):
    combined.loc[combined['MSSubClass'] == s,'MSSubClass'] = idx+1

train_df.groupby('MSSubClass')['SalePrice'].describe().sort_values('mean')

In [None]:
combined['MSSubClass'].isnull().sum()

### Summary
I don't see much coorelation directly between this and sale price but not sure at this point.  For now, will check other features to see if this might be useful in some way or if it should be dropped.

<a id="subsection-LotFrontage"></a>
## LotFrontage

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.scatter(y_train, X_train['LotFrontage'])
plt.show()

In [None]:
# Use mean normalization
combined['LotFrontage'] = combined['LotFrontage'].fillna(0)

for d in all_data:
    d['LotFrontage'] = d['LotFrontage'].fillna(0)
    d['LotFrontage'] = (d['LotFrontage']-combined['LotFrontage'].mean())/combined['LotFrontage'].std()
    
    
combined['LotFrontage'] = (combined['LotFrontage']-combined['LotFrontage'].mean())/combined['LotFrontage'].std()

### Summary:
Lot frontage is loosey coorelated to sale price so we probably have to add some features to take advantage of this.  I've put these into categories for now.

<a id="subsection-LotArea"></a>
## LotArea

In [None]:
all_data[0].head()

In [None]:
plt.plot(y_train, X_train['LotArea'], 'o')
m, b = np.polyfit(y_train, X_train['LotArea'], 1)
plt.plot(y_train, m*y_train + b)

In [None]:
# Use mean normalization
combined['LotArea'] = combined['LotArea'].fillna(0)

for d in all_data:
    d['LotArea'] = d['LotArea'].fillna(0)
    d['LotArea'] = (d['LotArea']-combined['LotArea'].mean())/combined['LotArea'].std()

combined['LotArea'] = (combined['LotArea']-combined['LotArea'].mean())/combined['LotArea'].std()

### Summary
The lot area is correlated to the price so we'll normalize it.

<a id="subsection-StreetAlley"></a>
## Street and Alley

In [None]:
combined['Alley'] = combined['Alley'].fillna('None')

for d in all_data:
    d['Alley'] = d['Alley'].fillna('None')

In [None]:
combined.groupby(['Street', 'Alley']).describe()

In [None]:
all_data[0].groupby('Street')['SalePrice'].describe()

### Summary
I don't think either of these features really contribute at all so I'm going to drop them.

In [None]:
combined.drop(['Street','Alley'], axis=1, inplace=True)

for d in all_data:
    d.drop(['Street','Alley'], axis=1, inplace=True)

<a id="subsection-DropFields"></a>
## Drop unused fields
Here is where I decided which fields I wouldn't use at all.  I did some analysis on these elsewhere but I also trusted my gut on others.

In [None]:
combined.drop(['LotShape','LandContour', 'Utilities', 'LandSlope', \
               'Condition1', 'Condition2', 'YearBuilt', 'RoofStyle', \
               'RoofMatl', 'Heating', 'Electrical', 'Functional', \
               'GarageYrBlt', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', \
               'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', \
               'SaleType'], axis=1, inplace=True)

for d in all_data:
    d.drop(['LotShape','LandContour', 'Utilities', 'LandSlope', \
            'Condition1', 'Condition2', 'YearBuilt', 'RoofStyle', \
            'RoofMatl', 'Heating', 'Electrical', 'Functional', \
            'GarageYrBlt', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', \
            'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', \
            'SaleType'], axis=1, inplace=True)

<a id="subsection-LotConfig"></a>
## LotConfig

In [None]:
subclass_map = train_df.groupby('LotConfig')['SalePrice'].mean().sort_values()
for d in all_data:
    for idx, s in enumerate(subclass_map.index):
        d.loc[d['LotConfig'] == s,'LotConfig'] = idx+1

        
for idx, s in enumerate(subclass_map.index):
    combined.loc[combined['LotConfig'] == s,'LotConfig'] = idx+1

train_df.groupby('LotConfig')['SalePrice'].describe().sort_values('mean')

<a id="subsection-Neighborhood"></a>
## Neighborhood

In [None]:
subclass_map = train_df.groupby('Neighborhood')['SalePrice'].mean().sort_values()
for d in all_data:
    for idx, s in enumerate(subclass_map.index):
        d.loc[d['Neighborhood'] == s,'Neighborhood'] = idx+1

        
for idx, s in enumerate(subclass_map.index):
    combined.loc[combined['Neighborhood'] == s,'Neighborhood'] = idx+1

train_df.groupby('Neighborhood')['SalePrice'].describe().sort_values('mean')

<a id="subsection-NeighborhoodLotArea"></a>
## NeighborhoodLotArea

In [None]:
# Added new column and used min/max normalization

combined['NeighborhoodLotArea'] = combined['Neighborhood'] * combined['LotArea']

for d in all_data:
    d['NeighborhoodLotArea'] = d['Neighborhood'] * d['LotArea']
    d['NeighborhoodLotArea'] = (d['NeighborhoodLotArea'] - combined['NeighborhoodLotArea'].min()) \
                                      / (combined['NeighborhoodLotArea'].max()-combined['NeighborhoodLotArea'].min())

combined['NeighborhoodLotArea'] = (combined['NeighborhoodLotArea'] - combined['NeighborhoodLotArea'].min()) \
                                  / (combined['NeighborhoodLotArea'].max()-combined['NeighborhoodLotArea'].min())

<a id="subsection-BldgType"></a>
## BldgType

In [None]:
subclass_map = train_df.groupby('BldgType')['SalePrice'].mean().sort_values()
for d in all_data:
    for idx, s in enumerate(subclass_map.index):
        d.loc[d['BldgType'] == s,'BldgType'] = idx+1

        
for idx, s in enumerate(subclass_map.index):
    combined.loc[combined['BldgType'] == s,'BldgType'] = idx+1

train_df.groupby('BldgType')['SalePrice'].describe().sort_values('mean')

<a id="subsection-NeighborhoodBldgType"></a>
## NeighborhoodBldgType

In [None]:
# Added new column and used min/max normalization

combined['NeighborhoodBldgType'] = combined['Neighborhood'] * combined['BldgType']

for d in all_data:
    d['NeighborhoodBldgType'] = d['Neighborhood'] * d['BldgType']
    d['NeighborhoodBldgType'] = (d['NeighborhoodBldgType'] - combined['NeighborhoodBldgType'].min()) \
                                      / (combined['NeighborhoodBldgType'].max()-combined['NeighborhoodBldgType'].min())

combined['NeighborhoodBldgType'] = (combined['NeighborhoodBldgType'] - combined['NeighborhoodBldgType'].min()) \
                                  / (combined['NeighborhoodBldgType'].max()-combined['NeighborhoodBldgType'].min())

<a id="subsection-TotalSquareFeet"></a>
## Total Square Feet

In [None]:
combined['TotalBsmtSF'] = combined['TotalBsmtSF'].fillna(0)
combined['GarageArea'] = combined['GarageArea'].fillna(0)

combined['TotalSquareFeet'] = combined['TotalBsmtSF'] + combined['1stFlrSF'] \
                            + combined['2ndFlrSF'] + combined['GarageArea'] \
                            + combined['WoodDeckSF']

for d in all_data:
    d['TotalBsmtSF'] = d['TotalBsmtSF'].fillna(0)
    d['GarageArea'] = d['GarageArea'].fillna(0)
    d['TotalSquareFeet'] = d['TotalBsmtSF'] + d['1stFlrSF'] + d['2ndFlrSF'] + d['GarageArea'] + d['WoodDeckSF']
    d['TotalSquareFeet'] = (d['TotalSquareFeet']-combined['TotalSquareFeet'].mean())/combined['TotalSquareFeet'].std()
    
combined['TotalSquareFeet'] = (combined['TotalSquareFeet']-combined['TotalSquareFeet'].mean())/combined['TotalSquareFeet'].std()

<a id="subsection-HouseStyle"></a>
## HouseStyle

In [None]:
subclass_map = train_df.groupby('HouseStyle')['SalePrice'].mean().sort_values()
for d in all_data:
    for idx, s in enumerate(subclass_map.index):
        d.loc[d['HouseStyle'] == s,'HouseStyle'] = idx+1

        
for idx, s in enumerate(subclass_map.index):
    combined.loc[combined['HouseStyle'] == s,'HouseStyle'] = idx+1

train_df.groupby('HouseStyle')['SalePrice'].describe().sort_values('mean')

<a id="subsection-NeighborhoodHouseStyle"></a>
## NeighborhoodHouseStyle

In [None]:
# Added new column and used min/max normalization

combined['NeighborhoodHouseStyle'] = combined['Neighborhood'] * combined['HouseStyle']

for d in all_data:
    d['NeighborhoodHouseStyle'] = d['Neighborhood'] * d['HouseStyle']
    d['NeighborhoodHouseStyle'] = (d['NeighborhoodHouseStyle'] - combined['NeighborhoodHouseStyle'].min()) \
                                      / (combined['NeighborhoodHouseStyle'].max()-combined['NeighborhoodHouseStyle'].min())

combined['NeighborhoodHouseStyle'] = (combined['NeighborhoodHouseStyle'] - combined['NeighborhoodHouseStyle'].min()) \
                                  / (combined['NeighborhoodHouseStyle'].max()-combined['NeighborhoodHouseStyle'].min())

<a id="subsection-OverallQualNCond"></a>
## OverallQualNCond

In [None]:
# Added new column and used min/max normalization

combined['OverallQualNCond'] = combined['OverallQual'] * combined['OverallCond']

for d in all_data:
    d['OverallQualNCond'] = d['OverallQual'] * d['OverallCond']
    d['OverallQualNCond'] = (d['OverallQualNCond'] - combined['OverallQualNCond'].min()) \
                                      / (combined['OverallQualNCond'].max()-combined['OverallQualNCond'].min())

combined['OverallQualNCond'] = (combined['OverallQualNCond'] - combined['OverallQualNCond'].min()) \
                                  / (combined['OverallQualNCond'].max()-combined['OverallQualNCond'].min())

<a id="subsection-NeighborhoodOverallQualNCond"></a>
## NeighborhoodOverallQualNCond

In [None]:
# Added new column and used min/max normalization

combined['NeighborhoodOverallQualNCond'] = combined['Neighborhood'] * combined['OverallQualNCond']

for d in all_data:
    d['NeighborhoodOverallQualNCond'] = d['Neighborhood'] * d['OverallQualNCond']
    d['NeighborhoodOverallQualNCond'] = (d['NeighborhoodOverallQualNCond'] - combined['NeighborhoodOverallQualNCond'].min()) \
                                      / (combined['NeighborhoodOverallQualNCond'].max()-combined['NeighborhoodOverallQualNCond'].min())

combined['NeighborhoodOverallQualNCond'] = (combined['NeighborhoodOverallQualNCond'] - combined['NeighborhoodOverallQualNCond'].min()) \
                                  / (combined['NeighborhoodOverallQualNCond'].max()-combined['NeighborhoodOverallQualNCond'].min())

<a id="subsection-Age"></a>
## Age
We'll determine this by getting the max MoSold/YrSold and count the months back to last remodelled since we determined build date is less correlated than remodelled date (assuming built & remodelled on 1st month of the year)

In [None]:
max_year = combined['YrSold'].max()
max_month = combined.loc[combined['YrSold'] == max_year, 'MoSold'].max()

for d in all_data:
    months_ago = (max_year-d['YearRemodAdd'])*12 + max_month
    d['AdjRemodAdd'] = (months_ago - months_ago.min()) / (months_ago.max()-months_ago.min()) \
                     * -1 + 1 # reverse the order
    d.drop(['YearRemodAdd'], axis=1, inplace=True)

months_ago = (max_year-combined['YearRemodAdd'])*12 + max_month
combined['AdjRemodAdd'] = (months_ago - months_ago.min()) / (months_ago.max()-months_ago.min()) \
                        * -1 + 1 # reverse the order
combined.drop(['YearRemodAdd'], axis=1, inplace=True)

In [None]:
from sklearn.metrics import mean_squared_error

X_train = train_df.drop(['SalePrice'], axis=1)
plt.plot(y_train, X_train['AdjRemodAdd'], 'o')
m, b = np.polyfit(y_train, X_train['AdjRemodAdd'], 1)
plt.plot(y_train, m*y_train + b)
print (mean_squared_error(y_train, m*y_train + b))

<a id="subsection-AverageSalePrice"></a>
## Average Sale Price

In [None]:
# Get the average of each year/month
df = train_df.groupby(['YrSold','MoSold'])[['SalePrice']].mean()
df = df.rename({'SalePrice': 'MonthMean'}, axis=1)

# get the rolling average spanning 3 months
qt = (df[['MonthMean']] + df[['MonthMean']].shift(1) + df[['MonthMean']].shift(-1)) / 3
df['QuarterMean'] = qt['MonthMean']
df['QuarterMean'] = df['QuarterMean'].fillna(df['MonthMean'])
print(df)

In [None]:
# Merge the averages onto our datasets
for idx, d in enumerate(all_data):
    all_data[idx] = pd.merge(d, df, how='left', left_on=['YrSold','MoSold'], right_on=['YrSold','MoSold'])
    all_data[idx].drop(['MoSold', 'YrSold'], axis=1, inplace=True)
    
combined = pd.merge(combined, df, how='left', left_on=['YrSold','MoSold'], right_on=['YrSold','MoSold'])
combined.drop(['MoSold', 'YrSold'], axis=1, inplace=True)

# Fix the reference since above we generated new references for all_data
train_df = all_data[0]
test_df = all_data[1]

In [None]:
# Normalize the values and drop the old ones

for d in all_data:
    d['QuarterMeanNorm'] = (d['QuarterMean']-combined['QuarterMean'].mean())/combined['QuarterMean'].std()
    d['MonthMeanNorm'] = (d['MonthMean']-combined['MonthMean'].mean())/combined['MonthMean'].std()
    #d.drop(['QuarterMean','MonthMean'], axis=1, inplace=True)

combined['QuarterMeanNorm'] = (combined['QuarterMean']-combined['QuarterMean'].mean())/combined['QuarterMean'].std()
combined['MonthMeanNorm'] = (combined['MonthMean']-combined['MonthMean'].mean())/combined['MonthMean'].std()
#combined.drop(['QuarterMean','MonthMean'], axis=1, inplace=True)

<a id="subsection-Exterior"></a>
## Exterior

In [None]:
ex1_map = {'Other': 0, 'BrkFace': 1, 'CemntBd': 2, 'Plywood': 3, 'Wd Sdng': 4, 'MetalSd': 5, 'HdBoard': 6, 'VinylSd': 7}
ex2_map = {'Other': 0, 'AsbShng': 1, 'BrkFace': 2, 'Stucco': 3, 'Wd Shng': 4, 'CmentBd': 5, 'Plywood': 7, 'Wd Sdng': 8, 'HdBoard': 9, 'MetalSd': 10, 'VinylSd': 11}
max_map = {'None': 0, 'BrkCmn': 1, 'Stone': 2, 'BrkFace': 3}
exg_map = {'Po': 0.0, 'Fa': 0.25, 'TA': 0.5, 'Gd': 0.75, 'Ex': 1.0}

combined['MasVnrArea'] = combined['MasVnrArea'].fillna(0)

for d in all_data:
    d['Exterior1st'] = d['Exterior1st'].fillna(combined['Exterior1st'].mode()[0])
    d['Exterior1st'] = d['Exterior1st'].replace(['AsphShn','CBlock','ImStucc','BrkComm','Stone','AsbShng','Stucco','WdShing'], 'Other')
    d['Exterior1st'] = d['Exterior1st'].map(ex1_map)
    d['Exterior2nd'] = d['Exterior2nd'].fillna(combined['Exterior2nd'].mode()[0])
    d['Exterior2nd'] = d['Exterior2nd'].replace(['CBlock','AsphShn','Stone','Brk Cmn','ImStucc'], 'Other')
    d['Exterior2nd'] = d['Exterior2nd'].map(ex2_map)
    d['MasVnrType'] = d['MasVnrType'].fillna('None')
    d['MasVnrType'] = d['MasVnrType'].map(max_map)
    d['MasVnrArea'] = d['MasVnrArea'].fillna(0)
    d['MasVnrArea'] = (d['MasVnrArea'] - combined['MasVnrArea'].min()) \
                      / (combined['MasVnrArea'].max()-combined['MasVnrArea'].min())
    d['ExterQual'] = d['ExterQual'].map(exg_map)
    d['ExterCond'] = d['ExterCond'].map(exg_map)

combined['Exterior1st'] = combined['Exterior1st'].fillna(combined['Exterior1st'].mode()[0])
combined['Exterior1st'] = combined['Exterior1st'].replace(['AsphShn','CBlock','ImStucc','BrkComm','Stone','AsbShng','Stucco','WdShing'], 'Other')
combined['Exterior1st'] = combined['Exterior1st'].map(ex1_map)
combined['Exterior2nd'] = combined['Exterior2nd'].fillna(combined['Exterior2nd'].mode()[0])
combined['Exterior2nd'] = combined['Exterior2nd'].replace(['CBlock','AsphShn','Stone','Brk Cmn','ImStucc'], 'Other')
combined['Exterior2nd'] = combined['Exterior2nd'].map(ex2_map)
combined['MasVnrType'] = combined['MasVnrType'].fillna('None')
combined['MasVnrType'] = combined['MasVnrType'].map(max_map)
combined['MasVnrArea'] = (combined['MasVnrArea'] - combined['MasVnrArea'].min()) \
                         / (combined['MasVnrArea'].max()-combined['MasVnrArea'].min())
combined['ExterQual'] = combined['ExterQual'].map(exg_map)
combined['ExterCond'] = combined['ExterCond'].map(exg_map)

<a id="subsection-FondBase"></a>
## Foundation and Basement

In [None]:
fnd_map = {'Slab': 0, 'BrkTil': 1, 'CBlock': 2, 'Other': 3, 'PConc': 4}
bsm_map = {'Po': 0.0, 'Fa': 0.25, 'TA': 0.5, 'Gd': 0.75, 'Ex': 1.0}
bex_map = {'NA': 0.0, 'No': 0.25, 'Mn': 0.5, 'Av': 0.75, 'Gd': 1.0}
bfi_map = {'NA': 0.0, 'Unf': 0.17, 'LwQ': 0.33, 'Rec': 0.5, 'BLQ': 0.67, 'ALQ': 0.83, 'GLQ': 1.0}

combined['BsmtFinSF1'] = combined['BsmtFinSF1'].fillna(0)
combined['BsmtFinSF2'] = combined['BsmtFinSF2'].fillna(0)
combined['BsmtUnfSF'] = combined['BsmtUnfSF'].fillna(0)
combined['TotalBsmtSF'] = combined['TotalBsmtSF'].fillna(0)
combined['BsmtFullBath'] = combined['BsmtFullBath'].fillna(0)
combined['BsmtHalfBath'] = combined['BsmtFullBath'].fillna(0)

for d in all_data:
    d['Foundation'] = d['Foundation'].replace(['Stone', 'Wood'], 'Other')
    d['Foundation'] = d['Foundation'].map(fnd_map)
    d['BsmtQual'] = d['BsmtQual'].fillna('Po')
    d['BsmtQual'] = d['BsmtQual'].map(bsm_map)
    d['BsmtCond'] = d['BsmtCond'].fillna('Po')
    d['BsmtCond'] = d['BsmtCond'].map(bsm_map)
    d['BsmtExposure'] = d['BsmtExposure'].fillna('NA')
    d['BsmtExposure'] = d['BsmtExposure'].map(bex_map)
    d['BsmtFinType1'] = d['BsmtFinType1'].fillna('NA')
    d['BsmtFinType1'] = d['BsmtFinType1'].map(bfi_map)
    d['BsmtFinType2'] = d['BsmtFinType2'].fillna('NA')
    d['BsmtFinType2'] = d['BsmtFinType2'].map(bfi_map)
    d['BsmtFinSF1'] = d['BsmtFinSF1'].fillna(0)
#     d['BsmtFinSF1'] = (d['BsmtFinSF1'] - combined['BsmtFinSF1'].min()) \
#                       / (combined['BsmtFinSF1'].max()-combined['BsmtFinSF1'].min())
    d['BsmtFinSF1'] = (d['BsmtFinSF1']-combined['BsmtFinSF1'].mean())/combined['BsmtFinSF1'].std()
    d['BsmtFinSF2'] = d['BsmtFinSF2'].fillna(0)
#     d['BsmtFinSF2'] = (d['BsmtFinSF2'] - combined['BsmtFinSF2'].min()) \
#                       / (combined['BsmtFinSF2'].max()-combined['BsmtFinSF2'].min())
    d['BsmtFinSF2'] = (d['BsmtFinSF2']-combined['BsmtFinSF2'].mean())/combined['BsmtFinSF2'].std()
    d['BsmtUnfSF'] = d['BsmtUnfSF'].fillna(0)
#     d['BsmtUnfSF'] = (d['BsmtUnfSF'] - combined['BsmtUnfSF'].min()) \
#                       / (combined['BsmtUnfSF'].max()-combined['BsmtUnfSF'].min())
    d['BsmtUnfSF'] = (d['BsmtUnfSF']-combined['BsmtUnfSF'].mean())/combined['BsmtUnfSF'].std()
    d['TotalBsmtSF'] = d['TotalBsmtSF'].fillna(0)
#     d['TotalBsmtSF'] = (d['TotalBsmtSF'] - combined['TotalBsmtSF'].min()) \
#                       / (combined['TotalBsmtSF'].max()-combined['TotalBsmtSF'].min())
    d['TotalBsmtSF'] = (d['TotalBsmtSF']-combined['TotalBsmtSF'].mean())/combined['TotalBsmtSF'].std()
    d['BsmtFullBath'] = d['BsmtFullBath'].fillna(0)
    d['BsmtHalfBath'] = d['BsmtFullBath'].fillna(0)
    d['BsmtFullBath'] = (d['BsmtFullBath']-combined['BsmtFullBath'].mean())/combined['BsmtFullBath'].std()
    d['BsmtHalfBath'] = (d['BsmtHalfBath']-combined['BsmtHalfBath'].mean())/combined['BsmtHalfBath'].std()

combined['Foundation'] = combined['Foundation'].replace(['Stone', 'Wood'], 'Other')
combined['Foundation'] = combined['Foundation'].map(fnd_map)
combined['BsmtQual'] = combined['BsmtQual'].fillna('Po')
combined['BsmtQual'] = combined['BsmtQual'].map(bsm_map)
combined['BsmtCond'] = combined['BsmtCond'].fillna('Po')
combined['BsmtCond'] = combined['BsmtCond'].map(bsm_map)
combined['BsmtExposure'] = combined['BsmtExposure'].fillna('NA')
combined['BsmtExposure'] = combined['BsmtExposure'].map(bex_map)
combined['BsmtFinType1'] = combined['BsmtFinType1'].fillna('NA')
combined['BsmtFinType1'] = combined['BsmtFinType1'].map(bfi_map)
combined['BsmtFinType2'] = combined['BsmtFinType2'].fillna('NA')
combined['BsmtFinType2'] = combined['BsmtFinType2'].map(bfi_map)
# combined['BsmtFinSF1'] = (combined['BsmtFinSF1'] - combined['BsmtFinSF1'].min()) \
#                          / (combined['BsmtFinSF1'].max()-combined['BsmtFinSF1'].min())
combined['BsmtFinSF1'] = (combined['BsmtFinSF1']-combined['BsmtFinSF1'].mean())/combined['BsmtFinSF1'].std()
# combined['BsmtFinSF2'] = (combined['BsmtFinSF2'] - combined['BsmtFinSF2'].min()) \
#                          / (combined['BsmtFinSF2'].max()-combined['BsmtFinSF2'].min())
combined['BsmtFinSF2'] = (combined['BsmtFinSF2']-combined['BsmtFinSF2'].mean())/combined['BsmtFinSF2'].std()
# combined['BsmtUnfSF'] = (combined['BsmtUnfSF'] - combined['BsmtUnfSF'].min()) \
#                          / (combined['BsmtUnfSF'].max()-combined['BsmtUnfSF'].min())
combined['BsmtUnfSF'] = (combined['BsmtUnfSF']-combined['BsmtUnfSF'].mean())/combined['BsmtUnfSF'].std()
# combined['TotalBsmtSF'] = (combined['TotalBsmtSF'] - combined['TotalBsmtSF'].min()) \
#                          / (combined['TotalBsmtSF'].max()-combined['TotalBsmtSF'].min())
combined['TotalBsmtSF'] = (combined['TotalBsmtSF']-combined['TotalBsmtSF'].mean())/combined['TotalBsmtSF'].std()
combined['BsmtFullBath'] = (combined['BsmtFullBath']-combined['BsmtFullBath'].mean())/combined['BsmtFullBath'].std()
combined['BsmtHalfBath'] = (combined['BsmtHalfBath']-combined['BsmtHalfBath'].mean())/combined['BsmtHalfBath'].std()

<a id="subsection-HeatCool"></a>
## Heating / Cooling

In [None]:
ht_map = {'Po': 0.0, 'Fa': 0.25, 'TA': 0.5, 'Gd': 0.75, 'Ex': 1.0}
yn_map = {'N': 0, 'Y': 1}

for d in all_data:
    d['HeatingQC'] = d['HeatingQC'].map(ht_map)
    d['CentralAir'] = d['CentralAir'].map(yn_map)
    
combined['HeatingQC'] = combined['HeatingQC'].map(ht_map)
combined['CentralAir'] = combined['CentralAir'].map(yn_map)

<a id="subsection-FloorRoom"></a>
## Floor Sizes and Room Conditions

In [None]:
rqc_map = {'Po': 0.0, 'Fa': 0.25, 'TA': 0.5, 'Gd': 0.75, 'Ex': 1.0}

for d in all_data:
    d['OverallQual'] = (d['OverallQual'] - combined['OverallQual'].min()) \
                       / (combined['OverallQual'].max()-combined['OverallQual'].min())
    d['OverallCond'] = (d['OverallCond'] - combined['OverallCond'].min()) \
                       / (combined['OverallCond'].max()-combined['OverallCond'].min())
#     d['1stFlrSF'] = (d['1stFlrSF'] - combined['1stFlrSF'].min()) \
#                     / (combined['1stFlrSF'].max()-combined['1stFlrSF'].min())
    d['1stFlrSF'] = (d['1stFlrSF']-combined['1stFlrSF'].mean())/combined['1stFlrSF'].std()
#     d['2ndFlrSF'] = (d['2ndFlrSF'] - combined['2ndFlrSF'].min()) \
#                     / (combined['2ndFlrSF'].max()-combined['2ndFlrSF'].min())
    d['2ndFlrSF'] = (d['2ndFlrSF']-combined['2ndFlrSF'].mean())/combined['2ndFlrSF'].std()
#     d['LowQualFinSF'] = (d['LowQualFinSF'] - combined['LowQualFinSF'].min()) \
#                     / (combined['LowQualFinSF'].max()-combined['LowQualFinSF'].min())
    d['LowQualFinSF'] = (d['LowQualFinSF']-combined['LowQualFinSF'].mean())/combined['LowQualFinSF'].std()
#     d['GrLivArea'] = (d['GrLivArea'] - combined['GrLivArea'].min()) \
#                     / (combined['GrLivArea'].max()-combined['GrLivArea'].min())
    d['GrLivArea'] = (d['GrLivArea']-combined['GrLivArea'].mean())/combined['GrLivArea'].std()
    d['KitchenQual'] = d['KitchenQual'].fillna(d['KitchenQual'].mode()[0])
    d['KitchenQual'] = d['KitchenQual'].map(rqc_map)
    d['FireplaceQu'] = d['FireplaceQu'].fillna('Po')
    d['FireplaceQu'] = d['FireplaceQu'].map(rqc_map)
    d['FullBath'] = (d['FullBath']-combined['FullBath'].mean())/combined['FullBath'].std()
    d['HalfBath'] = (d['HalfBath']-combined['HalfBath'].mean())/combined['HalfBath'].std()
    d['BedroomAbvGr'] = (d['BedroomAbvGr']-combined['BedroomAbvGr'].mean())/combined['BedroomAbvGr'].std()
    d['KitchenAbvGr'] = (d['KitchenAbvGr']-combined['KitchenAbvGr'].mean())/combined['KitchenAbvGr'].std()
    d['Fireplaces'] = (d['Fireplaces']-combined['Fireplaces'].mean())/combined['Fireplaces'].std()
    d['TotRmsAbvGrd'] = (d['TotRmsAbvGrd']-combined['TotRmsAbvGrd'].mean())/combined['TotRmsAbvGrd'].std()

combined['OverallQual'] = (combined['OverallQual'] - combined['OverallQual'].min()) \
                         / (combined['OverallQual'].max()-combined['OverallQual'].min())
combined['OverallCond'] = (combined['OverallCond'] - combined['OverallCond'].min()) \
                         / (combined['OverallCond'].max()-combined['OverallCond'].min())
# combined['1stFlrSF'] = (combined['1stFlrSF'] - combined['1stFlrSF'].min()) \
#                          / (combined['1stFlrSF'].max()-combined['1stFlrSF'].min())
combined['1stFlrSF'] = (combined['1stFlrSF']-combined['1stFlrSF'].mean())/combined['1stFlrSF'].std()
# combined['2ndFlrSF'] = (combined['2ndFlrSF'] - combined['2ndFlrSF'].min()) \
#                          / (combined['2ndFlrSF'].max()-combined['2ndFlrSF'].min())
combined['2ndFlrSF'] = (combined['2ndFlrSF']-combined['2ndFlrSF'].mean())/combined['2ndFlrSF'].std()
# combined['LowQualFinSF'] = (combined['LowQualFinSF'] - combined['LowQualFinSF'].min()) \
#                          / (combined['LowQualFinSF'].max()-combined['LowQualFinSF'].min())
combined['LowQualFinSF'] = (combined['LowQualFinSF']-combined['LowQualFinSF'].mean())/combined['LowQualFinSF'].std()
# combined['GrLivArea'] = (combined['GrLivArea'] - combined['GrLivArea'].min()) \
#                          / (combined['GrLivArea'].max()-combined['GrLivArea'].min())
combined['GrLivArea'] = (combined['GrLivArea']-combined['GrLivArea'].mean())/combined['GrLivArea'].std()
combined['KitchenQual'] = combined['KitchenQual'].fillna(combined['KitchenQual'].mode()[0])
combined['KitchenQual'] = combined['KitchenQual'].map(rqc_map)
combined['FireplaceQu'] = combined['FireplaceQu'].fillna('Po')
combined['FireplaceQu'] = combined['FireplaceQu'].map(rqc_map)
combined['FullBath'] = (combined['FullBath']-combined['FullBath'].mean())/combined['FullBath'].std()
combined['HalfBath'] = (combined['HalfBath']-combined['HalfBath'].mean())/combined['HalfBath'].std()
combined['BedroomAbvGr'] = (combined['BedroomAbvGr']-combined['BedroomAbvGr'].mean())/combined['BedroomAbvGr'].std()
combined['KitchenAbvGr'] = (combined['KitchenAbvGr']-combined['KitchenAbvGr'].mean())/combined['KitchenAbvGr'].std()
combined['Fireplaces'] = (combined['Fireplaces']-combined['Fireplaces'].mean())/combined['Fireplaces'].std()
combined['TotRmsAbvGrd'] = (combined['TotRmsAbvGrd']-combined['TotRmsAbvGrd'].mean())/combined['TotRmsAbvGrd'].std()

<a id="subsection-Garage"></a>
## Garage

In [None]:
grt_map = {'None': 0, 'Detchd': 1, 'Other': 2, 'Attchd': 3, 'BuiltIn': 4}
grf_map = {'None': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}
gqc_map = {'Po': 0.0, 'Fa': 0.25, 'TA': 0.5, 'Gd': 0.75, 'Ex': 1.0}
pdr_map = {'N': 0.0, 'P': 0.5, 'Y': 1.0}

combined['GarageCars'] = combined['GarageCars'].fillna(0)
combined['GarageArea'] = combined['GarageArea'].fillna(0)

for d in all_data:
    d['GarageType'] = d['GarageType'].fillna('None')
    d['GarageType'] = d['GarageType'].replace(['CarPort', '2Types', 'Basment'], 'Other')
    d['GarageType'] = d['GarageType'].map(grt_map)
    d['GarageFinish'] = d['GarageFinish'].fillna('None')
    d['GarageFinish'] = d['GarageFinish'].map(grf_map)
    d['GarageCars'] = d['GarageCars'].fillna(0)
    d['GarageCars'] = (d['GarageCars']-combined['GarageCars'].mean())/combined['GarageCars'].std()
    d['GarageArea'] = d['GarageArea'].fillna(0)
    d['GarageArea'] = (d['GarageArea']-combined['GarageArea'].mean())/combined['GarageArea'].std()
    d['GarageQual'] = d['GarageQual'].fillna('Po')
    d['GarageQual'] = d['GarageQual'].map(gqc_map)
    d['GarageCond'] = d['GarageCond'].fillna('Po')
    d['GarageCond'] = d['GarageCond'].map(gqc_map)
    d['PavedDrive'] = d['PavedDrive'].map(pdr_map)
    
combined['GarageType'] = combined['GarageType'].fillna('None')
combined['GarageType'] = combined['GarageType'].replace(['CarPort', '2Types', 'Basment'], 'Other')
combined['GarageType'] = combined['GarageType'].map(grt_map)
combined['GarageFinish'] = combined['GarageFinish'].fillna('None')
combined['GarageFinish'] = combined['GarageFinish'].map(grf_map)
combined['GarageCars'] = (combined['GarageCars']-combined['GarageCars'].mean())/combined['GarageCars'].std()
combined['GarageArea'] = (combined['GarageArea']-combined['GarageArea'].mean())/combined['GarageArea'].std()
combined['GarageQual'] = combined['GarageQual'].fillna('Po')
combined['GarageQual'] = combined['GarageQual'].map(gqc_map)
combined['GarageCond'] = combined['GarageCond'].fillna('Po')
combined['GarageCond'] = combined['GarageCond'].map(gqc_map)
combined['PavedDrive'] = combined['PavedDrive'].map(pdr_map)

<a id="subsection-Other"></a>
## Other Features

In [None]:
plt.plot(y_train, X_train['WoodDeckSF'], 'o')
m, b = np.polyfit(y_train, X_train['WoodDeckSF'], 1)
plt.plot(y_train, m*y_train + b)

In [None]:
for d in all_data:
    d['WoodDeckSF'] = (d['WoodDeckSF']-combined['WoodDeckSF'].mean())/combined['WoodDeckSF'].std()
    d['OpenPorchSF'] = (d['OpenPorchSF']-combined['OpenPorchSF'].mean())/combined['OpenPorchSF'].std()
    
combined['WoodDeckSF'] = (combined['WoodDeckSF']-combined['WoodDeckSF'].mean())/combined['WoodDeckSF'].std()
combined['OpenPorchSF'] = (combined['OpenPorchSF']-combined['OpenPorchSF'].mean())/combined['OpenPorchSF'].std()

<a id="subsection-SaleInfo"></a>
## Sale Info

In [None]:
sac_map = {'Other': 0, 'Abnorml': 1, 'Normal': 2, 'Partial': 3}

for d in all_data:
    d['SaleCondition'] = d['SaleCondition'].replace(['AdjLand', 'Family', 'Alloca'], 'Other')
    d['SaleCondition'] = d['SaleCondition'].map(sac_map)
    
combined['SaleCondition'] = combined['SaleCondition'].replace(['AdjLand', 'Family', 'Alloca'], 'Other')
combined['SaleCondition'] = combined['SaleCondition'].map(sac_map)

<a id="subsection-Results"></a>
## Results

In [None]:
pd.set_option('display.max_columns', None)
train_df.head(10)

In [None]:
field = 'SaleCondition'
print(f'{combined[field].isnull().sum()} or {combined[field].isnull().sum() / 2919. * 100}% empty')
train_df.groupby(field)['SalePrice'].describe().sort_values('mean')

### Output new files

In [None]:
train_df.to_csv('new_train.csv', index=False)
test_df.to_csv('new_test.csv', index=False)
print("New files have been created.")

<a id="section-onehot"></a>
# One-Hot Encoding

So I know some of you might be thinking why I have a whole different section for this.  Well, to be honest, one-hot encoding was an after thought for me.  After I did the above data cleanup, I tested it and got good results and wanted to see what it would look like if I did one-hot encoding and I got better results so I kept it in.

In [None]:
# import numpy as np
# import pandas as pd
# from sklearn.metrics import mean_squared_log_error

X = pd.read_csv("/kaggle/working/new_train.csv")
y = pd.read_csv("/kaggle/working/new_test.csv")

all_data = [X,y]

pd.set_option('display.max_columns', None)
X.head(20)

In [None]:
B = pd.concat([X,y])

B = pd.concat([B, pd.get_dummies(B['Neighborhood'], prefix='Neighborhood', drop_first=True)], axis=1)
B = pd.concat([B, pd.get_dummies(B['LotConfig'], prefix='LotConfig', drop_first=True)], axis=1)
B = pd.concat([B, pd.get_dummies(B['BldgType'], prefix='BldgType', drop_first=True)], axis=1)
B = pd.concat([B, pd.get_dummies(B['HouseStyle'], prefix='HouseStyle', drop_first=True)], axis=1)
B = pd.concat([B, pd.get_dummies(B['SaleCondition'], prefix='SaleCondition', drop_first=True)], axis=1)
B = pd.concat([B, pd.get_dummies(B['GarageFinish'], prefix='GarageFinish', drop_first=True)], axis=1)
B = pd.concat([B, pd.get_dummies(B['GarageType'], prefix='GarageType', drop_first=True)], axis=1)
B = pd.concat([B, pd.get_dummies(B['Foundation'], prefix='Foundation', drop_first=True)], axis=1)

#B.drop(['Neighborhood'], axis=1, inplace=True)
B.drop(['LotConfig', 'BldgType', 'HouseStyle', 'SaleCondition', 'GarageFinish', 'GarageType', 'Foundation'], axis=1, inplace=True)
B.drop(['MasVnrType', 'Exterior2nd', 'Exterior1st', 'MSZoning', 'MSSubClass'], axis=1, inplace=True)

B.head(20)

In [None]:
X = B[:len(X)]
y = B[len(X):]
y = y.drop('SalePrice', axis=1)

len(X)

In [None]:
X.to_csv('new_train_1h.csv', index=False)
y.to_csv('new_test_1h.csv', index=False)
print("New files have been created.")

<a id="section-columnselect"></a>
# Column Selection

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_log_error

# X_train = pd.read_csv("new_train.csv")
# X_test = pd.read_csv("new_test.csv")
X = pd.read_csv("/kaggle/working/new_train_1h.csv")
test_df = pd.read_csv("/kaggle/working/new_test_1h.csv")

In [None]:
train_df = X.sample(frac=0.6,random_state=200)
evaluate_df = X.drop(train_df.index)

X_train = train_df.drop(['Id','SalePrice'], axis=1)
Y_train = train_df["SalePrice"]
X_train_full = X.drop(['Id','SalePrice'], axis=1)
Y_train_full = X["SalePrice"]
X_eval  = evaluate_df.drop(['Id','SalePrice'], axis=1)
Y_eval  = evaluate_df["SalePrice"]
X_test = test_df.drop('Id', axis=1)

len(X_train_full.columns)

In [None]:
#from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb

# grad_boost = GradientBoostingRegressor(random_state=0, n_estimators=500, max_features= 0.3)
# grad_boost_full = GradientBoostingRegressor(random_state=0, n_estimators=500, max_features= 0.3)
grad_boost = xgb.XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, gamma=0,
             importance_type='gain', learning_rate=0.01, max_delta_step=0,
             max_depth=5, min_child_weight=1.5, n_estimators=1900, 
             n_jobs=1, nthread=None, objective='reg:squarederror',
             reg_alpha=0.6, reg_lambda=0.6, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1)
grad_boost_full = xgb.XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, gamma=0,
             importance_type='gain', learning_rate=0.01, max_delta_step=0,
             max_depth=5, min_child_weight=1.5, n_estimators=1900, 
             n_jobs=1, nthread=None, objective='reg:squarederror',
             reg_alpha=0.6, reg_lambda=0.6, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1)
num_feat_to_drop = 1
dropped_features = []
i = 0
best_acc = 1

while len(X_train.columns) > 5:
    # Train our model with the set
    grad_boost.fit(X_train, Y_train)
    grad_boost_full.fit(X_train_full, Y_train_full)
    
    # Determine our accurancy with this model
    acc_train = np.sqrt(mean_squared_log_error(Y_train, grad_boost.predict(X_train)))
    acc_eval = np.sqrt(mean_squared_log_error(Y_eval, grad_boost.predict(X_eval)))
    acc_full = np.sqrt(mean_squared_log_error(Y_train_full, grad_boost_full.predict(X_train_full)))
    avg_acc = (acc_train + acc_eval + acc_full) / 3
    if (avg_acc >= best_acc):
        print (f'{i}: Train = {acc_train}  Evaluate = {acc_eval} Full = {acc_full} Average = {avg_acc}')
    else:
        best_acc = avg_acc
        print (f'{i}: Train = {acc_train}  Evaluate = {acc_eval} Full = {acc_full} Average = {avg_acc} - NEW BEST!!!')
        print (f'  Dropped: {dropped_features}')
    
    # Determine the least important features
    importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(grad_boost.feature_importances_,3)})
    importances = importances.sort_values('importance',ascending=True).set_index('feature')
    
    # Drop the X least important features
    features_to_drop = importances[0:num_feat_to_drop].index.tolist()
    if len(features_to_drop) <= 0:
        break
    X_train.drop(features_to_drop, axis=1, inplace=True)
    X_eval.drop(features_to_drop, axis=1, inplace=True)
    X_train_full.drop(features_to_drop, axis=1, inplace=True)
    dropped_features.append(features_to_drop)
    
    i += num_feat_to_drop

In [None]:
# this is the result of my run on my own machine so it will differ with Kaggle
best_dropped = [['LotConfig_4'], ['Foundation_3'], ['Neighborhood_3'], ['BldgType_2'], ['Neighborhood_21'], ['Neighborhood_12'], ['Neighborhood_8'], ['HouseStyle_8'], ['BldgType_3'], ['HouseStyle_2'], ['Neighborhood_10']]
best_dropped = [x for inner in best_dropped for x in inner]
best_dropped

<a id="section-training"></a>
# Training

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_log_error

X = pd.read_csv("/kaggle/working/new_train_1h.csv")
test_df = pd.read_csv("/kaggle/working/new_test_1h.csv")

train_df = X.sample(frac=0.6,random_state=200)
evaluate_df = X.drop(train_df.index)

X_train = train_df.drop(['Id','SalePrice'], axis=1)
Y_train = train_df["SalePrice"]
X_train_full = X.drop(['Id','SalePrice'], axis=1)
Y_train_full = X["SalePrice"]
X_eval  = evaluate_df.drop(['Id','SalePrice'], axis=1)
Y_eval  = evaluate_df["SalePrice"]
X_test = test_df.drop('Id', axis=1)

In [None]:
# Check if any columns have null values
for (columnName, columnData) in X_test.iteritems():
    print (f'{columnName}: {X_test[columnName].isna().sum()}')

In [None]:
pd.set_option('display.max_columns', None)
X_test.describe()

**Final Column Drop**

Use the drop list from above

In [None]:
drop_list = ['LotConfig_4', 'Foundation_3', 'Neighborhood_3', 'BldgType_2', 'Neighborhood_21', 'Neighborhood_12', 'Neighborhood_8', 'HouseStyle_8', 'BldgType_3', 'HouseStyle_2', 'Neighborhood_10']

X_train.drop(drop_list, axis=1,inplace=True)
X_eval.drop(drop_list, axis=1,inplace=True)
X_train_full.drop(drop_list, axis=1,inplace=True)
X_test.drop(drop_list, axis=1,inplace=True)

<a id="subsection-LinearReg"></a>
## Linear Regression

In [None]:
from sklearn import linear_model

lr = linear_model.LinearRegression()
lr.fit(X_train, Y_train)

acc_train_lr = np.sqrt(mean_squared_log_error(Y_train, lr.predict(X_train)))
acc_eval_lr = np.sqrt(mean_squared_log_error(Y_eval, lr.predict(X_eval)))

print (f'LR: Train = {acc_train_lr}  Evaluate = {acc_eval_lr}')

<a id="subsection-StoGradDes"></a>
## Stochastic Gradient Descent (SGD)

In [None]:
from sklearn import linear_model

sgd = linear_model.SGDClassifier(random_state=1, max_iter=5, tol=None)
sgd.fit(X_train, Y_train)

acc_train_sgd = np.sqrt(mean_squared_log_error(Y_train, sgd.predict(X_train)))
acc_eval_sgd = np.sqrt(mean_squared_log_error(Y_eval, sgd.predict(X_eval)))

print (f'SGD: Train = {acc_train_sgd}  Evaluate = {acc_eval_sgd}')

<a id="subsection-RandFore"></a>
## Random Forest Classifier
I don't know why I did this.  This isn't a classifier problem...

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

acc_train_rf = np.sqrt(mean_squared_log_error(Y_train, random_forest.predict(X_train)))
acc_eval_rf = np.sqrt(mean_squared_log_error(Y_eval, random_forest.predict(X_eval)))

print (f'RF: Train = {acc_train_rf}  Evaluate = {acc_eval_rf}')

<a id="subsection-PolyReg"></a>
## Polynomial Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=1)
X_train_poly = poly_features.fit_transform(X_train)
X_eval_poly = poly_features.fit_transform(X_eval)
poly_model = LinearRegression()
poly_model.fit(X_train_poly, Y_train)

acc_train_pr = np.sqrt(mean_squared_log_error(Y_train, poly_model.predict(X_train_poly)))
acc_eval_pr = np.sqrt(mean_squared_log_error(Y_eval, abs(poly_model.predict(X_eval_poly))))

print (f'PR: Train = {acc_train_pr}  Evaluate = {acc_eval_pr}')

<a id="subsection-LogReg"></a>
## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

#logreg = LogisticRegression(random_state=1, max_iter=1000)
logreg = LogisticRegression(random_state=1)
logreg.fit(X_train, Y_train)

acc_train_lr = np.sqrt(mean_squared_log_error(Y_train, logreg.predict(X_train)))
acc_eval_lr = np.sqrt(mean_squared_log_error(Y_eval, logreg.predict(X_eval)))

print (f'LR: Train = {acc_train_lr}  Evaluate = {acc_eval_lr}')

<a id="subsection-GNB"></a>
## Gaussian Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)

acc_train_gnb = np.sqrt(mean_squared_log_error(Y_train, gaussian.predict(X_train)))
acc_eval_gnb = np.sqrt(mean_squared_log_error(Y_eval, gaussian.predict(X_eval)))

print (f'GNB: Train = {acc_train_gnb}  Evaluate = {acc_eval_gnb}')

<a id="subsection-Perceptron"></a>
## Perceptron

In [None]:
from sklearn.linear_model import Perceptron

perceptron = Perceptron(random_state=1, max_iter=50)
perceptron.fit(X_train, Y_train)

acc_train_prc = np.sqrt(mean_squared_log_error(Y_train, perceptron.predict(X_train)))
acc_eval_prc = np.sqrt(mean_squared_log_error(Y_eval, perceptron.predict(X_eval)))

print (f'Perceptron: Train = {acc_train_prc}  Evaluate = {acc_eval_prc}')

<a id="subsection-SVM"></a>
## Linear Support Vector Machine

In [None]:
from sklearn.svm import SVC, LinearSVC

#linear_svc = LinearSVC(random_state=1, max_iter=100000)
linear_svc = LinearSVC(random_state=1)
linear_svc.fit(X_train, Y_train)

acc_train_svm = np.sqrt(mean_squared_log_error(Y_train, linear_svc.predict(X_train)))
acc_eval_svm = np.sqrt(mean_squared_log_error(Y_eval, linear_svc.predict(X_eval)))

print (f'SVM: Train = {acc_train_svm}  Evaluate = {acc_eval_svm}')

<a id="subsection-DesiTree"></a>
## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier(random_state=1)
decision_tree.fit(X_train, Y_train)

acc_train_dt = np.sqrt(mean_squared_log_error(Y_train, decision_tree.predict(X_train)))
acc_eval_dt = np.sqrt(mean_squared_log_error(Y_eval, decision_tree.predict(X_eval)))

print (f'DT: Train = {acc_train_dt}  Evaluate = {acc_eval_dt}')

<a id="subsection-RanForReg"></a>
## Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor

random_forest_reg = RandomForestRegressor(random_state=1, n_estimators=100)
random_forest_reg.fit(X_train, Y_train)

acc_train_rfr = np.sqrt(mean_squared_log_error(Y_train, random_forest_reg.predict(X_train)))
acc_eval_rfr = np.sqrt(mean_squared_log_error(Y_eval, random_forest_reg.predict(X_eval)))

print (f'RFR: Train = {acc_train_rfr}  Evaluate = {acc_eval_rfr}')

<a id="subsection-GradBooReg"></a>
## Gradient Boost Regressor

In [None]:
## Current Best

from sklearn.ensemble import GradientBoostingRegressor

grad_boost = GradientBoostingRegressor(random_state=1, n_estimators=500, max_features= 0.3)
grad_boost.fit(X_train, Y_train)

acc_train_gb = np.sqrt(mean_squared_log_error(Y_train, grad_boost.predict(X_train)))
acc_eval_gb = np.sqrt(mean_squared_log_error(Y_eval, grad_boost.predict(X_eval)))

print (f'GB: Train = {acc_train_gb}  Evaluate = {acc_eval_gb}')

<a id="subsection-AdaBoost"></a>
## AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostRegressor

ada_boost = AdaBoostRegressor(random_state=1, n_estimators=100)
ada_boost.fit(X_train, Y_train)

acc_train_ab = np.sqrt(mean_squared_log_error(Y_train, ada_boost.predict(X_train)))
acc_eval_ab = np.sqrt(mean_squared_log_error(Y_eval, ada_boost.predict(X_eval)))

print (f'AdaBoost: Train = {acc_train_ab}  Evaluate = {acc_eval_ab}')

<a id="subsection-ExtRanTree"></a>
## Extremely Randomized Trees

In [None]:
from sklearn.ensemble import ExtraTreesRegressor

ex_trees = ExtraTreesRegressor(random_state=1, n_estimators=100)
ex_trees.fit(X_train, Y_train)

acc_train_et = np.sqrt(mean_squared_log_error(Y_train, ex_trees.predict(X_train)))
acc_eval_et = np.sqrt(mean_squared_log_error(Y_eval, ex_trees.predict(X_eval)))

print (f'ET: Train = {acc_train_et}  Evaluate = {acc_eval_et}')

<a id="subsection-EnsVotReg"></a>
## Ensemble: VotingRegessor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import VotingRegressor
import xgboost as xgb

reg1 = GradientBoostingRegressor(random_state=1, n_estimators=100)
reg2 = RandomForestRegressor(random_state=1, n_estimators=100)
reg3 = ExtraTreesRegressor(random_state=1, n_estimators=100)
reg4 = AdaBoostRegressor(random_state=1, n_estimators=100)
reg5 = xgb.XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, gamma=0,
             importance_type='gain', learning_rate=0.01, max_delta_step=0,
             max_depth=4, min_child_weight=1.5, n_estimators=2400,
             n_jobs=1, nthread=None, objective='reg:squarederror',
             reg_alpha=0.6, reg_lambda=0.6, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1)
ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('et', reg3), ('ad', reg4), ('xg', reg5)])
ereg = ereg.fit(X_train, Y_train)

acc_train_vr = np.sqrt(mean_squared_log_error(Y_train, ereg.predict(X_train)))
acc_eval_vr = np.sqrt(mean_squared_log_error(Y_eval, ereg.predict(X_eval)))

print (f'VR: Train = {acc_train_vr}  Evaluate = {acc_eval_vr}')

<a id="subsection-EnsStakGen"></a>
## Ensemble: Stacked Generalization

In [None]:
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor

est_cnt = 100
seed = 42

final_estimator = xgb.XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, gamma=0,
             importance_type='gain', learning_rate=0.01, max_delta_step=0,
             max_depth=4, min_child_weight=1.5, n_estimators=2400,
             n_jobs=1, nthread=None, objective='reg:squarederror',
             reg_alpha=0.6, reg_lambda=0.6, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1)

final_layer = StackingRegressor(
    estimators=[
                ('abar', AdaBoostRegressor(random_state=seed, n_estimators=est_cnt))
                #,('extr', ExtraTreesRegressor(random_state=seed, n_estimators=est_cnt))
                ,('adar', AdaBoostRegressor(random_state=seed, n_estimators=est_cnt))
                ,('gbrt', GradientBoostingRegressor(random_state=seed, n_estimators=est_cnt))
                #,('rf', RandomForestRegressor(random_state=seed, n_estimators=est_cnt))
               ],
    final_estimator=final_estimator
)

second_layer = StackingRegressor(
    estimators=[
                ('abar', AdaBoostRegressor(random_state=seed, n_estimators=est_cnt))
                ,('extr', ExtraTreesRegressor(random_state=seed, n_estimators=est_cnt))
                ,('adar', AdaBoostRegressor(random_state=seed, n_estimators=est_cnt))
                ,('gbrt', GradientBoostingRegressor(random_state=seed, n_estimators=est_cnt))
                ,('rf', RandomForestRegressor(random_state=seed, n_estimators=est_cnt))
               ],
    final_estimator=final_layer
)

multi_layer_regressor = StackingRegressor(
    estimators=[
                #('ridge', RidgeCV())
                ('lasso', LassoCV(random_state=seed))
                ,('svr', SVR(C=1, gamma=1e-6, kernel='rbf'))
                ,('etr', ExtraTreesRegressor(random_state=seed, n_estimators=est_cnt))
               ],
    final_estimator=second_layer
)

multi_layer_regressor.fit(X_train, Y_train)

acc_train_sg = np.sqrt(mean_squared_log_error(Y_train, multi_layer_regressor.predict(X_train)))
acc_eval_sg = np.sqrt(mean_squared_log_error(Y_eval, multi_layer_regressor.predict(X_eval)))

print (f'SG: Train = {acc_train_sg}  Evaluate = {acc_eval_sg}')

<a id="subsection-XGBoost"></a>
## XGBoost

In [None]:
import xgboost as xgb


gbm = xgb.XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.7, gamma=2,
             importance_type='gain', learning_rate=0.01, max_delta_step=0,
             max_depth=5, min_child_weight=0, n_estimators=1900, 
             n_jobs=1, nthread=None, objective='reg:squarederror',
             reg_alpha=0.6, reg_lambda=0.3, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1).fit(X_train, Y_train)
acc_train_xg = np.sqrt(mean_squared_log_error(Y_train, gbm.predict(X_train)))
acc_eval_xg = np.sqrt(mean_squared_log_error(Y_eval, gbm.predict(X_eval)))

print (f'XG: Train = {acc_train_xg}  Evaluate = {acc_eval_xg}')

<a id="subsection-Best"></a>
## Best

In [None]:
import xgboost as xgb

best_reg = xgb.XGBRegressor( booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.6, gamma=0,
             importance_type='gain', learning_rate=0.01, max_delta_step=0,
             max_depth=5, min_child_weight=1.5, n_estimators=1900, 
             n_jobs=1, nthread=None, objective='reg:squarederror',
             reg_alpha=0.6, reg_lambda=0.6, scale_pos_weight=1, 
             silent=None, subsample=0.8, verbosity=1)
best_reg.fit(X_train_full, Y_train_full)

acc_train_full = np.sqrt(mean_squared_log_error(Y_train_full, best_reg.predict(X_train_full)))

print (f'Best (previous): Train = {acc_train_xg}  Evaluate = {acc_eval_xg}')
print (f'Best FULL Train = {acc_train_full}')

predictions = best_reg.predict(X_test)
output = pd.DataFrame({'Id': test_df.Id, 'SalePrice': predictions})
output.to_csv('my_submission.csv', index=False)

In [None]:
import matplotlib.pyplot as plt

test_predictions = best_reg.predict(X_train_full).flatten()

a = plt.axes(aspect='equal')
plt.scatter(Y_train_full, test_predictions)
plt.xlabel('True Values')
plt.ylabel('Predictions')
lims = [0, 1000000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)