# "House Prices: Advanced Regression Techniques" Kaggle competition

## Authors: David Fernández & Rafael Lazcano

## Introduction:

This document is the final report for the "Artificial Intelligence and Machine Learning" subject. It illustrates a whole process of prediction, especifically the challenge of modelling house prices based on a wide set of features. 

The only rule is to use models covered by the subject's syllabus.



## Data cleansing

Usually data will be available in such a way that cannot be directly processed like empty or incoherent values. The first stage this analysis focuses on transforming the raw data into a suitable dataset while maintining the integrity of the information as much as possible.

First, data filling. For most features, like surfaces or distances of characteristics of the house, we suppossed that if a "na" is present it is because that specific house lacked that characteristic. For example, a house with "na" in "basement" surface is considered like does not have basement at all and thus the "na" is replaced with a 0. There is an exception for this rule, the "Year of Garage Building" for which the median is used to replace the "na".

Second part of the data preparation is taking care of the categorical variables. First we have to consider all features that are categorical and mark them. Then use the one-hot-encoding on them so they are suitable for processing. Lastly, we have to consider that this enconding creates a new feature for each distinct value, so some columns might exist in the train dataset and not in the test one, or viceversa, so we apply a function to fix it.

The functions used are:

In [1]:
def fill_na_values(df):
    # Now we fill null-values
    df['LotFrontage'].fillna(value=0, inplace=True)
    df['MasVnrArea'].fillna(value=0, inplace=True)
    df['BsmtFinSF1'].fillna(value=0, inplace=True)
    df['BsmtFinSF2'].fillna(value=0, inplace=True)
    df['BsmtUnfSF'].fillna(value=0, inplace=True)
    df['TotalBsmtSF'].fillna(value=0, inplace=True)
    df['BsmtFullBath'].fillna(value=0, inplace=True)
    df['GarageArea'].fillna(value=0, inplace=True)
    df['GarageCars'].fillna(value=0, inplace=True)
    df['GarageArea'].fillna(value=0, inplace=True)
    df['BsmtHalfBath'].fillna(value=0, inplace=True)
    df['GarageYrBlt'].fillna(df['GarageYrBlt'].median(), inplace=True)
    return df

def one_hot_encode(df, skip_features=[]):
    for col in [
                'Alley', 'PoolQC', 'Fence', 'MiscFeature',
                'MSSubClass', 'MSZoning', 'Street', 'LotShape', 'MSZoning', 'LandContour', 'Utilities',
                'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
                'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
                'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
                'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
                'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition']:
        if col in df.columns and col not in skip_features:
            df[col] = df[col].astype('category')

    # Apply one-hot-enconding to all categorical variables.
    categorical_columns = df.select_dtypes(include='category').columns.tolist()

    for categoricalVariable in categorical_columns:
        dummy = pd.get_dummies(df[categoricalVariable], prefix=categoricalVariable).astype('category')
        df = pd.concat([df, dummy], axis=1)
        df.drop([categoricalVariable], axis=1, inplace=True)

    return df


def merge_one_hot_encoded_columns(train_df, test_df):
    """
    After one-hot encoding, some columns might exist in the train dataset and not in the test one, or viceversa.
    If a column exists in one of the two dataframes and not in the other, we create it and fill it with zeros.
    """
    clean_test_set = set(list(test_df.columns.values))
    clean_train_set = set(list(train_df.columns.values))

    differences = list(clean_test_set ^ clean_train_set)
    differences.remove('SalePrice')

    for dif in differences:
        if dif not in train_df:
            train_df[dif] = 0
            train_df[dif] = train_df[dif].astype('category')
        if dif not in test_df:
            test_df[dif] = 0
            test_df[dif] = test_df[dif].astype('category')
    return train_df, test_df


## Feature selection and engineering

The next stage of the process is the feature selection and engineering. The main goal is to reduce noise on the data by deleting those features that do not have a good correlation with the target variable and also modify those that do in such a way that that correlation is easier to "understand" for the prediction model.

Taking into consideration that "a priori" we dont know what functions will achieve this, the process has 2 stages. The first one is the definition of set of functions that based on stastistical considerations (like features that originally had a lot of "na's" contain a lot of noise) and business-based considerations (like knowing that the sum of the surfaces is a better price indicator than explicitly using each surface).

You can see the original set of functions here:

In [None]:
###############################
# FEATURE ENGINEERING FUNCTIONS
###############################

def sum_SF(df):
    columns_to_add = ['1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'BsmtFinSF2']
    if pd.Series(columns_to_add).isin(df.columns).all():
        df['House_SF'] = df[columns_to_add].sum(axis=1)
        df.drop(columns_to_add, axis=1, inplace=True)
    return df


def sum_Baths(df):
    bath_features = ['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath']
    if pd.Series(bath_features).isin(df.columns).all():
        df['Total_Baths'] = (df['FullBath'] +
                             df['BsmtFullBath'] +
                             (0.8*df['HalfBath']) +
                             (0.8*df['BsmtHalfBath']))
        df.drop(bath_features, axis=1,inplace = True)
    return df


def sum_Porch(df):
    columns_to_add = ['OpenPorchSF','3SsnPorch','EnclosedPorch','ScreenPorch','WoodDeckSF']
    if pd.Series(columns_to_add).isin(df.columns).all():
        df['Porch_sf'] = df[columns_to_add].sum(axis=1)
        df.drop(columns_to_add, axis=1,inplace=True)
    return df


def feature_skewness(df):
    numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric_features = []
    for i in df.columns:
        if df[i].dtype in numeric_dtypes: 
            numeric_features.append(i)

    feature_skew = df[numeric_features].apply(lambda x: skew(x)).sort_values(ascending=False)
    skews = pd.DataFrame({'skew':feature_skew})
    return feature_skew, numeric_features


def fix_skewness(df):
    feature_skew, numeric_features = feature_skewness(df)
    high_skew = feature_skew[feature_skew > 0.9]
    skew_index = high_skew.index
    for i in skew_index:
        df[i] = boxcox1p(df[i], boxcox_normmax(df[i]+1))

    #skew_features = df[numeric_features].apply(lambda x: skew(x)).sort_values(ascending=False)
    #skews = pd.DataFrame({'skew':skew_features})
    return df


def categorical_to_ordinal(df):
    """
    Some textual features(e.g.basement quality) should be handled as numerical (i.e.ordinal) values
    """

    ordinal_features = ['ExterQual', 'BsmtQual', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'PoolQC',
                        'ExterCond', 'BsmtCond', 'GarageCond']
    for ordinalFeature in ordinal_features:
        if ordinalFeature in df:
            df[ordinalFeature].fillna(value=0, inplace=True)
            df[ordinalFeature] = df[ordinalFeature].replace({
                                            'Ex': 5,
                                            'Gd': 4,
                                            'TA': 3,
                                            'Fa': 2,
                                            'Po': 1,
                                            'NA': 0
                                            }).astype('int32')
    if 'Foundation' in df:
        df['Foundation'].fillna(value=0, inplace=True)
        df['Foundation'] = df['Foundation'].replace({
                                            'PConc': 3,
                                            'CBlock': 2,
                                            'BrkTil': 1,
                                            'Slab': 0,
                                            'Stone': 0,
                                            'Wood': 0,
                                            'NA': 0
                                            }).astype('int32')
    return df


def transform_sales_to_log_of_sales(df):
    """
    Our target values distribution get closer to a normal distribution using the log-transformation
    """
    if 'SalePrice' in df:
        df['SalePrice'] = df['SalePrice'].apply(np.log1p)
    return df


def add_expensive_neighborhood_feature(df):
    """
    Instead of using all the neighborhoods, we use a binary classification: are they located in one of the 5 most
    expensive neighborhoods?
    """
    expensive_neighborhoods = ['Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_StoneBr',
                               'Neighborhood_Somerst', 'Neighborhood_Crawfor']

    for neighborhood in expensive_neighborhoods:
        df.loc[df[neighborhood] == 1, "Expensive_Neighborhood"] = 1
    df["Expensive_Neighborhood"].fillna(0, inplace=True)
    df.drop([col for col in df if col.startswith('Neighborhood')], axis=1, inplace=True, errors="ignore")
    return df


def add_home_quality(df):
    df['HomeQuality'] = df['OverallQual'] + df['OverallCond']
    return df


def add_years_since_last_remodel(df):
    df['YearsSinceLastRemodel'] = df['YrSold'].astype(int) - df['YearRemodAdd'].astype(int)
    return df


def remove_too_cheap_outliers(df):
    new_df = df[df["SalePrice"] > 50000]
    if new_df.shape[0] > 500:
        return new_df
    else:
        return df


def remove_garage_cars_feature(df):
    """
    'GarageCars' feature is related to GarageArea feature, it might be interesting to remove it
    """
    df.drop(['GarageCars'], axis=1, inplace=True)
    return df


def remove_lotfrontage_feature(df):
    df.drop(['LotFrontage'], axis=1, inplace=True)
    return df


def drop_empty_features(df):
    """
    Drop features 'Alley', 'PoolQC', 'Fence' and 'MiscFeature', which are almost empty
    """
    df.drop(['Alley', 'PoolQC', 'Fence', 'MiscFeature'], axis=1, inplace=True, errors='ignore')
    return df


def remove_under_represented_features(df):
    """
    Eliminate those columns with most of the information belonging to the same class
    """
    under_rep = []
    for i in df.columns:
        if i != 'SalePrice':
            counts = df[i].value_counts()
            zeros = counts.iloc[0]
            if ((zeros / len(df)) * 100) > 99.0:
                under_rep.append(i)
    #not_dropped_features = set(df.columns) - set(under_rep)
    df.drop(under_rep, axis=1, inplace=True)
    return df


def feature_selection_lasso(df):
    """
    Use Lasso to select the most meaningful features
    """
    clf = linear_model.Lasso(alpha=0.01)
    X = df.drop(['SalePrice'], axis=1)
    y = df.SalePrice.reset_index(drop=True)
    clf.fit(X, y)
    zero_indexes = np.where(clf.coef_ == 0)[0]
    #not_dropped_features = set(df.columns) - set(zero_indexes)
    if len(df.columns) - len(zero_indexes) > 5:
        df.drop(X.columns[zero_indexes], axis=1, inplace=True)
    return df


def f_regression_feature_filtering(df):
    """
    Select the 18 best features to the target using f-test regression
    (https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)
    """
    X = df.drop(['SalePrice'], axis=1)
    y = df.SalePrice.reset_index(drop=True)
    best_features_indexes = SelectKBest(k=18, score_func=f_regression).fit(X, y).get_support(indices=True)
    filtered_features = df.filter(items=X.columns[best_features_indexes], axis=1)
    return filtered_features.join(df.SalePrice)



def drop_categories(df):
    categorical_columns = df.select_dtypes(include='category').columns.tolist()

    for categoricalVariable in categorical_columns:
        if categoricalVariable not in ['ExterQual_Ex', 'ExterQual_Gd', 'ExterQual_TA', 'ExterQual_Fa', 'ExterQual_Po',
                                       'KitchenQual_Ex', 'KitchenQual_Gd', 'KitchenQual_TA', 'KitchenQual_Fa',
                                       'KitchenQual_Po']:
            df.drop([categoricalVariable], axis=1, inplace=True, errors='ignore')

    return df
 

### Forward selecting the best function engineering combinations
Once the functions are ready, we have to select a group of those (and to which features apply) that leads to the best result of the model.
A quick calculation on combinations proves that brute force is not possible in a reasonable time. We will use a technique known as "forward selection" instead, this will take us to a local minimum in the search space in resonable time of computation. First, we start with the empty set of functions, and we start iterating over them, training and validating the model. For each iteration we select the 3 best possible combinations, "fixing" them for the next iteration. The best possible combinations are kept.


In pseudo-code:

FUNCTIONS_FROM_PREVIOUS_STEP <- [] FOR _ IN ALL FUNCTIONS: FUNCTIONS FROM CURRENT_STEP <- [] FOR FIXED_FUNCTIONS IN FUNCTIONS_FROM_PREVIOUS_STEP: FOR OTHER_FUNCTION IN (ALL_FUNCTIONS - FIXED_FUNCTIONS): SCORE <- EVALUATE(FIXED_FUNCTIONS + OTHER_FUNCTION) IF LEN(FUNCTIONS_FROM_CURRENT_STEP) < 3 OR SCORE > SCORE(FUNCTIONS_FROM_CURRENT_STEP): ADD FUNCTIONS_FROM_CURRENT_STEP TO FUNCTIONS_FROM_CURRENT_STEP FUNCTIONS_FROM_PREVIOUS_STEP <- FUNCTIONS_FROM_CURRENT_STEP


In [None]:
FUNCTIONS_FROM_PREVIOUS_STEP <- [] 
FOR _ IN ALL FUNCTIONS: FUNCTIONS FROM CURRENT_STEP <- [] 
    FOR FIXED_FUNCTIONS IN FUNCTIONS_FROM_PREVIOUS_STEP: 
        FOR OTHER_FUNCTION IN (ALL_FUNCTIONS - FIXED_FUNCTIONS): 
            SCORE <- EVALUATE(FIXED_FUNCTIONS + OTHER_FUNCTION) 
            IF LEN(FUNCTIONS_FROM_CURRENT_STEP) < 3 OR SCORE > SCORE(FUNCTIONS_FROM_CURRENT_STEP): 
                ADD FUNCTIONS_FROM_CURRENT_STEP TO FUNCTIONS_FROM_CURRENT_STEP 
    FUNCTIONS_FROM_PREVIOUS_STEP <- FUNCTIONS_FROM_CURRENT_STEP

In [None]:


functions_kept_per_step = 3
functions_kept_from_previous_step = [([],None)]
for _ in all_fe_functions:
    functions_kept_from_current_step = []
    while len(functions_kept_from_previous_step) > 0:
        functions_kept = functions_kept_from_previous_step.pop()[0]
        other_functions = set(all_fe_functions) - set(functions_kept)
        for other_function in other_functions:
                functions_to_evaluate = functions_kept + [other_function]

                try:
                    print("\nStarting functions {}".format(functions_to_evaluate))
                    # Load data set
                    train = pd.read_csv('train.csv').set_index('Id')
                    test = pd.read_csv('test.csv').set_index('Id')

                    # Allow info in bigger dataframes
                    pd.options.display.max_info_columns = 350

                    # in some cases, there are specific features that we do not want to one-hot encode
                    skip_one_hot_encode_features = []
                    if 'categorical_to_ordinal' in functions_to_evaluate:
                        skip_one_hot_encode_features = ['PoolQC']


                    # Data preparation
                    clean_train = one_hot_encode(fill_na_values(train))
                    clean_test = one_hot_encode(fill_na_values(test))
                    clean_train, clean_test = merge_one_hot_encoded_columns(clean_train, clean_test)

                    # Feature engineering
                    for fe_function in functions_to_evaluate:
                        clean_train = globals()[fe_function](clean_train)
                        if fe_function in fe_functions_only_for_training_set:
                            continue
                        elif fe_function in dynamic_feature_selection_functions:
                            # some functions remove features dynamically, we need to apply the same changes to the test data set
                            clean_test = clean_test[clean_train.drop('SalePrice', axis=1).columns]
                        else:
                            clean_test = globals()[fe_function](clean_test)

                    X = clean_train.loc[:, clean_train.columns != 'SalePrice']
                    y = clean_train.loc[:, 'SalePrice']

                    # Create linear regression object
                    #regr = linear_model.LinearRegression()
                    regr = ensemble.GradientBoostingRegressor()

                    # The metrics
                    #score = r2_score(y_test, y_pred)
                    # print(stats.describe(regr.coef_))
                    # mse = mean_squared_error(y_test, y_pred)
                    # rmse = np.sqrt(mean_squared_error(np.log(y_test), np.log(y_pred)))
                    # r2 = r2_score(np.log(y_test), np.log(y_pred))
                    # print(" FUNCTIONS : {}".format(functions))
                    # print(" sklearn score: {}".format(regr.score(X_test, y_test)))
                    # print("r2 {}".format(r2))
                    # print("rmse {}".format(rmse))
                    
                    scores = [0]
                    try:
                        scores = cross_val_score(regr, X, y, cv=5, n_jobs=-1)
                    score = scores.mean()

                    #if we haven't yet kept 3 sets of functions in this step, or this score is better than the third set kept
                    if len(functions_kept_from_current_step) < functions_kept_per_step:
                        functions_kept_from_current_step.append((functions_to_evaluate, score))
                    elif score > functions_kept_from_current_step[2][1]:
                        functions_kept_from_current_step[2] = (functions_to_evaluate, score)
                    functions_kept_from_current_step.sort(key=lambda tup: tup[1], reverse=True)


                    # keep track of results so far
    functions_kept_from_previous_step = functions_kept_from_current_step


## Model selection and tuning

This stage consists on finding the optimal values for the hyperparameters of the models that we are going to use. These models are:

* Linear Regression  
* Lasso Regression
* Ridge Regression
* Decision trees
* Random Forest
* XGBoost

Making use of GridSearch and cross-validation, we define a range of values for each hyperparameter of each model and try every combination. Those combinations that result in best performance are saved:

In [2]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, Lasso, RidgeCV, Ridge
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import xgboost as xgb


lassoCV = LassoCV(alphas=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1], cv=5, random_state=1)

bestLasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0001, random_state=1))


bestLinear = make_pipeline(RobustScaler(), LinearRegression())


parameters = {'n_estimators': [500, 1000, 2000, 3000, 5000], 
              'learning_rate': [0.05, 0.1, 0.5, 1],
              'max_depth': [3, 4, 5],
              'min_samples_leaf': [5, 10, 15, 20],
              'min_samples_split': [2, 5, 7, 10]}

GBoost = GridSearchCV(GradientBoostingRegressor(max_features='sqrt', loss='huber', random_state=1), parameters, cv=5)
GBoost.fit(X, y)
GBoost.best_estimator_


bestGBoost = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.05,
                                   max_depth=5, max_features='sqrt',
                                   min_samples_leaf=10, min_samples_split=7, 
                                   loss='huber', random_state=1)

ridge = RidgeCV(alphas=(0.001, 0.005, 0.1, 0.5, 1))
ridge.fit(X, y)
ridge.alpha_

bestRidge = Ridge(alpha=1)


parameters = {'max_depth': [3, 5, 10],
              'min_samples_leaf': [5, 10, 15, 20],
              'min_samples_split': [2, 5, 7, 10],
              'max_features': ['auto', 'sqrt', 'log2']}
decisionTree = GridSearchCV(DecisionTreeRegressor(random_state=1), parameters, cv=5)
decisionTree.fit(X, y)
decisionTree.best_estimator_


bestDecisionTree = DecisionTreeRegressor(criterion='mse', max_depth=10, max_features='auto',
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=15,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')


parameters = {
              'max_depth': [3, 5, 7, 10, 50, None],
              'min_samples_leaf': [5, 10, 15, 20],
              'min_samples_split': [2, 5, 7, 10],
              'max_features': ['auto', 'sqrt', 'log2']}

randomForest = GridSearchCV(RandomForestRegressor(random_state=1), parameters, cv=5)
randomForest.fit(X, y)
randomForest.best_estimator_

bestRandomForest = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
           max_features='log2', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=5, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=1, verbose=0, warm_start=False)

parameters = {'max_depth': [1, 3, 5],
             'learning_rate': [0.05, 0.1, 0.5, 1],
             'gamma': [0.01, 0.05, 0.1],
             'min_child_weight': [0.5, 1, 3],
             'reg_alpha': [0.1, 0.5, 1],
             'reg_lambda': [0.1, 0.5, 1],
             'colsample_bytree': [0.1, 0.5, 1]}
xgbReg = GridSearchCV(xgb.XGBRegressor(silent=True, nthread=-1, random_state=1), parameters, cv=5)
xgbReg.fit(X, y)
xgbReg.best_estimator_


bestXGB = xgb.XGBRegressor(n_estimators=2500,
                           learning_rate=0.05, max_depth=3,
                           colsample_bytree=0.5, gamma=0.05, 
                           min_child_weight=1, reg_alpha=0.5,
                           reg_lambda=1, subsample=0.5,
                           random_state =1, nthread = -1)

ModuleNotFoundError: No module named 'xgboost'

# Ensemble methods

After last stage we have all and each of the models optimized. One option would be to select the best performing model, but it has been empirically proved that many times, the linear combination with of worse performing models ends up giving better results than the best individual model. 

That said, our strategy is to try different linear combinations of all the models and test if we can find a combination that gives a better result than the best individual model. 

In [551]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [4]:
bestModels = [bestXGB, bestLasso, bestLinear, bestRidge, bestGBoost, bestRidge, bestDecisionTree, bestRandomForest]
for model in bestModels:
    model.fit(X_train, y_train)

NameError: name 'bestXGB' is not defined

In [482]:

import itertools
from sklearn.metrics import r2_score
weights = [0, 0.1, 0.5, 1] 

all_results = []
for weights in itertools.product(weights, repeat=len(bestModels)):
    if np.sum(weights) == 0 :
        continue
    predictions = np.column_stack([
        model.predict(X_test) for model in bestModels
    ])
    weighted_predictions = predictions * weights
    summed_predictions = np.sum(weighted_predictions, axis=1)
    result_predictions = summed_predictions / np.sum(weights)
    score = r2_score(y_test, result_predictions)
    all_results.append((weights, score))
    print(weights)

all_results.sort(key=lambda tup: tup[1], reverse=True)
all_results


'\nimport itertools\nfrom sklearn.metrics import r2_score\nweights = [0, 0.1, 0.5, 1] \n\nall_results = []\nfor weights in itertools.product(weights, repeat=len(bestModels)):\n    if np.sum(weights) == 0 :\n        continue\n    predictions = np.column_stack([\n        model.predict(X_test) for model in bestModels\n    ])\n    weighted_predictions = predictions * weights\n    summed_predictions = np.sum(weighted_predictions, axis=1)\n    result_predictions = summed_predictions / np.sum(weights)\n    score = r2_score(y_test, result_predictions)\n    all_results.append((weights, score))\n    print(weights)\n\nall_results.sort(key=lambda tup: tup[1], reverse=True)\nall_results\n'

## Final prediction

After all these stages we found that there are several combinations of models that outperform the indivdual best model. Among all those posibilities, the one that works best is composed of: 
* XGBoost
* Ridge (in a small percentage)
* GBoost
* Random Forest

As you can see, it relies on the XGBoost more than in any other. We could say that our final model is an XGBoost with a second order aproximation using Ridge,GBoost and Random Forest 

In [559]:
bestModels = [bestXGB, bestLasso, bestLinear, bestRidge, bestGBoost, bestRidge, bestDecisionTree, bestRandomForest]
for model in bestModels:
    model.fit(X, y)

In [560]:
best_weights = (2, 0, 0, 0.1, 1.1, 0.1, 0, 0.1)
predictions = np.column_stack([
        model.predict(clean_test) for model in bestModels
    ])
weighted_predictions = predictions * best_weights
summed_predictions = np.sum(weighted_predictions, axis=1)
result_predictions = summed_predictions / np.sum(best_weights)

result_predictions = np.expm1(result_predictions)

In [561]:

clean_test['SalePrice'] = result_predictions
submission = clean_test[['SalePrice']]
submission.to_csv('stacked.csv')
