> Currently i am writing this notebook my rank is `125` on 26/10/20
### Upvote will be much appreciated and keep me motivated :)
# House Prices: Advanced Regression Techniques using Ensemble Learning

### Contents

* 1.Importing Dataset
* 2.Data Analysis
    * 2.1 Corr Plot
    * 2.1 Analyzing Categoricla Data
    * 2.3 Analyzing Numerical Data
* 3.Feature Engineering
    * 3.1 Missing Data
    * 3.2 Encoding Categorical Data
    * 3.3 Create New Features
    * 3.4 Analizing New Features
    * 3.5 Skewness and Kurtosis
    * 3.6 Scaling Data
    * 3.7 Feature Importance using Lasoo cofficient
* 4.Model
    * 4.1 First lets define a function for hyperparmeter tuning
    * 4.2 XG Boost Regressor
    * 4.3 Random Forest Regressor
    * 4.4 LGBM Regressor
    * 4.5 Support Vector Machine
    * 4.6 Ridge Regressor
    * 4.7 Gradient Boosting Regressor
    * 4.8 Cat Boost Regressor
* 5.Final Prediction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Importing Dataset

In [1]:
# training set
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
train.head()

In [1]:
# test set
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
test.head()

In [1]:
print('Shape of Training Set : {}'.format(train.shape))
print('Number of training data points : {}\n'.format(len(train)))
print('Shape of Test Set : {}'.format(test.shape))
print('Number of test data points : {}\n'.format(len(test)))
print('Columns : {}'.format(train.columns))

# 2. Let's Analyze
## 2.1 Corr Plot

In [1]:
plt.figure(figsize=(17, 17))
g = sns.heatmap(train.drop('Id', axis=1).corr(), annot=True, cmap='coolwarm',  square=True, fmt='.1f')

**Observations:**
the feature with blue color are less important and of Red are more important
* Overall condition of the house seems less important on the pricing, i thought it would have weight
* There's strong relation between overall quality of the houses and their sale prices.
* Grade living area seems strong indicator for sale price.
* Garage features, number of baths and rooms, how old the building is etc. are important too.
* There are some obvious relations we gonna pass like total square feet affecting how many rooms there are or how many cars can fit into a garage vs. garage area etc.

## 2.2 Let's Look at Categorical Data

In [1]:
fig, ax = plt.subplots(14, 3, figsize=(25,80))
ax = ax.flatten()

for i,j in zip(train.select_dtypes(include=['object']).columns, ax):
    srtd = train.groupby(i)['SalePrice'].median().sort_values(ascending=False)
    sns.boxplot(x=i,
                y='SalePrice',
                data=train,
                order=srtd.index,
                ax=j
               )
    j.tick_params(labelrotation=45)
    plt.tight_layout()

**Observations:**

* MSZoning
    * Residental high and low seems similar meanwhile commercial is the cheap.
    * LandContour, Hillside houses seems a little bit higher expensive than the rest 
    * meanwhile banked houses are the lowest.


* Neighborhood
    * Northridge Heights and Northridge are top expensive places for houses.
    * Timberland, Somerset, Veenker, Crawford, Clear Creek, College Creek and Bloomington Heights seems above average.
    * Sawyer West has wide range for prices related to similar priced regions.
    * Old Town and Edwards has some outlier prices but they generally below average.
    * Briardale, Iowa DOT, Rail Road, Meadow Village are the cheapest places for houses it seems...
   

* Conditions
    * Meanwhile having wide range of values being close to North-South Railroad seems having positive effect on the price.
    * Being near or adjacent to positive off-site feature (park, greenbelt, etc.) increases the price.
    * These values are pretty similar but we can get some useful information from them.
    
    
* MasVnrType Having stone masonry veneer seems better priced than having brick.

* Quality Features; There are many categorical quality values that affects the pricing on some degree, we're going to quantify them so we can create new features based on them. So we don't dive deep on them in this part.

* CentralAir Having central air system has decent positive effect on sale prices.


* GarageType
    * Built-In (Garage part of house - typically has room above garage) garage typed houses are the most expensive ones.
    * Attached garage types following the built-in ones.
    * Car ports are the lowest
* Misc Sale type has some kind of effect on the prices but we won't get into details here. Btw... It seems having tennis court is really adding price to your house, who would have known

## 2.3 Analyzing Numerical Data

In [1]:
fig, ax = plt.subplots(12, 3, figsize=(25,80))
ax = ax.flatten()

for i,j in zip(train.select_dtypes(include=['number']).columns, ax):
    
    sns.regplot(x=i,
                y='SalePrice',
                data=train,
                ax=j,
                ci=None,
                line_kws={'color': 'black'},
                scatter_kws={'alpha':0.4}
               )
    j.tick_params(labelrotation=45)
    plt.tight_layout()

**Some Observations:**
* OverallQual; It's clearly visible that sale price of the house increases with overall quality. This confirms the correlation in first table we did at the beginning. (Pearson corr was 0.8)

* OverallCondition; Looks like overall condition is left skewed where most of the houses are around 5/10 condition. But it doesn't effect the price like quality indicator...

* YearBuilt; Again new buildings are generally expensive than the old ones.

* Basement; General table shows bigger basements are increasing the price but I see some outliers there...

* GrLivArea; This feature is pretty linear but we can spot two outliers effecting this trend. There are some huge area houses with pretty cheap prices, there might be some reason behind it but we better drop them.

### Lets Seperate Sale Prices from the data and make a set of target values

In [1]:
y = train['SalePrice']
train.drop('SalePrice', axis=1, inplace=True)

In [1]:
# combining train and test for feature enginerring
data = pd.concat([train, test])
data.drop('Id', axis=1, inplace=True)
data.shape
data.info()

# 3. Feature Engineering
## 3.1 Missing Data

In [1]:
nan_categorical = []
nan_num = []
missing = {}
tot = len(data)

for i in data.columns:
    var = data[i].isnull().sum()
    if (var!=0):
        missing.update({i:var/tot*100})
        if (data[i].dtype=='object'):
            nan_categorical.append(i)
        else:
            nan_num.append(i)

In [1]:
missing = pd.DataFrame.from_dict(missing, orient='index', columns=['Percentage'])
missing.sort_values(by='Percentage', ascending=False, inplace=True)
missing = missing.T

plt.figure(figsize=(20,4))
g = sns.barplot(data=missing, palette='Reds_r')
plt.ylabel('Percentage')
plt.xticks(rotation=90)

display(missing.style.background_gradient(cmap='Reds', axis=1))

there is a lot of missing data in the dataset specially `PoolQC` with 99% of missing data and MiscFeautres, Alley, Fence etc. closing

In [1]:
print("Catorical columns with nan values \n",nan_categorical)
print("\n Int or float columns in data \n",nan_num)
print("\n total nan columns:",len(nan_categorical)+len(nan_num) )

I have seperated columns having Categorical and Numerical Data with nan values it helps me look in the **Data Description** file provided. 
take a look in `data_description.txt` file for better understanding for filling data

**This is how we gonna fix most of the missing data:**

* First we fill the NaN's in the columns where they mean 'None' so we gonna replace them with that,
* Then fill numerical columns where missing values indicating there is no parent feature to measure, so we replace them with 0's.
* Even with these there are some actual missing data, by checking general trends of these features we can fill them with most frequent value(with mode).
* MSZoning and Lot Frontage part is little bit tricky I choose to fill them with most common type of the related MSSubClass and Neighborhood respective type. It's not perfect but at least we decrease randomness a little bit.


In [1]:
# columns to be filled by none in NA
none_columns = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu', 'GarageType', 
                'GarageFinish', 'GarageQual', 'GarageCond',  'PoolQC',  'Fence', 'MiscFeature', 'MasVnrType']

# int columns to be filled by 0
zero_columns = [ 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 
                'GarageYrBlt', 'GarageCars', 'GarageArea']

# list of columns with Nan values to be replaced by mode 
mode_columns = ['Utilities', 'Exterior1st', 'Exterior2nd', 'Electrical', 'KitchenQual', 'Functional', 'SaleType']

for i in none_columns:
    data[i].fillna('None', inplace=True)

for i in zero_columns:
    data[i].fillna(0, inplace=True)

for i in mode_columns:
    data[i].fillna(data[i].mode()[0], inplace=True)
    
# Filling MSZoning according to MSSubClass.
data['MSZoning'] = data.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))
    
# Filling LotFrontage according to Neighborhood.
data['LotFrontage'] = data.groupby(['Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))

In [1]:
# Features which numerical on data but should be treated as category.

data['MSSubClass'] = data['MSSubClass'].astype(object)

data['YrSold'] = data['YrSold'].astype(object)

data['MoSold'] = data['MoSold'].astype(object)

### Double check for nan values

In [1]:
print('Number of missing values: {data.isnull().sum().sum()}')

## 3.2 Encoding Categorical Data
The values gives to Categorical variables are based obervations made form BoxPlot

In [1]:
#Excellent, Good, Typical, Fair, Poor, None: Convert to 0-5 scale
cols_ExGd = ['ExterQual','ExterCond','BsmtQual','BsmtCond',
             'HeatingQC','KitchenQual','FireplaceQu','GarageQual',
             'GarageCond','PoolQC']

dict_ExGd = {'Ex':5,'Gd':4,'TA':3,'Fa':2,'Po':1,'None':0}

for col in cols_ExGd:
    data[col].replace(dict_ExGd, inplace=True)

#display(data[cols_ExGd].head(5))
dict_nigh = {'NridgHt': 8, 'NoRidge' : 8, 'StoneBr' : 8, 
             'Timber' : 7, 'Somerst' : 7, 'Veenker' : 7, 'Crawfor' : 7, 
             'ClearCr' : 7, 'CollgCr' : 7, 'Blmngtn' : 7,
             'Gilbert' : 6, 
             'SawyerW' : 5, 
             'Mitchel' : 4, 
             'NPkVill' : 3, 'NAmes': 3, 'NWAmes' : 3, 'SWISU' : 3, 'Blueste' : 3, 
             'Sawyer' : 2, 'BrkSide' : 2, 'Edwards' : 2, 'OldTown' : 2,
             'BrDale' : 1, 'IDOTRR' : 1, 'MeadowV' : 1,
             }
data['Neighborhood'].replace(dict_nigh, inplace=True)

# Remaining columns
data['BsmtExposure'].replace({'Gd':4,'Av':3,'Mn':2,'No':1,'None':0}, inplace=True)

data['CentralAir'].replace({'Y':1,'N':0}, inplace=True)

data['Functional'].replace({'Typ':7,'Min1':6,'Min2':5,'Mod':4,'Maj1':3,'Maj2':2,'Sev':1,'Sal':0}, inplace=True)

data['GarageFinish'].replace({'Fin':3,'RFn':2,'Unf':1,'None':0}, inplace=True)

data['LotShape'].replace({'Reg':3,'IR1':2,'IR2':1,'IR3':0}, inplace=True)

data['Utilities'].replace({'AllPub':3,'NoSewr':2,'NoSeWa':1,'ELO':0}, inplace=True)

data['LandSlope'].replace({'Gtl':2,'Mod':1,'Sev':0}, inplace=True)

bsm_dict = {'None': 0,
            'Unf': 1,
            'LwQ': 2,
            'Rec': 3,
            'BLQ': 4,
            'ALQ': 5,
            'GLQ': 6
           }
data['BsmtFinType1'].replace(bsm_dict, inplace=True)
data['BsmtFinType2'].replace(bsm_dict, inplace=True)

# 3.3 Let's Create Some New Intresting Features

In [1]:
data['HasAlley'] = data['Alley'].apply(lambda x: 1 if x!='None' else 0)

data['HasFence'] = data['Fence'].apply(lambda x: 1 if x!='None' else 0)

data['HasBsmt'] = data['BsmtQual'].apply(lambda x: 1 if x>0 else 0)

data['HasGarage'] = data['GarageType'].apply(lambda x: 1 if x!='None' else 0)

data['HasFirePlace'] = data['FireplaceQu'].apply(lambda x: 1 if x>0 else 0)

data['HasPool'] = data['PoolArea'].apply(lambda x: 1 if x>0 else 0)

data['Has2ndFloor'] = data['2ndFlrSF'].apply(lambda x: 1 if x>0 else 0)

# Merging quality and conditions.
data['TotalExtQual'] = (data['ExterQual'] + data['ExterCond'])

data['TotalBsmQual'] = (data['BsmtQual'] + data['BsmtCond'] + data['BsmtFinType1'] + data['BsmtFinType2'])

data['TotalGrgQual'] = (data['GarageQual'] + data['GarageCond'])

data['TotalQual'] = (data['OverallQual'] + data['TotalExtQual'] + data['TotalBsmQual'] + data['TotalGrgQual']+ 
                     data['KitchenQual'] + data['HeatingQC']
                    )
# Creating new data by using new quality indicators.
data['QualGr'] = data['TotalQual'] * data['GrLivArea']

data['QualBsm'] = data['TotalBsmQual'] * (data['BsmtFinSF1'] + data['BsmtFinSF2'])

data['QualExt'] = data['TotalExtQual'] * data['MasVnrArea']

data['QualGrg'] = data['TotalGrgQual'] * data['GarageArea']

data['QualSFNg'] = data['QualGr'] * data['Neighborhood']

# creating freatures
data['TotalSF'] = (data['BsmtFinSF1'] + data['BsmtFinSF2'] + data['1stFlrSF'] + data['2ndFlrSF'])

data['TotalBathrooms'] = (data['FullBath'] + (0.5 * data['HalfBath']) +
                          data['BsmtFullBath'] + (0.5 * data['BsmtHalfBath'])
                         )
data['TotalPorchSF'] = (data['OpenPorchSF'] + data['3SsnPorch'] + data['EnclosedPorch'] +
                        data['ScreenPorch'] + data['WoodDeckSF']
                       )
data['YearBlRm'] = (data['YearBuilt'] + data['YearRemodAdd'])

## 3.4 Analizing New Features

In [1]:
new_features = ['TotalExtQual', 'TotalBsmQual', 'TotalBsmQual', 'TotalGrgQual', 'TotalQual', 'QualGr', 
                'QualBsm', 'QualExt','QualGrg', 'QualSFNg', 'TotalSF', 'TotalBathrooms', 'TotalPorchSF', 'YearBlRm'
               ]
fig, ax = plt.subplots(3, 4, figsize=(16,12))
ax = ax.flatten()
features = data.join(y)
for i,j in zip(new_features, ax):
    
    sns.regplot(x=i,
                y='SalePrice',
                data=features,
                ax=j,
                ci=None,
                line_kws={'color': 'black'},
                scatter_kws={'alpha':0.4}
               )
    j.tick_params(labelrotation=45)
    plt.tight_layout()

## 3.5 Skewness and Kurtosis

**Skewness**

* is the degree of distortion from the symmetrical bell curve or the normal curve.
* So, a symmetrical distribution will have a skewness of "0".
* There are two types of Skewness: **Positive and Negative.**
* **Positive Skewness**(similar to our target variable distribution) means the tail on the right side of the distribution is longer and fatter.
* In **positive Skewness** the mean and median will be greater than the mode similar to this dataset. Which means more houses were sold by less than the average price.
* **Negative Skewness** means the tail on the left side of the distribution is longer and fatter.
* In **negative Skewness** the mean and median will be less than the mode.
* Skewness differentiates in extreme values in one versus the other tail.

**Kurtosis** According to Wikipedia,

In probability theory and statistics, Kurtosis is the measure of the "tailedness" of the probability. distribution of a real-valued random variable. So, In other words, it is the measure of the extreme values(outliers) present in the distribution.

* There are three types of Kurtosis: **Mesokurtic, Leptokurtic, and Platykurtic.**
* Mesokurtic is similar to the normal curve with the standard value of 3. This means that the extreme values of this distribution are similar to that of a normal distribution.
* Leptokurtic Example of leptokurtic distributions are the T-distributions with small degrees of freedom.
* Platykurtic: Platykurtic describes a particular statistical distribution with thinner tails than a normal distribution. Because this distribution has thin tails, it has fewer outliers (e.g., extreme values three or more standard deviations from the mean) than do mesokurtic and leptokurtic distributions.

In [1]:
plt.figure(figsize=(15,5))

plt.subplot(1, 2, 1)
sns.distplot(data.skew(), axlabel ='Skewness')

plt.subplot(1, 2, 2)
sns.distplot(data.kurt(), axlabel ='Kurtosis')

plt.show()

In [1]:
from scipy.stats import skew

numeric_feats = data.dtypes[data.dtypes != "object"].index

skewed_feats = data[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)

skewed_feats

In [1]:
sns.distplot(data['1stFlrSF']);

In [1]:
def fixing_skewness(df):
    """
    This function takes in a dataframe and return fixed skewed dataframe
    """
    ## Import necessary modules 
    from scipy.stats import skew
    from scipy.special import boxcox1p
    from scipy.stats import boxcox_normmax
    
    ## Getting all the data that are not of "object" type. 
    numeric_feats = df.dtypes[df.dtypes != "object"].index

    # Check the skew of all numerical features
    skewed_feats = df[numeric_feats].apply(lambda x: skew(x)).sort_values(ascending=False)
    high_skew = skewed_feats[abs(skewed_feats) > 0.5]
    skewed_features = high_skew.index

    for feat in skewed_features:
        df[feat] = boxcox1p(df[feat], boxcox_normmax(df[feat] + 1))

fixing_skewness(data)

In [1]:
sns.distplot(data['1stFlrSF']);

In [1]:
# creating dummies
cols = []
for i in data.columns:
    if data[i].dtype==object:
        cols.append(i)

data = pd.get_dummies(data=data, columns=cols)
data.shape

finally we are having 256 features, hope it give good results :)

In [1]:
# *********************Splitting data into train and test set**************************
X_train = data.iloc[:len(train), :]
X_test = data.iloc[len(train):, :]

## 3.6 Scaling Data Train, Test Set and Target Values

In [1]:
# I'm using log scale for scaling target values
y_scaled = np.log(y)

from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()

X_train_scaled = robust_scaler.fit_transform(X_train)
X_test_scaled = robust_scaler.transform(X_test)

## 3.7 Feature Importance using Lasoo cofficient

In [1]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.001)

lasso.fit(X_train_scaled, y_scaled)

y_pred_lasso = lasso.predict(X_test_scaled)

lasso_coeff = pd.DataFrame({'Feature Importance':lasso.coef_}, index=data.columns)
lasso_coeff.sort_values('Feature Importance', ascending=False)

In [1]:
g = lasso_coeff[lasso_coeff['Feature Importance'] != 0].sort_values('Feature Importance').plot(kind='barh',figsize=(20,20))

QualSFNg, GrLivArea, OverallQual are some of the important features


# 4. Model
These are the models we are goona use
* XG Boost Regressor
* Random Forest Regressor
* LGBM Regressor
* Support Vector Machine
* Ridge Regressor
* Gradient Boosting Regressor
* Cat Boost Regressor

### 4.1 First lets define a function for hyperparmeter tuning

In [1]:
from sklearn.model_selection import GridSearchCV

# defining grid search function
def hyperparameter_tuning(model, parameters, X_train=X_train_scaled, y_train=y_scaled, jobs=-1):
    grid_search = GridSearchCV(estimator=model,
                               param_grid=parameters,
                               cv=5,
                               scoring='neg_mean_squared_error',
                           #    n_jobs=jobs,
                           #    verbose=2
                              )
    grid_search.fit(X_train, y_train)
    print("Best Score: {:.5f}".format(np.sqrt(-grid_search.best_score_)))
    print("Best Parameters:", grid_search.best_params_)
    best_model = grid_search.best_estimator_
    return grid_search, best_model

## 4.2 XGB Regressor

In [1]:
from xgboost import XGBRegressor

xgb = XGBRegressor(learning_rate =0.0139,
                   n_estimators =4500,
                   max_depth =4,
                   min_child_weight =0,
                   subsample =0.7968,
                   colsample_bytree =0.4064,
                   nthread =-1,
                   scale_pos_weight =2,
                   seed=42,
                  )

## 4.3 Random Forest Regressor

In [1]:
from sklearn.ensemble import RandomForestRegressor

rf_param = {'n_estimators': [150, 300],
            'min_samples_split': [2, 5, 8]
           }
gs_rf, rf = hyperparameter_tuning(RandomForestRegressor(), rf_param)

## 4.4 Support Vector Machine

In [1]:
from sklearn.svm import SVR

svm_param = {'C':[15, 10, 1.5, 1, 0.1, 0.01],
             'epsilon':[0.0001, 0.001, 0.1, 0.2],
             'gamma': [0.0001, 0.001, 0.005, 0.1, 1]
            }
gs_svr, svr = hyperparameter_tuning(SVR(), svm_param)

## 4.5 Ridge Regressor

In [1]:
from sklearn.linear_model import Ridge

ridge_param = {'alpha': [1, 0.1, 0.01, 0.001, 0.0001, 0], 
               'solver': ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
              }
gr_ridge, ridge = hyperparameter_tuning(Ridge(), ridge_param)

## 4.6 LGBM Regressor

In [1]:
from lightgbm import LGBMRegressor
'''
lgb_param = {'objective': ['regression'],
             'learning_rate': [0.00721],
             'n_estimators': [4000, 7000],
         #    'boosting_type': ['gbdt', 'dart'],
             'num_leaves': [50, 100, 200],
             'reg_alpha': [0, 1.2, 1.3],
             'reg_lambda': [0, 1.2, 1.3]
            }
gs_lgb, lgb = hyperparameter_tuning(LGBMRegressor(), lgb_param)            
'''
lgb = LGBMRegressor(objective='regression',
                    n_estimators=5000,
                    num_leaves=5,
                    learning_rate=0.00721,
                    max_bin=163,
                    bagging_fraction=0.35711,
                    n_jobs=-1,
                    bagging_seed=42,
                    feature_fraction_seed=42,
                    bagging_freq=7,
                    feature_fraction=0.1294,
                    min_data_in_leaf=8
                   )

## 4.7 Gradient Boosting Regressor

In [1]:
from sklearn.ensemble import GradientBoostingRegressor

gbr_param = {'loss': ['ls', 'huber', 'quantile'],
             'n_estimators': [400, 700],
             'min_samples_leaf': [17, 15],
             'max_features': ['sqrt'],
             'max_depth': [4]
            }
gs_gbr, gbr = hyperparameter_tuning(GradientBoostingRegressor(), gbr_param)

## 4.8 Cat Boost Regressor

In [1]:
from catboost import CatBoostRegressor
'''cat_param = {'learning_rate': [0.03, 0.1],
             'depth': [4, 6, 10],
             'l2_leaf_reg': [1, 3, 5, 7, 9]
            }
model = CatBoostRegressor()
gs_cbr, cbr = hyperparameter_tuning(model, cat_param)'''

cbr = CatBoostRegressor(iterations=3500,
                        learning_rate=0.03,
                        od_type='Iter',
                        od_wait=1500,
                        depth=6,
                        random_strength=1,
                        l2_leaf_reg=10,
                        sampling_frequency='PerTree',
                        verbose=0
                       )
cbr.fit(X_train_scaled, y_scaled)

In [1]:
from sklearn.model_selection import cross_val_score, KFold, cross_validate
from sklearn.metrics import r2_score

#A function for testing multiple estimators
def model_check_cv(X, y, estimators, labels):
   
    model_table = pd.DataFrame()
    row_index = 0
    for est, label in zip(estimators, labels):
        MLA_name = label
        model_table.loc[row_index, 'Model Name'] = label

        cv_results = cross_validate(est,
                                    X,
                                    y,
                                    cv=5,
                                    scoring='neg_root_mean_squared_error',
                                    return_train_score=True
                              #      n_jobs=-1
                                   )
        model_table.loc[row_index, 'Train RMSE'] = -cv_results['train_score'].mean()
        model_table.loc[row_index, 'Test RMSE'] = -cv_results['test_score'].mean()
        model_table.loc[row_index, 'Test Std'] = cv_results['test_score'].std()
        model_table.loc[row_index, 'Time'] = cv_results['fit_time'].mean()

        row_index += 1

    model_table.sort_values(by=['Test RMSE'], inplace=True)
    return model_table

# Function for r2scores of all models
def r2_table(X, y, estimators, labels):
    table = pd.DataFrame()
    i = 0
    for est, label in zip(estimators, labels):
        table.loc[i, 'Model Name'] = label
        table.loc[i, 'R2 Score'] = r2_score(y, np.exp(est.predict(X)))
        i += 1
    return table.sort_values(by=['R2 Score'], ascending=False)

In [1]:
estimators = [xgb, rf, svr, ridge, lgb, gbr, cbr]
labels = ['XGBoostRegressor', 'Random Forest Regressor', 'Support Vector Regressor', 
          'Ridge Regression', 'LGBM Regressor', 'GradientBoostingRegressor', 'CatBoostRegressor']

models = model_check_cv(X_train_scaled, y_scaled, estimators, labels)
display(models.style.background_gradient(cmap='Reds'))

based on Root Mean Squared Error values and R2scores I will assign weights to the models

**CatBoost and XG Boosting** outperforms all other models while **SVM has lowest score** but still combareable

In [1]:
# fitting all models on whole training data
xgb.fit(X_train_scaled, y_scaled)
rf.fit(X_train_scaled, y_scaled)
ridge.fit(X_train_scaled, y_scaled)
lgb.fit(X_train_scaled, y_scaled)
gbr.fit(X_train_scaled, y_scaled)
cbr.fit(X_train_scaled, y_scaled)
svr.fit(X_train_scaled, y_scaled)

# these are r2 scores on whole training set
array = r2_table(X_train_scaled, y, estimators, labels)
display(array.style.background_gradient(cmap='Reds'))

## 5. Final Prediction

In [1]:
# function for predicting 
def price_predict(X):
    return (
            (0.2 * xgb.predict(X))+
            (0.2 * cbr.predict(X))+
            (0.15 * rf.predict(X))+
            (0.1 * ridge.predict(X))+
            (0.12 * lgb.predict(X))+
            (0.15 * gbr.predict(X))+
            (0.08 * svr.predict(X))
           )

In [1]:
pred = np.exp(price_predict(X_test_scaled))
pred = pd.Series(pred, name='SalePrice')
results = pd.concat((test['Id'], pred), axis=1)
results.to_csv("mysubmission.csv", index=False)
results.head()

* Your feedback in comments is much appreciated, Comment if you have any doubts or for inprovement
* Please **UPVOTE** if you LIKE this notebook, it will keep me motivated