# House Prices - Advanced Regression Techniques
Predict sales prices and practice feature engineering, RFs, and gradient boosting

# Loading the Dataset

Dataset to downloaded from the below link

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

In [None]:
import pandas as pd
import numpy as np
from sklearn import model_selection
pd.pandas.set_option('display.max_columns',None)
train_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test_data=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train_data

We can see from the Training DataFrame above that there are around 80 variables and 1460 observations. So we have suffcient data points to train the model and can expect a good score on the predictions.

# EDA
Our data is now in the form of a Data Frame. The initial step in EDA is to identify any missing data and examine how they relate to the target variable/feature. That analysis is usually useful in determining how to replace missing values.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(25,10))
sns.heatmap(train_data.isnull(), cmap="viridis")

The missing or NaN values are indicated by the yellow streaks in the preceding image. Although the majority of the columns in the data are complete, a few rows have more than 75% of the values missing, such as alley, fence, and a few others.

In [None]:
Id=train_data['Id']

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(train_data.corr())
plt.title('HeatMap- Correlation between predictor Variables')
plt.show()

In [None]:
train_data.corr()['SalePrice'].sort_values()


OverallQual and GrLivArea are highly connected with the SalePrice, according to the preceding series. This suggests that individuals are prepared to pay more for homes with more ground living space and residences of higher quality. Furthermore, an increase in the quantity of high-quality properties in the neighbourhood will raise the area's average house sale price.

In [None]:
plt.figure(figsize=(10, 8))
sns.boxplot(x='OverallQual', y='SalePrice',data=train_data )
plt.show()

The boxplot diagram clearly shows that overall quality is one of the primary factors influencing property prices.The median sale price had  simultaneously  increased as the Overallquality rating went up in the graph

In [None]:
plt.scatter( x='GrLivArea',y='SalePrice',data=train_data)
plt.ylabel('SalePrice')
plt.xlabel('GrLivArea')
plt.show()

The range of house sale prices has expanded as the ground living area has increased. Furthermore, we can detect some outliers on the right side of the graph.

In [None]:
numerical_col=[col for col in train_data.columns if train_data[col].dtypes!='O']
numerical_col.remove('Id')
year_col=['YearBuilt','YearRemodAdd','GarageYrBlt','YrSold']
train_data.groupby('YrSold')['SalePrice'].mean().plot()
plt.ylabel('SalePrice')

We can observe from the graph above that the Saleprice trend has been irregular when compared to the year sold variable, implying that the selling price of a property has a poor relationship with the year in which it is sold.

In [None]:
for i in year_col:
    data1=train_data.copy()
    if i!= 'YrSold':
        data1['new']=data1['YrSold']-data1[i]
        sns.scatterplot(x='new',y='SalePrice',data=data1)
        plt.xlabel('Number of years since'+' '+ i)
        plt.title(i)
        plt.show()
        

According to the three graphs above, older houses have a lower selling price than newly built houses. Furthermore, properties with freshly built garages or recently re-modified residences had higher selling values, and the price declined as the years since these changes or garage construction increased.

In [None]:
discrete_col=[col for col in numerical_col if len(train_data[col].value_counts())< 20 and col not in year_col]
for i in discrete_col:
    df1=train_data.copy()
    df1.groupby(i)['SalePrice'].mean().plot.bar()
    plt.ylabel('Sale Price')
    plt.show()

We plotted the SalePrice versus category factors in the figures above. Some attributes, such as OverallQuality, TotalRoomsAboveGround, Fireplaces, and GarageCars, have a considerable relationship with sales prices.

In [None]:
train_data.groupby(['YrSold','MoSold']).count()['SalePrice'].plot(kind='barh',figsize=(20,25))

In the above barchart we have plotted the number of houses sold in each month of the respective year starting from January 2006 to July 2011.If clearly observed, there is a trend in the above graph the number of houses sold have dramatically increased in the month of may, june and july in every year.

# Feature Engineering

## Imputing Missing Values

As stated in the dataset description in kaggle, NA values in numerous fields such as Alley,Fence,FireplaceQu...etc imply that these features or amenities are lacking in the home, thus I replaced them with a 'None' value if it is categorical and 0 incase the variable is numerical.

In [None]:
train_data["MiscFeature"] = train_data["MiscFeature"].fillna("None")
train_data["Alley"] = train_data["Alley"].fillna("None")
train_data["Fence"] = train_data["Fence"].fillna("None")
train_data["FireplaceQu"] = train_data["FireplaceQu"].fillna("None")

test_data["MiscFeature"] = test_data["MiscFeature"].fillna("None")
test_data["Alley"] = test_data["Alley"].fillna("None")
test_data["Fence"] = test_data["Fence"].fillna("None")
test_data["FireplaceQu"] = test_data["FireplaceQu"].fillna("None")

train_data["MasVnrArea"] = train_data["MasVnrArea"].fillna(0)
test_data["MasVnrArea"] = test_data["MasVnrArea"].fillna(0)
train_data["MasVnrType"] = train_data["MasVnrType"].fillna("None")
test_data["MasVnrType"] = test_data["MasVnrType"].fillna("None")
train_data["PoolQC"] = train_data["PoolQC"].fillna("None")
test_data["PoolQC"] = test_data["PoolQC"].fillna("None")

In [None]:
Basement_cat = ("BsmtQual" , "BsmtCond", "BsmtExposure" , "BsmtFinType1" , "BsmtFinType2")
for i in Basement_cat:
    train_data[i] = train_data[i].fillna("None")
    test_data[i] = test_data[i].fillna("None")
    
Basement_num = ("BsmtFinSF1" , "BsmtFinSF2" , "BsmtUnfSF", "TotalBsmtSF" ,
"BsmtFullBath" , "BsmtHalfBath")
for i in Basement_num:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)

In [None]:
garage_cat= ("GarageType" , "GarageFinish" , "GarageQual" ,"GarageCond")
for i in garage_cat:
    train_data[i] = train_data[i].fillna('None')
    test_data[i] = test_data[i].fillna('None')
    
garage_num = ("GarageYrBlt" , "GarageArea" , "GarageCars")
for i in garage_num:
    train_data[i] = train_data[i].fillna(0)
    test_data[i] = test_data[i].fillna(0)
    

In [None]:
train_data["LotFrontage"] = train_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
test_data["LotFrontage"] = test_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))


In the above cell, I grouped the columns based on their Neighborhood and then used the median value to fill in the empty values in the LotFrontage column, because properties in similar neighbourhoods tend to have the same LotFrontage distance.

In [None]:
numeric_cols = train_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = train_data.select_dtypes(include=['object']).columns.tolist()
year_col=['YearBuilt','YearRemodAdd','GarageYrBlt','YrSold']

In [None]:
print('Values along with Count in the Categorical Columns','\n')
for i in categorical_cols:
    print(i)
    print(train_data[i].value_counts(),'\n')  

In [None]:
train_data.drop(['Utilities','Street',"PoolQC"], axis = 1,inplace=True)
test_data.drop(['Utilities','Street',"PoolQC"], axis = 1,inplace=True)


I removed the utilities,PoolQc and street features since more than 95 percent of the values in it had a single value, thus adding these features to the model is pointless due to the lack of variety in values.

In [None]:
train_data

In [None]:
missing_counts = train_data.isnull().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

In [None]:
missing_counts = test_data.isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

We could see that there are few more missing values in training and testing data to deal with. Below, I have used the simple imputer class with mean to deal with the numerical missing values and mode for the categorical ones.The SimpleImputer class provides fundamental mechanisms for filling in missing values. Missing values can be imputed using a constant value supplied, or by utilising the statistics (mean, median, or most common) of each column in which the missing values are placed.

Numerical columns

In [None]:
from sklearn.impute import SimpleImputer
numeric_cols.remove('SalePrice')

In [None]:
imputer1 = SimpleImputer(strategy='mean')
imputer1.fit(train_data[numeric_cols])
train_data[numeric_cols] = imputer1.transform(train_data[numeric_cols])
test_data[numeric_cols] = imputer1.transform(test_data[numeric_cols])

Categorical columns

In [None]:
from sklearn.impute import SimpleImputer
categorical_cols = train_data.select_dtypes(include=['object']).columns.tolist()
imputer1 = SimpleImputer(strategy='most_frequent')
imputer1.fit(train_data[categorical_cols])
train_data[categorical_cols] = imputer1.transform(train_data[categorical_cols])
test_data[categorical_cols] = imputer1.transform(test_data[categorical_cols])

# Target variable transformation

The term "normality" refers to the fact that the distribution of variables follows a normal pattern.
Drawing a Histogram and a QQ plot is the simplest technique to check for normality.

In [None]:
sns.distplot(train_data['SalePrice'],bins=50)

The pricing is right skewed, as shown in the graph above. Skewed data makes it harder for a model to detect a correct pattern in the data, which is why we must convert skew data to normal or Gaussian data. The log-transformation effectively eliminates 
skewness.

In [None]:
from scipy import stats
stats.probplot(train_data['SalePrice'], plot=plt)
plt.show()

In [None]:
train_data["SalePrice"] = np.log1p(train_data["SalePrice"])
from scipy import stats
stats.probplot(train_data['SalePrice'], plot=plt)
plt.show()

In [None]:
Tvariable=train_data['SalePrice']
train_data.drop('SalePrice',axis=1,inplace=True)

#  ADDITIONAL FEATURES

In [None]:
yr_col=['YearBuilt','YearRemodAdd']
for i in yr_col:
        train_data['NYS'+i]=train_data['YrSold']-train_data[i]
        test_data['NYS'+i]=test_data['YrSold']-test_data[i]

I've included two new fields to the dataset: the number of years since the remodification and the year the home was built. As we can see from the figures in the EDA section, these two will aid the model in properly anticipating prices.

# Transforming  numerical variables that are categorical

Some of the numerical features are categorical, therefore I changed them to strings so that they would be taken into account when encoding the categorical columns. To identify these columns, I first built a list of numerical columns and then filtered the list for columns with fewer than 30 different values in the respective columns and that are not in the Year list.

In [None]:
num_cols = train_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
year_col=['YearBuilt','YearRemodAdd','GarageYrBlt','YrSold']
num_discrete_col=[col for col in num_cols if len(train_data[col].value_counts())<30 and col not in year_col]
train_data[num_discrete_col]

In [None]:
train_data["MSSubClass"] = train_data["MSSubClass"].apply(str)
test_data["MSSubClass"] = test_data["MSSubClass"].apply(str)
train_data["YrSold"] = train_data["YrSold"].apply(str)
test_data["YrSold"] = test_data["YrSold"].apply(str)
train_data["MoSold"] = train_data["MoSold"].apply(str)
test_data["MoSold"] = test_data["MoSold"].apply(str)

From the final list MSSubClass is categorical as its values relate to the type of dwelling involved in the sale. Moreover, Yearsold has only 4 disticnt values and MonthSold indicates the month in which the house was sold

# Outliers 

The statistics and distribution of the input variables affect machine learning algorithms. Outliers in data can sabotage and mislead the training process. Longer training times, fewer accurate models, and, ultimately, inferior results follow.

I separated the list of numerical columns into two categories: discrete and continuous. All columns with fewer than 15 different values were placed in the discrete column list, while the remaining were placed in the continuous column list. Following that, I created distribution charts for continuous columns.

In [None]:
numerical_col=[col for col in train_data.columns if train_data[col].dtypes!='O']
discrete_col=[col for col in numerical_col if len(train_data[col].value_counts())< 15 and col not in year_col]
cont_col=[col for col in numerical_col if col not in discrete_col+year_col ]
for i in cont_col:
    df1=train_data.copy()
    df1[i].hist(bins=50)
    plt.xlabel(i)
    plt.show()

In [None]:
l=['LotFrontage','LotArea','BsmtUnfSF','TotalBsmtSF','1stFlrSF','GrLivArea','GarageArea']
for i in l:
    Q1=train_data[i].quantile(0.25)
    Q3=train_data[i].quantile(0.75)
    IQR=Q3-Q1
    W1=Q1-(1.5*IQR)
    W2=Q3+(1.5*IQR)
    for x in train_data[i]:
        if x<W1:
            train_data[i].replace(x,W1,inplace=True)
        if x>W2:
            train_data[i].replace(x,W2,inplace=True)

After evaluating the distributions and the unique values in continuous columns, I have chosen a few columns that may include outliers. I later substituted the outliers with the corresponding whisker values.

# Skewness

Because it must cope with rare occurrences on extreme values, skewed data reduces the model's capacity to explain typical cases.There are statistical models that are resistant to outliers, such as Tree-based models, however this limits the ability to test alternative models. As a result, it is necessary to convert the skewed data into something like a Gaussian or Normal distribution. This will enable us to test a greater number of statistical models.

To cope with the skweness in the data, I applied the boxcox transformation below.

In [None]:
cdata=pd.concat([train_data,test_data])
numerical_col=[col for col in train_data.columns if train_data[col].dtypes!='O']
skew=cdata[numerical_col].skew().sort_values()
skew_score=pd.DataFrame({'Skew' :skew})
skew_score

In [None]:
skewness_p = skew_score[(skew_score['Skew']) > 0.75]
skewness_n = skew_score[(skew_score['Skew']) < -0.75]

from scipy.special import boxcox1p
skewed_features_p = skewness_p.index
lam = 0.17
for feat in skewed_features_p:
    cdata[feat] = boxcox1p(cdata[feat], lam)
    
skewed_features_n = skewness_n.index
lam=2    
for feat in skewed_features_n:
    cdata[feat] = boxcox1p(cdata[feat], lam)
    
train_data=cdata.iloc[0:1460]
test_data=cdata.iloc[1460:]

# Scaling Numerical Values

Feature scaling is a technique for standardising the independent characteristics included in data within a specific range. If feature scaling is not performed, a machine learning algorithm will tend to weight bigger values as higher and consider smaller values as lower, regardless of the unit of measurement.I used robustscaler to scale the data since there were many columns with outliers, and robustscaler performs well when scaling data with a high number of outliers.This Scaler eliminates the median and scales the data based on the quantile range . The IQR is the difference between the first and third quartiles (25th and 75th quantiles) (75th quantile). Because this Scaler's centering and scaling statistics are based on percentiles, they are unaffected by a small number of large marginal outliers.

In [None]:
num_cols = train_data.select_dtypes(include=['int64', 'float64']).columns.tolist()
train_data[num_cols].describe().loc[['min', 'max']]

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(pd.concat([train_data[num_cols], test_data[num_cols]]))
train_data[num_cols] = scaler.transform(train_data[num_cols])
test_data[num_cols] = scaler.transform(test_data[num_cols])

# Encode Categorical Columns

All input and output variables in machine learning models must be numeric. This implies that if your data contains categorical data, you must convert it to numbers before fitting and evaluating a model. When working with categorical data for machine learning algorithms, encoding is a needed pre-processing step.Some of the categorical columns had order, so I used label encoding on those, and the rest columns were encoded with getdummies function.

## LABEL ENCODING

In [None]:
cdata=pd.concat([train_data,test_data])
cdata['LotShape']=cdata['LotShape'].map({'Reg':3,'IR3':0,'IR2':1,'IR1':2})
cdata['LandSlope']=cdata['LandSlope'].map({'Gtl':0,'Mod':1,'Sev':2})
cdata['ExterQual']=cdata['ExterQual'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
cdata['ExterCond']=cdata['ExterCond'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
cdata['BsmtQual']=cdata['BsmtQual'].map({'None':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
cdata['BsmtCond']=cdata['BsmtCond'].map({'None':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
cdata['BsmtExposure']=cdata['BsmtExposure'].map({'None':0,'No':1,'Mn':2,'Av':3,'Gd':4})
cdata['BsmtFinType1']=cdata['BsmtFinType1'].map({'None':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
cdata['BsmtFinType2']=cdata['BsmtFinType2'].map({'None':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
cdata['HeatingQC']=cdata['HeatingQC'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
cdata['KitchenQual']=cdata['KitchenQual'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
cdata['FireplaceQu']=cdata['FireplaceQu'].map({'None':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
cdata['GarageFinish']=cdata['GarageFinish'].map({'None':0,'Unf':1,'RFn':2,'Fin':3})
cdata['GarageQual']=cdata['GarageQual'].map({'None':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
cdata['GarageCond']=cdata['GarageCond'].map({'None':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
cdata['YrSold']=cdata['YrSold'].map({'2006.0':0,'2007.0':1,'2008.0':2,'2009.0':3,'2010.0':4})
cdata['MoSold']=cdata['MoSold'].apply(float).apply(int)
cdata['PavedDrive']=cdata['PavedDrive'].map({'N':0,'P':1,'Y':2})
cdata['Fence']=cdata['Fence'].map({'None':0,'MnWw':1,'GdWo':2,'MnPrv':3,'GdPrv':4})

In [None]:
dataf=pd.get_dummies(cdata, drop_first=True)
dataf.drop(['Id'],axis=1,inplace=True)
ftrain_data=dataf.iloc[0:1460]
ftest_data=dataf.iloc[1460:]

# Model Building

In [None]:
train_inputs = ftrain_data.copy()
test_inputs = ftest_data.copy()

In [None]:
train_inputs['Id']=Id

# Hyper parameter optimization

The hyperparameter configuration becomes more complicated as the model becomes more sophisticated. Hyperparameter combinations can have a big influence on the model's performance.

Initially I have experimented with leastsqaures,lasso and ridge regression and finally ended up picking ridge as it has performed well and had great accuracy compared to the other two.
The rest two models are XGboost and gradient boosting regressor. I have performed the hyper parameter optimization on these models using the RandomizedSearchCV

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import RandomizedSearchCV

# MODEL-1 RIDGE REGRESSION

## Regularization
Value of alpha, which is a Ridge hyperparameter, which means that they are not learnt automatically by the model and must be adjusted manually. 
We use RandomizedSearchCV to determine the best alpha for Ridge Regularization.

In [None]:
from sklearn.linear_model import Ridge
param_grid={}
param_grid['alpha'] = np.arange(0, 20, 0.1)
model=Ridge()
searcher=RandomizedSearchCV(model,param_grid,n_iter=2,scoring='neg_mean_squared_error',verbose=1,cv=10)
searcher.fit(ftrain_data, Tvariable)
print(searcher.best_params_)
print(searcher.best_score_)

In [None]:
# Let us explore the coefficients for each of the independent attributes
ridge_model = Ridge(alpha=9)
ridge_model.fit(ftrain_data,Tvariable)
weights = ridge_model.coef_
weights_df = pd.DataFrame({
    'columns': ftrain_data.columns,
    'weight': weights
}).sort_values('weight', ascending=False)
weights_df=pd.concat([weights_df.iloc[0:4],weights_df.iloc[216:]])

In [None]:
plt.figure(figsize=(14,7))
sns.barplot(x='weight',y='columns',data=weights_df)


I made a graph showing the variable coefficients with the greatest magnitude. As anticipated, the model gave higher weightage to Groundliving area and OverallQuality.

# MODEL-2 XGBOOST

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. It is not always adequate to depend just on the outcomes of a single machine learning model. Ensemble learning provides a methodical approach to combining the predictive capacity of numerous learners. The end result is a single model that aggregates the output of numerous models.

In [None]:
n_estimators = [850,900,950]
max_depth = [2, 3, 5, 10, 15]
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]

# Define the grid of hyperparameters to search
param_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    }

from xgboost import XGBRegressor
my_model = XGBRegressor()
XGB=RandomizedSearchCV(my_model,param_grid,n_iter=5,scoring='neg_mean_squared_error',verbose=2,cv=10)
XGB.fit(ftrain_data,Tvariable)
print(XGB.best_params_)
print(XGB.best_score_)


# MODEL-3 GRADIENT BOOSTING REGRESSOR

Gradient boosting creates an additive mode by employing several fixed-size decision trees as weak learners or weak predictive models. The parameter n estimators determines how many decision trees will be utilised in the boosting phases. The gradient boosting approach is useful for training models for both regression and classification problems. Boosting Gradients To fit the model that predicts the continuous value, the regression procedure is utilised.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
my_model2=GradientBoostingRegressor()

In [None]:
n_estimators = [850,900,950]
max_depth = [2, 3, 5, 10, 15]
learning_rate=[0.05,0.1,0.15,0.20]

param_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate
    }

GBR=RandomizedSearchCV(my_model2,param_grid,n_iter=5,scoring='neg_mean_squared_error',verbose=2,cv=10)
GBR.fit(ftrain_data,Tvariable)
print(GBR.best_params_)
print(GBR.best_score_)

#  KFold dataset

I produced a Kfold dataset with 5 splits to utilise in the model blending strategy described below.

In [None]:
from sklearn.model_selection import KFold
K_train_data=train_inputs.copy()

In [None]:
K_train_data["kfold"] = -1
kf =KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_indicies, valid_indicies) in enumerate(kf.split(X=ftrain_data)):
    K_train_data["kfold"].loc[valid_indicies]  = fold

In [None]:
K_train_data['target']=Tvariable

In [None]:
K_train_data

# MODEL BLENDING

In [None]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso

I trained the model on the Kfold dataset above, leaving out a single fold in each iteration, and then predicted the home values on the left out fold as well as the test set. As a result, for each model, I have one prediction set for the train data and four prediction sets for the test data.

In [None]:
# MODEL 1
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
final_test_predictions = []
final_valid_predictions = {}
for fold in range(5):
    xtrain =  K_train_data[K_train_data.kfold != fold].reset_index(drop=True)
    xvalid = K_train_data[K_train_data.kfold == fold].reset_index(drop=True)
    xtest = test_inputs.copy()
    
    valid_ids = xvalid.Id.values.tolist()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain.drop(['target','kfold','Id'],axis=1,inplace=True)
    xvalid.drop(['target','kfold','Id'],axis=1,inplace=True)
    model =XGBRegressor(n_estimators=900, min_child_weight= 2, max_depth= 5, learning_rate= 0.05)
    model.fit(xtrain, ytrain)
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))

final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["Id", "pred_1"]
final_valid_predictions.to_csv("train_pred_1.csv", index=False)

t=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df=pd.DataFrame({'Id':t['Id'].apply(int),'pred_1':np.mean(np.column_stack(final_test_predictions), axis=1)})
df.to_csv("test_pred_1.csv", index=False)



In [None]:
# MODEL 2
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
final_test_predictions = []
final_valid_predictions = {}
for fold in range(5):
    xtrain =  K_train_data[K_train_data.kfold != fold].reset_index(drop=True)
    xvalid = K_train_data[K_train_data.kfold == fold].reset_index(drop=True)
    xtest = test_inputs.copy()
    
    valid_ids = xvalid.Id.values.tolist()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain.drop(['target','kfold','Id'],axis=1,inplace=True)
    xvalid.drop(['target','kfold','Id'],axis=1,inplace=True)
    
    model=Ridge(alpha=12.9)
    model.fit(xtrain, ytrain)
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))

final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["Id", "pred_2"]
final_valid_predictions.to_csv("train_pred_2.csv", index=False)

t=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df=pd.DataFrame({'Id':t['Id'].apply(int),'pred_2':np.mean(np.column_stack(final_test_predictions), axis=1)})
df.to_csv("test_pred_2.csv", index=False)



In [None]:
# MODEL 3
final_test_predictions = []
final_valid_predictions = {}
for fold in range(5):
    xtrain =  K_train_data[K_train_data.kfold != fold]
    xvalid = K_train_data[K_train_data.kfold == fold]
    xtest = test_inputs.copy()
    
    valid_ids = xvalid.Id.values.tolist()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain.drop(['target','kfold','Id'],axis=1,inplace=True)
    xvalid.drop(['target','kfold','Id'],axis=1,inplace=True)
    
    model=XGBRegressor(n_estimators=850, max_depth= 5, learning_rate=0.1)
    model.fit(xtrain, ytrain)
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    final_test_predictions.append(test_preds)
    final_valid_predictions.update(dict(zip(valid_ids, preds_valid)))

final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["Id", "pred_3"]
final_valid_predictions.to_csv("train_pred_3.csv", index=False)

t=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df=pd.DataFrame({'Id':t['Id'].apply(int),'pred_3':np.mean(np.column_stack(final_test_predictions), axis=1)})
df.to_csv("test_pred_3.csv", index=False)



After prediction sets are formed, I saved predictions made on train in a separate csv file for every model, and in the case of test data, I had taken the mean of four prediction sets and saved it. 

In [None]:
df1 = pd.read_csv("train_pred_1.csv")
df2 = pd.read_csv("train_pred_2.csv")
df3 = pd.read_csv("train_pred_3.csv")

df_test1 = pd.read_csv("test_pred_1.csv")
df_test2 = pd.read_csv("test_pred_2.csv")
df_test3 = pd.read_csv("test_pred_3.csv")

In [None]:
df=K_train_data
df = df.merge(df1, on="Id", how="left")
df = df.merge(df2, on="Id", how="left")
df = df.merge(df3, on="Id", how="left")

df_test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")
df_test = df_test.merge(df_test1, on="Id", how="left")
df_test = df_test.merge(df_test2, on="Id", how="left")
df_test = df_test.merge(df_test3, on="Id", how="left")


Below I trained a new ridge model by using the prior predictions from three models on training data as features and the train data sale prices as the target variable. Later, I forecasted the Final SalePrices on test data by using this model with the prior predictions on test data as inputs.Also I used the same Kfold values where were created previously, which again results in 5 test prediction sets using the same concept of leaving out 1 fold in every iteration. I have taken the mean of this 5 test set predictions as the Final Predictions.

In [None]:
from sklearn.linear_model import Ridge
useful_features = ["pred_1", "pred_2", "pred_3"]
df_test = df_test[useful_features]

predictions = []
for fold in range(5):
    xtrain =  df[df.kfold != fold].reset_index(drop=True)
    xvalid = df[df.kfold == fold].reset_index(drop=True)
    xtest = df_test.copy()

    ytrain = xtrain.target
    yvalid = xvalid.target
    
    xtrain = xtrain[useful_features]
    xvalid = xvalid[useful_features]
    
    model = Ridge()
    model.fit(xtrain, ytrain)
    
    preds_valid = model.predict(xvalid)
    test_preds = model.predict(xtest)
    predictions.append(test_preds)

# EXPONENTIAL TRANSFORMATION

In [None]:
predictions=np.exp(predictions)

In [None]:
predictions

# FINAL PREDICTIONS

In [None]:
final_predictions=np.mean(np.column_stack(predictions), axis=1)

In [None]:
t=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
Df = pd.DataFrame({'Id':t['Id'].apply(int), 'SalePrice':(final_predictions)})


In [None]:
Df.to_csv('Submission', index=False)


In [None]:
Df

# CONCLUSION

In this project, we covered topics such as Handling Missing Values, Feature Engineering, Hyperparameter Optimization, and Model Building. We also performed exploratory data analysis at the start to become acquainted with the data, which assisted us in creating more characteristics and removing those that were unnecessary. Finally, we performed the model blending part, which enhanced accuracy while increasing the model's complexity.Personally, I found that model blending had a lot of potential and was the most effective feature that helped me get to the top in the competition.