# My first Kaggle notebook

Hello Kagglers,

Below is my first ever kaggle notebook that I used to work with the famous dataset about residental homes in Ames, Iowa. I started working on this notebook to better understand the problems of advanced regression. During my work, I managed to practically learn about regression models and put together my first Kaggle submission. After several attempts, thanks to blended regression, I was able to reach the top 26% with the result of 0.12371 in public leaderboard.

I realize that this is a very popular problem among beginners and you can find many great kernels associated with it, but despite this I thought I would share my work to get valuable feedbacks from the community on how I can  improve my score.

Here I've tried to make everything as understandable as possible for me and to get the best possible result. I think that the presented way of working on the dataset will be helpful for beginners like me. Before I started, I read many interesting notebooks which provided me a lot of valuable informations. I would especially like to thank the authors of:

1. [Comprehensive data exploration with Python](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python) by **Pedro Marcelino**
2. [Regularized Linear Models](https://www.kaggle.com/apapiu/regularized-linear-models) by **Alexandru Papiu**
3. [Stacked Regressions : Top 4% on LeaderBoard](https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard) by **Serigne**

If you find my notebook useful, please upvote or comment if you have found any possible improvement. I will be very grateful for every feedback. That will keep me motivated to regular update of this notebook.

Thank You in advance!

# Main objectives

1. Performing exploratory data analysis on both qualitative and quantitative data.
2. Ensuring better data shape and quality by numerous transformation which will help to improve performance of regression models.
3. Checking and tuning different models to get the best results on target variable. 

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from scipy.special import boxcox1p


from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold
from sklearn.linear_model import LinearRegression, Lasso, Ridge, RidgeCV, LassoCV
from sklearn.ensemble import RandomForestRegressor

import xgboost as xgb

sns.set(style='whitegrid')
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
train_original = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_original = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train.info()

We need to deal with integer, float and categorical(object) data in this dataset.
Let's start cleaning up the data by removing the unnecseary 'ID' column. 

In [None]:
train = train.drop('Id', axis=1)
test = test.drop('Id', axis=1)

In [None]:
train.head()

In [None]:
test.head()

The main target of the analysis is "SalePrice", let's start our analysis with a brief overview.

In [None]:
train['SalePrice'].describe()

No 0 values, which could destroy my model. That is a good news. Now take a look at histogram.

In [None]:
sns.distplot(train['SalePrice'])

Qucik peview of skew and kurt.

In [None]:
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())

Based on the chart, we can see that SalePrice has: 

 - Left(positive) skewness
 - Visible peak
 - Distribution which is far from normal

Later, log transformation could be helpful. Why log transformation? Log transformation is used to transform skewed data into data which distribution is closer to normal. This allows for better performance of regression models. 
The next sensible step for further exploratory data analysis is the division of data into qualitative and quantitative. This division will allow for a better understanding of the dataset. 

In [None]:
train_quantitative = train[[d for d in train.columns if train.dtypes[d] != 'object']].copy()
test_quantitative = test[[d for d in test.columns if test.dtypes[d] != 'object']].copy()

In [None]:
train_qualitative = train[[d for d in train.columns if train.dtypes[d] == 'object']].copy()
test_qualitative = test[[d for d in test.columns if test.dtypes[d] == 'object']].copy()

In [None]:
train_quantitative.describe()

In [None]:
train_qualitative.describe()

Thanks to the division we obtained 37 quantitative data and 43 qualitative data, we can now start exploratory data analysis. I will start exploratory data analysis with qualitative data. 

# Exploratory data analysis on categorical data

Which columns are qualitative data?

In [None]:
train_qualitative.columns

How much categorical data do we lack? Let's find out. 

In [None]:
missing_train = train_qualitative.isnull().sum().sort_values(ascending=False)
percentage_train = (train_qualitative.isnull().sum()/train_qualitative.isnull().count()).sort_values(ascending=False)
train_info = pd.concat([missing_train,percentage_train],keys=['Missing','Percentage'],axis=1)
train_info.head(25)

In [None]:
fig = plt.figure(figsize=(10,5))
train_plot = sns.barplot(x=missing_train.index[0:20],y=missing_train[0:20])
train_plot.set_xticklabels(train_plot.get_xticklabels(),rotation=90)
plt.title('Number of missing values in categorical data(train)')

In [None]:
missing_test = test_qualitative.isnull().sum().sort_values(ascending=False)
percentage_test = (test_qualitative.isnull().sum()/test_qualitative.isnull().count()).sort_values(ascending=False)
test_info = pd.concat([missing_test,percentage_test],keys=['Missing','Percentage'],axis=1)
test_info.head(25)

In [None]:
fig = plt.figure(figsize=(10,5))
test_plot = sns.barplot(x=missing_test.index[0:20],y=missing_test[0:20])
test_plot.set_xticklabels(test_plot.get_xticklabels(),rotation=90)
plt.title('Number of missing values in categorical data(test)')

As we can see, we are missing quite a lot of data. A simple and effective solution will be to swap NaN values for None.

In [None]:
for column in train_qualitative.columns:
    train_qualitative[column] = train_qualitative[column].fillna("None")
for column in test_qualitative.columns:
    test_qualitative[column] = test_qualitative[column].fillna("None")

However, for some of the columns where not much data were missing I used the 'pad' method, which propagate last valid observation forward to next.

In [None]:
train_qualitative['Electrical']=train_qualitative['Electrical'].fillna(method='pad')
test_qualitative['SaleType']=test_qualitative['SaleType'].fillna(method='pad')
test_qualitative['KitchenQual']=test_qualitative['KitchenQual'].fillna(method='pad')
test_qualitative['Exterior1st']=test_qualitative['Exterior1st'].fillna(method='pad')
test_qualitative['Exterior2nd']=test_qualitative['Exterior2nd'].fillna(method='pad')
test_qualitative['Functional']=test_qualitative['Functional'].fillna(method='pad')
test_qualitative['Utilities']=test_qualitative['Utilities'].fillna(method='pad')
test_qualitative['MSZoning']=test_qualitative['MSZoning'].fillna(method='pad')

Finally, any missing data in train data?

In [None]:
train_qualitative.isnull().sum().sum()

Or maybe something in test data?

In [None]:
test_qualitative.isnull().sum().sum()

Quick look at shape of our dataset.

In [None]:
train_qualitative.shape

In [None]:
test_qualitative.shape

Surprisingly, it was not that bad. We will see how it goes with the quantitative data.

 # Exploratory data analysis on quantitative data

Let's check what really matters in quantiative data.
Sns's heatmap will be a great place to start.

In [None]:
top = 10
corr = train_quantitative.corr()
top10 = corr.nlargest(top,'SalePrice')['SalePrice'].index
corr_top10 = train_quantitative[top10].corr()
f,ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr_top10, square=True, ax=ax, annot=True, fmt='.2f', annot_kws={'size':12})
plt.title('Top correlated quantitative features of dataset')
plt.show()

In [None]:
corr = train_quantitative.corr()['SalePrice'].sort_values(ascending=False)
print(corr)

Looking at the heatmap we can see that "OverallQual", "GrLivArea" and "TotalBsmtSF" are the most correlated features with "SalePrice". 

Of course, we can't forget about "GarageCars" and "GarageArea" features, however they are closely related to each other. As the "GarageArea" increases, the number of "GarageCars" automatically increases.

"GarageCars" and "GarageArea" are not the only one features where we can see the multicollinearity. A similar situation occurs between "TotRmsAbvGrd" and "GrLIveArea".

The graphs below show the collinearity of data. Based on them I choose the following features for further analysis:

- GarageArea
- GrLivArea

Of those features I dropped the ones that has smaller correlation coeffiecient to "SalePrice".

In [None]:
fig,ax = plt.subplots(2,2,figsize=(15,15))
sns.scatterplot(data=train_quantitative, x='SalePrice', y='GarageArea', ax=ax[0][0])
sns.scatterplot(data=train_quantitative, x='SalePrice', y='GarageCars', ax=ax[0][1])
sns.scatterplot(data=train_quantitative, x='SalePrice', y='TotRmsAbvGrd', ax=ax[1][0])
sns.scatterplot(data=train_quantitative, x='SalePrice', y='GrLivArea', ax=ax[1][1])

plt.show()

In [None]:
corr = train_quantitative.corr()['SalePrice'].sort_values(ascending=False)
print(corr)

In [None]:
train_quantitative = train_quantitative.drop(['GarageCars','TotRmsAbvGrd'], axis=1)
test_quantitative = test_quantitative.drop(['GarageCars','TotRmsAbvGrd'], axis=1)

The dataset has been pre-arranged, but what about the other features? By making a quick insight into the rest of the data we will be able to see data without strong linear correlation with "SalePrice" and by this we also can find some numerical-categorical data.
To take whole dataset and see what's going on with the data the seaborn library will be very helpful again. 

In [None]:
fig,ax = plt.subplots(17,2,figsize=(15,60))

for i in range(len(train_quantitative.columns)-1):
    #-1 in iterator to avoid regplot between "SalePrice" and "SalePrice"
    r=i//2
    c=i%2
    sns.scatterplot(data=train_quantitative, x=train_quantitative.columns[i], y='SalePrice', hue='SalePrice', palette='rocket', ax=ax[r][c])
    
plt.tight_layout()
plt.show()

As we can see, among the numerical data, there are some data that are essentially categorical. Many of them show very weak correlations with our target - "SalePrice". I will move some of them to qualitative data but some of them will be used to feature engineering. 

 # Missing values

In [None]:
missing_train_num = train_quantitative.isnull().sum().sort_values(ascending=False)
percentage_train_num = (train_quantitative.isnull().sum()/train_quantitative.isnull().count()).sort_values(ascending=False)
train_info = pd.concat([missing_train_num,percentage_train_num],keys=['Missing','Percentage'],axis=1)
train_info.head(10)

In [None]:
fig = plt.figure(figsize=(10,5))
test_plot = sns.barplot(x=missing_train_num.index[0:5],y=missing_train_num[0:5])
test_plot.set_xticklabels(test_plot.get_xticklabels(),rotation=90)
plt.title('Number of missing values in numerical data(test)')

In [None]:
missing_test_num = test_quantitative.isnull().sum().sort_values(ascending=False)
percentage_test_num = (test_quantitative.isnull().sum()/test_quantitative.isnull().count()).sort_values(ascending=False)
train_info = pd.concat([missing_test_num,percentage_test_num],keys=['Missing','Percentage'],axis=1)
train_info.head(10)

In [None]:
fig = plt.figure(figsize=(10,5))
test_plot = sns.barplot(x=missing_test_num.index[0:5],y=missing_test_num[0:5])
test_plot.set_xticklabels(test_plot.get_xticklabels(),rotation=90)
plt.title('Number of missing values in numerical data(test)')

Becouse property areas are usually similar to other houses in its neighborhood, we can supplement the missing values with the median LotFrontage of the area.

In [None]:
train_quantitative['LotFrontage'] = train_quantitative.groupby(train_qualitative['Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))
test_quantitative['LotFrontage'] = test_quantitative.groupby(test_qualitative['Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))

For the rest of the missing data we will use the fillna method with the substituted value 0. The exception will be 'GarageYrBlt' which we will replace with median value.

In [None]:
train_quantitative['GarageYrBlt']=train_quantitative['GarageYrBlt'].fillna(train_quantitative['GarageYrBlt'].median())
test_quantitative['GarageYrBlt']=test_quantitative['GarageYrBlt'].fillna(test_quantitative['GarageYrBlt'].median())

for column in train_quantitative.columns:
    train_quantitative[column] = train_quantitative[column].fillna(0)
for column in test_quantitative.columns:
    test_quantitative[column] = test_quantitative[column].fillna(0)

Finally, any missing data in train data?

In [None]:
train_quantitative.isnull().sum().sum()

Or maybe something in test data?

In [None]:
test_quantitative.isnull().sum().sum()

# Feature engineering

By carefully reading the description of dataset we can apply some feature engineering. 

In [None]:
train_quantitative['TotalSF'] = train_quantitative['TotalBsmtSF']+train_quantitative['1stFlrSF']+train_quantitative['2ndFlrSF']
train_quantitative = train_quantitative.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'})
train_quantitative['YrBltAndRemod']=train_quantitative['YearBuilt']+train_quantitative['YearRemodAdd']
train_quantitative = train_quantitative.drop(columns={'YearBuilt', 'YearRemodAdd'})
train_quantitative['Bsmt'] = train_quantitative['BsmtFinSF1']+ train_quantitative['BsmtFinSF2']
train_quantitative = train_quantitative.drop(columns={'BsmtFinSF1','BsmtFinSF2'})
train_quantitative['TotalBathroom'] = (train_quantitative['FullBath'] + (0.5 * train_quantitative['HalfBath']) +
                               train_quantitative['BsmtFullBath'] + (0.5 * train_quantitative['BsmtHalfBath']))
train_quantitative = train_quantitative.drop(columns={'FullBath','HalfBath','BsmtFullBath','BsmtHalfBath'})


test_quantitative['TotalSF'] = test_quantitative['TotalBsmtSF']+test_quantitative['1stFlrSF']+test_quantitative['2ndFlrSF']
test_quantitative = test_quantitative.drop(columns={'1stFlrSF', '2ndFlrSF','TotalBsmtSF'})
test_quantitative['YrBltAndRemod']=test_quantitative['YearBuilt']+test_quantitative['YearRemodAdd']
test_quantitative = test_quantitative.drop(columns={'YearBuilt', 'YearRemodAdd'})
test_quantitative['Bsmt'] = test_quantitative['BsmtFinSF1']+ test_quantitative['BsmtFinSF2']
test_quantitative = test_quantitative.drop(columns={'BsmtFinSF1','BsmtFinSF2'})
test_quantitative['TotalBathroom'] = (test_quantitative['FullBath'] + (0.5 * test_quantitative['HalfBath']) +
                               test_quantitative['BsmtFullBath'] + (0.5 * test_quantitative['BsmtHalfBath']))
test_quantitative = test_quantitative.drop(columns={'FullBath','HalfBath','BsmtFullBath','BsmtHalfBath'})

Let's check how our newly created featurs talks with SalePrice.

In [None]:
fig,ax = plt.subplots(14,2,figsize=(15,60))

for i in range(len(train_quantitative.columns)):
    r=i//2
    c=i%2
    sns.scatterplot(data=train_quantitative, x=train_quantitative.columns[i], y='SalePrice', hue='SalePrice', palette='viridis', ax=ax[r][c])
    
plt.tight_layout()
plt.show()

As we can see from the charts in our data, there are still numerical featuers which in fact are categorical ones. I will move them to categorical data and drop them from numerical data. After that i will concat them with qualitative datasets and then i will use pd.getdummies to obtain final qualitative datasets.

In [None]:
numerical_to_categorical = ['TotalBathroom','Fireplaces','MSSubClass','OverallCond','BedroomAbvGr','LowQualFinSF','KitchenAbvGr','MoSold','YrSold','PoolArea','MiscVal','LotArea','3SsnPorch','ScreenPorch']

In [None]:
numerical_categorical_train=train_quantitative[numerical_to_categorical]
train_quantitative.drop(columns=numerical_to_categorical,inplace=True)
train_quantitative

In [None]:
numerical_categorical_test = test_quantitative[numerical_to_categorical]
test_quantitative.drop(columns=numerical_to_categorical, inplace=True)
test_quantitative

In [None]:
corr = train_quantitative.corr()['SalePrice'].sort_values(ascending=False)
print(corr)

In [None]:
train_qualitative = pd.concat([train_qualitative, numerical_categorical_train], axis=1)
test_qualitative = pd.concat([test_qualitative, numerical_categorical_test], axis=1)

In [None]:
qualitative = pd.concat((train_qualitative, test_qualitative), sort=False).reset_index(drop=True)
qualitative = pd.get_dummies(qualitative)

In [None]:
train_qualitative_final = qualitative[:train_qualitative.shape[0]]
test_qualitative_final = qualitative[train_qualitative.shape[0]:]

Another quick preview of datasets shape.

In [None]:
train_qualitative_final.shape

In [None]:
test_qualitative_final.shape

Categorical data are finished. Let's get back to te numerical data. 

 # Outliers
 
- GrLivArea > 4000
- GarageArea > 1200
- TotalBsmtSF > 3000

In several notebooks I read that you should drop outliers, however in my case model achieves a lower score with outliers. If you want to remove outliers, you can find code to do this below. 

In [None]:
# train_quantitative = train_quantitative.drop(train_quantitative[(train_quantitative['GrLivArea']>4000) & (train_quantitative['SalePrice']<300000)].index)
# train_quantitative = train_quantitative.drop(train_quantitative[(train_quantitative['GarageArea']>1200) & (train_quantitative['SalePrice']<500000)].index)
# train_quantitative = train_quantitative.drop(train_quantitative[(train_quantitative['Bsmt']>3000) & (train_quantitative['SalePrice']<700000)].index)

 # Skewed features
 
 As I mentioned before, distribution of our target - 'SalePrice' is far away from gaussian distribution. Thus we need to perform relevant transformation to obtain distribution closer to normal. To do this I've used log(1+x) transformation.

In [None]:
y_pred = np.log1p(train['SalePrice'])
y_train = np.log1p(train_quantitative['SalePrice'])

In [None]:
train_quantitative.drop('SalePrice',axis=1, inplace=True)

In [None]:
sns.distplot(y_train)

Now as we can see our target - 'SalePrice' is normally distributed. But what about other skewed features? Let's take a deeper look into them.

In [None]:
print('Train quantitative skewness')
skewed_features_train = []
for column in train_quantitative:
    skew = abs(train_quantitative[column].skew())
    print('{:15}'.format(column), 
          'Skewness: {:05.2f}'.format(skew))
    if skew > 0.5:
        skewed_features_train.append(column)

For all features with skewness above 0.5, a boxcox1p transformation will be applied.

In [None]:
skewed_features_train

In [None]:
lam = 0.15
for feat in skewed_features_train:
    train_quantitative[feat] = boxcox1p(train_quantitative[feat], lam)

And the same story for test quantitative features.

In [None]:
print('Test quantitative skewness')
skewed_features_test = []
for column in test_quantitative:
    skew = abs(test_quantitative[column].skew())
    print('{:15}'.format(column), 
          'Skewness: {:05.2f}'.format(skew))
    if skew > 0.75:
        skewed_features_test.append(column)

In [None]:
skewed_features_test

In [None]:
lam = 0.15
for feat in skewed_features_test:
    test_quantitative[feat] = boxcox1p(test_quantitative[feat], lam)

# Scaling

In [None]:
scaling = StandardScaler()
train_quantitative_final = pd.DataFrame(scaling.fit_transform(train_quantitative),columns=train_quantitative.columns)
test_quantitative_final = pd.DataFrame(scaling.fit_transform(test_quantitative),columns=test_quantitative.columns)

Merging quantitative and qualitative data.

In [None]:
train_final=train_quantitative_final.merge(train_qualitative_final,left_index=True,right_index=True).reset_index(drop=True)
train_final.head()

In [None]:
test_qualitative_final = test_qualitative_final.reset_index(drop=True)
test_final=test_quantitative_final.merge(test_qualitative_final,left_index=True,right_index=True).reset_index(drop=True)
test_final.head()

In [None]:
train_final.shape

In [None]:
test_final.shape

 # Train test split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(train_final, y_train, test_size = .3, random_state=0)

# RMSE evaluation

In [None]:
def rmse(actual,predicted):
    return(str(np.sqrt(mean_squared_error(actual, predicted))))

 # Simple Linear Regression without regularization

In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)

y_pred_train = lin_reg.predict(X_train)
y_pred_test = lin_reg.predict(X_test)

print('RMSE train = ' + rmse(Y_train,y_pred_train))
print('RMSE test = ' + rmse(Y_test,y_pred_test)) 
print()

# Lasso Regression (L1 regularization)

In [None]:
lasso_reg =Lasso()
parameters= {'alpha': [0.0005,0.001,0.1,1,5,10,20]}

lasso_reg=GridSearchCV(lasso_reg, param_grid=parameters)
lasso_reg.fit(X_train,Y_train)
alpha = lasso_reg.best_params_
lasso_score = lasso_reg.best_score_
print("The best alpha value found is:",alpha['alpha'],'with score:',lasso_score)

lasso_reg_alpha = Lasso(alpha=alpha['alpha'])
lasso_reg_alpha.fit(train_final,y_train)
y_pred_train=lasso_reg_alpha.predict(X_train)
y_pred_test=lasso_reg_alpha.predict(X_test)

print('RMSE train = ' + rmse(Y_train,y_pred_train))
print('RMSE test = ' + rmse(Y_test,y_pred_test))

# Ridge Regression (L2 regularization)

In [None]:
ridge=Ridge()
parameters= {'alpha': [0.0005,0.001,0.1,0.2,0.4,0.5,0.7,0.8,1]}

ridge_reg=GridSearchCV(ridge, param_grid=parameters)
ridge_reg.fit(X_train,Y_train)
alpha = ridge_reg.best_params_
ridge_score = ridge_reg.best_score_
print("The best alpha value found is:",alpha['alpha'],'with score:',ridge_score)

ridge_reg_alpha=Ridge(alpha=alpha['alpha'])
ridge_reg_alpha.fit(train_final,y_train)
y_pred_train=ridge_reg_alpha.predict(X_train)
y_pred_test=ridge_reg_alpha.predict(X_test)

print('RMSE train = ' + rmse(Y_train,y_pred_train))
print('RMSE test = ' + rmse(Y_test,y_pred_test))

# Random Forest Regressor

In [None]:
rf_reg = RandomForestRegressor()
parameters = {"max_depth":[5, 8, 15, 25, 30], "n_estimators":[25,50,100,200]}

rf_reg_param = GridSearchCV(rf_reg, parameters, cv = 10, n_jobs =10)
rf_reg_param.fit(X_train, Y_train)
rf_reg_best=rf_reg_param.best_estimator_
y_pred_train = rf_reg_best.predict(X_train)
y_pred_test = rf_reg_best.predict(X_test)

print('RMSE train = ' + rmse(Y_train,y_pred_train))
print('RMSE test = ' + rmse(Y_test,y_pred_test))

# XGBoost Regressor

In [None]:
import xgboost as xgb 

xgb_reg = xgb.XGBRegressor(n_estimators=1000)
xgb_reg.fit(X_train, Y_train, early_stopping_rounds=5, 
             eval_set=[(X_test, Y_test)], verbose=False)

In [None]:
xgb_reg_param = xgb.XGBRegressor(learning_rate=0.05,
                      n_estimators=1000,
                      max_depth=3)

xgb_reg_param.fit(train_final, y_train)
xgb_train_pred = xgb_reg_param.predict(X_train)
xgb_test_pred = xgb_reg_param.predict(X_test)


print('RMSE train = ' + rmse(Y_train,xgb_train_pred))
print('RMSE test = ' + rmse(Y_test,xgb_test_pred))

# Conclusions and possible future development of the model 

Among the tested models, the XGBRegressor was the best performing model. Therefore, I used it to perform final submission. 

In case of simple linear regression and random forest regressor rmse of the test sets significantly differs from the rmse of the training sets which may indicate about overfitting. It is also intresting to note that the model with outliers performs better than model without. Any feedback related to this problem will be very helpful.

Looking through other notebooks I noticed that the best performing models are based on blended regressions. My goal for the future will be to create a model which, being a blended model, will allow to obtain a better score.

# Blended approach

In order to test blended regression I decided to build a simple model and check how it performs by trial and error method. For this purpose I chose the L2 regularization model and XGBoost regressor becouse they achieve best RMSE scores in single run. 

In [None]:
def blended_regression(X):
    return ((0.3 * ridge_reg_alpha.predict(X)) + (0.7 * xgb_reg_param.predict(X)))

In [None]:
y_pred_train = blended_regression(X_train)
y_pred_test = blended_regression(X_test)
print('RMSE train = ' + rmse(Y_train,y_pred_train))
print('RMSE test = ' + rmse(Y_test,y_pred_test))

 # Submission


In [None]:
y_test=blended_regression(test_final)

In [None]:
final_y_test=np.expm1(y_test)

In [None]:
sample=pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission=pd.DataFrame({"Id":sample['Id'],
                         "SalePrice":final_y_test})
submission.to_csv('submission.csv',index=False)

In [None]:
final_y_test