<br>



<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 35px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">House Price Problem&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>

![](https://miro.medium.com/max/1024/1*Zr0rsnWzE0A_fqCHfDndMA.jpeg)

#### If this helped in your learning, then please **UPVOTE** – as they are the source of motivation!

<br>



<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Main Steps:&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>

#### **1. Data Cleaning**
#### **2. EDA**
#### **3. Pre-Processing**
#### **4. Model training and evaluting**

<br>



<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Import Libraries&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import StackingRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train_data.head()

<br>



<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">&nbsp;&nbsp;Basic Info&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>

In [None]:
train_data.info()

In [None]:
train_data.describe()



Excluded the Object type data and int type data to first analyze them individually 

In [None]:
num_data = train_data.select_dtypes(exclude=['object']).copy()
cat_data = train_data.select_dtypes(include=['object']).copy()

<br>



<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">&nbsp;&nbsp;Numeric Data Analysis&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>


In [None]:
num_data.info()

In [None]:
num_data.head()

Checking that is there any nan values in any column, so, LotFrontage, MasVnrArea, GarageYrBlt these  columns does contain Nan values.

In [None]:
num_data.isnull().sum()
# LotFrontage GarageYrBlt MasVnrArea

Filling out these columns -> LotFrontage with median, MasVnrArea with mean and GaragYrBlt with mode. I use mode because it contains years and most occurring year can be the good fit for nan.

In [None]:
num_data['LotFrontage'] = num_data['LotFrontage'].fillna(abs(num_data['LotFrontage'].median()))
num_data['MasVnrArea'] = num_data['MasVnrArea'].fillna(abs(num_data['MasVnrArea'].mean()))

In [None]:
num_data['GarageYrBlt'] = num_data['GarageYrBlt'].fillna(num_data['GarageYrBlt'].mode()[0])

<br>



<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Checking the Skewness of Columns with density distribution&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>


some columns have skewness so try to use log transformation to reduce their skewness. So, Updated LotArea column with log transformation to change it from positive skew to normal data.

In [None]:
num_data['LotArea'] = np.log(num_data.loc[:,'LotArea'])

Show how it changes

In [None]:
sns.displot(x=num_data.loc[:,'LotArea'].dropna(),kde='False',color='b')

Checking some column's relation with SalePrice column
MSSubClass -> 20 effects much on SalePrice,60,50 and etc.

In [None]:
plt.figure(figsize=(12,14))
plt.subplot(2,2,1)
num_data.groupby('MSSubClass')['SalePrice'].count().plot(kind='bar',legend=True)
plt.title('Count')
plt.subplot(2,2,2)
num_data.groupby('MSSubClass')['SalePrice'].sum().plot(kind='bar',legend=True)
plt.title('Sum')

checking which year most of the salling done, we analyze that on year 2009 and 2007 most of the sells done.

In [None]:
num_data.groupby('YrSold')['SalePrice'].count().plot(kind='bar',legend=True)

In [None]:
corrr = num_data.corr()['SalePrice']
corrr = corrr[corrr >0.7]
corrr

Checking the correlations. some column highly correlated with each other. so drop them

In [None]:
plt.figure(figsize=(22,24))
sns.heatmap(num_data.corr()>0.7,annot=True)       


In [None]:
num_data.drop(['GrLivArea','OverallQual','GarageCars'],axis=1,inplace=True)

<br>

<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Checking again the relationship of columns with SalePrice&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>


In [None]:
fig = plt.figure(figsize=(24,24))
for ind,col in enumerate(num_data):
    plt.subplot(6,8,ind+1)
    sns.scatterplot(x=num_data.loc[:,col],y='SalePrice',data=num_data.dropna())

fig.tight_layout(pad=1.5)

from scatter plot we see some columns have outliered data, so drop those rows

In [None]:
num_data = num_data.drop(num_data[num_data['LotFrontage'] > 200].index)
num_data = num_data.drop(num_data[num_data['LotArea'] > 100000].index)
num_data = num_data.drop(num_data[num_data['Fireplaces'] > 2.5].index)
num_data = num_data.drop(num_data[num_data['TotalBsmtSF'] > 5000].index)
num_data = num_data.drop(num_data[num_data['BsmtFinSF1'] > 4000].index)

In [None]:
num_data.drop(['Id'],axis=1,inplace=True)

#### **After removing irrelevent data , we check again the correlation to finalize it**

In [None]:
plt.figure(figsize=(22,24))
sns.heatmap(num_data.corr(),annot=True)

<br>

<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Now We analyze the Categorical Data&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>


In [None]:
cat_data.isnull().sum()

In [None]:
cat_data.columns

Checking frequency of every data of GarageType. we are doing this because we have to fill the Nan with something and this is categorical data which can not be fill with any mean,median and mode. it filled by some it's data.so that's why we first checking which data have most occurrence then fill that data inplace of Nan

In [None]:
train_data.groupby('GarageType')['SalePrice'].count()


<br>

<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Checking Frequencies of Every Column&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>


In [None]:
fig = plt.figure(figsize=(18,16))
for ind,col in enumerate(cat_data):
    plt.subplot(9,6,ind+1)
    sns.countplot(x=cat_data.loc[:,col],data=cat_data.dropna())

fig.tight_layout(pad=1.5)

Some Columns have too much Nan values so drop them,

In [None]:
cat_data.drop(['Alley','PoolQC','Fence','MiscFeature','FireplaceQu','Heating','RoofMatl','Condition2','Utilities','Street'],axis=1,inplace=True)



<br>

<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Now filling the Nan&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>


In [None]:
cat_data['MasVnrType'] = cat_data['MasVnrType'].fillna('None')
cat_data['BsmtQual'] = cat_data['BsmtQual'].fillna('TA')
cat_data['BsmtCond'] = cat_data['BsmtCond'].fillna('TA')
cat_data['BsmtExposure'] = cat_data['BsmtExposure'].fillna('No')
cat_data['BsmtFinType1'] = cat_data['BsmtFinType1'].fillna('Unf')
cat_data['BsmtFinType2'] = cat_data['BsmtFinType2'].fillna('Unf')
cat_data['GarageType'] = cat_data['GarageType'].fillna('Attchd')
cat_data['GarageFinish'] = cat_data['GarageFinish'].fillna('Unf')
cat_data['GarageQual'] = cat_data['GarageQual'].fillna('Unf')
cat_data['GarageCond'] = cat_data['GarageCond'].fillna('Unf')
cat_data['Electrical'] = cat_data['Electrical'].fillna('SBrkr')


In [None]:
cat_data.head()

#### **ML Model can not use with string data, so encoded them**

In [None]:

le = LabelEncoder()
for col in cat_data.columns:
    le.fit(cat_data[col].values)
    cat_data[col] = le.transform(cat_data[col].values)
    

In [None]:
cat_data.head()

Now we done all the EDA + Pre-Processing.
WE Now Just do all those important steps here on train_data, because we will use train_data now

In [None]:
train_data['LotFrontage'].fillna(train_data['LotFrontage'].median(),inplace=True)
train_data['MasVnrArea'].fillna(train_data['MasVnrArea'].median(),inplace=True)

# Log Transformation: Applied on High Variance columns


train_data['LotArea'] = np.log(train_data['LotArea'])

# Dropped Skewed Columns

skewed_cols = ['LowQualFinSF','3SsnPorch','PoolArea','MiscVal','GarageYrBlt']
train_data.drop(skewed_cols,axis=1,inplace=True)

# Finding Outliers 
for col in train_data.columns:
    
    if train_data[col].dtype !='object':
        first_quartile = train_data[col].quantile(0.25) 
        third_quartile = train_data[col].quantile(0.75)

        IQR = third_quartile - first_quartile
        out = third_quartile + 3*IQR 
        train_data.drop(train_data[train_data[col] > out].index,axis=0,inplace=True)



train_data.drop(['Id','GarageCars'],axis=1,inplace=True)

# Categorical : 


drop_col = ['Alley','PoolQC','MiscFeature','Fence']
train_data.drop(drop_col,axis=1,inplace=True)

train_data['MasVnrType'].fillna(train_data['MasVnrType'].mode()[0],inplace=True)
train_data['BsmtQual'].fillna(train_data['BsmtQual'].mode()[0],inplace=True)
train_data['BsmtCond'].fillna(train_data['BsmtCond'].mode()[0],inplace=True)
train_data['BsmtExposure'].fillna(train_data['BsmtExposure'].mode()[0],inplace=True)
train_data['BsmtFinType1'].fillna(train_data['BsmtFinType1'].mode()[0],inplace=True)
train_data['BsmtFinType2'].fillna(train_data['BsmtFinType2'].mode()[0],inplace=True)
train_data['FireplaceQu'].fillna(train_data['FireplaceQu'].mode()[0],inplace=True)
train_data['GarageType'].fillna(train_data['GarageType'].mode()[0],inplace=True)
train_data['GarageFinish'].fillna(train_data['GarageFinish'].mode()[0],inplace=True)
train_data['GarageQual'].fillna(train_data['GarageQual'].mode()[0],inplace=True)
train_data['GarageCond'].fillna(train_data['GarageCond'].mode()[0],inplace=True)
train_data['Electrical'].fillna(train_data['Electrical'].mode()[0],inplace=True)


le = LabelEncoder()
for col in train_data.columns:
    if train_data[col].dtype == 'object':
        train_data[col] = le.fit_transform(train_data[col])



this logic is only use to find out the object type columns. we use this in our future work

In [None]:
train_data.isnull().sum()

<br>

<a id="imports"></a>

<h1 style="font-family: Verdana; font-size: 20px; font-style: normal; font-weight: bold; text-decoration: none; text-transform: none; letter-spacing: 1px; background-color: #ffffff; color: #03045e;" id="imports">Checking the Updated train_data at last time&nbsp;&nbsp;&nbsp;&nbsp;<a href="#toc">&#10514;</a></h1>


#### **Same Work work test_data as we done for train_data except the dropping of rows which contain outlier.**

In [None]:
test_data['LotFrontage'].fillna(test_data['LotFrontage'].median(),inplace=True)
test_data['MasVnrArea'].fillna(test_data['MasVnrArea'].median(),inplace=True)

# Log Transformation: Applied on High Variance columns


test_data['LotArea'] = np.log(test_data['LotArea'])

# Dropped Skewed Columns

skewed_cols = ['LowQualFinSF','3SsnPorch','PoolArea','MiscVal','GarageYrBlt']
test_data.drop(skewed_cols,axis=1,inplace=True)

test_data.drop(['Id','GarageCars'],axis=1,inplace=True)

# Categorical : 


drop_col = ['Alley','PoolQC','MiscFeature','Fence']
test_data.drop(drop_col,axis=1,inplace=True)

test_data['MasVnrType'].fillna(test_data['MasVnrType'].mode()[0],inplace=True)
test_data['BsmtQual'].fillna(test_data['BsmtQual'].mode()[0],inplace=True)
test_data['BsmtCond'].fillna(test_data['BsmtCond'].mode()[0],inplace=True)
test_data['BsmtExposure'].fillna(test_data['BsmtExposure'].mode()[0],inplace=True)
test_data['BsmtFinType1'].fillna(test_data['BsmtFinType1'].mode()[0],inplace=True)
test_data['BsmtFinType2'].fillna(test_data['BsmtFinType2'].mode()[0],inplace=True)
test_data['FireplaceQu'].fillna(test_data['FireplaceQu'].mode()[0],inplace=True)
test_data['GarageType'].fillna(test_data['GarageType'].mode()[0],inplace=True)
test_data['GarageFinish'].fillna(test_data['GarageFinish'].mode()[0],inplace=True)
test_data['GarageQual'].fillna(test_data['GarageQual'].mode()[0],inplace=True)
test_data['GarageCond'].fillna(test_data['GarageCond'].mode()[0],inplace=True)
test_data['Electrical'].fillna(test_data['Electrical'].mode()[0],inplace=True)


le = LabelEncoder()
for col in test_data.columns:
    if test_data[col].dtype == 'object':
        test_data[col] = le.fit_transform(test_data[col])


test_data['BsmtFinSF2'].fillna(test_data['BsmtFinSF2'].median(),inplace=True)
test_data['BsmtHalfBath'].fillna(test_data['BsmtHalfBath'].median(),inplace=True)

In [None]:
input = train_data.drop(['SalePrice'],axis=1)
target = train_data.SalePrice
x_train,x_test,y_train,y_test = train_test_split(input,target,test_size=0.2)

#### **I Applied Multiple Models to check the best Model from them**

In [None]:
xg = xgb.XGBRegressor(subsample=0.1,
                      n_estimators=700,
                      min_child_weight= 2,
                      max_depth=4,
                      learning_rate=0.2,
                      col_sample_bytree=1,
                      booster='gblinear',
                      alpha=22)
xg.fit(x_train,y_train)
xg.score(x_test,y_test)

In [None]:
pred = xg.predict(test_data)

In [None]:
test_data

In [None]:
pred = xg.predict(test_data)
test_data['Price'] = pred

In [None]:
test_data

In [None]:
sub = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')

In [None]:
sub['SalePrice'] = pred
sub.to_csv('my_submission.csv',index=False)

In [None]:
sub = pd.read_csv('./my_submission.csv')
sub