# Quick start into house price competition
This kernel is created for beginners who want to have a quick journey through the a whole house price prediction project. It contains loading data step, dealing with missing values, preprocess both categorical and numeric features for training the model, and modeling steps also. 

Because predicting the house's price is the regression problem, therefore there're many appropriate and powerful regression models can be used in this case. And after going through some of base model, the result shows that xgboost regression model might be more suitable for this problem. Eventhough it take some time to train the model, but this model will give us back the satisfied result for a quick start model. 

I encourage you to fork this kernel, play with the code and get an overview idea to jumping into this competitons. Good luck!

If you like this kernel, please give it an upvote. Thank you!

## Model performance
The kernel results in nearly 0.13185 prediction score on the leaderboard, rank in the top 10% competitors

## Kernel outline

15/8/2021
* [**1. Loading data**](#1)
* [**2. Missing value**](#2)
* [**3. Feature engineering**](#3)
    * [3.1 Numeric features](#3.1)
    * [3.2 Categorical features](#3.2)
* [**4. Modeling**](#4)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_style('whitegrid')
import matplotlib.pyplot as plt
import missingno as msno
%matplotlib inline  

from scipy import stats
from sklearn import preprocessing
from sklearn import feature_selection
import warnings
warnings.filterwarnings('ignore')
SEED = 42

<a name='1'></a>
# 1. Loading data

In [None]:
def concat_df(train_data, test_data):
    # Returns a concatenated df of training and test set
    return pd.concat([train_data, test_data], sort=True).reset_index(drop=True)

def divide_df(all_data):
    # Returns divided dfs of training and test set
    return all_data.loc[:1459], all_data.loc[1460:]

df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
y_train = df_train.SalePrice
id_val = df_train.Id
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df_all = concat_df(df_train, df_test).drop(['SalePrice', 'Id'], axis=1)

df_train.name = 'Training Set'
df_test.name = 'Test Set'
df_all.name = 'All Set' 
dfs = [df_train, df_test]

In [None]:
df_all.head()

<a name='2'></a>
# 2. Missing value
I divide the set of missing value into 3 types: 
- (1) feature having below 100 missing values
- (2) Feature having more than 1000 missing values
- (3) The other missing value features

In [None]:
for df in dfs:
    print(f'Only features contained missing value in {df.name}')
    temp = df.isnull().sum()
    print(temp.loc[temp!=0], '\n')

In [None]:
null_features = df_all.isnull().sum()

# For features having smaller than 100 missing values
null_100 = df_all.columns[list((null_features < 100) & (null_features != 0))]
num = df_all[null_100].select_dtypes(include=np.number).columns
non_num = df_all[null_100].select_dtypes(include='object').columns
# Numerous features --> Fill with their median
df_all[num] = df_all[num].apply(lambda x: x.fillna(x.median()))
# Object features --> Fill with value having the highest frequently in this feature
df_all[non_num] = df_all[non_num].apply(lambda x: x.fillna(x.value_counts().index[0]))


# For features having larger than 1000 missing values --> I drop them
null_1000 = df_all.columns[list(null_features > 1000)]
df_all.drop(null_1000, axis=1, inplace=True)
df_all.drop(['GarageYrBlt', 'LotFrontage'], axis=1, inplace=True)


# For other features having missing values --> Fill na value with "Null" 
# GarageCond
df_all['GarageCond'] = df_all['GarageCond'].fillna('Null')
# GarageFinish
df_all['GarageFinish'] = df_all['GarageFinish'].fillna('Null')
# GarageQual
df_all['GarageQual'] = df_all['GarageQual'].fillna('Null')
# GarageType
df_all['GarageType'] = df_all['GarageType'].fillna('Null')

In [None]:
df_train, df_test = divide_df(df_all)
df_train = pd.concat([df_train, y_train], axis=1)  # Concatenate for analysis

# Checking existing missing value or not
print(df_all.isnull().any().sum())

<a name='3'></a>
# 3. Feature engineering

#### Using "Bin" technique for all features having value representing "year" & encode them by label encoding technique

In [None]:
# Using binned technique for "YearBuilt", "YearRemodAdd" & "YrSold"
df_all['YearBuilt'] = pd.qcut(df_all['YearBuilt'], 10, duplicates='drop')
df_all['YearRemodAdd'] = pd.qcut(df_all['YearRemodAdd'], 10, duplicates='drop')
df_all['YrSold'] = pd.qcut(df_all['YrSold'], 10, duplicates='drop')

In [None]:
# Encode categorical features to numeric feature
for cate_col in ['YearBuilt', 'YearRemodAdd', 'YrSold']:
    df_all[cate_col] = preprocessing.LabelEncoder().fit_transform(df_all[cate_col].values)
    
df_train, df_test = divide_df(df_all)

<a name='3.1'></a>
## 3.1 Numeric features

#### Adding some important features

In [None]:
# Total square feet of porch in a house
df_all['TotalPorchSF'] = (df_all['OpenPorchSF'] + df_all['3SsnPorch'] +
                          df_all['EnclosedPorch'] + df_all['ScreenPorch'] + df_all['WoodDeckSF'])
df_all['HasGarage'] = df_all['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
# Total number of bathroom
df_all['TotalBath'] = (df_all['FullBath'] + (0.5 * df_all['HalfBath']) +
                       df_all['BsmtFullBath'] + (0.5 * df_all['BsmtHalfBath']))
# House having the fire place or not
df_all['HasFireplace'] = df_all['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
# Total number of bathroom in basement
df_all['TotalBsmtbath'] = df_all['BsmtFullBath'] + (0.5 * df_all['BsmtHalfBath'])
# Total square foot
df_all['TotalSF'] = df_all['BsmtFinSF1'] + df_all['BsmtFinSF2'] + df_all['1stFlrSF'] + df_all['2ndFlrSF']

In [None]:
# These columns are used for generating above new features --> Drop the old features
df_all.drop(['OpenPorchSF', '3SsnPorch', 'EnclosedPorch', 'ScreenPorch', 'WoodDeckSF', 'FullBath', 'HalfBath',
            'BsmtFullBath', 'BsmtHalfBath'], axis=1, inplace=True)

#### Choosing numeric feature and normalize highly skewed features

In [None]:
num_features = ['OverallQual', 'GrLivArea', 'TotalSF', 'GarageCars', 'TotalBath', 'GarageArea', 'TotalBsmtSF',
 '1stFlrSF', 'TotRmsAbvGrd', 'MasVnrArea', 'HasFireplace', 'Fireplaces', 'TotalPorchSF', '2ndFlrSF',
 'LotArea', 'HasGarage', 'TotalBsmtbath', 'BsmtUnfSF', 'YearBuilt', 'YearRemodAdd', 'YrSold']

# Drop the unused numeric columns also
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
num_cols = df_all.select_dtypes(include=numeric_dtypes).columns
drop_num = np.setdiff1d(num_cols, num_features)

df_all.drop(drop_num, axis=1, inplace=True)

In [None]:
# Normalize skewness feature using Log function
skew_features = df_all[num_features].apply(lambda x: stats.skew(x)).sort_values(ascending=False)
skew_features = skew_features[abs(skew_features) > 0.5]
print(skew_features) 

# Apply Box cox for skewness > 0.75
for feat in skew_features.index:
    df_all[feat] = np.log1p(df_all[feat])

df_train, df_test = divide_df(df_all)

In [None]:
df_train[num_features].head()

<a name='3.2'></a>
## 3.2 Categorical features

#### Some features having some values exist in training dataset but not in testing dataset --> We'll fix it

In [None]:
# "Electrical" features
df_train['Electrical'].loc[df_train['Electrical']=='Mix'] = 'SBrkr'
# "Exterior2nd" features
df_train['Exterior2nd'].loc[df_train['Exterior2nd']=='Other'] = 'VinylSd'
# "Heating" features
df_train['Heating'].loc[df_train['Heating']=='OthW'] = 'GasA'
df_train['Heating'].loc[df_train['Heating']=='Floor'] = 'GasA'
# "HouseStyle" features
df_train['HouseStyle'].loc[df_train['HouseStyle']=='2.5Fin'] = '1.5Fin'

#### Choosing the appropriate categorical features

In [None]:
cate_features = ['BldgType', 'BsmtExposure', 'BsmtFinType1', 'BsmtQual', 'CentralAir', 'Condition1', 'Electrical',
 'ExterCond', 'ExterQual', 'Exterior2nd', 'Functional', 'GarageCond', 'GarageType', 'Heating', 'HouseStyle',
 'KitchenQual', 'LandContour', 'LandSlope', 'LotShape', 'Neighborhood', 'PavedDrive', 'RoofStyle',
 'SaleCondition', 'SaleType', 'Street', 'YearBuilt', 'YearRemodAdd', 'YrSold']

# Drop the unused categorical columns by choosing the only set of columns above
cols = df_train.select_dtypes(include=['object', 'category']).columns
# Choose features only in "cols" but not in "cate_features"
drop_cate = np.setdiff1d(cols, cate_features)

df_train.drop(drop_cate, axis=1, inplace=True)
df_test.drop(drop_cate, axis=1, inplace=True)

#### Encode the categorical features by using One-hot encoding technique

In [None]:
print(df_train.shape, df_test.shape)

In [None]:
# Transform categorical feature to dummies features
encoded_features = list()

for df in [df_train, df_test]:
    for feature in cate_features:
        # Change to array after encoding b.c want to add columns when change back to df
        encoded_feat = preprocessing.OneHotEncoder().fit_transform(df[feature].values.reshape(-1, 1)).toarray()
        # "n": Number of unique value in each feature
        n = df[feature].nunique()
        # "feature_uniqueVal" are the col's names in df after One-hot encoding
        cols = ['{}_{}'.format(feature, n) for n in range(1, n + 1)]
        
        encoded_df = pd.DataFrame(encoded_feat, columns=cols)
        encoded_df.index = df.index
        encoded_features.append(encoded_df)
        
df_train = pd.concat([df_train, *encoded_features[:len(cate_features)]], axis=1)
df_test = pd.concat([df_test, *encoded_features[len(cate_features):]], axis=1)

In [None]:
print(df_train.shape, df_test.shape)

#### Drop original category features, we only use one-hot features to train the model

In [None]:
# Drop original category features
df_train.drop(cate_features, axis=1, inplace=True)
df_test.drop(cate_features, axis=1, inplace=True)

df_all = concat_df(df_train, df_test)

In [None]:
print(df_train.shape, df_test.shape)

<a name='4'></a>
# 4. Modeling 

In [None]:
from sklearn.model_selection import KFold # for repeated K-fold cross validation
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score # score evaluation

In [None]:
# Repeated K-fold cross validation
kfolds = KFold(n_splits=10, shuffle=True, random_state=SEED)

# Return root mean square error applied cross validation (Used for training prediction)
def evaluate_model_cv(model, X, y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=kfolds))
    return (rmse)

#### Initialize the xgboost model

In [None]:
# Base model
xgboost = XGBRegressor(learning_rate=0.01, n_estimators=3460,
                       max_depth=3, min_child_weight=0,
                       gamma=0, subsample=0.7,
                       colsample_bytree=0.7, verbosity = 0,
                       objective='reg:squarederror', nthread=-1,
                       scale_pos_weight=1, seed=SEED, reg_alpha=0.00006)

#### Training the model

In [None]:
# Training model & find root mean square error (With cross validation technqiue)
xgboost = xgboost.fit(np.array(df_train), np.array(y_train))
print('Finish training')
cv_rmse_result = evaluate_model_cv(xgboost, np.array(df_train), np.array(y_train))
print(f'xgboost\'s rmse (apply cv) after training: {np.mean(cv_rmse_result)}\n')

#### Create the submission

In [None]:
# Testing ID
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
test_id = test['Id']

# When normalize the target
submit = pd.concat((test_id, 
                    pd.Series(xgboost.predict(np.array(df_test)), name='SalePrice')), axis=1)
submit.to_csv('Submission.csv', index=False)

We got nearly 0.13185 prediction score on the leaderboard, rank in the top 10% competitor, I think this score is not a bad start

If you like this notebook, please give it an upvote. Thank you!