Hey Guys!
I hope you had a great day so far. 
This notebook contais data cleaning, data analysis, feature engineering, normalization and modeling. I tried my best to keep it simple and beginner friendly. Feel free to share your thoughts and ideas with me and ask any question about this code in comment section. 
Don't forget to share this kernel with your friends and please upvote if you've learned anything from it.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy import stats
from warnings import filterwarnings
import pprint
from sklearn.preprocessing import StandardScaler
from scipy.stats import skew

In [None]:
filterwarnings(action='ignore')

## Loading datasets and getting some info

In [None]:
train_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
submission = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train_df.describe()

In [None]:
train_df.info()

In [None]:
# seperating categorical features from non-categoricals

categoricals = train_df.dtypes[train_df.dtypes == 'object'].index
non_categoricals = train_df.dtypes[train_df.dtypes != 'object'].index

print('Categoricals: ', categoricals)
print('\n Non-Categoricals: ', non_categoricals)

## Finding and handling missing values

In [None]:
nums = train_df.isna().sum().sort_values(ascending=False)
percent = train_df.isna().sum()/train_df.isna().count().sort_values(ascending=False)
missings = pd.concat([nums, percent], axis=1, keys=['Total', 'Percent'])

missings[missings['Total'] != 0]

According to data description, NaN in some features doesn't mean that the data is missing, it means None: Fence -> Nan means no_fence
So we're gonna replace them with None.

In [None]:
trap_missings = ['Fence', 'PoolQC', 'Alley', 'FireplaceQu',
                 'GarageFinish', 'GarageType', 'GarageQual',
                 'GarageCond', 'MiscFeature', 'BsmtFinType2',
                 'BsmtFinType1', 'BsmtExposure', 'BsmtCond',
                 'BsmtQual', 'MasVnrType']

for col in trap_missings:
    train_df[col].fillna('None', inplace=True)

In [None]:
# filling numerical missing valeus

train_df['LotFrontage'].fillna(train_df['LotFrontage'].mean(), inplace=True)
train_df['GarageYrBlt'].fillna(train_df['GarageYrBlt'].mean(), inplace=True)
train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].mean(), inplace=True)

In [None]:
# filling Categorical missing values -> using mode

train_df['Electrical'] = train_df['Electrical'].fillna(train_df['Electrical'].mode([0]))

Let's see if there's anything missed.

In [None]:
nums = train_df.isna().sum().sort_values(ascending=False)
percent = train_df.isna().sum()/train_df.isna().count().sort_values(ascending=False)
missings = pd.concat([nums, percent], axis=1, keys=['Total', 'Percent'])

missings[missings['Total'] != 0]

In [None]:
df = train_df

All clean!

## Outliers

For recognizing outlier I'd rather recognize them visually, so we're gonna have some scatter plots for all numerical features.

In [None]:
non_categoricals = non_categoricals.drop(['Id', 'SalePrice'])

In [None]:
fig, axes = plt.subplots(6, 6, figsize=(30, 30))

for col, ax in zip(non_categoricals, axes.flatten()):
    sns.scatterplot(train_df[col], y=train_df['SalePrice'], ax=ax, alpha=0.3)

Can we do anything with them? no, outliers are lying to us; so as a punishment we drop them.

In [None]:
train_df = train_df.drop(train_df[train_df['GrLivArea']>5000].index)
train_df = train_df.drop(train_df[train_df['LotArea']>200000].index)
train_df = train_df.drop(train_df[train_df['TotalBsmtSF']>4000].index)
train_df = train_df.drop(train_df[train_df['LotFrontage']>200].index)
train_df = train_df.drop(train_df[train_df['1stFlrSF']>4000].index)

## Feature importance

This dataset has so many column which can cause complexing out machine learning model. There is some tricks to get the best out of the features. I wanna use feature selection based on correlation. It chooses features which are more likly to produce SalePrice. 

Here you can see the heatmap of row features. By row I mean default features of the dataset.

In [None]:
# getting the features which are highly correlated to SalePrice

cols = train_df.corr().nlargest(20, 'SalePrice')['SalePrice'].index

plt.figure(figsize=(16, 12))
sns.heatmap(train_df[cols].corr(), cmap='Greys', annot=True)

### Creating new features

So we just saw a heatmap that could show us how correlated features are. But what if row features are useless? what if there are some hidden features behind this row features? 
Can we add some new features which are made from row features? of couse we can.
Here's some new features which are made from row features:

In [None]:
none = ['None', 'NA']
for row in [train_df]:
    row['HasPool'] = 1
    row.loc[(row['PoolQC'].isin(none)), 'HasPool'] = 0

    row['HasWoodDeck'] = 1
    row.loc[(row['WoodDeckSF'].isin(none)), 'HasWoodDeck'] = 0

    row['HasOpenPorch'] = 1
    row.loc[(row['OpenPorchSF'].isin(none)), 'HasOpenPorch'] = 0

    row['HasScreenPorch'] = 1
    row.loc[(row['ScreenPorch'].isin(none)), 'HasScreenPorch'] = 0

    row['HasAlleyAccess'] = 1
    row.loc[(row['Alley'].isin(none)), 'HasAlleyAccess'] = 0

    row['HasFirePlace'] = 1
    row.loc[(row['Fireplaces'] == 0), 'HasFirePlace'] = 0

    row['HasGarage'] = 1
    row.loc[(row['GarageType'].isin(none)), 'HasGarage'] = 0
    
    row['HasMVArea'] = 1
    row.loc[(row['MasVnrArea'] == 0), 'HasMVArea'] = 0
    

    row['Remodeled'] = 1
    row.loc[(row['YearBuilt'] == row['YearRemodAdd']), 'Remodeled'] = 0

    row['TotalHouseSF'] = row['1stFlrSF'] + \
        row['TotalBsmtSF'] + row['2ndFlrSF']

    row['HasBasement'] = 1
    row.loc[(row['BsmtFinType1'].isin(none) & (
        row['BsmtFinType2'].isin(none))), 'HasBasement'] = 0

    row['TotalBathroom'] = row['FullBath'] + \
        (row['HalfBath']*0.5) + row['BsmtFullBath'] + \
        (row['BsmtHalfBath']*0.5)

    row['TotalHouseQuality'] = row['OverallQual'] + row['OverallCond']
    

Checking if there is any change in the correlation after creating new features

In [None]:
cols = train_df.corr().nlargest(20, 'SalePrice')['SalePrice'].index

plt.figure(figsize=(16, 12))
sns.heatmap(train_df[cols].corr(), cmap='Greys', annot=True)

In [None]:
train_df = train_df[cols]
train_df.drop(['GarageArea', '1stFlrSF', 'Fireplaces', 'MasVnrArea', 'BsmtFinSF1'], axis=1, inplace=True)

We dropped column which had the same effect on producing HousePrice and were highly correlated to eachother: Garagecars and GarageArea. And also I dropped MasVnrArea because it contains lots of zero values and does not follow normal distribution. Normalizing columns with zero values by log transformation are a problem since log(0) is undefined. We created HasMVArea so that hopefully we don't lose much information.

## Normalization

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,6))

sns.distplot(train_df['SalePrice'], fit=stats.norm, ax=ax[0])
ax[0].set_title('Before Normalization')

train_df['SalePrice'] = np.log(train_df['SalePrice'])
ax[1].set_title('After Normalization')
sns.distplot(train_df['SalePrice'], fit=stats.norm, ax=ax[1])

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,6))

sns.distplot(train_df['GrLivArea'], fit=stats.norm, ax=ax[0])
ax[0].set_title('Before Normalization')

train_df['GrLivArea'] = np.log(train_df['GrLivArea'])
ax[1].set_title('After Normalization')
sns.distplot(train_df['GrLivArea'], fit=stats.norm, ax=ax[1])

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,6))

sns.distplot(train_df['TotalHouseSF'], fit=stats.norm, ax=ax[0])
ax[0].set_title('Before Normalization')

train_df['TotalHouseSF'] = np.log(train_df['TotalHouseSF'])
ax[1].set_title('After Normalization')
sns.distplot(train_df['TotalHouseSF'], fit=stats.norm, ax=ax[1])

## Modeling

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR

In [None]:
X = train_df.drop(['SalePrice'], axis=1)
y = train_df['SalePrice']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, random_state=42, test_size=0.2)

In [None]:
lr = LinearRegression()
rf = RandomForestRegressor(random_state=42)
lgb = LGBMRegressor(random_state=42, objective='regression')

In [None]:
ensemble_regressor = VotingRegressor(
    [('lr', lr), ('rf', rf), ('lgb', lgb)])

In [None]:
for reg in (lr, rf, lgb, ensemble_regressor):
    reg.fit(X_train, y_train)
    y_pred = reg.predict(X_val)
    print(reg.__class__.__name__, mean_squared_error(y_val, y_pred))


## Predicting the test set

In [None]:
trap_missings = ['PoolQC', 'FireplaceQu', 'Alley']

In [None]:
test_df.drop(['SaleType', 'Exterior1st', 'KitchenQual', 'Utilities', 'MSZoning',
             'BsmtQual', 'BsmtCond', 'LotFrontage', 'MiscFeature',
              'Fence', 'Electrical', 'Exterior2nd', 'Functional', 'GarageFinish',
              'BsmtUnfSF', 'BsmtFinSF2', 'BsmtFinSF1', 'GarageCond', 'BsmtExposure',
              'MasVnrType', 'GarageQual'], axis=1, inplace=True)

### Missing values

In [None]:
nums = test_df.isna().sum().sort_values(ascending=False)
percent = test_df.isna().sum()/test_df.isna().count().sort_values(ascending=False)
missings = pd.concat([nums, percent], axis=1, keys=['Total', 'Percent'])

missings[missings['Total'] != 0]

In [None]:
# replacing None for those were included in the data description
for col in trap_missings:
    test_df[col].fillna('None', inplace=True)

# imputing values for numerical features
test_df['GarageYrBlt'].fillna(df['GarageYrBlt'].mean(), inplace=True)
test_df['MasVnrArea'].fillna(df['MasVnrArea'].mean(), inplace=True)
test_df['TotalBsmtSF'].fillna(df['TotalBsmtSF'].mean(), inplace=True)
test_df['GarageArea'].fillna(df['GarageArea'].mean(), inplace=True)
test_df['BsmtFullBath'].fillna(df['BsmtFullBath'].mode()[0], inplace=True)
test_df['BsmtHalfBath'].fillna(df['BsmtHalfBath'].mode()[0], inplace=True)
test_df['GarageCars'].fillna(df['GarageCars'].mode()[0], inplace=True)
test_df['BsmtFinType1'].fillna(df['BsmtFinType1'].mode()[0], inplace=True)
test_df['BsmtFinType2'].fillna(df['BsmtFinType2'].mode()[0], inplace=True)
test_df['GarageType'].fillna(df['GarageType'].mode()[0], inplace=True)

In [None]:
nums = test_df.isna().sum().sort_values(ascending=False)
percent = test_df.isna().sum()/test_df.isna().count().sort_values(ascending=False)
missings = pd.concat([nums, percent], axis=1, keys=['Total', 'Percent'])

missings[missings['Total'] != 0]

All clean!

In [None]:
none = ['None', 'NA']
for row in [test_df]:
    row['HasPool'] = 1
    row.loc[(row['PoolQC'].isin(none)), 'HasPool'] = 0

    row['HasWoodDeck'] = 1
    row.loc[(row['WoodDeckSF'].isin(none)), 'HasWoodDeck'] = 0

    row['HasOpenPorch'] = 1
    row.loc[(row['OpenPorchSF'].isin(none)), 'HasOpenPorch'] = 0

    row['HasScreenPorch'] = 1
    row.loc[(row['ScreenPorch'].isin(none)), 'HasScreenPorch'] = 0

    row['HasAlleyAccess'] = 1
    row.loc[(row['Alley'].isin(none)), 'HasAlleyAccess'] = 0

    row['HasFirePlace'] = 1
    row.loc[(row['Fireplaces'] == 0), 'HasFirePlace'] = 0

    row['HasGarage'] = 1
    row.loc[(row['GarageType'].isin(none)), 'HasGarage'] = 0
    
    row['HasMVArea'] = 1
    row.loc[(row['MasVnrArea'] == 0), 'HasMVArea'] = 0
    

    row['Remodeled'] = 1
    row.loc[(row['YearBuilt'] == row['YearRemodAdd']), 'Remodeled'] = 0

    row['TotalHouseSF'] = row['1stFlrSF'] + \
        row['TotalBsmtSF'] + row['2ndFlrSF']

    row['HasBasement'] = 1
    row.loc[(row['BsmtFinType1'].isin(none) & (
        row['BsmtFinType2'].isin(none))), 'HasBasement'] = 0

    row['TotalBathroom'] = row['FullBath'] + \
        (row['HalfBath']*0.5) + row['BsmtFullBath'] + \
        (row['BsmtHalfBath']*0.5)

    row['TotalHouseQuality'] = row['OverallQual'] + row['OverallCond']

In [None]:
features = X_train.columns.to_list()
test_df = test_df[features]

## Normalization

In [None]:
test_df['GrLivArea'] = np.log(test_df['GrLivArea'])
test_df['TotalHouseSF'] = np.log(test_df['TotalHouseSF'])

In [None]:
y_pred = np.expm1(ensemble_regressor.predict(test_df))

In [None]:
y_pred = pd.DataFrame(y_pred, columns=['SalePrice'])

In [None]:
y_pred