***

### HOUSE PRICES DATA : COMPREHENSIVE DATA ANALYSIS & PREDICTION MODELLING

***

Starting with looking into deeper in the dataset, we will try to understand the given train dataset in more detail. 'SalePrice' being the most important variable in the dataset, we will then explore the correlation of 'SalePrice' with other variables. In the first part of this notebook, we will basically, by using python - pandas and seaborn packagaes, try to understand the data in more deeper and visualize in appropriate ways to make it vivid. 

### LB ~ 0.12 [Top 20%], CV RMSE ~ 0.05


#### *Content:*
- Basic EDA & Visualizations
- Optuna Hyperparameters tuning with LightGBMRegressor
- K-Fold Cross-validation training and Test set prediction with LightGBMRegressor

In [None]:
# libraries
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# sklearn
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder

# LightGBM
from lightgbm import LGBMRegressor, log_evaluation, early_stopping

# Hyperparams tuning
import optuna

#### Importing training and testing data

In [None]:
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

df_train.head(n=5)

In [None]:
# train and test data sizes

print("Size of the train data: ", df_train.shape)
print("Size of the test data: ", df_test.shape)

The test data has one less column than that in train data as it doesn't have the regression target column "SalePrice".

_Let us explore all the columns/decorations in the train data in more vivid way:_

In [None]:
df_train.columns

In [None]:
df_train.describe()

In [None]:
# Missing values
# percentage/ratio of the missing values by columns

for col, missing_ratio in (df_train.isnull().sum()/df_train.shape[0]).to_dict().items():
    if missing_ratio > 0:
        print(col, ":\t", round(missing_ratio, 3))

In [None]:
# columns with more than one third of them have missing values
drop_columns = list(df_train.columns[df_train.isnull().sum()/df_train.shape[0] > 0.33])
drop_columns

In [None]:
df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

### Missing value imputation

In [None]:
# correlation of LotFrontage with other features: top 5
for k, v in df_train.corr()["LotFrontage"].to_dict().items():
    if v > 0.35 and v < 1.0:
        print(k, ":\t", v)

Imputing the data with missing LotFrontage by using linear regression model. It seems that LotFrontage has somewhat linear relation with LotArea, therefore, we  can impute the missing LotFrontage data by the linear regression with LotArea as follow:

In [None]:
# This function predicts the LotFrontage of the missing data values using a Linear Regression model 
# Build a linear regression model with known LotArea and LotFrontage and predicts the LotFrontage for 
# the data with missing values

def regression_coeffs(X_train, y_train):
    X_train = X_train.reshape(len(X_train), 1)
    y_train = y_train.reshape(len(y_train), 1)
    reg = linear_model.LinearRegression()
    reg.fit(X_train, y_train)
    return reg.coef_[0][0], reg.intercept_[0] 

# linear reg coeffs
tmp = df_train[["LotArea", "LotFrontage"]].dropna()
w, intercept = regression_coeffs(tmp.LotArea.values, tmp.LotFrontage.values)
print(f"Regression params: weight={w}, intercept={intercept}")

for i in range(len(df_train["LotFrontage"])):
    if pd.isnull(df_train.loc[i, "LotFrontage"]):
        df_train.loc[i, "LotFrontage"] = df_train.loc[i, "LotArea"]*w + intercept

In [None]:
# No NaN values in the LotFrontage column left
df_train["LotFrontage"].isnull().sum()

In [None]:
# Imputing in the Test set as well
for i in range(df_test["LotFrontage"].shape[0]):
    if pd.isnull(df_test.loc[i, "LotFrontage"]):
        df_test.loc[i, "LotFrontage"] = df_test.loc[i, "LotArea"]*w + intercept

### "MasVnrType" and "MasVnrArea" Imputation

***
Here, we will see all the possible 'MasVnrType' and find out the most frequent type. Since only 8 of the data points have missing values, we will simply replace them with the most frequent type of 'MasVnrType', which is None type as obtained below, and the corresponding 'MasVnrArea' will be set to be 0.

In [None]:
# Let us see the distribution of "MasVnrType" in the data
df_train["MasVnrType"].describe()

Since most of the houses have "MasVnrType" values "None", let us replace remaining 8 values with "None" type and corresponding "MasVnrArea" value of 0

In [None]:
for i in range(len(df_train["MasVnrType"])):  
    if pd.isnull(df_train.loc[i, "MasVnrType"]) and pd.isnull(df_train.loc[i, "MasVnrArea"]):
        df_train.loc[i, "MasVnrType"] = "None"
        df_train.loc[i, "MasVnrArea"] = 0
        
# NaNs values in both MasVnrType and MasVnrArea are now removed
df_train["MasVnrType"].isnull().sum(), df_train["MasVnrArea"].isnull().sum()

In [None]:
# same for test set as well
for i in range(len(df_test["MasVnrType"])):  
    if pd.isnull(df_test.loc[i, "MasVnrType"]) and pd.isnull(df_test.loc[i, "MasVnrArea"]):
        df_test.loc[i, "MasVnrType"] = "None"
        df_test.loc[i, "MasVnrArea"] = 0

In [None]:
# Imputation can be done for each columns manually like this - however, we will use boosting model that will be able to impute missing value 
# itself with reasonable effectiveness

### Categorical and Numerical features:

Out of the 66 remaining columns/decorations left above, let us separate categorical and numerical variables.

In [None]:
df_train.dtypes

In [None]:
# numerical and categorical columns
numerical_vars = []
categorical_vars = []

for col in df_train.columns:
    if df_train[col].dtype == "object":
        categorical_vars.append(col)
    else:
        numerical_vars.append(col)

# number of numerical and categorical features left
len(numerical_vars), len(categorical_vars)

- Plotting few important numerical columns 

In [None]:
interesting_cols = ["OverallCond", "GrLivArea", "GarageCars", "YearBuilt", "LotArea", "SalePrice"]

plt.figure(figsize=(14,10))
sns.pairplot(df_train[interesting_cols], dropna=True);
del interesting_cols;

### House Built Year, Sold Year, and Age

In [None]:
sns.histplot(x="YearBuilt", data=df_train, bins=40);

In [None]:
# YearBuilt in test not included in train
for yr in df_test.YearBuilt.unique():
    if yr not in df_train.YearBuilt.unique():
        print(yr)

##### 2008 Housing Crash
- House price in 2006 vs 2010? There should be a singnificant effect because of the price correction. Age (will create a separate feature later) will not be enough to capture it.

In [None]:
# Sold Year
df_train.YrSold.unique(), df_test.YrSold.unique()

In [None]:
# to categorical variable
df_train.YrSold = df_train.YrSold.astype("object")
df_test.YrSold = df_test.YrSold.astype("object")

In [None]:
# YearSold in test not included in train
for yr in df_test.YrSold.unique():
    if yr not in df_train.YrSold.unique():
        print(yr)

In [None]:
df_train.YearBuilt.isnull().sum(), df_test.YearBuilt.isnull().sum()

In [None]:
# YearBuilt and YrSold: Calculate house Age from above info
df_train["Age"] = df_train.YrSold - df_train.YearBuilt
df_test["Age"] = df_test.YrSold - df_test.YearBuilt

In [None]:
# Creating a categorical buckets for Year Built: will be use in addition to the YrSold as categorical features
def year_built_category(year):
    
    if year < 1900:
        return "1800s"
    decade = f"{str(math.floor(year/10)*10)}s"
    return decade
        
# df_train.YearBuilt = df_train.YearBuilt.apply(lambda x: year_built_category(x))
# df_test.YearBuilt = df_test.YearBuilt.apply(lambda x: year_built_category(x))

df_train = df_train.drop(columns=["YearBuilt"])
df_test = df_test.drop(columns=["YearBuilt"])


### Correlation between the variables:

Let us draw heatmap to study the correlations between different variables of SalePrice of the houses. We will then list the most important varibles looking at the heatmap. The variables with highest correlation with SalesPrice will be important for the further analysis and will be considered ahead.

In [None]:
# OverallQual: Boxplot
plt.figure(figsize=(10,6))
sns.boxplot(x="OverallQual", y="SalePrice", data=df_train);

Looking at the above Boxplot, it can be seen that the average SalePrice is almost directly proportional to the OverallQual of the house. Therefore, OverallQual is very important variable to take into account for further calcualtions.

In [None]:
# Data Normalaization
# Normalizing the right skewed SalePrice
# Note: prediction wil be LogSalePrice --> need conversion back before submission

df_train["LogSalePrice"] = df_train.SalePrice.apply(lambda x: math.log10(x))

fig, ax =plt.subplots(1,2, figsize=(10, 3))
sns.histplot(x='SalePrice', data=df_train, bins=70, kde=True, ax=ax[0])
sns.histplot(x='LogSalePrice', data=df_train, bins=70, kde=True, ax=ax[1]);

### Outliers! 

Let us look at some of the outliers data points in the most important variables to be used in the prediction model. We will drop such data points entirely from the train data depending upon their presence in the important columns/decorators.  Here, we will are looking outliers in 'LotArea' and 'GrLivArea' variables. 

In [None]:
# Plottng the LotArea - SalePrice graph

plt.scatter(df_train["LotArea"], df_train["SalePrice"])
plt.xlabel("Lot Area")
plt.ylabel("Sale Price")
plt.show()

The four data points on the far right side of the graph are outliers in the data set, based on LotArea/SalePrice distribution, and we can remove these four data points from the dataset (not doing as non-linear model should be able to address this).

### Correlation with 'SalePrice'

Let us check the correlation of remaining numerical variables with 'SalePrice' once again now.

In [None]:
corr_with_SalePrice = df_train.drop(["Id"], axis=1).corr()
plot_data = corr_with_SalePrice["SalePrice"].sort_values(ascending=True)
plt.figure(figsize=(12,6))
plot_data.plot.bar()
plt.title("Correlations with the Sale Price")
plt.show()
del plot_data

We can see the correlation of remaining numerical columns/decorations on 'SalePrice'. The columns that have clear correlation (high positive or high negative) are important for the prediction model, but few of those with small (about zero) correlation will not have much effect on the 'SalePrice', therefore, we can still drop few of them.

In [None]:
# Removing few columns (low correlation with SalePrice)?
drop_columns = ["LowQualFinSF", "MiscVal", "BsmtHalfBath", "BsmtFinSF2"]

df_train = df_train.drop(columns=drop_columns)
df_test = df_test.drop(columns=drop_columns)

## Categorical columns

We looked into numerical columns/decorations above and got some rough idea about their distribution and importance on the 'SalePrice' determination. The final/important columns of them will be considered for the prediction model development later. There are also many columns that don't have numerical values, rather they have descriptive categorical values. Now, we will concentrate on those categorical columns below.

In [None]:
# categorical variable and unique item by variable
for col in df_train.columns:
    if df_train[col].dtype == "object":
        print(col, ":\t", df_train[col].nunique())

Let us now look at the 'SalePrice' variation on different categories of categorical variables/columns. This will give us some idea about the columns that are important for us, and which will be considered further.

In [None]:
# violinplot: for all columns/decorations in the categorical column list

few_cat_variables = ['KitchenQual', 'BsmtQual', 'Heating', 'ExterQual', 'LandSlope', 'HeatingQC', 'Foundation', 'Electrical', \
                     'LandContour', 'LotShape', 'CentralAir', 'SaleType']
# categorical_list => plotted all the variables in this list before showing only few of them in the above list
for i in range(len(few_cat_variables)):
    sns.violinplot(x=few_cat_variables[i], y='SalePrice', data=df_train)
    plt.show()

In [None]:
df_train.Neighborhood.value_counts()

In [None]:
plt.figure(figsize=(10, 4))
sns.histplot(x="SalePrice", hue="Neighborhood", data=df_train, bins=30);
plt.xticks(rotation=90);

### Okay,
The categorical effect on 'SalePrice' for most of the columns/decorations is not clearly conclusive. Still the categories can be ideally scaled into different prioritical numerical values looking on above plots. But we are not going to be that much precise, don't want to make it so complicated. Rather we will only concentrate on few of the variables which have comparatively clear effect on 'SalePrice' based on their categories. As just metioned, there might different preference but just going through above plots, I am going to take following columns into consideration. To decide, I have tried to consider both the mean and kernel density of the columns on different categories, definitely as allowed by my eyes and instinct in a quick going through!

### Regression Modelling and Prediction on Test Data

Now, we take the cleaned data above, df_train, and carry out prediction analysis with different regression methods from sklearn-library. We will compare the accuracy of different regression methods with cross_val_score and mean squared error.

In [None]:
df_train.shape, df_test.shape

In [None]:
y = df_train.LogSalePrice
X = df_train.drop(columns=["Id", "SalePrice", "LogSalePrice"])

# test data
X_test = df_test.drop(columns=["Id"])

In [None]:
categorical_features = []
for col in X.columns:
    if X[col].dtype == "object":
        categorical_features.append(col)

In [None]:
# Label encoding
for col in categorical_features:
    encoder = LabelEncoder()
    X[col] = X[col].astype(str)
    X_test[col] = X_test[col].astype(str)
    
    encoder.fit(pd.concat([X[col], X_test[col]]))
    
    X[col] = encoder.transform(X[col])
    X_test[col] = encoder.transform(X_test[col])

In [None]:
# Cat feature indices on X
cat_indices = []
for c in categorical_features:
    if c in X.columns:
        idx = list(X.columns).index(c)
        cat_indices.append(idx)

In [None]:
# Hyperparams tuning with Optuna
def objective(trial, data=X,target=y):
    
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.10, random_state=42)
    
    params = {
                'metric': 'rmse', 
                'random_state': 22,
                'n_estimators': 20000,
                'boosting_type': trial.suggest_categorical("boosting_type", ["gbdt", "goss"]),
                'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
                'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
                'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]),
                'subsample': trial.suggest_categorical('subsample', [0.6, 0.7, 0.85, 1.0]),
                'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.01, 0.02, 0.03, 0.05, 0.1]),
                'max_depth': trial.suggest_int('max_depth', 2, 12, step=1),
                'num_leaves' : trial.suggest_int('num_leaves', 13, 148, step=5),
                'min_child_samples': trial.suggest_int('min_child_samples', 1, 96, step=5),
            }
    
    reg = LGBMRegressor(**params)  
    reg.fit(X_train ,y_train,
            eval_set=[(X_valid, y_valid)],
            #categorical_feature=cat_indices,
            callbacks=[log_evaluation(period=1000), 
                       early_stopping(stopping_rounds=50)
                      ],
           )
    
    y_pred = reg.predict(X_valid)
    rmse = mean_squared_error(y_valid, y_pred, squared=False)
    
    return rmse

In [None]:
params_search = True
# # Optuna: run study trials

if params_search:
    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=120)

In [None]:
# Results from Hyperparameters tuning
if params_search:
    print('Totalnumber of trials: ', len(study.trials))
    print(f"Best RMSE score on validation data: {study.best_value}")

    print("-"*30)
    print('Best params:')
    print("-"*30)
    for param, v in study.best_trial.params.items():
        print(f"{param} :\t {v}")

In [None]:
# K-FOLD Cross-validation training and Prediction on test data
# Modeling with Best params

NFOLDS = 5
folds = KFold(n_splits=NFOLDS)
columns = X.columns
splits = folds.split(X, y)

y_preds = np.zeros(X_test.shape[0]) 
cv_score = 0
for fold_n, (train_idx, valid_idx) in enumerate(splits):
    print(f"FOLD: {fold_n}")
    
    X_train, X_valid = X[columns].iloc[train_idx], X[columns].iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx] 
    
    # further manually tune params from best params from Optuna
    params = {
             'n_estimators': 20000,
             'boosting_type': "gbdt",
             'reg_alpha': 1.0,
             'reg_lambda': 2.0,
             'colsample_bytree': 0.70,
             'subsample': 1.0,
             'learning_rate': 0.02,
             'max_depth': 4,
             'num_leaves': 65,
             'min_child_samples': 3,
             }
    
    reg = LGBMRegressor(**params) # **study.best_trial.params

    reg.fit(X_train, y_train,
            eval_set=[(X_valid, y_valid), (X_train, y_train)],
            categorical_feature=cat_indices,
            callbacks=[log_evaluation(period=100), 
                       early_stopping(stopping_rounds=100)
                      ],
           )
    
    # prediction on the test set
    y_preds += reg.predict(X_test)/NFOLDS   
    # cross-validation score
    cv_score += mean_squared_error(y_valid, reg.predict(X_valid), squared=False)/NFOLDS

In [None]:
print(f"Cross-validation mean RMSE score = {cv_score}")

In [None]:
sample_sub = pd.read_csv("/kaggle/input/sample_submission.csv")
sample_sub.head(2)

In [None]:
submission = pd.DataFrame(data={"Id": df_test.Id.values, "SalePrice": y_preds})
submission.SalePrice = submission.SalePrice.apply(lambda x: round(10**x, 3))
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)

In [None]:
# Done!