    Main Challenges: 
    
    1. Handling with Missing Values (For both Categorical and Numerical variables)
    2. Feature Engineering (For both Categorical and Numerical variables)
    
    Techniques Used for different Challenges:
    
    1. Box-cox Transformation (To transform to Normal)
    1. Principal Component Analysis (For Numerical Variable)
    3. Kruskal-Wallis H-test (For Categorical Variable)
    4. Advanced Regression Techniques
    5. Root Mean Square Log error to evaluate
    (Techniques to handle missing values explained later)
    
    We can use the same procedure for any other regression problem with little modification w.r.t data. 
    More techniques can be used for Feature Engineering. Here we haven't considered each columns separately. So for a specific problem we can improve our result by studying each variable separately.

### Importing Necessary Libraries

In [45]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from scipy import stats
from scipy.stats import norm, skew
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy import stats
from scipy.special import boxcox, inv_boxcox
from sklearn.metrics import mean_squared_log_error


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings("ignore")

In [46]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor, StackingRegressor, ExtraTreesRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge, RidgeCV, Lasso , LassoCV
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.svm import SVR
from mlxtend.regressor import StackingCVRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

### Read the train and test data

In [47]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [48]:
print("Shape of Train data:", train.shape)
print("Shape of Test data:", test.shape)

Shape of Train data: (1460, 81)
Shape of Test data: (1459, 80)


In [49]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [50]:
train = train.drop(columns=['Id'])
test = test.drop(columns=['Id'])
print("Shape of Train data:", train.shape)
print("Shape of Test data:", test.shape)

Shape of Train data: (1460, 80)
Shape of Test data: (1459, 79)


In [51]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [52]:
train.loc[:,train.dtypes==np.object].columns

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')

> ### Handling missing data

First we check the following - 

    1. Total no of missing values
    2. Out of all the variables how many variables contain missing values
    3. Total no of missing values for each variables
    4. Percentage of missing values for each variables

Now we have to deal with the missing values. This is a tough task and when you have a lot of variables, this is more difficult. You can go through all the variables separately and decide separate ways to deal with the missing values for different variables.(For eg - If you have a categorical variable and numerical variable,then most probably you are not going to choose similar ways to deal with the missing values. Isn't it? Again you may choose different ways for two different variables even if both of these are categorical/numerical. 

If you have enough time/passion to go through each of the variables separately(even when the data has many variables) you must do it. But as I am lazy I am going to choose same criteria for each of the numerical variables and same criteria for each of the categorical variables. I maybe lazy but I won't choose the same criteria to deal with missing values for a categorical and a numerical variable. You know why.

The steps to deal with missing values - 
    
    1. Here I will not use any variable which has more than 10% - 15% missing values. I simply remove those variables. The choice is yours.(I choose 15%)
    2. Then choose a way to deal with the missing values for the remaining variables. 
    I choose to do the following for the dataset. 
    (a) For numerical variables, I will replace the missing values with the mean.(If we go through the variables separately, we may have choose median or mode for some of the variables)
    (b) For categorical variables, I will replace the missing values with the most frequent category. 
    
   


### Should we consider all data(train and test) together and then deal with missing values?

    First of all no of test examples is also a factor. Here in this dataset we have almost equal number of train and test data. This is my opinion about the question. 
    
    This may be a choice but there are drawbacks and I won't prefer this idea for the following reasons.
    We will train the model based on the train data only. So combining the data together may lead to a bad model due to following reasons - 
    
    1. It may happen that a variable in the training data has more than 15% missing values. But after combining the data together the missing value percentage maybe less than 15%. At that time we won't remove the variable. This may affect our model as at the time of training we consider only the train data.
    2. The opposite case can also happen. The missing value percentage for a variable in the training data maybe less than 15% but after combining the data together, the percentage maybe more than 15%. I will say that in this case combining the data can help. If we use this variable(variable with less than 15% missing value in the train data and more than 15% missing value in the test data) in the training then the model maybe good but due to unavailabilty of that value of the variable in the test example, the model may perform bad for that test example. It also depends on the effect of that variable on the response variable. (If the amount of test data is huge and a va then it will be absurd to replace these many missing values in the test data with some mean/median/mode of values of the variables in the training data.
    
    I agree that all problems can not be taken care of at the same time but to overcome the difficulity based on these two points we can do the following - 
    
    We won't combine the train and test data together but we will identify all the variables having more than 15% missing values in train and test data. Then we remove all these variables. Some variables will be common and some extra variables may be these in test data. 
   

### Check the no of variables with missing values

In [53]:
miss_vars_train = train.columns[train.isnull().any()]
print("No of variables with missing values in Train data:",len(miss_vars_train))

miss_vars_test = test.columns[test.isnull().any()]
print("No of variables with missing values in Test data:",len(miss_vars_test))

No of variables with missing values in Train data: 19
No of variables with missing values in Test data: 33


### Check the percentage of missing values for Train data

In [54]:
#missing values in train data
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

Unnamed: 0,Total,Percent
PoolQC,1453,0.995205
MiscFeature,1406,0.963014
Alley,1369,0.937671
Fence,1179,0.807534
FireplaceQu,690,0.472603
LotFrontage,259,0.177397
GarageType,81,0.055479
GarageCond,81,0.055479
GarageFinish,81,0.055479
GarageQual,81,0.055479


### Check the percentage of missing values for Test data

In [55]:
#missing values in train data
total_test = test.isnull().sum().sort_values(ascending=False)
percent_test = (test.isnull().sum()/test.isnull().count()).sort_values(ascending=False)
missing_data_test = pd.concat([total_test, percent_test], axis=1, keys=['Total', 'Percent'])
missing_data_test.head(25)

Unnamed: 0,Total,Percent
PoolQC,1456,0.997944
MiscFeature,1408,0.965045
Alley,1352,0.926662
Fence,1169,0.801234
FireplaceQu,730,0.500343
LotFrontage,227,0.155586
GarageCond,78,0.053461
GarageFinish,78,0.053461
GarageYrBlt,78,0.053461
GarageQual,78,0.053461


So here we can see that all the variables with more than 15% missing values are same for train and test data. So we just remove those variables from both the train and test data.

### Drop the variables having more than 15%  missing values

In [56]:
## As both have same varibales with more than 15% missing values
train = train.drop((missing_data[missing_data['Percent'] > 0.15]).index,1)
test = test.drop((missing_data[missing_data['Percent'] > 0.15]).index,1)

#### Again check the no of variables with missing variables

In [57]:
miss_vars = train.columns[train.isnull().any()]
print("No of variables with missing values in train data:",len(miss_vars))

No of variables with missing values in train data: 13


So we removed 6 variables from our dataset. Now we have to deal with the remaining variables with the missing values.

In [58]:
train.isnull().sum().max() ## Maximum no of missing values among all the variables

81

### Replace the missing values 

### Separate out the numerical and categorical variables

In [59]:
## Separate the categorical variables
train_cat= train.select_dtypes(include='object')
test_cat= test.select_dtypes(include='object')

### Seprate out the numerical variables (except saleprice as this is our respose variable and this column is not there in test data)
train_num= train.drop(columns=['SalePrice']).select_dtypes(exclude='object')
test_num= test.select_dtypes(exclude='object')


## Store the column names
cat_cols = train_cat.columns
num_cols = train_num.columns

### Choose the strategy to replace the missing values

In [60]:
## For numerical
imp_for_num = SimpleImputer(strategy='mean')
imp_for_num.fit(train_num)

## For categorical
imp_for_cat = SimpleImputer(strategy="most_frequent")
imp_for_cat.fit(train_cat)

SimpleImputer()

SimpleImputer(strategy='most_frequent')

### Replace the missing values based on the training data

In [61]:
train_num = pd.DataFrame(imp_for_num.transform(train_num),columns = num_cols)
train_cat = pd.DataFrame(imp_for_cat.transform(train_cat),columns = cat_cols)

test_num = pd.DataFrame(imp_for_num.transform(test_num),columns = num_cols)
test_cat = pd.DataFrame(imp_for_cat.transform(test_cat),columns = cat_cols)

### Our final data after handling the missing values

In [62]:
target_data = train['SalePrice']
train = pd.concat([train_cat,train_num], axis=1)
test = pd.concat([test_cat,test_num], axis=1)

### Check the number of missing values

In [63]:
print("No of missing values in the Train data:",train.isnull().sum().sum())
print("No of missing values in the Test data:",test.isnull().sum().sum())

No of missing values in the Train data: 0
No of missing values in the Test data: 0


So the total no of missing values is now zero. So we are good to proceed. 

In [64]:
print("Shape of Train data:", train.shape)
print("Shape of Test data:", test.shape)

Shape of Train data: (1460, 73)
Shape of Test data: (1459, 73)


In [65]:
## For label encoding
# le = LabelEncoder()
# le.fit(train["LotConfig"])
# le.transform(train["LotConfig"])[:10]

# def label_encode_all(data_to_fit, data_to_transform):
#     for i in cat_cols:
#         le = LabelEncoder()
#         le.fit(data_to_fit[i])
#         data_to_transform[i] = le.transform(data_to_transform[i])
#     return(data_to_transform)

# train_data = label_encode_all(train,train)
# test_data = label_encode_all(train,test)
# train_data.head()
# test_data.head()
#all_data = pd.concat([train, test], keys=[0,1])
#all_data_encoded = label_encode_all(all_data, all_data)
# train_data = all_data_encoded.xs(0)
# test_data = all_data_encoded.xs(1)

### Transform the target variable (Box cox Transformation)

In [66]:
target, lambda_ = stats.boxcox(target_data.values)

## Feature Selection

### Feature selection for Categorical variables

In [67]:
def kruskal_wallis(data,target,level_of_significance = 0.05):
    selected_cat_cols = []
    for i in cat_cols:
        g =pd.concat([data[i],target], axis=1).groupby(i)
        test = stats.kruskal(*[g.get_group(i)["SalePrice"].values for i in data[i].unique()])
        if test.pvalue < level_of_significance:
            selected_cat_cols.append(i)
    return selected_cat_cols     

In [68]:
target_df = pd.DataFrame(target, columns = ["SalePrice"])
selected_cat_cols = kruskal_wallis(train,target_df)

print("No of categorical features:",len(cat_cols))
print("No of selected categorical features:",len(selected_cat_cols))

No of categorical features: 38
No of selected categorical features: 34


In [69]:
# selected_cat_cols = []
# p_vals = f_regression(train_data[list(cat_cols)],target)[1]
# for i in range(len(cat_cols)):
#     if p_vals[i] > 0.05:
#         selected_cat_cols.append(cat_cols[i])

#Result: Folowing variables got removed.
# ['Street',
#  'LandContour',
#  'Utilities',
#  'LandSlope',
#  'Condition2',
#  'MasVnrType',
#  'BsmtFinType2']

# ['Street', 'Utilities', 'LandSlope', 'BsmtFinType2']

### PCA function

In [70]:
def apply_pca(data_to_fit,data_to_transform):
    pca = PCA(0.80)
    pca.fit(data_to_fit[list(num_cols)])
    #print(pca.explained_variance_ratio_)
    out = pca.transform(data_to_transform[list(num_cols)])
    out = pd.DataFrame(out, columns = ["x_"+str(i) for i in range(len(pca.components_))])
    return(out)

### Normalize the data before applying PCA

In [71]:
sc_X = StandardScaler()
sc_X.fit(train[num_cols])

train_num_data = pd.DataFrame(sc_X.transform(train[num_cols]), columns = num_cols)
test_num_data = pd.DataFrame(sc_X.transform(test[num_cols]), columns = num_cols)

StandardScaler()

### Numerical and categorical Train and test data 

In [72]:
pca_train_data = apply_pca(train_num_data,train_num_data)
pca_test_data = apply_pca(train_num_data,test_num_data)

cat_train_data = train[selected_cat_cols]
cat_test_data = test[selected_cat_cols]

print("Shape - Numerical Train data: {} | Categorical Train data: {}".format(pca_train_data.shape, cat_train_data.shape))
print("Shape - Numerical Test data: {} | Categorical Test data: {}".format(pca_test_data.shape, cat_test_data.shape))

Shape - Numerical Train data: (1460, 17) | Categorical Train data: (1460, 34)
Shape - Numerical Test data: (1459, 17) | Categorical Test data: (1459, 34)


In [73]:
cat_train = train[cat_cols]
cat_test = test[cat_cols]

In [74]:
def label_encode_all(data_to_fit, data_to_transform):
    for i in cat_cols:
        le = LabelEncoder()
        le.fit(data_to_fit[i])
        data_to_transform[i] = le.transform(data_to_transform[i])
    return(data_to_transform)


cat_data = pd.concat([cat_train, cat_test], keys=[0,1])
cat_data_encoded = label_encode_all(cat_data, cat_data)
cat_train = cat_data_encoded.xs(0)
cat_test= cat_data_encoded.xs(1)

In [75]:
cat_train.head()

Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,3,1,3,3,0,4,0,5,2,2,...,4,2,6,1,1,4,4,2,8,4
1,3,1,3,3,0,2,0,24,1,2,...,4,3,6,1,1,4,4,2,8,4
2,3,1,0,3,0,4,0,5,2,2,...,4,2,6,1,1,4,4,2,8,4
3,3,1,0,3,0,0,0,6,2,2,...,4,2,6,5,2,4,4,2,8,0
4,3,1,0,3,0,2,0,15,2,2,...,4,2,6,1,1,4,4,2,8,4


In [76]:
target_df = pd.DataFrame(target)
target_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1460 non-null   float64
dtypes: float64(1)
memory usage: 11.5 KB


In [77]:
# from sklearn.feature_selection import chi2
# fs = SelectKBest(score_func=chi2, k='all')
# fs.fit(cat_train, target_df)
# X_train_fs = fs.transform(cat_train)
# # X_test_fs = fs.transform(X_test)

### Create dummy variable(For the categorical data)

In [78]:
train_final = pd.concat([pca_train_data, cat_train_data],axis=1)
test_final = pd.concat([pca_test_data, cat_test_data],axis=1)

all_data = pd.concat([train_final, test_final], keys=[0,1])
all_data = pd.get_dummies(all_data)
train_final = all_data.xs(0)
test_final = all_data.xs(1)

print("Shape of our Final Train data:", train_final.shape)
print("Shape of our Final Test data:", test_final.shape)
print("--"*50)
print("Columns of the final data: \n \n",train_final.columns )
print("--"*50)
print("Information about the columns: \n \n")
train_final.info()

Shape of our Final Train data: (1460, 238)
Shape of our Final Test data: (1459, 238)
----------------------------------------------------------------------------------------------------
Columns of the final data: 
 
 Index(['x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9',
       ...
       'SaleType_ConLw', 'SaleType_New', 'SaleType_Oth', 'SaleType_WD',
       'SaleCondition_Abnorml', 'SaleCondition_AdjLand',
       'SaleCondition_Alloca', 'SaleCondition_Family', 'SaleCondition_Normal',
       'SaleCondition_Partial'],
      dtype='object', length=238)
----------------------------------------------------------------------------------------------------
Information about the columns: 
 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 0 to 1459
Columns: 238 entries, x_0 to SaleCondition_Partial
dtypes: float64(17), uint8(221)
memory usage: 520.4 KB


### Train-test split

In [79]:
X_train, X_test, y_train, y_test = train_test_split(train_final, target, test_size=0.3, random_state=101)

### Cross validation function

In [80]:
#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train_final.values)
    rmse= np.sqrt(-cross_val_score(model, train_final.values, target, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

### Base models

In [81]:
models = {
    'Lasso' : LassoCV() ,

'ElasticNet' : ElasticNetCV() ,
    
'KernelRidge' : KernelRidge(alpha=0.2, kernel='polynomial', degree=2, coef0=2.5) ,

'GradientBoost' : GradientBoostingRegressor(n_estimators=3000, learning_rate=0.03,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber')  ,

'XGBoost' : XGBRegressor( learning_rate=0.02, max_depth=5, 
                             min_child_weight=1.7817, n_estimators=2000,
                             reg_alpha=0.008, reg_lambda=0.6,
                             subsample=0.5213,
                             )  ,

'Light GBM' : LGBMRegressor(objective='regression', num_leaves=5, learning_rate=0.02, n_estimators=2177, max_bin=50, bagging_fraction=0.65,bagging_freq=5, bagging_seed=7, 
                                feature_fraction=0.201, feature_fraction_seed=7,n_jobs=-1),

'Support Vector' : SVR(kernel = 'rbf') ,

'Random Forest' : RandomForestRegressor(max_depth=2, random_state=0)
    
}


In [82]:
# score = rmsle_cv(models['KernelRidge'])
# print("\n {} score: {:.4f} ({:.4f})\n".format(str('KR'),score.mean(), score.std()))

In [83]:
# for i in models.keys():
#     score = rmsle_cv(models[i])
#     print("\n {} score: {:.4f} ({:.4f})\n".format(str(i),score.mean(), score.std()))


### Check RMSLE score for any model

In [84]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_log_error(y, y_pred))

def predict_score(model , y_test = y_test):
    
    model.fit(X_train, y_train)
    y_pred= model.predict(X_test)

    y_pred = inv_boxcox(y_pred.reshape(-1) , lambda_)
    y_actual = inv_boxcox(y_test, lambda_)
    
    return(rmsle(y_actual,y_pred))

In [85]:
for i in models.keys():
    print("Model: {} & RMSLE Score: {}".format(i, predict_score(models[i])) )

Model: Lasso & RMSLE Score: 0.1799706815183176
Model: ElasticNet & RMSLE Score: 0.17937964402697013
Model: KernelRidge & RMSLE Score: 0.13836065646337548
Model: GradientBoost & RMSLE Score: 0.15239258583332405
Model: XGBoost & RMSLE Score: 0.13982249127070165
Model: Light GBM & RMSLE Score: 0.14468195917457097
Model: Support Vector & RMSLE Score: 0.18643740878968085
Model: Random Forest & RMSLE Score: 0.2132727638956891


### Stacking Models Prediction

In [86]:
models.keys()

dict_keys(['Lasso', 'ElasticNet', 'KernelRidge', 'GradientBoost', 'XGBoost', 'Light GBM', 'Support Vector', 'Random Forest'])

In [87]:
model_names = ['KernelRidge', 'GradientBoost', 'XGBoost', 'Light GBM']
estimators = [(i, models[i]) for i in model_names]
stack_reg=StackingRegressor(estimators=estimators,n_jobs=-1)

print('StackingRegressor:')
print('RMSLE Score: ',predict_score(stack_reg))

StackingRegressor:
RMSLE Score:  0.13771437147290216


### Final Prediction for test data

In [88]:
# stack_reg=StackingRegressor(estimators=estimators,n_jobs=-1)
# model = stack_reg
# model.fit(train_final, target)

# y_pred= model.predict(test_final)
# y_pred = inv_boxcox(y_pred.reshape(-1) , lambda_)
# print("Length of the output:",len(y_pred))

In [89]:
# Id = [i+1461 for i in range(1459)]
# df = pd.DataFrame({'Id':Id , 'SalePrice': y_pred})

# df.to_csv('Submission', sep=',',index=False)