# What is EDA?
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to :                                                                                                  
* Maximize insight into a data set;
* Uncover underlying structure;
* Extract important variables;
* Detect outliers and anomalies;
* Test underlying assumptions;
* Develop parsimonious models; and
* Determine optimal factor settings.

# What is Ensemble Learning?
> An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would, but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data.

> Empirically, ensembles tend to yield better results when there is a significant diversity among the models.Many ensemble methods, therefore, seek to promote diversity among the models they combine.Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.

# *Please upvote the kernel if you find it insightful!*

# Import Libraries

In [None]:
%matplotlib inline
import numpy as np 
import pandas as pd 
import random as rnd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
matplotlib.style.use('ggplot')

# Load train and test data 

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

train.describe()

In [None]:
test.describe()

**Getting Correlation between variables**

In [None]:
corr = train.select_dtypes(include = ['float64', 'int64']).iloc[:, 1:].corr()
plt.figure(figsize=(12, 12))
sns.heatmap(corr, vmax=1, square=True)

**Top 20 variables correlated with SalePrice with score**

In [None]:
k = 20 #number of variables for heatmap
cols = corr.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
plt.figure(figsize=(12, 12))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

**Finding outliers in GrLivArea**

In [None]:
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

**Deleting the 2 outliers in bottom right**

In [None]:
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

**Finding Skewness in SalePrice**

In [None]:
from scipy.stats import norm, skew
from scipy import stats
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

**So SalePrice is skewed and it needs to be normally distributed.**

In [None]:
train["SalePrice"] = np.log1p(train["SalePrice"])

In [None]:
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

# Preprocessing

In [None]:
ntrain = train.shape[0]
ntest = test.shape[0]

# get the targets
y_train_sale = train.SalePrice.values

# combine train and test
combined = pd.concat((train, test)).reset_index(drop=True)
combined.drop(['SalePrice'], axis=1, inplace=True)

**Finding features with NA values**

In [None]:
all_data_na = (combined.isnull().sum() / len(combined)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data

In [None]:
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

**Filling NAs**                                                                                                          
Using the variable description file provided, the features with missing values can be filled as below.

In [None]:
combined['MasVnrArea'] = combined['MasVnrArea'].fillna(0.0)
combined["MasVnrType"] = combined["MasVnrType"].fillna("None")
combined['LotFrontage'] = combined['LotFrontage'].fillna(combined['LotFrontage'].median())
combined['BsmtFinSF1'] = combined['BsmtFinSF1'].fillna(0.0)
combined['BsmtFinSF2'] = combined['BsmtFinSF2'].fillna(0.0)
combined['BsmtUnfSF'] = combined['BsmtUnfSF'].fillna(0.0)
combined['TotalBsmtSF'] = combined['TotalBsmtSF'].fillna(0.0)
combined['BsmtFullBath'] = combined['BsmtFullBath'].fillna(0)
combined['BsmtHalfBath'] = combined['BsmtHalfBath'].fillna(0)
combined['GarageYrBlt'] = combined['GarageYrBlt'].fillna(0)
combined['GarageCars'] = combined['GarageCars'].fillna(0)
combined['GarageArea'] = combined['GarageArea'].fillna(0)
combined['GarageFinish'] = combined['GarageFinish'].fillna('None')

# using the most frequent zone
combined['MSZoning'] = combined['MSZoning'].fillna(combined['MSZoning'].mode()[0])

combined = combined.drop(['Utilities'], axis=1)

# most common functionality
combined["Functional"] = combined["Functional"].fillna("Typ")

combined['Electrical'] = combined['Electrical'].fillna(combined['Electrical'].mode()[0])
combined['KitchenQual'] = combined['KitchenQual'].fillna(combined['KitchenQual'].mode()[0])
combined['Exterior1st'] = combined['Exterior1st'].fillna(combined['Exterior1st'].mode()[0])
combined['Exterior2nd'] = combined['Exterior2nd'].fillna(combined['Exterior2nd'].mode()[0])
combined['SaleType'] = combined['SaleType'].fillna(combined['SaleType'].mode()[0])
combined['MSSubClass'] = combined['MSSubClass'].fillna("None")
combined['PoolQC'] = combined['PoolQC'].fillna('None')
combined['MiscFeature'] = combined['MiscFeature'].fillna('None')
combined['Alley'] = combined['Alley'].fillna('None')
combined['Fence'] = combined['Fence'].fillna('None')
combined['FireplaceQu'] = combined['FireplaceQu'].fillna('None')
combined["LotFrontage"] = combined.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    combined[col] = combined[col].fillna('None')

for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    combined[col] = combined[col].fillna('None')

**Label Encoding of some categorical features**

In [None]:
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')

for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(combined[c].values)) 
    combined[c] = lbl.transform(list(combined[c].values))

**Combining all area features to a single feature**

In [None]:
combined['TotalSF'] = combined['TotalBsmtSF'] + combined['1stFlrSF'] + combined['2ndFlrSF']

**Generating Dummies**

In [None]:
combined = pd.get_dummies(combined)

In [None]:
train = combined[:ntrain]
test = combined[ntrain:]

# Split the data into Train and Test

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
x_train, x_test, y_train, y_test = train_test_split(train, y_train_sale, test_size=0.1, random_state=200)

# Import libraries for Ensemble modeling

In [None]:
from sklearn import ensemble, tree, linear_model
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import mean_squared_error
import lightgbm as lgb

# Function for Scoring, Training and Testing

In [None]:
def get_score(prediction, lables):    
    print('R2: {}'.format(r2_score(prediction, lables)))
    print('RMSE: {}'.format(np.sqrt(mean_squared_error(prediction, lables))))

def train_test(estimator, x_trn, x_tst, y_trn, y_tst):
    prediction_train = estimator.predict(x_trn)
    
    get_score(prediction_train, y_trn)
    prediction_test = estimator.predict(x_tst)
    
    get_score(prediction_test, y_tst)

# Ensembling

**Gradient Boosting Regressor**
> Gradient boosting is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

> GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

In [None]:
GBR = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=3, max_features='sqrt',
                                               min_samples_leaf=15, min_samples_split=10, loss='huber').fit(x_train, y_train)
train_test(GBR, x_train, x_test, y_train, y_test)

**Lasso**
> Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination.

In [None]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0007000000000000001, random_state=1)).fit(x_train, y_train)
train_test(lasso, x_train, x_test, y_train, y_test)

**Light GBM**
> Light GBM is a gradient boosting framework that uses tree based learning algorithm.Light GBM is prefixed as ‘Light’ because of its high speed. It can handle the large size of data and takes lower memory to run.

> It grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

In [None]:
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11).fit(x_train, y_train)
train_test(model_lgb, x_train, x_test, y_train, y_test)

# Modeling 

In [None]:
GB_model = GBR.fit(train, y_train_sale)
gbr_labels = np.expm1(GB_model.predict(test))

lasso_model = lasso.fit(train, y_train_sale)
lasso_labels = np.expm1(lasso_model.predict(test))

lgb_model = model_lgb.fit(train, y_train_sale)
lgb_labels = np.expm1(lgb_model.predict(test))

In [None]:
# scores decided on testing for a few values
output = lgb_labels*0.50 + lasso_labels*0.25 + gbr_labels*0.25

In [None]:
pd.DataFrame({'Id': test.Id, 'SalePrice': output}).to_csv('submission.csv', index =False)