<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#House-Prices:-Advanced-Regression-Techniques" data-toc-modified-id="House-Prices:-Advanced-Regression-Techniques-1">House Prices: Advanced Regression Techniques</a></span><ul class="toc-item"><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-1.1">Exploratory Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1.1.1">Import Libraries</a></span></li><li><span><a href="#Extract-data-and-get-info-about-the-dataframe" data-toc-modified-id="Extract-data-and-get-info-about-the-dataframe-1.1.2">Extract data and get info about the dataframe</a></span></li><li><span><a href="#Check-for-duplicates" data-toc-modified-id="Check-for-duplicates-1.1.3">Check for duplicates</a></span></li><li><span><a href="#Visualising-missing-values" data-toc-modified-id="Visualising-missing-values-1.1.4">Visualising missing values</a></span></li><li><span><a href="#Estimate-Skewness-and-Kurtosis" data-toc-modified-id="Estimate-Skewness-and-Kurtosis-1.1.5">Estimate Skewness and Kurtosis</a></span></li><li><span><a href="#Analysing-'SalePrice'" data-toc-modified-id="Analysing-'SalePrice'-1.1.6">Analysing 'SalePrice'</a></span></li><li><span><a href="#Multicollinearity-Check" data-toc-modified-id="Multicollinearity-Check-1.1.7">Multicollinearity Check</a></span></li><li><span><a href="#Correlation-between--'SalePrice'-and-numeric-features" data-toc-modified-id="Correlation-between--'SalePrice'-and-numeric-features-1.1.8">Correlation between  'SalePrice' and numeric features</a></span></li><li><span><a href="#Correlation-between-'SalePrice'-and-'OverallQual'" data-toc-modified-id="Correlation-between-'SalePrice'-and-'OverallQual'-1.1.9">Correlation between 'SalePrice' and 'OverallQual'</a></span></li><li><span><a href="#Correlation-between-'SalePrice'-and-categorical-features" data-toc-modified-id="Correlation-between-'SalePrice'-and-categorical-features-1.1.10">Correlation between 'SalePrice' and categorical features</a></span></li><li><span><a href="#Normalizing-independent-variables" data-toc-modified-id="Normalizing-independent-variables-1.1.11">Normalizing independent variables</a></span></li></ul></li><li><span><a href="#Feature-Engineering" data-toc-modified-id="Feature-Engineering-1.2">Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#Upfront-remove-unnecessary-columns(id's-etc.)" data-toc-modified-id="Upfront-remove-unnecessary-columns(id's-etc.)-1.2.1">Upfront remove unnecessary columns(id's etc.)</a></span></li><li><span><a href="#Remove-outliers" data-toc-modified-id="Remove-outliers-1.2.2">Remove outliers</a></span></li><li><span><a href="#Combine-train,-test-dataset" data-toc-modified-id="Combine-train,-test-dataset-1.2.3">Combine train, test dataset</a></span></li><li><span><a href="#Fill-missing-values" data-toc-modified-id="Fill-missing-values-1.2.4">Fill missing values</a></span><ul class="toc-item"><li><span><a href="#Missing-numeric-values" data-toc-modified-id="Missing-numeric-values-1.2.4.1">Missing numeric values</a></span></li><li><span><a href="#Missing-object-values" data-toc-modified-id="Missing-object-values-1.2.4.2">Missing object values</a></span></li></ul></li><li><span><a href="#Create-interesting-features:" data-toc-modified-id="Create-interesting-features:-1.2.5">Create interesting features:</a></span></li><li><span><a href="#Feature-transformation---Categorical-to-ordinal" data-toc-modified-id="Feature-transformation---Categorical-to-ordinal-1.2.6">Feature transformation - Categorical to ordinal</a></span></li><li><span><a href="#Fix-skewed-features" data-toc-modified-id="Fix-skewed-features-1.2.7">Fix skewed features</a></span></li><li><span><a href="#Feature-transformation---Numeric-to-categorical" data-toc-modified-id="Feature-transformation---Numeric-to-categorical-1.2.8">Feature transformation - Numeric to categorical</a></span></li><li><span><a href="#Normalize-'SalePrice'" data-toc-modified-id="Normalize-'SalePrice'-1.2.9">Normalize 'SalePrice'</a></span></li><li><span><a href="#Getting-dummy-categorical-features" data-toc-modified-id="Getting-dummy-categorical-features-1.2.10">Getting dummy categorical features</a></span></li><li><span><a href="#Train-test-split" data-toc-modified-id="Train-test-split-1.2.11">Train test split</a></span></li></ul></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-1.3">Modelling</a></span><ul class="toc-item"><li><span><a href="#Linear-Regression" data-toc-modified-id="Linear-Regression-1.3.1">Linear Regression</a></span></li></ul></li></ul></li></ul></div>

# House Prices: Advanced Regression Techniques
<b> Kaggle : https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview </b>

<b> Metric : Root-Mean-Squared-Error (RMSE) </b> between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

## Exploratory Data Analysis

https://scikit-learn.org/stable/modules/preprocessing.html

 - Univariate visualization, Bivariate visualization, Multivariate visualization
 
   Identify: 
   
 - Trends
 - Distribution
 - Mean
 - Median
 - Outlier
 - Spread measurement (SD)
 - Correlations
 - Hypothesis testing
 - Visual Exploration

### Import Libraries

In [None]:
import numpy as np
import pandas as pd

import scipy.stats as st
from scipy.special import boxcox1p

import missingno as msno
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings


warnings.filterwarnings('ignore')
%matplotlib inline

### Extract data and get info about the dataframe

In [None]:
df_raw_train = pd.read_csv('../input/train.csv')
df_raw_test = pd.read_csv('../input/test.csv')

In [None]:
df_raw_train.shape, df_raw_test.shape

In [None]:
df_raw_train.info()

In [None]:
df_raw_train.get_dtype_counts()

In [None]:
# segregate numeric and categotical features
numeric_features = df_raw_train.select_dtypes(include=[np.number])
categorical_features = df_raw_train.select_dtypes(include=[np.object])

In [None]:
pd.set_option('display.max_columns', len(df_raw_train.columns))
display(df_raw_train.head())

In [None]:
df_raw_train.describe()

### Check for duplicates

In [None]:
idsUnique = len(set(df_raw_train.Id))
idsTotal = df_raw_train.shape[0]
idsDupli = idsTotal - idsUnique
print("There are " + str(idsDupli) + " duplicate IDs for " + str(idsTotal) + " total entries")

### Visualising missing values

In [None]:
# Visualizing the patterns of missing value occurrence in training set

sns.heatmap(df_raw_train.isnull(), cbar=True)

"BsmtX", "GarageX" have missing values in same rows. We may predict that the house doesn't have these features.

In [None]:
missing_df = df_raw_train.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df['missing_ratio'] = missing_df['missing_count'] / df_raw_train.shape[0]
missing_df = missing_df.sort_values('missing_count', ascending = False)
missing_df.loc[missing_df['missing_count'] > 0]
#missing_df

In [None]:
# selecting rows whose column value is null / None / nan
df_raw_train[df_raw_train['MasVnrType'].isna()]

In [None]:
# missingno correlation heatmap measures nullity correlation: how strongly the presence/absence of one variable affects another variable
msno.heatmap(df_raw_train)

In [None]:
# Dendrogram

msno.dendrogram(df_raw_train)

### Estimate Skewness and Kurtosis

https://towardsdatascience.com/intro-to-descriptive-statistics-252e9c464ac9

In [None]:
sns.distplot(df_raw_train.skew(),color='blue',axlabel ='Skewness')

In a distplot, y-axis is probability density and not probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one.

In [None]:
skewness = df_raw_train.skew().sort_values(ascending = False)
skewness

In [None]:
skew_features = df_raw_train[skewness[abs(skewness)>0.5].index]
skew_features.columns

In [None]:
# We can treat skewness of a feature with the help of log transformation

# skew_features = np.log1p(skew_features)

# OR

# We can use the scipy function boxcox1p which computes the Box-Cox transformation.
# The goal is to find a simple transformation that lets us normalize data.

# for i in skew_features:
#    all_features[i] = boxcox1p(all_features[i], boxcox_normmax(all_features[i] + 1)) # all_features = train + test

In [None]:
kurtosis = df_raw_train.kurt().sort_values(ascending = False)
kurtosis

In [None]:
sns.distplot(df_raw_train.kurt(),color='blue',axlabel ='Kurtosis')

### Analysing 'SalePrice'

In [None]:
df_raw_train['SalePrice'].describe()

In [None]:
# Histogram -  To get an idea of the distribution.
plt.figure(figsize=(14,10))
plt.subplot(2,2,1)
plt.hist(df_raw_train['SalePrice'])

plt.subplot(2,2,2)
sns.distplot(df_raw_train['SalePrice'], color="r", kde=True)
plt.title("Distribution of Sale Price")
plt.ylabel("Number of Occurences")
plt.xlabel("Sale Price")

plt.subplot(2,2,3)
plt.scatter(range(df_raw_train.shape[0]), df_raw_train["SalePrice"].values,color='orange')
plt.title("Distribution of Sale Price")
plt.xlabel("Number of Occurences")
plt.ylabel("Sale Price")

plt.show()

Some outliers after Sale Price > 6000000

In [None]:
# Removing Outliers
# upperlimit = np.percentile(df_raw_train.SalePrice.values, 99.5)
# df_raw_train['SalePrice'].ix[houses['SalePrice']>upperlimit] = upperlimit

In [None]:
#skewness and kurtosis
print("Skewness: %f" % df_raw_train['SalePrice'].skew())
print("Kurtosis: %f" % df_raw_train['SalePrice'].kurt())

In [None]:
(mu, sigma) = st.norm.fit(df_raw_train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

In [None]:
plt.figure(figsize=(8,8))

plt.subplot(2,2,1)
plt.title('Normal')
sns.distplot(df_raw_train['SalePrice'], kde=False, fit=st.norm)

plt.subplot(2,2,2)
st.probplot(df_raw_train['SalePrice'], plot=plt)
plt.show

In [None]:
plt.figure(figsize=(8,8))

plt.subplot(2,2,1)
plt.title('Normal')
sns.distplot(np.log1p(df_raw_train['SalePrice']), kde=False, fit=st.lognorm)

plt.subplot(2,2,2)
st.probplot(np.log1p(df_raw_train['SalePrice']), plot=plt)
plt.show

It is apparent that SalePrice doesn't follow normal distribution, so before performing regression it has to be transformed to log.

In [None]:
# df_raw_train.SalePrice = np.log1p(df_raw_train.SalePrice)

### Multicollinearity Check

Multicollinearity refers to features that are correlated with other features. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your target variable, but also to each other.

Problem:
- Multicollinearity increases the standard errors of the coefficients. That means, multicollinearity makes some variables statistically insignificant when they should be significant.

To avoid this we can do 3 things:
- Completely remove those variables
- Make new feature by adding them or by some other operation.
- Use PCA, which will reduce feature set to small number of non-collinear features.

In [None]:
correlations = df_raw_train.corr()
attrs = correlations.iloc[:-1, :-1]  # all except target

threshold = 0.5
important_corrs = (attrs[abs(attrs) > threshold][attrs != 1.0]) \
    .unstack().dropna().to_dict()

unique_important_corrs = pd.DataFrame(
    list(set([(tuple(sorted(key)), important_corrs[key])
              for key in important_corrs])),
    columns=['Attribute Pair', 'Correlation'])

# sorted by absolute value
unique_important_corrs = unique_important_corrs.ix[
    abs(unique_important_corrs['Correlation']).argsort()[::-1]]

unique_important_corrs

### Correlation between  'SalePrice' and numeric features
 - Correlation Heat Map
 - Zoomed Heat Map
 - Pair Plot
 - Scatter Plot

In [None]:
correlation = numeric_features.corr()
print(correlation['SalePrice'].sort_values(ascending = False),'\n')

<b> Correlation Heat Map </b>

In [None]:
f , ax = plt.subplots(figsize = (14,12))
plt.title('Correlation of Numeric Features with Sale Price',y=1,size=16)
sns.heatmap(correlation,square = True,  vmax=1)

At initial glance it is observed that there are two light-red colored squares.

 - The first one refers to the 'TotalBsmtSF' and '1stFlrSF' variables.
 - Second one refers to the 'GarageX' variables.We can conclude that they give almost the same information.(Multicollinearity)

<b> Zoomed Heat Map </b>

In [None]:
k= 11
top_corr_features = correlation.nlargest(k,'SalePrice')['SalePrice'].index
print(top_corr_features)
cm = np.corrcoef(df_raw_train[top_corr_features].values.T)
f , ax = plt.subplots(figsize = (14,12))
sns.heatmap(cm, vmax=.8, linewidths=0.01,square=True,annot=True,cmap='viridis',
            linecolor="white",xticklabels = top_corr_features.values ,annot_kws = {'size':10},yticklabels = top_corr_features.values)

In [None]:
corrMatrix=df_raw_train[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
       'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt',
       'YearRemodAdd', 'GarageYrBlt', 'MasVnrArea', 'Fireplaces']].corr()

sns.set(font_scale=1.10)
plt.figure(figsize=(14, 12))

sns.heatmap(corrMatrix, vmax=.8, linewidths=0.01,
            square=True,annot=True,cmap='viridis',linecolor="white")
plt.title('Correlation between features');

- GarageCars & GarageArea are closely correlated.
- TotalBsmtSF and 1stFlrSF are also closely correlated.

<b> Pair Plot </b>
 - Check Outliers
 - Check relations b/w variables

In [None]:
sns.set()
df_raw_train_copy = df_raw_train.copy()
df_raw_train_copy['SalePrice'] = np.log1p(df_raw_train_copy['SalePrice'])
sns.pairplot(df_raw_train[top_corr_features],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()

- One interesting observation is between 'TotalBsmtSF' and 'GrLiveArea'. In this figure we can see the dots drawing a linear line, which almost acts like a border. It totally makes sense that the majority of the dots stay below that line. Basement areas can be equal to the above ground living area, but it is not expected a basement area bigger than the above ground living area.

- One more interesting observation is between 'SalePrice' and 'YearBuilt'. In the bottom of the 'dots cloud', we see what almost appears to be a exponential function.We can also see this same tendency in the upper limit of the 'dots cloud'

-  Last observation is that prices are increasing faster now with respect to previous years.

-  Visible outliers in  'TotalBsmtSF', 'GrLivArea'

<b> Scatter Plot </b>

In [None]:
columns = top_corr_features.drop('SalePrice')
for c in columns:
    df = pd.concat([df_raw_train_copy['SalePrice'], df_raw_train_copy[c]], axis=1)
    sns.lmplot(x=c, y="SalePrice", data=df)
    

<b>Quick Tip:</b>

df [['a', 'b']] vs df ['a']

df [['a', 'b']]:
 - dtype = dataframe
 - can select multiple columns from a dataframe
 
df ['a']:
 - dtype = series 
 - can select single column from a dataframe

### Correlation between 'SalePrice' and 'OverallQual'

In [None]:
df_raw_train[['OverallQual', 'SalePrice']].groupby(['OverallQual'],
                                                   as_index=False).mean().sort_values(by='OverallQual', ascending=False)

In [None]:
sns.barplot(df_raw_train.OverallQual, df_raw_train.SalePrice)

In [None]:
# boxplot is a method for graphically depicting groups of numerical data through their quartiles.

var = 'OverallQual'
data = pd.concat([df_raw_train['SalePrice'], df_raw_train[var]], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

In [None]:
df_raw_train['OverallQual'].value_counts().plot(kind="bar");

### Correlation between 'SalePrice' and categorical features

In [None]:
# Filling 'NaN' with string 'MISSING' in all categrical variables
# Creating a copy, will impute relevant missing values to the main dataframe in the later section

'''
Copying advice

# IS STILL POINTING TO ORIGINAL, EDITS TO NEW WILL ALSO BE APPLIED TO ORIGINAL
new_df = master_df 

new_df[ 1,2 ] = b # will also update `master_df` as well. Careful!

# makes a new copy to avoid this issue
new_df = df.copy()
'''

df_raw_train_copy = df_raw_train.copy()
for c in categorical_features:
    df_raw_train_copy[c] = df_raw_train_copy[c].astype('category')
    if df_raw_train_copy[c].isnull().any():
        df_raw_train_copy[c] = df_raw_train_copy[c].cat.add_categories(['MISSING'])
        df_raw_train_copy[c] = df_raw_train_copy[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)
f = pd.melt(df_raw_train_copy, id_vars=['SalePrice'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "SalePrice")

In [None]:
var = 'Neighborhood'
data = pd.concat([df_raw_train['SalePrice'], df_raw_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)

In [None]:
plt.figure(figsize = (12, 6))
sns.countplot(x = 'Neighborhood', data = data)
xt = plt.xticks(rotation=45)

In [None]:
data = pd.concat([df_raw_train['SalePrice'], df_raw_train['YearBuilt']], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=df_raw_train['YearBuilt'], y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=45);

In [None]:
sns.distplot(df_raw_train["YearBuilt"], kde=False);

In [None]:
ConstructionAge =  df_raw_train['YrSold'] - df_raw_train['YearBuilt']
plt.scatter(ConstructionAge, df_raw_train['SalePrice'])
plt.ylabel('SalePrice')
plt.xlabel("Construction Age of house")

Price of house goes down with its age.

### Normalizing independent variables

- <b> Normality </b> - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we'll just check univariate normality for 'SalePrice' (which is a limited approach). Remember that univariate normality doesn't ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that's the main reason why we are doing this analysis.

- <b> Homoscedasticity </b> - Homoscedasticity refers to the 'assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)' (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.

- <b> Linearity </b> - The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we'll not get into this because most of the scatter plots we've seen appear to have linear relationships.

- <b> Absence of correlated errors </b> - Correlated errors, like the definition suggests, happen when one error is correlated to another. For instance, if one positive error makes a negative error systematically, it means that there's a relationship between these variables. This occurs often in time series, where some patterns are time related. We'll also not get into this. However, if you detect something, try to add a variable that can explain the effect you're getting. That's the most common solution for correlated errors.



## Feature Engineering

### Upfront remove unnecessary columns(id's etc.)

In [None]:
# Remove the Ids from train and test, as they are unique for each row and hence not useful for the model
train_ID = df_raw_train['Id']
test_ID = df_raw_test['Id']
df_raw_train.drop(['Id'], axis=1, inplace=True)
df_raw_test.drop(['Id'], axis=1, inplace=True)
df_raw_train.shape, df_raw_test.shape

### Remove outliers

<b> look into 'TotalBsmtSF', 'GrLivArea' </b>

 - Outliers removal is note always safe. We decided to delete these two as they are very huge and really bad ( extremely large areas for very low prices).

 - There are probably others outliers in the training data. However, removing all them may affect badly our models if ever there were also outliers in the test data. That's why , instead of removing them all, we will just manage to make some of our models robust on them. You can refer to the modelling part of this notebook for that.

In [None]:
plt.figure(figsize=(16,4))

plt.subplot(1,3,1)
plt.title('GrLivArea')
plt.scatter(y =df_raw_train.SalePrice,x = df_raw_train.GrLivArea)

plt.subplot(1,3,2)
plt.title('TotalBsmtSF')
plt.scatter(y =df_raw_train.SalePrice,x = df_raw_train.TotalBsmtSF)

plt.subplot(1,3,3)
plt.title('OverallQual')
plt.scatter(y =df_raw_train.SalePrice,x = df_raw_train.OverallQual)


plt.show()

In [None]:
df_raw_train.drop(df_raw_train[(df_raw_train['GrLivArea']>4000) & (df_raw_train['SalePrice']<300000)].index, inplace=True)
df_raw_train.drop(df_raw_train[(df_raw_train['OverallQual']<5) & (df_raw_train['SalePrice']>200000)].index, inplace=True)

In [None]:
plt.figure(figsize=(16,4))

plt.subplot(1,3,1)
plt.title('GrLivArea')
plt.scatter(y =df_raw_train.SalePrice,x = df_raw_train.GrLivArea)

plt.subplot(1,3,2)
plt.title('TotalBsmtSF')
plt.scatter(y =df_raw_train.SalePrice,x = df_raw_train.TotalBsmtSF)

plt.subplot(1,3,3)
plt.title('OverallQual')
plt.scatter(y =df_raw_train.SalePrice,x = df_raw_train.OverallQual)


plt.show()

In [None]:
df_raw_train.reset_index(drop=True, inplace=True)

### Combine train, test dataset
Combine train and test features in order to apply the feature transformation pipeline to the entire dataset

In [None]:
y_train = df_raw_train.SalePrice
df_raw_train.drop('SalePrice', axis=1, inplace=True)
df_raw_train.shape, df_raw_test.shape

In [None]:
all_features = pd.concat([df_raw_train, df_raw_test]).reset_index(drop=True)
all_features.shape

<b>Quick Tip:</b>
Feature Scaling:
 - StandardScaler - subtract the mean and divide by std
 - MaxAbsScaler - transform down to [-1, 1] bounds
 - QuantileTransformer - transform down to [0 1] bounds



### Fill missing values

<b>Note the difference between NaN, '', None.</b>

- NaN = not a number, still a float type, so think of it as an empty space that can still be passed through numerical operations

- '' = is a empty string type

- None = is also a empty space, but in DataFrames it is considered an object which cannot be processed through optimized numerical operations

#### Missing numeric values
<b>Try different strategies of filling in missing values (modes/means/medians/etc.)</b>

In [None]:
numeric_df =  all_features.select_dtypes(include=[np.number])

missing_numeric_df = numeric_df.isnull().sum(axis=0).reset_index()
missing_numeric_df.columns = ['column_name', 'missing_count']
missing_numeric_df['missing_ratio'] = missing_numeric_df['missing_count'] / numeric_df.shape[0]
missing_numeric_df = missing_numeric_df.sort_values('missing_count', ascending = False)
missing_numeric_df.loc[missing_numeric_df['missing_count'] > 0]

In [None]:
# Group the by neighborhoods, and fill in missing value by the median LotFrontage of the neighborhood
all_features['LotFrontage'] = all_features.groupby(
    'Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))

# Replacing the missing values with 0, since no garage = no cars in garage
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
        all_features[col] = all_features[col].fillna(0)
        
# Replacing the missing values with 0, since no basement = no bathrooms, no surface area
for col in ('BsmtHalfBath', 'BsmtFullBath', 'TotalBsmtSF', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF'):
        all_features[col] = all_features[col].fillna(0)
        
all_features['MasVnrArea'] = all_features['MasVnrArea'].fillna(0)



#### Missing object values

In [None]:
object_df =  all_features.select_dtypes(include='object')

missing_object_df = object_df.isnull().sum(axis=0).reset_index()
missing_object_df.columns = ['column_name', 'missing_count']
missing_object_df['missing_ratio'] = missing_object_df['missing_count'] / object_df.shape[0]
missing_object_df = missing_object_df.sort_values('missing_count', ascending = False)
missing_object_df.loc[missing_object_df['missing_count'] > 0]

In [None]:
# For a few columns there is lots of NaN entries.
# However, reading the data description we find this is not missing data:
# For PoolQC, NaN is not missing data but means no pool, likewise for Fence, FireplaceQu etc.

cols_fillna = ['PoolQC', 'Alley', 'Fence', 'FireplaceQu', 'GarageCond', 'GarageQual', 'GarageFinish',
               'GarageType', 'BsmtExposure', 'BsmtCond', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1', 'MasVnrType']

# replace 'NaN' with 'None' in these columns
for col in cols_fillna:
    all_features[col].fillna('None', inplace=True)
    all_features[col].fillna('None', inplace=True)
all_features['MSZoning'] = all_features.groupby(
    'MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))
all_features['Exterior1st'] = all_features['Exterior1st'].fillna(
    all_features['Exterior1st'].mode()[0])
all_features['Exterior2nd'] = all_features['Exterior2nd'].fillna(
    all_features['Exterior2nd'].mode()[0])
all_features['SaleType'] = all_features['SaleType'].fillna(
    all_features['SaleType'].mode()[0])

# the data description states that NA refers to typical ('Typ') values
all_features['Functional'] = all_features['Functional'].fillna('Typ')

all_features['KitchenQual'] = all_features['KitchenQual'].fillna("TA")
# Replacing missing values with most frequent ones.
all_features['Electrical'] = all_features['Electrical'].fillna("SBrkr")

all_features["Utilities"] = all_features["Utilities"].fillna("None")
all_features['MiscFeature'] = all_features['MiscFeature'].fillna("None")

### Create interesting features:

- Simplifications of existing features
- Combinations of existing features
- Polynomials on the top 10 existing features

In [None]:
all_features['HasWoodDeck'] = all_features['WoodDeckSF'].apply(lambda x: 1 if x > 0 else 0)
all_features['HasOpenPorch'] = all_features['OpenPorchSF'].apply(lambda x: 1 if x > 0 else 0)
all_features['Has3SsnPorch'] = all_features['3SsnPorch'].apply(lambda x: 1 if x > 0 else 0)
all_features['HasScreenPorch'] = all_features['ScreenPorch'].apply(lambda x: 1 if x > 0 else 0)
all_features['haspool'] = all_features['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
all_features['has2ndfloor'] = all_features['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
all_features['hasgarage'] = all_features['GarageArea'].apply(lambda x: 1 if x > 0 else 0)
all_features['hasbsmt'] = all_features['TotalBsmtSF'].apply(lambda x: 1 if x > 0 else 0)
all_features['hasfireplace'] = all_features['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)

all_features['Total_sqr_footage'] = (all_features['BsmtFinSF1'] + all_features['BsmtFinSF2'] +
                                 all_features['1stFlrSF'] + all_features['2ndFlrSF'])
all_features['Total_Bathrooms'] = (all_features['FullBath'] + (0.5 * all_features['HalfBath']) +
                               all_features['BsmtFullBath'] + (0.5 * all_features['BsmtHalfBath']))
all_features['Total_porch_sf'] = (all_features['OpenPorchSF'] + all_features['3SsnPorch'] +
                              all_features['EnclosedPorch'] + all_features['ScreenPorch'] +
                              all_features['WoodDeckSF'])

all_features['Total_Home_Quality'] = all_features['OverallQual'] + all_features['OverallCond']

all_features['TotalSF'] = all_features['TotalBsmtSF'] + all_features['1stFlrSF'] + all_features['2ndFlrSF']

all_features['YearsSinceRemodel'] = all_features['YrSold'].astype(int) - all_features['YearRemodAdd'].astype(int)

#all_features['ConstructionAge'] = all_features['YrSold'] - all_features['YearBuilt']

In [None]:
#test = all_features.ix[all_features['ConstructionAge'] < 0]
#test

Adding squares, sqrt is motivated by non-linearities in scatterplots "predictor vs. log(SalePrice)/SalePrice"

In [None]:
def addSquared(dataframe, column_list):
    m = dataframe.shape[1]
    for col in column_list:
        dataframe = dataframe.assign(newcol=pd.Series(dataframe[col]*dataframe[col]).values)   
        dataframe.columns.values[m] = col + '_sq'
        m += 1
    return dataframe 

def addCubed(dataframe, column_list):
    m = dataframe.shape[1]
    for col in column_list:
        dataframe = dataframe.assign(newcol=pd.Series(dataframe[col]*dataframe[col]*dataframe[col]).values)   
        dataframe.columns.values[m] = col + '_cube'
        m += 1
    return dataframe 

def addsqrt(dataframe, column_list):
    m = dataframe.shape[1]
    for col in column_list:
        dataframe = dataframe.assign(newcol=pd.Series(np.sqrt(dataframe[col])).values)   
        dataframe.columns.values[m] = col + '_sqrt'
        m += 1
    return dataframe 

columns = top_corr_features.drop('SalePrice')
addSquared(all_features, columns)
addCubed(all_features, columns)
addsqrt(all_features, columns)

### Feature transformation - Categorical to ordinal
<b>Label Encoding some categorical variables that may contain information in their ordering set</b>

In [None]:
all_features['ExterQual'] = pd.Categorical(all_features['ExterQual'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True).codes
all_features['ExterCond'] = pd.Categorical(all_features['ExterCond'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True).codes
all_features['KitchenQual'] = pd.Categorical(all_features['KitchenQual'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True).codes
all_features['HeatingQC'] = pd.Categorical(all_features['HeatingQC'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po'], ordered=True).codes
all_features['BsmtQual'] = pd.Categorical(all_features['BsmtQual'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po', 'None'], ordered=True).codes
all_features['BsmtCond'] = pd.Categorical(all_features['BsmtCond'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po', 'None'], ordered=True).codes
all_features['BsmtFinType1'] = pd.Categorical(all_features['BsmtFinType1'], categories=[
    'GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None'], ordered=True).codes
all_features['BsmtFinType2'] = pd.Categorical(all_features['BsmtFinType2'], categories=[
    'GLQ', 'ALQ', 'BLQ', 'Rec', 'LwQ', 'Unf', 'None'], ordered=True).codes
all_features['FireplaceQu'] = pd.Categorical(all_features['FireplaceQu'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po', 'None'], ordered=True).codes
all_features['GarageQual'] = pd.Categorical(all_features['GarageQual'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po', 'None'], ordered=True).codes
all_features['GarageCond'] = pd.Categorical(all_features['GarageCond'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'Po', 'None'], ordered=True).codes
all_features['PoolQC'] = pd.Categorical(all_features['PoolQC'], categories=[
    'Ex', 'Gd', 'TA', 'Fa', 'None'], ordered=True).codes
all_features['LandSlope'] = pd.Categorical(all_features['LandSlope'], categories=[
    'Gtl', 'Mod', 'Sev'], ordered=True).codes
all_features['PavedDrive'] = pd.Categorical(all_features['PavedDrive'], categories=[
    'Y', 'P', 'N'], ordered=True).codes
all_features['GarageFinish'] = pd.Categorical(all_features['GarageFinish'], categories=[
    'Fin', 'RFn', 'Unf', 'None'], ordered=True).codes
all_features['BsmtExposure'] = pd.Categorical(all_features['BsmtExposure'], categories=[
    'Gd', 'Av', 'Mn', 'No', 'None'], ordered=True).codes
all_features['Functional'] = pd.Categorical(all_features['Functional'], categories=[
    'Typ', 'Min1', 'Min2', 'Mod', 'Maj1', 'Maj2', 'Sev', 'Sal', 'None'], ordered=True).codes
all_features['Fence'] = pd.Categorical(all_features['Fence'], categories=[
    'GdPrv', 'MnPrv', 'GdWo', 'MnWw', 'None'], ordered=True).codes
all_features['LotShape'] = pd.Categorical(all_features['LotShape'], categories=[
    'Reg', 'IR1', 'IR2', 'IR3'], ordered=True).codes


all_features['CentralAir'] = all_features['CentralAir'].apply(
    lambda x: 0 if x == 'N' else 1)
all_features['Street'] = all_features['Street'].apply(
    lambda x: 0 if x == 'Pave' else 1)

Or, Let sklearn select the best ordering for them.

In [None]:
#from sklearn.preprocessing import LabelEncoder
#cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
#        'ExterQual', 'ExterCond', 'HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
#        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
#        'LotShape', 'PavedDrive', 'CentralAir')
# process columns, apply LabelEncoder to categorical features
#for c in cols:
#    lbl = LabelEncoder()
#    lbl.fit(list(all_features[c].values))
#    all_features[c] = lbl.transform(list(all_features[c].values))

# shape
#print('Shape all_data: {}'.format(all_features.shape))

<b>Quick Tip </b>
fit vs. fit_transform vs. transform
 - fit_transform
   - Xtrain−norm=Xtrain−μtrainσtrain
 - transform - note that we divide by the previously fit values
   - Xtest−norm=Xtest−μtrainσtrain
 - fit
   - when you fit a scaler to dataset A, it calculates mean of A, and the standard deviation of A 
 - transform
   - this will actually look at ANY dataset and subtract previously fitted (calculated) variables mean A and divide by standard deviation of A fit_transform does both of these things in two steps.
   
   
For consistency purposes, it is best to fit_transform on your training dataset, but only transform your validation set. This ensures your validation and training set has been consistently transformed

### Fix skewed features
Fixing skewness is motivated by the fact that:
- linear methods might fit such predictors with very small weights and most of the information contained in the values might be lost
- predictions when such predictors take very high values might be also very high or misleading.

<b>http://onlinestatbook.com/2/transformations/box-cox.html</b>

<b>https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.special.boxcox1p.html</b>

In [None]:
skewness = all_features.skew().sort_values(ascending = False)
skew_features = all_features[skewness[abs(skewness)>0.5].index]
for i in skew_features:
    all_features[i] = boxcox1p(all_features[i], st.boxcox_normmax(all_features[i] + 1))

Or, we may let original skewed columns in the dataframe and add additional log/boxcox(column) as new columns.

In [None]:
#def add_log_columns(dataframe, column_list):
#    m = dataframe.shape[1]
#    for column_name in column_list:
#        dataframe = dataframe.assign(newcol=pd.Series(np.log1p(dataframe[column_name])).values)
#        dataframe.columns.values[m] = column_name + '_log'
#        m += 1
#    return dataframe

#all_features = add_log_columns(all_features, skew_features)

In [None]:
#def add_boxcox_columns(dataframe, column_list):
#    m = dataframe.shape[1]
#    for column_name in column_list:
#        dataframe = dataframe.assign(newcol=pd.Series(boxcox1p(
#       all_features[i], st.boxcox_normmax(all_features[i] + 1))).values)
#        dataframe.columns.values[m] = column_name + '_log'
#        m += 1
#    return dataframe

#all_features = add_boxcox_columns(all_features, skew_features)

### Feature transformation - Numeric to categorical

<b>Transforming some numerical variables that are really categorical</b>

In [None]:
# MSSubClass = building class
all_features['MSSubClass'] = all_features['MSSubClass'].astype(str)


#Changing OverallCond into a categorical variable
all_features['OverallCond'] = all_features['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_features['YrSold'] = all_features['YrSold'].astype(str)
all_features['MoSold'] = all_features['MoSold'].astype(str)

### Normalize 'SalePrice'
- The SalePrice is skewed to the right. This is a problem because most ML models don't do well with non-normally distributed data. We can apply a log(1+x) tranform to fix the skew.
- Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

In [None]:
y_train = np.log1p(y_train)
(mu, sigma) = st.norm.fit(y_train)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

In [None]:
# Remove any duplicated column names
all_features = all_features.loc[:,~all_features.columns.duplicated()]
all_features.shape

### Getting dummy categorical features

In [None]:
all_features = pd.get_dummies(all_features).reset_index(drop=True)
all_features.shape

### Train test split

In [None]:
X_train = all_features[:df_raw_train.shape[0]]
X_test = all_features[df_raw_train.shape[0]:]
X_train.shape, X_test.shape

## Modelling

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train,y_train, test_size=0.2)
print(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)

### Linear Regression

In [None]:
lm = LinearRegression()
lm.fit(X_train,y_train)
y_pred = lm.predict(X_valid)
rmse = np.sqrt(metrics.mean_squared_error(y_pred, y_valid))

lm.score(X_train,y_train), lm.score(X_valid,y_valid), rmse

Ridge Regression

https://www.quora.com/What-is-Ridge-Regression-in-laymans-terms
    
https://www.youtube.com/watch?v=Q81RR3yKn30

Lasso

https://chrisalbon.com/machine_learning/linear_regression/effect_of_alpha_on_lasso_regression/