# Table of Contents



* [Preamble](#sec-preamble)
* [Part 0 - Imports, Shared Functions and Common Code](#sec-0)
    * [Imports](#subsec-imports)
    * [Shared Functions](#subsec-shared)
    * [Common Code](#subsec-common)
* [Part 1 - Intro To Machine Learning](#sec-1)
    * [Part 1.a - Random Forest Regressor](#subsec-1a)
* [Part 2 - Intermediate Machine Learning](#sec-2)
    * [Part 2.a - Missing Values (Dropping Values)](#subsec-2a)
    * [Part 2.b - Missing Values (Simple Imputation)](#subsec-2b)
    * [Part 2.c - Missing Values (Extended Imputation)](#subsec-2c)
    * [Part 2.d - Categorical Variables (Drop Categorical Variables)](#subsec-2d)
    * [Part 2.e - Categorical Variables (Label Encoding)](#subsec-2e)
    * [Part 2.f - Categorical Variables (One-Hot Encoding)](#subsec-2f)
    * [Part 2.g - Intermediate Machine Learning - Pipelines](#subsec-2g)
    * [Part 2.h - Intermediate Machine Learning - Cross Validation](#subsec-2h)
    * [Part 2.i - XGBoost (Gradient Boost)](#subsec-2i)
    * [Part 2.j - XGBoost (Parameter Tuning)](#subsec-2j)
* [Part 3 - Exploratry Data Analysis](#sec-3)
    * [Part 3.a - Visualizing SalePrice](#subsec-3a)
    * [Part 3.b - Heatmaps Comparing Properties](#subsec-3b)
    * [Part 3.c - Missing Data](#subsec-3c)
* [Part 4 - Intro to Deep Learing](#sec-4)
    * [Part 4.a - Initializing the Data](#subsec-4a)
    * [Part 4.b - Simple EDA](#subsec-4b)
    * [Part 4.c - Dealing with Missing Data](#subsec_4c)
    * [Part 4.d - Setting up the Training and Testing Data](#subsec-4d)
    * [Part 4.e - Creating and Training a Model](#subsec-4e)
    * [Part 4.f - Generating the Submission](#subsec-4f)
* [Part 5 - Feature Engineering](#sec-5)
    * [Part 5.a - Baseline lightGBM](#subsec-5a)
    * [Part 5.b - Numerical Transforms - Logarithm](#subsec-5b)
    * [Part 5.c - Numerical Transforms - Square Root](#subsec-5c)
    * [Part 5.d - Complete Feature Engineering](#subsec-5d)
* [Part N - Determining the Best Model](#sec-N)

<a id="sec-preamble"></a>
# **Preamble**

This is meant to be a rolling notebook where we analyze the "House Prices: Advanced Regression Techniques" dataset using the techniques learned in the coursers offered by Kaggle. Currently the following courses have been completed with italicized courses being relevant to this notebook:
1. Python
2. _Intro to Machine Learning_
3. _Intermediate Machine Learning_
4. _Data Visualization_
5. _Pandas_
6. Intro to SQL
7. Advanced SQL
8. _Intro to Deep Learning_
9. Computer Vision
10. Data Cleaning
11. Geospatial Analysis
12. Machine Learning Explainability
13. Microchallenges
14. _Feature Engineering_

Currently in progress:
* Natural Language Processing

While I have previous experience in some of these courses, I will try to only update this notebook with concepts/ideas introduced in the courses.

## **Goals**

The goal of the analysis is to predict the sale value of each house. For a given ID in the test set we want to predict the SalePrice value generated from the training set.

## **Notes**

This notebook is updated after each course is finished.

* The imports in part 1 were commented out and moved to a combined part 0 section
* Shared code such as the reading in of data was consoldated into part 0
* The most effective ML method was the XGBoost method using parameter tuning. This results in a score of 0.13719 which is top 42% of scores.
* Due to the order the courses were taken in, the data visualization will come _AFTER_ the intermediate ML which doesn't make much practical sense. In principle one would use data visualization to perform a feature exploration and determine which properties are relevant to their analysis.
* The pandas course does not explictly add any additional tools to analyze the housing data but we can use it to look at which columns are missing large amounts of rows.
* Again, similar to how data visualization came after intermediate ML, ML explainability and Feature engineering will also come later in the notebook.
* Interestingly, the simple neural network doesn't perform better than any of the intermediate ML techniques.
* Machine Learning Explainability covers material that fits into exploratory data analysis which we do in using the data visualization course.

<a id="sec-0"></a>
# **Part 0. Imports, Shared Functions and Common Code**

After finishing the intermediate course it became clear that having a combined section for imports and any shared functions would be best

<a id="subsec-imports"></a>
## Imports

In [None]:
# Imports used throughout the notebook

# plotting packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# math packages
import numpy as np

# packages needed for introductory ML
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# packages needed for intermediate ML
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

# packages needed for deep learning
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GroupShuffleSplit

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import callbacks
from tensorflow.keras.callbacks import EarlyStopping

# packages needed for feature engineering
import lightgbm as lgb
from scipy.stats import skew

print("Finished Imports")

<a id="subsec-shared"></a>
## Shared Functions

In [None]:
def get_MAE(X_train, X_valid, y_train, y_valid):
    """
    Calculates the mean absolute error (MAE) for a ML approach
    
    Input
    -----
    X_train:
        the training data used
    X_valid:
        the data to be compared to
    y_train:
        the y values that are used for training the model
    y_valid:
        the y values we want our comparison to be tested against
    
    Output:
    -------
    mean_absolute_error:
        sum of total absolute error divided by sample size
    """
    
    model = RandomForestRegressor(n_estimators=100)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_valid)
    mae = mean_absolute_error(y_valid, y_predict)
    return mae

def get_score(n_estimators, X, y):
    """
    Return the average MAE over 3 CV folds of random forest model.
    
    Input
    -----
    n_estimators:
        the number of trees in the forest
        
    Output:
    -------
    mean_score:
        The mean scores when using a pipeline to determine the mean absolute error
    """
    # Replace this body with your own code
    pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                               ('model', RandomForestRegressor(n_estimators, random_state=0))])
    scores = -1 * cross_val_score(pipeline, X, y, cv=3, scoring='neg_mean_absolute_error')
    return scores.mean()

def gen_prediction(training_data, target_data, test_data, estimators=100):
    """
    Calculate the model prediction using an inputted training, target and
    test data.

    Input
    -----
    training_data:
        training data in a pandas array used to generate a model
    target_data:
        target data in a pandas array used to generate a model
    test_data: 
        test data we will be fitting a model to
    
    Output:
    -------
    predictions for test data
    """
    # Define and fit model
    my_model = RandomForestRegressor(n_estimators=estimators, random_state=0)
    my_model.fit(training_data, target_data)

    # Get test predictions
    print ("Submission data calculated")
    return my_model.predict(test_data)


<a id="subsec-common"></a>
## Common Code

In [None]:
# Set the paths to our data
test_data_path = "../input/house-prices-advanced-regression-techniques/test.csv"
train_data_path = "../input/house-prices-advanced-regression-techniques/train.csv"
sample_data_path = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"

# Define the data
test_data = pd.read_csv(test_data_path, index_col='Id')
train_data = pd.read_csv(train_data_path, index_col='Id')
sample_data = pd.read_csv(sample_data_path)

# Create a directory to hold the scores of each part with different techniques
scores_dict = {}
submission_dict = {}

print("Data loaded and dictionaries initialized")

### Quick data property checks

In [None]:
# Check the head of each file
test_data.head(5)

In [None]:
train_data.head(5)

In [None]:
sample_data.head(5)

In [None]:
# Output the shape of our test, train, and sample data
print("Training data shape: {}".format(train_data.shape))
print("Testing data shape: {}".format(test_data.shape))
print("Sample data shape: {}".format(sample_data.shape))

<a id="sec-1"></a>
# **Part 1. Intro to Machine Learning**
This section is a brief look at the material covered in [Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)
<a id="subsec-1a"></a>
## Part 1.a Using a Random Forest Regressor
* This first submission is unlikely to score very highly as we are only using simple techniques but its a good place to start.
* This submission scored 0.18806 which in the top 79%, we can definitely improve on this score with more advanced analysis.

## Test using Random Forest Regressor
This is the simplest approach, I wont include the attempts using other methods from the intro course

In [None]:
# Define the data for testing
y = train_data.SalePrice
X = train_data.drop(['SalePrice'], axis=1)

# Divide our data into training and validation data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [None]:
# Define properties that are useful from the intro course
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X_train_features = X_train[features]
X_valid_features = X_valid[features]

In [None]:
# Check the MAE of this model
scores_dict['1.a'] = get_MAE(X_train_features, X_valid_features, y_train, y_valid)

print("MAE of simple Random Forest Regressor: ")
print(scores_dict['1.a'])

## Generate the Submission using Random Forest Regressor

In [None]:
# Create a target object
train_y = train_data.SalePrice

# Define properties that are useful from the intro course
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
train_X = train_data[features]
test_X = test_data[features]

In [None]:
# Make a prediction on the values using RandomForestRegressor
test_prediction_1a = gen_prediction(train_X, train_y, test_X)

In [None]:
# Save the prediction to our dictonary
submission_dict['1.a'] = test_prediction_1a
print("Random Forest Regressor Submission Saved")

<a id="sec-2"></a>
# **Part 2.Intermediate Machine Learning**
* This section will contain predictions done using the techniques taught in [intermediate machine learning course](https://www.kaggle.com/learn/intermediate-machine-learning)
* The score using these techniques was 0.14855 which is in the top 58%

In [None]:
# Define the data that will be used for all tests
y = train_data.SalePrice
X_full = train_data.drop(['SalePrice'], axis=1)

# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

# Define the data that will be used for all submission generation
X_test = test_data.select_dtypes(exclude=['object'])

<a id="subsec-2a"></a>
## Part 2.a Missing Values (Dropping values)
This section will use the approach where we simply drop any columns missing data
## Test the effectiveness of dropping values

In [None]:
# Check for missing values in each column of training data
missing_val_train = [column for column in X_train.columns
                     if X_train[column].isnull().any()]
print(missing_val_train)

In [None]:
# drop columns in training and validation data
reduced_X_train = X_train.drop(missing_val_train, axis=1)
reduced_X_valid = X_valid.drop(missing_val_train, axis=1)

In [None]:
# Check the MAE of this model
scores_dict['2.a'] = get_MAE(reduced_X_train, reduced_X_valid, y_train, y_valid)

print("MAE (Drop columns with missing values):")
print(scores_dict['2.a'])

## Generate the submission dropping missing values

In [None]:
# Check for missing values in each column of training data and test data using list comprehension
missing_val_X = [column for column in X.columns
                 if X[column].isnull().any()]

missing_val_X_test = [column for column in X_train.columns
                      if X_test[column].isnull().any()]

print("Columns with missing training data: ")
print(missing_val_X)

print("\nColumns with missing test data: ")
print(missing_val_X_test)

In [None]:
# Combine the two sets of missing columns together
# If we combine the two sets of missing columns together arbitrarily we'll get duplicates
print(missing_val_X + missing_val_X_test)

combined_missing_val = list(set(missing_val_X + missing_val_X_test))
print(combined_missing_val)

print("\nTotal number of columns to be dropped:")
print(len(combined_missing_val))

In [None]:
# Drop the missing values in both sets
X_drop = X.drop(combined_missing_val, axis=1)
X_test_drop = X_test.drop(combined_missing_val, axis=1)

print ("Shape of X_train: {}".format(X.shape))
print ("Shape of X_test: {}".format(X_test.shape))

print ("Shape of X_train_drop: {}".format(X_drop.shape))
print ("Shape of X_test_drop: {}".format(X_test_drop.shape))

In [None]:
# Generate the predictions
test_prediction_2a = gen_prediction(X_drop, y, X_test_drop)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.a'] = test_prediction_2a
print("Drop Value Submission Saved")

<a id="subsec-2b"></a>
## Part 2.b Missing Values (Simple Imputation)
This section will use the approach where we use imputation to fill in missing data
## Test the effectiveness of simple imputation

In [None]:
# Imputation, setup the simple imputer and apply it to our training data
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

In [None]:
# Check the MAE of this model
scores_dict['2.b'] = get_MAE(imputed_X_train, imputed_X_valid, y_train, y_valid)

print("MAE (Imputation):")
print(scores_dict['2.b'])

## Generate the output for simple imputation

In [None]:
# Imputation, setup the simple imputer and apply it to our full data
my_imputer = SimpleImputer()
imputed_X = pd.DataFrame(my_imputer.fit_transform(X))
imputed_X_test = pd.DataFrame(my_imputer.transform(X_test))

# imputation removed column names; put them back
imputed_X.columns = X.columns
imputed_X_test.columns = X_test.columns

In [None]:
# Generate the predictions
test_prediction_2b = gen_prediction(imputed_X, y, imputed_X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.b'] = test_prediction_2b
print("Simple Imputation Submission Saved")

<a id="subsec-2c"></a>
## Part 2.c Missing Values (Extended Imputation)
In this section we will try to extend the simple imputation by only working on the columns with missing data
## Test the effectiveness of extended imputation

In [None]:
# hard copy the data to ensure that we don't change the original
X_train_ext = X_train.copy()
X_valid_ext = X_valid.copy()

# Use the missing columns found in section 2.a, reminder of how these
# columns were found
# missing_column = [col for col in X_train.columns
#                   if X_train[col].isnull().any()]

# generate new columns we want to impute
for column in missing_val_train:
    X_train_ext[column + '_missing'] = X_train_ext[column].isnull()
    X_valid_ext[column + '_missing'] = X_valid_ext[column].isnull()
    
# impute the extended data
# Imputation, setup the simple imputer and apply it to our full data
my_imputer = SimpleImputer()
imputed_X_train_ext = pd.DataFrame(my_imputer.fit_transform(X_train_ext))
imputed_X_valid_ext = pd.DataFrame(my_imputer.transform(X_valid_ext))

# imputation removed column names; put them back
imputed_X_train_ext.columns = X_train_ext.columns
imputed_X_valid_ext.columns = X_valid_ext.columns

In [None]:
# Check the MAE of this model
scores_dict['2.c'] = get_MAE(imputed_X_train_ext, imputed_X_valid_ext, y_train, y_valid)

print("MAE (Extended Imputation):")
print(scores_dict['2.c'])

## Generate the output for extended imputation

In [None]:
# hard copy the data to ensure that we don't change the original
X_ext = X.copy()
X_test_ext = X_test.copy()

# Use the missing columns found in section 2.a, reminder of how these
# columns were found
# missing_column = [col for col in X_train.columns
#                   if X_train[col].isnull().any()]

# generate new columns we want to impute
for column in combined_missing_val:
    X_ext[column + '_missing'] = X_ext[column].isnull()
    X_test_ext[column + '_missing'] = X_test_ext[column].isnull()
    
# impute the extended data
# Imputation, setup the simple imputer and apply it to our full data
my_imputer = SimpleImputer()
imputed_X_ext = pd.DataFrame(my_imputer.fit_transform(X_ext))
imputed_X_test_ext = pd.DataFrame(my_imputer.transform(X_test_ext))

# imputation removed column names; put them back
imputed_X_ext.columns = X_ext.columns
imputed_X_test_ext.columns = X_test_ext.columns

In [None]:
# Generate the predictions
test_prediction_2c = gen_prediction(imputed_X_ext, y, imputed_X_test_ext)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.c'] = test_prediction_2c
print("Extended Imputation Submission Saved")

<a id="subsec-2d"></a>
## Part 2.d Categorical Variables (Drop Categorical Variables)


## **Categorical Variables Shared Code**
This section will contain predictions done using the techniques taught in the intermediate machine learning course in regards to dealing with categorical variables

In [None]:
# Define the data that will be used for all tests
y = train_data.SalePrice
X = train_data.drop(['SalePrice'], axis=1)

# Define the data that will be used for all submission generation
X_test = test_data.copy()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

This section will use the approach where we just drop categorical variables
## Test the effectiveness of dropping variables

In [None]:
# Check for missing values in each column of training data
missing_val_train = [column for column in X_train.columns
                     if X_train[column].isnull().any()]

# drop columns in training and validation data
reduced_X_train = X_train.drop(missing_val_train, axis=1)
reduced_X_valid = X_valid.drop(missing_val_train, axis=1)

In [None]:
print ("Shape of X_train: {}".format(X_train.shape))
print ("Shape of X_valid: {}".format(X_valid.shape))

print ("Shape of reduced_X_train: {}".format(reduced_X_train.shape))
print ("Shape of reduced_X_valid: {}".format(reduced_X_valid.shape))

In [None]:
# Drop the objects from our dataset
drop_X_train = reduced_X_train.select_dtypes(exclude=['object'])
drop_X_valid = reduced_X_valid.select_dtypes(exclude=['object'])

In [None]:
print ("Shape of drop_X_train: {}".format(drop_X_train.shape))
print ("Shape of drop_X_valid: {}".format(drop_X_valid.shape))

In [None]:
# Check the MAE of this model
scores_dict['2.d'] = get_MAE(drop_X_train, drop_X_valid, y_train, y_valid)

print("MAE (Drop categorical variables):")
print(scores_dict['2.d'])

## Generate the output for dropping variables

In [None]:
# Check for missing values in each column of training data and test data using list comprehension
# This is taken from section 2.a for finding columns with missing data
missing_val_X = [column for column in X.columns
                 if X[column].isnull().any()]

missing_val_X_test = [column for column in X_train.columns
                      if X_test[column].isnull().any()]

combined_missing_val = list(set(missing_val_X + missing_val_X_test))

In [None]:
# drop columns in training and validation data
reduced_X = X.drop(combined_missing_val, axis=1)
reduced_X_test = X_test.drop(combined_missing_val, axis=1)

In [None]:
print ("Shape of X: {}".format(X.shape))
print ("Shape of X_test: {}".format(X_test.shape))

print ("Shape of reduced_X: {}".format(reduced_X.shape))
print ("Shape of reduced_X_test: {}".format(reduced_X_test.shape))

In [None]:
# Drop the objects from our dataset
drop_X = reduced_X.select_dtypes(exclude=['object'])
drop_X_test = reduced_X_test.select_dtypes(exclude=['object'])

In [None]:
print ("Shape of drop_X: {}".format(drop_X.shape))
print ("Shape of drop_X_test: {}".format(drop_X_test.shape))

In [None]:
# Generate the predictions
test_prediction_2d = gen_prediction(drop_X, y, drop_X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.d'] = test_prediction_2d
print("Drop Variable Submission Saved")

<a id="subsec-2e"></a>
## Part 2.e Categorical Variables (Label Encoding)
This section will use the approach where we assign a unique value to a different integer
## Test the effectiveness of converting labels to integer values

In [None]:
# All categorical columns
object_cols = [column for column in X_train.columns if
               X_train[column].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [column for column in object_cols if 
                   set(X_train[column]) == set(X_valid[column])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols) - set(good_label_cols))
        
print('Categorical columns that will be label encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)

In [None]:
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

print ("Shape of X_train: {}".format(X_train.shape))
print ("Shape of X_valid: {}".format(X_valid.shape))

print ("\nShape of label_X_train after dropping bad labels: {}".format(label_X_train.shape))
print ("Shape of label_X_valid after dropping bad labels: {}".format(label_X_valid.shape))

In [None]:
# Apply label encoder 
# Cannot use the code shown in the course, will raise error:
#     TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
# For solution:
#     https://stackoverflow.com/questions/46406720/labelencoder-typeerror-not-supported-between-instances-of-float-and-str

label_encoder = LabelEncoder()
for column in set(good_label_cols):
    label_X_train[column] = label_encoder.fit_transform(X_train[column].astype(str))
    label_X_valid[column] = label_encoder.transform(X_valid[column].astype(str))

In [None]:
# We cant directly calculate the MAE from this point,
# if we do there will be some missing values, need to impute some values
imputed_label_X_train = pd.DataFrame(my_imputer.fit_transform(label_X_train))
imputed_label_X_valid = pd.DataFrame(my_imputer.transform(label_X_valid))

# imputation removed column names; put them back
imputed_label_X_train.columns = label_X_train.columns
imputed_label_X_valid.columns = label_X_valid.columns

In [None]:
# Check the MAE of this model
scores_dict['2.e'] = get_MAE(imputed_label_X_train, imputed_label_X_valid, y_train, y_valid)

print("MAE (Label Encoding):") 
print(scores_dict['2.e'])

## Generate the output for label encoding

If we directly follow the same steps as above for calculating the MAE we encounter an error where there are remaining NaN in the data set. To solve this we need to split the data into categorical and numerical data and work with them separately. Once we have processed the data in two chunks we can concat them back together then generate the predictions.


In [None]:
# All categorical columns
object_cols_full = X.columns

# Columns that can be safely label encoded
good_label_cols_full = [column for column in object_cols_full if 
                        set(X[column]) == set(X_test[column])]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols_full = list(set(object_cols_full) - set(good_label_cols_full))

print('\nNumber of Categorical columns that will be label encoded:', len(good_label_cols_full))
print('Categorical columns that will be label encoded:', good_label_cols_full)

print('\nNumber of categorical columns that will be dropped from the dataset:', len(bad_label_cols_full))
print('Categorical columns that will be dropped from the dataset:', bad_label_cols_full)

In [None]:
# Separate the categorical columns and numeric columns
cat_X = X[object_cols_full].copy()
cat_X_test = X_test[object_cols_full].copy()

num_X = X.select_dtypes(exclude=['object']).copy()

num_X_test = X_test.select_dtypes(exclude=['object']).copy()

# Drop categorical columns that will not be encoded
label_cat_X = cat_X.drop(bad_label_cols_full, axis=1)
label_cat_X_test = cat_X_test.drop(bad_label_cols_full, axis=1)

In [None]:
print ("Shape of X: {}".format(X.shape))
print ("Shape of X_test: {}".format(X_test.shape))

print ("\nShape of cat_X: {}".format(cat_X.shape))
print ("Shape of cat_X_test: {}".format(cat_X_test.shape))

print ("\nShape of num_X: {}".format(num_X.shape))
print ("Shape of num_X_test: {}".format(num_X_test.shape))

print ("\nShape of label_cat_X after dropping bad labels: {}".format(label_cat_X.shape))
print ("Shape of label_cat_X_test after dropping bad labels: {}".format(label_cat_X_test.shape))

In [None]:
# Impute the two different sets of data:

# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')

imputed_num_X = pd.DataFrame(num_imputer.fit_transform(num_X))
imputed_num_X_test = pd.DataFrame(num_imputer.transform(num_X_test))

# imputation removed column names; put them back
imputed_num_X.columns = num_X.columns
imputed_num_X_test.columns = num_X_test.columns

In [None]:
# Impute category columns
cat_imputer = SimpleImputer(strategy='most_frequent')

imputed_label_cat_X = pd.DataFrame(cat_imputer.fit_transform(label_cat_X))
imputed_label_cat_X_test = pd.DataFrame(cat_imputer.transform(label_cat_X_test))

# imputation removed column names; put them back
imputed_label_cat_X.columns = label_cat_X.columns
imputed_label_cat_X_test.columns = label_cat_X_test.columns

In [None]:
# Apply label encoder 
# Cannot use the code shown in the course, will raise error:
#     TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']
# For solution:
#     https://stackoverflow.com/questions/46406720/labelencoder-typeerror-not-supported-between-instances-of-float-and-str

label_encoder = LabelEncoder()
for column in set(good_label_cols_full):
    imputed_label_cat_X[column] = label_encoder.fit_transform(X[column].astype(str))
    imputed_label_cat_X_test[column] = label_encoder.transform(X_test[column].astype(str))

In [None]:
print ("\nShape of imputed_label_cat_X: {}".format(imputed_label_cat_X.shape))
print ("Shape of imputed_label_cat_X_test: {}".format(imputed_label_cat_X_test.shape))

In [None]:
full_label_X = pd.concat([imputed_num_X, imputed_label_cat_X], axis=1)
full_label_X_test = pd.concat([imputed_num_X_test, imputed_label_cat_X_test], axis=1)

print ("\nShape of full_label_X after merge: {}".format(full_label_X.shape))
print ("Shape of full_label_X_test after merge: {}".format(full_label_X_test.shape))

In [None]:
# Generate the predictions
test_prediction_2e = gen_prediction(full_label_X, y, full_label_X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.e'] = test_prediction_2e
print("Label Variable Submission Saved")

<a id="subsec-2f"></a>
## Part 2.f Categorical Variables (One-Hot Encoding)
Here we will use one-hot encoding where we create new columns that indicate the presence or absence of values in the original data.
## Testing One-Hot Encoding

In [None]:
# Investigate cardinality
object_cols = [column for column in X_train.columns if X_train[column].dtype == "object"]

# Columns that will be one-hot encoded
low_cardinality_cols = [column for column in object_cols if X_train[column].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

OH_X_train = X_train[low_cardinality_cols]
OH_X_valid = X_valid[low_cardinality_cols]

In [None]:
# Imputation to categorical columns or we'll encounter problems with NaN
# Impute category columns
cat_imputer = SimpleImputer(strategy='most_frequent')

imputed_X_train = pd.DataFrame(cat_imputer.fit_transform(OH_X_train))
imputed_X_valid = pd.DataFrame(cat_imputer.transform(OH_X_valid))

# imputation removed column names; put them back
imputed_X_train.columns = OH_X_train.columns
imputed_X_valid.columns = OH_X_valid.columns


In [None]:
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

In [None]:
# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')

imputed_num_X_train = pd.DataFrame(num_imputer.fit_transform(num_X_train))
imputed_num_X_valid = pd.DataFrame(num_imputer.transform(num_X_valid))

# imputation removed column names; put them back
imputed_num_X_train.columns = num_X_train.columns
imputed_num_X_valid.columns = num_X_valid.columns

In [None]:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(imputed_X_train))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(imputed_X_valid))

# One-hot encoding removed index; put it back
OH_cols_train.index = imputed_X_train.index
OH_cols_valid.index = imputed_X_valid.index

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([OH_cols_train, imputed_num_X_train], axis=1)
OH_X_valid = pd.concat([OH_cols_valid, imputed_num_X_valid], axis=1)

In [None]:
print ("\nShape of num_X_train: {}".format(num_X_train.shape))
print ("Shape of num_X_valid: {}".format(num_X_valid.shape))

print ("\nShape of OH_cols_train: {}".format(OH_cols_train.shape))
print ("Shape of OH_cols_valid: {}".format(OH_cols_valid.shape))

print ("\nShape of OH_X_train after merge: {}".format(OH_X_train.shape))
print ("Shape of OH_X_valid after merge: {}".format(OH_X_valid.shape))

In [None]:
scores_dict['2.f'] = get_MAE(OH_X_train, OH_X_valid, y_train, y_valid)

print("MAE (One-Hot Encoding):") 
print(scores_dict['2.f'])

## Generate the output for One Hot Encoding
We can effectively just copy the code for generating the MAE value and change the inputs to take the entire dataset instead.

In [None]:
# Investigate cardinality
object_cols_full = [column for column in X.columns if X[column].dtype == "object"]

# Columns that will be one-hot encoded
low_cardinality_cols_full = [column for column in object_cols_full if X[column].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols_full = list(set(object_cols_full)-set(low_cardinality_cols_full))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols_full)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols_full)

OH_X = X[low_cardinality_cols_full]
OH_X_test = X_test[low_cardinality_cols_full]

In [None]:
# Imputation to categorical columns or we'll encounter problems with NaN
# Impute category columns
cat_imputer = SimpleImputer(strategy='most_frequent')

imputed_X = pd.DataFrame(cat_imputer.fit_transform(OH_X))
imputed_X_test = pd.DataFrame(cat_imputer.transform(OH_X_test))

# imputation removed column names; put them back
imputed_X.columns = OH_X.columns
imputed_X_test.columns = OH_X_test.columns


In [None]:
# Remove categorical columns (will replace with one-hot encoding)
num_X = X.drop(object_cols_full, axis=1)
num_X_test = X_test.drop(object_cols_full, axis=1)

In [None]:
# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')

imputed_num_X = pd.DataFrame(num_imputer.fit_transform(num_X))
imputed_num_X_test = pd.DataFrame(num_imputer.transform(num_X_test))

# imputation removed column names; put them back
imputed_num_X.columns = num_X.columns
imputed_num_X_test.columns = num_X_test.columns

In [None]:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(imputed_X))
OH_cols_test = pd.DataFrame(OH_encoder.transform(imputed_X_test))

# One-hot encoding removed index; put it back
OH_cols.index = imputed_X.index
OH_cols_test.index = imputed_X_test.index

# Add one-hot encoded columns to numerical features
OH_X = pd.concat([OH_cols, imputed_num_X], axis=1)
OH_X_test = pd.concat([OH_cols_test, imputed_num_X_test], axis=1)

In [None]:
# Generate the predictions
test_prediction_2f = gen_prediction(OH_X, y, OH_X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.f'] = test_prediction_2f
print("One-Hot Encoding Submission Saved")

<a id="subsec-2g"></a>
## Part 2.g Intermediate Machine Learning - Pipelines
This section will demonstrate the use of pipelines. Pipelines won't necessarily improve our MAE score but it does bundle the preprocessing and modelling steps together which will streamline our code

In [None]:
# Define the data that will be used for all tests
y_full = train_data.SalePrice
X_full = train_data.drop(['SalePrice'], axis=1)

# Define the data that will be used for all submission generation
X_test_full = test_data.copy()

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

## Test the effectiveness of pipelines

In [None]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [None]:
# Preprocessing for numerical data
# depending on choice of strategy we can get very different MAE values later
numerical_transformer = SimpleImputer(strategy='mean')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

In [None]:
# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

In [None]:
scores_dict['2.g'] = mean_absolute_error(y_valid, preds)

print("MAE (Pipeline):") 
print(scores_dict['2.g'])

## Generate the output for pipelines

In [None]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols_full = [cname for cname in X.columns if
                         X[cname].nunique() < 10 and 
                         X[cname].dtype == "object"]

# Select numerical columns
numerical_cols_full = [cname for cname in X.columns if 
                       X[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols_full = categorical_cols_full + numerical_cols_full
X = X_full[my_cols_full].copy()
X_test = X_test_full[my_cols_full].copy()

In [None]:
# Preprocessing for numerical data
# depending on choice of strategy we can get very different MAE values later
numerical_transformer = SimpleImputer(strategy='mean')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols_full),
        ('cat', categorical_transformer, categorical_cols_full)
    ])

In [None]:
# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

In [None]:
# Preprocessing of training data, fit model 
clf.fit(X, y)

# Preprocessing of validation data, get predictions
test_prediction_2g = clf.predict(X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.g'] = test_prediction_2g
print("Pipeline Submission Saved")

<a id="subsec-2h"></a>
## Part 2.h Intermediate Machine Learning - Cross-Validation
Machine learning is iterative and is a better way to validate our data, but it shouldn't lead to a strong result without a solid algorithm to start with. For the test portion of this we will still use the full data set and choose the number of estimators that produces the best result for the submission.

In [None]:
# Define the data that will be used for all tests
y = train_data.SalePrice
X = train_data.drop(['SalePrice'], axis=1)

# Define the data that will be used for all submission generation
X_test = test_data.copy()

## Testing Cross-validation

In [None]:
# Take the column filtering from pipelines
categorical_cols = [cname for cname in X.columns if X[cname].nunique() < 10 and 
                        X[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols

In [None]:
# Filter down to the numerical columns
cv_X = X[numerical_cols].copy()
cv_test = X_test[numerical_cols].copy()

In [None]:
# Generate the results
results={}
for index in range(1, 9):
    results[50*index] = get_score(n_estimators=50*index, X=cv_X, y=y)

In [None]:
# Plot the results to see where the ideal number of n_estimators is
plt.plot(list(results.keys()), list(results.values()))
plt.show()

In [None]:
min_n_ests = min(results, key=results.get)

print("n_estimators with lowest score:")
print(min_n_ests)

In [None]:
scores_dict['2.h'] = min(results.values())

print("MAE (Pipeline):") 
print(scores_dict['2.h'])

## Generating Submission Using Cross-validation

In [None]:
# create the pipeline
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='mean')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
# Define the Model with the n_estimators with lowest score
model = RandomForestRegressor(n_estimators=min_n_ests, random_state=0)

In [None]:
# Preprocessing of training data, fit model 
cv_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
cv_X = X[my_cols].copy()
cv_X_test = X_test[my_cols].copy()
cv_pipeline.fit(cv_X, y)

# Preprocessing of validation data, get predictions
test_prediction_2h = cv_pipeline.predict(cv_X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.h'] = test_prediction_2h
print("Cross Validation Submission Saved")

<a id="subsec-2i"></a>
## Part 2.i XGBoost (Gradient Boost)
In all of our previous sections we have been using random forest, here we will be using a different method called gradient boosting. This method should yield much better results.
## Shared Code

In [None]:
# Define the data that will be used for all tests
y_full = train_data.SalePrice
X_full = train_data.drop(['SalePrice'], axis=1)

# Define the data that will be used for all submission generation
X_test_full = test_data.copy()

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

## Testing the Effectiveness of Gradient Boost

In [None]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)

In [None]:
# Train model
my_model = XGBRegressor(random_state=0)
my_model.fit(X_train, y_train)
# Predict
prediction_1 = my_model.predict(X_valid)

In [None]:
scores_dict['2.i'] = mean_absolute_error(y_valid, prediction_1)

print("MAE (Gradient Boost):") 
print(scores_dict['2.i'])

## Generating the Submission for Gradient Boost

In [None]:
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X.columns if X[cname].nunique() < 10 and 
                        X[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X = X_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X = pd.get_dummies(X)
X_test = pd.get_dummies(X_test)
X, X_test = X.align(X_test, join='left', axis=1)

In [None]:
my_model = XGBRegressor(random_state=0)
my_model.fit(X, y)
test_prediction_2i = my_model.predict(X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.i'] = test_prediction_2i
print("XGBoost Submission Saved")

<a id="subsec-2j"></a>
## Part 2.j XGBoost (Parameter Tuning)
## Testing Parameter Tuning

In [None]:
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols

X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()


# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)

In [None]:
# Define the model
my_model_2 = XGBRegressor(n_estimators=500, learning_rate=0.05)

# Fit the model
my_model_2.fit(X_train, y_train,
               early_stopping_rounds=5,
               eval_set=[(X_valid, y_valid)],
               verbose=False)
               

# Get predictions
prediction_2 = my_model_2.predict(X_valid)

In [None]:
scores_dict['2.j'] = mean_absolute_error(y_valid, prediction_2)

print("MAE (Parameter Tuning):") 
print(scores_dict['2.j'])

## Generate Submission using Parameter Tuning
Unlike the other methods, parameter tuning requires a training set, a validation set, and a test set

In [None]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()


# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

In [None]:
# Define the model
my_model_3 = XGBRegressor(n_estimators=500, learning_rate=0.05)

# Fit the model
my_model_3.fit(X_train, y_train,
               early_stopping_rounds=5,
               eval_set=[(X_valid, y_valid)],
               verbose=False)

test_prediction_2j = my_model_3.predict(X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['2.j'] = test_prediction_2j
print("Parameter Tuning Submission Saved")

<a id="sec-3"></a>
# Part 3. Exploratory Data Analysis
These sections will be devoted to exploring the data and seeing what we can learn about the dataset

In [None]:
# Set the paths to our data
test_data_path = "../input/house-prices-advanced-regression-techniques/test.csv"
train_data_path = "../input/house-prices-advanced-regression-techniques/train.csv"
sample_data_path = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"

# Define the data
test_data = pd.read_csv(test_data_path, index_col='Id')
train_data = pd.read_csv(train_data_path, index_col='Id')
sample_data = pd.read_csv(sample_data_path)

# Define the plot style used
sns.set_style("darkgrid")

<a id="subsec-3a"></a>
## Part 3.a Analyzing "SalePrice"
First thing we'll check out is how the main property of SalePrice changes in relation to other properties. This section and the subsequent section contains techniques and material learned in the [data visualization course](https://www.kaggle.com/learn/data-visualization)

In [None]:
# Look at the sale price using the describe function
train_data['SalePrice'].describe()

In [None]:
# Look at the sale price visually using a histogram
plt.figure(figsize=(16, 8))
sns.distplot(train_data['SalePrice'])

It looks like the SalePrice is strongly peaked at ~15000 with a longer tail towards higher prices.

In [None]:
# The documentation states that there are potential outliers in comparing
# SalePrice and GrLivArea (Above grade (ground) living area square feet)
# so lets see if that is true

plt.figure(figsize=(16,8))
sns.scatterplot(x=train_data['GrLivArea'], y=train_data['SalePrice'])

In [None]:
# There are some pretty extreme outliers out beyond GrLivArea > 4000 with SalePrice < 20000
# Lets draw trend lines with these points and without to see their effect

plt.figure(figsize=(16,8))
sns.regplot(x=train_data['GrLivArea'], y=train_data['SalePrice'])

In [None]:
# Removing those two data points
high_GrLivArea = np.where(train_data['GrLivArea'] > 4000)[0]
low_SalePrice = np.where(train_data['SalePrice'] < 200000)[0]

print("high GrLivArea points: ", high_GrLivArea)
print("\nlow SalePrice points: ", low_SalePrice)

outlier_inds = list(set(high_GrLivArea) & set(low_SalePrice))
outlier_inds.sort()

print("\noutliers: ", outlier_inds)

shortened_train_data = train_data.drop(train_data.index[outlier_inds])

plt.figure(figsize=(16,8))
sns.regplot(x=shortened_train_data['GrLivArea'], y=shortened_train_data['SalePrice'])

It looks like removing the two outliers has reduced the spread of our trend and properly handling outliers should be an important step in the analysis. It's important to note that removing outliers is not always safe and should be done with caution. A safer option moving forward should be to make the machine learning model more robust to outliers. Unfortunately, I have not learned this skill yet from the courses so we will not be applying these techniques yet.

<a id="subsec-3b"></a>
## Section 3.b Heatmaps of the Data
There are too many columns of data to individually compare to the SalePrice. Here we will be using a heatmap to see how each property correlates to SalePrice using a Heatmap.

In [None]:
# need to generate a correlation matrix with our data
correlation_matrix = train_data.corr()

plt.figure(figsize=(16,16))
sns.heatmap(correlation_matrix, square=True)

The bottom row is the SalePrice and we can see which properties seem to correlates most strongly with it.
1. OverallQual
2. GrLivArea
These two properties seem to be the strongly correlated but we can fiddle with heatmap properties to find the ones that are most relevant.

In [None]:
plt.figure(figsize=(8,8))

# look for the top 10 properties, need to pass number of properties we're
# interested + 1 because SalePrice has a 1:1 correlation with itself
num_variables = 11 

top_cols = correlation_matrix.nlargest(num_variables, 'SalePrice')['SalePrice'].index
short_cm = np.corrcoef(train_data[top_cols].values.T)

sns.heatmap(short_cm, annot=True, yticklabels=top_cols.values, xticklabels=top_cols.values)

Doing this we can see in descending order the most important properties are:
1. OverallQual: Rates the overall material and finish of the house
2. GrLivArea: Above grade (ground) living area in square feet
3. GarageCars: Size of garage in car capacity
4. GarageArea: Size of garage in square feet
5. TotalBsmtSF: Total square feet of basement area
6. 1stFlrSF: First Floor square feet
7. FullBath: Full bathrooms above grade
8. TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
9. YearBuilt: Original construction date
10. YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)


In regards to this data we can draw some comparisons between some of them

* GarageCars and GarageArea are effectively describing the same thing.
* TotRmsAbvGrd and GrLivArea are similar and also describe the total space above ground
* There is strong correlation between 1stFlrSF and TotalBsmtSF likely suggesting that if you have a large basement then you'll also have a large ground floor.
* There is effectively no correlation between when the house was built and the total square ft of the home.

<a id="subsec-3c"></a>
## Part 3.c Missing Data Analysis
The [pandas course](https://www.kaggle.com/learn/pandas) on Kaggle is a good introduction to the package but does not offer many additional tools to analyze the housing prices. One technique taught in the course is filtering and grouping data by properties so in this section we will take a look at columns that contain missing rows. This analysis should also be done prior to modelling similar to our "visualization" section but we will do it here instead. Determining where data is missing is important for the analysis, the questions we would like to answer when looking for missing data are the following:
1. Is the missing data structured or is it random?
2. How much data is missing and is that data relevant to our analysis?

If the missing data is structured then there can be additional insight that can be gained in analyzing the missing data and why it is missing.
If the missing data doesn't matter to our analysis then removing it from our model can be a perfectly fine approach but if it plays a significant role in analysis then our model must be robust in dealing with the missing rows.

In [None]:
# Set the paths to our data
test_data_path = "../input/house-prices-advanced-regression-techniques/test.csv"
train_data_path = "../input/house-prices-advanced-regression-techniques/train.csv"
sample_data_path = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"

# Define the data
test_data = pd.read_csv(test_data_path, index_col='Id')
train_data = pd.read_csv(train_data_path, index_col='Id')
sample_data = pd.read_csv(sample_data_path)

In [None]:
# Looking for missing data
# Count the number of null values in our dataset

count_nulls = train_data.isnull().sum().sort_values(ascending=False)
print(count_nulls)

In [None]:
# Lets drop the properties where the count is 0
non_zero_counts = count_nulls != 0
nz_count_nulls = count_nulls[non_zero_counts]

In [None]:
# Calculate the percentage of the data that is missing

percentage_nulls = (train_data.isnull().sum()/train_data.isnull().count()) * 100
sorted_precentages = percentage_nulls.sort_values(ascending=False)
nz_sorted_percentage = sorted_precentages[non_zero_counts]

In [None]:
# Put the counts and percentages together
missing_data = pd.concat([nz_count_nulls, nz_sorted_percentage], keys=['Count', 'Percentage'], axis=1)
missing_data

Looking at the output above, four properties have over 50% of its data missing and two other properties have over 10% of its data missing. For what we should be doing with these properties I'm not currently sure but I assume I'll learn in the "feature engineering" course. With over 90% of the data missing in PoolQC, MiscFeature and Alley I suspect that outright dropping these columns from the analysis would be reasonable.

Other properties we can see in the missing data analysis is that there is correlation between certain properties.
* GarageType, GarageCond, GarageFinish, GarageQual and GarageYrBlt all appear to have the same number of values and are likely missing properties of the same garages. 
* BsmtFinType2 and BsmtExposure appear to be correlated similar to the Garage properties.
* BsmtQual, BsmtCond and BsmtFinType1 are likely correlated as well and are only one count off from the other two basement properties.
* MasVnrArea and MasVnrType are likely correlated.
* There is only one missing electrical value.

In regards to how to handle this data we will revisit it later on when we learn more about feature engineering but naively I would remove any data that is missing over 10% of it's data. Following that if there is a property in the [heatmap](#subsec-7b) which can effectively replace any of the columns here missing data I would also remove that column. One column I would avoid removing would likely be "Electrical", there is only one missing value and it seems more reasonable to just ignore that row instead of removing the entire column.

<a id="sec-4"></a>
# Part 4. Intro to Deep Learning
* These sections will focus on the introductory material in the [deep learning course](https://www.kaggle.com/learn/intro-to-deep-learning).
* This section will be more akin to the [Intro to Machine Learning](#sec-1) than [Intermediate Machine Learning](#sec-2) section, so we will produce models that don't quite match up with the topics in the course.

<a id="subsec-4a"></a>
## Part 4.a Initializing the Data
Loading in the data and defining any setup any functions we need

In [None]:
# Set the paths to our data
test_data_path = "../input/house-prices-advanced-regression-techniques/test.csv"
train_data_path = "../input/house-prices-advanced-regression-techniques/train.csv"
sample_data_path = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"

# Define the data
test_data = pd.read_csv(test_data_path)
train_data = pd.read_csv(train_data_path)
sample_data = pd.read_csv(sample_data_path)

In [None]:
# Make copies of the data so we can recall the complete data if necessary
full_test_dl = test_data.copy()
full_train_dl = train_data.copy()
full_combined_dl = pd.concat([full_train_dl, full_test_dl], ignore_index=True)

dtypes = {
    'MSSubClass': str,
}

for col_, type_ in dtypes.items():
    full_combined_dl[col_] = full_combined_dl[col_].astype(type_)

In [None]:
print ('Full Test set:', full_test_dl.shape)
print ('Full Train set:', full_train_dl.shape)
print ('Full Combined set:', full_combined_dl.shape)

<a id="subsec-4b"></a>
## Part 4.b Simple EDA
We'll just output a summary of our data and look at a graphical output of which variables seem most correlated to sales price. This is largely a reminder of what was covered in [section 3](#sec-3).

In [None]:
# summary of the data frame information
full_combined_dl.info()

In [None]:
# Unlike what was done in section 3.b with the heatmap, we'll use a correlation barplot instead.
corrmat = full_combined_dl.corr()

# Plot the barplots
plt.figure(figsize=(10, 17))
sns.barplot(y=corrmat['SalePrice'].sort_values().index, x=corrmat['SalePrice'].sort_values().values)
plt.xlabel(f'correlation between SalePrice')
plt.show()

Again, similar to what was found in [section 3.b](#subect-3b) we find that there are clearly properties that more strongly correlate to SalePrice, for a more in depth explanation see [section 3](#sec-3).

<a id="subsec-4c"></a>
## Part 4.c Dealing With Missing Data
We can see from the above EDA that there are some columns with significant amounts of data missing. For example, PoolQC is missing data in almost every row, so this needs to be dealt with. One method of dealing with missing data is to simply drop any columns or rows with missing values. If we were to drop any rows with NaN we would quickly reduce our dataset to a handful of rows making this a terrible option. Since we have covered imputation in [section 2](#sec-2), we will fill in the NaN with either 'None' if the column is an object or the median value in numerical categories.

In [None]:
for cname in full_combined_dl.columns:
    if full_combined_dl[cname].dtype == 'object':
        full_combined_dl[cname].fillna('None', inplace=True)
    else:
        full_combined_dl[cname].fillna(full_combined_dl[cname].median(), inplace=True)

In [None]:
# Check to make sure we've replaced all of the NaNs
full_combined_dl.isnull().sum().max()

Now we need to apply One Hot Encoding to convert the categorical features into numerical ones

In [None]:
# Select the categorical columns
features_cat = [cname for cname in full_combined_dl.columns if
                full_combined_dl[cname].dtype == "object"]

In [None]:
# Generate the dummies for the categorical columns
combined_dl_OH = full_combined_dl.join(pd.get_dummies(full_combined_dl[features_cat]))

print ('no dummies set:', full_combined_dl.shape)
print ('dummies set:', combined_dl_OH.shape)

In [None]:
# Filter out the categorical columns
numerical_features = [cname for cname in combined_dl_OH.columns if
                      combined_dl_OH[cname].dtype != "object"]

In [None]:
# Remove the Id number as well
numerical_features.remove('Id')

In [None]:
numerical_features

<a id="subsec-4d"></a>
## Part 4.d Setting up the Training and Testing Data
Now that we've dealt with the NaNs we can split the data into training and testing data

In [None]:
# Copy the entire combined deep learning dataframe
training_dl = combined_dl_OH.copy()

# Filter the entire dataframe and only keep the rows with index matching the original training set
training_dl = training_dl[training_dl.Id.isin(full_train_dl.Id)]

In [None]:
# repeat the above with test data
testing_dl = combined_dl_OH.copy()

# Filter the entire dataframe and only keep the rows with index matching the original training set
testing_dl = testing_dl[testing_dl.Id.isin(full_test_dl.Id)]

In [None]:
print ('combined_dl_OH:', combined_dl_OH.shape)
print ('Training_dl:', training_dl.shape)
print ('Testing_dl:', testing_dl.shape)

In [None]:
# split the training data into properties (X) and target (y)
training_dl_X = training_dl[numerical_features]
training_dl_y = training_dl['SalePrice']

testing_dl_X = testing_dl[numerical_features]

In [None]:
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(training_dl_X, training_dl_y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

print ('Train set:', X_train.shape,  y_train.shape)
print ('Valid set:', X_valid.shape,  y_valid.shape)

In [None]:
# Normalize the data
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(testing_dl_X)


In [None]:
print ('Train X:', X_train.shape)
print ('Valid X:', X_valid.shape)
print ('Test X:', X_test.shape)

<a id="subsec-4e"></a>
## Part 4.e Creating and Training a Model
This is the section where we will generate and train the deep learning model

In [None]:
# Clear out the backend to make sure things aren't effected by other models run
tf.keras.backend.clear_session()

In [None]:
# Determine the shape of our input into the model
input_shape = [X_train.shape[1]]
print("Input shape: {}".format(input_shape))

In [None]:
# Create the simple model
model = tf.keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=input_shape),
    layers.Dense(64, activation='relu'),    
    layers.Dense(1, activation='linear')
])

In [None]:
# Compile the model with a simple optimizer and keep track of the mean errors
model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mse', 'mae']
)

# Set an early stopping condition so we dont overfit
early_stopping = EarlyStopping(
    monitor='val_mae',
    patience=25, # how many epochs to wait before stopping
    restore_best_weights=True,
)

In [None]:
# Fit the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=1000,
    callbacks=[early_stopping],
    verbose=0  # reduce the output so we dont flood the notebook
)

In [None]:
# Evaluate our simple model
model.evaluate(X_valid,y_valid)

In [None]:
# Plot the mean absolute error
history_df = pd.DataFrame(history.history)

plt.figure(figsize=(8, 6))
plt.plot(history.history['mae'], label='Training MAE')
plt.plot(history.history['val_mae'], label='Validation MAE')
plt.ylabel('MAE')
plt.xlabel('Epoch Number')
plt.legend()
plt.show()

# We can see that initially it decreases very quickly for the first ~100 epochs then slows down
# realistically the 1000 epochs does help but not to the degree we might want

In [None]:
# Plotting the mean squared error

plt.figure(figsize=(8, 6))
plt.plot(history.history['mse'], label='Training MSE')
plt.plot(history.history['val_mse'], label='Validation MSE')
plt.ylabel('MSE')
plt.xlabel('Epoch Number')
plt.legend()
plt.show()

# Again a very similar figure as the mean absolute error

In [None]:
# Calculate the y values to compare the the actual results

yhat_valid=model.predict(X_valid)
yhat_train=model.predict(X_train)

In [None]:
# Generate a plot of the fitted values compared to the actual values in our training data

plt.figure(figsize=(8, 6))

ax1 = sns.distplot(y_train, hist=False, color="r", label="Actual Value")
sns.distplot(yhat_train, hist=False, color="b", label="Fitted Values" , ax=ax1)

plt.title('Actual vs Fitted Values for Price')

plt.show()

# The general shape is similar but the peak is a bit off and there are
# more differences as we move to higher prices 

In [None]:
# Generate a plot of the fitted values compared to the actual values in our validation data

plt.figure(figsize=(8, 6))

ax2 = sns.distplot(y_valid, hist=False, color="r", label="Actual Value")
sns.distplot(yhat_valid, hist=False, color="b", label="Fitted Values" , ax=ax2)

plt.title('Actual vs Fitted Values for Price')

# Again, the general shape is similar but there are obvious differences

<a id="subsec-4f"></a>
## Part 4.f Generate The Submission

In [None]:
scores_dict['4.f'] = min(history.history['val_mae'])

print("MAE (Simple Deep Learning):") 
print(scores_dict['4.f'])

In [None]:
# Get predictions
prediction_4f = model.predict(X_test)

In [None]:
# Save the prediction to our dictonary
submission_dict['4.f'] = prediction_4f.flatten()
print("Simple Deep Learning Model Submission Saved")

<a id="sec-5"></a>
# Part 5. Feature Engineering


Following the order of the courses offered on Kaggle, we will use some of the techniques learned in [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) to hopefully improve the MAE of our models. Similar to what was done in the course, we will use a lightGBM model as the baseline.

<a id="subsec-5a"></a>
## Part 5.a Baseline lightGBM
This section will make a prediction using only the lightGBM model without any feature engineering to give us a baseline.

In [None]:
# Set the paths to our data
test_data_path = "../input/house-prices-advanced-regression-techniques/test.csv"
train_data_path = "../input/house-prices-advanced-regression-techniques/train.csv"
sample_data_path = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"

# Define the data
test_data = pd.read_csv(test_data_path)
train_data = pd.read_csv(train_data_path)
sample_data = pd.read_csv(sample_data_path)

In [None]:
# Combine the data together to deal with any easily apply any transforms
combined_data = pd.concat([test_data, train_data])
combined_data.head()

In [None]:
# Use One Hot Encoding to categorical data
categorical_columns = [column for column in combined_data.columns
                       if combined_data[column].dtype == "object"]
categorical_data = pd.get_dummies(combined_data.loc[:, categorical_columns],
                                  drop_first=True)
categorical_data.head()

In [None]:
# Baseline model will not transform numerical data
numerical_columns = [column for column in combined_data.columns
                     if combined_data[column].dtype != "object"]
numerical_data = combined_data[numerical_columns].drop("Id", axis=1)
numerical_data.head()

In [None]:
# Combine the transformed data back together
baseline_data = pd.concat([combined_data['Id'], categorical_data, numerical_data],
                          axis=1)
baseline_data.head()

In [None]:
# Split the data into training and testing sets again
base_test_data = baseline_data[:test_data.shape[0]]
base_train_data = baseline_data[test_data.shape[0]:]

print ("Shape of base_train_data: {}".format(base_train_data.shape))
print ("Shape of base_test_data: {}".format(base_test_data.shape))

In [None]:
# Define the data that will be used for all tests
y_train_full_base = base_train_data.SalePrice
X_train_full_base = base_train_data.drop(['SalePrice', 'Id'], axis=1)
test_full_base = base_test_data.drop(['SalePrice', 'Id'], axis=1)

In [None]:
print ("Shape of X_train_full: {}".format(X_train_full_base.shape))
print ("Shape of test_full: {}".format(test_full_base.shape))

## Testing the Baseline lightGBM model

In [None]:
# Break off validation set from training data
X_train_base, X_valid_base, y_train_base, y_valid_base = train_test_split(X_train_full_base, y_train_full_base,
                                                                          train_size=0.8, test_size=0.2,
                                                                          random_state=0)

In [None]:
lgb_train_base = lgb.Dataset(X_train_base, y_train_base)
params = {'objective': 'regression',
          'metric': {'rmse'}}
gbm_base = lgb.train(params, lgb_train_base)
prediction_gbm_base = gbm_base.predict(X_valid_base)

In [None]:
scores_dict['5.a'] = mean_absolute_error(y_valid_base, prediction_gbm_base)

print("Feature Engineering - GBM (Baseline):") 
print(scores_dict['5.a'])

## Generate Submission using Baseline lightGBM model

In [None]:
lgb_train_full_base = lgb.Dataset(X_train_full_base, y_train_full_base)
gbm_full_base = lgb.train(params, lgb_train_full_base)
prediction_5a = gbm_full_base.predict(test_full_base)

In [None]:
# Save the prediction to our dictonary
submission_dict['5.a'] = prediction_5a

<a id="subsec-5b"></a>
## Part 5.b Simple Numerical Transforms - Logarithm
Among the basic feature engineering techniques taught in the course, most are categorical encoding methods which resulted in very small changes in the accuracy of the prediction. In our case we used One hot Encoding in producing our baseline so we wont rerun our model with a large number of different encoding methods. Instead we will look at some other simple techniques to transform our data to improve predictions. One method is to transform numerical features to constrain outliers. These numerical transformations are unlikely to change our predictions very much as we are using a tree-based model but it is worth trying to see if this is true. In this subsection we will see if taking the logarithm of numerical features results in a significant change in predictions.

In [None]:
linear_data = numerical_data.copy()
linear_data.head()

In [None]:
linear_skew = linear_data.apply(lambda x: skew(x.dropna())).sort_values()
linear_skew.plot.barh(figsize=(12,8), title="Skewness of Untransformed Data")
plt.show()

In [None]:
print("total skewness of unmodified data:", sum(abs(linear_skew)))

There are quite a few features that have significant skewness, we are going to try taking the natural log and the square root of the features to determine if the skewness decreases.

In [None]:
# we take ln(1 + value) to ensure we don't run into issues where the value = 0
ln1p_data = np.log1p(numerical_data.copy())
ln1p_data.head()

In [None]:
ln1p_skew = ln1p_data.apply(lambda x: skew(x.dropna())).sort_values()
ln1p_skew.plot.barh(figsize=(12,8), title="Skewness of ln(1 + Data)")
plt.show()

In [None]:
print("total skewness of ln(1 + data):", sum(abs(ln1p_skew)))

Numerically, the total skewness of the data has decreased. The type of logarithm should not play an effect on the results as it would scale all values similarly. Next we check if taking the square root of values is more effective.

In [None]:
# we take sqrt(value)
sqrt_data = np.sqrt(numerical_data.copy())
sqrt_data.head()

In [None]:
sqrt_skew = sqrt_data.apply(lambda x: skew(x.dropna())).sort_values()
sqrt_skew.plot.barh(figsize=(12,8), title="Skewness of sqrt(Data)")
plt.show()

In [None]:
print("total skewness of sqrt(data):", sum(abs(sqrt_skew)))

Superimposing all three skewness values on one plot we can visually see how these different transforms compare.

In [None]:
combined_skew = abs(pd.concat([linear_skew, ln1p_skew, sqrt_skew], axis=1)).rename(columns={0:'unscaled', 1:'natural log', 2:'sqrt'})
combined_skew.plot.barh(figsize=(20,16), title="Skewness of Data Using Different Transforms", width=0.8)
plt.show()

Based purely on this simple check of skewness the natural log performs slightly better but we will generate MAE values with both numerical transforms to determine if a given transform generates better predictions.

In [None]:
combined_ln1p_data = pd.concat([combined_data['Id'], categorical_data, ln1p_data],
                                axis=1)
combined_sqrt_data = pd.concat([combined_data['Id'], categorical_data, sqrt_data],
                                axis=1)

## Testing Numerical Transform - Logarithm

In [None]:
# Split the data into training and testing sets again
ln1p_test_data = combined_ln1p_data[:test_data.shape[0]]
ln1p_train_data = combined_ln1p_data[test_data.shape[0]:]

In [None]:
# Define the data that will be used for all tests
y_train_full_ln1p = ln1p_train_data.SalePrice
X_train_full_ln1p = ln1p_train_data.drop(['SalePrice', 'Id'], axis=1)
test_full_ln1p = ln1p_test_data.drop(['SalePrice', 'Id'], axis=1)

In [None]:
# Break off validation set from training data
X_train_ln1p, X_valid_ln1p, y_train_ln1p, y_valid_ln1p = train_test_split(X_train_full_ln1p, y_train_full_ln1p,
                                                                          train_size=0.8, test_size=0.2,
                                                                          random_state=0)

In [None]:
lgb_train_ln1p = lgb.Dataset(X_train_ln1p, y_train_ln1p)
params = {'objective': 'regression',
          'metric': {'rmse'}}
gbm_ln1p = lgb.train(params, lgb_train_ln1p)
prediction_gbm_ln1p = gbm_ln1p.predict(X_valid_ln1p)

In [None]:
scores_dict['5.b'] = mean_absolute_error(np.expm1(y_valid_ln1p),
                                         np.expm1(prediction_gbm_ln1p))

print("Feature Engineering - GBM (Log Transform):") 
print(scores_dict['5.b'])

## Generate Submission using Log Transform

In [None]:
lgb_train_full_ln1p = lgb.Dataset(X_train_full_ln1p, y_train_full_ln1p)
gbm_full_ln1p = lgb.train(params, lgb_train_full_ln1p)
prediction_5b = gbm_full_ln1p.predict(test_full_ln1p)

In [None]:
# Save the prediction to our dictonary
submission_dict['5.b'] = np.expm1(prediction_5b)

<a id="subsec-5c"></a>
## Part 5.c Simple Numerical Transforms - Square Root
Most of the preliminary exploratory analysis is shown in the above [section](#subsec-5b). This section will simply calculate the MAE using lightGBM with a square root transform applied to the data.

## Testing Numerical Transform - Square Root

In [None]:
# Split the data into training and testing sets again
sqrt_test_data = combined_sqrt_data[:test_data.shape[0]]
sqrt_train_data = combined_sqrt_data[test_data.shape[0]:]

In [None]:
# Define the data that will be used for all tests
y_train_full_sqrt = sqrt_train_data.SalePrice
X_train_full_sqrt = sqrt_train_data.drop(['SalePrice', 'Id'], axis=1)
test_full_sqrt = sqrt_test_data.drop(['SalePrice', 'Id'], axis=1)

In [None]:
# Break off validation set from training data
X_train_sqrt, X_valid_sqrt, y_train_sqrt, y_valid_sqrt = train_test_split(X_train_full_sqrt, y_train_full_sqrt,
                                                                          train_size=0.8, test_size=0.2,
                                                                          random_state=0)

In [None]:
lgb_train_sqrt = lgb.Dataset(X_train_sqrt, y_train_sqrt)
params = {'objective': 'regression',
          'metric': {'rmse'}}
gbm_sqrt = lgb.train(params, lgb_train_sqrt)
prediction_gbm_sqrt = gbm_sqrt.predict(X_valid_sqrt)

In [None]:
scores_dict['5.c'] = mean_absolute_error(np.square(y_valid_sqrt),
                                         np.square(prediction_gbm_sqrt))

print("Feature Engineering - GBM (Square Root Transform):") 
print(scores_dict['5.c'])

## Generating Submission for Square Root Transform

In [None]:
lgb_train_full_sqrt = lgb.Dataset(X_train_full_sqrt, y_train_full_sqrt)
gbm_full_sqrt = lgb.train(params, lgb_train_full_sqrt)
prediction_5c = gbm_full_sqrt.predict(test_full_sqrt)

In [None]:
# Save the prediction to our dictonary
submission_dict['5.c'] = np.square(prediction_5c)

<a id="subsec-5d"></a>
## Part 5.d Complete Feature Engineering
The previous two subsections showed that applying a simple numerical transform like a square root or logarithm can reduce our MAE. In this section we will handle outliers and missing data to improve the score.

### Outliers
From [section 3](#sec-3) we already know there are outliers and portions of the data that are missing information. Similar to what was done there we will first remove any clear outliers.

```
# Set the paths to our data
test_data_path = "../input/house-prices-advanced-regression-techniques/test.csv"
train_data_path = "../input/house-prices-advanced-regression-techniques/train.csv"
sample_data_path = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"

# Define the data
test_data = pd.read_csv(test_data_path)
train_data = pd.read_csv(train_data_path)
sample_data = pd.read_csv(sample_data_path)

```

In [None]:
plt.figure(figsize=(16,8))
sns.regplot(x=train_data['GrLivArea'], y=train_data['SalePrice'])
plt.show()

In [None]:
# Removing those two data points
high_GrLivArea = np.where(train_data['GrLivArea'] > 4000)[0]
low_SalePrice = np.where(train_data['SalePrice'] < 200000)[0]

print("high GrLivArea points: ", high_GrLivArea)
print("\nlow SalePrice points: ", low_SalePrice)

outlier_inds = list(set(high_GrLivArea) & set(low_SalePrice))
outlier_inds.sort()

print("\noutliers: ", outlier_inds)

shortened_train_data = train_data.drop(train_data.index[outlier_inds])

plt.figure(figsize=(16,8))
sns.regplot(x=shortened_train_data['GrLivArea'], y=shortened_train_data['SalePrice'])
plt.show()

## Reapplying Log Transform
Now that we've removed the outlier lets reapply the log transformation to our sales price information to normalize it.

In [None]:
train_data_NO = shortened_train_data.copy()
train_data_NO["SalePrice"] = np.log1p(train_data_NO["SalePrice"])
train_y_NO = train_data["SalePrice"]

In [None]:
combined_NO_data = pd.concat((train_data_NO, test_data)).reset_index(drop=True)
combined_NO_data.drop(['SalePrice'], axis=1, inplace=True)
combined_NO_data.index = combined_NO_data['Id']
combined_NO_data.drop(["Id"], axis=1, inplace=True)
print("combined_NO_data size is : {}".format(combined_NO_data.shape))

In [None]:
combined_NO_data.head()

## Missing Data
Now we've removed the outliers let's look at sections with missing data

In [None]:
# Looking for missing data
# Count the number of null values in our dataset

count_nulls = combined_NO_data.isnull().sum().sort_values(ascending=False)# Lets drop the properties where the count is 0
non_zero_counts = count_nulls != 0
nz_count_nulls = count_nulls[non_zero_counts]
print(nz_count_nulls)

Again, similar to what was done in [section 3](#sec-3) we find that there are a few features that have a huge number of missing data points. Let's look at the data in a correlation map so we can get a feel for which properties are correlated to gain some insight on how to handle missing data.

In [None]:
# need to generate a correlation matrix with our data
correlation_matrix = combined_NO_data.corr()

plt.figure(figsize=(16,16))
sns.heatmap(correlation_matrix, square=True)
plt.show()

The various "garage" features are somewhat correlated, along with the total square footage of the house with the 1st and 2nd floor area. Let's look at the properties and decide on how to fill in the missing data. This section should basically be called 'read the documentation'. As a reminder here is the data:

```
PoolQC          2908
MiscFeature     2812
Alley           2719
Fence           2346
FireplaceQu     1420
LotFrontage      486
GarageCond       159
GarageQual       159
GarageYrBlt      159
GarageFinish     159
GarageType       157
BsmtCond          82
BsmtExposure      82
BsmtQual          81
BsmtFinType2      80
BsmtFinType1      79
MasVnrType        24
MasVnrArea        23
MSZoning           4
BsmtHalfBath       2
Utilities          2
Functional         2
BsmtFullBath       2
BsmtFinSF2         1
BsmtFinSF1         1
Exterior2nd        1
BsmtUnfSF          1
TotalBsmtSF        1
Exterior1st        1
SaleType           1
Electrical         1
KitchenQual        1
GarageArea         1
GarageCars         1
```

The following features have documentation which states that values of NA denote that the property doesnt exist, so let's go ahead and fill all of those NA values with 'None'


In [None]:
for col in ('PoolQC',
            'MiscFeature',
            'Alley',
            'Fence',
            'FireplaceQu',
            'GarageCond',
            'GarageQual',
            'GarageFinish',
            'GarageType',
            'BsmtCond',
            'BsmtExposure',
            'BsmtQual',
            'BsmtFinType2',
            'BsmtFinType1',
            'MasVnrType'):
    combined_NO_data[col] = combined_NO_data[col].fillna('None')

The following features have documentation which implites that NA values likely correspond with a value of 0, so let's go ahead and fill all of those NA values with 0


In [None]:
for col in ('GarageYrBlt',
            'GarageArea',
            'GarageCars',
            'MasVnrArea',
            'BsmtHalfBath',
            'BsmtFullBath',
            'BsmtUnfSF',
            'BsmtFinSF1',
            'BsmtFinSF2',
            'TotalBsmtSF'):
    combined_NO_data[col] = combined_NO_data[col].fillna(0)

Now looking at features that we can replace the NA values with the most common value, so we're going to fill the missing data with the mean


In [None]:
for col in ('MSZoning',
            'Functional',
            'Exterior1st',
            'Exterior2nd',
            'SaleType',
            'Electrical',
            'KitchenQual',
            'Utilities'):
    mean_val = combined_NO_data[col].mode()[0]
    combined_NO_data[col] = combined_NO_data[col].fillna(mean_val)

### The only property left, LotFrontage is linear feet of street connect to the property, we can approximate this by taking the mean value in the same neighbourhood.

In [None]:
combined_NO_data['LotFrontage'] = combined_NO_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))

In [None]:
count_nulls = combined_NO_data.isnull().sum().sort_values(ascending=False)# Lets drop the properties where the count is 0
non_zero_counts = count_nulls != 0
nz_count_nulls = count_nulls[non_zero_counts]
print(nz_count_nulls)

We've filled in all of the missing data. Now lets create some properties that might be useful in categorizing the sale price of a house.

In [None]:
combined_NO_data['YearsSinceReno'] = combined_NO_data['YrSold'].astype(int) - combined_NO_data['YearRemodAdd'].astype(int)
combined_NO_data['CombinedQual'] = combined_NO_data['OverallQual'] + combined_NO_data['OverallCond']
combined_NO_data['TotalSF'] = combined_NO_data['TotalBsmtSF'] + combined_NO_data['1stFlrSF'] + combined_NO_data['2ndFlrSF']

Now we're going to label encode some features where the ordering actually contains information

In [None]:
categorical_variables = ('FireplaceQu','BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
                         'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
                         'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
                         'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir')

for variable in categorical_variables:
    label = LabelEncoder()
    label.fit(list(combined_NO_data[variable].values))
    combined_NO_data[variable] = label.transform(list(combined_NO_data[variable].values))

Now we're going to reduce the skewness of our data using log transform. We'll only transform the properties that are highly skewed unlike what was done in [subsection 3.b](#subsec-3b).

In [None]:
numerical_variables = [column for column in combined_NO_data.columns
                       if combined_NO_data[column].dtype != "object"]

skewed_properties = combined_NO_data[numerical_variables].apply(lambda x: skew(x.dropna())).sort_values()
skewed_properties.head(5)

In [None]:
abs_skewed_props = abs(skewed_properties).sort_values()
abs_skewed_props.plot.barh(figsize=(12,8), title="Skewness of sqrt(Data)")
plt.show()

In [None]:
# Lets transform the properties with large skews, we'll classify anything with a skewness value > 0.5 as large
high_skew_props = abs_skewed_props[abs_skewed_props > 0.5]
high_skew_props.plot.barh(figsize=(12,8), title="Skewness of sqrt(Data)")
plt.show()

In [None]:
high_skew_indices = high_skew_props.index
combined_NO_normalized_data = combined_NO_data.copy()
for prop_index in high_skew_indices:
    combined_NO_normalized_data[prop_index] = np.log1p(combined_NO_normalized_data[prop_index])

In [None]:
combined_NO_normalized_data = pd.get_dummies(combined_NO_normalized_data).reset_index(drop=True)
print(combined_NO_data.shape)
print(combined_NO_normalized_data.shape)

In [None]:
combined_NO_normalized_data.head(5)

In [None]:
# Split the data into training and testing sets again
engineered_test_data = combined_NO_normalized_data[:test_data.shape[0]]
engineered_train_data = combined_NO_normalized_data[test_data.shape[0]:]

print ("Shape of base_train_data: {}".format(engineered_test_data.shape))
print ("Shape of base_test_data: {}".format(engineered_train_data.shape))

In [None]:
# Define the data that will be used for all tests
y_train_full_engg = train_y_NO
X_train_full_engg = engineered_train_data
test_full_engg = engineered_test_data

In [None]:
# Break off validation set from training data
X_train_engg, X_valid_engg, y_train_engg, y_valid_engg = train_test_split(X_train_full_engg, y_train_full_engg,
                                                                          train_size=0.8, test_size=0.2,
                                                                          random_state=0)

In [None]:
# lgb_train_engg = lgb.Dataset(X_train_engg, y_train_engg)
# params = {'objective': 'regression',
#           'metric': {'rmse'}}
# gbm_engg = lgb.train(params, lgb_train_engg)
# prediction_gbm_engg = gbm_engg.predict(X_valid_engg)

LGBM = lgb.LGBMRegressor(n_estimators = 1000)
LGBM.fit(X_train_engg, y_train_engg)

<a id="sec-N"></a>
# Part N. Determining the Best Model
In this section we simply check which approach produced the lowest mean absolute error (MAE) and use that model to generate the submission. The name of the given method is the key in the dictionary pointing to the MAE value.

In [None]:
print("MAE values generated:")
for i in scores_dict:
    print(i + " : " + str(round(scores_dict[i], 2)))
    
min_key = min(scores_dict, key=scores_dict.get)

print("\nMethod with lowest MAE:")
print(min_key)

submission = submission_dict[min_key]
# print(submission)

output = pd.DataFrame({'Id': sample_data.Id,
                       'SalePrice': submission})
output.to_csv('submission.csv', index=False)
print("\nOutput generated as submission.csv")