# Ensemble Methods: Challenge Session

In [None]:
import numpy  as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

## House Prices: Advanced Regression Techniques

Source:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

> With 79 explanatory variables describing (almost) every aspect of
> residential homes in Ames, Iowa, this competition challenges you to
> predict the final price of each home.
> The potential for creative feature engineering provides a rich 
> opportunity for fun and learning. This dataset lends itself to
> advanced regression techniques like random forests and gradient
> boosting.

This prediction challenge uses RMSE of predictions versus targets as the quality metric:
$$ \mathtt{RMSE}
    = \sqrt{ \frac1T \sum_{t=1}^T (y_t - \hat{y}_t)^2}
    \,. $$

In [None]:
from sklearn.metrics import explained_variance_score, mean_squared_error

## Load an preprocess the data

Load the ``csv`` data

In [None]:
train = pd.read_csv("housing/train.csv")
test = pd.read_csv("housing/test.csv")

Inspect

In [None]:
train.head()

#### Data Fields
Target varaibls:
* **SalePrice** the property's sale price in dollars.

Explanatory variables:
* **MSSubClass** -- The building class
* **MSZoning** -- The general zoning classification
* **LotFrontage** -- Linear feet of street connected to property
* **LotArea** -- Lot size in square feet
* **Street** -- Type of road access
* **Alley** -- Type of alley access
* **LotShape** -- General shape of property
* **LandContour** -- Flatness of the property
* **Utilities** -- Type of utilities available
* **LotConfig** -- Lot configuration
* **LandSlope** -- Slope of property
* **Neighborhood** -- Physical locations within Ames city limits
* **Condition1** -- Proximity to main road or railroad
* **Condition2** -- Proximity to main road or railroad (if a second is present)
* **BldgType** -- Type of dwelling
* **HouseStyle** -- Style of dwelling
* **OverallQual** -- Overall material and finish quality
* **OverallCond** -- Overall condition rating
* **YearBuilt** -- Original construction date
* **YearRemodAdd** -- Remodel date
* **RoofStyle** -- Type of roof
* **RoofMatl** -- Roof material
* **Exterior1st** -- Exterior covering on house
* **Exterior2nd** -- Exterior covering on house (if more than one material)
* **MasVnrType** -- Masonry veneer type
* **MasVnrArea** -- Masonry veneer area in square feet
* **ExterQual** -- Exterior material quality
* **ExterCond** -- Present condition of the material on the exterior
* **Foundation** -- Type of foundation
* **BsmtQual** -- Height of the basement
* **BsmtCond** -- General condition of the basement
* **BsmtExposure** -- Walkout or garden level basement walls
* **BsmtFinType1** -- Quality of basement finished area
* **BsmtFinSF1** -- Type 1 finished square feet
* **BsmtFinType2** -- Quality of second finished area (if present)
* **BsmtFinSF2** -- Type 2 finished square feet
* **BsmtUnfSF** -- Unfinished square feet of basement area
* **TotalBsmtSF** -- Total square feet of basement area
* **Heating** -- Type of heating
* **HeatingQC** -- Heating quality and condition
* **CentralAir** -- Central air conditioning
* **Electrical** -- Electrical system
* **1stFlrSF** -- First Floor square feet
* **2ndFlrSF** -- Second floor square feet
* **LowQualFinSF** -- Low quality finished square feet (all floors)
* **GrLivArea** -- Above grade (ground) living area square feet
* **BsmtFullBath** -- Basement full bathrooms
* **BsmtHalfBath** -- Basement half bathrooms
* **FullBath** -- Full bathrooms above grade
* **HalfBath** -- Half baths above grade
* **Bedroom** -- Number of bedrooms above basement level
* **Kitchen** -- Number of kitchens
* **KitchenQual** -- Kitchen quality
* **TotRmsAbvGrd** -- Total rooms above grade (does not include bathrooms)
* **Functional** -- Home functionality rating
* **Fireplaces** -- Number of fireplaces
* **FireplaceQu** -- Fireplace quality
* **GarageType** -- Garage location
* **GarageYrBlt** -- Year garage was built
* **GarageFinish** -- Interior finish of the garage
* **GarageCars** -- Size of garage in car capacity
* **GarageArea** -- Size of garage in square feet
* **GarageQual** -- Garage quality
* **GarageCond** -- Garage condition
* **PavedDrive** -- Paved driveway
* **WoodDeckSF** -- Wood deck area in square feet
* **OpenPorchSF** -- Open porch area in square feet
* **EnclosedPorch** -- Enclosed porch area in square feet
* **3SsnPorch** -- Three season porch area in square feet
* **ScreenPorch** -- Screen porch area in square feet
* **PoolArea** -- Pool area in square feet
* **PoolQC** -- Pool quality
* **Fence** -- Fence quality
* **MiscFeature** -- Miscellaneous feature not covered in other categories
* **MiscVal** -- $Value of miscellaneous feature
* **MoSold** -- Month Sold
* **YrSold** -- Year Sold
* **SaleType** -- Type of sale
* **SaleCondition** -- Condition of sale

Pool the train and test datasets

In [None]:
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

Get the colnames of all the numeric features

In [None]:
numeric_ = all_data.dtypes[all_data.dtypes != "object"].index

And log-transform skewed numeric features:

In [None]:
from scipy.stats import skew
skewed_ = train[numeric_].apply(lambda x: skew(x.dropna()))
skewed_ = skewed_[skewed_ > 0.75].index

all_data[skewed_] = np.log1p(all_data[skewed_])

Do the one-hot encoding of the (string) data

In [None]:
all_data = pd.get_dummies(all_data)

Fill the missing values in each column with its average on the train

In [None]:
all_data = all_data.fillna(train.mean())

Do a log transform the target (``SalePrice``)

In [None]:
train["SalePrice"] = np.log1p(train["SalePrice"])

Now do the split!

In [None]:
df_X_train = all_data[:train.shape[0]]
df_X_test = all_data[train.shape[0]:]
df_y_train = train.SalePrice

Inspect the data

In [None]:
df_X_train.head()

Now do a 75-25 train/validation split

In [None]:
from sklearn.model_selection import train_test_split

X, y = df_X_train.values.copy(), df_y_train.values.copy()
X_test = df_X_test.values.copy()

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25)

# Train, validate & apply regression

Train/validate your ensemble model here

Samples:
* ``(X_train, y_train)`` -- train
* ``(X_valid, y_valid)`` -- validation

In [None]:
###############################
##### PUT YOUR MODEL HERE #####
###############################

## Make a submission

Samples:
* ``(X, y)`` -- full train dataset
* ``X_test`` -- the test dataset (no target $y$)

In [None]:
###############################
##### PUT YOUR MODEL HERE #####
###############################

y_pred = np.zeros(X_test.shape[0], dtype=np.int)

Write the submission to ``"my_submission.csv"``
* ``y_pred`` -- predictions on the test ``X_test``

In [None]:
pd.DataFrame(dict(Id=test.Id,
                  SalePrice=y_pred.astype(float)),
             columns=["Id", "SalePrice"])\
  .to_csv("my_submission.csv", index=False)