In [0]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import set_config; set_config(display='diagram')

# Houses Kaggle Competition (bis 🔥) 

[<img src='https://github.com/lewagon/data-images/blob/master/ML/kaggle-batch-challenge.png?raw=true' width=600>](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

Let's re-use our previous pipeline build in module `05-07-Ensemble-Methods` and improve final predictions using a Neural Network!

# Re-use already-built preprocessing

### Load data

In [0]:
# Let's load our training dataset
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_train_raw.csv")
X = data.drop(columns='SalePrice')
y = data['SalePrice']

# You don't have access to y_yest! Only Kaggle has it.
X_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/houses_test_raw.csv")

print(X.shape, y.shape, X_test.shape)

### Import preprocessor

You will find in `utils/preprocessor.py` the data-preprocessing pipeline that was built in our previous iteration.

❓ Run the cell below, and make sure you understand what the pipeline does. Look at the code in `preprocessor.py`

In [0]:
from utils.preprocessor import create_preproc
preproc = create_preproc(X)
preproc

❓ Fit the preprocessor you your train set and create your feature matrix `X_preproc` that will be used by the Neural Network

# Your prediction in Keras

This is your first **regression** task with Keras! 
- The cell below contains compiler and fit hyper-parameters we recommend you to start with.
- Kaggle's [rule](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation) requires to minimize `rmsle` (Root Mean Square Log Error). As you can see, we have been able to specify `msle` direcly as loss-function with Keras! Just remember to take square-root of your loss results to read your rmsle metric.
- The best boosted-tree `rmsle` score to beat is around **0.13**

❓ **Question** ❓
- Your responsibility is to build the best model architecture, and to control the epoch number to avoid overfitting.
- We recommand you to create a train/val split upfront to visually control the validation loss thanks to `plot_history`

In [0]:
# Create a train-val split here

In [0]:
def initialize_model():

    ### YOUR MODEL ARCHITECTURE HERE
    # model = ...

    
    # Recommended compilator
    model.compile(optimizer='adam',
                  loss='msle', # directly optimize for the squared log error!
    return model

model = initialize_model()

history = model.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=100, # Play with this until your validation loss overfit
                    batch_size=16, # Keep batch size to 16 today
                    verbose=0)

In [0]:
def plot_history(history):
    plt.plot(np.sqrt(history.history['loss']))
    plt.plot(np.sqrt(history.history['val_loss']))
    plt.title('Model Loss')
    plt.ylabel('MSLE')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Val'], loc='best')
    plt.show()

❓ **Question** ❓
- Are you satisfied with your score?
- Before you publish it, ask yourself if you can trust it entirely? Has it been cross-validated? 
- Feel free to cross-validate it manually with a for loop in python if you want before submitting to Kaggle

# 🏅FINAL SUBMISSION

Predict the house prices of your test set and submit results to kaggle! Be carefull with the format