Update 2017-03-11: Kaggle added sklearn_pandas to kernels, yeah !

# Why this kernel

The purpose of this kernel is to provide you with a hackable baseline for this competition.

The unofficial secondary purpose (shhh!) is to get a shiny Kaggle medal. If you find it useful vote!


## Introduction
So far all kernels I see are not doing feature engineering in a maintainable and scalable way:
* Feature engineering has to be done twice, once for training, one for testing.
* It's contamination prone for cross-validation and GridSearch. (E.G. You compute the mean of the whole dataset and use it as a feature even though you cross-validate on 80% only for example).
* Feature testing and scaling is all over the code.
* You have to leave Pandas at one point and use NumPy array, meaning you lose context and label of data.
* You can't find useful features in an easy automated way, especially after OneHotEncoding or LabelBinarizer.

## Learning outcomes
You will learn:
* How to scale feature engineering with a Pipeline
* How to debug easily any step in your feature engineering Pipeline
* How to structure your code to enable/disable features in a single place (aka Command Center) and preprocess them properly (StandardScaler, LabelBinarizer, OneHotEncoder ...)
* How to extract the most useful features from a feature set, even after OneHotEncoding or Binarization

**The end goal is to have a very clean code that allows to test features very rapidly**

What you will not learn:
* Data exploration and visualization
* Stacking in 2 liners, use mlxtend for that: https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
* Imputing missing values with advanced techniques (beyond mean/median/mode): check my Titanic kernels in [Python](https://www.kaggle.com/mratsim/titanic/titanic-end-to-end-pipeline-stacking-gridsearch) and [Julia](https://www.kaggle.com/mratsim/titanic/titanic-julia-end-to-end-pipelining) for examples.


## Notes
This is a port of [Li Li's kernel](https://www.kaggle.com/aikinogard/two-sigma-connect-rental-listing-inquiries/random-forest-starter-with-numerical-features) to Scikit's Pipeline. Thank you Li Li for some clean and to the point code.

Unfortunately this kernel does not run completely at Kaggle kernel due to the lack of sklearn-pandas library that allows to use Pandas' dataframes with ScikitLearn

The Baseline score is : 0.63

# Import libraries
* Numerical libraries
* ScikitLearn Tools
* Classifier: XGBoost, using the Scikit Learn API
* time: to name the output files

sklearn-pandas is imported at a later time as it won't run in Kaggle anyway

In [None]:
import numpy as np
import pandas as pd

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelBinarizer, RobustScaler, Binarizer, StandardScaler, OneHotEncoder

In [None]:
from xgboost import XGBClassifier

In [None]:
import time

In [None]:
# Update 2017-03-11 - Kaggle added sklearn_pandas support
from sklearn_pandas import DataFrameMapper

In [None]:
# Update 2017-03-11: Kaggle added sklearn_pandas to kernels, yeah !
from sklearn_pandas import DataFrameMapper

# Import and display data

In [None]:
df_train = pd.read_json(open("../input/train.json", "r"))
df_train.head()

In [None]:
df_test = pd.read_json(open("../input/test.json", "r"))
df_test.head()

# Transformers

This is the main lesson of this code. Transformers allow your code to be easily maintainable by wrapping eachpreprocessing step in an easily testable class :
* You will always be sure that transformations are applied to train and test data
* You won't have to track matrix sizes, column names
* You can store temporary data in the training phase to reuse in the testing phase in a clean way, for example the number of occurences of a manager id.

Transformers have 4 methods:

**\_\_init\_\_**

Usually at "pass", it's useful:
* for temporary variables, like a dictionary to do some mapping
* if you need to store data from the training phase to reuse during the predict phase. Check my [Titanic Kernel](https://www.kaggle.com/mratsim/titanic/titanic-end-to-end-pipeline-stacking-gridsearch)

**fit**

If the transformer don't learn from the input data (same transformation at training and testing phase) it should return self
If the transformer "learns" from the data, i.e. tehre is one or more internal properties in the __init__ methods, update the properties and then return self. Those properties can then be used in the transform method

**transform**

How the transformer transform the data

**predict**

This is used if you build [your custom classifier](http://scikit-learn.org/dev/developers/contributing.html#rolling-your-own-estimator)

In [None]:
# This transformer extracts the number of photos
class PP_NumPhotTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        df = pd.DataFrame(X) #Python thinks X is a list input by default instead of a Dataframe
        return df.assign(
            NumPhotos = df['photos'].str.len()
            )
    
# This transformer extracts the number of features
class PP_NumFeatTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        df = pd.DataFrame(X) #Python thinks X is a list input by default instead of a Dataframe
        return df.assign(
            NumFeat = df['features'].str.len()
            )
    
# This transformer extracts the number of words in the description
class PP_NumDescWordsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        df = pd.DataFrame(X) #Python thinks X is a list input by default instead of a Dataframe
        return df.assign(
            NumDescWords = df["description"].apply(lambda x: len(x.split(" ")))
            )
    
# This transformer extracts the date/month/year and timestamp in a neat package
class PP_DateTimeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        df = pd.DataFrame(X) #Python thinks X is a list input by default instead of a Dataframe
        df = df.assign(
            Created_TS = pd.to_datetime(df["created"])
        )
        return df.assign(
            Created_Year = df["Created_TS"].dt.year,
            Created_Month = df["Created_TS"].dt.month,
            Created_Day = df["Created_TS"].dt.day
            )

####### Debug Transformer ###########
# Use this transformer anywhere in your Pipeline to dump your dataframe to CSV
class DebugTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        X.to_csv('./debug.csv')
        return X

In [None]:
# Impact sklearn_pandas which Pandas DataFrame compatibility with Scikit's classifiers and Pipeline
from sklearn_pandas import DataFrameMapper

# Command Center - enabled features

This is where you:
* enable/disable features
* specify scaling, onehotencoding, labelbinarizing


Note: LabelBinarizer use the following syntax ("feature", LabelBinarizer())
instead of (["feature"], LabelBinarizer())

Note2: Copy Pasting your DataFrameMapper config when you submit your results makes for a superb description of your model

In [None]:
mapper = DataFrameMapper([
    (["bathrooms"],RobustScaler()), #Some bathrooms number are 1.5, Some outliers are 112 or 20    (["bedrooms"],OneHotEncoder()),
    (["latitude"],None),
    (["longitude"],None),
    (["price"],RobustScaler()),
    # (["NumDescWords"],None),
    (["NumFeat"],StandardScaler()),
    (["Created_Year"],None),
    (["Created_Month"],None),
    (["Created_Day"],None)
])

# Command Center - feature engineering pipeline + classifier

This is your transformation pipeline in the format:
("arbitrary_name", Transformer())

You can comment/uncomment to remove/add steps.
The last step should be your classifier.

Note: if you need to configure a specific step "step" use the format step\_\_parameter = value

For example I wanted to set the eval_metric of xgboost (name xgb) during the fit step so I used:

pipe.fit(X_train, y_train, **xgb\__eval_metric**='mlogloss')

In [None]:
pipe = Pipeline([
    ("extract_numphot", PP_NumPhotTransformer()),
    ("extract_numfeat", PP_NumFeatTransformer()),
    ("extract_numdesc", PP_NumDescWordsTransformer()),
    ("extract_datetime", PP_DateTimeTransformer()),
    # ("DEBUG", DebugTransformer()), #Uncomment to debug
    ("featurize", mapper),
    ("xgb",XGBClassifier(
        n_estimators=1000,
        seed=42,
        objective='multi:softprob',
        subsample=0.8,
        colsample_bytree=0.8,
    ))
])

# Helper functions

## Cross Validation
5 folds, results summarized with 3 decimals of precision

## Get features that contributes most to the score
This function gives sensible names in case of OneHotEncoding or LabelBinarizer.
This is only possible with classifiers that implements **feature\_importances_** like RandomForest, ExtraTrees or XGBoost.

It outputs a top_featurs.csv file:

![enter image description here][1]

## Predict and format the output


  [1]: https://i.imgur.com/a7ElIO8.png

In [None]:
##### Cross Validation #######
def crossval():
    cv = cross_val_score(pipe, X_train, y_train, cv=5)
    print("Cross Validation Scores are: ", cv.round(3))
    print("Mean CrossVal score is: ", round(cv.mean(),3))
    print("Std Dev CrossVal score is: ", round(cv.std(),3))

In [None]:
####### Get top features and noise #######
def top_feat():
    dummy, model = pipe.steps[-1]

    feature_list = []
    for feature in pipe.named_steps['featurize'].features:
        if isinstance(feature[1], OneHotEncoder):
            for feature_value in feature[1].active_features_:
                feature_list.append(feature[0][0]+'_'+str(feature_value))
        else:
            try:
                for feature_value in feature[1].classes_:
                    feature_list.append(feature[0]+'_'+feature_value)
            except:
                feature_list.append(feature[0])


    top_features = pd.DataFrame({'feature':feature_list,'importance':np.round(model.feature_importances_,3)})
    top_features = top_features.sort_values('importance',ascending=False).set_index('feature')
    top_features.to_csv('./top_features.csv')
    top_features.plot.bar()

In [None]:
####### Predict and format output #######
def output():
    predictions = pipe.predict_proba(df_test)
    
    #debug
    print(pipe.classes_)
    print(predictions)
    
    result = pd.DataFrame({
        'listing_id': df_test['listing_id'],
        pipe.classes_[0]: [row[0] for row in predictions], 
        pipe.classes_[1]: [row[1] for row in predictions],
        pipe.classes_[2]: [row[2] for row in predictions]
        })
    result.to_csv(time.strftime("%Y-%m-%d_%H%M-")+'baseline.csv', index=False)

In [None]:
################ Training ################################

X_train = df_train
y_train = df_train['interest_level']

In [None]:
################ Cross Validation ################################
crossval()

In [None]:
##### Fit ######
pipe.fit(X_train, y_train, xgb__eval_metric='mlogloss')

In [None]:
######### Most influential features ########
top_feat()

In [None]:
######## Predict ########
output()

# The End

I hope you enjoyed the kernel and that it will help you iterate and test faster in your feature engineering quest.

Thank you for your attention, feel free to post comment and upvote.

You can check advanced transformer usage on my Titanic kernels in [Python](https://www.kaggle.com/mratsim/titanic/titanic-end-to-end-pipeline-stacking-gridsearch) and [Julia](https://www.kaggle.com/mratsim/titanic/titanic-julia-end-to-end-pipelining).