# Regression preparation with basic methods

We're just going to take the simplest approach to combining all the data. This rough and ready approach will hopefully help guide what the most fruitful avenue to pursue is.

Comments on workflow are appreciated! 

## Data processing

We'll pull in all the data, combine it, take a look at some stats and do very basic preparation (with no real 'feature engineering').

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train = pd.read_csv('../input/train.csv')
macro = pd.read_csv('../input/macro.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
train = pd.merge(train, macro, how='left', on='timestamp')
print(train.shape)
train.head()

### Processing the target variable

First let's sort out our predicted value. Often in housing price datasets there is a lot of skewness in the value to predict and taking a log gives a more normal distribution. This tends to lead to less bias in the regression. Let's examine that value here.

In [None]:
target = train['price_doc']
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10,3))

target.plot(ax=axes[0], kind='hist', bins=100)
np.log(target).plot(ax=axes[1], kind='hist', bins=100, color='green', secondary_y=True)
plt.show()

Looks like the same thing is present in this dataset, so we will aim to predict log(house price).

In [None]:
y = np.log(target)



### Processing the input features

Based on just the top of the dataset we can see there are plenty of missing values in some columns. Let's see how bad the problem is. Since we have 490 features currently, it's likely we will be able to discard some of the features with many missing values.

We can always come up with ways to add them in later.

In [None]:
percent_null = train.isnull().mean(axis=0) > 0.20
print("{:.2%} of columns have more than 20% missing values.".format(np.mean(percent_null)))

I'm happy to lose 5% of the features and not have to worry about a proper imputation strategy. We'll also pull out the uninformative columns.

In [None]:
df = train.loc[:, ~percent_null]
df = df.drop(['id', 'price_doc'], axis=1)

print(df.dtypes.value_counts())
np.array([c for c in df.columns if df[c].dtype == 'object'])

The next question is how to handle the object data types. I'm going to make the basic assumption that all floating and integer valued columns can be treated as one-dimensional values, so there is no need to dummy these.

It's possible that you could do better processing on some of the object columns - for example, 'sub_area' could be replaced with a 2D co-ordinate vector of the area, so distance metrics make sense between each class. However, I'm just going to dummy every object variable except 'timestamp'. I'll replace 'timestamp' with a numeric value, since it makes sense to treat this as 1-dimensional and the distance is well-defined.

In [None]:
df['timestamp'] = pd.to_numeric(pd.to_datetime(df['timestamp'])) / 1e18
print(df['timestamp'].head())

# This automatically only dummies object columns
df = pd.get_dummies(df).astype(np.float64)
print(df.shape)

In [None]:
X = df

So we have 636 features and 30,471 total observations at this point. Since the number of features is low, the challenge is going to be finding a model with high capacity, rather than necessarily adding lots of regularization at this stage. I'm not worried about overfitting yet.

Now let's prepare the data for learning. To do this, we'll follow the basic steps:

1. Make a train/test split.
2. Impute values for the missing values - we will replace with the mean. 
3. Scale every value by mean and standard deviation.

N.B. We are going to use the imputer class from sklearn - this doesn't support (afaik) different imputing methods for different columns. Ideally for the [0, 1] values we converted from strings, you would use the mode, but for the continuous you would use the mean. You could write a class to implement this, but again this is just a rough and ready approach.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import make_pipeline

# Make a pipeline that transforms X
pipe = make_pipeline(Imputer(), StandardScaler())
pipe.fit(X_train)
pipe.transform(X_train)

Now finally we want a single real-valued metric for comparing our models and implementations. Handily, Kaggle already tells us what that should be - the RMSLE. I don't think a library implementation in sklearn exists for this, but it's easy to define yourself.

Also, for convenience, I will take the exponential in this function, since our model is working in log(house price).

In [None]:
from sklearn.metrics import make_scorer

def rmsle_exp(y_true_log, y_pred_log):
    y_true = np.exp(y_true_log)
    y_pred = np.exp(y_pred_log)
    return np.sqrt(np.mean(np.power(np.log(y_true + 1) - np.log(y_pred + 1), 2)))

def score_model(model, pipe):
    train_error = rmsle_exp(y_train, model.predict(pipe.transform(X_train)))
    test_error = rmsle_exp(y_test, model.predict(pipe.transform(X_test)))
    return train_error, test_error

We now have everything we need for making some predictions. Let's fit a basic linear model. I expect this to underfit the data.

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression(fit_intercept=True)
lr.fit(pipe.transform(X_train), y_train)

print("Train error: {:.4f}, Test error: {:.4f}".format(*score_model(lr, pipe)))

So it looks like the benchmark level for the competition was set with linear regression - it's close to our test error. 

I'll use a few more "out-of-the-box" methods here, and we can proceed from there. Specifically I'll try SVR, random forests and everyone's favourite, XGBoost.

SVR allows for nice non-linearities if you use the Gaussian kernel. The downside is it takes a long time to fit the data. You also should run cross-validation on the parameters C (the regularization parameter) and what sklearn calls 'gamma', the standard deviation of the kernel. Here, I'll just use the defaults.

**Note: Due to high runtime, I have disabled this run of SVR. Suffice to say, it performs worse than the forests but better than LR, with train/test errors of around 0.48.**

In [None]:
#from sklearn.svm import SVR

#svr = SVR()
#svr.fit(pipe.transform(X_train), y_train)

#print("Train error: {:.4f}, Test error: {:.4f}".format(*score_model(svr, pipe)))
print("Train error: ~0.48, Test error: ~0.48")

The next two will be tree models. I'll use the same n_estimators with both. Random forests will overfit unless you set min_samples_leaf to a reasonable value, so I picked 50 relatively arbitrarily. Again, cross validation can help determine better than default settings for this and many other parameters.

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=100, min_samples_leaf=50, n_jobs=-1)
rfr.fit(pipe.transform(X_train), y_train)

print("Train error: {:.4f}, Test error: {:.4f}".format(*score_model(rfr, pipe)))

In [None]:
from xgboost import XGBRegressor

xgb = XGBRegressor()
xgb.fit(pipe.transform(X_train), y_train)

print("Train error: {:.4f}, Test error: {:.4f}".format(*score_model(xgb, pipe)))

So in just base implementations, XGBoost is (very slightly) the winner. Shocking, I know.

Of course, there are many steps you can take to improve any of these models. For example:

1. Engineer better features from the data.
2. Use some of the features we threw out at the start.
3. Cross-validate on the many parameters the more complicated models have.
4. Try a model we haven't used yet (deep network, polynomial features in regression)

But this notebook should serve as a baseline workflow!

And finally, basic code for submission below:

In [None]:
# Refit the model on everything, including our held-out test set.
pipe.fit(X)
xgb.fit(pipe.transform(X), y)

In [None]:
# Apply the same steps to process the test data
test_data = pd.merge(test, macro, how='left', on='timestamp')
test_data['timestamp'] = pd.to_numeric(pd.to_datetime(test_data['timestamp'])) / 1e18
test_data = pd.get_dummies(test_data).astype(np.float64)

# Make sure it's in the same format as the training data
df_test = pd.DataFrame(columns=df.columns)
for column in df_test.columns:
    if column in test_data.columns:
        df_test[column] = test_data[column]
    else:
        df_test[column] = np.nan

# Make the predictions
predictions = np.exp(xgb.predict(pipe.transform(df_test)))

# And put this in a dataframe
predictions_df = pd.DataFrame()
predictions_df['id'] = test['id']
predictions_df['price_doc'] = predictions
predictions_df.head()

In [None]:
# Now, output it to CSV
predictions_df.to_csv('predictions.csv', index=False)