Hey, I was predicting house prices for some time now - thought I'd share some of my major lessons.

In [None]:
import pandas as pd
import numpy
from sklearn import linear_model

# init
train_2d = pd.read_csv("../input/train.csv")
pred_2d = pd.read_csv("../input/test.csv")
x_train_2d, y_train_1d, x_pred_2d = train_2d.loc[:, "MSSubClass":"SaleCondition"], train_2d["SalePrice"], pred_2d.loc[:, "MSSubClass":"SaleCondition"]
x_2d = pd.concat((x_train_2d, x_pred_2d)); x_2d = x_2d.reset_index(); 

* logging/normalizing gave a major boost to linear regression (0.135-&gt;0.120~). Though this is a subtle thing - make sure you only log the correct features, because some features really hurt the score. Also note XGB sometime does not like logged features.

In [None]:
# example
cur_x_train_2d = x_train_2d.copy()
cur_x_train_2d['GrLivArea'] = numpy.log1p(cur_x_train_2d['GrLivArea'])

* engineer new features - make many and trust your model to pick the best. This improved my score drastically (at the time it was 0.119 -&gt; 0.116 for my lasso regression model)

In [None]:
# example
x_2d["YrSold"] = 2010 - x_2d["YrSold"]

* clean data according to what makes sense. Worry less about CV score if the change makes sense - though still have a look it doesn't hurt results too much. For me, if the change made sense and didn't hurt results more than 0.0003 then I let it stay. Common fix types: fill nans according to what makes sense (mostly just some string), convert ordered classes to numeric and vice versa, removing the index feature, dummy variables, etc

In [None]:
# example
x_2d['LotShape'] = x_2d['LotShape'].replace({'Reg': 4, 'IR1': 3, 'IR2': 2, 'IR3': 1})

* remove outliers. Another major boost was due to proper outliers removal (I removed like 20 observations overall).

In [None]:
# example
x_2d, y_train_1d = x_2d.drop([1298]), y_train_1d.drop([1298])

* ensemble. My ensemble was just an avg between xgb (score 0.11366) and lasso regression (score 0.11594). Actually I had in mind that this avg is just temporary and had several plans to much improve it, but decided I'd move on to the Zillow's competition instead.

In [None]:
# add some filler values for the sake of the example
linear_y_pred_1d = numpy.ones(100); xgb_y_pred_1d = numpy.ones(100)
# note that sometime it's better to do 0.6, 0.4 or 0.7 0.3. depends on relative strength of models.
avg_pred_1d = (linear_y_pred_1d * 0.5 + xgb_y_pred_1d * 0.5)

* make sure to optimize parameters once in the start (so that your CV score is somewhat relevant) and then once again in the end after all your data tweaks (this improved my score from 10th place to 3rd). But not too much in the middle of the competition in my opinion (spend this making progress with data or so).

In [None]:
# my eventual lasso parametrs
regr = linear_model.Lasso(max_iter=1e6, alpha=5e-4)