Question about usage... #41

webzest · 2020-08-28T16:25:10Z

I am trying to predict Housing prices, where I have a train data set and a test data set. the train data has a label and I need to train on it to later use this trained model to predict the label for the test data, which do not have a label. Aso, I followed your process on my train data set and performed the stacking, and applied the second level to the S_train and S_test variables as indicated in your instructions.
Now that i have done that, how do I proceed to predict the label on the test (unknown) dataset?

vecxoz · 2020-08-29T10:42:49Z

Label which you predicted for S_test is your final prediction. We can view S_test as a representation of X_test.

webzest · 2020-08-29T11:53:39Z

Great, So, allow me to explain a bit further so we can be on the same page...
The train and test data are from here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
The train data shape is (1460, 81), where col 81 is the label (SalePrice).
The test data shape is (1460, 80), where the SalePrice is missing or not provided and is the label that I need to predict.

In your example,: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0), y represent the label of X, which is, in my case it would be the label of train or SalePrice. So technically, you seem to be using only one dataset... i.e., in my case, I can pop SalePrice out of train and make it y, and use the remaining data as X, which is what I did when I tested, so, I am essentially using a piece of train when I get to S_test, as it is reflected in your example here:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
S_train, S_test = stacking(models, # list of models
X_train, y_train, X_test, # data
regression=True, ...

So, technically, by using S_test, I have not reached the point to introduce my unknown dataset, such as test that I need to predict a label for...
How would you recommend that I conduct my model training on the train data that come with a label, and then use the train model on my unknown dataset, test, that I need to predict?

vecxoz · 2020-08-29T14:47:09Z

In my example the following lines are used just to create artificial data for demonstrational purposes:

X, y = boston.data, boston.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

When we call stacking function we don’t use y_test because in real task y_test is unknown:

S_train, S_test = stacking(models, X_train, y_train, X_test, ...)

So in your task do not use train_test_split. When you call stacking function X_train should be the whole available training set of shape (1460, 80) and y_train should be the whole corresponding training label of shape (1460,).

webzest · 2020-08-29T19:26:49Z

In my attempt to follow the recommendations. I tried this approach, which provided a so-so result and also scored on Kaggle:

Approach # 1: ( Score on Kaggle: 0.12146

models = [
    LassoCV(random_state=1, n_jobs=-1), 
    RidgeCV(),
    ElasticNetCV(l1_ratio=(.1,.5,.7,.9,.99,1), random_state=1, n_jobs=-1), 
    KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5), 
    GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=1), 
    XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3, min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571, subsample=0.5213, random_state=1, n_jobs=-1), 
    LGBMRegressor(objective='regression', num_leaves=5, learning_rate=0.05, n_estimators=720, max_bin = 55, feature_fraction_seed=9, bagging_seed=9, random_state=1, n_jobs=-1)
    ]

S_train, S_test = stacking(models, train, y_train, test, regression=True, n_folds=4, shuffle=True,  random_state=42, verbose=0)

# Initialize 2nd level estimator
model = XGBRegressor(colsample_bytree=0.4603,gamma=0.0468, learning_rate=0.05, max_depth=3, min_child_weight=1.7817, n_estimators=2200,reg_alpha=0.4640, reg_lambda=0.8571, subsample=0.5213, random_state=42, n_jobs=-1)

# Fit & Predict
model = model.fit(S_train, y_train)
y_pred = np.expm1(model.predict(S_test))

How would you recommend improving the model or the process to boost the accuracy and generate a better score on kaggle?

vecxoz · 2020-08-30T11:50:26Z

Your code and score look OK.

The best place to look for model improvement is Kaggle Notebooks related to a specific competition. Basically it’s a standard practice on Kaggle to study high scoring notebooks and try to incorporate code from them in your model. If you are particularly interested in stacking then again you can search Notebooks by keyword stack. Also make sure to look at famous Kaggle Ensembling Guide.

There are some general recommendations on how to improve stacking model but experiment is the only answer:

1). It’s very important to remember that stacking is about quality of models not quantity. And in this case quality means not only good score but also low correlation of predictions. Stack of 3 good uncorrelated models can beat stack of 30 highly correlated models.

2). Sometimes use of stacking is excess. At first always try simple averaging of predictions of your individual models. In some cases this approach may outperform stacking.

webzest · 2020-08-30T14:29:03Z

Thank you for your excellent recommendations and support. i will definitely spend some time in those areas...

vecxoz · 2020-08-31T08:35:57Z

Good luck in Kaggle competitions!

vecxoz closed this as completed Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about usage... #41

Question about usage... #41

webzest commented Aug 28, 2020 •

edited

vecxoz commented Aug 29, 2020

webzest commented Aug 29, 2020

vecxoz commented Aug 29, 2020

webzest commented Aug 29, 2020

vecxoz commented Aug 30, 2020

webzest commented Aug 30, 2020

vecxoz commented Aug 31, 2020

Question about usage... #41

Question about usage... #41

Comments

webzest commented Aug 28, 2020 • edited

vecxoz commented Aug 29, 2020

webzest commented Aug 29, 2020

vecxoz commented Aug 29, 2020

webzest commented Aug 29, 2020

vecxoz commented Aug 30, 2020

webzest commented Aug 30, 2020

vecxoz commented Aug 31, 2020

webzest commented Aug 28, 2020 •

edited