Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about usage... #41

Closed
webzest opened this issue Aug 28, 2020 · 7 comments
Closed

Question about usage... #41

webzest opened this issue Aug 28, 2020 · 7 comments

Comments

@webzest
Copy link

webzest commented Aug 28, 2020

I am trying to predict Housing prices, where I have a train data set and a test data set. the train data has a label and I need to train on it to later use this trained model to predict the label for the test data, which do not have a label. Aso, I followed your process on my train data set and performed the stacking, and applied the second level to the S_train and S_test variables as indicated in your instructions.
Now that i have done that, how do I proceed to predict the label on the test (unknown) dataset?

@vecxoz
Copy link
Owner

vecxoz commented Aug 29, 2020

Label which you predicted for S_test is your final prediction. We can view S_test as a representation of X_test.

@webzest
Copy link
Author

webzest commented Aug 29, 2020

Great, So, allow me to explain a bit further so we can be on the same page...
The train and test data are from here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
The train data shape is (1460, 81), where col 81 is the label (SalePrice).
The test data shape is (1460, 80), where the SalePrice is missing or not provided and is the label that I need to predict.

In your example,: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0), y represent the label of X, which is, in my case it would be the label of train or SalePrice. So technically, you seem to be using only one dataset... i.e., in my case, I can pop SalePrice out of train and make it y, and use the remaining data as X, which is what I did when I tested, so, I am essentially using a piece of train when I get to S_test, as it is reflected in your example here:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
S_train, S_test = stacking(models, # list of models
X_train, y_train, X_test, # data
regression=True, ...

So, technically, by using S_test, I have not reached the point to introduce my unknown dataset, such as test that I need to predict a label for...
How would you recommend that I conduct my model training on the train data that come with a label, and then use the train model on my unknown dataset, test, that I need to predict?

@vecxoz
Copy link
Owner

vecxoz commented Aug 29, 2020

In my example the following lines are used just to create artificial data for demonstrational purposes:

X, y = boston.data, boston.target 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

When we call stacking function we don’t use y_test because in real task y_test is unknown:

S_train, S_test = stacking(models, X_train, y_train, X_test, ...)

So in your task do not use train_test_split. When you call stacking function X_train should be the whole available training set of shape (1460, 80) and y_train should be the whole corresponding training label of shape (1460,).

@webzest
Copy link
Author

webzest commented Aug 29, 2020

In my attempt to follow the recommendations. I tried this approach, which provided a so-so result and also scored on Kaggle:

Approach # 1: ( Score on Kaggle: 0.12146

models = [
    LassoCV(random_state=1, n_jobs=-1), 
    RidgeCV(),
    ElasticNetCV(l1_ratio=(.1,.5,.7,.9,.99,1), random_state=1, n_jobs=-1), 
    KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5), 
    GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt', min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=1), 
    XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3, min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571, subsample=0.5213, random_state=1, n_jobs=-1), 
    LGBMRegressor(objective='regression', num_leaves=5, learning_rate=0.05, n_estimators=720, max_bin = 55, feature_fraction_seed=9, bagging_seed=9, random_state=1, n_jobs=-1)
    ]

S_train, S_test = stacking(models, train, y_train, test, regression=True, n_folds=4, shuffle=True,  random_state=42, verbose=0)

# Initialize 2nd level estimator
model = XGBRegressor(colsample_bytree=0.4603,gamma=0.0468, learning_rate=0.05, max_depth=3, min_child_weight=1.7817, n_estimators=2200,reg_alpha=0.4640, reg_lambda=0.8571, subsample=0.5213, random_state=42, n_jobs=-1)

# Fit & Predict
model = model.fit(S_train, y_train)
y_pred = np.expm1(model.predict(S_test))

How would you recommend improving the model or the process to boost the accuracy and generate a better score on kaggle?

@vecxoz
Copy link
Owner

vecxoz commented Aug 30, 2020

Your code and score look OK.

The best place to look for model improvement is Kaggle Notebooks related to a specific competition. Basically it’s a standard practice on Kaggle to study high scoring notebooks and try to incorporate code from them in your model. If you are particularly interested in stacking then again you can search Notebooks by keyword stack. Also make sure to look at famous Kaggle Ensembling Guide.

There are some general recommendations on how to improve stacking model but experiment is the only answer:

1). It’s very important to remember that stacking is about quality of models not quantity. And in this case quality means not only good score but also low correlation of predictions. Stack of 3 good uncorrelated models can beat stack of 30 highly correlated models.

2). Sometimes use of stacking is excess. At first always try simple averaging of predictions of your individual models. In some cases this approach may outperform stacking.

@webzest
Copy link
Author

webzest commented Aug 30, 2020

Thank you for your excellent recommendations and support. i will definitely spend some time in those areas...

@vecxoz
Copy link
Owner

vecxoz commented Aug 31, 2020

Good luck in Kaggle competitions!

@vecxoz vecxoz closed this as completed Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants