# The January 2021 Tabular Playground Series competition is almost over and thanks to the many people who have posted great notebooks with a lot of great ideas. One question I usually contemplate is, "How good is best?".  It is said that time is money, so how much time (or money) should a person spend on a task to make it “best”? Unfortunately, this can sometimes be a difficult question to answer.
# 
# Given the above, suppose you are asked by your boss to predict thousands of outcomes from some data that is circulating around the office (think the data in the Tabular Playground Series competition!), what to do? Fortunately, your coworkers have been busy on the problem and they tell you that XGBoost on the raw data should work fine. XGBoost sounds good to you, but how much extra work would it be to use Hyperopt to make the results potentially better? Is the extra effort worth it?
# 
# The information provided here will not provide a definitive answer to the above questions. The idea is, why not give it a go and see for yourself? Here is what I tried, starting with notebook code generously posted by:
# 
# https://www.kaggle.com/jamesmcguigan/tabular-playground-xgboost
# and
# https://www.kaggle.com/marionhesse/hyperopt-xgboost-parameter-tuning
# 
# This notebook contains the basic “apply XGBoost to data”, another notebook has the code using XGBoost with Hyperopt.
# As with the above notebooks, this notebook is released under the Apache 2.0 open source license. http://www.apache.org/licenses/LICENSE-2.0

# Applying XGBoost to the data is given below.  This initial submission resulted in a public score of 0.70426. How will this compare to XGBoost plus Hyperopt?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from xgboost import XGBRegressor
import sklearn

train_df = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv', index_col='id')
test_df  = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv',  index_col='id')

columns = test_df.columns
X       = train_df[columns]
Y       = train_df['target']
X_train, X_valid, Y_train, Y_valid = sklearn.model_selection.train_test_split(X, Y, test_size=0.01, random_state=42)
X_test  = test_df[columns]


xgb = XGBRegressor(
    n_jobs=-1,
    verbosity=0,
    random_state=42
)
xgb.fit(
    X_train, Y_train, 
)


predictions   = xgb.predict(X_test)

submission_df = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv', index_col='id')
submission_df['target'] = predictions
submission_df.to_csv('submission.csv')



