# Beginner-friendly Tabular Playground tutorial (0.69973 score)



**Hello and welcome to my beginner-friendly tutorial for the Tabular Playground series Januar 2021 competition!**

**This tutorial is meant for anybody who is new to kaggle competitions. Doenst matter if you have absolutely no experience with kaggle competitions or if you already gained some experience in a few competitions, this tutorial should be helpful for both scenarios.**

**I was very happy about this new kaggle competition series, every first day of each month a competition like this will be hosted :)**

**It is very helpful for beginners, because the datasets are very friendly and nicely structured, and in this competition we dont even have any categorical features.**


# What is going to happen in this tutorial?


**In this tutorial we will first look at the data and then simply train a xgb regressor model.** 


# 1.) Load data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# import needed modules

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

# this line is needed for plotting later
mpl.rcParams['agg.path.chunksize'] = 10000

In [None]:
train_data = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/train.csv')
test_data = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2021/test.csv')

print("successfully loaded!")

# 2.) Have a first look at data

**In this chapter we will simply print some interesting properties and information of our train_data and test_data sets.**

In [None]:
print(train_data.shape)
print(test_data.shape)

**The train_data has 300k rows and 16 columns, the test_data has 200k rows and 15 columns.**

In [None]:
# .info() is a helpful command to get a nice overview of a dataframe

print(train_data.info(), "\n")
print(test_data.info())

**Here we can see nicely that the dataset of this competition simply consists of one 'id' column, 14 'cont' columns/features consisting of float64 numbers, and the train_data also has the 'target' feature. This is the feature we want to predict as accurately as possible, this is the goal of the entire competition.** 

**The next step will be to plot the train_data and test_data. We do this to get a better insight and understanding of our data.**

# 3.) Plot data

## 3.1) Plot x-axis = id, y-axis = feature


**The first kind of plots we are going to look at will have the 'id' column on the x-axis and the features 'cont1' to 'cont14' on the y-axis.** 

In [None]:
# create list containing all column names of train_data

list_of_train_features = train_data.columns.to_list()

print(list_of_train_features)

In [None]:
# plot all 14 'cont' features 

for i in range(1,len(list_of_train_features)-1):
    
    fig = plt.figure(figsize=(7,4.5))
    plt.plot(train_data["id"], train_data[list_of_train_features[i]])
    plt.title(list_of_train_features[i])
    plt.show()

**When we look at these 14 plots and scroll through them, we see immediately that all 14 features have the same kind of distribution.**

**The range of all 14 features goes from 0.0 to 1.0, sometimes a little more, sometimes a little less.**

**But there are no interesting shapes or distributions or dependencies noticeable.**

**Let's see if this is the same for the 14 'cont' features of the test_data set:**

In [None]:
# create list containing all column names of test_data

list_of_test_features = test_data.columns.to_list()

print(list_of_test_features)

In [None]:
for i in range(1,len(list_of_test_features)):
    
    fig = plt.figure(figsize=(7,4.5))
    plt.plot(test_data["id"], test_data[list_of_test_features[i]])
    plt.title(list_of_test_features[i])
    plt.show()

**The 14 'cont' features of the test_data look extremely similar to the features of the train_data.**

**Finally let's plot the target before we move on to the next kind of plots:**

In [None]:
fig = plt.figure(figsize=(7,4.5))
plt.plot(train_data["id"], train_data["target"])
plt.title("target")
plt.show()

**The target shows no interesting effects, shapes or relations, it's just numbers between rougly 6 and 10.**

**The only interesting thing is that one outlier at the bottom with a target value of about 0.0, let's remove this outlier.**

In [None]:
# find the outlier by looking for all values in the
# train_data set that have a target value smaller than 1.0

outlier = train_data.loc[train_data.target < 1.0]
print(outlier)

In [None]:
# remove the outlier from the train_data set
train_data.drop([170514], inplace = True)

**The next kind of plots will have the 'cont' feature of the x-axis, and the 'target' column on the y-axis.**

## 3.2) Plot x-axis = feature, y-axis = target 

**In these plots we will be able to see the relation between the 14 features and the 'target' column.**

In [None]:
for i in range(1,len(list_of_train_features)-1):
    
    fig = plt.figure(figsize=(7,4.5))
    plt.plot(train_data[list_of_train_features[i]], train_data["target"], linestyle = '', marker = 'x')
    plt.title(list_of_train_features[i])
    plt.show()

**We can see that many features simply look like a cloud, a broad distribution, no clear relation or dependency between any of the features and the target.**

**Interesting are the stripes in 'cont2', this is the only feature that does not have this cloud-like distribution.**

# 4.) Optimize xgb model



**Before we will train our xgb regressor model, we must find the optimal values for our datasets.**

**Most models like XGBoost, LightGBM or CatBoost have many many parameters that can be optimized such that the model prediction is more accurate.**

**Feel free to look at the list of xgb parameters, we will only use and optimize a few of them:** https://xgboost.readthedocs.io/en/latest/parameter.html

## xgb GridSearchCV

**GridSearchCV is a helpful tool for parameter optimization.**

**It works by trying out every possible combination of a given set of parameters and then detecting the best combination.**

**These GridSearchCV calculations can take very long depending on the size of the dataset and the number of parameter combinations given.**

**It is recommended to keep the number of parameter combinations low in order to keep the total computation time as short as possible.**

**Down below you can find the code for making a GridSearchCV in order to find the best parameter combination of the given 'n_estimators' and 'learning_rate' parameters.**

In [None]:
# create y_train which only contains the target
y_train = train_data["target"]

# remove target column from train_data set
train_data.drop(columns = ["target"], inplace = True)

In [None]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor


# parameter combinations
params = {  'n_estimators' : [1500, 2000, 2500],
            'learning_rate' : [0.01, 0.02]
        }


# xgb regressor model
xgb = XGBRegressor(
        objective = 'reg:squarederror',
        subsample = 0.8,
        colsample_bytree = 0.8,
        learning_rate = 0.01,
        tree_method = 'gpu_hist')
        #colsample_bynode = 0.85,
        #colsample_bylevel = 0.2,
        #gamma = 0.01,
        #reg_lambda = 0.01,
        #learning_rate = 0.02)
        #max_depth = 3,
        #min_child_weight = 3,
        #n_estimators = 6100)


grid_search = GridSearchCV(xgb, 
                           param_grid = params, 
                           scoring = 'neg_root_mean_squared_error', 
                           n_jobs = -1, 
                           verbose = 10)

grid_search.fit(train_data, y_train)


print('\n Best estimator:')
print(grid_search.best_estimator_)

print('\n Best score:')
print(grid_search.best_score_)

print('\n Best hyperparameters:')
print(grid_search.best_params_)

**As we can see, n_estimators = 2000  and learning_rate = 0.02 are the best parameters.**

**You can optimize every parameter of xgboost and other models with this GridSearchCV method, but always keep an eye on the GridSearch score, the score should always improve as well when you optimize your parameters more and more.**

# 5.) Train model

## xgb regressor model

**We will now train a xgb regressor model with the parameters we optimized with help of a GridSearchCV.**

**Then we will use the test_data to make a prediction, and then save this prediction as the submission file.**

In [None]:
clf = XGBRegressor(
    objective = 'reg:squarederror',
    subsample = 0.8,
    learning_rate = 0.02,
    max_depth = 7,
    n_estimators = 2500,
    tree_method = 'gpu_hist')


clf.fit(train_data, y_train)

y_pred_xgb = clf.predict(test_data)

print(y_pred_xgb)

**Now we can save the prediction:**

In [None]:
solution = pd.DataFrame({"id":test_data.id, "target":y_pred_xgb})

solution.to_csv("solution.csv", index = False)

print("saved successful!")

**We then have to click on "Save Version", then choose  "Save & Run All (Commit)"  such that the submission file will be saved in the output tab of our notebooks.**

**Then we can click on "Output" of this notebook and submit the submission file to the competition to see how good it scored.**

# Thank you for reading this Tabular Playground Tutorial !

# If you have any questions or ideas, feel free to ask and comment :)