# Beginner-friendly February Tabular Tutorial! (0.84358)

**Hello and welcome to my beginner-friendly tutorial for the Tabular Playground series February 2021 competition!**

**This tutorial is meant for anybody who is new to kaggle competitions. Doenst matter if you have absolutely no experience with kaggle competitions or if you already gained some experience in a few competitions, this tutorial should be helpful for both scenarios.**

**I was very happy about this new kaggle competition series, every first day of each month a competition like this will be hosted :)**

**It is very helpful for beginners, because the datasets are very friendly and nicely structured.**

**link to competition:** https://www.kaggle.com/c/tabular-playground-series-feb-2021


# What is going to happen in this tutorial?

**In this tutorial we will first look at the data, we are going to analyze the numerical and categorical features separately.**

**Afterwards we are going to train a CatBoost model and finally make a prediction.**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1.) Load data

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
train_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/train.csv')
test_data = pd.read_csv('/kaggle/input/tabular-playground-series-feb-2021/test.csv')

print(train_data.head(), "\n")
print(test_data.head())

# 2.) Have a first look at data

**In this section we will simply print out some interesting properties and characteristics of our data.**

In [None]:
print(train_data.shape, "\n")
print(test_data.shape)

**The data sets of this competition have the same number of rows as the previous January competition.**

In [None]:
print(train_data.info())

**We have 24 feature columns and 1 target column.**

**The 24 features columns consist of 10 categorical features and 14 numeric features.**

# 3.) Plot data

## 3.1) Plot x-axis = id, y-axis = num. feature

In [None]:
# import needed modules
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

In [None]:
print(train_data.columns)

In [None]:
cat_features = ['cat0', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8', 'cat9']
numerical_features = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5','cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']

In [None]:
for i in numerical_features:
    
    fig, ax = plt.subplots(1,2, figsize=(18,6))

    ax[0].plot(train_data["id"], train_data[i])
    ax[1].plot(test_data["id"], test_data[i])
    
    ax[0].set(xlabel="id", ylabel=i)
    ax[0].set_title('train_data')

    ax[1].set(xlabel="id", ylabel=i)
    ax[1].set_title("test_data")

    plt.show()

**The numerical features look good for now, train and test data look very similar, rougly the same y-range for all 14 features.**

**Now let's plot the numerical features of the train data together with the target.**

## 3.2) distplot of num. features

In [None]:
for i in numerical_features:

    fig, ax = plt.subplots(1,2, figsize=(18,6))

    sns.distplot(a = train_data[i], ax = ax[0])
    ax[0].set(xlabel='id', ylabel=i)
    ax[0].set_title('train_data')

    ax[1].set(xlabel='id', ylabel=i)
    ax[1].set_title("test_data")
    sns.distplot(a = test_data[i], ax = ax[1])

## 3.3) Plot x-axis = num. feature, y-axis = target

In [None]:
for i in numerical_features:
    
    fig = plt.figure(figsize=(10,6))
    plt.plot(train_data[i], train_data["target"], linestyle = '', marker = 'x')
    plt.title(i)
    plt.show()

**These distributions look good for now, only a few outliers at the bottom.**

**The target range goes from 0 to 10, the range of the numerical feature values goes rougly from 0 to 1.1.**

**Only the feature cont1 shows some interesting properties in form of the thin stripes between x = 0.2 and 0.6.**

**Let's remove that one outlier at the bottom.**

In [None]:
outlier = train_data.loc[train_data.target < 1.0]
print(outlier, "\n")
print(outlier.index)

In [None]:
# remove the outlier from the train_data set
train_data.drop([99682], inplace = True)

**Now let's plot the categorical features.**

**First we will simply plot the unique values of the categorical features.** 

**Afterwards we will plot the categorical feature together with the target.**


## 3.4) Plot x-axis = id, y-axis = cat. feature

In [None]:
for i in cat_features:

    fig, ax = plt.subplots(1,2, figsize=(18,6))
    
    train_data[i].value_counts().plot(kind = 'bar', ax = ax[0])
    ax[0].set(xlabel='id', ylabel=i)
    ax[0].set_title('train_data')
    
    ax[1].set(xlabel='id', ylabel=i)
    ax[1].set_title("test_data")
    test_data[i].value_counts().plot(kind = 'bar', ax = ax[1])
    
    plt.show()

**The columns cat0, cat1 and cat2 have only 2 unique values: A and B.**

**The columns cat3, cat4, cat5 have 3 unique values: A, B and C.**

**The columns cat6 and cat7 have 8 unique values: A,B,C,D,E,F,G,H.**

**The column cat8 and 7 unique values: A,B,C,D,E,F,G.**

**The column cat9 has 15 unique values: A,B,C,D,E,F,G,H,I,J,K,L,M,N,O.**

**The unique values are distributed in an imbalanced way.**

**And train and test data look very similar.**

## 3.5) Plot x-axis = cat. feature, y-axis = target

In [None]:
for i in cat_features:
    
    sns.catplot(x = i, y="target", data=train_data)
    plt.show()

**These plots reveal some information about the relationship between the categorical features and the target.**

**The first 3 plots with only 2 unique categorical values show that both unique values cover rougly the same target range.**

# Train CatBoost model

In [None]:
from catboost import CatBoostRegressor

categorical_features = cat_features

y_train = train_data["target"]

train_data.drop(columns = ['target'], inplace = True)


test_data_backup = test_data.copy()

# dropping the id column slightly improves the score
train_data.drop(columns = ["id"], inplace = True)
test_data.drop(columns = ["id"], inplace = True)

In [None]:
model_ctb = CatBoostRegressor(iterations = 3000, 
                               learning_rate = 0.02,
                               od_type = 'Iter',
                               loss_function = 'RMSE',
                               #eval_metric='AUC',
                               grow_policy = 'SymmetricTree',
                               #auto_class_weights = 'Balanced',
                               #max_depth = 8,
                               subsample = 0.8,
                               #colsample_bylevel = 0.9,
                               #l2_leaf_reg = 0.80, 
                               #one_hot_max_size = 4,
                               verbose = 3,
                               random_seed = 17)

model_ctb.fit(train_data, y_train, cat_features=categorical_features)

y_pred = model_ctb.predict(test_data)

print(y_pred)

In [None]:
solution = pd.DataFrame({"id":test_data_backup.id, "target":y_pred})

solution.to_csv("solution.csv", index = False)

print("saved successful!")

# Thanks for reading this tutorial!


# If you have any ideas or questions, feel free to ask :)