# Tabular-Catboost
[Catboost](https://catboost.ai/) is an open-source software library.I will introduce it because I used it for studying.
Catboost has the following features. ([wikipedia](https://en.wikipedia.org/wiki/Catboost))
* Ordered Boosting to overcome over fitting
* Native handling for categorical features
* Using Oblivious Trees or Symmetric Trees for faster execution
> 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from catboost import Pool
from catboost import CatBoost

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Importing data



In [None]:
train = pd.read_csv('../input/tabular-playground-series-jan-2021/train.csv')
test  = pd.read_csv('../input/tabular-playground-series-jan-2021/test.csv')
sub = pd.read_csv('../input/tabular-playground-series-jan-2021/sample_submission.csv')

### Checking the data

In [None]:
features = [f'cont{x}' for x in range(1,15)]
data = train[features]
X_test = test[features]
target = train['target']
data.head()

### z-score normalization

In [None]:
train_data = (data-data.mean())/data.std()
test_data = (X_test - X_test.mean())/X_test.std()
train_data.describe()

### Catboost modeling
 
params is not the optimal solution. So please try it with various variables.
If you can't use the GPU, delete task_type.
I referred to [Ensemble using XGBoost and LGBM with EDA](https://www.kaggle.com/jyotmakadiya/ensemble-using-xgboost-and-lgbm-with-eda) when making a model of Catboost. Please take a look at it because it contains very useful information!

In [None]:
params = {
    'early_stopping_rounds' : 300,
    'loss_function' : 'RMSE',
    'num_boost_round' : 5000,
    'learning_rate' : 0.005,
    'max_depth' : 15,
    'verbose' : 200,
    'random_seed' : 42,
    'task_type': 'GPU'}

preds = np.zeros(test.shape[0])

kf = KFold(n_splits = 7, random_state = 42, shuffle = True)

rmse= []
n=0
for trn_idx, test_idx in kf.split(train[features], target):
    X_train, X_test = train[features].iloc[trn_idx], train[features].iloc[test_idx]
    y_train, y_test = target.iloc[trn_idx], target.iloc[test_idx]
    
    train_pool = Pool(X_train, label = y_train)
    test_pool = Pool(X_test, label = y_test)
    
    model = CatBoost(params)
    model.fit(train_pool, eval_set = [test_pool])  
    preds += model.predict(test[features])/kf.n_splits
    
    rmse.append(mean_squared_error(y_test, model.predict(X_test), squared=False))
    print(n+1, rmse[n])
    n+=1

print(f"mean RMSE for all the folds is {rmse}")

### Output

In [None]:
sub['target']=preds
sub.to_csv('submission.csv', index = False)
sub