In this notebook, you will learn how to make your first submission to the [Tabular Playground Series - Mar 2021 competition.](https://www.kaggle.com/c/tabular-playground-series-mar-2021)

# Make the most of this notebook!

You can use the "Copy and Edit" button in the upper right of the page to create your own copy of this notebook and experiment with different models. You can run it as is and then see if you can make improvements.

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from xgboost import XGBClassifier

import matplotlib.pyplot as plt
        
input_path = Path('/kaggle/input/tabular-playground-series-mar-2021/')

# Read in the data files

In [None]:
train = pd.read_csv(input_path / 'train.csv', index_col='id')
display(train.head())

In [None]:
# are there any missing values?
train.isna().any().any()

In [None]:
test = pd.read_csv(input_path / 'test.csv', index_col='id')
display(test.head())

In [None]:
# are there any missing values?
test.isna().any().any()

In [None]:
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')
display(submission.head())

In [None]:
train.iloc[:,20:31].corr()['target'].abs() > 0.1


## We need to encode the categoricals.

There are different strategies to accomplish this, and different approaches will have different performance when using different algorithms.  You may decide to encode features with high cardinality (e.g., more distinct values) diffirently than features with low cardinality. For this starter notebook, we'll use simple encoding.

### Update 

* I want to implement a pipeline with one hot encoding like I learned in the tutorials. But one hot encoding turned out to be worse.
* The I wanted to use the LabelEncoder within a pipeline, but it turns out, this is not ment to be used for transforming [features](https://www.kaggle.com/getting-started/146568).
* Next step I try the OrdinalEncoder. 

In [None]:
X = train.drop(columns=['target'])
y = train['target']
T = test.copy()

#X_train, X_test, y_train, y_test = train_test_split(train, target, train_size=0.60)

In [None]:
labels = []
categorical_cols = []
drop_list = []
continuous_cols = []
MAX_CAT = 15
MIN_CORR = 0.1
for idx, c in enumerate(train.columns):
    if train[c].dtype=='object': 
        all_labels = list(set(train[c].values).union(set(test[c].values)))
        if len(all_labels) <= MAX_CAT:
            labels.append(all_labels)
            categorical_cols.append(c)
        else:
             drop_list.append(c)  
    elif c not in ['target']:
        if abs(train[[c, 'target']].corr()['target'][0]) > MIN_CORR:
            continuous_cols.append(c)
                
# labels
#print(categorical_cols)
# drop_list
#continuous_cols

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder

categorical_transformer = Pipeline(steps=[
    ('encoder', OrdinalEncoder(categories=labels))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols),
        ('cont', 'passthrough', continuous_cols)  # list of continuous columns
    ], remainder='drop')

In [None]:
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', XGBClassifier(n_estimators=500,
                                                      booster='gbtree',
                                                      use_label_encoder=False,
                                                      learning_rate=0.02,
                                                      eval_metric='auc',
                                                      n_jobs=-1,
                                                      random_state=42))
                             ])


In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(my_pipeline, X, y,
                         cv=3,
                         scoring='roc_auc')

print("ROC scores:\n", scores)

In [None]:
my_pipeline.fit(X, y) ;

In [None]:
T.head()

# Let's train it on all the data and make a submission!

In [None]:
# My Prediction
submission['target'] = my_pipeline.predict_proba(T)[:, 1]

In [None]:
submission.head(20)

In [None]:
submission.to_csv('random_forest.csv')