# About this kernel

This is most common (to me) use case of machine learning, I just want to make a simple version for myself 
to server as a reference. Everything on here is basic, nothing special, really.

In [None]:
# Import needed libraries
%time
import os
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
print("XGBoost version:", xgb.__version__)

# Preprocessing

We will use label encoding (instead of dropping the column or using one-hote encode) for balanced
efficient cleansing and data size management

In [None]:
# Check process time
# We only have limited memory (13GB)
# Need to properly manage or the kernel will die
%time

# Read the data
train_transaction = pd.read_csv('../input/train_transaction.csv', index_col='TransactionID')
test_transaction  = pd.read_csv('../input/test_transaction.csv', index_col='TransactionID')
train_identity    = pd.read_csv('../input/train_identity.csv', index_col='TransactionID')
test_identity     = pd.read_csv('../input/test_identity.csv', index_col='TransactionID')
sample_submission = pd.read_csv('../input/sample_submission.csv', index_col='TransactionID')

# Merge
train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test  = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)

# Manage memory
del train_transaction, train_identity, test_transaction, test_identity
print(train.shape)
print(test.shape)

# Identify targets and features, also fill the NaNs
X_test  = test.copy()
X_test  = X_test.fillna(-999)
y = train['isFraud'].copy()
X = train.drop('isFraud', axis=1)
X = X.fillna(-999)

# Manage memory
del train, test

# Preprocess categorical features using label encoding for both test and train features
for f in X.columns:
    if X[f].dtype=='object' or X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(X[f].values) + list(X_test[f].values))
        X[f] = lbl.transform(list(X[f].values))
        X_test[f] = lbl.transform(list(X_test[f].values))
        
# Do the legendary split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Manage Memory
del X, y

# Training and Scoring

For GPU usage, simply use `tree_method='gpu_hist'` (took me an hour to figure out, I wish XGBoost documentation was clearer about that).

In [None]:
clf = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=9,
    learning_rate=0.1,
    subsample=0.9,
    colsample_bytree=0.9,
    missing=-999,
    random_state=2019,
    tree_method='gpu_hist'
)
%time clf.fit(X_train, y_train)
preds = clf.predict(X_val)
score = round(accuracy_score(preds, y_val) * 100, 2)
print(score)

Some of you must be wondering how we were able to decrease the fitting time by that much. The reason for that is not only we are running on gpu, but we are also computing an approximation of the real underlying algorithm (which is a greedy algorithm). This hurts your score slightly, but as a result is much faster.

So why am I not using CPU with `tree_method='hist'`? If you try it out yourself, you'll realize it'll take ~ 7 min, which is still far from the GPU fitting time. Similarly, `tree_method='gpu_exact'` will take ~ 4 min, but likely yields better accuracy than `gpu_hist` or `hist`.

The [docs on parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) has a section on `tree_method`, and it goes over the details of each option.

# Predicting

Time to make some predictions

In [None]:
sample_submission['isFraud'] = clf.predict_proba(X_test)[:,1]
sample_submission.to_csv('simple_xgboost.csv')