**Disclaimer: I am new to Python, use this Kernel at your own risk. GL! **

v.23 added KmeansSmote oversampler, messed with xgboost params, reduced %undersamples. I don't expect much improvement. Apparently, ~0.94 is where this approach hits the limit in my hands. Need a good CV strategy for tuning. Need to work on feature engineering.

v.20 **the highest LB so far: 0.9398** with `xgboost_under_over_blend.csv` submission

v.16-21 trying to add variety by using different params (mostly for xgboost) and by adding another random sampler. 

v.15 same as v.14 but removed `TransactionDT` - aka "the better way"

v.14 Current setup: Oversample using SMOTE, Random Oversample, Undersample with 17x runs. Equal blending of three aproaches.

v.13 `NearMiss version=1` does not perform well yet.

v.12 Removing SmoteBorderline oversampling part since the kernel keeps crashing. In the future will create oversampled dataset in separate kernels.

v.8 and v.9 keep crashing. RAM limit? will remove a few more columns in v.10

v.8 **Warnings:** added back `TransactionDT` as a possible leak/overfit illustration. The feature is used in some top kernels. This is likely the wrong way of using it. 

v.7 update: added NearMiss for undersampling and BorderlineSmote for oversampling

v.5 update: trying SMOTE for oversampling '1' class, then blending preds with undersampled '0' class preds.

For simple random undersampling the main theme is to:
* randomly undersample 0 class, train on train_new, predict test
* rinse and repeat until cows come home
* average test predictions.

I've borrowed code from several Kernels. Let me know if I forgot to acknowledge your work.

[Undersampling](https://www.kaggle.com/artkulak/use-only-5-of-0-labels-get-negligible-lb-drop)

[Remove putative low information content columns](https://www.kaggle.com/artgor/eda-and-models)

[0.9383](https://www.kaggle.com/artkulak/ieee-fraud-simple-baseline-0-9383-lb)

[GPU Optimization](https://www.kaggle.com/xhlulu/ieee-fraud-xgboost-with-gpu-fit-in-40s)

In [None]:
import os

import numpy as np
import pandas as pd
from sklearn import preprocessing
import xgboost as xgb
import gc

## Pre-processing 

In [None]:
%%time
train_transaction = pd.read_csv('../input/train_transaction.csv', index_col='TransactionID')
test_transaction = pd.read_csv('../input/test_transaction.csv', index_col='TransactionID')

train_identity = pd.read_csv('../input/train_identity.csv', index_col='TransactionID')
test_identity = pd.read_csv('../input/test_identity.csv', index_col='TransactionID')

sample_submission = pd.read_csv('../input/sample_submission.csv', index_col='TransactionID')

train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)

print(train.shape)
print(test.shape)

y_train = train['isFraud'].copy()
del train_transaction, train_identity, test_transaction, test_identity

# Drop target, fill in NaNs
X_train = train.drop('isFraud', axis=1)
X_test = test.copy()

del train, test

many_null_cols = [col for col in X_train.columns if X_train[col].isnull().sum() / X_train.shape[0] > 0.96]
many_null_cols_X_test = [col for col in X_test.columns if X_test[col].isnull().sum() / X_test.shape[0] > 0.96]
big_top_value_cols = [col for col in X_train.columns if X_train[col].value_counts(dropna=False, normalize=True).values[0] > 0.96]
big_top_value_cols_X_test = [col for col in X_test.columns if X_test[col].value_counts(dropna=False, normalize=True).values[0] > 0.96]
cols_to_drop = list(set(many_null_cols + many_null_cols_X_test + big_top_value_cols + big_top_value_cols_X_test ))
len(cols_to_drop)
print(cols_to_drop)


X_train = X_train.drop(cols_to_drop, axis=1)
X_test = X_test.drop(cols_to_drop, axis=1)


X_train.drop('TransactionDT', axis=1, inplace=True)
X_test.drop('TransactionDT', axis=1, inplace=True)

print(X_train.shape)
print(X_test.shape)

# Label Encoding
for f in X_train.columns:
    if X_train[f].dtype=='object' or X_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(X_train[f].values) + list(X_test[f].values))
        X_train[f] = lbl.transform(list(X_train[f].values))
        X_test[f] = lbl.transform(list(X_test[f].values)) 
        
X_train = X_train.fillna(-999)
X_test = X_test.fillna(-999)

## Reducing RAM usage

In [None]:
%%time
# From kernel https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
# WARNING! THIS CAN DAMAGE THE DATA 
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
X_train = reduce_mem_usage(X_train)
X_test = reduce_mem_usage(X_test)

## Oversampling using KmeansSMOTE

In [None]:
from imblearn.over_sampling import KMeansSMOTE

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = KMeansSMOTE(random_state=99, sampling_strategy = 0.15,  k_neighbors = 10,cluster_balance_threshold = 0.02, n_jobs=4)
X_train_new, y_train_new = sm.fit_sample(X_train, y_train.ravel())

X_train_new = pd.DataFrame(X_train_new)
X_train_new.columns = X_train.columns
y_train_new = pd.DataFrame(y_train_new)

print('After OverSampling, the shape of X_train_new: {}'.format(X_train_new.shape))
print('After OverSampling, the shape of y_train_new: {} \n'.format(y_train_new.shape))

In [None]:
%%time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

EPOCHS = 4

y_preds = np.zeros(sample_submission.shape[0])

kf = StratifiedKFold(n_splits=EPOCHS, random_state= 99, shuffle=True)
y_oof_new = np.zeros(X_train_new.shape[0])
gc.collect()

for tr_idx, val_idx in kf.split(X_train_new, y_train_new):
    clf = xgb.XGBClassifier(
            n_estimators=500,
            max_depth=17,
            learning_rate=0.03,
            subsample=0.9,
            colsample_bytree=0.9,
            tree_method='gpu_hist',
            missing=-999
        )
    
    X_tr, X_vl = X_train_new.iloc[tr_idx, :], X_train_new.iloc[val_idx, :]
    y_tr, y_vl = y_train_new.iloc[tr_idx], y_train_new.iloc[val_idx]
    clf.fit(X_tr, y_tr)
    y_pred_train = clf.predict_proba(X_vl)[:,1]
    #y_oof[val_idx] = y_pred_train
    print('ROC AUC {}'.format(roc_auc_score(y_vl, y_pred_train)))
    y_oof_new[val_idx] = y_pred_train    
    y_preds+= clf.predict_proba(X_test)[:,1] / EPOCHS
    del clf
    gc.collect()
print('ROC AUC oof_new {}'.format(roc_auc_score(y_train_new, y_oof_new))) 
del X_train_new
gc.collect()

sample_submission1a = sample_submission.copy()
sample_submission1a['isFraud'] = y_preds
sample_submission1a.to_csv('xgboost_oversample.csv')
sample_submission1a['isFraud'].describe()

## Oversampling using SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = SMOTE(random_state=99, sampling_strategy = 0.15)
X_train_new, y_train_new = sm.fit_sample(X_train, y_train.ravel())

X_train_new = pd.DataFrame(X_train_new)
X_train_new.columns = X_train.columns
y_train_new = pd.DataFrame(y_train_new)

print('After OverSampling, the shape of X_train_new: {}'.format(X_train_new.shape))
print('After OverSampling, the shape of y_train_new: {} \n'.format(y_train_new.shape))

In [None]:
%%time
#training on Smote dataset
#from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

EPOCHS = 4

#kf = KFold(n_splits = EPOCHS, shuffle = True)
y_preds = np.zeros(sample_submission.shape[0])


kf = StratifiedKFold(n_splits=EPOCHS, random_state= 99, shuffle=True)
#y_oof = np.zeros(X_train.shape[0])
#y_train_new = y_train_new.reset_index().drop(columns = 'TransactionID')
#X_train_new = X_train_new.reset_index().drop(columns = 'TransactionID')
y_oof_new = np.zeros(X_train_new.shape[0])
gc.collect()

for tr_idx, val_idx in kf.split(X_train_new, y_train_new):
    clf = xgb.XGBClassifier(
            n_estimators=500,
            max_depth=17,
            learning_rate=0.02,
            subsample=0.9,
            colsample_bytree=0.9,
            tree_method='gpu_hist',
            missing=-999
        )
    
    X_tr, X_vl = X_train_new.iloc[tr_idx, :], X_train_new.iloc[val_idx, :]
    y_tr, y_vl = y_train_new.iloc[tr_idx], y_train_new.iloc[val_idx]
    clf.fit(X_tr, y_tr)
    y_pred_train = clf.predict_proba(X_vl)[:,1]
    #y_oof[val_idx] = y_pred_train
    print('ROC AUC {}'.format(roc_auc_score(y_vl, y_pred_train)))
    y_oof_new[val_idx] = y_pred_train    
    y_preds+= clf.predict_proba(X_test)[:,1] / EPOCHS
    del clf
    gc.collect()
print('ROC AUC oof_new {}'.format(roc_auc_score(y_train_new, y_oof_new))) 
del X_train_new
gc.collect()

sample_submission1 = sample_submission.copy()
sample_submission1['isFraud'] = y_preds
sample_submission1.to_csv('xgboost_oversample.csv')
sample_submission1['isFraud'].describe()

## Oversampling using RandomOverSampler

In [None]:
from imblearn.over_sampling import RandomOverSampler

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = RandomOverSampler(random_state=99, sampling_strategy = 0.12)
X_train_new, y_train_new = sm.fit_sample(X_train, y_train.ravel())

X_train_new = pd.DataFrame(X_train_new)
X_train_new.columns = X_train.columns
y_train_new = pd.DataFrame(y_train_new)

print('After OverSampling, the shape of X_train_new: {}'.format(X_train_new.shape))
print('After OverSampling, the shape of y_train_new: {} \n'.format(y_train_new.shape))

In [None]:
%%time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

EPOCHS = 4

y_preds = np.zeros(sample_submission.shape[0])

kf = StratifiedKFold(n_splits=EPOCHS, random_state= 99, shuffle=True)
y_oof_new = np.zeros(X_train_new.shape[0])
gc.collect()

for tr_idx, val_idx in kf.split(X_train_new, y_train_new):
    clf = xgb.XGBClassifier(
            n_estimators=500,
            max_depth=11,
            learning_rate=0.05,
            subsample=0.9,
            colsample_bytree=0.9,
            tree_method='gpu_hist',
            missing=-999,
            min_child_weight=1
        )
    
    X_tr, X_vl = X_train_new.iloc[tr_idx, :], X_train_new.iloc[val_idx, :]
    y_tr, y_vl = y_train_new.iloc[tr_idx], y_train_new.iloc[val_idx]
    clf.fit(X_tr, y_tr)
    y_pred_train = clf.predict_proba(X_vl)[:,1]
    #y_oof[val_idx] = y_pred_train
    print('ROC AUC {}'.format(roc_auc_score(y_vl, y_pred_train)))
    y_oof_new[val_idx] = y_pred_train    
    y_preds+= clf.predict_proba(X_test)[:,1] / EPOCHS
    del clf
    gc.collect()
print('ROC AUC oof_new {}'.format(roc_auc_score(y_train_new, y_oof_new))) 
del X_train_new
gc.collect()

sample_submission2 = sample_submission.copy()
sample_submission2['isFraud'] = y_preds
sample_submission2.to_csv('xgboost_oversample_random.csv')
sample_submission2['isFraud'].describe()

## Oversampling using RandomOverSampler (slighly diff params)

In [None]:
from imblearn.over_sampling import RandomOverSampler

print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = RandomOverSampler(random_state=999, sampling_strategy = 0.13)
X_train_new, y_train_new = sm.fit_sample(X_train, y_train.ravel())

X_train_new = pd.DataFrame(X_train_new)
X_train_new.columns = X_train.columns
y_train_new = pd.DataFrame(y_train_new)

print('After OverSampling, the shape of X_train_new: {}'.format(X_train_new.shape))
print('After OverSampling, the shape of y_train_new: {} \n'.format(y_train_new.shape))

In [None]:
%%time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

EPOCHS = 4

y_preds = np.zeros(sample_submission.shape[0])


kf = StratifiedKFold(n_splits=EPOCHS, random_state= 99, shuffle=True)
y_oof_new = np.zeros(X_train_new.shape[0])
gc.collect()

for tr_idx, val_idx in kf.split(X_train_new, y_train_new):
    clf = xgb.XGBClassifier(
            n_estimators=500,
            max_depth=17,
            learning_rate=0.035,
            subsample=0.9,
            colsample_bytree=0.9,
            tree_method='gpu_hist',
            missing=-999
        )
    
    X_tr, X_vl = X_train_new.iloc[tr_idx, :], X_train_new.iloc[val_idx, :]
    y_tr, y_vl = y_train_new.iloc[tr_idx], y_train_new.iloc[val_idx]
    clf.fit(X_tr, y_tr)
    y_pred_train = clf.predict_proba(X_vl)[:,1]
    #y_oof[val_idx] = y_pred_train
    print('ROC AUC {}'.format(roc_auc_score(y_vl, y_pred_train)))
    y_oof_new[val_idx] = y_pred_train    
    y_preds+= clf.predict_proba(X_test)[:,1] / EPOCHS
    del clf
    gc.collect()
print('ROC AUC oof_new {}'.format(roc_auc_score(y_train_new, y_oof_new))) 
del X_train_new
gc.collect()

sample_submission3 = sample_submission.copy()
sample_submission3['isFraud'] = y_preds
sample_submission3.to_csv('xgboost_oversample_random.csv')
sample_submission3['isFraud'].describe()

## Random undersampling of '0' class, multiple runs

In [None]:
%%time
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

EPOCHS = 4
TIMES = 24
frac_inc = np.arange(0,TIMES,1)/400 +0.1
print(frac_inc)

y_preds = np.zeros(sample_submission.shape[0])

for _ in range (TIMES):
    kf = StratifiedKFold(n_splits=EPOCHS, random_state= _, shuffle=True)
    #y_oof = np.zeros(X_train.shape[0])
    #X_train_new = X_train[y_train == 0].sample(frac = 0.09).append(X_train[y_train == 1])
    X_train_new = X_train[y_train == 0].sample(frac = frac_inc[_]).append(X_train[y_train == 1])
    y_train_new = y_train[X_train_new.index].reset_index().drop(columns = 'TransactionID')
    X_train_new = X_train_new.reset_index().drop(columns = 'TransactionID')
    #X_train_new
    #y_train_new
    y_oof_new = np.zeros(X_train_new.shape[0])
    gc.collect()

    for tr_idx, val_idx in kf.split(X_train_new, y_train_new):
        clf = xgb.XGBClassifier(
            n_estimators=500,
            max_depth=10,
            learning_rate=0.04,
            subsample=0.8,
            colsample_bytree=0.9,
            tree_method='gpu_hist',
            missing=-999,
            min_child_weight=2
        )
    
        X_tr, X_vl = X_train_new.iloc[tr_idx, :], X_train_new.iloc[val_idx, :]
        y_tr, y_vl = y_train_new.iloc[tr_idx], y_train_new.iloc[val_idx]
        clf.fit(X_tr, y_tr)
        y_pred_train = clf.predict_proba(X_vl)[:,1]
        #y_oof[val_idx] = y_pred_train
        #print('ROC AUC {}'.format(roc_auc_score(y_vl, y_pred_train)))
        y_oof_new[val_idx] = y_pred_train    
        y_preds+= clf.predict_proba(X_test)[:,1] / EPOCHS   
    print('ROC AUC oof_new {}'.format(roc_auc_score(y_train_new, y_oof_new))) 
    
sample_submission4 = sample_submission.copy()
sample_submission4['isFraud'] = y_preds/TIMES
sample_submission4.to_csv('xgboost_undersample.csv')
sample_submission4['isFraud'].describe()

## Blending over- and under-sampled results

In [None]:
sample_submission_blend = sample_submission.copy()
sample_submission_blend['isFraud'] = (sample_submission1a['isFraud']+sample_submission1['isFraud'] + sample_submission2['isFraud']*0.5+ sample_submission3['isFraud']*0.5+ sample_submission4['isFraud'])/4
sample_submission_blend.to_csv('xgboost_under_over_blend.csv')
sample_submission_blend2 = sample_submission.copy()
sample_submission_blend2['isFraud'] = (sample_submission1a['isFraud']*0.5+sample_submission1['isFraud']*0.5 + sample_submission2['isFraud']*0.25+ sample_submission3['isFraud']*0.25+ sample_submission4['isFraud'])/2.5
sample_submission_blend2.to_csv('xgboost_under_over_blend2.csv')
sample_submission_blend_equal = sample_submission.copy()
sample_submission_blend_equal['isFraud'] = (sample_submission1a['isFraud']+sample_submission1['isFraud'] + sample_submission2['isFraud']+ sample_submission3['isFraud']+ sample_submission4['isFraud'])/5
sample_submission_blend_equal.to_csv('xgboost_under_over_blend_equal.csv')