## Data Overview

### Project Goals

The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. The goal is to predict the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.

The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.

The training dataset consists of more than 400 features and 5.9 Million samples. This is supervised binary classification problem and goal is to predict if a credit card transaction is Fraud based on input features mentioned below.

### Information of Variables

According to the Data Description given by the data provider Vesta:

https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203

**Continuous Variables**

``TransactionDT``: timedelta from a given reference datetime (not an actual timestamp)

``TransactionAMT``: transaction payment amount in USD

``dist``: May be the distance from addresses

``C1-C14``: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.

``D1-D15``: timedelta, such as days between previous transaction, etc.

``V1`` - ``V339``: Vesta engineered rich features, including ranking, counting, and other entity relations.

``id_01`` - ``id_11``(In identity table)

**Categorical Variables**

From **Transaction** Table:

``ProductCD``: product code, the product for each transaction

``card1`` - ``card6``: payment card information, such as card type, card category, issue bank, country, etc.

``addr1``, ``addr2``: 
both are for purchaser,
addr1 as billing region,
addr2 as billing country

``Pemaildomain``, ``Remaildomain``: purchaser and recipient email domain

``M1`` - ``M9``: match, such as names on card and address, etc.

From **Identity** Table:

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.

``id12`` - ``id38``, ``DeviceType``, ``DeviceInfo``

### Import Packages

In [None]:
import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
from pandas.api.types import CategoricalDtype

In [None]:
plt.rcParams['figure.figsize'] = (8,8)
pd.set_option('display.max_columns', 500)

In [None]:
def create_col_name(base_str, start_int, end_int):
    return [base_str + str(i) for i in range(start_int, end_int+1)]

In [None]:
create_col_name('card', 1, 6)

In [None]:
cat_cols = (['ProductCD'] + create_col_name('card', 1, 6) + ['addr1', 'addr2', 'P_emaildomain', 'R_emaildomain'] + 
            create_col_name('M', 1, 9) + ['DeviceType', 'DeviceInfo'] + create_col_name('id_', 12, 38))

id_cols = ['TransactionID', 'TransactionDT']

target = 'isFraud'

In [None]:
type_map = {c: str for c in cat_cols + id_cols}

### Loading the Data

In [None]:
df_train_id = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv', dtype=type_map)
df_train_trans = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv', dtype=type_map)


In [None]:
df_test_id = pd.read_csv('../input/ieee-fraud-detection/test_identity.csv', dtype=type_map)
df_test_trans = pd.read_csv('../input/ieee-fraud-detection/test_transaction.csv', dtype=type_map)

In [None]:
df_train_id.shape, df_train_trans.shape

In [None]:
df_train_id.head()

In [None]:
df_train_trans.head()

In [None]:
df_test_id.head()

#### Merging the Train & Test Data

In [None]:
df_train = df_train_trans.merge(df_train_id, on='TransactionID', how='left')

In [None]:
df_t = df_test_trans.merge(df_test_id, on='TransactionID', how='left')

In [None]:
del df_train_id, df_train_trans

In [None]:
del df_test_id,df_test_trans

### Reduce Memory Usage

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
df_train = reduce_mem_usage(df_train)
df_t = reduce_mem_usage(df_t)

In [None]:
import gc
gc.collect()

In [None]:
numeric_cols = [col for col in df_train.columns.tolist() if col not in cat_cols + id_cols + [target]]

In [None]:
#assert(df_train.shape[0]==df_train_trans.shape[0])

In [None]:
df_train.head()

In [None]:
df_train[cat_cols].head()

In [None]:
df_train[numeric_cols].head()

In [None]:
df_t.head()

****The id column names in the train and test dataset does not match. So, renaming the test columns according to the train id columns.****

In [None]:
df_test = df_t.rename(columns = {"id-01":"id_01", "id-02":"id_02","id-03":"id_03","id-04":"id_04","id-05":"id_05",
                                "id-06":"id_06","id-07":"id_07","id-08":"id_08","id-09":"id_09","id-10":"id_10",
                                "id-11":"id_11","id-12":"id_12","id-13":"id_13","id-14":"id_14","id-15":"id_15",
                                "id-16":"id_16","id-17":"id_17","id-18":"id_18","id-19":"id_19","id-20":"id_20",
                                "id-21":"id_21","id-22":"id_22","id-23":"id_23","id-24":"id_24","id-25":"id_25",
                                "id-26":"id_26","id-27":"id_27","id-28":"id_28","id-29":"id_29","id-30":"id_30",
                                "id-31":"id_31","id-32":"id_32","id-33":"id_33","id-34":"id_34","id-35":"id_35",
                                "id-36":"id_36","id-37":"id_37","id-38":"id_38",})

In [None]:
df_test.head()

In [None]:
del df_t

## Feature Engineering

#### Count NULL Values

In [None]:
df_train['n_nulls'] = df_train.isnull().sum(axis=1)
df_test['n_nulls'] = df_test.isnull().sum(axis=1)

numeric_cols += ['n_nulls']

#### Remove Version Numbers

In [None]:
df_train['id_30'].unique()[:20]

In [None]:
df_test['id_30'].unique()[:20]

In [None]:
for col in ['id_30', 'id_31']:
    df_train[col+'_clean'] = df_test[col].str.replace(r'[^A-Za-z]', '', regex=True)
    df_test[col+'_clean'] = df_test[col].str.replace(r'[^A-Za-z]', '', regex=True)
    cat_cols += [col+'_clean']

### Convert Categories to Ints for LightGBM

In [None]:
def cat_to_int(df_train, df_test, col):
    catDtype = CategoricalDtype(categories=df_train[col].value_counts().index.values)
    return df_train[col].astype(catDtype).cat.codes.values, df_test[col].astype(catDtype).cat.codes.values

In [None]:
df_train.loc[:,cat_cols] = df_train[cat_cols].fillna('<UNK>')
df_test.loc[:,cat_cols] = df_test[cat_cols].fillna('<UNK>')

df_train = df_train.fillna(-999)
df_test = df_test.fillna(-999)

In [None]:
cat_cols_encoded = list()
for col in cat_cols:
    df_train[col+'_code'], df_test[col+'_code'] = cat_to_int(df_train, df_test, col)
    cat_cols_encoded.append(col+'_code')

## Modeling

#### Catboost Modeling

In [None]:
from catboost import Pool, CatBoostClassifier
import lightgbm as lgb
from sklearn.model_selection import KFold
from pandas.api.types import CategoricalDtype

In [None]:
features = cat_cols + numeric_cols

In [None]:
N_val = int(df_train.shape[0]*0.05)
df_val = df_train.sort_values(by='TransactionDT').tail(N_val)
df_train_sample = df_train[~df_train.index.isin(df_val.index)]
shuffle_ks = True

In [None]:
df_train.shape[0], df_train_sample.shape[0], df_val.shape[0]

In [None]:
df_train_sample.head()

In [None]:
def build_pool(df, features, cat_cols, target=None):
    if target:
        data = Pool(
            data=df[features],
            label=df[target],
            cat_features=cat_cols
        )
    else:
        data = Pool(
            data=df[features],
            cat_features=cat_cols
        )
        
    return data

In [None]:
train_data = build_pool(df_train_sample, features, cat_cols, target)
holdout_data = build_pool(df_val, features, cat_cols, target)

In [None]:
len(train_data.get_label())

In [None]:
df_train_sample.shape

In [None]:
w = (df_train_sample[target]==0).sum() / (df_train_sample[target]==1).sum() / 5

In [None]:
params = {
    'iterations': 1500,
    'learning_rate': 0.05,
    #'depth': 15,
    'eval_metric': 'AUC',
    'od_type': 'Iter',
    'od_wait': 50,
     'task_type': 'CPU',
     'devices': '3',
    'scale_pos_weight': w,
}

In [None]:
import warnings

model_single = CatBoostClassifier(**params)
model_single.fit(train_data, eval_set=holdout_data, plot=True, verbose=False)

### Train LGB

In [None]:
train_data_lgb = lgb.Dataset(
    data=df_train_sample[numeric_cols + cat_cols_encoded], 
    label=df_train_sample[target],
    categorical_feature=cat_cols_encoded,
    free_raw_data=False,
)

holdout_data_lgb = lgb.Dataset(
    data=df_val[numeric_cols + cat_cols_encoded], 
    label=df_val[target],
    categorical_feature=cat_cols_encoded,
    free_raw_data=False,
)

In [None]:
# https://www.kaggle.com/timon88/lgbm-baseline-small-fe-no-blend

lgb_params = {
    'num_leaves': 491,
    'min_data_in_leaf': 106,
    'max_depth': -1,
    'min_child_weight': 0.03,
    'feature_fraction': 0.38,
    'bagging_fraction': 0.42,
    'objective': 'binary',
    'learning_rate': 0.0069,
    "boosting_type": "gbdt",
    "bagging_seed": 0,
    "metric": 'auc',
    "verbosity": -1,
    'reg_alpha': 0.39,
    'reg_lambda': 0.65,
    'random_state': 0,
    'scale_pos_weight': w,
}

In [None]:
%%time 

num_round = 3000
bst = lgb.train(lgb_params, train_data_lgb, num_round, valid_sets=[holdout_data_lgb], early_stopping_rounds=100)

In [None]:
bst.best_score

### Evaluate Performance

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
def plot_roc(y_trues, y_preds, labels, x_max=1.0):
    fig, ax = plt.subplots()
    for i, y_pred in enumerate(y_preds):
        y_true = y_trues[i]
        fpr, tpr, thresholds = roc_curve(y_true, y_pred)
        auc = roc_auc_score(y_true, y_pred)
        ax.plot(fpr, tpr, label='%s; AUC=%.3f' % (labels[i], auc), marker='o', markersize=1)

    ax.legend()
    ax.grid()
    ax.plot(np.linspace(0, 1, 20), np.linspace(0, 1, 20), linestyle='--')
    ax.set_title('ROC curve')
    ax.set_xlabel('False Positive Rate')
    ax.set_xlim([-0.01, x_max])
    _ = ax.set_ylabel('True Positive Rate')

In [None]:
plot_roc(
    [df_val[target]]*2,
    [model_single.predict_proba(holdout_data)[:,1], bst.predict(df_val[numeric_cols+cat_cols_encoded].values, num_iteration=bst.best_iteration)],
    ['Single Catboost Model', 'LGB']
)

## CV

Using the full Dataset

In [None]:
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True)
models_cv = list()
for train_index, test_index in kf.split(df_train.index.values):
    print("Train shape: ", len(train_index))
    train_data = build_pool(df_train.iloc[train_index,:], features, cat_cols, target)
    holdout_data = build_pool(df_train.iloc[test_index,:], features, cat_cols, target)
    
    model = CatBoostClassifier(ignored_features=None, **params)
    model.fit(train_data, eval_set=holdout_data, plot=True, verbose=False)
    models_cv.append(model)

In [None]:
test_data = build_pool(df_test, features, cat_cols)

In [None]:
y_test_catboost = model_single.predict_proba(test_data)[:,1]

In [None]:
y_test_lgb = bst.predict(df_test[numeric_cols + cat_cols_encoded].values, num_iteration=bst.best_iteration)

In [None]:
fig, ax = plt.subplots()
ax.scatter(x=y_test_catboost, y=y_test_lgb, s=1)

In [None]:
y_test_hats = list()
for model in models_cv:
    y_test_hats.append(model.predict_proba(test_data)[:,1])

In [None]:
y_test_hat = np.vstack([y_test_catboost, y_test_lgb]+y_test_hats).mean(axis=0)
df_test['isFraud'] = y_test_hat

### Generate Test Predictions

In [None]:
df_test[['TransactionID', 'isFraud']].to_csv('my_submission_v1.csv', index=False)

In [None]:
submission = pd.read_csv('my_submission_v1.csv')

In [None]:
submission.head()

In [None]:
#!kaggle competitions submit -c ieee-fraud-detection -f my_submission_v1.csv -m "Fraud Detection with CatBoost + LGB-weighted + Cat_5fold"