# Overall summary - Methodology

Create a data pipeline to depict your sequence of actions that move data from a source to a destination.
Show and explain all phases or steps involved in the data mining process, including the data
preprocessing (ETL/ ELT), modelling, evaluation and deployment in your project development.

## Preprocessing
We apply few transformation at the data reading, then the data will be fed into the pipeline we have developed using sklearn library. 

Suppose that we've 10 input variable and 1 categorical target feature. We develop a code that normalize each feature and one-hot encode the categorical feature. We can use `ColumnTransformer` and `Pipeline` to describe such processing step so that the preprocessor receives and transform any data that has the same format. Furthermore, the preprocessor known how to inverse transform each encoded feature. However, this is no supported by current `sklearn` framework.

In developing pipeline under `sklearn` framework, `ColumnTransformer` apply transformations parallely whereas `Pipeline` apply transformations sequentially. Thus, we mix them to achive desired overall transformation.

As in previous EDA section, we will reduce features based on basis of correlation. To impute missing value and aggregate grouped variables, we have defined a basic unit `impute_reduce_pipe`. `correlated_group_reducer` is constructed using the `impute_reduce_pipe`.

Initially, we want the reducer adjust its imputation and aggregration strategies based on the grouped variables. For example, for one variable, we can skip; for 2 to 5, we can take the average; for more than 5, we do dimension reduction PCA. However, due to the size of dataset, KNN imputation and dimension reduction techniques don't scale well; too slow. Hence, we don't adopt such strategy.

As a result, we reduce 300++ features to 185 input features.

## Model
Refering to the competition championship's solution, XGB model perform very well. Therefore, we adopt the same model. Once we done model development and training, we submit the `submission.csv` to Kaggle viewing the percision score. 

Since the target feature distribution is heavily imbalanced, precision score is used rather than accuracy.

Trying different model with different parameters (e.g.: random forest and decision tree), the precision score is still capped at 0.70 whereas top 100 solutions all have above 0.95 score. This is due to we don't include identity dataset in model training. Also, we don't aggressively engineer our features via time consistency checking. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from functools import reduce
import sklearn as sk
from variable_group import *

In [2]:
def flatten_list(li):
    from functools import reduce
    return reduce(lambda acm,cur: acm + cur, li)

vesta_subset_grp = flatten_list(vesta_subset_grp_pages)
vesta_columns = [f'V{i+1}' for i in range(339)]

In [3]:
input_features = ['TransactionAmt', 'dist1', 'dist2'] + vesta_columns + flatten_list(new_time_subsets) + flatten_list(count_feature_subset) + categorical_cols
len(input_features)

381

In [4]:
# the file is very large, then read some of it
def get_data(name, parent_zip = 'ieee-fraud-detection.zip', nrows = None):
    from zipfile import ZipFile
    with ZipFile(parent_zip, 'r') as f:
        df = pd.read_csv(f.open(name), nrows = nrows)
        # preprocess
        df.index = df['TransactionID']
        df = df.drop('TransactionID', axis=1)
        
        df['TransactionDay'] = df['TransactionDT'] // (24*60*60)
        
        df[categorical_cols] = df[categorical_cols].astype('category')
        
    return df

transaction_df = get_data('train_transaction.csv')

In [5]:
def create_test_df(n = 100):
    test_df = transaction_df.head(n).copy()
    return test_df.drop(['isFraud', 'TransactionDT', 'addr2'], axis=1)

test_df = create_test_df(100)
test_df.head()

Unnamed: 0_level_0,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,dist1,...,V331,V332,V333,V334,V335,V336,V337,V338,V339,TransactionDay
TransactionID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2987000,68.5,W,13926,,150.0,discover,142.0,credit,315.0,19.0,...,,,,,,,,,,1
2987001,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,,...,,,,,,,,,,1
2987002,59.0,W,4663,490.0,150.0,visa,166.0,debit,330.0,287.0,...,,,,,,,,,,1
2987003,50.0,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,,...,,,,,,,,,,1
2987004,50.0,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [6]:
from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
'''
Notes

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
'''
def MeanTransformer(row):
    mean_agg =  np.mean(row, axis=1).reshape(-1,1) # so that it is 2D
    return FunctionTransformer(mean_agg)

def impute_reduce_pipe(imputer, aggregrator):
    return Pipeline([
        ('impute', imputer),
        ('reduce', aggregrator),
    ])

def vesta_reducer(aggregrator = None):
    if aggregrator is None:
        aggregrator = FunctionTransformer()
    return impute_reduce_pipe(SimpleImputer(strategy='constant', fill_value=-999), aggregrator)

def correlated_group_reducer(subset_group, prefix):
    
    transformers = []
    for i, grp in enumerate(subset_group):
        sz = len(grp)
        if sz <= 1:
            transformers.append((f'{prefix}_{i}', vesta_reducer(), grp))
        # elif sz >= 7: # too slow
        #     vesta_transfomers.append((f'{prefix}_{i}', vesta_reducer(PCA(1)), grp))
        else:
            transformers.append((f'{prefix}_{i}',vesta_reducer(MeanTransformer()), grp))
    return transformers


email_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')), 
    ('encode', OrdinalEncoder())
    ])

vesta_transfomers = correlated_group_reducer(vesta_subset_grp, 'TransV')
count_transformers = correlated_group_reducer(count_feature_subset, 'TransC')
time_transformers = correlated_group_reducer(new_time_subsets, 'TransD')

transfomers = []
transfomers += count_transformers
transfomers += [
    ('identity', vesta_reducer(), ['dist1', 'dist2']), 
    ('log', impute_reduce_pipe(SimpleImputer(strategy='mean'),FunctionTransformer(np.log10)), ['TransactionAmt']),
    # ('KNNImputer1', KNNImputer(), time_features),
    ('cat', OrdinalEncoder(encoded_missing_value=-999), categorical_cols),
    ('email_imputer', email_pipe, ['P_emaildomain', 'R_emaildomain']),
]
transfomers += time_transformers
transfomers += vesta_transfomers
# check if all input features are covered
def check():
    prv = set(input_features)
    cur = set(flatten_list(trans[2] for trans in transfomers))
    assert(prv.difference(cur) == set())
    return len(transfomers)
print(check())
preprocessor = ColumnTransformer(transfomers, remainder='drop')

162


In [7]:
# test preprocessor on small data
test_df = create_test_df()
res = preprocessor.fit_transform(test_df)
assert(pd.DataFrame(res).isna().any().sum() == 0) # ensure there is no missing values
res.astype('float'); # ensure all values are float type

In [8]:
# test preprocessor on full data
y = transaction_df['isFraud']

test_df = transaction_df.drop(['isFraud', 'TransactionDT', 'addr2'], axis=1)
X = preprocessor.fit_transform(test_df)
assert(pd.DataFrame(X).isna().any().sum() == 0) # ensure there is no missing values
X.astype('float'); # ensure all values are float type

In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X,y, test_size = 0.5, random_state=42)
[tmp.shape for tmp in [X_train,y_train,X_test, y_test]]

[(295270, 180), (295270,), (295270, 180), (295270,)]

In [24]:
import xgboost as xgb
import os
import joblib
if 'xgb_1.joblib' in os.listdir('saved_model'):
    clf = joblib.load('saved_model/xgb_1.joblib')
else:
    print('start fitting')
    clf = xgb.XGBClassifier( 
            n_estimators=2000,
            max_depth=10, 
            learning_rate=0.02, 
            subsample=0.8,
            colsample_bytree=0.4, 
            missing=-1, 
            eval_metric='auc',
            # USE CPU
            #tree_method='hist' 
            # USE GPU
            tree_method='gpu_hist',
        )
    clf.fit(X_train, y_train, verbose=True)
    joblib.dump(clf, 'saved_model/xgb_1.joblib')

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
{
    'train': sk.metrics.accuracy_score(y_train, y_train_pred),
    'test': sk.metrics.accuracy_score(y_test, y_test_pred),
    'train_percision': sk.metrics.precision_score(y_train, y_train_pred),
    'test_precision': sk.metrics.precision_score(y_test, y_test_pred)
}

{'train': 0.993355234192434,
 'test': 0.9827005791309649,
 'train_percision': 0.9961580021611238,
 'test_precision': 0.9409136047666335}

# Submission

In [25]:
test_transaction_df = get_data('test_transaction.csv')

test_dat = test_transaction_df.drop(['TransactionDT', 'addr2'], axis=1)
test_dat = preprocessor.fit_transform(test_dat)
assert(pd.DataFrame(test_dat).isna().any().sum() == 0) # ensure there is no missing values
test_X = test_dat.astype('float')
test_X

MemoryError: Unable to allocate 1.42 GiB for an array with shape (375, 506691) and data type float64

In [None]:
test_pred_y = clf.predict(test_X)
test_transaction_df['isFraud'] = test_pred_y
test_transaction_df[['isFraud']].to_csv('submission.csv')