# Final Project
Syracuse Applied Data Science, IST-718 Big Data Analytics  

Team: AUQ-42
Team Members:
* Ryan Timbrook
* Amanda Carvalho
* Luigi Penaloza
* Chikeung Cheung

DATE:   
ASSIGNMENT: IEEE-CIS Fraud Detection (kaggle competition)


## Business Question
Improve the efficacy of fraudulent transaction alerts, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue; while securing consumer’s peace of mind and wallets!





## Problems to solve
Identify real-time fraudulent e-commerce transactions, using advanced Machine Learning algorithms, by automating alerts that block highly suspicious activities. 

## Why the problem is important
Everyone who uses e-commerce technology and modern banking systems are at risk of being a victim of fraud. It costs both the individual as well as the merchant who offers refunds for fraudulent transactions; and not all scenarios are covered, leaving many individuals having to pay.  

Chargebacks area a growing costly burden for merchants. By eliminating chargebacks, fines, and fees related to third-party fraud and unauthorized charges, the client, VESTA , is able to significantly reduce the operational costs and resources associated with complex chargeback management solutions and the specialized staff necessary for rapid, scalable business growth. This leaves all the cost risk on the client. Improving automated fraudulent detection technology will greatly reduce this cost.  
 

# About the Data
The core data set for this project is provided by VESTA, the worlds leading payment service company, and is a kaggle competition being facilitated by the [IEEE Computational Intelligence Society](https://www.kaggle.com/c/ieee-fraud-detection/data).

## [VESTA](https://www.kaggle.com/c/ieee-fraud-detection/data)
Predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.  

The data is broken into two files identity and transaction, which are joined by **TransactionID**. *Not all transactions have corresponding identity information.*

#### Categorical Features - Transaction
* ProductCD
* card1 - card6
* addr1, addr2
* P_emaildomain
* R_emaildomain
* M1 - M9

#### Categorical Features - Identity
* DeviceType
* DeviceInfo
* id_12 - id_38

*The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp).*

##### Files
* train_{transaction, identity}.csv - the training set
* test_{transaction, identity}.csv - the test set (you must predict the isFraud value for these observations)
* sample_submission.csv - a sample submission file in the correct format

## CCFD (kaggel)



## FTC

## Findings / Recommendations
place findings and recommendations here  





### --------------------------------------------------------------------------------------------
## Coding Environment Setup

# ONLY RUN WHEN WORKING ON COLAB

In [1]:
# toggle for working with colab
isColab = False

In [None]:
# mount google drive for working in colab
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

# working within colab, set base working directory
base_dir = "./gdrive/My Drive/IST718_PRJ_FraudDetection/workspace/"

# validate directory mapping
#ls f'{base_dir}'

# upload custome python files
from google.colab import files
uploaded_files = files.upload()

# print files uploaded
for f in uploaded_files.keys():
  print(f'file name: {f}')

isColab = True

In [2]:
# import packages for analysis and modeling
import pandas as pd #data frame operations
import numpy as np #arrays and math functions

import warnings
warnings.filterwarnings('ignore')

import logging
logging.getLogger('tensorflow').disabled = True

In [3]:
# import custome packages
import auq_42_utils as auq

All the files are downloaded


In [4]:
# set global properties
if not isColab:
    dataDir = './data/'
    outputDir = './output/'
    configDir = './config/'
    logOutDir = './logs/'
    imageDir = './images/'
    modelDir = './models/'
else:
    # working within colab
    dataDir = f'{base_dir}/data/'
    outputDir = f'{base_dir}/output/'
    configDir = f'{base_dir}/config/'
    logOutDir = f'{base_dir}/logs/'
    imageDir = f'{base_dir}/images/'
    modelDir = f'{base_dir}/models/'


In [5]:
# get a logger for troubleshooting / data exploration
appName = 'rt_fraud_obtain_data' # sets the logger file name
loglevel = 10 # 10-DEBUG, 20-INFO, 30-WARNING, 40-ERROR, 50-CRITICAL
logger = auq.getFileLogger(logOutDir,appName,level=loglevel)

## OBTAIN the data
Import external datasets for evaluation

### Vesta Dataset
#### IEEE-CIS Fraud Detection
##### Can you detect fraud from customer transactions?
* kaggle [link](https://www.kaggle.com/c/ieee-fraud-detection/data)

## FTC Dataset

### Credit Card Fraud Detection
#### Anonymized credit card transactions labeled as fraudulent or genuine
* Kaggel [Link](https://kaggle.com/mlg-ulb/creditcardfraud)

In [6]:
# read in datasets
import pickle
# look for reduced memory dataset first
isMemoryReductionTrain = False
isMemoryReductionTest = False

# training datasets
try:
    with open(f'{dataDir}v_train.pkl','rb') as f:
        v_train = pickle.load(f)
        logger.info(f'saved pickled vesta train dataset found...')
        isMemoryReductionTrain = True
except FileNotFoundError:
    logger.info('v_train file not found... pulling data in from csv files')
    # VESTA
    v_train_identity = pd.read_csv(f'{dataDir}ieee_train_identity.csv')
    v_train_transaction = pd.read_csv(f'{dataDir}ieee_train_transaction.csv')
    # merge VESTA training datasets
    v_train = pd.merge(v_train_transaction, v_train_identity, on='TransactionID', how='left') 
    # free up memory of loaded datasets after merging
    v_train_identity = None
    v_train_transaction = None

# testing datasets
try:
    with open(f'{dataDir}v_test.pkl','rb') as f:
        v_test = pickle.load(f)
        logger.info(f'saved pickled vesta testing dataset found...')
        isMemoryReductionTest = True
except FileNotFoundError:
    logger.info('v_test file not found... pulling data in from csv files')
    v_test_identity = pd.read_csv(f'{dataDir}ieee_test_identity.csv')
    v_test_transaction = pd.read_csv(f'{dataDir}ieee_test_transaction.csv')
    # merge test datasets
    v_test = pd.merge(v_test_transaction, v_test_identity, on='TransactionID', how='left')
    # free up memory of loaded datasets after merging
    v_test_identity = None
    v_test_transaction = None


v_train file not found... pulling data in from csv files
v_test file not found... pulling data in from csv files


In [7]:
# look at the datasets shaped
logger.info(f'Vesta Train dataset shape: [{v_train.shape}]')
logger.info(f'Vesta Test dataset shape: [{v_test.shape}]')
logger.info(f'Vesta Train dataset Total NAN count: [{auq.getNaNCount(v_train)[0]}]')
logger.info(f'Vesta Test dataset Total NAN count: [{auq.getNaNCount(v_test)[0]}]')

Vesta Train dataset shape: [(590540, 434)]
Vesta Test dataset shape: [(506691, 433)]
Vesta Train dataset Total NAN count: [115523073]
Vesta Test dataset Total NAN count: [90186908]


In [9]:
# which columns have NaN fields, and how many are there
logger.info(f'Vesta Train dataset column NANs {auq.getColumnsNaNCnts(v_train, logger)}')

Vesta Train dataset column NANs [('card2', 8933), ('card3', 1565), ('card4', 1577), ('card5', 4259), ('card6', 1571), ('addr1', 65706), ('addr2', 65706), ('dist1', 352271), ('dist2', 552913), ('P_emaildomain', 94456), ('R_emaildomain', 453249), ('D1', 1269), ('D2', 280797), ('D3', 262878), ('D4', 168922), ('D5', 309841), ('D6', 517353), ('D7', 551623), ('D8', 515614), ('D9', 515614), ('D10', 76022), ('D11', 279287), ('D12', 525823), ('D13', 528588), ('D14', 528353), ('D15', 89113), ('M1', 271100), ('M2', 271100), ('M3', 271100), ('M4', 281444), ('M5', 350482), ('M6', 169360), ('M7', 346265), ('M8', 346252), ('M9', 346252), ('V1', 279287), ('V2', 279287), ('V3', 279287), ('V4', 279287), ('V5', 279287), ('V6', 279287), ('V7', 279287), ('V8', 279287), ('V9', 279287), ('V10', 279287), ('V11', 279287), ('V12', 76073), ('V13', 76073), ('V14', 76073), ('V15', 76073), ('V16', 76073), ('V17', 76073), ('V18', 76073), ('V19', 76073), ('V20', 76073), ('V21', 76073), ('V22', 76073), ('V23', 76073),

## SCRUB / CLEAN
Clean and perform initial transformations steps of the data

### Run Memory Reduction Pre-Processing
**Session memory after loading datasets was 11 GB. After running memory reduction steps it's 1.3 GB

In [10]:
# run memory reduction
if not isMemoryReductionTrain:
    logger.info(f'Before Reduction Memory Usage: VESTA Training Dataset:[{auq.mem_usage(v_train)}]')
    v_train = auq.reduce_df_memory(v_train, logger)
    auq.save_df(v_train, f'{dataDir}v_train.pkl', logger)
    logger.info(f'After Reduction Memory Usage: VESTA Training Dataset:[{auq.mem_usage(v_train)}]')
if not isMemoryReductionTest:
    logger.info(f'Before Reduction Memory Usage: VESTA Testing Dataset: [{auq.mem_usage(v_test)}]')
    v_test = auq.reduce_df_memory(v_test, logger)
    auq.save_df(v_test, f'{dataDir}v_test.pkl', logger)
    logger.info(f'After Reduction Memory Usage: VESTA Testing Dataset: [{auq.mem_usage(v_test)}]')

Before Reduction Memory Usage: VESTA Training Dataset:[2584.98 MB]
converted [0] of [434] columns in time [4.500000000007276e-05]
converted [20] of [434] columns in time [9.480417699999975]
converted [40] of [434] columns in time [22.199514499999964]
converted [60] of [434] columns in time [34.50245589999997]
converted [80] of [434] columns in time [47.18627500000002]
converted [100] of [434] columns in time [60.533856300000025]
converted [120] of [434] columns in time [72.86636939999994]
converted [140] of [434] columns in time [83.92050119999999]
converted [160] of [434] columns in time [95.95080859999996]
converted [180] of [434] columns in time [105.02796739999997]
converted [200] of [434] columns in time [113.17697569999996]
converted [220] of [434] columns in time [122.79845479999994]
converted [240] of [434] columns in time [129.82537979999995]
converted [260] of [434] columns in time [136.30953239999997]
converted [280] of [434] columns in time [142.15557360000003]
converted [3

## EXPLORE
Explore the datasets

In [11]:
# perform exploratory data analysis techiques
logger.info(v_train.columns)
logger.info(v_train.head())
logger.info(v_test.head())

Index(['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt',
       'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5',
       ...
       'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
       'DeviceType', 'DeviceInfo'],
      dtype='object', length=434)
   TransactionID  isFraud  TransactionDT  TransactionAmt ProductCD  card1  \
0        2987000        0          86400            68.5         W  13926   
1        2987001        0          86401            29.0         W   2755   
2        2987002        0          86469            59.0         W   4663   
3        2987003        0          86499            50.0         W  18132   
4        2987004        0          86506            50.0         H   4497   

   card2  card3       card4  card5  ...                id_31  id_32  \
0    NaN  150.0    discover  142.0  ...                  NaN    NaN   
1  404.0  150.0  mastercard  102.0  ...                  NaN    NaN   
2  490.0  150.0        visa  

### Base Feature Engineering / Transformation


In [None]:
# explore feature importance using Random Forest Classifier

### MODEL
Create models

In [None]:
# perform model creation and validation techniques

#### Model Validation
Perform model validations

### iNterpret
Interpret the model results, make knowledge based recommendations

In [None]:
# perform interpretation steps

## -----------------------------------------------------------------------------
## Custome Class Objects - Run these sections first
## -----------------------------------------------------------------------------

In [None]:
# classes

## -----------------------------------------------------------------------------
## Local Utility Functions - Run these sections second
## -----------------------------------------------------------------------------

In [None]:
# functions
