# Fraud ETL

- Create a profile of individual user's spending behavior
- Fraudulent transactions may be more likely to have high amounts
- TransactionDT is "timestamp"; add dollar/time as a metric, if it's high then that means someone is withdrawing more money than their past activity suggests that they need
- Recipient email could look suspicious or be different from usual

## Data descriptions

- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
- Categorical Features: ProductCD card1 - card6 addr1, addr2 P_emaildomain R_emaildomain M1 - M9

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import xgboost

In [4]:
base = Path('/home/zach/datasets/ieee-fraud-detection')
train = pd.read_csv(base / 'train_transaction.csv')

test = pd.read_csv(base / 'test_transaction.csv')
train.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
counts = train['isFraud'].value_counts()
counts[1] / (counts[0] + counts[1])

np.float64(0.03499000914417313)

In [6]:
# Baseline performance
from sklearn.metrics import accuracy_score

classifier = xgboost.XGBClassifier(eval_metric='logloss')
labels = train['isFraud']
features = train.drop(columns=['isFraud', 'TransactionID'])

classifier.fit(features, labels)
y_pred = classifier.predict(features)

accuracy = accuracy_score(labels, y_pred)
print(f'Accuracy: {accuracy:.4f}')

ImportError: sklearn needs to be installed in order to use this module

## Resampling for imbalanced datasets

In [16]:
train['TransactionAmt'].describe()

count    590540.000000
mean        135.027176
std         239.162522
min           0.251000
25%          43.321000
50%          68.769000
75%         125.000000
max       31937.391000
Name: TransactionAmt, dtype: float64

In [17]:
train['R_emaildomain'].value_counts()

R_emaildomain
gmail.com           57147
hotmail.com         27509
anonymous.com       20529
yahoo.com           11842
aol.com              3701
outlook.com          2507
comcast.net          1812
yahoo.com.mx         1508
icloud.com           1398
msn.com               852
live.com              762
live.com.mx           754
verizon.net           620
me.com                556
sbcglobal.net         552
cox.net               459
outlook.es            433
att.net               430
bellsouth.net         422
hotmail.fr            293
hotmail.es            292
web.de                237
mac.com               218
prodigy.net.mx        207
ymail.com             207
optonline.net         187
gmx.de                147
yahoo.fr              137
charter.net           127
mail.com              122
hotmail.co.uk         105
gmail                  95
earthlink.net          79
yahoo.de               75
rocketmail.com         69
embarqmail.com         68
scranton.edu           63
yahoo.es               5

In [21]:
import xgboost as xgb
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.preprocessing import OrdinalEncoder

In [24]:
def encode_categorical(df):
    encoder = OrdinalEncoder()
    encoded = encoder.fit_transform([df.select_dtypes(include='object').columns])
    return encoded

def test_model(model, df):
    X = df.copy()
    y = df.pop('isFraud')
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.70, random_state=42)

    model.fit(X_train, y_train)

    preds = model.predict(X_test)
    accuracy = f1_score(preds, y_test)

    return accuracy

In [26]:
model = xgb.XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1)
encoded = encode_categorical(train)
train[[train.select_dtypes(include='object').columns]] = encoded
acc = test_model(model, train)

ValueError: Columns must be same length as key