# Credit Card Transactions Fraud Detection Fortcasting


## About the Dataset

This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants. 

Source:https://www.kaggle.com/datasets/kartik2112/fraud-detection?select=fraudTest.csv

## Data Dictionary

trans_date_trans_time -> Transaction time stamp <br>
cc_num -> Credit card number <br>
merchant -> merchant name <br>
category -> transaction category <br>
amt -> Transaction amount <br>
first -> First name of card holder <br>
last -> Last name of card holder <br>
gender -> Sex of card holder <br>
street -> transaction address <br>
city -> transaction city <br>
state -> transaction state <br>
zip -> transaction zipcode <br>
lat -> transaction lattitude <br>
long -> transaction longitude <br>
city_pop -> Population of the city <br>
job -> job of the card holder <br>
dob -> date of birth of card holder <br>
trans_num -> transaction number of transaction <br>
unix_time -> time in unix format <br>
merch_lat -> lattitude of the merchant <br>
merch_long -> longitude of merchant <br>
is_fraud -> nature of transaction (fraud or not fraud) <br>
Here, the 'is_fraud' variables is our target variable.

## Import the Dataset and Packages

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score,plot_confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from mlflow import log_metric, log_param, log_artifacts
import mlflow
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")

  from pandas import MultiIndex, Int64Index


In [2]:
# Import train and test dataset, merge both datasets because the datasets were seperated in a chronological order. We want randomize our datasets.
train_data = pd.read_csv("fraudTrain.csv")
test_data = pd.read_csv("fraudTest.csv")
#concatenating the two datasets
dat = pd.concat([train_data, test_data]).reset_index()
dat.drop(dat.columns[:2], axis=1, inplace=True)
dat.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [3]:
dat.shape

(1852394, 22)

In [4]:
dat.dtypes

trans_date_trans_time     object
cc_num                     int64
merchant                  object
category                  object
amt                      float64
first                     object
last                      object
gender                    object
street                    object
city                      object
state                     object
zip                        int64
lat                      float64
long                     float64
city_pop                   int64
job                       object
dob                       object
trans_num                 object
unix_time                  int64
merch_lat                float64
merch_long               float64
is_fraud                   int64
dtype: object

In [5]:
dat.corr()

Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud
cc_num,1.0,0.001826,0.041504,-0.058744,-0.048429,-0.009118,0.000284,-0.058415,-0.048421,-0.001125
amt,0.001826,1.0,0.001979,-0.00067,-0.000735,0.004921,-0.002411,-0.000613,-0.000711,0.209308
zip,0.041504,0.001979,1.0,-0.114554,-0.909795,0.077601,0.001017,-0.113934,-0.908981,-0.00219
lat,-0.058744,-0.00067,-0.114554,1.0,-0.014744,-0.154816,0.000741,0.993582,-0.014709,0.002904
long,-0.048429,-0.000735,-0.909795,-0.014744,1.0,-0.052359,-0.000574,-0.014585,0.999118,0.001022
city_pop,-0.009118,0.004921,0.077601,-0.154816,-0.052359,1.0,-0.001636,-0.153863,-0.052329,0.000325
unix_time,0.000284,-0.002411,0.001017,0.000741,-0.000574,-0.001636,1.0,0.000696,-0.000571,-0.013329
merch_lat,-0.058415,-0.000613,-0.113934,0.993582,-0.014585,-0.153863,0.000696,1.0,-0.014554,0.002778
merch_long,-0.048421,-0.000711,-0.908981,-0.014709,0.999118,-0.052329,-0.000571,-0.014554,1.0,0.000999
is_fraud,-0.001125,0.209308,-0.00219,0.002904,0.001022,0.000325,-0.013329,0.002778,0.000999,1.0


In [6]:
dat.isnull().sum()

trans_date_trans_time    0
cc_num                   0
merchant                 0
category                 0
amt                      0
first                    0
last                     0
gender                   0
street                   0
city                     0
state                    0
zip                      0
lat                      0
long                     0
city_pop                 0
job                      0
dob                      0
trans_num                0
unix_time                0
merch_lat                0
merch_long               0
is_fraud                 0
dtype: int64

## Data Wrangling

In [7]:
# Transform column names 
dat.rename(columns={"trans_date_trans_time":"transaction_time",
                         "cc_num":"credit_card_number",
                         "amt":"amount(usd)",
                         "trans_num":"transaction_id"},
                inplace=True)

In [8]:
#Transform to datetime
dat["transaction_time"] = pd.to_datetime(dat["transaction_time"], infer_datetime_format=True)
dat["dob"] = pd.to_datetime(dat["dob"], infer_datetime_format=True)

In [9]:
# Sine column 'unix_time' is the same as transaction time but in unix format. We drop it but keep the 'transaction_time'.
dat.drop('unix_time', axis=1,inplace=True)

In [10]:
dat['age'] = np.round((dat['transaction_time'] - 
                      dat['dob'])/np.timedelta64(1, 'Y'))

In [11]:
# Sperate transaction time into year, month, day and hour
dat['year'] = dat.transaction_time.dt.year
dat['month'] = dat.transaction_time.dt.month
dat['day'] = dat.transaction_time.dt.day
dat['hour'] = dat.transaction_time.dt.hour
dat.drop('transaction_time', axis=1,inplace=True)

In [12]:
# change male to 1, female to 0
dat["gender"]=dat.gender.apply(lambda x: 1 if x=="M" else 0)

In [13]:
dat.head()

Unnamed: 0,credit_card_number,merchant,category,amount(usd),first,last,gender,street,city,state,...,dob,transaction_id,merch_lat,merch_long,is_fraud,age,year,month,day,hour
0,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,0,561 Perry Cove,Moravian Falls,NC,...,1988-03-09,0b242abb623afc578575680df30655b9,36.011293,-82.048315,0,31.0,2019,1,1,0
1,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,0,43039 Riley Greens Suite 393,Orient,WA,...,1978-06-21,1f76529f8574734946361c461b024d99,49.159047,-118.186462,0,41.0,2019,1,1,0
2,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,1,594 White Dale Suite 530,Malad City,ID,...,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,43.150704,-112.154481,0,57.0,2019,1,1,0
3,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,1,9443 Cynthia Court Apt. 038,Boulder,MT,...,1967-01-12,6b849c168bdad6f867558c3793159a81,47.034331,-112.561071,0,52.0,2019,1,1,0
4,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,1,408 Bradley Rest,Doe Hill,VA,...,1986-03-28,a41d7549acf90789359a9aa5346dcb46,38.674999,-78.632459,0,33.0,2019,1,1,0


## Data Preparation 
Since our y is 0 and 1, we'll be building classifier in this case. We'll build a XGBoost model, so we need to transform variables into numeric variables.

In [14]:
dat.drop(['merchant','first','last','street','city','state','job','dob','transaction_id'], axis=1,inplace=True)

In [15]:
# LabelEncoder
labelencoder = LabelEncoder()
dat['category'] = labelencoder.fit_transform(dat['category'])

In [16]:
#OneHotEncoder
dat_dum = pd.get_dummies(dat['category'])
#pd.DataFrame(dat_dum)

In [17]:
# Merge our dummies into dataframe
dat = pd.concat([dat, dat_dum], axis="columns")

In [18]:
# run the following command if you want to do smote()

#dat_sample = dat.sample(n = 100000,random_state =50)
#X= dat_sample.iloc[:,dat_sample.columns!= 'is_fraud']
#y= dat_sample.iloc[:,dat_sample.columns== 'is_fraud']

In [19]:
X= dat.iloc[:,dat.columns!= 'is_fraud']
y= dat.iloc[:,dat.columns== 'is_fraud']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

In [21]:
y_train.value_counts()

is_fraud
0           1234658
1              6445
dtype: int64

In [22]:
y_train.shape

(1241103, 1)

In [23]:
y_train.value_counts()

is_fraud
0           1234658
1              6445
dtype: int64

In [24]:
#if smote()

#from imblearn.over_sampling import SMOTE

#smote = SMOTE(random_state=0,sampling_strategy=0.3)
#X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)


## Data Modeling

In [25]:
# Use IP of your remote machine here
server_ip = "0.0.0.0"

# set up minio credentials and connection
os.environ['AWS_ACCESS_KEY_ID'] = 'admin'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'adminadmin'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = "http://localhost:9000"

# set mlflow track uri
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("XGBoost")

<Experiment: artifact_location='s3://mlflow/1', experiment_id='1', lifecycle_stage='active', name='XGBoost', tags={}>

In [28]:
# Remeber to change X_train,y_train to X_train_smote, y_train_smote if you wanna do smote

with mlflow.start_run(run_name="XGBoost_all:"):

    #xgboost    
    xgboostModel = XGBClassifier(n_estimators=500, learning_rate= 0.1, objective='binary:logistic', booster='gbtree')
    
    xgboostModel.fit(X_train, y_train)

    predicted = xgboostModel.predict(X_test)
    
    #mlflow
    param=dat.columns.to_list()
    for i in range(len(param)):
        mlflow.log_param("parameter%d"%(i+1),param[i]) 
    mlflow.log_param("Train rows", len(X_train))
    mlflow.log_param("Test rows", len(X_test))
    print("Classification report:\n", classification_report(y_test, predicted))
    mlflow.log_metric("Accuracy", accuracy_score(y_test, predicted))
    mlflow.log_metric("Recall" , recall_score(y_test, predicted))
    mlflow.log_metric("Precision", precision_score(y_test, predicted))
    mlflow.log_metric("F1", f1_score(y_test, predicted))
    mlflow.sklearn.log_model(xgboostModel,"XGBoost")

Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    608085
           1       0.97      0.86      0.91      3206

    accuracy                           1.00    611291
   macro avg       0.98      0.93      0.95    611291
weighted avg       1.00      1.00      1.00    611291

