## Financial Fraud Classifier
#### Sam Berkson and Ben Puryear
#### CPSC 322 Final Project

In [11]:
from mysklearn import myclassifiers, myevaluation, mypytable, myutils
import importlib

# set up the data for later usage
importlib.reload(mypytable)

# first we are going to import the dataset into a mypytable object
mytable = mypytable.MyPyTable()
mytable.load_from_file("input_data/Fraud_chop.csv")

# the values of type are strings, so we will convert them to ints to be able to be used in the classifiers
mytable.convert_col_to_int('type')
mytable.convert_col_to_int('amount')
mytable.drop_cols(['step','nameOrig', 'nameDest']) # we dont need step, nameOrig, nameDest, isFlaggedFraud

### Introduction)  

* Dataset
    * Our dataset originates from Kaggle, coming packaged as a csv file.  This csv file contains just over a million instances of different types of financial transactions from different accounts, and marks whether or not the transaction was flagged as fradulent or not, and whether or not the transaction was actually fradulent.  Our dataset contains 11 attributes:
    * step
        * Step maps to a number of hours, where 1 step is 1 hour
     * type
        * This identifies the type of transaction.  It can be: CASH-IN; CASH-OUT; DEBIT; PAYMENT; and TRANSFER.
        * This could be a pretty useful attribute to use in classification.
    * amount
        * This is the amount of money transferred (in the local currency).
        * This can also be a pretty useful attribute to use in our classification.
    * nameOrig
        * This identifies the customer who initiated the transaction.
    * oldBalanceOrg
        * This is the initial balance before the transaction.
    * newBalanceOrig
        * This is the new balance after the transaction.
    * nameDest
        * This identifies the recipient of the transfer.
    * oldBalanceDest
        * This is the initial balance recipient before the transaction
    * newBalanceDest
        * This is the new balance recipient after the transaction.
        * This can be a useful tool for classification when used in comparison to oldBalanceDest for any given instance.
    * isFraud
        * This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
    * isFlaggedFraud
        * The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than $200.000$ in a single transaction.
    
Our classification goal is to correctly predict whether any given transaction is fradulent or not (our class label is 'isFraud')

### EDA)
While we wont go over all of our EDA results, here are the cliffnotes and some interesting finds.

__Attribute Distributions__

* type

![image](media/Transaction_Type_Distribution.jpeg) 
    
* isFraud

![image](media/Fradulent_Transaction_Distribution.jpeg)


__Attribute Relationships__

* oldBalanceOrg and newBalanceOrig
    * slope: $.6599$, just about $2/3$

![image](media/Old_Balance_v_New_Balance.jpeg)

* type and isFraud

![image](media/Fraudulent_Transaction_Types_Distribution.jpeg)
![image](media/Non-Fraudulent_Transaction_Types_Distribution.jpeg)

* type and isFlaggedFraud

![image](media/Fraudulent_Transaction_Types_Flagged_Distribution.jpeg)
![image](media/Non-Fraudulent_Transaction_Types_Flagged_Distribution.jpeg)

In [12]:
oldBalanceDest = mytable.get_column('oldbalanceDest')
newBalanceDest = mytable.get_column('newbalanceDest')
amount = mytable.get_column('amount')

avgOrgBalance = sum(oldBalanceDest)/len(oldBalanceDest)
avgNewBalance = sum(newBalanceDest)/len(newBalanceDest)
difference = avgOrgBalance - avgNewBalance
avgTransfer = sum(amount)/len(amount)

print("Average original balance:", avgOrgBalance)
print("Average new balance:", avgNewBalance)
print("Average difference: $", difference)

Average original balance: 833508.1312484784
Average new balance: 1304552.5530755178
Average difference: $ -471044.42182703933


In [13]:
isFlaggedFraud = mytable.get_column('isFlaggedFraud')
isFraud = mytable.get_column("isFraud")

isFlaggedFraudT = 0
isFlaggedFraudF = 0
isFraudT = 0
isFraudF = 0
correctlyPredicted = 0
incorrectlyPredicted = 0

for index, value in enumerate(isFlaggedFraud):
    if isFlaggedFraud[index] == 1:
        isFlaggedFraudT += 1
    else:
        isFlaggedFraudF += 1
        
    if isFraud[index] == 1:
        isFraudT += 1
    else:
        isFraudF += 1

    if isFlaggedFraud[index] == 1 and isFraud[index] == 1:
        correctlyPredicted += 1
    elif isFlaggedFraud[index] == 0 and isFraud[index] == 0:
        correctlyPredicted += 1
    else:
        incorrectlyPredicted += 1

flagPercentage = (100 / 1642) * isFlaggedFraudT    
print("Number of instances: ", isFraudT + isFraudF)
print("Number of fraudulent transactions:", isFraudT)
print("Number of non-fraudulent transactions:", isFraudF)
print("Number of fraudulent transactions flagged:", isFlaggedFraudT)
print("Number of non-fraudulent transactions flagged:", isFlaggedFraudF)
print("Number of transaction types:", 5)
print("Percentage of instances flagged as fradulent: ", str(flagPercentage) + " %")
print("Number of instances correctly predicted: ", correctlyPredicted)
print("Number of instances incorrectly predicted: ", incorrectlyPredicted)
print("Accuracy: ", (100 / 1642) * correctlyPredicted)

Number of instances:  1642
Number of fraudulent transactions: 821
Number of non-fraudulent transactions: 821
Number of fraudulent transactions flagged: 3
Number of non-fraudulent transactions flagged: 1639
Number of transaction types: 5
Percentage of instances flagged as fradulent:  0.18270401948842874 %
Number of instances correctly predicted:  824
Number of instances incorrectly predicted:  818
Accuracy:  50.18270401948843


We will need a classifier accuracy greater than 50.2% in order to surpass the dataset's classifier.

### Classification)

For our classification, we began by running all of our supervised learning classifiers over our training and test sets.  We then took our $4$ best performing classifiers: 
* Linear Regression 
* Dummy Clasifier 
* Naive Bayes 
* Forest Classifier.  

Since we are trying to predict whether a given transaction is fradulent or not, that means we're dealing with binary classification. We tracked the following metrics to measure the performance of our classifiers and decide which classifier to build into our Heroku app:
* Accuracy
* Binary F1
* Binary Precision
* Binary Recall

The results (rounded to the nearest hundredth) for each classifier are as follows:
* Linear Regressor:
    * Accuracy: $0.47$ or $47$%
    * Binary F1: $0.44$
    * Binary Precision: $0.47$
    * Binary Recall: $0.42$

* Dummy Classifier:
    * Accuracy: $0.5$
    * Binary F1: $0$
    * Binary Precision: $0$
    * Binary Recall: $0$

* Naive Bayes:
    * Accuracy: $0.5$ or $50$%
    * Binary F1: $0$
    * Binary Precision: $0$
    * Binary Recall: $0$

* Forest Classifier:
    * We ran our forest classifier implementation with the following settings:
    * $1000$ weak learners
    * $15$ better learners
    * $4$ random attribute subsets  
    * Accuracy: $0.7813641900121803$
    * Binary F1: $0.7201870615744349$
    * Binary precision: $1.0$
    * Binary recall: $0.5627283800243605$

**Results**:
* The Random Forest classifier was the best in every category, which is why we used it in our Heroku App

* Our forest classifier also surpasses the dataset's classifier accuracy of $50.18$%.  Pretty neat!

### Conclusion)

* Potential Improvements
    * We implemented all of our ideas for improving classification in our dataset.  On the algorithm side of things, we believe we have created the best Forest Classifier we can. 
    * Some potential improvements might involve toying more with attribute selection, and exploring other relationships between attributes that could be useful in classification.  

* Since our dataset came completely pre-cleaned and ready to roll, there wasnt a whole lot we had to do with the dataset other than load it in and start working with it.  Since it is binary classification and contains 11 informative attributes, this did not create many issues for classification beyond trying to get our implementation working.  We did not encounter and challenges in the classification of our dataset after we achieved equal class distributions in our dataset.

* After we achieved an equal class distribution, all that was left was to plug our dataset into our random subsampling functions and interpret the results.  When we saw our Forest Classifier's performance, we knew that it was our best classificationt tool and decided to use it as our Heroku model, as well as our feature classification method for the whole project.


* Contributions
    * Ben handled our classification and classifier evaluation and the Heroku site.  
    * Sam handled our project proposal, EDA, and compilation of resources into our presentation and report.


* Sources:
    * Dataset:
        * https://www.kaggle.com/datasets/vardhansiramdasu/fraudulent-transactions-prediction?resource=download
    * Images:
        * https://www.istockphoto.com/illustrations/elder-fraud

    