## Financial Fraud Classifier
#### Sam Berkson and Ben Puryear
#### CPSC 322 Final Project

In [55]:
from mysklearn import myclassifiers, myevaluation, mypytable, myutils
import importlib

# set up the data for later usage
importlib.reload(mypytable)

# first we are going to import the dataset into a mypytable object
mytable = mypytable.MyPyTable()
mytable.load_from_file("input_data/Fraud_chop.csv")

# the values of type are strings, so we will convert them to ints to be able to be used in the classifiers
mytable.convert_col_to_int('type')
mytable.convert_col_to_int('amount')
mytable.drop_cols(['step','nameOrig', 'nameDest']) # we dont need step, nameOrig, nameDest, isFlaggedFraud

# we also will make x and y
X = []
y = []
for row in mytable.data:
    X.append(row[0:len(row)-1])
    y.append(row[-1])

# creating our training and test sets
X_train_folds_indexes, X_test_folds_indexes = myevaluation.kfold_cross_validation(X,13) # 13 proved to be our most accurate # of folds

X_test_folds, X_train_folds, y_test_folds, y_train_folds = myutils.indexes_to_fold(X_test_folds_indexes, X_train_folds_indexes, X, y)
X_test, X_train, y_test, y_train = myutils.folds_to_train_test(X_test_folds, X_train_folds, y_test_folds, y_train_folds)


### Introduction)  

* Why are we doing this in the first place?
    * The potential impact of our results are pretty intuitive.  Automated fraud detection systems have massive applications for both businesses as well as financial institutions.  A system like this could be used to stop fradulent activies, and help identify the perpetrators of financial crimes.  A system like ours would likely affect the following stakeholders:
    * Consumers
        * Consumer's financial assets gain increased security, and helps to increase trust in modern financial institutions which can help stimulate economic growth.
    * Businesses
        * Businesses can save needless expense by ensuring that they are not the victims of fradulent transactions out of company accounts, or the victims of fradulent orders.
    * Financial Institutions
        * Banks can automate the task of consumer protection, allowing them greater resources to pursue recovery of funds through all available avenues.
    * Regulatory Agencies
        * Regulatory agencies can work with financial institutions to share identifying information regarding fradulent transactions, and use this information to pursue recovery of funds through a variety of methods.  One cool method is through appoint a receiver to recover and equitably dispense recovered funds among secured and unsecured creditors, which Sam just so happens to do for work over the summer!
    * Financial Criminals
        * This gang isnt so lucky.  Criminals only gain determent from committing felonies and robbing consumers and instituions of their assets.

    * Personal interest:
        * Sam works with financial fraud on a daily basis for work

![image](media/saftey.jpeg)

* Dataset
    * Our dataset originates from Kaggle, coming packaged as a csv file.  This csv file contains just over a million instances of different types of financial transactions from different accounts, and marks whether or not the transaction was flagged as fradulent or not, and whether or not the transaction was actually fradulent.  Our dataset contains 11 attributes:
    * step
        * Step maps to a number of hours, where 1 step is 1 hour
     * type
        * This identifies the type of transaction.  It can be: CASH-IN; CASH-OUT; DEBIT; PAYMENT; and TRANSFER.
        * This could be a pretty useful attribute to use in classification.
    * amount
        * This is the amount of money transferred (in the local currency).
        * This can also be a pretty useful attribute to use in our classification.
    * nameOrig
        * This identifies the customer who initiated the transaction.
    * oldBalanceOrg
        * This is the initial balance before the transaction.
    * newBalanceOrig
        * This is the new balance after the transaction.
    * nameDest
        * This identifies the recipient of the transfer.
    * oldBalanceDest
        * This is the initial balance recipient before the transaction
    * newBalanceDest
        * This is the new balance recipient after the transaction.
        * This can be a useful tool for classification when used in comparison to oldBalanceDest for any given instance.
    * isFraud
        * This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
    * isFlaggedFraud
        * The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than $200.000$ in a single transaction.
    
Our classification goal is to correctly predict whether any given transaction is fradulent or not.

### EDA)
While we wont go over all of our EDA results, here are the cliffnotes and some interesting finds.

__Attribute Distributions__

* type

![image](media/Transaction_Type_Distribution.jpeg) 
    
* isFraud

![image](media/Fradulent_Transaction_Distribution.jpeg)


__Attribute Relationships__

* oldBalanceOrg and newBalanceOrig
    * slope: .6599, just about 2/3

![image](media/Old_Balance_v_New_Balance.jpeg)

* type and isFraud

![image](media/Fraudulent_Transaction_Types_Distribution.jpeg)
![image](media/Non-Fraudulent_Transaction_Types_Flagged_Distribution.jpeg)

* type and isFlaggedFraud

![image](media/Fraudulent_Transaction_Types_Flagged_Distribution.jpeg)
![image](media/Non-Fraudulent_Transaction_Types_Flagged_Distribution.jpeg)

In [56]:
oldBalanceDest = mytable.get_column('oldbalanceDest')
newBalanceDest = mytable.get_column('newbalanceDest')
amount = mytable.get_column('amount')

avgOrgBalance = sum(oldBalanceDest)/len(oldBalanceDest)
avgNewBalance = sum(newBalanceDest)/len(newBalanceDest)
difference = avgOrgBalance - avgNewBalance
avgTransfer = sum(amount)/len(amount)

print("Average original balance:", avgOrgBalance)
print("Average new balance:", avgNewBalance)
print("Average difference: $", difference)

Average original balance: 833508.1312484784
Average new balance: 1304552.5530755178
Average difference: $ -471044.42182703933


In [57]:
isFlaggedFraud = mytable.get_column('isFlaggedFraud')
isFraud = mytable.get_column("isFraud")

isFlaggedFraudT = 0
isFlaggedFraudF = 0
isFraudT = 0
isFraudF = 0
correctlyPredicted = 0
incorrectlyPredicted = 0

for index, value in enumerate(isFlaggedFraud):
    if isFlaggedFraud[index] == 1:
        isFlaggedFraudT += 1
    else:
        isFlaggedFraudF += 1
        
    if isFraud[index] == 1:
        isFraudT += 1
    else:
        isFraudF += 1

    if isFlaggedFraud[index] == 1 and isFraud[index] == 1:
        correctlyPredicted += 1
    elif isFlaggedFraud[index] == 0 and isFraud[index] == 0:
        correctlyPredicted += 1
    else:
        incorrectlyPredicted += 1

flagPercentage = (100 / 1642) * isFlaggedFraudT    
print("Number of instances: ", isFraudT + isFraudF)
print("Number of fraudulent transactions:", isFraudT)
print("Number of non-fraudulent transactions:", isFraudF)
print("Number of fraudulent transactions flagged:", isFlaggedFraudT)
print("Number of non-fraudulent transactions flagged:", isFlaggedFraudF)
print("Number of transaction types:", 5)
print("Percentage of instances flagged as fradulent: ", str(flagPercentage) + " %")
print("Number of instances correctly predicted: ", correctlyPredicted)
print("Number of instances incorrectly predicted: ", incorrectlyPredicted)
print("Accuracy: ", (100 / 1642) * correctlyPredicted)

Number of instances:  1642
Number of fraudulent transactions: 821
Number of non-fraudulent transactions: 821
Number of fraudulent transactions flagged: 3
Number of non-fraudulent transactions flagged: 1639
Number of transaction types: 5
Percentage of instances flagged as fradulent:  0.18270401948842874 %
Number of instances correctly predicted:  824
Number of instances incorrectly predicted:  818
Accuracy:  50.18270401948843


We will need a classifier accuracy greater than 50.2% in order to surpass the dataset's classifier.

### Classification)

For our classification, we began by running all of our supervised learning classifiers over our training and test sets.  Since we are trying to predict whether a given transaction is fradulent or not, that means we're dealing with binary classification. We tracked the following metrics for all classifiers to measure accuracy:
* Accuracy
* Binary F1
* Binary Precision
* Binary Recall

The results (rounded to the nearest hundredth) for each classifier are as follows:
* Linear Regressor:
    * Accuracy: $0.47$ or $47$%
    * Binary F1: $0.44$
    * Binary Precision: $0.47$
    * Binary Recall: $0.42$

* Dummy Classifier:
    * Accuracy: $0.5$
    * Binary F1: $0$
    * Binary Precision: $0$
    * Binary Recall: $0$

* Naive Bayes:
    * Accuracy: $0.5$ or $50$%
    * Binary F1: $0$
    * Binary Precision: $0$
    * Binary Recall: $0$

* Forest Classifier:
    * We ran our forest classifier implementation with the following settings:
    * $1000$ weak learners
    * $15$ better learners
    * $4$ random attribute subsets  
    * Our resulting accuracy was $0.78$, or $78$%.

**Results**:
* Out of all the results the Linear Regressor had highest score in all 4 categories:
    1. Accuracy
    1. Binary F1
    1. Binary Precision
    1. Binary Recall
    
* However, our forest classifier had the best accuracy at $78$%.  Because of this, we used it in our Heroku app.

### Conclusion)

* Potential Improvements
    * We implemented all of our ideas for improving classification in our dataset.  We believe that further improving the accuracy of our classifier would require implementing more production-style algorithms (as in more accurate sklearn implementations), which are beyond our scope.

    * We did not encounter and challenges in the classification of our dataset after we achieved equal class distributions in our dataset.

* Key Code Components:

```py
def generate_weak_forest(X_remainder, y_remainder, N, F, random_state=None):
    """
    generates a weak forest of size N

    Parameters
    ----------
    X_remainder : list
        list of lists of attributes
    y_remainder : list
        list of labels
    N : int
        number of trees in the forest
    F : int
        the size of the randomly generated subset of the data
    """
    if random_state is not None:
        # print("seed", random_state)
        np.random.seed(random_state)
    weak_forest = []
    for i in range(N):
        X_subset = []
        y_subset = []
        for j in range(F):
            # this has a chance of being a duplicate
            rand_index = np.random.randint(0, len(X_remainder))
            if X_remainder[rand_index] not in X_subset:
                X_subset.append(X_remainder[rand_index])
                y_subset.append(y_remainder[rand_index])
        # now that we have the subset, we can create the tree
        tree = myclassifiers.MyDecisionTreeClassifier()
        tree.fit(X_subset, y_subset)
        weak_forest.append(tree)
    return weak_forest
```

In [58]:
mytable.drop_cols(['isFlaggedFraud']) # we dont need step, nameOrig, nameDest, isFlaggedFraud

# we also will make x and y
X = []
y = []
for row in mytable.data:
    X.append(row[0:len(row)-1])
    y.append(row[-1])

# creating our training and test sets
X_train_folds_indexes, X_test_folds_indexes = myevaluation.kfold_cross_validation(X,13) # 13 proved to be our most accurate # of folds

X_test_folds, X_train_folds, y_test_folds, y_train_folds = myutils.indexes_to_fold(X_test_folds_indexes, X_train_folds_indexes, X, y)
X_test, X_train, y_test, y_train = myutils.folds_to_train_test(X_test_folds, X_train_folds, y_test_folds, y_train_folds)

# declare threshholds for the forest classifier
n = 1000
m = 15
f = 4

forest_clf = myclassifiers.MyRandomForestClassifier(random_state=100)
forest_clf.fit(X, y, n, m, f)
y_predicted = forest_clf.predict(X_test)
accuracy = myevaluation.accuracy_score(y_test, y_predicted)
print("Accuracy:", accuracy)

Accuracy: 0.7813641900121803


In [59]:
import requests
responses = []

n = 1000
m = 15
f = 4

# for the presentation, to save time we will be use less data

y_predicted = []
forest_clf = myclassifiers.MyRandomForestClassifier(random_state=100)
forest_clf.fit(X, y, n, m, f)

online_predicted = []

for instance in X_test:
    url = "http://127.0.0.1:5001/" # could also be https://financial-fraud-classifier.herokuapp.com/ 
    # but to save time in the presentation we will use the local server

    types = instance[0]
    amount = instance[1]
    oldbalanceOrg = instance[2]
    newbalanceOrig = instance[3]
    oldbalanceDest = instance[4]
    newbalanceDest = instance[5]
    url += "predict?type=" + str(types) + "&amount=" + str(amount) + "&oldbalanceOrg=" + str(oldbalanceOrg) + "&newbalanceOrig=" + \
        str(newbalanceOrig) + "&oldbalanceDest=" + \
        str(oldbalanceDest) + "&newbalanceDest=" + str(newbalanceDest)

    response = requests.get(url)
    dictionary_version = dict(response.json())

    online_predicted.append(dictionary_version["prediction"][0])
    y_predicted.append(forest_clf.predict([instance])[0])


accuracy = myevaluation.accuracy_score(y_test, y_predicted)
print("Accuracy for local forest:", accuracy)

accuracy = myevaluation.accuracy_score(y_test, online_predicted)
print("Accuracy for deployed forest:", accuracy)

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=5001): Max retries exceeded with url: /predict?type=4&amount=420&oldbalanceOrg=325470.07&newbalanceOrig=0.0&oldbalanceDest=19771.15&newbalanceDest=345241.22 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f285b7caf10>: Failed to establish a new connection: [Errno 111] Connection refused'))

## Some test links:
#### Fraud:
* https://financial-fraud-classifier.herokuapp.com/predict?type=4&amount=25529.76&oldbalanceOrg=25529.76&newbalanceOrig=0.0&oldbalanceDest=9798.98&newbalanceDest=35328.74 

* https://financial-fraud-classifier.herokuapp.com/predict?type=4&amount=290090.57&oldbalanceOrg=290090.57&newbalanceOrig=0.0&oldbalanceDest=2000395.18&newbalanceDest=2290485.75* 
* https://financial-fraud-classifier.herokuapp.com/predict?type=2&amount=2154005.71&oldbalanceOrg=2154005.71&newbalanceOrig=0.0&oldbalanceDest=0.0&newbalanceDest=0.0 

#### Non-Fraudulent: 
* https://financial-fraud-classifier.herokuapp.com/predict?type=3&amount=180941.96&oldbalanceOrg=9343.0&newbalanceOrig=190284.96&oldbalanceDest=0.0&newbalanceDest=0.0 

* https://financial-fraud-classifier.herokuapp.com/predict?type=0&amount=8842.23&oldbalanceOrg=25081.04&newbalanceOrig=16238.82&oldbalanceDest=0.0&newbalanceDest=0.0 

* https://financial-fraud-classifier.herokuapp.com/predict?type=3&amount=85491.08&oldbalanceOrg=2217709.25&newbalanceOrig=2303200.33&oldbalanceDest=3404000.21&newbalanceDest=3318509.13 

* Contributions
    * Ben handled our classification and classifier evaluation.  
    * Sam handled our EDA and bringing our work into both the presentation and final report.  


* Sources:
    * Dataset:
        * https://www.kaggle.com/datasets/vardhansiramdasu/fraudulent-transactions-prediction?resource=download
    * Images:
        * https://www.istockphoto.com/illustrations/elder-fraud

    