## Financial Fraud Classifier
#### Sam Berkson and Ben Puryear
#### CPSC 322 Final Project

In [3]:
from mysklearn import myclassifiers, myevaluation, mypytable, myutils
import importlib

### Introduction)  

* Why are we doing this in the first place?
    * The potential impact of our results are pretty intuitive.  Automated fraud detection systems have massive applications for both businesses as well as financial institutions.  A system like this could be used to stop fradulent activies, and help identify the perpetrators of financial crimes.  A system like ours would likely affect the following stakeholders:
    * Consumers
        * Consumer's financial assets gain increased security, and helps to increase trust in modern financial institutions which can help stimulate economic growth.
    * Businesses
        * Businesses can save needless expense by ensuring that they are not the victims of fradulent transactions out of company accounts, or the victims of fradulent orders.
    * Financial Institutions
        * Banks can automate the task of consumer protection, allowing them greater resources to pursue recovery of funds through all available avenues.
    * Regulatory Agencies
        * Regulatory agencies can work with financial institutions to share identifying information regarding fradulent transactions, and use this information to pursue recovery of funds through a variety of methods.  One cool method is through appoint a receiver to recover and equitably dispense recovered funds among secured and unsecured creditors, which Sam just so happens to do for work over the summer!
    * Financial Criminals
        * This gang isnt so lucky.  Criminals only gain determent from committing felonies and robbing consumers and instituions of their assets.

    * Personal interest:
        * Sam works with financial fraud on a daily basis for work

![image](media/saftey.jpeg)

* Dataset
    * Our dataset originates from Kaggle, coming packaged as a csv file.  This csv file contains just over a million instances of different types of financial transactions from different accounts, and marks whether or not the transaction was flagged as fradulent or not, and whether or not the transaction was actually fradulent.  Our dataset contains 11 attributes:
 
    * step
        * Step maps to a number of hours, where 1 step is 1 hour
     * type
        * This identifies the type of transaction.  It can be: CASH-IN; CASH-OUT; DEBIT; PAYMENT; and TRANSFER.
        * This could be a pretty useful attribute to use in classification.
    * amount
        * This is the amount of money transferred (in the local currency).
        * This can also be a pretty useful attribute to use in our classification.
    * nameOrig
        * This identifies the customer who initiated the transaction.
    * oldBalanceOrg
        * This is the initial balance before the transaction.
    * newBalanceOrig
        * This is the new balance after the transaction.
    * nameDest
        * This identifies the recipient of the transfer.
    * oldBalanceDest
        * This is the initial balance recipient before the transaction
    * newBalanceDest
        * This is the new balance recipient after the transaction.
        * This can be a useful tool for classification when used in comparison to oldBalanceDest for any given instance.
    * isFraud
        * This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
    * isFlaggedFraud
        * The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than $200.000$ in a single transaction.
    
Our classification goal is to correctly predict whether any given transaction is fradulent or not.

### EDA)

### Classification)

For our classification, we began by running all of our supervised learning classifiers over our training and test sets.  Since we are trying to predict whether a given transaction is fradulent or not, that means we're dealing with binary classification. We tracked the following metrics for all classifiers to measure accuracy:
    * Accuracy
    * Binary F1
    * Binary Precision
    * Binary Recall

The results (rounded to the nearest hundredth) for each classifier are as follows:
* Linear Regressor:
    * Accuracy: $0.47$ or $47$%
    * Binary F1: $0.44$
    * Binary Precision: $0.47$
    * Binary Recall: $0.42$

* Dummy Classifier:
    * Accuracy: $0.5$
    * Binary F1: $0$
    * Binary Precision: $0$
    * Binary Recall: $0$

* Naive Bayes:
    * Accuracy: $0.5$ or $50$%
    * Binary F1: $0$
    * Binary Precision: $0$
    * Binary Recall: $0$

* Forest Classifier:
    * We ran our forest classifier implementation with the following settings:
    * $1000$ weak learners
    * $15$ better learners
    * $4$ random attribute subsets  
    * Our resulting accuracy was $0.78$, or $78$%.

**Results**:
* Out of all the results the Linear Regressor had highest score in all 4 categories:
    1. Accuracy
    1. Binary F1
    1. Binary Precision
    1. Binary Recall
    
* However, our forest classifier had the best accuracy at $78$%.  Because of this, we used it in our Heroku app.

### Conclusion)

* Potential Improvements
    * We implemented all of our ideas for improving classification in our dataset.  We believe that further improving the accuracy of our classifier would require implementing more production-style algorithms (as in more accurate sklearn implementations), which are beyond our scope.

    * We did not encounter and challenges in the classification of our dataset after we achieved equal class distributions in our dataset.

* Key Code Components:

In [4]:
importlib.reload(mypytable)
# first we are going to import the dataset into a mypytable object
mytable = mypytable.MyPyTable()
mytable.load_from_file("input_data/Fraud_chop.csv")

# the values of type are strings, so we will convert them to ints to be able to be used in the classifiers
mytable.convert_col_to_int('type')
mytable.drop_cols(['step','nameOrig', 'nameDest', 'isFlaggedFraud']) # we dont need step, nameOrig, nameDest, isFlaggedFraud

# we also will make x and y
X = []
y = []
for row in mytable.data:
    X.append(row[0:len(row)-1])
    y.append(row[-1])

# creating our training and test sets
X_train_folds_indexes, X_test_folds_indexes = myevaluation.kfold_cross_validation(X,13) # 13 proved to be our most accurate # of folds

X_test_folds, X_train_folds, y_test_folds, y_train_folds = myutils.indexes_to_fold(X_test_folds_indexes, X_train_folds_indexes, X, y)
X_test, X_train, y_test, y_train = myutils.folds_to_train_test(X_test_folds, X_train_folds, y_test_folds, y_train_folds)

# declare threshholds for the forest classifier
n = 1000
m = 15
f = 4

forest_clf = myclassifiers.MyRandomForestClassifier(random_state=100)
forest_clf.fit(X, y, n, m, f)
y_predicted = forest_clf.predict(X_test)
accuracy = myevaluation.accuracy_score(y_test, y_predicted)
print("Accuracy:", accuracy)

Accuracy: 0.7813641900121803


* Contributions
    * Ben handled our classification and evaluation
    * Sam handled our EDA and bringing our work into both the presentation and final report.
* Sources:
    * Dataset:
        * https://www.kaggle.com/datasets/vardhansiramdasu/fraudulent-transactions-prediction?resource=download
    * Images:
        * https://www.1800homecare.com/homecare/elderly-scams/

    