Blocker Fraud Company

Summary

0. Business Problem
1. Solution Strategy and Assumptions Resume
- 1.1. First CRISP Cycle
2. Exploratory Data Analysis
- 2.2. Top 3 Eda Insights
3. Data Preparation
4. Embedding Space Study
5. Machine Learning Models
6. Model Tuning
7. Model Bussiness Results
8. 8. Model Deployment
9. References

0. Business Problem

"Blocker Fraude Company is a company specialized in fraud detection in financial transactions made through mobile devices. The company has a service called “Blocker Fraud” with guarantee of blocking fraudulent transactions."

The company's business model is of the service type with monetization made by the provider's performance, that is, the user pays a fixed fee on the success in the detection of service fraud of the customer's transactions.

You have been hired by a DS consultancy to create a highly accurate and accurate model for fraud detection of transactions through mobile devices. You need to deliver an API with access to the model and also the answers to the following questions:

1. What is the precision and accuracy of the model?

2. How reliable is the model in classifying new transactions?

3. What is the company's expected revenue if 100% of the transactions are classified with the model?

4. What is the price expected by the company if the model fails?

5. What is the expected profit for using this model?

0.1. What is a Service

Service is a business model like consultory, the company make a work and receive a profit based on her work results. For example, the Blocker Fraud Company, In Brazil, the company start with a new method of services to get some new clients based only on your "Blocker Fraud" model.

At Blocker Fraud it is B2B, Business to Business, Blocker Fraud offers the service to other businesses like banks and finance houses

1. The company will receive 25% of the value of each transaction that is truly detected as fraud.

2. The company will receive 5% of the value of each transaction detected as fraud, however the transaction is truly legitimate.

3. The company will refund 100% of the value to the customer, for each transaction detected as legitimate, however the transaction is truly a fraud.

For the customer it is an excellent deal, although the fee charged for the service is very high above 25% of success, the company will reduce its costs with fraudulent transactions detected as true frauds, but it will have a loss caused by the error in the service that will be covered by the own company.

In addition to attracting new clients from Brazil, with this risky strategy of guaranteeing reimbursement in case of failure, it specifically depends on the precision and accuracy of the model in carrying out these detections.

B2B Metrics is very important, let's check some ones!

💸 Market Strategy;
💸 Target Clients;
💸 Number of New Clients;
💸 Client Aquisition Cost;
💸 Lead Conversion Rate;
💸 Churn Rate;

0.2. What is a Fraud

"Wrongful or criminal deception intended to result in financial or personal gain."~ Wiki

Fraud is an intentionally deceptive action designed to provide the perpetrator with an unlawful gain or to deny a right to a victim. In addition, it is a deliberate act (or failure to act) with the intention of obtaining an unauthorized benefit, either for oneself or for the institution, by using deception or false suggestions or suppression of truth or other unethical means, which are believed and relied upon by others. Depriving another person or the institution of a benefit to which he/she/it is entitled by using any of the means described above also constitutes fraud.

Types of fraud include tax fraud, credit card fraud, wire fraud, securities fraud, and bankruptcy fraud. Fraudulent activity can be carried out by one individual, multiple individuals or a business firm as a whole.

More Examples of Fraud:

Data theft on fake websites.
Purchase of physical facilities.
Document change.
Computer data theft.
Interception of goods.
Phishing.
Bots and Stealing Files.
Receipts of or benefits not granted.
Ethics Violations.
Conflict of interests.
Appropriation or misuse of resources eg Supplies.

Fraud involves the false representation of facts, whether by intentionally withholding important information or providing false statements to another party for the specific purpose of gaining something that may not have been provided without the deception.

0.3. Fraud in Financial Transactions

A financial transaction is an agreement, or communication, between a buyer and seller to exchange goods, services, or assets for payment. Any transaction involves a change in the status of the finances of two or more businesses or individuals. A financial transaction always involves one or more financial asset, most commonly money or another valuable item such as gold or silver.

0.3.1. Types of Transactions

0.3.1.1. Cash Transactions

A cash transaction is any transaction where money is exchanged for a good, service, or other commodity. Cash transactions can refer to items bought with physical money, such as coins or cash, or with a debit card. These differ from credit transactions because the money is immediately taken from the buyer and given to the seller.

A cash transaction stands in contrast to other modes of payment, such as credit transactions in a business involving bills receivable. Similarly, a cash transaction is also different from credit card transactions.

0.3.1.2. Transfer

A transfer is the movement of assets, funds, or ownership rights from one place to another. Is also used to describe the process by which ownership of funds or assets are reassigned to a new owner. Banking, brokerage, cryptocurrency, asset titles, and loan transfers are a few examples of domains and transaction types where transfers occur.

0.3.1.3. Debit

A debit card payment is the same as an immediate payment of cash as the amount gets instantly debited from your bank account.

Debit cards allow bank customers to spend money by drawing on existing funds they have already deposited at the bank, such as from a checking account. A debit transaction using your PIN (personal identification number), is an online transaction completed in real time. When you complete a debit transaction, you authorize the purchase with your PIN and the merchant communicates immediately with your bank or credit union, causing the funds to be transferred in real time.

In a credit card payment made by you, the amount is not debited from your bank account instantly. However, you are able to buy the product within your credit limit under the credit card. You need to make payment only after the generation of your credit card bill

0.3.1.4. Payment

An act initiated by the payer or payee, or on behalf of the payer, of placing, transferring or withdrawing funds, irrespective of any underlying obligations between the payer and payee.

1. Solution Strategy and Assumptions Resume

The first problem finded is the massive size of dataset. The Dataset Info you can check below.

Columns	Rows
11	6353307

1.1. First CRISP Cycle

Streamlit Dashboard!

Data Cleaning & Descriptive Statistical.: The first step in data science projects is clean the dataset and check rows datatypes and fix simple wrong inputs. After dataset clean, the step is check with simple statistics methods the dataset behavior with (mean, median, std, skew, skew, kurtosis).
Feature Engineering.: In this step, with coggle.it to make a mind map and use the mind map to create some hypothesis list, after this list, I have created some new features based on differencies, like origin and destination balance diff, merchant flag and day based on "step" feature.
Data Filtering.: In first cycle I only selected a range of values to amount because in "big" transactions, in the dataset do not exists fraud on high values of transaction.
Exploratory Data Analysis.: In this step I deep dive in the dataset to shear for behaviors in three steps, the Univariable, Bivariable and Multivariable analysis to detect Fraud comportements, validation of bussiness hypothesis and correlations.

2. Exploratory Data Analysis

2.1. EDA On First Cycle

I divide the EDA into three steps, univariable, bivariable and multivariable steps, In the univariable I check the target and all the characteristics (categorical and numerical), In the Bivariable, this step I validate 8 Business Hypotheses and in the Multivariable it is the correlations and the pair diagrams

Principal Results

Usually after fraudulent transactions, all money is withdrawn from the destination account.
Only Transactions and Debit types have fraudulent activities.
The dataset does not follow a standard cash flow due to the simulator that generated the data.
There is not much information related to CASH-OUT to answer item 3, it can be assumed that the chash-out is a transaction worth 0 and therefore has some value that does not follow a flow.
the numerical variables have a distribution that can be seen by applying the log.
There is a spike in fraudulent transactions at the beginning of the simulation month.
Fraudulent transactions do not have negative diff in origin based on amount.
The larger the transaction, the more susceptible it is to fraud.

2.2. Top 3 Eda Insights

1. Debit transactions represent more than 50% of fraud cases.

2. Transactions after the 20th of the month account for 20% more fraud cases.

3. 80% of fraud cases happen when the destination of the transaction already has money in the account.

3. Data Preparation

3.1. Encoding and Rescaling

In first cycle I only used Frequency Encoding for all categorical features and most robust scaler for numerical features because have extrema skewness on all numerical variables. But in EDA i see some good features to use log transformation too, for example the Amout have a cool normal shape when this feature is rescaled. For day feature I chose sin cos transformation.

In Second cycle I will go try log transformation and others encoding like One-Hot for type of payment because do not have any order on the transaction type, and I think the frequency encoding it's applying a bias on the transaction type because have different sizes of types in the same feature.

But in EDA I have checked, only exists transaction fraud only on Debit and Transaction types in this feature.

3.2. Feature Selection

In first cycle I choose a simple and fast method to check feature importance using only two models, the Random Forest and XGBoost, in second cycle I will try Boruta.

In resume, the XGBoost have better results than Random Forest, But why XGB, he took the difference in type feature (cashin, cashout...) and didn't overfit the other features, the RF only selected two features.

4. Embedding Space Study

Next Cycle

5. Machine Learning Models

In this step to chose only used four models, the Stochastic Gradient Descent, LGBM, Random Forest Classifier and XGBoost Classifier for prediction with dataset splited in three, train, validation and test.

5.1. Stochastic Gradient Descent

The SGD is my second baseline model, in some cases this model fit's good the dataset and have good performaces, but for this dataset based on results of data preparation, the SGD do not haved a good performace than RF and XGB.

5.2. XGBoost

Better results of all models with all train dataset. With test dataset the results is:

In cross validation with test dataset I get this results in both datasets:

5.3. Model Metrics

The key of this results is the Recall of the model, Recall is a model performace metric based on "false negative". In Blocker Fraud with lower recall, the model is missing fraudulent transactions, this is horrible for the current business model which is based on whether or not to detect fraud, if the model misses a legitimate fraud it can cause a lot of financial damage especially if it is a very high transaction.

In the Confusion Matrix (2x2 cube plot) the model is missing six fraudulent transactions, and the question that remains, what if just one of these transactions cost the entire company? The bussiness model is, 100% of cash back if model miss the fraud!!

6. Model Tuning

In this step, I have used this techniques:

Cross Validation: To get the real performace of Model in some pices of dataset.

Cross Validation for Default Model.

Grid Search: To maximize model performace based on model's parameters.

Using Grid Search I get a more simple model with 85 estimators for example.

Change Threshold: Analyze whether or not it is worth changing the Threshold for better performance.

By Default, the sklearn define a threshold is .5, in this analysis I check what is the best threshold for this model to classify.

Calibration Curves: Calibrate the model for better performace.

Predicted probabilities that match the expected distribution of probabilities for each class are referred to as calibrated. The distribution of the probabilities can be adjusted to better match the expected distribution observed in the data.

7. Model Bussiness Results

Now, based on bussiness I will translate model performace on bussiness performace. In other words, how much money will I make using the model?

1. What is the precision and accuracy of the model?

The precision and the accuracy of model results in One training and testing is 0.999994 for accuracy and 0.99513 for Precision. But this two metrics is do not good for Blocker Fraud, because if model wrong the fraud detection, he needs to refund 100% of transaction amount, because this i used recall two.

If lower Recal much chance to have False Negative (Fraud but model classify with No Fraud). For recall in one training and testing I get 1.0 recall based on amount of no fraud detected.

In cross validation i get 0.9951 + / - 0.0 of Recall.

2. How reliable is the model in classifying new transactions?

Answer in Second Cycle!

3. What is the company's expected revenue if 100% of the transactions are classified with the model?

If all trully transactions detected as a fraud, the company will receive a total of R$ 56616833024,00.
If all fraudulent transactions is detected by model, the company will receive a total of R$ 3014103808,00.
Total money obtained by only correct detections is R$ 59630936064,00.

4. What is the price expected by the company if the model fails?

If all fraudulent transactions in dataset is not detected, the company will lost R$ 3014103808,00.
The diference of detection is R$ -9042311168,00, if model miss, the company will have more loss.

5. What is the expected profit for using this model?

The company will save a total of R$ +/- 11996415433,55 ONLY for fraudulent transactions on dataset!!

The model miss six fraudulent transactions, to answer this I only sort values and drop the last six transactions with more amount.

8. Model Deployment

8.1. ETL Process

This is the first cloud architecture ideia.

Data Scource -> Postgresql: The "Raw" data is in Postgresql in Heroku Cloud. Only have 5.000 rows to classify with model.
Airflow Task I.: In this task, the Airflow daily collect data from postgresql and store in CSV local file for Backup and go to classification.
Airflow Task II.: In this task, the Airflow daily make a POST request to model on Cloud to classify the 5.000 rows on Fraud and Not.
Airflow Task III.: Store CSV/Parquet file on s3 Bucket.
Databse Dashboard.: Daily job to collect new data to feed in Power Bi/Streamlit dashboard.

Minio s3 buckets (Datalake) with PySpark parquet files and CSV file.

8.2. Airflow Jobs

Airflow in windows with docker-compose, it's success in dark green for all data pipeline jobs.

The pipeline consists in four simple tasks

Create Buckets: This step i chekc if have buckets or no at s3.
Collect Raw Dataset: Collect raw data from production.
Model Api Request: Make a POST request to a model in API to classify the new transactions if is fraud or no.
Store Dataset on s3: Store the new dataset with classification on S3 bucket in parquet format.

In Future:

More Robust PySpark Structure;
Try Hadoop Clusters;

8.3. Streamlit or BI Dashboard