# Intro

## 1.1. The Blocker Fraud Company

**Blocker Fraud Company** is a specialized company in fraud detection on financial transactions through mobile devices. It has the "Blocker Fraud" service, wich guarantee the block of fraudulent transactions.

The company business model is service type, with monetization made by performance of the provided service. The user pay a fixed fee on the fraud detection success.

## 1.2. Expansion Strategy in Brazil

Aiming to expand business in Brazil it has adopted the following strategy:

* The company will receive 25% of the value of each transaction detected as *fraud*.
* The company will receive 5% of the value of each transaction detected as *fraud*, but the transaction is *legitimate*.
* The company will return 100% of the value to the customer, for each transaction detected as legitimate, however a transaction is a fraud.

## 1.3. Context
There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

## 1.4. Content
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

## 1.5. Goal
1. What is the model's *precision* and *accuracy*?
2. How reliable is the model in classifying transactions as *legitimate* or *fraudulent*?
3. What is the expected billing by the company if we classify 100% of data transactions with the model?
4. What is the loss expected by company in case of model failure ?
5. What is the profit expected by the **Blocker Fraud Company** when using model?

`Disclaimer: The following context is completely fictional, the company, the context, the CEO and the business questions.`

## 1.6. Data

Data provided by Kaggle: [Synthetic Financial Datasets for Fraud Detection](https://www.kaggle.com/ntnu-testimon/paysim1)

## 1.7. Analysis

This solution will use descriptive statistics and data visualization to find key figures in understanding the distribution, count, and relationship between variables. Since the goal of the project to make predictions on the fraud's detection, classification algorithms from the supervised learning family of machine learning models will be implemented.

## 1.8. Evaluation

The project will conclude with the evaluation of the machine learning model selected with a validation data set. The output of the predictions can be checked through a confusion matrix, and metrics such as accuracy, precision, recall, F1 and Kappa scores.

# 2. Data

## Loading Data

In [1]:
# importing
import numpy as np
import pandas as pd

# data viz
import matplotlib.pyplot as plt
import seaborn as sns

# machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# model selection
from sklearn.model_selection import train_test_split

# read csv
df = pd.read_csv('PS_20174392719_1491204439457_log.csv')

ModuleNotFoundError: No module named 'xgboost'

## Data Check

In [None]:
# rows and columns
print(df.shape)

# check first five rows
df.head()

In [None]:
df.dtypes

## About Data

### Headers

| Feature        | Description                                                  |
| :------------- | :----------------------------------------------------------- |
| step           | maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation). |
| type           | Transaction type (CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER) |
| amount         | amount of the transaction in local currency                  |
| nameOrig       | customer who started the transaction                         |
| oldbalanceOrg  | initial balance before the transaction                       |
| newbalanceOrig | new balance after the transaction                            |
| nameDest       | customer who is the recipient of the transaction             |
| oldbalanceDest | initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants) |
| newbalanceDest | new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants) |
| isFraud        | This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system |
| isFlaggedFraud | The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200K in a single transaction |

# Exploratory Data Analysis

##  Summary Statistics

In [3]:
# describe data 
df.describe()

NameError: name 'df' is not defined

## Missing Data

It is important to check if there any missing values that might need to be imputed or removed. Luckily the data is pretty clean so no need to worry about that.

In [None]:
df.isnull().sum()

## Explore the Data

To verify fraudulent actions a dataset with fraud detection will be created.

# Hypothesis

## 4.1. Step
1. Fraud shoul occour for transactions made before day 15.

## 4.2. Transactions amount
1. Fraud occur with high transactions amount.

## 4.3. Transactions Origin
1. Fraud not occour when `oldbalanceOrg` is equal to zero.
2. Fraud not occour when `newbalanceOrig` is equal to zero.
3. Fraud occour when the balance difference berofe and after is different from the transaction amount.

## 4.4. Transactions Destiny
1. Fraud not occour when `oldbalanceDest` is equal to zero.
2. Fraud not occour when `newbalanceOrig` is equal to zero.

## 4.5. Transactions Type
1. Fraud occour with cash out.
2. Fraud occour with transfer.

# Data Visualization

## 5.1. Relationship

Understand relationships between variables.

Fraudulents transactions represents 0.13% of the DataSet.

In [None]:
plt.figure(figsize=(5,5))
total2 = float(len(df))
sns.set(style="darkgrid")
print((df.isFraud.value_counts()/df.isFraud.count()) * 100)
ax = sns.countplot(data=df, x='isFraud', palette='Set3')
plt.show()

Fraud occours only in 2 type of transactions: **TRANSFER** and **CASH_OUT**.

In [None]:
# plotting most common fraud type
total = float(len(fraud))
plt.figure(figsize=(5,5))
sns.set(style="darkgrid")
print(fraud.type.value_counts())
ax = sns.countplot(data=fraud, x='type', palette='Set1')
plt.title('Fraud by transfer and cash out', fontsize=14)
plt.show()

In [None]:
# plotting no fraud transactions
total = float(len(fraud))
plt.figure(figsize=(5,5))
sns.set(style="darkgrid")
print(no_fraud.type.value_counts())
ax = sns.countplot(data=no_fraud, x='type', palette='Set1')
plt.title('No Fraud', fontsize=14)
plt.show()

Fraudulent transactions represent a 1% loss

In [None]:
amount_transaction = df.groupby('isFraud')['amount'].sum()
print((amount_transaction/amount_transaction.sum())*100)
amount_transaction.plot(kind='bar')
plt.show()

There is a strong correlation between the columns: `oldbalanceOrg`, `newbalanceOrig`, `oldbalanceDest` and `newbalanceDest`.

In [None]:
plt.figure(figsize=(13,10))
sns.heatmap(df.corr(), annot=True)
plt.show()

## Distribution

Show the possible values that we can expect to see in a variable, along with how likely they are.

* The step is right-skewed with less median than mean. 
* The `fraud` data has no obvious clustering, it is a *Uniform Distribution* and *symmetric skew*.

In [None]:
# mean and median
df_step_mean = df.step.mean()
df_step_median = df.step.median()
fraud_step_mean = fraud.step.mean()
fraud_step_median = fraud.step.median()

# ploting first histogram
fig, axs = plt.subplots(nrows= 2, ncols=1, figsize=(20,6), sharex=True)
fig.suptitle('Steps in hour', fontsize=20)
sns.distplot(df.step, kde=False, ax=axs[0], color='green')
axs[0].set_xlabel('Step in Overral Transactions')
axs[0].axvline(df_step_median, color = 'r', linewidth = 3, label = 'median')
axs[0].axvline(df_step_mean, color='b', linewidth = 3, label = 'mean')
axs[0].legend()

# ploting second histogram
sns.distplot(fraud.step, kde=False, ax=axs[1])
axs[1].set_xlabel('Step in Fraud transactions')
axs[1].axvline(fraud_step_median, color = 'g', linewidth = 1, label = 'median')
axs[1].axvline(fraud_step_mean, color='black', linewidth = 1, label = 'mean')
axs[1].legend()
plt.show()

## Trends

Define pattern of change.

In [None]:
fig, axs = plt.subplots(nrows= 3, ncols=1, figsize=(20,6), sharex=True, sharey=True)
fig.suptitle('Transactions', fontsize=20)
# amount linechart
sns.lineplot(data = df, x='step', y='amount', ax=axs[0], color='green', ci=None)
axs[0].set_ylabel('Amount')

# origin balance linechart
sns.lineplot(data = df, x='step', y='oldbalanceOrg', ax=axs[1], color='blue', ci=None, label='Old Balance')
sns.lineplot(data = df, x='step', y='newbalanceOrig', ax=axs[1], color='black', ci=None, label='New Balance')
axs[1].set_ylabel('Balance Origin')
axs[1].legend()

# destination balance linechart
sns.lineplot(data = df, x='step', y='newbalanceOrig', ax=axs[2], color='purple', ci=None, label='Old Balance')
sns.lineplot(data = df, x='step', y='newbalanceDest', ax=axs[2], color='tomato', ci=None, label='New Balance')
axs[2].set_ylabel('Balance Destination')
axs[2].legend()
plt.show()

# Variables Filtering

This step aims to remove the outliers from the dataset. As previously checked in the descriptive statistics, some features have a huge range of values, particularly the amount, destination balance before the transaction `oldbalanceDest` and the destination balance after the transaction `newbalanceDest`, as shown in the boxplot below.

## Removing Outliers
Creating a function to drop outliers:

In [None]:
df = drop_outliers('amount', df)
df = drop_outliers('oldbalanceDest', df)
df = drop_outliers('newbalanceDest', df)
df.shape

In [None]:
def drop_outliers(var: str, dataset: pd.DataFrame):

    # find Q1, Q3 e IQR
    Q1 = np.quantile(dataset[var], .25)
    Q3 = np.quantile(dataset[var], .75)
    IQR = Q3 - Q1

    # calculates the outliers boundaries through statistical relationship
    low = Q1 - 1.5 * IQR
    high = Q3 + 1.5 * IQR

    dados_resultado = dataset.loc[(dataset[var] > low) & (dataset[var] < high)]
    return dados_resultado

## Visualizing Outliers with violinplot

In [None]:
# boxplot 1
sns.boxplot(df.amount)
plt.xlabel('Amount')
plt.show()

In [None]:
# boxplot 2
sns.boxplot(df.oldbalanceOrg)
plt.xlabel('Old Balance Origin')
plt.show()

## New Trends

In [None]:
fig, axs = plt.subplots(nrows= 5, ncols=1, figsize=(20,6), sharex=True, sharey=True)
fig.suptitle('Transactions x step', fontsize=20)
# amount 
sns.scatterplot(data = df2, x='step', y='amount', hue='isFraud', ax=axs[0], color='green')


# origin balance 
sns.scatterplot(data = df2, x='step', y='oldbalanceOrg', hue='isFraud', ax=axs[1], color='blue', label='Old Balance')
sns.scatterplot(data = df2, x='step', y='newbalanceOrig', ax=axs[2], hue='isFraud',  color='black', label='New Balance')


# destination balance 
sns.scatterplot(data = df2, x='step', y='newbalanceOrig', ax=axs[3], hue='isFraud', color='purple', label='Old Balance')
sns.scatterplot(data = df2, x='step', y='newbalanceDest', ax=axs[4], hue='isFraud', color='tomato', label='New Balance')

plt.show()

As seen earlier, Fraud Transactions only occours in Transactions and Cash Out type. A new Data will be created based on these two parameters.

In [None]:
df2 = df.loc[(df.type == 'TRANSFER') | (df.type == 'CASH_OUT')]
print(df2.shape)

# Preprocessing

## Dummy Variables
In this step, dummy variables are created to deal with the categorical variables. Dummy variables will turn the categories per variable into its own binary identifier.
* 0 = Cash out
* 1 = Transfer

In [None]:
df2.type = pd.get_dummies(df2.type)
print(len(df2))
df2.head()

In [None]:
df2.to_csv('dataset/df2.csv', index=False)

## Splitting Data
For start the machine learning model, the data needs to be split into train and validation sets.  In this split 20% of the data is reserved for the final validations, while 80% is kept for training the model.

In [None]:
# loading df2 data
df2 = pd.read_csv('dataset/df2.csv')

# Y1 is the targeting column and X1 has the rest
X = df2.drop(columns=['nameOrig', 'nameDest', 'isFraud', 'isFlaggedFraud'], axis=1)
Y = df2['isFraud']

In [None]:
# split the data into chunks
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=0)
print(X_train.shape)
print(X_val.shape)
print(Y_train.shape)
print(Y_val.shape)

# Prediction

## Model Building
Now it's time to create some models. For this project three common algorithms will be used to make predictions. 

The respective modeules for Logistic Regression, Decision Trees and Random Trees.

## Logistic Regression
The first model is logistic regression.

In [None]:
lr = LogisticRegression().fit(X_train, Y_train)

In [None]:
lrpred = lr.predict(X_val)

In [None]:
acc = accuracy_score(Y_val, lrpred)
precision = precision_score(Y_val, lrpred)
recall = recall_score(Y_val, lrpred)
f1 = f1_score(Y_val, lrpred)

logistic_regression = pd.DataFrame(
    [['Logistic Regression', acc, precision, recall, f1]],
    columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score']
     )

## Decision Trees

In [None]:
dt = DecisionTreeClassifier().fit(X_train, Y_train)

In [None]:
dtpred = dt.predict(X_val)

In [None]:
acc = accuracy_score(Y_val, dtpred)
precision = precision_score(Y_val, dtpred)
recall = recall_score(Y_val, dtpred)
f1 = f1_score(Y_val, dtpred)

decision_tree = pd.DataFrame(
    [['Decision Tree Classifier', acc, precision, recall, f1]],
    columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score']
     )

## Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators = 42).fit(X_train, Y_train)

In [None]:
rf_predict = rf.predict(X_val)

In [None]:
acc = accuracy_score(Y_val, rf_predict)
precision = precision_score(Y_val, rf_predict)
recall = recall_score(Y_val, rf_predict)
f1 = f1_score(Y_val, rf_predict)

random_forest = pd.DataFrame(
    [['Random Forest', acc, precision, recall, f1]],
    columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score']
     )

## XGBoost Classifier

In [None]:
xgb = XGBClassifier().fit(X_train, Y_train)

In [None]:
xgb_predict = xgb.predict(X_val)

In [None]:
acc = accuracy_score(Y_val, xgb_predict)
precision = precision_score(Y_val, xgb_predict)
recall = recall_score(Y_val, xgb_predict)
f1 = f1_score(Y_val, xgb_predict)

xg_boost = pd.DataFrame(
    [['XGB', acc, precision, recall, f1]],
    columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score']
     )

## Confusion Matrix
Confusion matrix of the results with the true values on the y axis and predicted values along the x axis. Since the diagonals are lighter in color and have higher numbers, the accuracy is going to be high since those are the True Positives.

In [None]:
# logistic regression
cmLR = confusion_matrix(Y_val, lrpred)
cm_LR = pd.DataFrame(cmLR, columns=['predicted_non_fraud', 'predicted_fraud'], index=['actual_non_fraud', 'actual_fraud'])

# decision trees
cmDT = confusion_matrix(Y_val, dtpred)
cm_DT = pd.DataFrame(cmDT, columns=['predicted_non_fraud', 'predicted_fraud'], index=['actual_non_fraud', 'actual_fraud'])

# random forest
cmRF = confusion_matrix(Y_val, rf_predict)
cm_RF = pd.DataFrame(cmRF, columns=['predicted_non_fraud', 'predicted_fraud'], index=['actual_non_fraud', 'actual_fraud'])

# XGBoost
cmXG = confusion_matrix(Y_val, xgb_predict)
cm_XG = pd.DataFrame(cmXG, columns=['predicted_non_fraud', 'predicted_fraud'], index=['actual_non_fraud', 'actual_fraud'])

In [None]:
plt.figure(figsize=(15,8))

plt.suptitle("Confusion Matrixes",fontsize=24)
plt.subplots_adjust(wspace = 0.4, hspace= 0.4)

plt.subplot(2,2,1)
plt.title("Logistic Regression Confusion Matrix")
sns.heatmap(cm_LR,annot=True,cmap="Greens",fmt="d",cbar=False, annot_kws={"size": 24})

plt.subplot(2,2,2)
plt.title("Decision Tree Confusion Matrix")
sns.heatmap(cm_DT,annot=True,cmap="Blues",fmt="d",cbar=False, annot_kws={"size": 24})

plt.subplot(2,2,3)
plt.title("Random Forest Confusion Matrix")
sns.heatmap(cm_RF,annot=True,cmap="Reds",fmt="d",cbar=False, annot_kws={"size": 24})

plt.subplot(2,2,4)
plt.title("XGBoost Confusion Matrix")
sns.heatmap(cm_XG,annot=True,cmap="Reds",fmt="d",cbar=False, annot_kws={"size": 24})
plt.show()

## Machine Learning Scores

In [None]:
ml_scores = pd.concat([logistic_regression, decision_tree, random_forest, xg_boost], ignore_index=True)
ml_scores

# Conclusion

## Questions

Answering questions based on the best performing Machine Learning Model: **Decision Tree** and **Random Forest**. 

* The company will receive 25% of the value of each transaction detected as *fraud*.
* The company will receive 5% of the value of each transaction detected as *fraud*, but the transaction is *legitimate*.
* The company will return 100% of the value to the customer, for each transaction detected as legitimate, however a transaction is a fraud.

Median amount of a transaction: \$7,4871.8 (we're using the median because the amount distribution is highly skewed).

Portfolio: 6,362,620 transactions (fraudulent + non fraudulent).
* True positive amount = [0.25 x (median amount) x (number of TP transactions)]
* False positive amount = [0.05 x (median amount) x (number of FP transactions)]
* False negative amount = [1 x (median amount) x (number of FN transactions)]

In [None]:
# join predictions on the test set in order to calculate the business performance
Y_val['predictions'] = dtpred
Y_val['amount']  = X_val['amount']

# the company receive 25% of each transaction value truly detected as fraud
fraud_detected = Y_val[(Y_val['isFraud'] == 1) & (Y_val['predictions'] == 1)]
fraud_detected_amount = fraud_detected[['amount', 'isFraud', 'predictions']].groupby(['isFraud', 'predictions']).sum().reset_index()
fraud_detected_amount['to_receive'] = fraud_detected_amount['amount'] * 0.25

# the company receive 5% of each transaction value truly detected as fraud
fraud_detected_leg = Y_val[(Y_val['isFraud'] == 0) & (Y_val['predictions'] == 1)]
fraud_detected_leg_amount = fraud_detected_leg[['amount', 'isFraud', 'predictions']].\
    groupby(['isFraud', 'predictions']).sum().reset_index()
fraud_detected_leg_amount['to_receive'] = fraud_detected_leg_amount['amount']*0.05

# the company gives back 100% of the value for the customer in each transaction detected as Legitimate
fraud_not_detected = Y_val[(Y_val['isFraud'] == 1) & (Y_val['predictions'] == 0)]
fraud_not_detected_amount = fraud_not_detected[['amount', 'isFraud', 'predictions']].\
    groupby(['isFraud', 'predictions']).sum().reset_index()

# print results
print('The company will receive ${:,.2f} due to transactions truly detected as fraud'.format(fraud_detected_amount['to_receive'][0]))
print('The company will receive ${:,.2f} due to transactions detected as fraud, but actually legitimate'.\
      format(fraud_detected_leg_amount['to_receive'][0]))
print('The company will give back ${:,.2f} due to transactions detected as legitimate, but actually fraud'.\
      format(fraud_not_detected_amount['amount'][0]))
# the company profit
profit = fraud_detected_amount['to_receive'][0] + fraud_detected_leg_amount['to_receive'][0] - fraud_not_detected_amount['amount'][0]
print(f'The Blocker Fraud Company forecasted profit: ${profit:,.2f}')